Solve a large scale radiation heat transfer problem in parallel
Showcase a parallel finite to finite area view factor code
Evaluate the performance of solving a large dense matrix on keeneland using MAGMA and
SCALAPACK
300
200
100
0
0
Modify View3d, an open source serial code
Implement pre-detection and elimination of potential obstructions
Use finite to finite, differential to finite view factor calculations
Order of computations improved from O(N²*X) to O(N²*X/M) , where
N= # of surfaces, X= # of obstructing surfaces,
M = # of processors
800
700
600
500
400
50 100 150 200 250
120
100
80
Time
60
40
20
0
0
TEMPLATE DESIGN © 2008 www.PosterPresentations.com
# of Processors
200
Produce LU factors and pivot vector compatible with ScaLAPACK PZGETRF, double complex version only
Uses left-looking out-of-core algorithm to solve problems larger than available device memory on GPU
Uses ScaLAPACK PZGETRF for factorization of narrow column panel and CUBLAS v2 for computation on GPU
Algorithm assumes access to one GPU per MPI task, the amount of device memory is set by user
Data on GPU must be transferred to buffers on the GPU to be communicated by MPI
Tests on Keeneland successful at using multi-
GPU with multiple nodes
Code extended to multiple GPUs by allocating
based on the
of the process modulus the number of CUDA enabled devices present
Extend to support double precision complex
Rewrite portions of the source code to translate different data types
ScaLAPACK PZxxxx calls changed to PDxxxx
changed to
Function call changed from:
PDGETRF( M, N, MEM( IPA ), 1, 1,
DESCA, MEM( IPPIV ), INFO )
PDGETRF_OOC2(m,n,mem(ipa),1,1, descA, mem(ippiv), memsize, info )
Experiments were performed on Keeneland , 12 cores 2.8GHz CPU, 3 Tesla M2090 GPU
All Performance measures in Gigaflop/s
Table 1: Comparing in-node and multi-node with same # of
CPU, *All Tests with 3GPU
Problem
Size
15000
1 node,
3 ppn,
3 nodes,
1 ppn
1 node,
12 ppn
3 nodes
4 ppn
101 121 171 214
25000
35000
159
194
182
213
263
308
355
450
Table 2: Comparing in-node and multi-node with 3 times # of
CPU,GPU Problem size of N=25000
Single Node Performance Three Nodes Performance
1 CPU, 1 GPU 63 3 CPU, 3 GPU 166
3 CPU, 1 GPU
3 CPU, 3 GPU
110
159
9 CPU, 3 GPU
9 CPU, 9 GPU
315
340
500
450
400
350
300
250
200
150
100
50
Within Node Across 3 Nodes
3 CPU, 3 GPU 12 CPU, 3 GPU
0
15000 25000 35000 15000 25000 35000
Problem Size (N)
Table 3: Scaling size with 7 nodes (21 CPU, 21 GPU)
Problem Size Performance Problem Size Performance
5000
15000
25000
35000
45000
30
351
602
809
975
55000
65000
75000
85000
90000
1,054
1,252
1,310
1,372
1,526
Table 4: Varying process grid on single and multi-node
P
1
3
12
3
Q
3
4
7
7
# Nodes
1
1
7
7
Performance
160
263
455
602
Table 2: Comparing Single-GPU MAGMA to multi-GPU static and dynamic scheduling -- Problem size of N=25000
Single GPU
311 Gflop/s
Multi-GPU-
Static
823 Gflop/s
Single-GPU-
Dynamic
Multi-GPU-
Dynamic
Utilizing 3 CPUs and 3 GPUs spread across 3 nodes yielded higher performance than residing on a single node
Results show moving from 3CPU,1GPU on a single node to 9CPU,3GPU on three nodes gives an increase in performance by factor of three
Scaling tests reveal increasing performance from the GPU as the problem size grows.
This shows that at smaller problem sizes GPUs are being left idle for extended periods of time
On one node, a 4:1 CPU:GPU ratio maximized performance, while on seven nodes, 1:1 ratio of
CPU:GPU maximized performance
Larger problem size of many nodes leads to an increase in the N³ GEMM operations on the GPU, so less CPUs can factor the panels quickly enough
MAGMA single GPU and multi GPU performance is very efficient in keeping the memory on the GPU to minimize overhead of data transfer
Innovative Computing Lab -- University of
Tennessee, Knoxville --