MS PowerPoint presentation

advertisement
edit
Fortran
type
LINPACK
ISV
onRelease
title
TOP500
master
I to L
Tuning LINPACK NxN
for HP Platforms
Hsin-Ying Lin [lin@rsn.hp.com]
Piotr Luszczek [luszczek@utk.edu]
MLIB team/HEPS/SCL/TCD
Hewlett Packard Company
HiPer’01 Bremen, Germany
October 8, 2001
Technical Systems Division * Scalable Computing Lab
Hsin-Ying Lin lin@rsn.hp.com
(T)(972)497-4897
hiper01.ppt
Printed:3/11/2016 5:43:35 PM
2
LINPACK
TOP500
Why tune LINPACK N*N
Customers use TOP500 list as one of the
criteria to purchase machines
HP wants to increase the number of
computers on the TOP500 list and to help
demonstrate HP’s commitment to high
performance computing
See http://www.top500.org/
Technical Systems Division * Scalable Computing Lab
Hsin-Ying Lin lin@rsn.hp.com (T)(972)497-4897
hiper01.ppt
Printed:3/11/2016 5:43:35 PM
3
LINPACK
TOP500
What is LINPACK NxN
 LINPACK NxN benchmark
• Solves system of linear equations by some method
• Allows the vendors to choose size of problem for benchmark
• Measures execution time for each size problem
 LINPACK NxN report
• Nmax – the size of the chosen problem run on a machine
• Rmax – the performance in Gflop/s for the chosen size
problem run on the machine
• N1/2 – the size where half the Rmax execution rate is achieved
• Rpeak – the theoretical peak performance Gflop/s for the
machine
 LINPACK NxN is used to rank TOP500 fastest computers in the
world
Technical Systems Division * Scalable Computing Lab
Hsin-Ying Lin lin@rsn.hp.com (T)(972)497-4897
hiper01.ppt
Printed:3/11/2016 5:43:35 PM
4
LINPACK
TOP500
TOP500 – Past, Present, and Future
 June 2000 – 47 HP systems
• Cut-off: 43.82 Gflop/s (Performance of 500th computer)
 November 2000 – 5 HP systems
• Cut-off: 55.1 GFLOP/s (26% increase from June 2000)
 June 2001 – 41 HP systems
• Cut-off: 67.78 GFLOP/s (23% increase from
November 2000)
 November 2001 – ??? HP systems
• Cut-off: 83-92 GFLOP/s (23-36% estimated increase
from June 2001)
Technical Systems Division * Scalable Computing Lab
Hsin-Ying Lin lin@rsn.hp.com (T)(972)497-4897
hiper01.ppt
Printed:3/11/2016 5:43:35 PM
5
LINPACK
TOP500
HP list in TOP500 (June 2001)
Rank
Computer Clock Rate CPU Peak
119
122
162
205
211
216
266
305-329
358
363
364-368
463
496
V2600
V2500
V2500
SD/N4000
N4000
Superdome
N4000
Superdome
N4000
V2500
Superdome
Superdome
Superdome
552
440
440
552
552
552
552
552
440
440
552
552
552
2.208
1.760
1.760
2.208
2.208
2.208
2.208
2.208
1.760
1.760
2.208
2.208
2.208
Processors Rmax(Gflop/s) Rpeak(Gflop/s) Efficiency
256
320
192
120
176
96
144
64
128
128
64
48
44
196
189
139
123
122
121
103
91
85
84
84
74
69
565
563
338
265
389
212
318
141
225
225
141
106
97
35%
34%
41%
46%
31%
57%
32%
64%
38%
37%
60%
70%
71%
Technical Systems Division * Scalable Computing Lab
Hsin-Ying Lin lin@rsn.hp.com (T)(972)497-4897
hiper01.ppt
Printed:3/11/2016 5:43:35 PM
6
LINPACK
TOP500
HP’s TOP500 Status and Goals
About 30 systems missed the entry threshold
55.1 Gflop/s by 1 Gflop/s on Nov. 1, 2000
Goal for Nov. 1, 2001: Ensure all 64 CPU
Superdome systems are listed in TOP500
Lack of excellent MPI based Linpack N*N
algorithms despite relatively good single node
Linpack N*N performance
Goal for Nov. 1, 2001: Develop better scalable
algorithm for multiple node systems
Technical Systems Division * Scalable Computing Lab
Hsin-Ying Lin lin@rsn.hp.com (T)(972)497-4897
hiper01.ppt
Printed:3/11/2016 5:43:35 PM
7
LINPACK
TOP500
The Road to Highly Scalable
LINPACK NxN Algorithm
Studied the public domain software HPL
(High Performance LINPACK benchmark):
Q: Why HPL?
A: Other vendors use HPL for their
LINPACK N*N benchmark and show good
scalability.
See: http://www.netlib.org/benchmark/hpl
Technical Systems Division * Scalable Computing Lab
Hsin-Ying Lin lin@rsn.hp.com (T)(972)497-4897
hiper01.ppt
Printed:3/11/2016 5:43:35 PM
8
LINPACK
TOP500
HPL(High Performance LINPACK)
MPI implementation of LINPACK NxN benchmark
Algorithm keywords
• One- and two-dimensional block-cyclic data
distribution
• Right-looking variant of the LU factorization
• Row partial pivoting
• Multiple look-ahead depths
• Recursive panel factorization
Highly tunable (matrix dimension, blocking factor, grid
topology, broadcast/factorization algorithms, data
alignment)
Technical Systems Division * Scalable Computing Lab
Hsin-Ying Lin lin@rsn.hp.com (T)(972)497-4897
hiper01.ppt
Printed:3/11/2016 5:43:35 PM
9
LINPACK
TOP500
HPL(High Performance LINAPCK)
HPL solves a linear system of order n of the form:
Ax=b
 Compute LU factorization with partial pivoting of
n-by-(n+1) matrix:
[A,b] = [[L,U],y]
 Since the lower triangular factor L is applied
to b as factorization progress, the solution x is
obtained by solving the upper triangular system:
Ux = y
Technical Systems Division * Scalable Computing Lab
Hsin-Ying Lin lin@rsn.hp.com (T)(972)497-4897
hiper01.ppt
Printed:3/11/2016 5:43:35 PM
10
LINPACK
TOP500
Caveat of HPL
The lower triangular matrix L is left unpivoted and the array of pivots is not
returned.
Array b is part of Matrix A.
These imply that HPL is not a general
LU factorization software and it cannot be
used to solve multiple right hand sides
simultaneously.
Technical Systems Division * Scalable Computing Lab
Hsin-Ying Lin lin@rsn.hp.com (T)(972)497-4897
hiper01.ppt
Printed:3/11/2016 5:43:35 PM
11
LINPACK
TOP500
0
Cyclic 1D division of matrix into
8 panels – with 4 processors
1
2
3
4
5
6
7
Factor panel 0
Update panel 1-7 using panel 0
Factor panel 1
Update panel 2-7 using panel 1
P0 P1 P2 P3 P0 P1 P2 P
3
Factor panel 2
.
.
.
.
Factor panel 7
Technical Systems Division * Scalable Computing Lab
Hsin-Ying Lin lin@rsn.hp.com (T)(972)497-4897
hiper01.ppt
Printed:3/11/2016 5:43:35 PM
12
LINPACK
TOP500
0
Look Ahead Algorithm
1
2
3
4
5
6
7
Factor panel 0
Update panel 1 using panel 0
Factor panel 1
Mark panel 1 as factored
P0 P1 P2 P3 P0 P1 P2 P
Update panel 5 using panel 0
Update panel 5 using panel 1
3
.
.
Technical Systems Division * Scalable Computing Lab
Hsin-Ying Lin lin@rsn.hp.com (T)(972)497-4897
hiper01.ppt
Printed:3/11/2016 5:43:35 PM
13
LINPACK
TOP500
Characteristics of HPL
Is most suitable for cluster system, i.e. relatively
many low-performance CPUs connected with a
relatively low-speed network.
Is not suitable for SMPs as MPI incurs
overhead which causes substantial deterioration
of performance for a benchmark code.
When look-ahead technique is used with MPI, it
requires additional memory to be allocated on
each CPU for communication buffer. In an SMP
system, such buffer is unnecessary due to the
shared memory mechanism.
Technical Systems Division * Scalable Computing Lab
Hsin-Ying Lin lin@rsn.hp.com (T)(972)497-4897
hiper01.ppt
Printed:3/11/2016 5:43:35 PM
14
LINPACK
TOP500
Approach for Tuning LINPACK NxN
Leverage algorithms in HPL
• Use pthreads instead of MPI for
single node
• Use hybrid of MPI and pthreads for
multi-node (Constellation) system;
MPI across nodes and pthreads
within the node
Leverage HP MLIB’s BLAS routines to
improve single CPU performance. See
http://www.hp.com/go/mlib
Technical Systems Division * Scalable Computing Lab
Hsin-Ying Lin lin@rsn.hp.com (T)(972)497-4897
hiper01.ppt
Printed:3/11/2016 5:43:35 PM
15
LINPACK
TOP500
SD PA8600 vs. other machines
Rank
System
SD PA8600
255 SGI O3800
258 Sun HPC 1000
273 IBM Power3
284 Compaq EV67
Rmax(Gflop/s) CPUs Ratio
100
107
105
100
99
64
128
168
96
128
1.0
1.9
2.5
1.5
2.0
Note: Small is better for the number under “Ratio”
Technical Systems Division * Scalable Computing Lab
Hsin-Ying Lin lin@rsn.hp.com (T)(972)497-4897
hiper01.ppt
Printed:3/11/2016 5:43:35 PM
16
250
Constellation PA8600 Performance
Gflop/s
Gflop/s
200
Parallel Efficiency
3.8x
1.2
3.9x
1
0.8
150
1.9x
100
0.6
1.9x
0.4
50
0.2
0
Parallel Efficiency
LINPACK
TOP500
0
1x32
2x32 G
G: Gigabit Ethernet
H: Hyper Fabric
1x64
4x32 G
4x32 H
CPUs
Technical Systems Division * Scalable Computing Lab
Hsin-Ying Lin lin@rsn.hp.com (T)(972)497-4897
hiper01.ppt
Printed:3/11/2016 5:43:35 PM
17
LINPACK
TOP500
Summary
We believe that we reached our first goal.
Accomplished our second goal to have better
scalable code for HP Constellation system.
4x32 CPUs SD PA8600 could be ranked close to
TOP 100, based on TOP500 list of June 2001.
1x64 CPUs SD PA8600 could be ranked within TOP
250 based on TOP500 list of June 2001.
Performance/CPU of SD PA8600 is about 1.5x, 1.9x,
and 2.5x of IBM Power3, SGI O3000, and Sun
HPC1000 respectively.
Technical Systems Division * Scalable Computing Lab
Hsin-Ying Lin lin@rsn.hp.com (T)(972)497-4897
hiper01.ppt
Printed:3/11/2016 5:43:35 PM
18
Download