edit Fortran type LINPACK ISV onRelease title TOP500 master I to L Tuning LINPACK NxN for HP Platforms Hsin-Ying Lin [lin@rsn.hp.com] Piotr Luszczek [luszczek@utk.edu] MLIB team/HEPS/SCL/TCD Hewlett Packard Company HiPer’01 Bremen, Germany October 8, 2001 Technical Systems Division * Scalable Computing Lab Hsin-Ying Lin lin@rsn.hp.com (T)(972)497-4897 hiper01.ppt Printed:3/11/2016 5:43:35 PM 2 LINPACK TOP500 Why tune LINPACK N*N Customers use TOP500 list as one of the criteria to purchase machines HP wants to increase the number of computers on the TOP500 list and to help demonstrate HP’s commitment to high performance computing See http://www.top500.org/ Technical Systems Division * Scalable Computing Lab Hsin-Ying Lin lin@rsn.hp.com (T)(972)497-4897 hiper01.ppt Printed:3/11/2016 5:43:35 PM 3 LINPACK TOP500 What is LINPACK NxN LINPACK NxN benchmark • Solves system of linear equations by some method • Allows the vendors to choose size of problem for benchmark • Measures execution time for each size problem LINPACK NxN report • Nmax – the size of the chosen problem run on a machine • Rmax – the performance in Gflop/s for the chosen size problem run on the machine • N1/2 – the size where half the Rmax execution rate is achieved • Rpeak – the theoretical peak performance Gflop/s for the machine LINPACK NxN is used to rank TOP500 fastest computers in the world Technical Systems Division * Scalable Computing Lab Hsin-Ying Lin lin@rsn.hp.com (T)(972)497-4897 hiper01.ppt Printed:3/11/2016 5:43:35 PM 4 LINPACK TOP500 TOP500 – Past, Present, and Future June 2000 – 47 HP systems • Cut-off: 43.82 Gflop/s (Performance of 500th computer) November 2000 – 5 HP systems • Cut-off: 55.1 GFLOP/s (26% increase from June 2000) June 2001 – 41 HP systems • Cut-off: 67.78 GFLOP/s (23% increase from November 2000) November 2001 – ??? HP systems • Cut-off: 83-92 GFLOP/s (23-36% estimated increase from June 2001) Technical Systems Division * Scalable Computing Lab Hsin-Ying Lin lin@rsn.hp.com (T)(972)497-4897 hiper01.ppt Printed:3/11/2016 5:43:35 PM 5 LINPACK TOP500 HP list in TOP500 (June 2001) Rank Computer Clock Rate CPU Peak 119 122 162 205 211 216 266 305-329 358 363 364-368 463 496 V2600 V2500 V2500 SD/N4000 N4000 Superdome N4000 Superdome N4000 V2500 Superdome Superdome Superdome 552 440 440 552 552 552 552 552 440 440 552 552 552 2.208 1.760 1.760 2.208 2.208 2.208 2.208 2.208 1.760 1.760 2.208 2.208 2.208 Processors Rmax(Gflop/s) Rpeak(Gflop/s) Efficiency 256 320 192 120 176 96 144 64 128 128 64 48 44 196 189 139 123 122 121 103 91 85 84 84 74 69 565 563 338 265 389 212 318 141 225 225 141 106 97 35% 34% 41% 46% 31% 57% 32% 64% 38% 37% 60% 70% 71% Technical Systems Division * Scalable Computing Lab Hsin-Ying Lin lin@rsn.hp.com (T)(972)497-4897 hiper01.ppt Printed:3/11/2016 5:43:35 PM 6 LINPACK TOP500 HP’s TOP500 Status and Goals About 30 systems missed the entry threshold 55.1 Gflop/s by 1 Gflop/s on Nov. 1, 2000 Goal for Nov. 1, 2001: Ensure all 64 CPU Superdome systems are listed in TOP500 Lack of excellent MPI based Linpack N*N algorithms despite relatively good single node Linpack N*N performance Goal for Nov. 1, 2001: Develop better scalable algorithm for multiple node systems Technical Systems Division * Scalable Computing Lab Hsin-Ying Lin lin@rsn.hp.com (T)(972)497-4897 hiper01.ppt Printed:3/11/2016 5:43:35 PM 7 LINPACK TOP500 The Road to Highly Scalable LINPACK NxN Algorithm Studied the public domain software HPL (High Performance LINPACK benchmark): Q: Why HPL? A: Other vendors use HPL for their LINPACK N*N benchmark and show good scalability. See: http://www.netlib.org/benchmark/hpl Technical Systems Division * Scalable Computing Lab Hsin-Ying Lin lin@rsn.hp.com (T)(972)497-4897 hiper01.ppt Printed:3/11/2016 5:43:35 PM 8 LINPACK TOP500 HPL(High Performance LINPACK) MPI implementation of LINPACK NxN benchmark Algorithm keywords • One- and two-dimensional block-cyclic data distribution • Right-looking variant of the LU factorization • Row partial pivoting • Multiple look-ahead depths • Recursive panel factorization Highly tunable (matrix dimension, blocking factor, grid topology, broadcast/factorization algorithms, data alignment) Technical Systems Division * Scalable Computing Lab Hsin-Ying Lin lin@rsn.hp.com (T)(972)497-4897 hiper01.ppt Printed:3/11/2016 5:43:35 PM 9 LINPACK TOP500 HPL(High Performance LINAPCK) HPL solves a linear system of order n of the form: Ax=b Compute LU factorization with partial pivoting of n-by-(n+1) matrix: [A,b] = [[L,U],y] Since the lower triangular factor L is applied to b as factorization progress, the solution x is obtained by solving the upper triangular system: Ux = y Technical Systems Division * Scalable Computing Lab Hsin-Ying Lin lin@rsn.hp.com (T)(972)497-4897 hiper01.ppt Printed:3/11/2016 5:43:35 PM 10 LINPACK TOP500 Caveat of HPL The lower triangular matrix L is left unpivoted and the array of pivots is not returned. Array b is part of Matrix A. These imply that HPL is not a general LU factorization software and it cannot be used to solve multiple right hand sides simultaneously. Technical Systems Division * Scalable Computing Lab Hsin-Ying Lin lin@rsn.hp.com (T)(972)497-4897 hiper01.ppt Printed:3/11/2016 5:43:35 PM 11 LINPACK TOP500 0 Cyclic 1D division of matrix into 8 panels – with 4 processors 1 2 3 4 5 6 7 Factor panel 0 Update panel 1-7 using panel 0 Factor panel 1 Update panel 2-7 using panel 1 P0 P1 P2 P3 P0 P1 P2 P 3 Factor panel 2 . . . . Factor panel 7 Technical Systems Division * Scalable Computing Lab Hsin-Ying Lin lin@rsn.hp.com (T)(972)497-4897 hiper01.ppt Printed:3/11/2016 5:43:35 PM 12 LINPACK TOP500 0 Look Ahead Algorithm 1 2 3 4 5 6 7 Factor panel 0 Update panel 1 using panel 0 Factor panel 1 Mark panel 1 as factored P0 P1 P2 P3 P0 P1 P2 P Update panel 5 using panel 0 Update panel 5 using panel 1 3 . . Technical Systems Division * Scalable Computing Lab Hsin-Ying Lin lin@rsn.hp.com (T)(972)497-4897 hiper01.ppt Printed:3/11/2016 5:43:35 PM 13 LINPACK TOP500 Characteristics of HPL Is most suitable for cluster system, i.e. relatively many low-performance CPUs connected with a relatively low-speed network. Is not suitable for SMPs as MPI incurs overhead which causes substantial deterioration of performance for a benchmark code. When look-ahead technique is used with MPI, it requires additional memory to be allocated on each CPU for communication buffer. In an SMP system, such buffer is unnecessary due to the shared memory mechanism. Technical Systems Division * Scalable Computing Lab Hsin-Ying Lin lin@rsn.hp.com (T)(972)497-4897 hiper01.ppt Printed:3/11/2016 5:43:35 PM 14 LINPACK TOP500 Approach for Tuning LINPACK NxN Leverage algorithms in HPL • Use pthreads instead of MPI for single node • Use hybrid of MPI and pthreads for multi-node (Constellation) system; MPI across nodes and pthreads within the node Leverage HP MLIB’s BLAS routines to improve single CPU performance. See http://www.hp.com/go/mlib Technical Systems Division * Scalable Computing Lab Hsin-Ying Lin lin@rsn.hp.com (T)(972)497-4897 hiper01.ppt Printed:3/11/2016 5:43:35 PM 15 LINPACK TOP500 SD PA8600 vs. other machines Rank System SD PA8600 255 SGI O3800 258 Sun HPC 1000 273 IBM Power3 284 Compaq EV67 Rmax(Gflop/s) CPUs Ratio 100 107 105 100 99 64 128 168 96 128 1.0 1.9 2.5 1.5 2.0 Note: Small is better for the number under “Ratio” Technical Systems Division * Scalable Computing Lab Hsin-Ying Lin lin@rsn.hp.com (T)(972)497-4897 hiper01.ppt Printed:3/11/2016 5:43:35 PM 16 250 Constellation PA8600 Performance Gflop/s Gflop/s 200 Parallel Efficiency 3.8x 1.2 3.9x 1 0.8 150 1.9x 100 0.6 1.9x 0.4 50 0.2 0 Parallel Efficiency LINPACK TOP500 0 1x32 2x32 G G: Gigabit Ethernet H: Hyper Fabric 1x64 4x32 G 4x32 H CPUs Technical Systems Division * Scalable Computing Lab Hsin-Ying Lin lin@rsn.hp.com (T)(972)497-4897 hiper01.ppt Printed:3/11/2016 5:43:35 PM 17 LINPACK TOP500 Summary We believe that we reached our first goal. Accomplished our second goal to have better scalable code for HP Constellation system. 4x32 CPUs SD PA8600 could be ranked close to TOP 100, based on TOP500 list of June 2001. 1x64 CPUs SD PA8600 could be ranked within TOP 250 based on TOP500 list of June 2001. Performance/CPU of SD PA8600 is about 1.5x, 1.9x, and 2.5x of IBM Power3, SGI O3000, and Sun HPC1000 respectively. Technical Systems Division * Scalable Computing Lab Hsin-Ying Lin lin@rsn.hp.com (T)(972)497-4897 hiper01.ppt Printed:3/11/2016 5:43:35 PM 18