Achieving Scalability to over 1000 Processors on the HPCx system

advertisement
Achieving Scalability to over 1000 Processors
on the HPCx system
Joachim Hein, Mark Bull, Gavin Pringle
EPCC
The University of Edinburgh
Mayfield Rd
Edinburgh EH9 3JZ
Scotland, UK
HPCx is the UK’s largest High Performance Computing Service, consisting of
40 IBM Regatta-H SMP nodes, each containing 32 POWER4 processors. The
main objective of the system is to provide a capability computing service for
a range of key scientific applications, i.e. a service for applications that can
utilise a significant fraction of the resource. To achieve this capability computing objective, applications must be able to scale effectively to around 1000
processors. This presents a considerable challenge, and requires an understanding of the system and its bottlenecks. In this paper we present results from a
detailed performance investigation on HPCx, highlighting potential bottlenecks
for applications and how these may be avoided. For example, we achieve good
scaling on a benchmark code through effective use of environment variables,
level 2 cache and under populated logical partitions.
1
Introduction
HPCx is the UK’s newest and largest National
High Performance Computing system. This
system has been funded by the British Government, through the Engineering and Physical Sciences Research Council (EPSRC). The
project is run by the HPCx Consortium, a
consortium led by the University of Edinburgh
(through Edinburgh Parallel Computing Centre
(EPCC)), with the Central Laboratory for the
Research Councils in Daresbury (CLRC) and
IBM as partners.
The main objective of the system is to provide a world-class service for capability computing for the UK scientific community. Achieving effective scaling on over 1000 processors for
the broad range of application areas studied in
the UK, such as materials science, atomic and
molecular physics, computational engineering
and environmental science, is a key challenge
of the service.
To achieve this, we require a detailed un-
derstanding of the system and its bottlenecks.
Hence in this paper we present results from a
detailed performance investigation on HPCx,
using a simple iterative Jacobi application.
This highlights a number of potential bottlenecks and how these may be avoided.
2
The HPCx system
HPCx consists out 40 IBM p690 Regatta H
frames. Each frame has 32 POWER4 processors with a clock of 1.3 GHz. This provides a peak performance 6.6 Tflop/s and up to
3.2 Tflop/s sustained performance. The frames
are connected via IBM’s SP Colony switch. Per
frame, these processors are grouped into 4 multi
chip modules (MCM), where each MCM has 8
processors. In order to increase the communication bandwidth of the system, the frames have
been divided into 4 logical partitions (lpar), coinciding with the MCMs. Each lpar is operated
as an 8-way SMP, running its own copy of the
operating system AIX.
The system has three levels of cache. There
is is 32kB of level 1 cache per processor. The
level 2 cache of 1440kB is shared between two
processors. The eight processors inside an lpar
share 8GB of main memory, the memory bus
and the level 3 cache of 128MB.
this context it is interesting to compare the results for 8 processors and different numbers of
lpars for the medium problem size. For example, running an 8 processor job across 8 lpars
(i.e. 1 processor per lpar), rather than with 1
lpar (i.e. 8 processors per lpar) reduces the execution time by 35%. With 8 processors across
8 lpars, each processor is no longer sharing its
memory bus and level 3 cache with 7 other pro3 Case study code
cessors. Also the level 2 cache is no longer
Our case study inverts a lattice Laplacian in shared between two active processors. As a contwo dimensions using the Jacobi algorithm
sequence, the processors can access their data
at a higher rate. However, time on HPCx is
In (x1 , x2 )
charged to users on an lpar basis. Hence, the
run with 8 lpar is 8 times as expensive but only
1h
=
In−1 (x1 + 1, x2 ) + In−1 (x1 − 1, x2 )
35% more efficient than the single lpar run. In
4
summary running with a single CPU per lpar
+In−1 (x1 , x2 + 1) + In−1 (x1 , x2 − 1)
i
is not a good deal. On HPCx it is advisable to
−E(x1 , x2 )
(1) choose the number of processors per lpar which
gives the fastest wall clock time for the selected
We start the iteration with I0 = E. The matrix number of lpars. Figure 1 does not give a conE is constant. This benchmark contains typical sistent picture here. Depending on the paramfeatures of a field-theory with next neighbour eters either 7 or 8 processors per lpar appears
interactions. The code has been parallelised us- to be optimal.
ing MPI_Sendrecv to exchange the halos. The
When comparing the different single procespresent version does not contain global com- sor results, the run for the small size is 4 times
munications. The modules have been compiled faster than the medium size run. This is exusing version 8.1 of the IBM XL Fortran com- pected, since the problem is 4 times smaller.
piler with the options -O3 and -qarch=pwr4. However the large size is more than 5 times
We have been using version 5.1 of the AIX op- slower than the medium size. This reflects the
erating system.
fact that the large size does not fit into the level
3 cache of a single lpar. When running on two
lpar it fits into level 3 cache and we observe rea4 Tasks per logical parti- sonable scaling between the 2 lpar runs for the
medium and large problem size.
tion
We measured the performance of our application code on three different problem sizes,
small =840 × 1008, medium=1680 × 2016 and
large=3360 × 4032 on a range of processors and
lpars. Our results are shown in Figure 1. The
points give the fastest observed run time out
of three or more trials. To guide the eye, we
connected runs on the same number of lpars.
The straight lines give “lines of perfect scaling”.
They are separated by factors of two.
We start the discussion with the results for
a single lpar and medium problem size (1680 ×
2016). By increasing the number of active processors on the lpar, we note a drop in efficiency
to slightly less than 50%. This pattern is observed for all numbers of lpars and problem
sizes. When using large numbers of processors
per lpar the data is required at a higher rate
than the memory system is able to deliver. In
5
Cache utilisation
For the above, the algorithm has been implemented using Fortran90 array syntax. In total we used three different arrays corresponding to the matrices In , In−1 and E in eq. (1).
After each iteration, In has to be copied into
In−1 . To investigate the efficiency of code generated from array syntax we compared against
an implementation using two explicit do-loops
for eq. (1) and another two do-loops for copying
In into In−1 .
It is possible to fuse these sets of do-loops
by copying element In (x1 , x2 − 1) into element
In−1 (x1 , x2 −1) directly after having calculated
In (x1 , x2 ), assumeing x2 is the index of the
outer loop. For this algorithm only two lines of
In need to be stored. We call this the “compact
Figure 1: Wall clock time vs number of processors for a given number of logical partitions.
Results are for 2000 iterations.
1Lpar
1Lpar
1Lpar
2Lpar
2Lpar
2Lpar
4Lpar
4Lpar
4Lpar
8Lpar
8Lpar
8Lpar
Wallclock CPU in seconds
100
3360*4032
1680*2016
840*1008
3360*4032
1680*2016
840*1008
3360*4032
1680*2016
840*1008
3360*4032
1680*2016
840*1008
10
1
1
10
100
# of processors
Figure 2: Performance comparison of different
version of the update code for 8 processors on
a single lpar.
6
Runtime relative to ‘Array Syntax’
1.4
Array Syntax
Do Loop
Compact
1.2
1
0.8
0.6
0.4
0.2
0
which fits into level 2 cache, the difference reduces but is still significant.
420x504
840x1008 1680x2016 3360x4032
version” of the update code. The performance
of the three implementations is compared in
Figure 2.
For none of the four problem sizes do we observe any significant difference between the version using array syntax and the one using explicit do-loops. However the compact version
is faster in all cases. This improvement is dramatic for the three larger problem sizes. Here
the compact version is more than 2 times faster,
which is due to less memory traffic and better cache reuse. For the smallest problem size,
MPI protocol
The environment variable MP_EAGER_LIMIT
controls the protocol used for the exchange of
messages under MPI. For messages of a size
smaller than MP_EAGER_LIMIT an MPI standard send is implemented as a buffered send,
leading to a lower latency but increasing the
memory consumption of the MPI library. For
messages larger than the MP_EAGER_LIMIT the
standard send is implemented as a synchronous
send. Both the default and maximum values
of MP_EAGER_LIMIT depend on the number of
MPI tasks.
In Figure 3 we demonstrate the effect of eager (full symbols) and non-eager (open symbols) sending on the performance of our Jacobi inverter. The figure shows the efficiency
E(nproc ) = t(1)/[t(nproc )nproc ]. This study
uses the compact version of the update code.
For larger messages, i.e. smaller number of
processors, there is little difference between eager and non-eager sending. However for 1024
processors using eager sending improves the efficiency from 49% to 73%.
When increasing the number of lpars, the up-
Figure 3: Efficiency for the different values of
the MP EAGER LIMIT. The problem size is
6720 × 8064 and we use 8 tasks per lpar.
1.6
Update Eager=Max
Update Eager=0
Total Eager=Max
Total Eager=0
1.4
efficiency
1.2
Table 1: Time in ms spend in communication
and calculation averaged over 20000 iterations.
The numbers in the parentheses give the standard deviation in last digits.
size:
tasks/lpar:
communication:
calculation:
size
tasks/lpar:
communication:
calculation:
1
0.8
0.6
0.4
6720 × 8064
7
8
0.22(9)
0.31(99)
0.304(11) 0.265(11)
26880 × 32256
7
8
1.28(49)
1.67(380)
8.74(36)
6.66(33)
0.2
0
In Table 1 we show results for the averaged
communication and calculation times for two
different problem sizes using 7 or 8 tasks/lpar.
In all cases we used 128 lpar, which is the full
date code (computation) shows significant su- production region of the HPCx system. Using
perlinear scaling. This is a typical behaviour for only 7 tasks/lpar substantially reduces the avera modern cache based architecture. When us- age and the standard deviation of the communiing a larger number of logical partitions, there cation time. This confirms our above expectais more cache memory available. Above 32 pro- tion of the interruptions being a major cause of
cessors, the problem fits into level 3 cache and the run time variations. Obviously the calculafor 1024 processors it fits into level 2 cache. tion time increases when using fewer tasks/lpar.
This superlinear scaling compensates for most
With respect to the overall execution time,
of the overhead associated with the increased for the smaller problem it is advantageous to
communication when running on a larger num- use 7 task/lpar since the advantage in commuber of processors. When using eager sending, nication time outweighs the penalty on the calthe efficiency of the total code is almost level culation time. For the larger problem size this
in a range 32 ≤ nproc ≤ 1024. For 1024 pro- is the other way round and it is advantageous
cessors we observe an execution speed of 0.51 to use 8 tasks/lpar and to live with the run time
Tflop/s. Considering that the code spends half variations.
of its time in communication and is unbalanced
with respect to multiplications vs additions1 ,
8 Summary
we believe this is satisfactory.
1
10
100
1000
# proc
7
Run time variations
When running our code on a large number of
lpars, we observed a wide variation of run times.
Interruptions by the operating system, demons
and helper tasks in the MPI system might be a
cause of this noise. If these interruptions are indeed the cause, using only 7 tasks/lpar, which
leaves 1 processor free to deal with these interruptions, should improve the situation, since
each lpar is operated as an independent SMP.
1 The IBM POWER4 processors have two floating
point multiply-addition units. For optimum performance these require an equal number of multiplications
and additions.
In this publication we investigate the performance of a case study code on the HPCx system when using up to 1024 processors. The
memory bus has been identified as a potential
bottle neck. Better utilisation of the cache system by reducing local workspace inside subroutines and loop fusion improves the situation.
For codes using MPI the environment variable
MP_EAGER_LIMIT has a large impact on the performance and should be tuned properly. When
using large numbers of processors, the interrupts from the operating system and the various demons can impact the performance of the
code. Using fewer then eight tasks per lpar can
improve the overall performance of the application.
Download