Mixed Mode programming on Clustered SMP Systems

advertisement
Mixed Mode programming on Clustered SMP Systems
Lorna Smith and Mark Bull
EPCC, James Clark Maxwell Building, The King’s Buildings,
The University of Edinburgh, Mayfield Road, Edinburgh, EH9 3JZ, U.K.
email: l.smith@epcc.ed.ac.uk or m.bull@epcc.ed.ac.uk
Abstract
MPI / OpenMP mixed mode codes could potentially offer the most effective parallelisation
strategy for an SMP cluster, as well as allowing the different characteristics of both paradigms to be
exploited to give the best performance on a single SMP. This paper discusses the implementation,
development and performance of mixed mode MPI / OpenMP applications.
While this style of programming is often not the most effective mechanism on SMP systems,
significant performance benefit can be obtained on codes with certain communication patterns,
such as as those with a large number of collective communications.
node. Hence a combination of shared memory
and message passing parallelisation paradigms
Shared memory architectures have gradually within the same application (mixed mode probecome more prominent in the HPC market, gramming) may provide a more efficient paralas advances in technology have allowed larger lelisation strategy than pure MPI.
numbers of CPUs to have access to a single memory space. In addition, manufactur- In this paper we will compare and contrast MPI
ers have increasingly clustered these SMP sys- and OpenMP before discussing the potential
tems together to go beyond the limits of a sin- benefits of a mixed mode strategy on an SMP
gle system. These clustered SMPs have re- system. We will then examine the performance
cently become more prominent in the UK, one of a collective communication routine, using
such example being the HPCx system - the pure MPI and mixed MPI/OpenMP
UK’s newest and largest National High Performance Computing system, comprising 40 2. HPCx
IBM Regatta-H SMP nodes, each containing 32 HPCx is the UK’s newest and largest National
POWER4 processors. Hence it has become im- High performance Computing system. It has
portant for applications to be portable and effi- been funded by the Engineering and Physical
cient on these systems.
Sciences Research Council (EPSRC). The project
is
run by the HPCx Consortium, a consortium
Message passing codes written in MPI are obinvolving
The University of Edinburgh, EPCC,
viously portable and should transfer easily to
CCLRC’s
Daresbury
Laboratory and IBM.
clustered SMP systems. Whilst message pass1. Introduction
ing may be necessary to communicate between
nodes, it is not immediately clear that this is the
most efficient parallelisation technique within
an SMP node. In theory, a shared memory
model such as OpenMP should offer a more
efficient parallelisation strategy within an SMP
HPCx consists of 40 IBM p690 Regatta nodes,
each containing 32 Power4 processors. Within a
node there are 16 chips: each chip contains two
processors with their own level 1 caches and a
shared level 2 cache. The chips are packaged
into a Multi-Chip Module (MCM) containing
4 chips (8 processors) and a 128 Mbyte level
3 cache, which is shared by all 8 processors in
the MCM. Each Regatta node contains 4 MCMs
and 32 Gbytes of main memory. The MCMs
are connected to each other and to main memory by a 4-way bus interconnect to form a 32way symmetric multi-processor (SMP). In order to increase the communication bandwidth
of the system each Regatta node has been divided into 4 logical partitions (LPAR), coinciding with each MCM. Each LPAR runs its own
copy of the AIX operating system and operates
as an 8-way SMP.
3. Programming model characteristics
benefits of both models. For example a mixed
mode program may allow the data placement
policies of MPI to be utilised with the finer
grain parallelism of OpenMP. The majority of
mixed mode applications involve a hierarchical
model; MPI parallelisation occurring at the top
level, and OpenMP parallelisation occurring
below. For example, Figure 1 shows a 2D grid
which has been divided geometrically between
four MPI processes. These sub-arrays have then
been further divided between three OpenMP
threads. This model closely maps to the architecture of an SMP cluster, the MPI parallelisation occurring between the SMP nodes and the
OpenMP parallelisation within the nodes.
The message passing programming model is a
distributed memory model with explicit control parallelism. MPI [1] is portable to both dis2D Array
tributed and shared memory architecture and
allows static task scheduling. The explicit parallelism often provides a better performance
MPI
and a number of optimised collective communication routines are available for optimal efficiency. Data placement problems are rarely
process 0
process 1
process 2
process 3
observed and synchronisation occurs implicitly
with subroutine calls and hence is minimised
naturally. However MPI suffers from a few OpenMP
OpenMP
OpenMP
OpenMP
deficiencies. Decomposition, development and
thread 0
thread 0
thread 0
thread 0
debugging of applications can be time consumthread 1
thread 1
thread 1
thread 1
ing and significant code changes are often rethread 2
thread 2
thread 2
thread 2
quired. Communications can create a large
overhead and the code granularity often has to Figure 1: Schematic representation of a hierarbe large to minimise the latency. Finally, global chical mixed mode programming model for a
operations can be very expensive.
2D array.
OpenMP is an industry standard [2] for shared
memory programming. Based on a combination of compiler directives, library routines and 4. Benefits of mixed mode programming
environment variables it is used to specify par- This section discusses various situations where
allelism on shared memory machines. Com- a mixed mode code may be more efficient than
munication is implicit and OpenMP applica- a corresponding MPI implementation, whether
tions are relatively easy to implement. In the- on an SMP cluster or single SMP system.
ory, OpenMP makes better use of the shared
memory architecture. Run time scheduling is Codes which scale poorly with MPI
allowed and both fine and coarse grain paralOne of the largest areas of potential benefit
lelism are effective. OpenMP codes will howfrom mixed mode programming is with codes
ever only run on shared memory machines and
which scale poorly with increasing MPI prothe placement policy of data may causes probcesses. One of the most common reasons for an
lems. Coarse grain parallelism often requires a
MPI code to scale poorly is load imbalance. For
parallelisation strategy similar to an MPI stratexample irregular applications such as adaptive
egy and explicit synchronisation is required.
mesh refinement codes suffer from load balance
By utilising a mixed mode programming model problems when parallelised using MPI. By dewe should be able to take advantage of the veloping a mixed mode code for a clustered
SMP system, MPI need only be used for com-
munication between nodes, creating a coarser
grained problem. The OpenMP implementation may not suffer from load imbalance and
hence the performance of the code would be
improved.
implementations within a shared memory architecture, this may not always be the case. On
a clustered SMP system, if the MPI implementation has not been optimised, the performance of
a pure MPI application across the system may
be poorer than a mixed MPI / OpenMP code.
Fine grain parallelism problems
This is obviously vendor specific, but in certain
OpenMP generally gives better performance cases a mixed mode code could offer signifion fine grain problems, where an MPI ap- cant performance improvement. For example,
plication may become communication domi- IBM’s MPI is not optimised for clustered sysnated. Hence when an application requires tems.
good scaling with a fine grain level of paral- Poor scaling of the MPI implementation
lelism a mixed mode program may be more efficient. Obviously a pure OpenMP implementa- Clustered SMPs open the way for systems to be
tion would give better performance still, how- built with ever increasing numbers of procesever on SMP clusters MPI parallelism is still sors. In certain situations the scaling of the MPI
required for communication between nodes. implementation itself may not match these ever
By reducing the number of MPI processes re- increasing processor numbers or may indeed be
quired, the scaling of the code should be im- restricted to a certain maximum number. In this
proved.
situation developing a mixed mode code may
be of benefit (or required), as the number of
Replicated data
MPI processes needed will be reduced and reCodes written using a replicated data strategy placed with OpenMP threads.
often suffer from memory limitations and from 5. Collective Communications
poor scaling due to global communications. By
using a mixed mode programming style on an Having discussed various situations where a
SMP cluster, with the MPI parallelisation occur- mixed mode code may be more efficient than
ring across the nodes and the OpenMP paral- a corresponding MPI implementation, this seclelisation inside the nodes, the problem will be tion considers one specific situation relevant to
limited to the memory of an SMP node rather HPCx: collective communications.
than the memory of a processor (or, to be precise, the memory of an SMP node divided by Many scientific application use collective comthe number of processors), as is the case for munications and to achieve good scaling on
a pure MPI implementation. This has obvi- clusters SMP systems such as HPCx, these
ous advantages, allowing more realistic prob- communications need to be implemented efficiently. Collective communications were inlem sizes to be studied.
cluded in the MPI standard to allow developers
Restricted MPI process applications
to implement optimised versions of essential
communication patterns. A number of collecA number of MPI applications require a specific tive operations can be efficiently implemented
number of processes to run. Whilst this may using tree algorithms - including Broadcast,
be a natural and efficient implementation, this Gather, Scatter and Reduce. On a system such
limits the number of MPI processes to certain as HPCx, where communication is faster within
combinations. By developing a mixed mode a node than between nodes, the tree algorithm
MPI/OpenMP code the natural MPI decompo- can be constructed such that communications
sition strategy can be used, running the de- corresponding to branches of the tree at the
sired number of MPI processes, and OpenMP same level should run at the same speed, oththreads used to further distribute the work, al- erwise the speed of each stage of the algorithm
lowing all the available processes to be used ef- will be limited by the performance of the slowfectively.
est communication.
Poorly optimised intra-node MPI
To demonstrate this, two techniques have been
Although a number of vendors have spent considerable amounts of time optimising their MPI
used. Firstly, a library has been used that,
by creating multiple communicators, performs
Secondly, a mixed mode version of the code has
been developed that also performs the operation in two stages: stage 1 uses OpenMP and
stage 2 MPI. For example, an all reduce operation involves a reduction operation carried
out using OpenMP across threads within and
LPAR, followed by an MPI all reduce operation
of these results across LPARs. To ensure the
MPI and OpenMP operations are comparable,
the OpenMP reduction operation is carried out
across variables that are private to each thread.
To avoid the use of an ATOMIC directive, all the
private variables are copied to a different piece
of a shared array, based on their thread identifier. The contents of this shared array are then
reduced into a second shared array by the master thread. The master thread on each LPAR is
then involved in an MPI all reduce operation
across LPARs.
50
18
16
14
12
10
8
6
4
2
0
45
40
35
0
30
Mbytes/s
collective operations in two stages [3]. For example, an all reduce operation involves a reduction operation carried out across the processors within an LPAR, followed by an all reduce
of these results across LPARs. This library uses
the MPI profiling interface and hence is easy to
use, as it requires no code modification.
400
800
1200
25
20
15
10
MPI
MPI with 2 stage library
Mixed MPI/OpenMP
5
0
0
2000
4000
6000
8000
10000
12000
Message Size
Figure 2: Performance in Mbytes/s of the twostage all reduce operations and the standards
all reduce operation of the IBM MPI library.
the execution time for sending messages between LPARs. In addition, this operation benefits from carrying out communication within
an LPAR through direct read and writes to
memory (using OpenMP), which eliminates the
overhead of calling the MPI library. For larger
data sizes however, the performance of this
A simple benchmark code, that executes an mixed operation is worse than both the origiall reduce operation across a range of data nal MPI operation and the operation with two
sizes, has been used to compare the perfor- stage MPI library.
mance of these two techniques to carry out
6. Conclusions
a reduction operation against the standard
all reduce operation of the IBM MPI library. For applications that are dominated by collecFigure 2 shows the performance (in bytes/s) for tive communications, the two-stage MPI library
each operation on a range of data sizes on 32 offers a simple and quick mechanism to obtain
processors. It is clear from this diagram that the significant performance improvement. Using a
two-stage operation, using multiple communi- mixed MPI/OpenMP version of the collective
cators, is significantly faster than the standard communication improves the performance for
MPI operation for all data sizes. As mentioned low data sizes, but has a detrimental effect for
above, the collective operation is dominated by larger data sizes.
its slowest communication. The slowest communication within the standard MPI operation
is communication between LPARs and this is
dominating the performance. By carrying out References
this process in two stages, the amount of data
being communicated between LPARs has re- [1] MPI, MPI: A Message-Passing Interface
standard. Message Passing Interface Foduced significantly, hence reducing the execurum, June 1995.
tion time of sending messages between LPARs.
http://www.mpi-forum.org/.
For small data sizes, the two-stage mixed operation is faster again than the two-stage op- [2] OpenMP, The OpenMP ARB.
http://www.openmp.org/.
eration using multiple communicators. This
operation also reduces the amount of data be[3] Developed by Stephen Booth, EPCC.
ing communicated between LPARs, reducing
Download