Mixed Mode programming on Clustered SMP Systems Lorna Smith and Mark Bull EPCC, James Clark Maxwell Building, The King’s Buildings, The University of Edinburgh, Mayfield Road, Edinburgh, EH9 3JZ, U.K. email: l.smith@epcc.ed.ac.uk or m.bull@epcc.ed.ac.uk Abstract MPI / OpenMP mixed mode codes could potentially offer the most effective parallelisation strategy for an SMP cluster, as well as allowing the different characteristics of both paradigms to be exploited to give the best performance on a single SMP. This paper discusses the implementation, development and performance of mixed mode MPI / OpenMP applications. While this style of programming is often not the most effective mechanism on SMP systems, significant performance benefit can be obtained on codes with certain communication patterns, such as as those with a large number of collective communications. node. Hence a combination of shared memory and message passing parallelisation paradigms Shared memory architectures have gradually within the same application (mixed mode probecome more prominent in the HPC market, gramming) may provide a more efficient paralas advances in technology have allowed larger lelisation strategy than pure MPI. numbers of CPUs to have access to a single memory space. In addition, manufactur- In this paper we will compare and contrast MPI ers have increasingly clustered these SMP sys- and OpenMP before discussing the potential tems together to go beyond the limits of a sin- benefits of a mixed mode strategy on an SMP gle system. These clustered SMPs have re- system. We will then examine the performance cently become more prominent in the UK, one of a collective communication routine, using such example being the HPCx system - the pure MPI and mixed MPI/OpenMP UK’s newest and largest National High Performance Computing system, comprising 40 2. HPCx IBM Regatta-H SMP nodes, each containing 32 HPCx is the UK’s newest and largest National POWER4 processors. Hence it has become im- High performance Computing system. It has portant for applications to be portable and effi- been funded by the Engineering and Physical cient on these systems. Sciences Research Council (EPSRC). The project is run by the HPCx Consortium, a consortium Message passing codes written in MPI are obinvolving The University of Edinburgh, EPCC, viously portable and should transfer easily to CCLRC’s Daresbury Laboratory and IBM. clustered SMP systems. Whilst message pass1. Introduction ing may be necessary to communicate between nodes, it is not immediately clear that this is the most efficient parallelisation technique within an SMP node. In theory, a shared memory model such as OpenMP should offer a more efficient parallelisation strategy within an SMP HPCx consists of 40 IBM p690 Regatta nodes, each containing 32 Power4 processors. Within a node there are 16 chips: each chip contains two processors with their own level 1 caches and a shared level 2 cache. The chips are packaged into a Multi-Chip Module (MCM) containing 4 chips (8 processors) and a 128 Mbyte level 3 cache, which is shared by all 8 processors in the MCM. Each Regatta node contains 4 MCMs and 32 Gbytes of main memory. The MCMs are connected to each other and to main memory by a 4-way bus interconnect to form a 32way symmetric multi-processor (SMP). In order to increase the communication bandwidth of the system each Regatta node has been divided into 4 logical partitions (LPAR), coinciding with each MCM. Each LPAR runs its own copy of the AIX operating system and operates as an 8-way SMP. 3. Programming model characteristics benefits of both models. For example a mixed mode program may allow the data placement policies of MPI to be utilised with the finer grain parallelism of OpenMP. The majority of mixed mode applications involve a hierarchical model; MPI parallelisation occurring at the top level, and OpenMP parallelisation occurring below. For example, Figure 1 shows a 2D grid which has been divided geometrically between four MPI processes. These sub-arrays have then been further divided between three OpenMP threads. This model closely maps to the architecture of an SMP cluster, the MPI parallelisation occurring between the SMP nodes and the OpenMP parallelisation within the nodes. The message passing programming model is a distributed memory model with explicit control parallelism. MPI [1] is portable to both dis2D Array tributed and shared memory architecture and allows static task scheduling. The explicit parallelism often provides a better performance MPI and a number of optimised collective communication routines are available for optimal efficiency. Data placement problems are rarely process 0 process 1 process 2 process 3 observed and synchronisation occurs implicitly with subroutine calls and hence is minimised naturally. However MPI suffers from a few OpenMP OpenMP OpenMP OpenMP deficiencies. Decomposition, development and thread 0 thread 0 thread 0 thread 0 debugging of applications can be time consumthread 1 thread 1 thread 1 thread 1 ing and significant code changes are often rethread 2 thread 2 thread 2 thread 2 quired. Communications can create a large overhead and the code granularity often has to Figure 1: Schematic representation of a hierarbe large to minimise the latency. Finally, global chical mixed mode programming model for a operations can be very expensive. 2D array. OpenMP is an industry standard [2] for shared memory programming. Based on a combination of compiler directives, library routines and 4. Benefits of mixed mode programming environment variables it is used to specify par- This section discusses various situations where allelism on shared memory machines. Com- a mixed mode code may be more efficient than munication is implicit and OpenMP applica- a corresponding MPI implementation, whether tions are relatively easy to implement. In the- on an SMP cluster or single SMP system. ory, OpenMP makes better use of the shared memory architecture. Run time scheduling is Codes which scale poorly with MPI allowed and both fine and coarse grain paralOne of the largest areas of potential benefit lelism are effective. OpenMP codes will howfrom mixed mode programming is with codes ever only run on shared memory machines and which scale poorly with increasing MPI prothe placement policy of data may causes probcesses. One of the most common reasons for an lems. Coarse grain parallelism often requires a MPI code to scale poorly is load imbalance. For parallelisation strategy similar to an MPI stratexample irregular applications such as adaptive egy and explicit synchronisation is required. mesh refinement codes suffer from load balance By utilising a mixed mode programming model problems when parallelised using MPI. By dewe should be able to take advantage of the veloping a mixed mode code for a clustered SMP system, MPI need only be used for com- munication between nodes, creating a coarser grained problem. The OpenMP implementation may not suffer from load imbalance and hence the performance of the code would be improved. implementations within a shared memory architecture, this may not always be the case. On a clustered SMP system, if the MPI implementation has not been optimised, the performance of a pure MPI application across the system may be poorer than a mixed MPI / OpenMP code. Fine grain parallelism problems This is obviously vendor specific, but in certain OpenMP generally gives better performance cases a mixed mode code could offer signifion fine grain problems, where an MPI ap- cant performance improvement. For example, plication may become communication domi- IBM’s MPI is not optimised for clustered sysnated. Hence when an application requires tems. good scaling with a fine grain level of paral- Poor scaling of the MPI implementation lelism a mixed mode program may be more efficient. Obviously a pure OpenMP implementa- Clustered SMPs open the way for systems to be tion would give better performance still, how- built with ever increasing numbers of procesever on SMP clusters MPI parallelism is still sors. In certain situations the scaling of the MPI required for communication between nodes. implementation itself may not match these ever By reducing the number of MPI processes re- increasing processor numbers or may indeed be quired, the scaling of the code should be im- restricted to a certain maximum number. In this proved. situation developing a mixed mode code may be of benefit (or required), as the number of Replicated data MPI processes needed will be reduced and reCodes written using a replicated data strategy placed with OpenMP threads. often suffer from memory limitations and from 5. Collective Communications poor scaling due to global communications. By using a mixed mode programming style on an Having discussed various situations where a SMP cluster, with the MPI parallelisation occur- mixed mode code may be more efficient than ring across the nodes and the OpenMP paral- a corresponding MPI implementation, this seclelisation inside the nodes, the problem will be tion considers one specific situation relevant to limited to the memory of an SMP node rather HPCx: collective communications. than the memory of a processor (or, to be precise, the memory of an SMP node divided by Many scientific application use collective comthe number of processors), as is the case for munications and to achieve good scaling on a pure MPI implementation. This has obvi- clusters SMP systems such as HPCx, these ous advantages, allowing more realistic prob- communications need to be implemented efficiently. Collective communications were inlem sizes to be studied. cluded in the MPI standard to allow developers Restricted MPI process applications to implement optimised versions of essential communication patterns. A number of collecA number of MPI applications require a specific tive operations can be efficiently implemented number of processes to run. Whilst this may using tree algorithms - including Broadcast, be a natural and efficient implementation, this Gather, Scatter and Reduce. On a system such limits the number of MPI processes to certain as HPCx, where communication is faster within combinations. By developing a mixed mode a node than between nodes, the tree algorithm MPI/OpenMP code the natural MPI decompo- can be constructed such that communications sition strategy can be used, running the de- corresponding to branches of the tree at the sired number of MPI processes, and OpenMP same level should run at the same speed, oththreads used to further distribute the work, al- erwise the speed of each stage of the algorithm lowing all the available processes to be used ef- will be limited by the performance of the slowfectively. est communication. Poorly optimised intra-node MPI To demonstrate this, two techniques have been Although a number of vendors have spent considerable amounts of time optimising their MPI used. Firstly, a library has been used that, by creating multiple communicators, performs Secondly, a mixed mode version of the code has been developed that also performs the operation in two stages: stage 1 uses OpenMP and stage 2 MPI. For example, an all reduce operation involves a reduction operation carried out using OpenMP across threads within and LPAR, followed by an MPI all reduce operation of these results across LPARs. To ensure the MPI and OpenMP operations are comparable, the OpenMP reduction operation is carried out across variables that are private to each thread. To avoid the use of an ATOMIC directive, all the private variables are copied to a different piece of a shared array, based on their thread identifier. The contents of this shared array are then reduced into a second shared array by the master thread. The master thread on each LPAR is then involved in an MPI all reduce operation across LPARs. 50 18 16 14 12 10 8 6 4 2 0 45 40 35 0 30 Mbytes/s collective operations in two stages [3]. For example, an all reduce operation involves a reduction operation carried out across the processors within an LPAR, followed by an all reduce of these results across LPARs. This library uses the MPI profiling interface and hence is easy to use, as it requires no code modification. 400 800 1200 25 20 15 10 MPI MPI with 2 stage library Mixed MPI/OpenMP 5 0 0 2000 4000 6000 8000 10000 12000 Message Size Figure 2: Performance in Mbytes/s of the twostage all reduce operations and the standards all reduce operation of the IBM MPI library. the execution time for sending messages between LPARs. In addition, this operation benefits from carrying out communication within an LPAR through direct read and writes to memory (using OpenMP), which eliminates the overhead of calling the MPI library. For larger data sizes however, the performance of this A simple benchmark code, that executes an mixed operation is worse than both the origiall reduce operation across a range of data nal MPI operation and the operation with two sizes, has been used to compare the perfor- stage MPI library. mance of these two techniques to carry out 6. Conclusions a reduction operation against the standard all reduce operation of the IBM MPI library. For applications that are dominated by collecFigure 2 shows the performance (in bytes/s) for tive communications, the two-stage MPI library each operation on a range of data sizes on 32 offers a simple and quick mechanism to obtain processors. It is clear from this diagram that the significant performance improvement. Using a two-stage operation, using multiple communi- mixed MPI/OpenMP version of the collective cators, is significantly faster than the standard communication improves the performance for MPI operation for all data sizes. As mentioned low data sizes, but has a detrimental effect for above, the collective operation is dominated by larger data sizes. its slowest communication. The slowest communication within the standard MPI operation is communication between LPARs and this is dominating the performance. By carrying out References this process in two stages, the amount of data being communicated between LPARs has re- [1] MPI, MPI: A Message-Passing Interface standard. Message Passing Interface Foduced significantly, hence reducing the execurum, June 1995. tion time of sending messages between LPARs. http://www.mpi-forum.org/. For small data sizes, the two-stage mixed operation is faster again than the two-stage op- [2] OpenMP, The OpenMP ARB. http://www.openmp.org/. eration using multiple communicators. This operation also reduces the amount of data be[3] Developed by Stephen Booth, EPCC. ing communicated between LPARs, reducing