Parallel Data Transfer in the Model Coupling Toolkit

advertisement
Parallel Data Transfer in the
Model Coupling Toolkit
Robert L. Jacob
J. Walter Larson
Mathematics and Computer Science Division1
Argonne National Laboratory
Many of the world’s current high-performance computing platforms are microprocessor-based,
scalable, distributed-memory, parallel-architecture computers. These machines present unique
challenges to the design of climate models and in particular to the design of the coupler—the
component that acts as a conduit of data between submodels of a climate model. The Model Coupling
Toolkit (MCT) is a Fortran90 library built on top of MPI with data types and methods that simplify
the construction of distributed-memory parallel couplers. Below we explain how the MCT simplifies
parallel data transfer. We also give performance data for simple test cases using an early version of
MCT.
The Model Coupling Toolkit is a product of the Department of Energy’s Accelerated Climate
Prediction Initiative Avant Garde project. One goal of this project is to increase the performance and
scalability of the NCAR Community Climate System Model (CCSM) and its components on parallel
systems. One of those components, the current CCSM flux coupler, contains shared-memory
parallelism but does not support distributed-memory data-parallelism, a design that impacts scalability
on distributed-memory platforms. An additional goal of the Avant Garde project is to increase the
flexibility of the coupler; to make it easier to change how much data is exchanged, which models
form the coupled system, and how many models are coupled together. By using the objects and
methods of the Model Coupling Toolkit, the new CCSM flux coupler should be able to meet these
goals with a relatively small amount of code.
The essential function of a coupler is to repeatedly transfer data, such as atmospheric temperature and
wind speed, to other models that need this data as a boundary forcing term. The coupler must also do
additional, computationally significant work, such as interpolate the data onto a different grid or timeaverage the data. When the entire coupler, or the entire climate model, resides in a single memory
image, the data transfer is easy to conceptualize and implement as an all-to-one or one-to-one
exchange of messages between coupler and components. But when the coupler is a distributedmemory parallel application, it will have a decomposition of state that may be different from that of
the model it is communicating with. This complexity can be avoided by communicating with only one
node of the coupler and broadcasting, but that strategy would still present a bottleneck to scalability.
The most efficient, scalable communication scheme is a parallel data transfer where each node of a
component transfers coincident data to corresponding coupler nodes. The MCT provides data types
and functions that automatically determine this transfer pattern between any two decompositions of a
numerical grid.
The MCT’s treatment of a parallel data transfer is illustrated in Figure 1. The figure shows a simple
numerical grid for a possible atmosphere model. In the atmosphere (left side of Fig. 1), this grid is
decomposed over four processors (MPI processes) in a simple “checkerboard” pattern. In the coupler
(right side of Fig. 1), entire rows of the same grid have been assigned to three separate processors.
1
This work is supported by the Office of Science of the U.S. Department of Energy
Atmosphere
Coupler
17 18 19 20
213 14 15 16 3
9 10 11 12
0 5 6 7 81
1 2 3 4
Router
17 18 19 20
13 14 15 16
9 10 11 12
5 6 7 8
1 2 3 4
GlobalSegMap
Pe_loc
start
0
1
2
0
5
2
0
9
2
1
3
2
1
7
2
1
11
2
13
2
2
17
2
3
15
2
3
19
2
5
4
GlobalSegMap
length
2
6
Processor
Decomposition
Pe_loc
start
length
0
1
8
1
9
8
2
17
4
Figure 1: Illustration of Model Coupling Toolkit concepts used in a parallel data transfer
The following steps are necessary to complete a parallel data transfer of data from the atmosphere grid
to the coupler using MCT:
 The grid is numbered as shown in Fig 1. Each point in the numerical grid is given a unique
integer. The same physical points are given the same integer in both the atmosphere and the
coupler. The user may determine the numbering scheme.
 A GlobalSegmentMap is defined. The GlobalSegmentMap is an MCT datatype that describes
how the numbered grid points are assigned to processors. The decomposition of a grid will divide
the numbered grid into segments as shown in Fig. 1. The GlobalSegMap describes, for each
segment, its MPI process rank (pe_loc), the value from the numbering scheme of the first grid
point in the segment (start), and the amount of consecutively numbered points in the segment
(length). To initialize a GlobalSegMap, the user describes each of the three arrays that are then
stored in the GlobalSegMap datatype. Each local process is given a copy of the entire
GlobalSegMap.
 Given two GlobalSegMaps, the MCT automatically determines a Router. A Router is an MCT
datatype that describes how to transfer, or route, data between two decompositions of a numerical
grid (see Fig. 1). Similar to the GlobalSegMap, the Router contains start and length information
for subsegments of the locally owned GlobalSegMap segments and the processor rank of the
remote process that also owns those points. The Router simultaneously works as both a send map
and a receive map.
 Parallel data transfer occurs by all processes calling MCT_Send or MCT_Recv. These routines
take two arguments: an AttributeVector (an MCT datatype that contains all the data to be sent or
received) and a Router. The details of sending /receiving to individual processors using
MPI_Send/Recv is handled internally by MCT_Send/Recv using the Router.
The performance of MCT_Send and MCT_Recv for ten transfers of sixteen fields on a T42 (128x64)
atmospheric grid is shown in Figure 2 as measured on an IBM SP3 (375 MHz). In this simple case,
the decomposition strategy of the grid and the number of processors are the same (latitudes divided
evenly between processors) for the atmosphere and the coupler. The performance for this simple case
is as expected: transfer time decreases as the message size decreases and the number or processors
assigned to each model is increased.
Figure 2: MCT performance for parallel data transfer in a simple case (right panel). The left and center
panels show the decomposition for four coupler and atmosphere nodes, respectively. MCT_Recv (dotted
line) takes slightly longer because it must finish copying data from buffers into the AttributeVector.
MCT can also handle more complex cases such as that illustrated in Figure 3. In this case, the
atmosphere has a very different decomposition from the coupler, and the number of nodes assigned to
each is not the same. The Router between these two decompositions was automatically determined by
MCT. The number of coupler nodes was varied for each of three cases: with the atmosphere on 8
(black), 16 (red), and 32 (blue) nodes. The poor scaling may be an unavoidable result of doing a
parallel data transfer between two very dissimilar decompositions. But the overall transfer time is still
very small compared with the time the full model will spend computing 10 timesteps. Moreover, the
users/developers are relieved of determining the complex transfer pattern themselves.
Figure 3: MCT performance for parallel data transfer in a complex case. The decomposition is shown
for 16 atmosphere and 4 coupler nodes (left 2 figures).
The MCT is intended to simplify the construction of distributed-memory parallel couplers. It
provides simple routines for doing parallel data transfers that hide the complexity while providing
good performance. Although we have given examples with fixed numerical grids, the datatypes and
methods in MCT have the flexibility to deal with reduced grids, finite element meshes, or any
arbitrary multidimensional array. Moreover, the toolkit can be used to couple other models besides the
components of a climate model. Future work on MCT will provide automatic transfer mechanisms
for models that share processors but still have different decompositions.
Download