Parallel Data Transfer in the Model Coupling Toolkit Robert L. Jacob J. Walter Larson Mathematics and Computer Science Division1 Argonne National Laboratory Many of the world’s current high-performance computing platforms are microprocessor-based, scalable, distributed-memory, parallel-architecture computers. These machines present unique challenges to the design of climate models and in particular to the design of the coupler—the component that acts as a conduit of data between submodels of a climate model. The Model Coupling Toolkit (MCT) is a Fortran90 library built on top of MPI with data types and methods that simplify the construction of distributed-memory parallel couplers. Below we explain how the MCT simplifies parallel data transfer. We also give performance data for simple test cases using an early version of MCT. The Model Coupling Toolkit is a product of the Department of Energy’s Accelerated Climate Prediction Initiative Avant Garde project. One goal of this project is to increase the performance and scalability of the NCAR Community Climate System Model (CCSM) and its components on parallel systems. One of those components, the current CCSM flux coupler, contains shared-memory parallelism but does not support distributed-memory data-parallelism, a design that impacts scalability on distributed-memory platforms. An additional goal of the Avant Garde project is to increase the flexibility of the coupler; to make it easier to change how much data is exchanged, which models form the coupled system, and how many models are coupled together. By using the objects and methods of the Model Coupling Toolkit, the new CCSM flux coupler should be able to meet these goals with a relatively small amount of code. The essential function of a coupler is to repeatedly transfer data, such as atmospheric temperature and wind speed, to other models that need this data as a boundary forcing term. The coupler must also do additional, computationally significant work, such as interpolate the data onto a different grid or timeaverage the data. When the entire coupler, or the entire climate model, resides in a single memory image, the data transfer is easy to conceptualize and implement as an all-to-one or one-to-one exchange of messages between coupler and components. But when the coupler is a distributedmemory parallel application, it will have a decomposition of state that may be different from that of the model it is communicating with. This complexity can be avoided by communicating with only one node of the coupler and broadcasting, but that strategy would still present a bottleneck to scalability. The most efficient, scalable communication scheme is a parallel data transfer where each node of a component transfers coincident data to corresponding coupler nodes. The MCT provides data types and functions that automatically determine this transfer pattern between any two decompositions of a numerical grid. The MCT’s treatment of a parallel data transfer is illustrated in Figure 1. The figure shows a simple numerical grid for a possible atmosphere model. In the atmosphere (left side of Fig. 1), this grid is decomposed over four processors (MPI processes) in a simple “checkerboard” pattern. In the coupler (right side of Fig. 1), entire rows of the same grid have been assigned to three separate processors. 1 This work is supported by the Office of Science of the U.S. Department of Energy Atmosphere Coupler 17 18 19 20 213 14 15 16 3 9 10 11 12 0 5 6 7 81 1 2 3 4 Router 17 18 19 20 13 14 15 16 9 10 11 12 5 6 7 8 1 2 3 4 GlobalSegMap Pe_loc start 0 1 2 0 5 2 0 9 2 1 3 2 1 7 2 1 11 2 13 2 2 17 2 3 15 2 3 19 2 5 4 GlobalSegMap length 2 6 Processor Decomposition Pe_loc start length 0 1 8 1 9 8 2 17 4 Figure 1: Illustration of Model Coupling Toolkit concepts used in a parallel data transfer The following steps are necessary to complete a parallel data transfer of data from the atmosphere grid to the coupler using MCT: The grid is numbered as shown in Fig 1. Each point in the numerical grid is given a unique integer. The same physical points are given the same integer in both the atmosphere and the coupler. The user may determine the numbering scheme. A GlobalSegmentMap is defined. The GlobalSegmentMap is an MCT datatype that describes how the numbered grid points are assigned to processors. The decomposition of a grid will divide the numbered grid into segments as shown in Fig. 1. The GlobalSegMap describes, for each segment, its MPI process rank (pe_loc), the value from the numbering scheme of the first grid point in the segment (start), and the amount of consecutively numbered points in the segment (length). To initialize a GlobalSegMap, the user describes each of the three arrays that are then stored in the GlobalSegMap datatype. Each local process is given a copy of the entire GlobalSegMap. Given two GlobalSegMaps, the MCT automatically determines a Router. A Router is an MCT datatype that describes how to transfer, or route, data between two decompositions of a numerical grid (see Fig. 1). Similar to the GlobalSegMap, the Router contains start and length information for subsegments of the locally owned GlobalSegMap segments and the processor rank of the remote process that also owns those points. The Router simultaneously works as both a send map and a receive map. Parallel data transfer occurs by all processes calling MCT_Send or MCT_Recv. These routines take two arguments: an AttributeVector (an MCT datatype that contains all the data to be sent or received) and a Router. The details of sending /receiving to individual processors using MPI_Send/Recv is handled internally by MCT_Send/Recv using the Router. The performance of MCT_Send and MCT_Recv for ten transfers of sixteen fields on a T42 (128x64) atmospheric grid is shown in Figure 2 as measured on an IBM SP3 (375 MHz). In this simple case, the decomposition strategy of the grid and the number of processors are the same (latitudes divided evenly between processors) for the atmosphere and the coupler. The performance for this simple case is as expected: transfer time decreases as the message size decreases and the number or processors assigned to each model is increased. Figure 2: MCT performance for parallel data transfer in a simple case (right panel). The left and center panels show the decomposition for four coupler and atmosphere nodes, respectively. MCT_Recv (dotted line) takes slightly longer because it must finish copying data from buffers into the AttributeVector. MCT can also handle more complex cases such as that illustrated in Figure 3. In this case, the atmosphere has a very different decomposition from the coupler, and the number of nodes assigned to each is not the same. The Router between these two decompositions was automatically determined by MCT. The number of coupler nodes was varied for each of three cases: with the atmosphere on 8 (black), 16 (red), and 32 (blue) nodes. The poor scaling may be an unavoidable result of doing a parallel data transfer between two very dissimilar decompositions. But the overall transfer time is still very small compared with the time the full model will spend computing 10 timesteps. Moreover, the users/developers are relieved of determining the complex transfer pattern themselves. Figure 3: MCT performance for parallel data transfer in a complex case. The decomposition is shown for 16 atmosphere and 4 coupler nodes (left 2 figures). The MCT is intended to simplify the construction of distributed-memory parallel couplers. It provides simple routines for doing parallel data transfers that hide the complexity while providing good performance. Although we have given examples with fixed numerical grids, the datatypes and methods in MCT have the flexibility to deal with reduced grids, finite element meshes, or any arbitrary multidimensional array. Moreover, the toolkit can be used to couple other models besides the components of a climate model. Future work on MCT will provide automatic transfer mechanisms for models that share processors but still have different decompositions.