Improving the Performance of Collective Operations in MPICH Rajeev Thakur and William Gropp Mathematics and Computer Science Division Argone National Laboratory 9700 S. Cass Avenue Argonne, IL 60439, USA (thakur, gropp}@mcs.anl.gov 1. Introduction The performance of parallel applications on cluster systems depends heavily on the performance of MPI implementation. Collective communication is an important and frequently used component of MPI. This paper focused on improving the performance of all the collective operations in MPICH. The initial target architecture is the cluster of machines connected by a switch, such as Myrinet or IBM SP switch. For each collective operation, multiple algorithms are selected based on message size. The algorithm for short message aims to minimize latency while algorithm for long message is focus on bandwidth usage. This paper describes algorithms for allgather, broadcast, reduce-scatter, and reduce. The performance of algorithms is evaluated using SKaMPI benchmark. 2. Cost Model Cost of the collective communication is in terms of latency and bandwidth usage. The time taken to send a message between any two nodes can be modeled as + n where is the latency (or startup time) per message, independent of message size, is the transfer time per byte, and n is the number of bytes transferred. The node’s network interfaces is assumed to be single ported; at most one message can be sent and one message can be received simultaneously. In the case of reduction operations, is the computation cost per byte for performing the reduction operation locally on any process. The number of processes is denoted by p. 3. Algorithms This section describes the new algorithms and performance equation. 3.1. Allgather MPI_Allgather gathers data from all tasks and distribute it to all process in group. Old algorithm uses a ring method in which the data from each process is sent n around a virtual ring of processes. Ring algorithm takes p 1 , there receives p p 1 n . The new amount of data. So, this algorithm takes Tring ( p 1) p algorithm uses recursive doubling, that takes lg p steps. So, the new algorithm takes p 1 Trec _ dbl lg p n . In real implementation, recursive doubling is used for p short message while ring algorithm is for long message. 3.2. Broadcast MPI_Bcast broadcasts a message from root process to all other processes of the group. Old algorithm uses binomial tree. The time used is Ttree lg p ( n ) . This algorithm is good for short messages. For long messages, Van de Geijn et al proposed the algorithm that message is scatter then collected back to all processes. p 1 The time used T van deg eijn (lg p p 1) 2 n . p 3.3. Reduce-Scatter MPI_Reduce_scatter combines values and scatters the results. The old algorithm implements by doing a binary tree reduce to rank 0 followed by a linear p 1 scatterv. This algorithm takes T old (lg p p 1) (lg p )n n lg p p For commutative operations, recursive having algorithm, is used: Each process p exchanges data with a process that is a distance away. This procedure continues 2 recursively, having the data communicated at each step, for a total of lg p steps. So, p 1 p 1 the time taken by this algorithm is Trec _ half lg p n n . This p p algorithm is for messages up to 512 kB. If the reduction operation is not commutative, recursive having will not work. Recursive-doubling algorithm is similar to the one in all-gather. However, more data is communicated than in allgather. The time used in this operation is p 1 p 1 Tshort lg p n(lg p ) n(lg p ) . This algoritm is used for very short p p message (< 512 bytes) For long message, pairwise exchange algorithm that takes p-1 steps is used. The data exchanged is only the data needed for the scattered result on the process n p 1 p 1 ( ). The time taken by this algorithm is Tlong ( p 1) n n . p p p 3.4. Reduce MPI_Reduce Reduces values on all processes to a single value. The old algorithm uses a binomial tree, which takes lg p steps, and the data communicated at each step is n. The time taken by this algorithm is Ttree lg p ( n n ). This is a good algorithm for short message because of the lg p . For long message, Rabenseifner implements as a reduce-scatter followed by a gather to the root, which has the same effect of reducing the bandwidth term from n lg p to 2n . In case of predefined reduction operations, Rabenseifner’s algorithm for long messages (> 2 kB) and the binomial tree algorithm for short message. The time taken by Rabenseifner’s algorithm is the sum of the times taken by reduce-scatter and gather, which is p 1 p 1 Trabenseifner 2 lg p 2 n n . p p 4. Conclusion This paper improves the performance of the collective communication algorithms in MPICH. Since these algorithms distinguish between short and long messages, an important factor is the message size at which algorithm is switched. 5. Discussion Algorithms presented here are specific to architecture, which Myrinet and IBM SP switch. Most clusters still use Fast Ethernet, which have longer latency. The cost model is only an empirical model. In Linux system, the cost model is more complex than presented. The switch point is selected statically. The switch point in this paper may not be used in other architecture. So, the dynamic computation of switch point is required.