
Improving the Performance of Collective Operations in
Rajeev Thakur and William Gropp
Mathematics and Computer Science Division
Argone National Laboratory
9700 S. Cass Avenue
Argonne, IL 60439, USA
(thakur, gropp}@mcs.anl.gov
1. Introduction
The performance of parallel applications on cluster systems depends heavily
on the performance of MPI implementation. Collective communication is an
important and frequently used component of MPI. This paper focused on improving
the performance of all the collective operations in MPICH. The initial target
architecture is the cluster of machines connected by a switch, such as Myrinet or IBM
SP switch. For each collective operation, multiple algorithms are selected based on
message size. The algorithm for short message aims to minimize latency while
algorithm for long message is focus on bandwidth usage. This paper describes
algorithms for allgather, broadcast, reduce-scatter, and reduce. The performance of
algorithms is evaluated using SKaMPI benchmark.
2. Cost Model
Cost of the collective communication is in terms of latency and bandwidth
usage. The time taken to send a message between any two nodes can be modeled as 
+ n where  is the latency (or startup time) per message, independent of message
size,  is the transfer time per byte, and n is the number of bytes transferred. The
node’s network interfaces is assumed to be single ported; at most one message can be
sent and one message can be received simultaneously. In the case of reduction
operations,  is the computation cost per byte for performing the reduction operation
locally on any process. The number of processes is denoted by p.
3. Algorithms
This section describes the new algorithms and performance equation.
3.1. Allgather
MPI_Allgather gathers data from all tasks and distribute it to all process in
group. Old algorithm uses a ring method in which the data from each process is sent
around a virtual ring of processes. Ring algorithm takes p  1 , there receives
p 1
n . The new
amount of data. So, this algorithm takes Tring  ( p  1) 
algorithm uses recursive doubling, that takes lg p steps. So, the new algorithm takes
p 1
Trec _ dbl  lg p   
n   . In real implementation, recursive doubling is used for
short message while ring algorithm is for long message.
3.2. Broadcast
MPI_Bcast broadcasts a message from root process to all other processes of
the group. Old algorithm uses binomial tree. The time used is Ttree  lg p (  n ) .
This algorithm is good for short messages. For long messages, Van de Geijn et al
proposed the algorithm that message is scatter then collected back to all processes.
p 1
The time used T van deg eijn (lg p  p  1)  2
n .
3.3. Reduce-Scatter
MPI_Reduce_scatter combines values and scatters the results. The old
algorithm implements by doing a binary tree reduce to rank 0 followed by a linear
p 1
scatterv. This algorithm takes T old (lg p  p  1)  (lg p 
)n  n lg p  
For commutative operations, recursive having algorithm, is used: Each process
exchanges data with a process that is a distance
away. This procedure continues
recursively, having the data communicated at each step, for a total of lg p steps. So,
p 1
p 1
the time taken by this algorithm is Trec _ half  lg p 
n 
n . This
algorithm is for messages up to 512 kB.
If the reduction operation is not commutative, recursive having will not work.
Recursive-doubling algorithm is similar to the one in all-gather. However, more data
is communicated than in allgather. The time used in this operation is
p 1
p 1
Tshort  lg p  n(lg p 
)   n(lg p 
)  . This algoritm is used for very short
message (< 512 bytes)
For long message, pairwise exchange algorithm that takes p-1 steps is used.
The data exchanged is only the data needed for the scattered result on the process
p 1
p 1
(  ). The time taken by this algorithm is Tlong  ( p  1) 
n 
n .
3.4. Reduce
MPI_Reduce Reduces values on all processes to a single value. The old
algorithm uses a binomial tree, which takes lg p steps, and the data communicated at
each step is n. The time taken by this algorithm is Ttree  lg p (  n  n ). This is
a good algorithm for short message because of the lg p  . For long message,
Rabenseifner implements as a reduce-scatter followed by a gather to the root, which
has the same effect of reducing the bandwidth term from n lg p to 2n . In case of
predefined reduction operations, Rabenseifner’s algorithm for long messages (> 2 kB)
and the binomial tree algorithm for short message. The time taken by Rabenseifner’s
algorithm is the sum of the times taken by reduce-scatter and gather, which is
p 1
p 1
Trabenseifner  2 lg p  2
n 
n .
4. Conclusion
This paper improves the performance of the collective communication
algorithms in MPICH. Since these algorithms distinguish between short and long
messages, an important factor is the message size at which algorithm is switched.
5. Discussion
Algorithms presented here are specific to architecture, which Myrinet and
IBM SP switch. Most clusters still use Fast Ethernet, which have longer latency. The
cost model is only an empirical model. In Linux system, the cost model is more
complex than presented. The switch point is selected statically. The switch point in
this paper may not be used in other architecture. So, the dynamic computation of
switch point is required.