Pitch Patarasuk, Ahmad Faraj, Xin Yuan
Department of Computer Science
Florida State University
Tallahassee, FL 32306
Broadcast communication(MPI_Bcast)
Before n
0
A B C D n
1 n
2 n
3
After n
0
A B C D n
1 n
2 n
3
A B C D A B C D A B C D
Let T(msize) = time to send a message of size msize
Broadcast(msize) >= T(msize)
switch switch switch switch
Problem statement:
How to efficiently realize the broadcast operation with large message sizes on
Ethernet switched clusters.
Using pipelined broadcast can achieve near optimal results ( T(msize) a message of size msize) . time for broadcasting
Finding contention free broadcast tree
Finding a good segment size
• Linear tree
5 6 0 1 2
Time = (P-1) x T(msize)
• Flat tree
3
0
4 7
1 2 3
Time = (P-1) x T(msize)
4 5 6 7
• Binary tree
0
• k-ary tree
0
1
3 4 5
2
6
7
• Time = 2x(log
2
(P+1)-1)xT(msize)
4
1
5 6
2
7
3
•
Binomial tree
0
4
6 5
2
3
7
1
Time = log
2
P x T(msize)
•
Scatter/Allgather n
0
Before A B C D n
1
Scatter A B n
2 n
3
C D
Allgather A B C D A B C D A B C D A B C D
Time = 2 x T(msize)
Linear tree
Flat tree
(P-1) x T(msize)
(P-1) x T(msize)
Binary tree 2x(log
2
(P+1)-1)xT(msize)
Approx. 2xlog
2
P x T(msize)
Binomial tree log
2
P x T(msize)
Scatter/allgather 2xT(msize)
Linear pipeline
0 1 2 3
Performance of pipelined broadcast:
Assume no network contention a message of size messages of msize be broken into X msize/X .
H : tree hight, D : the number of children
Size of pipelined stage: D * T(msize/X)
Total time T: (X + H –1) * (D * T(msize /X)) linear tree: H = P, D = 1, T = T(msize)
Binary tree: H = log(P), D= 2, T = 2T(msize)
K-ary tree: H = log_k(P), D = k efficient as binary tree.
, in general not as
Time Complexity for large messages
Pipelined (linear) T(msize)
Pipelined (binary) 2 x T(msize) k-ary pipeline k x T(msize)
Binomial tree log
2
P x T(msize)
Scatter/allgather 2xT(msize)
How to find a contention-free broadcast tree?
How to select the best segment size?
7
•
Binary tree
0 switch n
0
,n
1
,n
2
,n
3 switch n
4
,n
5
,n
6
,n
7
1
3 4
2
5 6
There is a link contention cause by communication
(1 4), (2 5), (2 6), and
(3 7)
•
Linear tree switch n
0
,n
1
,n
4
,n
5 switch n
2
,n
3
,n
6
,n
7
The linear tree 0
1
2
3
…
7 will have a contention caused by (1
2) and (5
6)
Algorithm for constructing contention free linear tree
Step 1: Traverse through all switches using depth-first-search (DFS) algorithm, name the switch by the order of their arrival in DFS tree
Step 2: The linear tree consists of all machines in switch S in S
1
, then S
2
0
, follows by all machines
,and so on
Example of contention free linear tree n
0
,n
1
,n
4
,n
5
Switch
S0 n
2
,n
3
,n
6
,n
7
Switch
S1 n
Switch
S3
12
,n
13
,n
14
,n
15
Switch
S2 n
8
,n
9
,n
10
,n
11
Linear tree: n0
n1
n4
n5
2
3
6
7
8
9
…
15
Algorithm for constructing contention free binary tree
Start with a contention free linear tree
Recursively divide the tree into 2 sub-trees
Make sure that the cannot be a contention
The sub-trees are chosen such that the height of the whole tree will be minimal
0 1 2 3 4 5 6 7 8 9 101112131415
Binary tree height
Performance of binary pipeline broadcast depends on the height of a binary tree
Even though contention free binary tree may not be a complete binary tree, its height is not that much more than a complete binary tree
Average tree heights for 20 randomly generated topologies
Evaluation
Contention free pipelined algorithms:
Routine generators from topology information
The generated routines are based on MPICH p2p primitives.
Linear tree
Binary tree
3-nary tree
Targets for comparison:
MPICH: Binomial tree, Scatter/allgather
LAM: Flat-tree, Binomial
Topology unaware pipelined linear and binary algorithms
Evaluation
Performance of different pipelined trees
(topology 1)
Comparing pipelined broadcast with other schemes
Topology unaware and contention-free pipelined broadcast
Segment size for pipelined broadcast
Pipelined broadcast is faster than the current broadcast algorithm for medium and large messages
Linear pipeline has a completion time roughly equal to T(msize) binary pipeline broadcast is best for medium messages
Contention free broadcast tree is necessary for pipelined algorithms
A good segment size for pipelined broadcast is not difficult to find.
Questions?