Pipelined Broadcast on Ethernet Switched Clusters

advertisement

Pipelined Broadcast on Ethernet

Switched Clusters

Pitch Patarasuk, Ahmad Faraj, Xin Yuan

Department of Computer Science

Florida State University

Tallahassee, FL 32306

Broadcast communication(MPI_Bcast)

Before n

0

A B C D n

1 n

2 n

3

After n

0

A B C D n

1 n

2 n

3

A B C D A B C D A B C D

Let T(msize) = time to send a message of size msize

Broadcast(msize) >= T(msize)

Ethernet Switched Cluster

switch switch switch switch

Problem statement:

How to efficiently realize the broadcast operation with large message sizes on

Ethernet switched clusters.

Using pipelined broadcast can achieve near optimal results ( T(msize) a message of size msize) . time for broadcasting

Finding contention free broadcast tree

Finding a good segment size

Traditional Broadcast algorithms

• Linear tree

5 6 0 1 2

Time = (P-1) x T(msize)

• Flat tree

3

0

4 7

1 2 3

Time = (P-1) x T(msize)

4 5 6 7

• Binary tree

0

• k-ary tree

0

1

3 4 5

2

6

7

• Time = 2x(log

2

(P+1)-1)xT(msize)

4

1

5 6

2

7

3

Binomial tree

0

4

6 5

2

3

7

1

Time = log

2

P x T(msize)

Scatter/Allgather n

0

Before A B C D n

1

Scatter A B n

2 n

3

C D

Allgather A B C D A B C D A B C D A B C D

Time = 2 x T(msize)

Time Complexity for large messages

Linear tree

Flat tree

(P-1) x T(msize)

(P-1) x T(msize)

Binary tree 2x(log

2

(P+1)-1)xT(msize)

Approx. 2xlog

2

P x T(msize)

Binomial tree log

2

P x T(msize)

Scatter/allgather 2xT(msize)

Pipelined Broadcast Algorithm

Linear pipeline

0 1 2 3

Performance of pipelined broadcast:

Assume no network contention a message of size messages of msize be broken into X msize/X .

H : tree hight, D : the number of children

Size of pipelined stage: D * T(msize/X)

Total time T: (X + H –1) * (D * T(msize /X)) linear tree: H = P, D = 1, T = T(msize)

Binary tree: H = log(P), D= 2, T = 2T(msize)

K-ary tree: H = log_k(P), D = k efficient as binary tree.

, in general not as

Time Complexity for large messages

Pipelined (linear) T(msize)

Pipelined (binary) 2 x T(msize) k-ary pipeline k x T(msize)

Binomial tree log

2

P x T(msize)

Scatter/allgather 2xT(msize)

Pipelined broadcast

How to find a contention-free broadcast tree?

How to select the best segment size?

7

Example of network contention

Binary tree

0 switch n

0

,n

1

,n

2

,n

3 switch n

4

,n

5

,n

6

,n

7

1

3 4

2

5 6

There is a link contention cause by communication

(1  4), (2  5), (2  6), and

(3  7)

Linear tree switch n

0

,n

1

,n

4

,n

5 switch n

2

,n

3

,n

6

,n

7

The linear tree 0

1

2

3

 … 

7 will have a contention caused by (1

2) and (5

6)

Algorithm for constructing contention free linear tree

Step 1: Traverse through all switches using depth-first-search (DFS) algorithm, name the switch by the order of their arrival in DFS tree

Step 2: The linear tree consists of all machines in switch S in S

1

, then S

2

0

, follows by all machines

,and so on

Example of contention free linear tree n

0

,n

1

,n

4

,n

5

Switch

S0 n

2

,n

3

,n

6

,n

7

Switch

S1 n

Switch

S3

12

,n

13

,n

14

,n

15

Switch

S2 n

8

,n

9

,n

10

,n

11

Linear tree: n0

 n1

 n4

 n5

2

3

6

7

8

9

 … 

15

Algorithm for constructing contention free binary tree

Start with a contention free linear tree

Recursively divide the tree into 2 sub-trees

Make sure that the cannot be a contention

The sub-trees are chosen such that the height of the whole tree will be minimal

0 1 2 3 4 5 6 7 8 9 101112131415

Binary tree height

Performance of binary pipeline broadcast depends on the height of a binary tree

Even though contention free binary tree may not be a complete binary tree, its height is not that much more than a complete binary tree

Average tree heights for 20 randomly generated topologies

Evaluation

Contention free pipelined algorithms:

Routine generators from topology information

The generated routines are based on MPICH p2p primitives.

Linear tree

Binary tree

3-nary tree

Targets for comparison:

MPICH: Binomial tree, Scatter/allgather

LAM: Flat-tree, Binomial

Topology unaware pipelined linear and binary algorithms

Evaluation

Performance of different pipelined trees

(topology 1)

Comparing pipelined broadcast with other schemes

Topology unaware and contention-free pipelined broadcast

Segment size for pipelined broadcast

Conclusions

Pipelined broadcast is faster than the current broadcast algorithm for medium and large messages

Linear pipeline has a completion time roughly equal to T(msize) binary pipeline broadcast is best for medium messages

Contention free broadcast tree is necessary for pipelined algorithms

A good segment size for pipelined broadcast is not difficult to find.

Questions?

Download