Slides

advertisement
The Buffer - Bandwidth Tradeoff in Lossless Networks
Alex Shpiner, Eitan Zahavi, Ori Rottenstreich
April 2014
Background: Network Incast Congestion
100%
300%
100%
100%
Source: http://theithollow.com/2013/05/flow-control-explained/
© 2014 Mellanox Technologies
- Mellanox Technologies-
2
Background: Pause Frame Flow Control
Buffer
Source: http://theithollow.com/2013/05/flow-control-explained/
© 2014 Mellanox Technologies
- Mellanox Technologies-
3
Background: Incast with Pause Frame Flow Control
33%
33%
100%
33%
Source: http://theithollow.com/2013/05/flow-control-explained/
© 2014 Mellanox Technologies
- Mellanox Technologies-
4
Background: Congestion Spreading Problem
33%
instead of
66%
This flow is also
paused, since the
pause control does
not distinguish
between flows.
Effective link bandwidth=
Link bandwidth * %unpaused
 Small buffers
Link pauses
Congestion spreading
Effective link bandwidth decrease
 To deal with Incast we can:
• Increase buffers
• Increase link bandwidth
Trade-off
Source: http://theithollow.com/2013/05/flow-control-explained/
© 2014 Mellanox Technologies
- Mellanox Technologies-
5
Buffer-bandwidth Tradeoff
 Higher bandwidth allows:
• Faster buffer draining
• Pausing link for longer periods, while achieving same effective bandwidth.
reduced buffering demand.
Effective bandwidth =
Link bandwidth * %unpaused
 Aim: evaluate the buffer-bandwidth tradeoff.
 Assumptions:
• Lossless network
• Congestion spreading avoidance is desired
© 2014 Mellanox Technologies
- Mellanox Technologies-
6
Evaluation Flow
Traffic Injector
1). Assume network with
links of bandwidth C and
buffers of size B
0
C
B
C
1
C/N
B
C/N
Traffic Receiver
C/N
B
C
C
Switch
2). Define the most challenging
workload the network can handle
without congestion spreading
Data Rate
N-1
Time
3). Increase links bandwidth by α
and reduce buffer size by β
Transmit the
workload at the
same rate,
while keeping the
same effective link
bandwidth
© 2014 Mellanox Technologies
4). Evaluate the relation between α and β
that able to handle the workload from step 2.
- Mellanox Technologies-
𝐶𝑛𝑒𝑤 = 𝛼𝐶,
𝐵𝑛𝑒𝑤 = 𝛽𝐵,
𝛼>1
𝛽<1
β = 𝑓(𝛼)
7
Network Model
The most challenging workload:
Data Rate
Injected
Traffic
Traffic Injector
0
C
B
Time
1
C
C/N
B
C/N
Injected
Traffic
Data Rate
Traffic Receiver
C/N
B
C
C
Switch
N-1
Time
© 2014 Mellanox Technologies
- Mellanox Technologies-
8
Determining the Workload Parameters
Arrival rate
λin
C
t
T/N
T
 Assumption: output link is 100% utilized.
• Burst length = 𝑇/𝑁
Q
 𝑇 =?
Queue size
• It takes 𝑡 = 𝑇
N senders
to fill buffer of size 𝐵 at
arrival rate 𝐶 and departure rate 𝐶/𝑁:
B
𝑁
𝑇
𝐵
=
𝑁 𝐶 − 𝐶
𝑁
𝑁2𝐵
𝑇=
𝐶(𝑁 − 1)
t
© 2014 Mellanox Technologies
- Mellanox Technologies-
9
Effect of Buffer Size Reduction with Link Acceleration
λin
C
Arriving
traffic
t
T/N
T/N
TT
Input
Queue
Size
Q
B
𝛽𝐵
𝑡1 =
𝐶(1 − α 𝑁)
𝛽𝐵
𝑡2 = 𝑡3 −
𝛼𝐶 𝑁
𝑇
𝑡3 =
𝛼
𝑄𝑚𝑎𝑥 = 𝐵
t
Q
Input
Queue
Size
βB
𝑄𝑚𝑎𝑥 = 𝛽𝐵, 𝛽 < 1
𝐶𝑛𝑒𝑤 = 𝛼𝐶, 𝛼 > 1
© 2014 Mellanox Technologies
t1
t2
t3
Conclusions:
• We can push more traffic.
• When the buffer is full, the link is
in paused mode: 𝐶𝑒𝑓𝑓 = 𝛼𝐶 𝑁
(congestion spreading)
t
- Mellanox Technologies-
10
Effect of Buffer Size Reduction with Link Acceleration– cont.
Q
 When buffer is full, the link is in paused mode:
• Paused mode
𝐶𝑒𝑓𝑓𝑒𝑐𝑡𝑖𝑣𝑒 = 𝛼𝐶 𝑁
βB
congestion spreading
t1
 %𝑝𝑎𝑢𝑠𝑒𝑑 =
𝑡2 −𝑡1
𝑇
=⋯=
𝑁−𝛼−𝛽 𝑁−1
𝛼 𝑁−𝛼
𝑡1 =
(1)
𝛽𝐵
;𝑡
𝐶(1−α 𝑁) 2
t2
= 𝑡3 −
t3
t
T
𝛽𝐵
;𝑡
𝛼𝐶 𝑁 3
=
𝑇
𝛼
 By how much the buffer can be reduced (β) to avoid congestion spreading
(%𝑝𝑎𝑢𝑠𝑒𝑑 = 0) ?
• For %𝑝𝑎𝑢𝑠𝑒𝑑 = 0, 𝛼 =
- 𝛽 = 0.6
© 2014 Mellanox Technologies
= 1.4, 𝑁 = 2 (incast 2
56
40
= 1.4, 𝑁 = 10 (incast 10
1):
40% of buffer saving!!! 
• For %𝑝𝑎𝑢𝑠𝑒𝑑 = 0, 𝛼 =
- 𝛽 = 0.95
56
40
1):
only 5% of buffer saving 
- Mellanox Technologies-
11
Effect of Buffer Size Reduction with Link Acceleration– cont.
Q
 Observation: we are allowed to pause the link, since we
increased the link capacity.
• 𝐶 = 𝛼𝐶 1 − %𝑝𝑎𝑢𝑠𝑒𝑑 +
𝛼𝐶
𝑁
∙ %𝑝𝑎𝑢𝑠𝑒𝑑
βB
%𝑝𝑎𝑢𝑠𝑒𝑑 =
𝛼−1
𝛼
𝛼−𝑁
- For 𝛼 = 1.4, 𝑁 = 10 : %𝑝𝑎𝑢𝑠𝑒𝑑 = 32%
• Using (1) and (2)
56
𝛽=
- For 𝑁 = 10, 𝛼 = 40 = 1.4
2
t1
𝑡1 =
𝛽𝐵
;𝑡
𝐶(1−α 𝑁) 2
t2
= 𝑡3 −
t3
t
T
𝛽𝐵
;𝑡
𝛼𝐶 𝑁 3
=
(𝑁−𝛼)(2𝑁−𝛼𝑁−1)
.
(𝑁−1)2
𝛽 = 0.53!!!
 We can save 47% of buffer size with 40% of link rate
increase, to get the same performance!
 And we can also push more data in total (e.g. 56Gbps vs 40Gbps)
• with the congestion spreading cost
© 2014 Mellanox Technologies
- Mellanox Technologies-
12
𝑇
𝛼
Asymptotic Analysis
For 𝜶 ≥ 𝟐
no buffering
is required in
the switches!
By increasing link bandwidth by X%
the buffer saving is at least X%
for any incast load!
Buffer Size Saving (1-β)
Buffer Size Saving (1-β)
100%
90%
α
80%
1.1
80%
70%
1.2
70%
60%
1.3
60%
50%
1.4
50%
40%
1.5
40%
30%
1.6
30%
20%
1.7
20%
10%
1.8
10%
0%
1.9
0%
100%
0
5
10
15
20
25
30
35
40
45
90%
50
2
4
8
16
32
64
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2
α
N (Incast ratio N to 1)
© 2014 Mellanox Technologies
N
- Mellanox Technologies-
13
Multiple Incast Cascade Analysis
Rank 2
Rank 1
Rank n
Traffic
Injectors
Last rank defines the workload parameters:
C
𝐵 ∙ 𝑘𝑛
𝑇𝑛 =
;
𝐶 𝑘𝑛 − 1
𝑇
𝑇𝑛−1 = 𝑛 𝑘
;
𝑛−1
𝑇𝑛
𝑇1 = 𝑛−1
𝑖=1 𝑘𝑖
λin
C
B
B
k1
B
Tn
C
Q
C
Switch
B
B
k2
Traffic
Receiver
C
B
k1
C
B
Switch
C
B
B
t
T
C
kn
B
B
B
B
Switch
Switch
t
C
Rank n, α>1, β<1:
C
αC
The analysis is similar to a
single-rank case, but now the
traffic arrives at rate 𝛼𝐶
λin
Tn/α
T
t
𝜷 = 𝟏 − 𝜶 ∗ %𝒑𝒂𝒖𝒔𝒆𝒅
Q
βB
𝛽 𝛼 = 1.4, %𝑝𝑎𝑢𝑠𝑒𝑑 = 32% = 0.55
t
© 2014 Mellanox Technologies
- Mellanox Technologies-
14
Conclusions
 We can reduce switch buffer size, while still pushing the same traffic.
• But, we pay with:
- Congestion spreading (pause frames)
- Buffers at the traffic sources (NICs)
 By increasing the links bandwidth, we can reduce the congestion spreading
• And push more traffic.
 We can save X% of buffer size with X% of link rate increase (for any
incast).
 When increasing the links bandwidth by a factor of at least 2 (𝜶 ≥ 𝟐) no
buffering is required at the switches.
 The results hold also for the multiple incast cascade.
© 2014 Mellanox Technologies
- Mellanox Technologies-
15
Thank
You
Thank You
© 2014 Mellanox Technologies
- Mellanox Technologies-
16
Download