Defense - Northwestern Networks Group

advertisement
UNDERSTANDING TCP INCAST
THROUGHPUT COLLAPSE IN
DATACENTER NETWORKS
Presenter:
Aditya Agarwal
Tyler Maclean
MOTIVATION/IMPORTANCE

Internet datacenters support a myriad of service and
applications.



Google, Microsoft, Yahoo, Amazon
Vast majority of datacenter use TCP for communication between
nodes.
The unique workload, scale and environment of internet
datacenter violate the WAN assumption on which TCP was
originally designed.


RTO = 200ms (default value in Linux)
2-3 order of magnitude greater than the RTT in the data center
WHAT IS THE PROBLEM

Incast communication pattern:
client
switch
server
server
server

Try to understand TCP incast throughput collapse.



Prove this problem is general,
An analytical model
Modifications to TCP and make sure that it works
THE CONTRIBUTIONS



Reproduce the problem in our own experimental
testbeds and demonstrate the generality of Incast.
Propose a quantitative model that accounts some of
the observed Incast behavior.
Implement several intuitive modifications to the
TCP stack in Linux, and prove that some
modifications are more helpful than others.
ROADMAP

Experiment setting:


Workload
Experiment results:
Initial Finding
 Deep analysis

Quantitative Models
 Conclusions

WORKLOAD SETTING

Map Reduce like application:
Receiver requests k blocks of data from S storage
servers.
 Each block of data striped across S storage servers
 Each server responses with a “fixed” amount of data.
(fixed-fragment workload)
 Client won’t request block k+1 until all the fragments of
block k have been received.


Setting:
k=100
 S = 1-48
 fragment size : 256KB

DETER NETWORK SECURITY
TESTBED
400 PCs, located at USC ISI and UC Berkeley
 Supported operating systems include Linux,
FreeBSD, Windows

INITIAL RESULTS

Different sender
experience long ,
synchronized TCP
retransmission timeout
(RTO) events.

RTO =200ms (default
value in WAN
environment)
MINOR AND INTUITIVE MODIFICATIONS
Decrease the minimum RTO timer from 200ms
 Randomize the minimum RTO timer
 Smaller multiplier for the RTO exponential back off
 Randomize the multiplier for the RTO exponential
back off.

INITIAL RESULTS

Smaller multiplier for the RTO exponential back
off


Randomize the multiplier for the RTO
exponential back off


Useless
Useless
There are only a tiny number of exponential back
offs for the entire transfer
INITIAL RESULTS

Randomize the RTO
timer


Useless, but also no
penalty
Because the servers
share the same switch,
all subsequent switch
buffer overflow events
will be synchronized for
all sender.???
ANALYSIS IN DEPTH

Different RTO Timers

Observations:
Initial goodput min occurs at
the same number of servers.
 Larger min RTO timer value,
max goodput occurs at large
number of senders.
 Smaller RTO timer value has
faster goodput “recovery”
rate
 The decrease rate after local
max is the same between
different min RTO settings.

DELAY ACKS AND HIGH
RESOLUTION TIMERS

Improving methods
proposed by [11]
Turn off the delay
ACKs function
(defaults delayed
ACKs threshold is
40ms)
 Use high resolution
Timer.

CONGESTION WINDOWS
WITH/WITHOUT DELAY ACKS
SMOOTHED RTT
WITH/WITHOUT DELAY ACKS
DIFFERENT WORKLOAD
SUB-OPTIMAL BEHAVIOR WITH REGARDS TO
DELAYED ACKS IS WORKLOAD INDEPENDENT.
CANNOT MATCH THE RESULTS IN
PREVIOUS WORK[11]
SMOOTHED RTT
WITH/WITHOUT DELAY ACKS

QUANTITATIVE MODELS

Net good put:
D
L  (R * r)





S*D
L  (R * r)
D: total amount of data to be sent, 100 blocks of 256KB
L: total transfer time of the workload without and RTO events.
 of RTO events during the transfer
R: the number
S: number of server:
r: the value of the minimum RTO timer value
FIT THE CURVE OF THE NUMBER OF RTO
EVENTS

EQUATION OF L
I is the inter-packet
waiting time
HOW GOOD IS THEIR ANALYSIS
MODEL?
FURTHER ANALYSIS ON R
AND I
Number of RTO event
is similar for different
RTO values( 200ms
and 1ms).
 Interpkt waiting is
vary different for
different RTO value(
200ms and 1ms).

QUALITATIVE REFINEMENT FOR THEIR
MODEL




As the number of sender increase, the number of RTO event per
sender increases. Beyond a certain number of sender, the number
of RTO event is constant.
When a network resource becomes saturated, it is saturated at
the same time for all senders.
After a congestion event, the senders enter the TCP RTO state.
The RTO timer expires at each sender with a uniform
distribution in time and a constant delay after the congestion
event.
T is increase as the number of sender increase, however, T is
bounded.
MORE EXPLANATIONS





A smaller minimum RTO timer value means larger goodput
values for the initial minimum.
The initial goodput minimum occurs at the same number of
senders, regardless the value of the minimum RTO times.
The second order goodput peak occurs at a higher number of
senders for a larger RTO timer value
The smaller the RTO timer values, the faster the rate of recovery
between the goodput minimum and the second order goodput
maximum.
After the second order goodput maximum, the slope of goodput
decrease is the same for different RTO timer values.
CONCLUSIONS
Study the dynamic of Incast.
 Propose a simple mathematical model to explain the
observed trends
 Account for the difference between their observation
and that in previous work.

Download