Transport Protocols

advertisement
Chapter 7
Internet Transport
Protocols
1
Transport Layer
Our goals:
 understand principles
behind transport
layer services:




Multiplexing /
demultiplexing data
streams of several
applications
reliable data transfer
flow control
congestion control
Transport Layer
Chapter 6:
 rdt principles
Chapter 7:
 multiplex/ demultiplex
 Internet transport layer
protocols:


UDP: connectionless
transport
TCP: connection-oriented
transport
• connection setup
• data transfer
• flow control
• congestion control
2
Transport vs. network layer
Transport Layer
Network Layer
logical communication
between processes
logical communication
between hosts
exists only in hosts
exists in hosts and
in routers
ignores network
Port #s used for “routing”
to the intended process
inside destination computer
routes data through
network
IP addresses used for
routing in network
 Transport layer uses Network layer services
 adds value to these services
3
Socket
Multiplexing
4
Multiplexing/demultiplexing
Multiplexing at send host:
gather data from multiple
sockets, envelop data with
headers (later used for
demultiplexing), pass to L3
application
transport
network
link
P3
P1
P1
Demultiplexing at rcv host:
receive segment from L3
deliver each received
segment to the right socket
= socket
application
transport
network
P2
= process
P4
application
transport
network
link
link
physical
host 1
physical
host 2
physical
host 3
5

each datagram has source
IP address, destination IP
address in its header
• used by network to get it there
each datagram carries one
transport-layer segment
 each segment has source,
destination port number
in its header
 host uses port #s(*) to direct
segment to correct socket
 from socket data gets to
the relevant application
process

(*)
appl. msg
 host receives IP datagrams
L4 header L3 hdr
How demultiplexing works
32 bits
source IP addr dest IP addr.
other IP header fields
source port #
dest port #
other header fields
application
data
(message)
TCP/UDP segment format
to find a TCP socket on server, source & dest. IP address is also needed,
see details later
6
Connectionless demultiplexing (UDP)
 Processes create sockets
with port numbers
 a UDP socket is identified
by a pair of numbers:
(my IP address , my port number)
 When server receives
a UDP segment:


 Client decides to contact:
 a server (peer IP-address)
 an application ( “WKP”)
 puts those into the UDP
packet sent, written as:


dest IP address - in the
IP header of the packet
dest port number - in its
UDP header
checks destination port
number in segment
directs UDP segment to
the socket with that port
number
• single server socket per
application type
• (packets from different
remote sockets directed to
same socket)


msg waits in socket queue
and processed in its turn.
answer sent to the client
socket (listed in Source
fields of query packet)
Realtime UDP applications have individual server sockets per client. However their
port numbers are distinct, since they are coordinated in advance by some signaling
protocol. This is possible since port number is not used to specify the application.
7
Connectionless demux (cont)
client socket:
port=5775, IP=B
client socket:
port=9157, IP=A
L5
P2
L4
P3
Reply
L3
L2
message
L1
S-IP: C
S-IP: C
D-IP: A
D-IP: B
SP: 53
SP: 53
DP: 9157
message
DP: 5775
S-IP: A
client
IP: A
server socket:
port=53, IP = C
Wait for
application
SP: 9157 Getting
DP: 53 Service
D-IP: C
SP = Source port number
DP= Destination port number
S-IP= Source IP Address
D-IP=Destination IP Address
P1
Reply
message
server
IP: C
S-IP: B
Getting
Service
IP-Header
D-IP: C
SP: 5775
DP: 53
Client
IP:B
message
UDP-Header
SP and S-IP provide “return address”
8
Connection-oriented demux (TCP)
 TCP socket identified
by 4-tuple:




local (my) IP address
local (my) port number
remote (peer) IP address
remote (peer) port #
 host receiving a packet
uses all four values to
direct the segment to
appropriate socket
 Server host may support
many simultaneous TCP
sockets:

each socket identified by
its own 4-tuple
 Web server dedicates a
different socket to each
connecting client

If you open two browser
windows, you generate 2
sockets at each end
9
Connection-oriented demux (cont)
client socket:
LP= 9157, L-IP= A
RP= 80 , R-IP= C
L5
server socket:
server socket:
LP= 80 , L-IP= C
RP= 9157, R-IP= A
LP= 80 , L-IP= C
RP= 5775, R-IP= B
P1
L4
P4
P5
P6
client
IP: A
D-IP: C
SP: 9157
DP: 80
H3
H4
server
IP: C
packet:
S-IP: B
D-IP: C
SP: 9157
message
LP= Local Port , RP= Remote Port
L-IP= Local IP , R-IP= Remote IP
P1P3
S-IP: B
D-IP: C
message
packet:
S-IP: A
P2
DP: 80
L2
L1
LP= 9157, L-IP= B
RP= 80 , R-IP= C
SP: 5775
server socket:
LP= 80 , L-IP= C
RP= 9157, R-IP= B
L3
packet:
client socket:
“L”= Local
= My
“R”= Remote = Peer
DP: 80
message
Client
IP: B
client socket:
LP= 5775, L-IP= B
RP= 80 , R-IP= C
10
Connection-oriented Sockets
 Client socket has a port
number unique in host
 packet for client socket
directed by the host OS
based on dest. port only
 each server application
has an always active
waiting socket;


that socket receives all
packets not belonging to
any established connection
these are packets that
open new connections
 when waiting socket
accepts a ‘new connection’
segment,





a new socket is generated at
server with same port number
this is the working socket for
that connection
next sockets arriving at
server on connection will be
directed to working socket
socket will be identified using
all 4 identifiers
last slide shows working
sockets on the server side
Note: Client IP + Client Port are globally unique
11
UDP Protocol
12
UDP: User Datagram Protocol [RFC 768]
 simple transport protocol
 “best effort” service, UDP
segments may be:
 lost
 delivered out of order
to application
with no correction by UDP
 UDP will discard bad
checksum segments if so
configured by application
 connectionless:
 no handshaking between
UDP sender, receiver
 each UDP segment
handled independently
of others
Why is there a UDP?
 no connection establishment
saves delay
 no congestion control:
 better delay & BW
 simple: less memory & RT
 small segment header

 typical usage: realtime appl.
loss tolerant
 rate sensitive
 other uses (why?):



DNS
SNMP
13
UDP segment structure
Total length of
segment (bytes)
32 bits
source port #
length
dest port #
checksum
application data
(variable length)
Checksum computed over:
• the whole segment, plus
• part of IP header:
– both IP addresses
– protocol field
– total IP packet length
Checksum usage:
• computed at destination to
detect errors
• on error, discard segment,
• checksum is optional
• if not used, sender puts
checksum = all zeros
• computed zero = all ones
14
TCP Protocol
15
TCP: Overview
 point-to-point:
 one sender, one receiver
 works between sockets
 reliable, in-order byte
stream:

no “message boundaries”
 pipelined:
 TCP congestion and flow
control set window size
 send & receive buffers
socket
door
application
writes data
application
reads data
TCP
send buffer
TCP
receive buffer
RFCs: 793, 1122, 1323, 2018, 2581
 full duplex data:
 bi-directional data flow
in same connection
 MSS: maximum segment
size
 connection-oriented:
 handshaking (exchange
of control msgs) init’s
sender, receiver state
before data exchange
 flow controlled:
 sender will not
overwhelm receiver
socket
door
segment
16
TCP segment structure
32 bits
FLAGS
ACK: ACK # valid
PSH, URG seldom used
not clearly defined
URG: indicates start
of urgent data
PSH: indicates urgent
data ends in this segm.
ptr = end urgent data
SYN: initialize conn.,
synchronize SN
FIN: I wish to disconn.
RST: break conn.
immediately
hdr length in
32 bit words
source port #
dest port #
sequence number
acknowledgement number
head not
RSF
len used U A P
checksum
rcvr window size
ptr urgent data
Options (variable length)
application
data
(variable length)
counting
by bytes
of data
(not segments!)
# bytes
rcvr willing
to accept
Internet
checksum
(as in UDP)
17
TCP sequence # (SN) and ACK # (AN)
SN:

byte stream
“number” of first
byte in segment’s
data
AN:
SN of next byte
expected from other
side
 it’s a cumulative ACK
Qn: how receiver handles
out-of-order segments?
 puts them in receive
buffer but does not
acknowledge them
Host A
host A
sends
100 data
bytes

host ACKs
receipt
of data ,
sends no data
WHY?
Host B
host B
ACKs 100
bytes
and sends
50 data
bytes
time
simple data transfer scenario
(some time after conn. setup)
18
Connection Setup: Objective
 Agree on initial sequence numbers
 a sender should not reuse a seq# before it is
sure that all packets with the seq# are purged
from the network
• the network guarantees that a packet too old will be
purged from the network: network bounds the life
time of each packet

To avoid waiting for them to disappear, choose
initial SN (ISN) far away from previous session
• needs connection setup so that the sender tells the
receiver initial seq#
 Agree on other initial parameters
 e.g. Maximum Segment Size
19
TCP Connection Management
Setup: establish connection
between the hosts before
exchanging data segments
 called: 3 way handshake
 initialize TCP variables:
 seq. #s
 buffers, flow control
info (e.g. RcvWindow)
 client : connection initiator

opens socket and cmds OS
to connect it to server
 server : contacted by client



has waiting socket
accepts connection
generates working socket
Teardown: end of
Three way handshake:
Step 1: client host sends TCP SYN
segment to server
 specifies initial seq # (ISN)
 no data
Step 2: server host receives SYN,
replies with SYNACK segment
(also no data)
allocates buffers
 specifies server initial
SN & window size
Step 3: client receives SYNACK,
replies with ACK segment,
which may contain data

connection
(we skip the details)
20
TCP Three-Way Handshake (TWH)
A
B
X+1
Y+1
Send Buffer
Send Buffer
Y+1
Receive Buffer
X+1
Receive Buffer
21
Connection Close
 Objective of closure
handshake:

each side can release
resource and remove
state about the
connection
• Close the socket
client
initial
close :
release no data from
resource? client
close
release
resource
server
close
release
resource
22
TCP reliable data transfer
 TCP creates reliable




service on top of IP’s
unreliable service
pipelined segments
cumulative acks
single retransmission
timer
receiver accepts out
of order segments but
does not acknowledge
them
 Retransmissions are
triggered by
timeout events

in some versions of TCP
also by triple duplicate
ACKs (see later)
 Initially consider
simplified TCP sender:

ignore flow control,
congestion control
7-23
TCP sender events:
data rcvd from app:
 create segment with
seq #
 seq # is byte-stream
number of first data
byte in segment
 start timer if not
already running (timer
relates to oldest
unACKed segment)
 expiration interval:
TimeOutInterval
timeout (*):
 retransmit segment
that caused timeout
 restart timer
ACK rcvd:
 if ACK acknowledges
previously unACKed
segments


update what is known to
be ACKed
Note: Ack is cumulative
start timer if there are
outstanding segments
(*) retransmission done also on triple duplicate Ack (see later)
7-24
NextSeqNum = InitialSeqNum
SendBase = InitialSeqNum
loop (forever) {
switch(event)
event: data received from application above
if (NextSeqNum-send_base < N) then {
create TCP segment with sequence number NextSeqNum
if (timer currently not running)
start timer
pass segment to IP
NextSeqNum = NextSeqNum + length(data) }
else reject data /* in truth: keep in send buffer until new Ack */
event: timer timeout
retransmit not-yet-acknowledged segment with
smallest sequence number
start timer
event: ACK received, with ACK field value of y
if (y > SendBase) {
SendBase = y
if (there are currently not-yet-acknowledged segments)
start timer
}
} /* end of loop forever */
TCP
sender
(simplified)
Comment:
• SendBase-1: last
cumulatively
ACKed byte
Example:
• SendBase-1 = 71;
y= 73, so the rcvr
wants 73+ ;
y > SendBase, so
that new data is
ACKed
Transport Layer
7-25
TCP actions on receiver events:
data rcvd from IP:
 if Checksum fails, ignore
segment
 If checksum OK, then :
if data came in order:
 update AN &WIN, as follows:
 AN grows by the number
of new in-order bytes
 WIN decreases by same #
if data out of order:
 Put in buffer, but
don’t count it for AN/ WIN
application takes data:
 free the room in
buffer
 give the freed cells
new numbers

circular numbering
 WIN increases by the
number of bytes taken
7-26
TCP: retransmission scenarios
Host A
Host A
Host B
Host B
start
timer
for
SN 92
start
timer
for
SN 92
stop
timer
X
start
timer for
SN 100
TIMEOUT
loss
start
timer for
new SN 92
stop
timer
NO
timer
stop
timer
timeA. normal scenario
timer setting
actual timer run
NO
timer
time
B. lost ACK + retransmission
7-27
TCP retransmission scenarios (more)
Host A
Host A
Host B
start
timer
for
SN 92
Host B
start
timer
for
SN 92
X
loss
stop
timer
TIMEOUT
start for 92
stop
start for 100
NO
timer
DROP !
stop
NO
timer
redundant ACK
time
C. lost ACK, NO retransmission
time
D. premature timeout
Transport Layer 7-28
TCP ACK generation (Receiver rules)
Event at Receiver
TCP Receiver action
Arrival of in-order segment with
expected seq #. All data up to
expected seq # already ACKed
Delayed ACK. Wait up to 500ms
for next segment. If no data segment
to send, then send ACK
Arrival of in-order segment with
expected seq #. One other
segment has ACK pending
Immediately send single cumulative
ACK, ACKing both in-order segments
Arrival of out-of-order segment
with higher-than-expect seq. # .
Gap detected
Immediately send duplicate ACK,
indicating seq. # of next expected byte
This Ack carries no data & no new WIN
Arrival of segment that
partially or completely fills gap
Immediately send ACK, provided that
segment starts at lower end of 1st gap
[RFC 1122, RFC 2581]
Transport Layer
7-29
Fast Retransmit (Sender Rules)
 time-out period often
relatively long:

Causes long delay before
resending lost packet
 idea:
detect lost segments
via duplicate ACKs.


sender often sends many
segments back-to-back
if segment is lost, there
will likely be many
duplicate ACKs for that
segment
 Rule:
If sender receives 4
ACKs for same data (= 3
duplicates), it assumes
that segment after
ACKed data was lost:

fast retransmit: resend
segment immediately
(before timer expires)
Transport Layer
7-30
Fast Retransmit scenario
Host A
seq # x1
seq # x2
seq # x3
seq # x4
seq # x5
Host B
X
ACK # x2
ACK # x2
ACK # x2
ACK # x2
* no data in segment
* no window change
timeout
triple
duplicate
ACKs
time
Transport Layer
7-31
Fast retransmit algorithm:
event: ACK received, with ACK field value of y
if (y > SendBase) {
SendBase = y
if (there are currently not-yet-acknowledged segments)
start timer
}
else {if (segment carries no data & doesn’t change WIN)
increment count of dup ACKs received for y
if (count of dup ACKs received for y = 3) {
{ resend segment with sequence number y
count of dup ACKs received for y = 0 }
}
a duplicate ACK for
already ACKed segment
fast retransmit
7-32
Transp
ort
TCP: setting
timeouts
33
General idea
Q: how to set TCP
timeout interval?
Average
Timeout
Interval
 should be longer than
RTT
 but: RTT will vary
 if too short: premature
timeout
 unnecessary
retransmissions
 if too long: slow reaction
to segment loss
 Set timeout = average +
safe margin :
margin
34
Estimating Round Trip Time
RTT: gaia.cs.umass.edu to fantasia.eurecom.fr
 SampleRTT: measured time from
300
RTT (milliseconds)
segment transmission until
receipt of ACK for it
 SampleRTT will vary, want a
“smoother” estimated RTT
use several recent
measurements, not
just current SampleRTT
350
250
200
150
100
1
8
15
22
29
36
43
50
57
64
71
78
85
92
99
106
time (seconnds)
SampleRTT
Estimated RTT
EstimatedRTT = (1- )*EstimatedRTT + *SampleRTT
 Exponential weighted moving average
 influence of past sample decreases exponentially fast
 typical value:  = 0.125
35
Setting Timeout
Problem:
 using the average of SampleRTT will generate
many timeouts due to network variations
Solution:
freq.
 EstimatedRTT plus “safety margin”
 large variation in EstimatedRTT -> requires larger safety margin
RTT
Estimate average deviation of RTT:
DevRTT = (1-)*DevRTT + *|SampleRTT-EstimatedRTT|
(typically,  = 0.25)
Then set timeout interval:
TimeoutInterval = EstimatedRTT + 4*DevRTT
36
TCP:
Flow Control
37
TCP Flow Control: Simple Case
flow control
 TCP at A sends data to B
 The picture below shows the
TCP receive-buffer at B
AN
Receive
Buffer
data
taken by
application
TCP data
in buffer
spare
room
data
from IP
(sent by
TCP at A)
WIN
node B : Receive process
 application process at B may
be slow at reading from
buffer
sender won’t overflow
receiver’s buffer by
transmitting too much,
too fast
 flow control matches the
send rate of A to the
receiving application’s
drain rate at B
 Receive buffer size set
by OS at connection init
 WIN = window size =
number bytes A may send
starting at AN
7-38
TCP Flow control: General Case
Rcv Buffer
data
taken by ACKed data
application in buffer
s p a r e
r o o m
AN
Formulas:



data from IP
(sent by TCP at A)
WIN
node B : Receive process
Procedure:
 AN = first byte not received yet
 sent to A in TCP header

non-ACKed data in buffer
(arrived out of order) ignored
 Rcvr advertises “spare room”
by including value of WIN in his
AckedRange =
segments
= AN – FirstByteNotReadByAppl=
 Sender A is allowed to send at
= # bytes rcvd in sequence &not taken
most WIN bytes in the range
WIN = RcvBuffer – AckedRange
starting with AN
= “SpareRoom”
 guarantees that receive
AN and WIN sent to A in TCP header
buffer doesn’t overflow
Data received out of sequence is
considered part of ‘spare room’ range
7-39
‫בקרת זרימה של ‪ – TCP‬דוגמה ‪1‬‬
‫‪7-40‬‬
‫בקרת זרימה של ‪ – TCP‬דוגמה ‪2‬‬
‫‪7-41‬‬
TCP:
Congestion
Control
42
TCP Congest’n Ctrl Overview (1)
 Closed-loop, end-to-end, window-based congestion
control
 Designed by Van Jacobson in late 1980s, based on
the AIMD algorithm of Dah-Ming Chu and Raj Jain
 Works well so far: the bandwidth of the Internet
has increased by more than 200,000 times
 Many versions
 TCP-Tahoe: this is a less optimized version
 TCP-Reno: many OSs today implement Reno type
congestion control
 TCP-Vegas: not currently used
For more details: see Stevens: TCP/IP illustrated; K-R chapter 6.7, or read:
http://lxr.linux.no/source/net/ipv4/tcp_input.c for linux implementation
43
TCP Congest’n Ctrl Overview (2)
 Dynamic window size [Van Jacobson]
 Initialization: MI (Multiplicative Increase)
• Slow start

Steady state: AIMD
(Additive Increase / Multiplicative Decrease)
• Congestion Avoidance
 “Congestion is timeout || 3 duplicate ACK”
 TCP Tahoe: treats both cases identically
 TCP Reno: treat each case differently
 “Congestion = (also) higher latency”
 TCP Vegas
44
General method
 sender limits rate by limiting number of
unACKed bytes “in pipeline”:
LastByteSent-LastByteAcked  cwnd (*)


cwnd: differs from WIN (how, why?)
sender limited by ewnd ≡ min(cwnd,WIN)
(effecive window)
cwnd
bytes
 roughly,
rate =
ewnd
RTT
bytes/sec
 cwnd is dynamic, function of
perceived network congestion
RTT
ACK(s)
Transport Layer
7-45
The Basic Two Phases
MSS
Congestion avoidance
Additive Increase
cwnd
Slow start
Multiplicative Increase
46
Pure AIMD: Bandwidth Probing Principle
 “probing for bandwidth”: increase transmission rate
on receipt of ACK, until eventually loss occurs, then
decrease transmission rate

continue to increase on ACK, decrease on loss (since available
bandwidth is changing, depending on other connections in
network)
ACKs being received,
so increase rate slowly
X loss, so decrease rate fast
sending rate
X
AI
MD
X
AI
X
X
TCP’s
“sawtooth”
behavior
MD
this model ignores Slow Start
 Q: how fast to increase/decrease?
 details to follow
time
Transport Layer
7-47
TCP Slowstart: MI
* used in all TCP versions
Host B
RTT
Slowstart algorithm
initialize: cwnd = 1 MSS
for (each segment ACKed)
cwnd += MSS (*)
until (congestion event OR
cwnd ≥ threshold)
On congestion event:
Host A
{Threshold = cwnd/2
cwnd = 1 MSS }
(*) doubled per RTT:
•
•
time
exponential increase in window size (very fast!)
therefore slowstart lasts a short time
48
TCP: congestion avoidance (CA)
 when cwnd > ssthresh
grow cwnd linearly:
as long as all ACKs arrive
 increase cwnd
by ≈1 MSS per RTT
 approach possible
congestion slower than
in slowstart
 implementation:
cwnd += MSS^2/cwnd
for each ACK received
AIMD
 ACKs: increase cwnd
by 1 MSS per RTT:
additive increase
 loss(*): cut cwnd in
half : multiplicative
decrease


true in macro picture
in actual algorithm may
have Slow Start first to
grow up to this value (+)
(*) = Timeout or 3 Duplicate
(+) depends on case & TCP type
Transport Layer
7-49
TCP Tahoe
 Initialize with SlowStart
state with cwnd = 1 MSS
 When cwnd ≥ ssthresh
change to CA state
 When sense congestion(*):
set ssthresh =ewnd/2 (+)
 set cwnd = 1 MSS
 change state to SlowStart
(*) Timeout or Triple Duplicate Ack
(+) recall ewnd = min(cwnd, WIN);
in our discussion here we assume
that WIN > cwnd, so
ewnd=cwnd
TCP Tahoe
T/O or 3 Dup
AI
MD
CA
CA

SSt
SSt
50
TCP Reno
Rationale:
 triple duplicate event
shows less congestion
than timeout
TCP Reno Procedure
 Initialize with SlowStart
 Slowstart as in Tahoe
 CA growth as in Tahoe
 first segment probably lost
 On T/O, act as in Tahoe
but some others arrived
 therefore on 3Dup, cwnd  On Triple Duplicate,
 set ssthresh = ewnd/2
is decreased to ewnd/2,
 enter Fast Recovery state
skipping SlowStart stage


less aggressive than on T/O
this is an approximate
description; more details to
the right and two slides
below


this is a temporary state
until a non-Dup Ack arrives
when Fast Recovery ends,
set: cwnd = ssthresh
Transport Layer
7-51
Fast Recovery
Rationale:
Fast Recovery State
 cwnd increases only when a  Initialize cwnd += 3 MSS
new segment is Ack’ed
 on each additional
 in the 3 Dup situation, it
duplicate Ack increase
may take time until such
cwnd by MSS
Ack arrives.
 when a new segment is
 Until that time:
acknowledged, set
 we increase cwnd on the
cwnd = ssthresh
arrival of each duplicate Ack,
including the three that
triggered Fast Retransmit
 when new Ack arrives


recall that ssthresh was set
to half of the last ewnd
value in CA state
set cwnd = ssthresh
Transport Layer
7-52
TCP Reno cwnd Trace
70
triple duplicate Ack
threshold
congestion
window
timeouts
50
20
10
CA
CA
CA
additive increase
slow start period
Sl.Start
30
fast retransmission
Slow Start
40
Slow Start
Congestion Window
60
0
0
10
20
30
Time
40
50
fast recovery stage
skipped
60
53
TCP Reno Cong. Ctrl State Transition Diagram
slow
start
cwnd > ssthresh congestion
loss:
timeout
loss:
timeout
loss:
timeout
loss:
3dupACK
fast
recovery
avoidance
new ACK loss:
3dupACK
Transport Layer
7-54
TCP Reno Congestion Control FSM
check == 3?
duplicate ACK
dupACKcount++
L
cwnd = 1 MSS
ssthresh = 64 KB
dupACKcount = 0
INIT
slow
start
timeout
ssthresh = cwnd/2
cwnd = 1 MSS
dupACKcount = 0
retransmit missing segment
dupACKcount == 3
ssthresh= cwnd/2
cwnd = ssthresh + 3 MSS
retransmit missing segment
new ACK
cwnd = cwnd+MSS
dupACKcount = 0
transmit new segment(s),as allowed
cwnd > ssthresh
L
timeout
ssthresh = cwnd/2
cwnd = 1 MSS
dupACKcount = 0
retransmit missing segment
timeout
ssthresh = cwnd/2
cwnd = 1 MSS
dupACKcount = 0
retransmit missing segment
new ACK
cwnd = cwnd + MSS (MSS/cwnd)
dupACKcount = 0
transmit new segment(s),as allowed
.
congestion
avoidance
duplicate ACK
dupACKcount++
check == 3?
New ACK
cwnd = ssthresh
dupACKcount = 0
dupACKcount == 3
ssthresh= cwnd/2
cwnd = ssthresh + 3 MSS
retransmit missing segment
fast
recovery
duplicate ACK
cwnd = cwnd + MSS
transmit new segment(s), as allowed
Transport Layer
7-55
cwnd window size (in
segments)
Popular “flavors” of TCP
TCP Reno
ssthresh
ssthresh
TCP Tahoe
Transmission
round
Transport Layer
7-56
Summary: TCP Reno Congestion Control
 when cwnd < ssthresh, sender in slow-start
phase, window grows exponentially.
 when cwnd >= ssthresh, sender is in
congestion-avoidance phase, window grows linearly.
 when triple duplicate ACK occurs, ssthresh set
to cwnd/2, cwnd eventually set to ~ ssthresh
(after detour to Fast Retransmit state)
 when timeout occurs, ssthresh set to cwnd/2,
cwnd set to 1 MSS.
Transport Layer
7-57
TCP throughput
 Q: what’s average throughout of TCP as
function of window size, RTT?

ignoring slow start
 let W be window size when loss occurs.
 when
window is W, throughput is W/RTT
 just after loss, window drops to W/2,
throughput to W/2RTT, then grows
linearly slow
 average throughout: .75 W/RTT
Transport Layer
7-58
TCP Fairness
fairness goal: if K TCP sessions share same
bottleneck link of bandwidth R, each should have
average rate of R/K
TCP connection 1
TCP
connection 2
bottleneck
router
capacity R
Transport Layer
7-59
Why is TCP fair?
Two competing sessions: (Tahoe, Slow Start ignored)
 Additive increase gives slope of 1, as throughout increases
 multiplicative decrease decreases throughput proportionally
R
equal bandwidth share
loss: decrease window by factor of 2
y = x+(b-a)/4
y = x+(b-a)/4
(a/2+t/2+t1,b/2+t/2+t1) ;
y = x+(b-a)/2
((a+t)/2,(b+t)/2) => y = x+(b-a)/2
(a+t,b+t) => y = x+(b-a)
(a,b)
Connection 1 throughput R
congestion avoidance:
additive increase
Transport Layer
7-60
Fairness (more)
Fairness and UDP
 multimedia apps often
do not use TCP

do not want rate
throttled by congestion
control
 instead use UDP:
 pump audio/video at
constant rate, tolerate
packet loss
Fairness and parallel TCP
connections
 nothing prevents appl.
from opening parallel
connections between two
hosts.
 web browsers do this
 example: link of rate R
supporting already
9 connections;


new app asks for 1 TCP, gets
rate R/10
new app asks for 11 TCPs,
gets > R/2 !!
Transport Layer
7-61
Extra Slides
62
Exercise
 MSS = 1000
 Only one event per row
Transport Layer
7-63
Download