TCP

advertisement
Transport Layer Outline
 3.1 Transport-layer
services
 3.2 Multiplexing and
demultiplexing
 3.3 Connectionless
transport: UDP
 3.4 Principles of
reliable data transfer
 3.5 Connection-oriented
transport: TCP




segment structure
reliable data transfer
flow control
connection management
 3.6 Principles of
congestion control
 3.7 TCP congestion
control
Transport Layer
3-1
Recap: rdt3.0 sender (Stop-and-wait)
rdt_send(data)
sndpkt = make_pkt(0, data, checksum)
udt_send(sndpkt)
start_timer
rdt_rcv(rcvpkt)
L
rdt_rcv(rcvpkt)
&& notcorrupt(rcvpkt)
&& isACK(rcvpkt,1)
rdt_rcv(rcvpkt) &&
( corrupt(rcvpkt) ||
isACK(rcvpkt,0) )
timeout
udt_send(sndpkt)
start_timer
rdt_rcv(rcvpkt)
&& notcorrupt(rcvpkt)
&& isACK(rcvpkt,0)
stop_timer
stop_timer
timeout
udt_send(sndpkt)
start_timer
L
Wait
for
ACK0
Wait for
call 0from
above
L
rdt_rcv(rcvpkt) &&
( corrupt(rcvpkt) ||
isACK(rcvpkt,1) )
Wait
for
ACK1
Wait for
call 1 from
above
rdt_send(data)
rdt_rcv(rcvpkt)
L
sndpkt = make_pkt(1, data, checksum)
udt_send(sndpkt)
start_timer
Transport Layer
3-2
Recap: rdt3.0: stop&wait op
sender
receiver
first packet bit transmitted, t = 0
last packet bit transmitted, t = L / R
first packet bit arrives
last packet bit arrives, send ACK
RTT
ACK arrives, send next
packet, t = RTT + L / R
U
sender
=
L/R
RTT + L / R
=
.008
30.008
= 0.00027
microsec
onds
Transport Layer
3-3
Recap: Pipelining: increased
utilization
sender
receiver
first packet bit transmitted, t = 0
last bit transmitted, t = L / R
first packet bit arrives
last packet bit arrives, send ACK
last bit of 2nd packet arrives, send ACK
last bit of 3rd packet arrives, send ACK
RTT
ACK arrives, send next
packet, t = RTT + L / R
Increase utilization
by a factor of 3!
U
sender
=
3*L/R
RTT + L / R
=
.024
30.008
= 0.0008
microsecon
ds
Transport Layer
3-4
Recap: GBN for Pipelined Error Recovery
Sender:



There is a k-bit sequence # in packet header
“window” of up to N, consecutive unacknowledged sent/can-be-sent packets allowed
window moves by 1 packet at a time when its 1st sent pkt is acknowledged (standard behavior)
window cannot contain acknowledged pkts
Sender must respond to three types of events:
 1- Invocation from above: application layers tries to send a packet, if window is full then
packet is returned otherwise the packet is accepted and sent.
 2- Receipt of an ACK: One ACK(n) received indicates that all pkts up to, including seq # n
have been received - “cumulative ACK”
 may receive duplicate ACKs (when receiver receives out-of-order packets)
 3- A timeout event (only cause of retransmission):
 timer for each in-flight pkt.
 if timeout occurs: retransmit packets that have not been acknowledged.
Transport Layer
3-5
Recap: Selective repeat for error recovery
Window may contain acknowledged pkts (unlike GBN)
Transport Layer
3-6
TCP: Overview
RFCs: 793, 1122, 1323, 2018, 2581
 point-to-point:

one sender, one receiver

no one to many multicasts
 full duplex data:

 flow controlled:
 connection-oriented:


processes must handshake before
sending data
three-way handshake: (exchange
of control msgs) initializes
sender, receiver state before
data exchange
 pipelined:

TCP congestion and flow control
set window size
 send & receive buffers:

set-aside during the 3-way
handshaking
bi-directional data flow in same
connection at the same time
socket
door

sender will not overwhelm receiver
application
writes data
application
reads data
TCP
send buffer
TCP
receive buffer
socket
door
segment
Transport Layer
3-7
TCP: Overview - cont
Maximum Segment Size (MSS):
 Defined as the maximum amount of application-layer data in the TCP segment.
 TCP grabs data in chunks from the send buffer where the maximum chunk size
is called MSS. TCP segment contains TCP header and MSS.
 MSS is set by determining the largest link layer frame (Maximum Transmission
Unit or MTU) that can be sent by the local host
 MSS is set so that an MSS put into an IP datagram will fit into a single link
layer frame. Common values of MTU is 1460 bytes, 536 bytes and 512 bytes.
 TCP sequence #s:
 both sides randomly choose initial seq #s (other than 0) to prevent receiving
segments of older connections that were using the same ports.
 TCP views data as unordered structured stream of bytes so seq #s are over
the stream of byes.
 file size of 500,000 bytes and MSS=1,000 bytes, segment seq #s are: 0, 1000,
2000, etc.
 TCP acknowledgement #s:
 uses cumulative acks: TCP only acks bytes up to the first missing byte in the
stream . TCP RFCs do not address how to handle out-of-order segments.
 ACK # field has the next byte offset that the sender or receiver is expecting

Transport Layer
3-8
TCP segment structure
32 bits
URG: urgent data
(generally not used)
ACK: ACK #
valid
PSH: push data now
to upper layer
SYN/FIN: connection setup
and close.
RST=1: used in response
when client
tries to connect to
a non-open server port .
Internet
checksum
(as in UDP)
source port #
dest port #
sequence number
acknowledgement number
header
length
not
UA P R S F
used
checksum
Receive window
Urgent data pointer
Options (variable length)
used to negotiate MSS
application
data
(variable length)
counting by bytes
of data (not segments!)
largest file that can
be sent = 232 (4GB)
total #segments=
filesize/MSS
16-bit= # bytes
receiver willing
to accept
(RcvWindow size)
header-length = 4-bits
in 32-bit words
Transport Layer
3-9
Seq Numbers and Ack Numbers
 Suppose a data stream of size 500,000 bytes,
MSS is 1,000 bytes; the first byte of the data
stream is numbered zero.

Seq number of the segments:
• 1st seg: 0; 2nd seg: 1000; 3rd seg: 2000, …
 Ack number:
 Assume host A is sending seg to host B. Because TCP is
full-duplex, A may be receiving data from B
simultaneously.
 Ack number that host B puts in its seg is the seq number
of the next byte B is expecting from A
• B has received all bytes numbered 0 through 535 from A. If
B is about to send a segment to host A. The ack number in
its segment should 536
Transport Layer 3-10
TCP seq. #’s and ACKs - Telnet example
Telnet uses “echo back” to
ensure characters seen by
user already been received
and processed at server.
 Assume starting seq #s are
42 and 79 for client and
server respectively.
 After connection is
established, client is waiting
for byte 79 and server for
byte 42.
Seq. #’s:
 byte stream “number” of
first byte in segment’s data
ACKs:
 seq # of next byte
expected from other side
 cumulative ACK

User
types
‘C’
Host A
client
Host B
server
host ACKs
receipt of
‘C’, echoes
back ‘C’
host ACKs
receipt
of echoed
‘C’
simple telnet scenario
Transport Layer
time
3-11
TCP Round Trip Time and Timeout
Q: how to estimate RTT?
Q: how to set TCP
 SampleRTT: measured time from
timeout value ?
segment transmission (handing the
(timer management)
 based on RTT
 longer than RTT

but RTT varies
 too short: premature
timeout
 unnecessary
retransmissions
 too long: slow reaction
to segment loss
segment to IP) until ACK receipt
 ignore retransmissions (why?)
 SampleRTT will vary from segment
to segment, want estimated RTT
“smoother”
 average several recent
measurements, not just current
SampleRTT
 TCP maintains an average called
EstimatedRTT to use it to
calculate the timeout value
Transport Layer 3-12
TCP Round Trip Time (RTT) and Timeout
EstimatedRTT = (1- ) * priorEstimatedRTT +  * currentSampleRTT
 Exponential Weighted Moving Average (EWMA)
 Puts more weight on recent samples rather than old ones
 influence of past sample decreases exponentially fast
 typical value:  = 0.125
 Formula becomes:
EstimatedRTT = 0.875 * priorEstimatedRTT + 0.125 * currentSampleRTT
Why TCP ignores retransmissions when calculating SampleRTT:
Suppose source sends packet P1, the timer for P1 expires, and the source then sends P2, a
new copy of the same packet. Further suppose the source measures SampleRTT for P2
(the retransmitted packet) and that shortly after transmitting P2 an acknowledgment for P1
arrives. The source will mistakenly take this acknowledgment as an acknowledgment for
P2 and calculate an incorrect value of SampleRTT.
Transport Layer 3-13
RTT Sample Ambiguity
A
B
Estimate RTT
Sample
RTT
A
B
X
eRTT
Sample
RTT
 Karn’s RTT Estimator
 If a segment has been retransmitted:
• Don’t count RTT sample on ACKs for this segment
• Keep backed off time-out for next packet
• Reuse RTT estimate only after one successful transmission
Transport Layer 3-14
Example RTT estimation:
RTT: gaia.cs.umass.edu to fantasia.eurecom.fr
350
RTT (milliseconds)
300
250
200
150
100
1
8
15
22
29
36
43
50
57
64
71
78
85
92
99
106
time (seconnds)
SampleRTT
Estimated RTT
Transport Layer 3-15
TCP Round Trip Time and Timeout
Setting the timeout
 EstimtedRTT plus “safety margin”

large variation in EstimatedRTT -> larger safety margin
 first estimate of how much SampleRTT deviates from
EstimatedRTT:
DevRTT = (1-)*DevRTT +
*|SampleRTT-EstimatedRTT|
(typically,  = 0.25)
Then set timeout interval:
TimeoutInterval = EstimatedRTT + 4*DevRTT
Transport Layer 3-16
TCP: conn-oriented transport
 segment structure
 RTT Estimation and Timeout
 reliable data transfer
 flow control
 connection management
Transport Layer 3-17
TCP reliable data transfer
 TCP creates rdt service
on top of IP’s unreliable
service
 Pipelined segments
 Cumulative acks
 TCP uses single
retransmission timer as
multiple timers require
considerable overhead
 Retransmissions are
triggered by:


timeout events
duplicate acks
 Initially consider
simplified TCP sender:


ignore duplicate acks
ignore flow control,
congestion control
Transport Layer 3-18
TCP sender events:
data rcvd from app:
 Create segment with
seq #
 seq # is byte-stream
number of first data
byte in segment
 start timer if not
already running for
some other segment
(think of timer as for
oldest unacknowledged
segment)
 expiration interval:
TimeOutInterval
timeout:
 retransmit segment that
caused timeout
 restart timer
Ack rcvd:
 a valid ACK field
(cumulative ACK)
acknowledges previously
unacknowledged
segments:


update expected ACK #
restart timer if there are
currently unacknowledged
segments
Transport Layer 3-19
NextSeqNum = InitialSeqNum
SendBase = InitialSeqNum
loop (forever) {
switch(event)
event: data received from application above
create TCP segment with sequence number NextSeqNum
if (timer currently not running)
start timer
pass segment to IP
NextSeqNum = NextSeqNum + length(data)
event: timer timeout
retransmit not-yet-acknowledged segment with
smallest sequence number
start timer
event: ACK received, with ACK field value of y
if (y > SendBase) {
SendBase = y
if (there are currently not-yet-acknowledged segments)
start timer
}
} /* end of loop forever */
TCP
sender
(simplified)
Comment:
• SendBase-1: last
cumulatively
ack’ed byte
Example:
• SendBase-1 = 71;
y= 73, so the rcvr
wants 73+ ;
y > SendBase, so
that new data is
acked
Transport Layer 3-20
TCP: retransmission scenarios
Host A
X
loss
transmit not-yet-ack
segment with
smallest seq #
Sendbase
= 100
SendBase
= 120
SendBase
= 100
time
SendBase
= 120
lost ACK scenario
Host B
Seq=92 timeout
Host B
Seq=92 timeout
timeout
Host A
time
premature timeout
Transport Layer 3-21
TCP retransmission scenarios (more)
timeout
Host A
Host B
X
loss
SendBase
= 120
time
Cumulative ACK scenario
 Doubling the timeout
value technique is used in
TCP implementations. The
timeout value is doubled
for every retransmission
since the timeout could
have occurred because
the network is congested.
(the intervals grow
exponentially after each
retransmission and reset
after either of the two
other events)
Transport Layer 3-22
TCP ACK generation policy [RFC 1122, RFC 2581]
Event at Receiver
TCP Receiver action
Arrival of in-order segment with
expected seq #. All data up to
expected seq # already ACKed
Delayed ACK. Wait up to 500ms
for next segment. If no next segment,
send ACK
Arrival of in-order segment with
expected seq #. One other
segment has ACK pending
Immediately send single cumulative
ACK, ACKing both in-order segments
Arrival of out-of-order segment
higher-than-expect seq. # .
Gap detected
Immediately send duplicate ACK,
indicating seq. # of next expected byte
Arrival of segment that
partially or completely fills gap
Immediate send ACK, provided that
segment starts at lower end of gap
leaves buffering of out-of-order segments open
Transport Layer 3-23
Fast Retransmit
 Time-out period often
relatively long:

long delay before
resending lost packet
 Detect lost segments via
 If sender receives 3 ACKs for
the same data, it supposes
that segment after last
ACKed segment was lost:

duplicate ACKs.



Dup Ack is an ack that
reaknolwedges the receipt
of an acknowledged
segment
Sender often sends many
segments back-to-back
If segment is lost, there
will likely be many
duplicate ACKs.

sender performs fast
retransmit: resend segment
before that segment’s timer
expires
algorithm comes as a result of
15 years TCP experience !
Transport Layer 3-24
Fast retransmit algorithm:
event: ACK received, with ACK field value of y
if (y > SendBase) {
SendBase = y
if (there are currently not-yet-acknowledged segments)
start timer
}
else {
increment count of dup ACKs received for y
if (count of dup ACKs received for y = 3) {
resend segment with sequence number y
}
a duplicate ACK for
already ACKed segment
fast retransmit
Transport Layer 3-25
Is TCP a GBN or SR protocol ?
 TCP can buffer out-of-order segments (like SR).
 TCP has a proposed RFC called selective
acknowledgement to selectively acknowledge out-oforder segments and save on retransmissions (like SR).
 TCP sender need only maintain smallest seq # of a
transmitted but unacknowledged byte and the seq #
of next byte to be sent (like GBN).
 TCP is hybrid between GBN and SR.
Transport Layer 3-26
TCP: conn-oriented transport
 segment structure
 RTT Estimation and Timeout
 reliable data transfer
 flow control
 connection management
Transport Layer 3-27
TCP Flow Control
 receive side of TCP
connection has a
receive buffer:
flow control
sender won’t overflow
receiver’s buffer by
transmitting too much,
too fast
 speed-matching
 app process may be
service: matching the
send rate to the
receiving app’s drain
rate
slow at reading from
buffer
Transport Layer 3-28
TCP Flow control: how it works
Rcvr advertises spare room by
including value of RcvWindow in
segments
 RcvWindow = RcvBuffer at the
start of transmission
 Sender limits unACKed data to
RcvWindow

(Suppose TCP receiver discards out-oforder segments)
 sender maintains variable called
receive window
 spare room in buffer = RcvWindow
= RcvBuffer-[LastByteRcvd LastByteRead]
 TCP is not allowed to overflow the
allocated buffer (LastByteRcvd LastByteRead <= RcvBuffer)

sender keeps track of
UnAcked data size =
(LastByteSent LastByteAcked)
UnAcked data size <=
RcvWindow
 When Receiver RcvWindow = 0,
Sender does not block but rather
sends 1 byte segments that are
acked by receiver until RcvWindow
becomes bigger.

Transport Layer 3-29
TCP: conn-oriented transport
 segment structure
 RTT Estimation and Timeout
 reliable data transfer
 flow control
 connection management
Transport Layer 3-30
Recap: TCP socket interaction
Server (running on hostid)
Client
create socket,
port=x, for
incoming request:
welcomeSocket =
ServerSocket()
TCP
wait for incoming
connection request connection
connectionSocket =
welcomeSocket.accept()
read request from
connectionSocket
write reply to
connectionSocket
close
connectionSocket
setup
create socket,
connect to hostid, port=x
clientSocket =
Socket()
send request using
clientSocket
read reply from
clientSocket
close
clientSocket
Transport Layer 3-31
TCP Connection Management
Recall: TCP sender, receiver
establish “connection”
before exchanging data
segments
 initialize TCP variables:
 seq. #s
 buffers, flow control
info (e.g. RcvWindow)
 client: connection initiator
Socket clientSocket = new
Socket("hostname","port
number");
 server: contacted by client
Socket connectionSocket =
welcomeSocket.accept();
32 bits
source port #
dest port #
sequence number
acknowledgement number
header
length
not
UA P R S F
used
checksum
Receive window
Urgent data pointer
Options (variable length)
used to negotiate MSS
application
data
(variable length)
Transport Layer 3-32
TCP Connection Management - connecting
 Three way handshake:
 Step 1: client host sends TCP
SYN segment (SYN bit=1) to
server
• specifies initial seq #
(client_isn)
• no data

Step 2: server host receives
SYN, replies with SYNACK
segment
• server allocates buffers
• specifies server initial seq. #
(server_isn), with ACK # =
client_isn+1

Step 3: client receives
SYNACK, replies with ACK # =
server_isn+1, which may
contain data
client
conn
request
server
conn
granted
ACK
Time
Time
Transport Layer 3-33
TCP Connection Setup Example
09:23:33.042318 IP 128.2.222.198.3123 > 192.216.219.96.80: S
4019802004:4019802004(0) win 65535 <mss 1260,nop,nop,sackOK>
09:23:33.118329 IP 192.216.219.96.80 > 128.2.222.198.3123: S
3428951569:3428951569(0) ack 4019802005 win 5840 <mss
1460,nop,nop,sackOK>
09:23:33.118405 IP 128.2.222.198.3123 > 192.216.219.96.80: . ack
3428951570 win 65535
 Client SYN

SeqC: Seq. #4019802004, window 65535, max. seg. 1260
 Server SYN-ACK+SYN


Receive: #4019802005 (= SeqC+1)
SeqS: Seq. #3428951569, window 5840, max. seg. 1460
 Client SYN-ACK

Receive: #3428951570 (= SeqS+1)
sackOK: selective acknowledge
Transport Layer 3-34
TCP Connection Management - disconnecting
Closing a connection:
client closes socket:
clientSocket.close();
client
close
Step 1: client end system
Step 2: server receives
FIN, replies with ACK.
Closes connection, sends
FIN=1.
close
timed wait
sends TCP FIN control
segment (FIN bit=1) to
server
server
closed
Transport Layer 3-35
TCP Connection Management (cont.)
Step 3: client receives FIN,

Enters “timed wait” will respond with ACK
to received FINs
where typical wait is 30
sec. All resources and
ports are released.
Step 4: server, receives
ACK. Connection closed.
server
closing
closing
timed wait
replies with ACK.
client
closed
closed
Transport Layer 3-36
TCP Conn.Teardown Example
09:54:17.585396 IP 128.2.222.198.4474 > 128.2.210.194.6616: F
1489294581:1489294581(0) ack 1909787689 win 65434
09:54:17.585732 IP 128.2.210.194.6616 > 128.2.222.198.4474: F
1909787689:1909787689(0) ack 1489294582 win 5840
09:54:17.585764 IP 128.2.222.198.4474 > 128.2.210.194.6616: . ack
1909787690 win 65434
 Session

Echo client on 128.2.222.198, server on 128.2.210.194
 Client FIN

SeqC: 1489294581
 Server ACK + FIN


Ack: 1489294582 (= SeqC+1)
SeqS: 1909787689
 Client ACK

Ack: 1909787690 (= SeqS+1)
Transport Layer 3-37
TCP Connection Management (cont)
TCP server
lifecycle
TCP client
lifecycle
Transport Layer 3-38
Queue Management
 Two queues for each listening socket
Transport Layer 3-39
Concurrent Server
(1) pid_t pid;
(2) int listenfd, connfd;
(3) listenfd = Socket( ... );
(4) /* fill in sockaddr_in{} with server's well-known port */
(5) Bind(listenfd, ... );
(6) Listen(listenfd, LISTENQ);
(7) for ( ; ; ) {
(8)
connfd = Accept (listenfd, ... ); /* probably blocks */
(9)
if( (pid = Fork()) == 0) {
(10)
Close(listenfd); /* child closes listening socket */
(11)
doit(connfd); /* process the request */
(12)
Close(connfd); /* done with this client */
(13)
exit(0); /* child terminates */
(14) }
(15) Close(connfd); /* parent closes connected socket */
(16) }
Transport Layer 3-40
Concurrent Server (Cont’)
(a) Status before call to call to accept returns
(b) status after return from accept
(d) Status after parent/child close appropriate sockets
(c) Status after return of spawning a process
Transport Layer 3-41
TCP Summary
 TCP Properties:
 point to point, connection-oriented, full-duplex, reliable
 TCP Segment Structure
 How TCP sequence and acknowledgement #s are




assigned
How does TCP measure the timeout value needed
for retransmissions using EstimatedRTT and
DevRTT
TCP retransmission scenarios, ACK generation and
fast retransmit
How does TCP Flow Control work
TCP Connection Management: 3-segments exchanged
to connect and 4-segments exchanged to disconnect
Transport Layer 3-42
Download