Transport Layer Outline 3.1 Transport-layer services 3.2 Multiplexing and demultiplexing 3.3 Connectionless transport: UDP 3.4 Principles of reliable data transfer 3.5 Connection-oriented transport: TCP segment structure reliable data transfer flow control connection management 3.6 Principles of congestion control 3.7 TCP congestion control Transport Layer 3-1 Recap: rdt3.0 sender (Stop-and-wait) rdt_send(data) sndpkt = make_pkt(0, data, checksum) udt_send(sndpkt) start_timer rdt_rcv(rcvpkt) L rdt_rcv(rcvpkt) && notcorrupt(rcvpkt) && isACK(rcvpkt,1) rdt_rcv(rcvpkt) && ( corrupt(rcvpkt) || isACK(rcvpkt,0) ) timeout udt_send(sndpkt) start_timer rdt_rcv(rcvpkt) && notcorrupt(rcvpkt) && isACK(rcvpkt,0) stop_timer stop_timer timeout udt_send(sndpkt) start_timer L Wait for ACK0 Wait for call 0from above L rdt_rcv(rcvpkt) && ( corrupt(rcvpkt) || isACK(rcvpkt,1) ) Wait for ACK1 Wait for call 1 from above rdt_send(data) rdt_rcv(rcvpkt) L sndpkt = make_pkt(1, data, checksum) udt_send(sndpkt) start_timer Transport Layer 3-2 Recap: rdt3.0: stop&wait op sender receiver first packet bit transmitted, t = 0 last packet bit transmitted, t = L / R first packet bit arrives last packet bit arrives, send ACK RTT ACK arrives, send next packet, t = RTT + L / R U sender = L/R RTT + L / R = .008 30.008 = 0.00027 microsec onds Transport Layer 3-3 Recap: Pipelining: increased utilization sender receiver first packet bit transmitted, t = 0 last bit transmitted, t = L / R first packet bit arrives last packet bit arrives, send ACK last bit of 2nd packet arrives, send ACK last bit of 3rd packet arrives, send ACK RTT ACK arrives, send next packet, t = RTT + L / R Increase utilization by a factor of 3! U sender = 3*L/R RTT + L / R = .024 30.008 = 0.0008 microsecon ds Transport Layer 3-4 Recap: GBN for Pipelined Error Recovery Sender: There is a k-bit sequence # in packet header “window” of up to N, consecutive unacknowledged sent/can-be-sent packets allowed window moves by 1 packet at a time when its 1st sent pkt is acknowledged (standard behavior) window cannot contain acknowledged pkts Sender must respond to three types of events: 1- Invocation from above: application layers tries to send a packet, if window is full then packet is returned otherwise the packet is accepted and sent. 2- Receipt of an ACK: One ACK(n) received indicates that all pkts up to, including seq # n have been received - “cumulative ACK” may receive duplicate ACKs (when receiver receives out-of-order packets) 3- A timeout event (only cause of retransmission): timer for each in-flight pkt. if timeout occurs: retransmit packets that have not been acknowledged. Transport Layer 3-5 Recap: Selective repeat for error recovery Window may contain acknowledged pkts (unlike GBN) Transport Layer 3-6 TCP: Overview RFCs: 793, 1122, 1323, 2018, 2581 point-to-point: one sender, one receiver no one to many multicasts full duplex data: flow controlled: connection-oriented: processes must handshake before sending data three-way handshake: (exchange of control msgs) initializes sender, receiver state before data exchange pipelined: TCP congestion and flow control set window size send & receive buffers: set-aside during the 3-way handshaking bi-directional data flow in same connection at the same time socket door sender will not overwhelm receiver application writes data application reads data TCP send buffer TCP receive buffer socket door segment Transport Layer 3-7 TCP: Overview - cont Maximum Segment Size (MSS): Defined as the maximum amount of application-layer data in the TCP segment. TCP grabs data in chunks from the send buffer where the maximum chunk size is called MSS. TCP segment contains TCP header and MSS. MSS is set by determining the largest link layer frame (Maximum Transmission Unit or MTU) that can be sent by the local host MSS is set so that an MSS put into an IP datagram will fit into a single link layer frame. Common values of MTU is 1460 bytes, 536 bytes and 512 bytes. TCP sequence #s: both sides randomly choose initial seq #s (other than 0) to prevent receiving segments of older connections that were using the same ports. TCP views data as unordered structured stream of bytes so seq #s are over the stream of byes. file size of 500,000 bytes and MSS=1,000 bytes, segment seq #s are: 0, 1000, 2000, etc. TCP acknowledgement #s: uses cumulative acks: TCP only acks bytes up to the first missing byte in the stream . TCP RFCs do not address how to handle out-of-order segments. ACK # field has the next byte offset that the sender or receiver is expecting Transport Layer 3-8 TCP segment structure 32 bits URG: urgent data (generally not used) ACK: ACK # valid PSH: push data now to upper layer SYN/FIN: connection setup and close. RST=1: used in response when client tries to connect to a non-open server port . Internet checksum (as in UDP) source port # dest port # sequence number acknowledgement number header length not UA P R S F used checksum Receive window Urgent data pointer Options (variable length) used to negotiate MSS application data (variable length) counting by bytes of data (not segments!) largest file that can be sent = 232 (4GB) total #segments= filesize/MSS 16-bit= # bytes receiver willing to accept (RcvWindow size) header-length = 4-bits in 32-bit words Transport Layer 3-9 Seq Numbers and Ack Numbers Suppose a data stream of size 500,000 bytes, MSS is 1,000 bytes; the first byte of the data stream is numbered zero. Seq number of the segments: • 1st seg: 0; 2nd seg: 1000; 3rd seg: 2000, … Ack number: Assume host A is sending seg to host B. Because TCP is full-duplex, A may be receiving data from B simultaneously. Ack number that host B puts in its seg is the seq number of the next byte B is expecting from A • B has received all bytes numbered 0 through 535 from A. If B is about to send a segment to host A. The ack number in its segment should 536 Transport Layer 3-10 TCP seq. #’s and ACKs - Telnet example Telnet uses “echo back” to ensure characters seen by user already been received and processed at server. Assume starting seq #s are 42 and 79 for client and server respectively. After connection is established, client is waiting for byte 79 and server for byte 42. Seq. #’s: byte stream “number” of first byte in segment’s data ACKs: seq # of next byte expected from other side cumulative ACK User types ‘C’ Host A client Host B server host ACKs receipt of ‘C’, echoes back ‘C’ host ACKs receipt of echoed ‘C’ simple telnet scenario Transport Layer time 3-11 TCP Round Trip Time and Timeout Q: how to estimate RTT? Q: how to set TCP SampleRTT: measured time from timeout value ? segment transmission (handing the (timer management) based on RTT longer than RTT but RTT varies too short: premature timeout unnecessary retransmissions too long: slow reaction to segment loss segment to IP) until ACK receipt ignore retransmissions (why?) SampleRTT will vary from segment to segment, want estimated RTT “smoother” average several recent measurements, not just current SampleRTT TCP maintains an average called EstimatedRTT to use it to calculate the timeout value Transport Layer 3-12 TCP Round Trip Time (RTT) and Timeout EstimatedRTT = (1- ) * priorEstimatedRTT + * currentSampleRTT Exponential Weighted Moving Average (EWMA) Puts more weight on recent samples rather than old ones influence of past sample decreases exponentially fast typical value: = 0.125 Formula becomes: EstimatedRTT = 0.875 * priorEstimatedRTT + 0.125 * currentSampleRTT Why TCP ignores retransmissions when calculating SampleRTT: Suppose source sends packet P1, the timer for P1 expires, and the source then sends P2, a new copy of the same packet. Further suppose the source measures SampleRTT for P2 (the retransmitted packet) and that shortly after transmitting P2 an acknowledgment for P1 arrives. The source will mistakenly take this acknowledgment as an acknowledgment for P2 and calculate an incorrect value of SampleRTT. Transport Layer 3-13 RTT Sample Ambiguity A B Estimate RTT Sample RTT A B X eRTT Sample RTT Karn’s RTT Estimator If a segment has been retransmitted: • Don’t count RTT sample on ACKs for this segment • Keep backed off time-out for next packet • Reuse RTT estimate only after one successful transmission Transport Layer 3-14 Example RTT estimation: RTT: gaia.cs.umass.edu to fantasia.eurecom.fr 350 RTT (milliseconds) 300 250 200 150 100 1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 time (seconnds) SampleRTT Estimated RTT Transport Layer 3-15 TCP Round Trip Time and Timeout Setting the timeout EstimtedRTT plus “safety margin” large variation in EstimatedRTT -> larger safety margin first estimate of how much SampleRTT deviates from EstimatedRTT: DevRTT = (1-)*DevRTT + *|SampleRTT-EstimatedRTT| (typically, = 0.25) Then set timeout interval: TimeoutInterval = EstimatedRTT + 4*DevRTT Transport Layer 3-16 TCP: conn-oriented transport segment structure RTT Estimation and Timeout reliable data transfer flow control connection management Transport Layer 3-17 TCP reliable data transfer TCP creates rdt service on top of IP’s unreliable service Pipelined segments Cumulative acks TCP uses single retransmission timer as multiple timers require considerable overhead Retransmissions are triggered by: timeout events duplicate acks Initially consider simplified TCP sender: ignore duplicate acks ignore flow control, congestion control Transport Layer 3-18 TCP sender events: data rcvd from app: Create segment with seq # seq # is byte-stream number of first data byte in segment start timer if not already running for some other segment (think of timer as for oldest unacknowledged segment) expiration interval: TimeOutInterval timeout: retransmit segment that caused timeout restart timer Ack rcvd: a valid ACK field (cumulative ACK) acknowledges previously unacknowledged segments: update expected ACK # restart timer if there are currently unacknowledged segments Transport Layer 3-19 NextSeqNum = InitialSeqNum SendBase = InitialSeqNum loop (forever) { switch(event) event: data received from application above create TCP segment with sequence number NextSeqNum if (timer currently not running) start timer pass segment to IP NextSeqNum = NextSeqNum + length(data) event: timer timeout retransmit not-yet-acknowledged segment with smallest sequence number start timer event: ACK received, with ACK field value of y if (y > SendBase) { SendBase = y if (there are currently not-yet-acknowledged segments) start timer } } /* end of loop forever */ TCP sender (simplified) Comment: • SendBase-1: last cumulatively ack’ed byte Example: • SendBase-1 = 71; y= 73, so the rcvr wants 73+ ; y > SendBase, so that new data is acked Transport Layer 3-20 TCP: retransmission scenarios Host A X loss transmit not-yet-ack segment with smallest seq # Sendbase = 100 SendBase = 120 SendBase = 100 time SendBase = 120 lost ACK scenario Host B Seq=92 timeout Host B Seq=92 timeout timeout Host A time premature timeout Transport Layer 3-21 TCP retransmission scenarios (more) timeout Host A Host B X loss SendBase = 120 time Cumulative ACK scenario Doubling the timeout value technique is used in TCP implementations. The timeout value is doubled for every retransmission since the timeout could have occurred because the network is congested. (the intervals grow exponentially after each retransmission and reset after either of the two other events) Transport Layer 3-22 TCP ACK generation policy [RFC 1122, RFC 2581] Event at Receiver TCP Receiver action Arrival of in-order segment with expected seq #. All data up to expected seq # already ACKed Delayed ACK. Wait up to 500ms for next segment. If no next segment, send ACK Arrival of in-order segment with expected seq #. One other segment has ACK pending Immediately send single cumulative ACK, ACKing both in-order segments Arrival of out-of-order segment higher-than-expect seq. # . Gap detected Immediately send duplicate ACK, indicating seq. # of next expected byte Arrival of segment that partially or completely fills gap Immediate send ACK, provided that segment starts at lower end of gap leaves buffering of out-of-order segments open Transport Layer 3-23 Fast Retransmit Time-out period often relatively long: long delay before resending lost packet Detect lost segments via If sender receives 3 ACKs for the same data, it supposes that segment after last ACKed segment was lost: duplicate ACKs. Dup Ack is an ack that reaknolwedges the receipt of an acknowledged segment Sender often sends many segments back-to-back If segment is lost, there will likely be many duplicate ACKs. sender performs fast retransmit: resend segment before that segment’s timer expires algorithm comes as a result of 15 years TCP experience ! Transport Layer 3-24 Fast retransmit algorithm: event: ACK received, with ACK field value of y if (y > SendBase) { SendBase = y if (there are currently not-yet-acknowledged segments) start timer } else { increment count of dup ACKs received for y if (count of dup ACKs received for y = 3) { resend segment with sequence number y } a duplicate ACK for already ACKed segment fast retransmit Transport Layer 3-25 Is TCP a GBN or SR protocol ? TCP can buffer out-of-order segments (like SR). TCP has a proposed RFC called selective acknowledgement to selectively acknowledge out-oforder segments and save on retransmissions (like SR). TCP sender need only maintain smallest seq # of a transmitted but unacknowledged byte and the seq # of next byte to be sent (like GBN). TCP is hybrid between GBN and SR. Transport Layer 3-26 TCP: conn-oriented transport segment structure RTT Estimation and Timeout reliable data transfer flow control connection management Transport Layer 3-27 TCP Flow Control receive side of TCP connection has a receive buffer: flow control sender won’t overflow receiver’s buffer by transmitting too much, too fast speed-matching app process may be service: matching the send rate to the receiving app’s drain rate slow at reading from buffer Transport Layer 3-28 TCP Flow control: how it works Rcvr advertises spare room by including value of RcvWindow in segments RcvWindow = RcvBuffer at the start of transmission Sender limits unACKed data to RcvWindow (Suppose TCP receiver discards out-oforder segments) sender maintains variable called receive window spare room in buffer = RcvWindow = RcvBuffer-[LastByteRcvd LastByteRead] TCP is not allowed to overflow the allocated buffer (LastByteRcvd LastByteRead <= RcvBuffer) sender keeps track of UnAcked data size = (LastByteSent LastByteAcked) UnAcked data size <= RcvWindow When Receiver RcvWindow = 0, Sender does not block but rather sends 1 byte segments that are acked by receiver until RcvWindow becomes bigger. Transport Layer 3-29 TCP: conn-oriented transport segment structure RTT Estimation and Timeout reliable data transfer flow control connection management Transport Layer 3-30 Recap: TCP socket interaction Server (running on hostid) Client create socket, port=x, for incoming request: welcomeSocket = ServerSocket() TCP wait for incoming connection request connection connectionSocket = welcomeSocket.accept() read request from connectionSocket write reply to connectionSocket close connectionSocket setup create socket, connect to hostid, port=x clientSocket = Socket() send request using clientSocket read reply from clientSocket close clientSocket Transport Layer 3-31 TCP Connection Management Recall: TCP sender, receiver establish “connection” before exchanging data segments initialize TCP variables: seq. #s buffers, flow control info (e.g. RcvWindow) client: connection initiator Socket clientSocket = new Socket("hostname","port number"); server: contacted by client Socket connectionSocket = welcomeSocket.accept(); 32 bits source port # dest port # sequence number acknowledgement number header length not UA P R S F used checksum Receive window Urgent data pointer Options (variable length) used to negotiate MSS application data (variable length) Transport Layer 3-32 TCP Connection Management - connecting Three way handshake: Step 1: client host sends TCP SYN segment (SYN bit=1) to server • specifies initial seq # (client_isn) • no data Step 2: server host receives SYN, replies with SYNACK segment • server allocates buffers • specifies server initial seq. # (server_isn), with ACK # = client_isn+1 Step 3: client receives SYNACK, replies with ACK # = server_isn+1, which may contain data client conn request server conn granted ACK Time Time Transport Layer 3-33 TCP Connection Setup Example 09:23:33.042318 IP 128.2.222.198.3123 > 192.216.219.96.80: S 4019802004:4019802004(0) win 65535 <mss 1260,nop,nop,sackOK> 09:23:33.118329 IP 192.216.219.96.80 > 128.2.222.198.3123: S 3428951569:3428951569(0) ack 4019802005 win 5840 <mss 1460,nop,nop,sackOK> 09:23:33.118405 IP 128.2.222.198.3123 > 192.216.219.96.80: . ack 3428951570 win 65535 Client SYN SeqC: Seq. #4019802004, window 65535, max. seg. 1260 Server SYN-ACK+SYN Receive: #4019802005 (= SeqC+1) SeqS: Seq. #3428951569, window 5840, max. seg. 1460 Client SYN-ACK Receive: #3428951570 (= SeqS+1) sackOK: selective acknowledge Transport Layer 3-34 TCP Connection Management - disconnecting Closing a connection: client closes socket: clientSocket.close(); client close Step 1: client end system Step 2: server receives FIN, replies with ACK. Closes connection, sends FIN=1. close timed wait sends TCP FIN control segment (FIN bit=1) to server server closed Transport Layer 3-35 TCP Connection Management (cont.) Step 3: client receives FIN, Enters “timed wait” will respond with ACK to received FINs where typical wait is 30 sec. All resources and ports are released. Step 4: server, receives ACK. Connection closed. server closing closing timed wait replies with ACK. client closed closed Transport Layer 3-36 TCP Conn.Teardown Example 09:54:17.585396 IP 128.2.222.198.4474 > 128.2.210.194.6616: F 1489294581:1489294581(0) ack 1909787689 win 65434 09:54:17.585732 IP 128.2.210.194.6616 > 128.2.222.198.4474: F 1909787689:1909787689(0) ack 1489294582 win 5840 09:54:17.585764 IP 128.2.222.198.4474 > 128.2.210.194.6616: . ack 1909787690 win 65434 Session Echo client on 128.2.222.198, server on 128.2.210.194 Client FIN SeqC: 1489294581 Server ACK + FIN Ack: 1489294582 (= SeqC+1) SeqS: 1909787689 Client ACK Ack: 1909787690 (= SeqS+1) Transport Layer 3-37 TCP Connection Management (cont) TCP server lifecycle TCP client lifecycle Transport Layer 3-38 Queue Management Two queues for each listening socket Transport Layer 3-39 Concurrent Server (1) pid_t pid; (2) int listenfd, connfd; (3) listenfd = Socket( ... ); (4) /* fill in sockaddr_in{} with server's well-known port */ (5) Bind(listenfd, ... ); (6) Listen(listenfd, LISTENQ); (7) for ( ; ; ) { (8) connfd = Accept (listenfd, ... ); /* probably blocks */ (9) if( (pid = Fork()) == 0) { (10) Close(listenfd); /* child closes listening socket */ (11) doit(connfd); /* process the request */ (12) Close(connfd); /* done with this client */ (13) exit(0); /* child terminates */ (14) } (15) Close(connfd); /* parent closes connected socket */ (16) } Transport Layer 3-40 Concurrent Server (Cont’) (a) Status before call to call to accept returns (b) status after return from accept (d) Status after parent/child close appropriate sockets (c) Status after return of spawning a process Transport Layer 3-41 TCP Summary TCP Properties: point to point, connection-oriented, full-duplex, reliable TCP Segment Structure How TCP sequence and acknowledgement #s are assigned How does TCP measure the timeout value needed for retransmissions using EstimatedRTT and DevRTT TCP retransmission scenarios, ACK generation and fast retransmit How does TCP Flow Control work TCP Connection Management: 3-segments exchanged to connect and 4-segments exchanged to disconnect Transport Layer 3-42