Chapter 7 Internet Transport Protocols 1 Transport Layer Our goals: understand principles behind transport layer services: Multiplexing / demultiplexing data streams of several applications reliable data transfer flow control congestion control Transport Layer Chapter 6: rdt principles Chapter 7: multiplex/ demultiplex Internet transport layer protocols: UDP: connectionless transport TCP: connection-oriented transport • connection setup • data transfer • flow control • congestion control 2 Transport vs. network layer Transport Layer Network Layer logical communication between processes logical communication between hosts exists only in hosts exists in hosts and in routers ignores network Port #s used for “routing” to the intended process inside destination computer routes data through network IP addresses used for routing in network Transport layer uses Network layer services adds value to these services 3 Socket Multiplexing 4 Multiplexing/demultiplexing Multiplexing at send host: gather data from multiple sockets, envelop data with headers (later used for demultiplexing), pass to L3 application transport network link P3 P1 P1 Demultiplexing at rcv host: receive segment from L3 deliver each received segment to the right socket = socket application transport network P2 = process P4 application transport network link link physical host 1 physical host 2 physical host 3 5 each datagram has source IP address, destination IP address in its header • used by network to get it there each datagram carries one transport-layer segment each segment has source, destination port number in its header host uses port #s(*) to direct segment to correct socket from socket data gets to the relevant application process (*) appl. msg host receives IP datagrams L4 header L3 hdr How demultiplexing works 32 bits source IP addr dest IP addr. other IP header fields source port # dest port # other header fields application data (message) TCP/UDP segment format to find a TCP socket on server, source & dest. IP address is also needed, see details later 6 Connectionless demultiplexing (UDP) Processes create sockets with port numbers a UDP socket is identified by a pair of numbers: (my IP address , my port number) When server receives a UDP segment: Client decides to contact: a server (peer IP-address) an application ( “WKP”) puts those into the UDP packet sent, written as: dest IP address - in the IP header of the packet dest port number - in its UDP header checks destination port number in segment directs UDP segment to the socket with that port number • single server socket per application type • (packets from different remote sockets directed to same socket) msg waits in socket queue and processed in its turn. answer sent to the client socket (listed in Source fields of query packet) Realtime UDP applications have individual server sockets per client. However their port numbers are distinct, since they are coordinated in advance by some signaling protocol. This is possible since port number is not used to specify the application. 7 Connectionless demux (cont) client socket: port=5775, IP=B client socket: port=9157, IP=A L5 P2 L4 P3 Reply L3 L2 message L1 S-IP: C S-IP: C D-IP: A D-IP: B SP: 53 SP: 53 DP: 9157 message DP: 5775 S-IP: A client IP: A server socket: port=53, IP = C Wait for application SP: 9157 Getting DP: 53 Service D-IP: C SP = Source port number DP= Destination port number S-IP= Source IP Address D-IP=Destination IP Address P1 Reply message server IP: C S-IP: B Getting Service IP-Header D-IP: C SP: 5775 DP: 53 Client IP:B message UDP-Header SP and S-IP provide “return address” 8 Connection-oriented demux (TCP) TCP socket identified by 4-tuple: local (my) IP address local (my) port number remote (peer) IP address remote (peer) port # host receiving a packet uses all four values to direct the segment to appropriate socket Server host may support many simultaneous TCP sockets: each socket identified by its own 4-tuple Web server dedicates a different socket to each connecting client If you open two browser windows, you generate 2 sockets at each end 9 Connection-oriented demux (cont) client socket: LP= 9157, L-IP= A RP= 80 , R-IP= C L5 server socket: server socket: LP= 80 , L-IP= C RP= 9157, R-IP= A LP= 80 , L-IP= C RP= 5775, R-IP= B P1 L4 P4 P5 P6 client IP: A D-IP: C SP: 9157 DP: 80 H3 H4 server IP: C packet: S-IP: B D-IP: C SP: 9157 message LP= Local Port , RP= Remote Port L-IP= Local IP , R-IP= Remote IP P1P3 S-IP: B D-IP: C message packet: S-IP: A P2 DP: 80 L2 L1 LP= 9157, L-IP= B RP= 80 , R-IP= C SP: 5775 server socket: LP= 80 , L-IP= C RP= 9157, R-IP= B L3 packet: client socket: “L”= Local = My “R”= Remote = Peer DP: 80 message Client IP: B client socket: LP= 5775, L-IP= B RP= 80 , R-IP= C 10 Connection-oriented Sockets Client socket has a port number unique in host packet for client socket directed by the host OS based on dest. port only each server application has an always active waiting socket; that socket receives all packets not belonging to any established connection these are packets that open new connections when waiting socket accepts a ‘new connection’ segment, a new socket is generated at server with same port number this is the working socket for that connection next sockets arriving at server on connection will be directed to working socket socket will be identified using all 4 identifiers last slide shows working sockets on the server side Note: Client IP + Client Port are globally unique 11 UDP Protocol 12 UDP: User Datagram Protocol [RFC 768] simple transport protocol “best effort” service, UDP segments may be: lost delivered out of order to application with no correction by UDP UDP will discard bad checksum segments if so configured by application connectionless: no handshaking between UDP sender, receiver each UDP segment handled independently of others Why is there a UDP? no connection establishment saves delay no congestion control: better delay & BW simple: less memory & RT small segment header typical usage: realtime appl. loss tolerant rate sensitive other uses (why?): DNS SNMP 13 UDP segment structure Total length of segment (bytes) 32 bits source port # length dest port # checksum application data (variable length) Checksum computed over: • the whole segment, plus • part of IP header: – both IP addresses – protocol field – total IP packet length Checksum usage: • computed at destination to detect errors • on error, discard segment, • checksum is optional • if not used, sender puts checksum = all zeros • computed zero = all ones 14 TCP Protocol 15 TCP: Overview point-to-point: one sender, one receiver works between sockets reliable, in-order byte stream: no “message boundaries” pipelined: TCP congestion and flow control set window size send & receive buffers socket door application writes data application reads data TCP send buffer TCP receive buffer RFCs: 793, 1122, 1323, 2018, 2581 full duplex data: bi-directional data flow in same connection MSS: maximum segment size connection-oriented: handshaking (exchange of control msgs) init’s sender, receiver state before data exchange flow controlled: sender will not overwhelm receiver socket door segment 16 TCP segment structure 32 bits FLAGS ACK: ACK # valid PSH, URG seldom used not clearly defined URG: indicates start of urgent data PSH: indicates urgent data ends in this segm. ptr = end urgent data SYN: initialize conn., synchronize SN FIN: I wish to disconn. RST: break conn. immediately hdr length in 32 bit words source port # dest port # sequence number acknowledgement number head not RSF len used U A P checksum rcvr window size ptr urgent data Options (variable length) application data (variable length) counting by bytes of data (not segments!) # bytes rcvr willing to accept Internet checksum (as in UDP) 17 TCP sequence # (SN) and ACK # (AN) SN: byte stream “number” of first byte in segment’s data AN: SN of next byte expected from other side it’s a cumulative ACK Qn: how receiver handles out-of-order segments? puts them in receive buffer but does not acknowledge them Host A host A sends 100 data bytes host ACKs receipt of data , sends no data WHY? Host B host B ACKs 100 bytes and sends 50 data bytes time simple data transfer scenario (some time after conn. setup) 18 Connection Setup: Objective Agree on initial sequence numbers a sender should not reuse a seq# before it is sure that all packets with the seq# are purged from the network • the network guarantees that a packet too old will be purged from the network: network bounds the life time of each packet To avoid waiting for them to disappear, choose initial SN (ISN) far away from previous session • needs connection setup so that the sender tells the receiver initial seq# Agree on other initial parameters e.g. Maximum Segment Size 19 TCP Connection Management Setup: establish connection between the hosts before exchanging data segments called: 3 way handshake initialize TCP variables: seq. #s buffers, flow control info (e.g. RcvWindow) client : connection initiator opens socket and cmds OS to connect it to server server : contacted by client has waiting socket accepts connection generates working socket Teardown: end of Three way handshake: Step 1: client host sends TCP SYN segment to server specifies initial seq # (ISN) no data Step 2: server host receives SYN, replies with SYNACK segment (also no data) allocates buffers specifies server initial SN & window size Step 3: client receives SYNACK, replies with ACK segment, which may contain data connection (we skip the details) 20 TCP Three-Way Handshake (TWH) A B X+1 Y+1 Send Buffer Send Buffer Y+1 Receive Buffer X+1 Receive Buffer 21 Connection Close Objective of closure handshake: each side can release resource and remove state about the connection • Close the socket client initial close : release no data from resource? client close release resource server close release resource 22 TCP reliable data transfer TCP creates reliable service on top of IP’s unreliable service pipelined segments cumulative acks single retransmission timer receiver accepts out of order segments but does not acknowledge them Retransmissions are triggered by timeout events in some versions of TCP also by triple duplicate ACKs (see later) Initially consider simplified TCP sender: ignore flow control, congestion control 7-23 TCP sender events: data rcvd from app: create segment with seq # seq # is byte-stream number of first data byte in segment start timer if not already running (timer relates to oldest unACKed segment) expiration interval: TimeOutInterval timeout (*): retransmit segment that caused timeout restart timer ACK rcvd: if ACK acknowledges previously unACKed segments update what is known to be ACKed Note: Ack is cumulative start timer if there are outstanding segments (*) retransmission done also on triple duplicate Ack (see later) 7-24 NextSeqNum = InitialSeqNum SendBase = InitialSeqNum loop (forever) { switch(event) event: data received from application above if (NextSeqNum-send_base < N) then { create TCP segment with sequence number NextSeqNum if (timer currently not running) start timer pass segment to IP NextSeqNum = NextSeqNum + length(data) } else reject data /* in truth: keep in send buffer until new Ack */ event: timer timeout retransmit not-yet-acknowledged segment with smallest sequence number start timer event: ACK received, with ACK field value of y if (y > SendBase) { SendBase = y if (there are currently not-yet-acknowledged segments) start timer } } /* end of loop forever */ TCP sender (simplified) Comment: • SendBase-1: last cumulatively ACKed byte Example: • SendBase-1 = 71; y= 73, so the rcvr wants 73+ ; y > SendBase, so that new data is ACKed Transport Layer 7-25 TCP actions on receiver events: data rcvd from IP: if Checksum fails, ignore segment If checksum OK, then : if data came in order: update AN &WIN, as follows: AN grows by the number of new in-order bytes WIN decreases by same # if data out of order: Put in buffer, but don’t count it for AN/ WIN application takes data: free the room in buffer give the freed cells new numbers circular numbering WIN increases by the number of bytes taken 7-26 TCP: retransmission scenarios Host A Host A Host B Host B start timer for SN 92 start timer for SN 92 stop timer X start timer for SN 100 TIMEOUT loss start timer for new SN 92 stop timer NO timer stop timer timeA. normal scenario timer setting actual timer run NO timer time B. lost ACK + retransmission 7-27 TCP retransmission scenarios (more) Host A Host A Host B start timer for SN 92 Host B start timer for SN 92 X loss stop timer TIMEOUT start for 92 stop start for 100 NO timer DROP ! stop NO timer redundant ACK time C. lost ACK, NO retransmission time D. premature timeout Transport Layer 7-28 TCP ACK generation (Receiver rules) Event at Receiver TCP Receiver action Arrival of in-order segment with expected seq #. All data up to expected seq # already ACKed Delayed ACK. Wait up to 500ms for next segment. If no data segment to send, then send ACK Arrival of in-order segment with expected seq #. One other segment has ACK pending Immediately send single cumulative ACK, ACKing both in-order segments Arrival of out-of-order segment with higher-than-expect seq. # . Gap detected Immediately send duplicate ACK, indicating seq. # of next expected byte This Ack carries no data & no new WIN Arrival of segment that partially or completely fills gap Immediately send ACK, provided that segment starts at lower end of 1st gap [RFC 1122, RFC 2581] Transport Layer 7-29 Fast Retransmit (Sender Rules) time-out period often relatively long: Causes long delay before resending lost packet idea: detect lost segments via duplicate ACKs. sender often sends many segments back-to-back if segment is lost, there will likely be many duplicate ACKs for that segment Rule: If sender receives 4 ACKs for same data (= 3 duplicates), it assumes that segment after ACKed data was lost: fast retransmit: resend segment immediately (before timer expires) Transport Layer 7-30 Fast Retransmit scenario Host A seq # x1 seq # x2 seq # x3 seq # x4 seq # x5 Host B X ACK # x2 ACK # x2 ACK # x2 ACK # x2 * no data in segment * no window change timeout triple duplicate ACKs time Transport Layer 7-31 Fast retransmit algorithm: event: ACK received, with ACK field value of y if (y > SendBase) { SendBase = y if (there are currently not-yet-acknowledged segments) start timer } else {if (segment carries no data & doesn’t change WIN) increment count of dup ACKs received for y if (count of dup ACKs received for y = 3) { { resend segment with sequence number y count of dup ACKs received for y = 0 } } a duplicate ACK for already ACKed segment fast retransmit 7-32 Transp ort TCP: setting timeouts 33 General idea Q: how to set TCP timeout interval? Average Timeout Interval should be longer than RTT but: RTT will vary if too short: premature timeout unnecessary retransmissions if too long: slow reaction to segment loss Set timeout = average + safe margin : margin 34 Estimating Round Trip Time RTT: gaia.cs.umass.edu to fantasia.eurecom.fr SampleRTT: measured time from 300 RTT (milliseconds) segment transmission until receipt of ACK for it SampleRTT will vary, want a “smoother” estimated RTT use several recent measurements, not just current SampleRTT 350 250 200 150 100 1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 time (seconnds) SampleRTT Estimated RTT EstimatedRTT = (1- )*EstimatedRTT + *SampleRTT Exponential weighted moving average influence of past sample decreases exponentially fast typical value: = 0.125 35 Setting Timeout Problem: using the average of SampleRTT will generate many timeouts due to network variations Solution: freq. EstimatedRTT plus “safety margin” large variation in EstimatedRTT -> requires larger safety margin RTT Estimate average deviation of RTT: DevRTT = (1-)*DevRTT + *|SampleRTT-EstimatedRTT| (typically, = 0.25) Then set timeout interval: TimeoutInterval = EstimatedRTT + 4*DevRTT 36 TCP: Flow Control 37 TCP Flow Control: Simple Case flow control TCP at A sends data to B The picture below shows the TCP receive-buffer at B AN Receive Buffer data taken by application TCP data in buffer spare room data from IP (sent by TCP at A) WIN node B : Receive process application process at B may be slow at reading from buffer sender won’t overflow receiver’s buffer by transmitting too much, too fast flow control matches the send rate of A to the receiving application’s drain rate at B Receive buffer size set by OS at connection init WIN = window size = number bytes A may send starting at AN 7-38 TCP Flow control: General Case Rcv Buffer data taken by ACKed data application in buffer s p a r e r o o m AN Formulas: data from IP (sent by TCP at A) WIN node B : Receive process Procedure: AN = first byte not received yet sent to A in TCP header non-ACKed data in buffer (arrived out of order) ignored Rcvr advertises “spare room” by including value of WIN in his AckedRange = segments = AN – FirstByteNotReadByAppl= Sender A is allowed to send at = # bytes rcvd in sequence &not taken most WIN bytes in the range WIN = RcvBuffer – AckedRange starting with AN = “SpareRoom” guarantees that receive AN and WIN sent to A in TCP header buffer doesn’t overflow Data received out of sequence is considered part of ‘spare room’ range 7-39 בקרת זרימה של – TCPדוגמה 1 7-40 בקרת זרימה של – TCPדוגמה 2 7-41 TCP: Congestion Control 42 TCP Congest’n Ctrl Overview (1) Closed-loop, end-to-end, window-based congestion control Designed by Van Jacobson in late 1980s, based on the AIMD algorithm of Dah-Ming Chu and Raj Jain Works well so far: the bandwidth of the Internet has increased by more than 200,000 times Many versions TCP-Tahoe: this is a less optimized version TCP-Reno: many OSs today implement Reno type congestion control TCP-Vegas: not currently used For more details: see Stevens: TCP/IP illustrated; K-R chapter 6.7, or read: http://lxr.linux.no/source/net/ipv4/tcp_input.c for linux implementation 43 TCP Congest’n Ctrl Overview (2) Dynamic window size [Van Jacobson] Initialization: MI (Multiplicative Increase) • Slow start Steady state: AIMD (Additive Increase / Multiplicative Decrease) • Congestion Avoidance “Congestion is timeout || 3 duplicate ACK” TCP Tahoe: treats both cases identically TCP Reno: treat each case differently “Congestion = (also) higher latency” TCP Vegas 44 General method sender limits rate by limiting number of unACKed bytes “in pipeline”: LastByteSent-LastByteAcked cwnd (*) cwnd: differs from WIN (how, why?) sender limited by ewnd ≡ min(cwnd,WIN) (effecive window) cwnd bytes roughly, rate = ewnd RTT bytes/sec cwnd is dynamic, function of perceived network congestion RTT ACK(s) Transport Layer 7-45 The Basic Two Phases MSS Congestion avoidance Additive Increase cwnd Slow start Multiplicative Increase 46 Pure AIMD: Bandwidth Probing Principle “probing for bandwidth”: increase transmission rate on receipt of ACK, until eventually loss occurs, then decrease transmission rate continue to increase on ACK, decrease on loss (since available bandwidth is changing, depending on other connections in network) ACKs being received, so increase rate slowly X loss, so decrease rate fast sending rate X AI MD X AI X X TCP’s “sawtooth” behavior MD this model ignores Slow Start Q: how fast to increase/decrease? details to follow time Transport Layer 7-47 TCP Slowstart: MI * used in all TCP versions Host B RTT Slowstart algorithm initialize: cwnd = 1 MSS for (each segment ACKed) cwnd += MSS (*) until (congestion event OR cwnd ≥ threshold) On congestion event: Host A {Threshold = cwnd/2 cwnd = 1 MSS } (*) doubled per RTT: • • time exponential increase in window size (very fast!) therefore slowstart lasts a short time 48 TCP: congestion avoidance (CA) when cwnd > ssthresh grow cwnd linearly: as long as all ACKs arrive increase cwnd by ≈1 MSS per RTT approach possible congestion slower than in slowstart implementation: cwnd += MSS^2/cwnd for each ACK received AIMD ACKs: increase cwnd by 1 MSS per RTT: additive increase loss(*): cut cwnd in half : multiplicative decrease true in macro picture in actual algorithm may have Slow Start first to grow up to this value (+) (*) = Timeout or 3 Duplicate (+) depends on case & TCP type Transport Layer 7-49 TCP Tahoe Initialize with SlowStart state with cwnd = 1 MSS When cwnd ≥ ssthresh change to CA state When sense congestion(*): set ssthresh =ewnd/2 (+) set cwnd = 1 MSS change state to SlowStart (*) Timeout or Triple Duplicate Ack (+) recall ewnd = min(cwnd, WIN); in our discussion here we assume that WIN > cwnd, so ewnd=cwnd TCP Tahoe T/O or 3 Dup AI MD CA CA SSt SSt 50 TCP Reno Rationale: triple duplicate event shows less congestion than timeout TCP Reno Procedure Initialize with SlowStart Slowstart as in Tahoe CA growth as in Tahoe first segment probably lost On T/O, act as in Tahoe but some others arrived therefore on 3Dup, cwnd On Triple Duplicate, set ssthresh = ewnd/2 is decreased to ewnd/2, enter Fast Recovery state skipping SlowStart stage less aggressive than on T/O this is an approximate description; more details to the right and two slides below this is a temporary state until a non-Dup Ack arrives when Fast Recovery ends, set: cwnd = ssthresh Transport Layer 7-51 Fast Recovery Rationale: Fast Recovery State cwnd increases only when a Initialize cwnd += 3 MSS new segment is Ack’ed on each additional in the 3 Dup situation, it duplicate Ack increase may take time until such cwnd by MSS Ack arrives. when a new segment is Until that time: acknowledged, set we increase cwnd on the cwnd = ssthresh arrival of each duplicate Ack, including the three that triggered Fast Retransmit when new Ack arrives recall that ssthresh was set to half of the last ewnd value in CA state set cwnd = ssthresh Transport Layer 7-52 TCP Reno cwnd Trace 70 triple duplicate Ack threshold congestion window timeouts 50 20 10 CA CA CA additive increase slow start period Sl.Start 30 fast retransmission Slow Start 40 Slow Start Congestion Window 60 0 0 10 20 30 Time 40 50 fast recovery stage skipped 60 53 TCP Reno Cong. Ctrl State Transition Diagram slow start cwnd > ssthresh congestion loss: timeout loss: timeout loss: timeout loss: 3dupACK fast recovery avoidance new ACK loss: 3dupACK Transport Layer 7-54 TCP Reno Congestion Control FSM check == 3? duplicate ACK dupACKcount++ L cwnd = 1 MSS ssthresh = 64 KB dupACKcount = 0 INIT slow start timeout ssthresh = cwnd/2 cwnd = 1 MSS dupACKcount = 0 retransmit missing segment dupACKcount == 3 ssthresh= cwnd/2 cwnd = ssthresh + 3 MSS retransmit missing segment new ACK cwnd = cwnd+MSS dupACKcount = 0 transmit new segment(s),as allowed cwnd > ssthresh L timeout ssthresh = cwnd/2 cwnd = 1 MSS dupACKcount = 0 retransmit missing segment timeout ssthresh = cwnd/2 cwnd = 1 MSS dupACKcount = 0 retransmit missing segment new ACK cwnd = cwnd + MSS (MSS/cwnd) dupACKcount = 0 transmit new segment(s),as allowed . congestion avoidance duplicate ACK dupACKcount++ check == 3? New ACK cwnd = ssthresh dupACKcount = 0 dupACKcount == 3 ssthresh= cwnd/2 cwnd = ssthresh + 3 MSS retransmit missing segment fast recovery duplicate ACK cwnd = cwnd + MSS transmit new segment(s), as allowed Transport Layer 7-55 cwnd window size (in segments) Popular “flavors” of TCP TCP Reno ssthresh ssthresh TCP Tahoe Transmission round Transport Layer 7-56 Summary: TCP Reno Congestion Control when cwnd < ssthresh, sender in slow-start phase, window grows exponentially. when cwnd >= ssthresh, sender is in congestion-avoidance phase, window grows linearly. when triple duplicate ACK occurs, ssthresh set to cwnd/2, cwnd eventually set to ~ ssthresh (after detour to Fast Retransmit state) when timeout occurs, ssthresh set to cwnd/2, cwnd set to 1 MSS. Transport Layer 7-57 TCP throughput Q: what’s average throughout of TCP as function of window size, RTT? ignoring slow start let W be window size when loss occurs. when window is W, throughput is W/RTT just after loss, window drops to W/2, throughput to W/2RTT, then grows linearly slow average throughout: .75 W/RTT Transport Layer 7-58 TCP Fairness fairness goal: if K TCP sessions share same bottleneck link of bandwidth R, each should have average rate of R/K TCP connection 1 TCP connection 2 bottleneck router capacity R Transport Layer 7-59 Why is TCP fair? Two competing sessions: (Tahoe, Slow Start ignored) Additive increase gives slope of 1, as throughout increases multiplicative decrease decreases throughput proportionally R equal bandwidth share loss: decrease window by factor of 2 y = x+(b-a)/4 y = x+(b-a)/4 (a/2+t/2+t1,b/2+t/2+t1) ; y = x+(b-a)/2 ((a+t)/2,(b+t)/2) => y = x+(b-a)/2 (a+t,b+t) => y = x+(b-a) (a,b) Connection 1 throughput R congestion avoidance: additive increase Transport Layer 7-60 Fairness (more) Fairness and UDP multimedia apps often do not use TCP do not want rate throttled by congestion control instead use UDP: pump audio/video at constant rate, tolerate packet loss Fairness and parallel TCP connections nothing prevents appl. from opening parallel connections between two hosts. web browsers do this example: link of rate R supporting already 9 connections; new app asks for 1 TCP, gets rate R/10 new app asks for 11 TCPs, gets > R/2 !! Transport Layer 7-61 Extra Slides 62 Exercise MSS = 1000 Only one event per row Transport Layer 7-63