TCP 10 TCP – purpose • • • • TCP provides reliable data transmission over an unreliable network. TCP provides congestion control TCP provides flow control TCP passes messages – Inputs • • • • – Outputs • • • • • Destination address Destination port Source port (socket) Message Message Error reporting If TCP reports that the message has been delivered then we can rest assured that the receiving application has received the data. What the application does with it is another story. At least 85% of all traffic uses TCP….but I heard the 50% of traffic in S. Korea uses UDP (gaming). UDP – – No flow control No error reporting (little error reporting) BGP FTP HTTPSMTP telnet icmp UDP OSPF TCP IP TCP header • IP header is 20 bytes (source IP, destination IP, protocol, TTL,…) • TCP header 20 bytes Source port Destination port Sequence # ACK # U A P R S F Reserved R C S S Y I Header length 4 bits 6 G K H T N N REC WIN 16 CHECK SUM 16 Urgent ptr 16 Options and padding • Ports – used so a single host can have many connections at the same time. When a packet arrives, it is distinguished by the source IP, source port, and destination port. More or less, the IPs and port define an application • Sequence number – indicates the 1st byte of the data. • ACK# is the next expected sequence number • Header length in 32 bit words. 4 bits means the max size is 60 bytes. 20 bytes are used by the header, so up to 40 bytes more could be in options. • flags – URG – urgent ptr (urgent data and valid urgent ptr, eg., cntrl-c) – ACK – ACK number is valid – PSH – receiver (the receiver should pass this data to the application as soon as possible… as oppose to what? This should be set when this packet will empty the outgoing buffer so the receiver should not wait for a full buffer before passing data to the app. Just send it now.) – RST – reset connection (something went wrong, good for detecting attacks). – SYN – synchronize sequence number – FIN – sender is finished sending data connection establishment Node A initiates a connection with node B => Node A performs an active open, node B passive open (listen) source Send SYN SYN=1, seq=2197 ACK=0 dest Send SYN/ACK SYN=1, seq#=197 ACK=1, ack#=2198 Send ACK (for syn) ACK flag=1 ack#=198 seq#=2198 Initial SYN depends on implementation… Connection establishment • If the first SYN is dropped, then it is resent 3 seconds later. If this is dropped, it is resent 6 seconds. And so on. The maximum waiting time is 64 seconds. The maximum time can be as high as 180 second. But this depends on the implementation. • If the listener doesn’t get an ACK, it will retransmit in 3 second and back-off in the same way. • But if the listener gets a data packet, the ack will be set and this will end the connection establishment. • Often during connection establishment connection setup data is included in the options. – Eg., the segment size is included in the options. – More option discussed later Connection termination • • • • FIN flag implies no more data will be sent from that host. A FIN from each side closes the connection. A FIN from only one size puts the connection in the half close state. Example – Node A sends first • A sends pkt with FIN=1 and seq#=U (A enters FIN_WAIT) • B responds with ACK and ack#=U+1 (B enters close_wait) • A receives ACK (A enters FIN_WAIT2) • Now b closes • B send pkt with FIN set and seq#=V (enters LAST_ACK) • A responds with ACK and ack#=V+1 (enters TIME_WAIT and stays there for 120 seconds and then enters closed) • B receives ACK and enters closed. • Use netstat to determine the state of the TCP connections. Sending data • • Either side can send data. When sequence number indicates where the first byte is placed in the receiver buffer. The receiver responds with an ACK, the ack# indicates the next empty byte location in the buffer. SYN had seq#=14 Seq#=20 Ack#=1001 Data = ‘Hi’, size = 2 (bytes) Seq#=1001 Ack#=22 Data size =0 Seq # 15 16 buffer S t 17 e 18 19 20 21 v e H i 22 SYN had seq#=14 Seq#=20 Ack#=1001 Data = ‘Hi’, size = 2 (bytes) Seq # buffer Seq#=22 Ack#=1001 Data = ‘Bye’, size = 3 (bytes) Seq#=20 Ack#=1001 Data = ‘Hi’, size = 2 (bytes) buffer 17 t e 18 19 20 21 22 20 21 22 v e 15 S 16 17 t e 18 19 v e B y e SYN had seq#=14 Seq # buffer Seq#=1001 Ack#=25 Data size =0 S 16 SYN had seq#=14 Seq # Seq#=1001 Ack#=20 Data size =0 15 15 S 16 17 t e 18 19 v e 20 21 H i 22 B y e Note: here the receiver is not sending data, so its seq num is never changing and the reply ack is never changing. But the definitions of the ACK and SYN remain valid. Note that SYN and FIN packets are special cases. No data, but the ACKs increment. Retransmission time-out • • • How to decide when a packet should be retransmitted? Two methods. Here we talk about the first, when the ACK has not been received in a long time, TCP assumes that the packet was dropped. How long is a long time…..? No good solution. RTT is the round-trip time SRTT is a smoothed (filtered) version of RTT RTTMD accounts for the variance of RTT Van jackobson’s algorithm SRTT k1 SRTT k 1 RTT k RTTMDk1 RTTMDk 1 |SRTT k RTT k | 0. 9 or 7/8 0. 25 RTOk max SRTT k 4RTMDk , MinRTO MinRTO 200ms in linux, 500ms in BSD, RFC’s say it should be 1second This does not work all that well. Really, it is MinRTO that controls when time-outs occur. Van Jackobson’s algorithm does not work well. But more analysis is required. RTO analysis Suppose that the pdf of RTT is e R (exponentially distributed, e.g., M/M/1 queue) Mean deviation is 1 0 r e dr 0 1 re r dr 1 r 1 e 0 r dr r 1 1 1 e e 2r e 1 P timeoutP R 1 4 2 e 1 8 e 1 1 e r dr e 8e Using the July 25, 2001 snapshot of round-trip times from the NLANR data set. we computed empirical probability of spurious timeouts. The total data set consists of nearly 13000 connections between 122 sites and 17.5 million round-trip time measurements. This data consisted of time series of round-trip times for each connection with each time series containing 1440 round-trip times (one sample per minute over the entire day) 1 0. 019 2% 0.07 0.06 P(RTT>RTO) 1 0.05 0.04 0.03 0.02 0.01 0 0 5 10 K 15 20 Detecting drops with triple Dup ACKs Seq#=20 Ack#=1001 Data = ‘Hi’, size = 2 (bytes) Seq#=1001 Ack#=22 Data size =0 Seq # buffer 15 16 17 18 19 20 21 22 25 30 30 35 S t e v eH i Seq#=22 Ack#=1001 Data = ‘Bye’, size = 2 (bytes) Seq#=25 Ack#=1001 Data = ‘Wazup’, size = 5 (bytes) Seq#=1001 Ack#=22 Data size =0 Rwin=2 Seq#=30 Ack#=1001 Data = ‘Give’, size = 4 (bytes) Seq#=1001 Ack#=22 Data size =0 Rwin=2 Seq#=34 Ack#=1001 Data = ‘Me’, size = 2 (bytes) Seq#=1001 Ack#=22 Data size =0 Rwin=2 Seq#=22 Ack#=1001 Data = ‘Bye’, size = 2 (bytes) Seq#=1001 Ack#=36 Data size =0 Rwin=2 15 16 17 18 19 20 21 22 25 S t e v eH i Wa z u p 30 30 35 30 30 35 15 16 17 18 19 20 21 22 25 S t e v eH i Wa z u p G i v e 15 16 17 18 19 20 21 22 25 S t e v eH i Wa z u p G i v e Me 15 16 17 18 19 20 21 22 25 30 30 30 35 30 35 S t e v e H i B y e Wa z u p G i v e Me Why triple dup ACK? • Why not one DUP ACK? 1. Bennet and Partridge, Packets reordering is not pathological network behavior, 1999. This paper showed that packet reordering can/does occur. Further research into this could be a project. 1. The reason for the packet reordering is that the routers have parallel paths through them. So, depending on the order of arrival and the packet sizes, the incoming order will be different from the outgoing order. 2. Supposedly this was only a problem with older model juniper routers. There are many of these routers out there. Cisco field day! 3. Reordering only happens when the packets arrive at nearly the same time. This might not happen that much in TCP (see ACK clocking later). 4. However, this is an active research area. 5. Load balancing can cause packets to take different paths. This can cause reordering. Load balancing is a good project topic. 6. Route flap can also cause reordering. 2. Why not a larger DUPThres (larger than 3)? 1. This casues other problems. 2. Limited transmit can help. See my papers on TCP-PR for details. 1. Using triple DUP ACKs instead of RTO is called fast retransmit because the drop is detected faster. Flow control – so the receive doesn’t get overwhelmed. Seq#=20 Ack#=1001 Data = ‘Hi’, size = 2 (bytes) Seq#=1001 Ack#=22 Data size =0 Rwin=2 SYN had seq#=14 Seq # buffer Seq#=22 Ack#=1001 Data = ‘By’, size = 2 (bytes) 15 16 S 15 t e 16 S Seq#=1001 Ack#=24 Data size =0 Rwin=0 17 17 t e • 18 19 20 21 22 • v e H i 18 19 20 21 v e H i 22 B y Application reads buffer 24 25 26 27 28 29 30 31 24 25 26 27 28 29 30 31 Seq#=1001 Ack#=24 Data size =0 Rwin=9 Seq#=4 Ack#=1001 Data = ‘e’, size = 1 (bytes) e The number of unacknowledg packets must be lass than the receiver window. As the receivers buffer fills, decreases the receiver window Flow control – so the receive doesn’t get overwhelmed. Seq#=20 Ack#=1001 Data = ‘Hi’, size = 2 (bytes) Seq#=1001 Ack#=22 Data size =0 Rwin=2 SYN had seq#=14 Seq # 16 15 S buffer Seq#=22 Ack#=1001 Data = ‘By’, size = 2 (bytes) 17 18 16 S 19 20 21 22 • v e H i t e 15 Seq#=1001 Ack#=24 Data size =0 Rwin=0 • 17 18 19 20 21 22 v e H i t e B y Application reads buffer 24 3s 25 26 27 28 29 30 31 Seq#=1001 Ack#=24 Data size =0 Rwin=9 Seq#=4 Ack#=1001 Data = , size = 0 (bytes) window probe Seq#=1001 Ack#=24 Data size =0 Rwin=9 Seq#=4 Ack#=1001 Data = ‘e’, size = 1 (bytes) 24 e 25 26 27 28 29 30 31 The number of unacknowledg packets must be lass than the receiver window. As the receivers buffer fills, decreases the receiver window Flow control – so the receive doesn’t get overwhelmed. Seq#=20 Ack#=1001 Data = ‘Hi’, size = 2 (bytes) Seq#=1001 Ack#=22 Data size =0 Rwin=2 Seq#=22 Ack#=1001 Data = ‘By’, size = 2 (bytes) Seq#=1001 Ack#=24 Data size =0 Rwin=0 SYN had seq#=14 Seq # buffer 15 S 15 S 16 17 t e 16 17 t e • 18 19 20 21 22 • v e H i 18 19 20 21 v e H i The number of unacknowledg packets must be lass than the receiver window. As the receivers buffer fills, decreases the receiver window 22 B y 3s Seq#=4 Ack#=1001 Data = , size = 0 (bytes) Seq#=1001 Ack#=24 Data size =0 Rwin=0 6s Seq#=4 Ack#=1001 Data = , size = 0 (bytes) Max time between probes is 60 or 64 seconds Receiver window • The receiver window field is 16 bits. • Default receiver window – – – – – By default, the receiver window is in units of bytes. Hence 64KB is max receiver size for any (default) implementation. Ethernet segments are 1500 bytes (TCP data =1460). So that would give 44 packets. If the bit-rate was 10Mbps, what is the RTT so that this window size is equal to the bandwidth delay product. • Receiver window scale – During SYN, one option is Receiver window scale. – This option provides the amount to shift the Receiver window. – Eg. Is rec win scale = 4 and rec win=10, tehn real receiver window is 10<<4 = 160 bytes. Congestion Control • Make sure not to overwhelm the network • How much data to put into the network? • The sender maintains a the congestion window (cwnd) that is the maximum number of unacknowledged packets. • InFlight is the number of unacked packets. • If InFlight < cwnd, then a packet can be sent. • When an ACK arrives, InFlight decreases so another packet can be sent. suppose that cwnd = 4*MSS suppose MSS=1000 Inflight=1MSS Inflight=2MSS MSS is maximum segment size = min of segment sizes of sender and receiver. It is negotiated during SYN. Seq#=20 Ack#=1001Data = …, size =1 MSS (bytes) Seq#=1020 ck#=1001 Data = …, size =1 MSS (bytes) Seq#=2020 Ack#=1001 Data = …, size =1 MSS (bytes) Inflight=3MSS Inflight=4MSS Seq#=3020 Ack#=1001 Data = …, size =1 MSS (bytes) Seq#=1001 Ack#=1020 Data size =0 Inflight=3MSS Inflight=4MSS Inflight=3MSS Inflight=4MSS Seq#=1001 Ack#=1020 Data size =0 Seq#=4020 Ack#=1001 Data = …, size =1 MSS (bytes) Seq#=4020 Ack#=1001 Data = …, size =1 MSS (bytes) suppose that cwnd = 4*MSS suppose MSS=1000 Inflight=1MSS Inflight=2MSS MSS is maximum segment size = min of segment sizes of sender and receiver. It is negotiated during SYN. Seq#=20 Ack#=1001Data = …, size =1 MSS (bytes) Seq#=1020 ck#=1001 Data = …, size =1 MSS (bytes) Seq#=2020 Ack#=1001 Data = …, size =1 MSS (bytes) Seq#=3020 Ack#=1001 Data = …, size =1 MSS (bytes) Inflight=3MSS Inflight=4MSS Seq#=1001 Ack#=1020 Data size =0 Inflight=3MSS Inflight=4MSS Inflight=3MSS Inflight=4MSS Seq#=1001 Ack#=1020 Data size =0 Seq#=4020 Ack#=1001 Data = …, size =1 MSS (bytes) Seq#=4020 Ack#=1001 Data = …, size =1 MSS (bytes) ACK clocking What is the maximum rate that ACKs can arrive at the sender? ACK clocking 100Mbps Packets can leave here at 100Mbps 10Mbps 100Mbps ACK clocking 100Mbps 10Mbps 100Mbps Packets can leave here at 100Mbps Packets leave here at a rate of 10Mbps What rate do packets leave here? ACK clocking 10Mbps 100Mbps 100Mbps Packets can leave here at 100Mbps Packets leave here at a rate of 10Mbps What rate do packets leave here? Ans: 10Mbps, they arrive at 10Mbps What about the ACKs? 100Mbps 10Mbps 100Mbps What rate do ACKs leave here? ACK clocking 10Mbps 100Mbps 100Mbps Packets can leave here at 100Mbps Packets leave here at a rate of 10Mbps What rate do packets leave here? Ans: 10Mbps, they arrive at 10Mbps What about the ACKs? 100Mbps What rate do ACKs leave here? Ans: 40/1040 * 10Mbps. Or at a rate so that if a oacket is send for each ACK, then the rate that the packets are sent is 10Mbps What about the packets? 10Mbps 100Mbps What rate do ACKs leave here? Ans: 40/1040 * 10Mbps. Or at a rate so that if a oacket is send for each ACK, then the rate that the packets are sent is 10Mbps ACK clocking 10Mbps 100Mbps 100Mbps Packets can leave here at 100Mbps Packets leave here at a rate of 10Mbps What rate do packets leave here? Ans: 10Mbps, they arrive at 10Mbps What about the ACKs? 100Mbps 10Mbps What rate do ACKs leave here? Ans: 40/1040 * 10Mbps. Or at a rate so that if a oacket is send for each ACK, then the rate that the packets are sent is 10Mbps What about the packets? 10Mbps. Perfect!!! 100Mbps What rate do ACKs leave here? Ans: 40/1040 * 10Mbps. Or at a rate so that if a oacket is send for each ACK, then the rate that the packets are sent is 10Mbps Congestion control • ACK clocking makes the sender not send any faster than the bottleneck link speed. • But how to “fill the pipe?” Sending at “burst” rate of 10Mbps Not sending pckts. Wasted bandwidth Sending at “burst” rate of 10Mbps We only send cwnd packets in a burst. How big should cwnd be? Congestion control • ACK clocking makes the sender not send any faster than the bottleneck link speed. • But how to “fill the pipe?” We only send cwnd packets in a burst. How big should cwnd be? RTT The number of pckts sent in one RTT is the cwnd. In order to not waste bandwidth, how many packets should be sent? Congestion control • ACK clocking makes the sender not send any faster than the bottleneck link speed. • But how to “fill the pipe?” We only send cwnd packets in a burst. How big should cwnd be? The number of pckts sent in one RTT is the cwnd. In order to not waste bandwidth, how many packets should be sent? RTT Cwnd (bytes)= Link byte-rate (byte/s) * RTT s Bottleneck links speed Bandwidth delay product = Link byte-rate (byte/s) * RTT s Congestion control • Ideally cwnd = bandwidth delay product. • This ignores fairness. If there are N flows that are also use the same link. Then ideally cwnd = bandwidth delay product/N. • But how to find this value??? TCP congestion control • Theme: probe the system. – Slowly increase cwnd until there is a packet drop. That must imply that the cwnd size (or sum of windows sizes) is larger than the BWDP. – Once a packet is dropped, then decrease the cwnd. And then continue to slowly increase. • Two phases: – slow start (to get to the ballpark of the correct cwnd) – Congestion avoidance, to oscillate around the correct cwnd size. Cwnd>ssthress Triple dup ack Connection establishment Congestion avoidance Slow-start timeout Connection termination Slow start • When the connect first start (and after a timeout for today’s TCPs) • Cwnd starts at 1 or 2 MSS. • For each non-dup ACK received, the window size increase by one. • This increasing continues until the window reaches the value of SSThres. • The initial value of SSThres is often large (taken as infinite). So the Rwin limits the growth of the window. Slow start cwnd SYN: Seq#=20 Ack#=X SYN: Seq#=1000 Ack#=21 SYN: Seq#=21 Ack#=1001 1 Seq#=21 Ack#=1001 Data=‘…’ size =1000 2 Seq#=1021 Ack#=1001 Data=‘…’ size =1000 Seq#=2021 Ack#=1001 Data=‘…’ size =1000 3 4 5 6 7 8 Seq#=1001 Ack#=1021 size =0 Seq#=1001 Ack#=1021 size =0 Seq#=1021 Ack#=1001 Data=‘…’ size =1000 Seq#=2021 Ack#=1001 Data=‘…’ size =1000 Seq#=1021 Ack#=1001 Data=‘…’ size =1000 Seq#=2021 Ack#=1001 Data=‘…’ size =1000 Seq#=1001 Ack#=1021 size =0 The pipe is full! Slow start cwnd SYN: Seq#=1000 Ack#=21 1 2 Cwnd doubles every RTT!! 3 4 5 6 7 8 RTT Seq#=1001 Ack#=1021 size =0 RTT Seq#=1001 Ack#=1021 size =0 Seq#=1001 Ack#=1021 size =0 RTT RTT RTT?? The pipe is full! What is happening here? Slow start cwnd SYN: Seq#=1000 Ack#=21 1 2 Cwnd doubles every RTT!! 3 4 5 6 7 8 RTT Seq#=1001 Ack#=1021 size =0 RTT Seq#=1001 Ack#=1021 size =0 Seq#=1001 Ack#=1021 size =0 RTT RTT RTT?? What is happening here? Now the queue is filling. Either it will fill and drop a packet or the recWin will stop cwnd from increasing • If RecWin!=inf and RecWin<bandwidth delay product + queue size, and there are no other packets, then there will never be a drop. Lots of conditions, but a large number of flows do not experience drops. • If RecWin/ssthress=inf and the outgoing link of the sender is not the bottleneck, then eventually there will be a drop. If the drop is detected with triple dupack, then cwnd = cwnd/2 and congestion avoidance is entered. • If the drop(s) is(are) detected with timeout, then ssthress=cwnd/2, cwnd=1 and slowstart is continued. • If ssthress< bandwidth delay product + queue size and RecWin>ssthress, the congestion avoidance is entered. Congestion Avoidance Basics: additive increase multiplicative decrease (AIMD)!! Rough view For every cwnd’s worth of packets, cwnd is incremented by one. When there is a drop, cwnd=cwnd/2. cwnd 4 5 6 Seq# (MSS) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 cwnd 6 16 17 18 19 20 21 2 3 4 5 5 6 7 8 9 10 11 12 13 14 15 6 3 4 15 22 23 24 11 12 13 14 15 15 15 15 15 15 22 23 24 Rough view of TCP congestion control drops Cwnd=ssthres Slow start Congestion avoidance drops drop Slow start Congestion avoidance drops drop Slow start Congestion avoidance Slow start TCP - more detailed view • Delayed ACKs – The worry was that the network was going to be all jammed up with ACKs. – So instead of sending an ACK for every pck, delay the ack and maybe ack two packets • Generate an ACK for at least every other packet. • Don’t delay an ACK by more than 500ms. (exact number depends on implementation.) • If packets are out of order, generate an ACK for every packet. • Also, immediately send an ACK when a “gap” in the buffer is filled. – Delayed ACKs can greatly slow down a connection. • Eg., the first packet is delayed by 500ms • Depending on the implementation, cwnd will grow more slowly. Details - Fast recovery • cwnd after a drop • Recall, TCP only sends packets when InFlight < Cwnd. • InFlight only decreases when a new ACK is received, I.e., a DUP ACK does not cause InFlight to change. – If a DUP ACK arrives, then it means that a packet arrived at the receiver and an ACK was sent. So the number of packet in the network has decreased. So InFlight should decrease. – But maybe the network has duplicated the ACK. To be conservative, leave InFlight as is (I guess). Fast recovery • Upon the two DUP ACK arrival, do nothing. Don’t send any packets (InFlight is the same). • Upon the third Dup ACK, – set SSThres=cwnd/2. – Cwnd=cwnd/2+3 – Retransmit the requested packet. • • • • Upon every other DUP ACK, cwnd=cwnd+1. If InFlight<cwnd, send a packet and increment InFlight. When a new ACK arrives, set cwnd=ssthres (RENO). When an ACK arrives that ACKs all packets that were outstanding when the first drop was detected, cwnd=ssthres (NEWRENO) Fast recovery cwnd 4 5 6 Seq# (MSS) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Inflight cwnd 6 6 16 17 18 19 20 21 2 3 4 5 5 6 7 8 9 10 11 12 13 14 15 6 6=6/2+3 7 8 3 7 8 3 15 22 23 24 11 12 13 14 15 15 15 15 15 15 22 23 24 cwnd 4 5 6 Seq# (MSS) Fast recovery – multiple drops - RENO 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 2 3 4 5 Inflight cwnd 6 6 16 17 18 19 20 21 5 6 7 8 9 10 11 12 11 12 6 6=6/2+3 7 8 7 8 12 12 3 12 22 23 12 12 12 12 12 12 12 15 15 15 24 3 Why is this bad? The first drop told us that we were sending to fast. The second drop tells us the same thing (already). So why react to the same news twice….NewReno 15 5 5=2+3 15 16 2 2 15 Fast Recovery – multiple drops - NewReno • The problem was that one of the packets that was outstanding when the drop was detected was also dropped. • Solution (NewReno) – When a drop is detected, • • • • Ssthres=cwnd/2 Cwnd=cwnd/2+3 Recover = seq# of largest byte sent. Retransmit the dropped packet – Upon a DUP ACK, increment cwnd and sent if Inflight<cwnd – If ACK is larger than pervious ACK, but smaller than recover (partial ack) • • • • • Suppose that pervious ack#=X and now ack#=Y<recover Retransmit drop packet Cwnd = cwnd – (Y-X)+1 Of course, Inflight = Inflight-(Y-X) So transmit another packet (that makes two transmissions) – If ACK>recover, • Cwnd=ssthres • Exit fast recovery Fast Recovery – single drops - NewReno Inflight cwnd 14 14 16 17 18 19 20 21 17 17 17 17 Recover=29 14 15 10 11 12 13 14 15 16 7 17 31 Note how the actual number outstanding is always = 7 Fast Recovery – multiple drops - NewReno Inflight cwnd 14 14 16 17 18 19 20 21 29 17 17 17 17 Recover=29 14 15 10 11 12 13 14 15 16 17 21 19 15=19-4 NewReno sends two packets for every ACK indicating a multiple drop. 16=19-(21-17)+1 21 35 7 Exit fast recovery 2 drops takes 2 RTT to recover. N drops takes N RTT to recover. If N*RTT>RTO, then slow-steady => no TO impatient => TO Other things • Idle restart – If no packet has been sent in RTO seconds • SSThress=Cwnd • Cwnd=1 • Slow-start – Avoids big bursts after idle times • E.g., get data form disk • http 1.1 • Timeout – exponential back off – If no ACK arrives before RTO timer expires, then time-out • Ssthress=cwnd/2; Cwnd=2; slow-start • RTO=min(2*RTO,64s) – If next packet is dropped, then the wait is longer – Gives up after 9-12 tries. But implementation dependent (ns never stops) • If a retransmitted is dropped, the TCP times out. Dup ACKs after timeout Inflight cwnd 14 14 20 21 22 23 16 17 18 19 20 21 17 17 17 17 24 29 14 15 10 11 12 13 14 15 16 24 26 28 30 30 Recover=29 42 42 42 42 42 42 42 42 17 31 19 15=19-4 16=19-(21-17)+1 eventually timeout DUP ACKS 17 18 18 19 Set send_high to maximum seq# sent. If DUP ACKs are received for segments less than send_high, assume it does not indicate a drop. In case there was a drop, then there will be a time out. Selective Acknowledgment – SACK The latest widespread congestion control • • • • • seq num Problem: when a multiple packets are dropped, the cumulative ACK does not give information as to which packets were dropped. As a result, fast recovery is not so fast; it takes one RTT per lost packet. Solution: embed into the ACK some information about which packets have successfully arrived. TCP-SACK allows ACKs to contain information about received packets. If the packets are received in order, then the ACK looks the same as TCPRENO or TCP-NEWRENO. But if a packet the packets arrive out of order, then the ACK contains SACK blocks. A SACK block indicates a sequence of segments that have been received. 15 A 20 A A ACKed S 25 S S SACKed 30 S S S SACKed S 35 N N N Not Sent TCP-SACK Highest ACK seq num 15 A 20 A A ACKed S 25 S S SACKed left edge of 2nd block 30 S S S S SACKed right edge of left edge of 2nd block 2nd block SACK option N N N Not Sent right edge of 2nd block SACK blocks are 8 bytes long (4 bytes for each edge) The SACK option includes 1 byte to specify that it is a SCK block and one byte for the number of SACK blocks. 1 SACK block = 10 bytes + 2 bytes padding -> 52 bytes header 2 SACK blocks = 18 bytes + 2 bytes padding -> 60 bytes header 3 SACK blocks = 26 bytes + 2 bytes padding -> 68 bytes header 4 SACK blocks = 34 bytes + 2 bytes padding -> 76 bytes header Max ACK is 80 bytes If time stamp option is used, then the max number of SACK blocks is 3. kind=5 length=2 left edge of 2st block = 26 right edge of 2st block = 30 left edge of 1st block = 20 right edge of 1st block = 23 35 Generation of SACKs 1. 2. 3. 4. 5. No SACK blocks if no out of order packets No delayed ACK if out of order packets (send an ACK for every received packet. When an out of order packet arrives, the first SACK block contains contain the segment that just arrived. The ACK should contain as many SACK blocks as fit and are required (no skimping to save bit-rate). The SACK blocks included should be those that have most recently been reported (see 3). So if there are at most 3 SACK blocks, then each continuous block of segments will be reported at least 3 times. If the packet that arrived has just been received (a duplicate reception), then the first SACK block should identify this packet. (This is the DSACK extension to SACK). In this case, the next SACK block should indicate the continuous sequence of segments that contain the segments received in duplicate. 6. seq num 15 A 20 A A ACKed S 25 S S SACKed left edge of 2nd block 30 S S S SACKed right edge of left edge of 2nd block 2nd block Now suppose that segment 21 arrives for a second time. SACK option S kind=5 left edge of DUP packet = 21 right edge of DUP packet = 22 left edge of 1st block = 20 right edge of 1st block = 23 left edge of 2st block = 26 right edge of 2st block = 30 length=2 35 N N N Not Sent right edge of 2nd block DSACK • DSACK is to identify packets that have been needlessly retransmitted. • The primary source of such retransmissions is packet reordering. • If such a retransmission occurs, it likely means that cwnd was divided by 2 needlessly. • DSACK helps identify these needless divides by two. • It is not clear what can be done once they are identified. • Many ideas have been suggested, but it remains to be scene if they actually improve things – Ethan Blanton, Mark Allman, On Making TCP More Robust to Packet Reordering (2002): show that some improvement is possible – Bohacek et al shows that if there is persistent reordering, more drastic measures are required. – Neither paper includes analysis of the current situation in the Internet. • The current situation is not completely known. • The homework provides backbone traces with rampant reordering. • In my opinion (on 2/20/04) some sort of timer-based approach is necessary. The DUPACK threshold approach is not appropriate because a burst of packets (as can be seen in the homework) can be very reordered. But reordering by more than a few milliseconds is very rare. • A project could examine this. Eifel Detection • DSACK is only useful after the arrival of the second copy of the packet. • Eifel uses time-stamps to inform the sender that a packet that was thought to have been lost has actually arrived. TCP-SACK (Sender side) • • Slow start and the linear increase part of SACK is the same as TCPRENO/NEWRENO. The fast recovery part is different. SACK provides more information about which packets have been lost. The sender can use this to determine – – • which packets to send when to send packets When to assume that a packet is lost 1. If DupThresh continuous SACK blocks have been SACKed that have larger sequence number. The idea is that DupThresh packets have been SACKed with larger sequence number, but continuous SACK blocks are used instead. If DupThresh*MSS bytes have been SACKed that have larger sequence number. 2. MSS=5 bytes DupThresh=3 little packets Packet num 3 seq num 15:19 A 8 13 65:69 40:44 A A S S S ACKed Assumed dropped because of reason 1 and 2 1. Number of continuous sack blocks with higher seq num = 4DupThresh 2. Number SACKed bytes with large seq num = 25 MSS*DupThresh SACKed 14 15 16 17 70:71 72:73 74:75 76:77 S S 18 19 S SACKed Assumed dropped because of reason 1 only 1. Number of continuous sack blocks with higher seq num = 3 DupThresh 2. Number SACKed bytes with large seq num = 9<MSS*DupThresh 23 78:82 83:87 N N N Not Sent Not assumed dropped. Number in “pipe” or InFlight • If a packet has been sent, not lost, and not SACKed, then this packet is assumed to be in the pipe. • Any packet that has been retransmitted and not SACKed. – Retransmission happen in order (smallest seq num first, why?) – Let HighRX denote the highest segment that has been Retransmitted. – Any packet that has been not been SACKed and has seq num less been retransmitted, so it is in the pipe. Which packet to send next? (during fast recovery) • The next to transmit is the segment with the smallest seq num that satisfies 1. 2. 3. seq num If the segment is less than HighRX If the segment has seq num less than the largest segment in a SACK block If the segment is assumed to be lost. 15 20 A A A 25 S ACKed S 30 S S S SACKed S S 35 N SACKed N N Not Sent HighRX already retransmitted • • seq num next to be sent If the above is an empty set, then the next to be sent is smallest segment that has not yet been sent. If the above is also empty (because there are no more packets to be sent), 15 20 A A A 25 S ACKed S S 30 S S N SACKed SACKed next to be sent end of file seq num 15 A 20 A A S 25 S S S SACKed SACKed ACKed already retransmitted S HighRX next to be sent N Not Sent HighRX already retransmitted 35 N TCP-SACK congestion control • When a loss is detected: – set RecoveryPoint=Seq num of highest segment sent. Fast recovery ends when this seq num is ACKed (SACKed is not good enough). – ssthresh = cwnd=Inflight – Retransmit lost packet with smallest seq num. – Set HighRX equal to the retransmitted packet • During recovery (until RecoveryPoint is ACKed) – If pipe<cwnd, then send next to be sent. TCP-SACK notes • After RTO, the TCP-SACK sender starts fresh and erases SAKC info from prior to the RTO (some of it might be regained in retransmissions of SACK blocks). • Like NEWRENO, the highest seq sent before an RTO is recorded and a dupack from a packet qith seq num less than this highest seq does not cause fast recovery/retransmit. • Like NEWRENO, the retransmit timer can be reset during recovery (slow and steady) or not (impatient). Inflight 14 cwnd newReno 14 TCP-SACK timeout pkt sent • 16 17 18 19 20 21 17 • 29 17 17 17 • 14 14 10 11 12 17 13 14 14 no more packet sent time-out SACK, NewReno, etc. will time-out if a retransmission is lost. If SACK uses the same technique to increase cwnd as NewReno (I.e., cwnd=inflight/2+3…). and if there are more than cwnd/2 packets are lost, SACK will time-out. The ns implementation has this problem. TCP-SACK burst pipe cwnd pkt sent SACK • 16 17 18 19 20 21 14 • 17 29 17 17 17 4,5,6,7 7 7 17,18,19,20 21 22 lost ACK clocking and sent a burst 24 31 37 38 recovery ends SACK, NewReno, etc. will time-out if a retransmission is lost. Multiple drops lead to a burst of packets being sent. Limited Transmit • When a packet is dropped and the window size is less than 4, TCP will always timeout (not enough ACKs arrive to get triple DUP). It, upon receiving a DUP ACK, a packet is transmitted, then there might be enough DUPACKs to cause fast retransmitted and avoid time-out. Limited transmit allow for a packet to be send when the second Dup Ack is received. (In general, for every other dup ack). Even if a packet is lost, sending a packet for every other ACK is sending at half the bit-rate. While this helps TCP avoid time-outs, it also makes this version of TCP far more aggressive for loss probability greater than about 1% (where time-outs become quite prevalent for non-limited transmit TCP) • • • • Seq# cwnd (MSS) Seq# cwnd (MSS) 3 3 1 2 3 1 2 3 2 2 2 4 4 2 5 5 2 Triple dup ack! No time out 2 Time out Limited Transmit cwnd Seq# (MSS) 5 cwnd 4 1 2 3 4 Seq# (MSS) 1 2 3 4 5 2 2 2 2 6 5 2 6 2 7 2 2 Triple dup ack! Triple dup ack! ECN • Sometimes the router will have a large enough queue to accept the packet, but the queue occupancy is beyond a threshold, so in order to try to get the TCP flows to send at a slower rate, the router would drop packets (even though there is room in the queue). • It’s funny to drop packets when there is room in the queue, so another option is to mark the packets. The receiver should include in the ACK that packet that is being ACKed has been marked and the sender should react to this marking as it would to a drop, except that there is no reason to retransmit the marked packet. • This approach has little impact in general, except, like limited transmit, when the loss probability if very high, it can reduce timeouts.