Transport Layer Our goals: • understand principles behind transport layer services: • learn about transport layer protocols in the Internet: – UDP: connectionless transport – TCP: connection-oriented transport – TCP congestion control – multiplexing/demulti plexing – reliable data transfer – flow control – congestion control Ref: slides by J. Kurose and K. Ross Xin Liu 1 Outline • • • • • Transport-layer services Multiplexing and demultiplexing Connectionless transport: UDP Principles of reliable data transfer Connection-oriented transport: TCP – – – – segment structure reliable data transfer flow control connection management • Principles of congestion control • TCP congestion control Xin Liu 2 3 Transport services and protocols • provide logical communication between app processes running on different hosts • transport protocols run in end systems – send side: breaks app messages into segments, passes to network layer – rcv side: reassembles segments into messages, passes to app layer • more than one transport protocol available to apps – Internet: TCP and UDP application transport network data link physical Xin Liu network data link physical network data link physical network data link physical network data link physical network data link physical application transport network data link physical Transport vs. network layer • network layer: logical communication between hosts • transport layer: logical communication between processes – relies on, enhances, network layer services Xin Liu Household analogy: 12 kids sending letters to 12 kids • processes = kids • app messages = letters in envelopes • hosts = houses • transport protocol = Ann and Bill • network-layer protocol = postal service 4 Internet transport-layer protocols • reliable, in-order delivery (TCP) – congestion control – flow control – connection setup application transport network data link physical network data link physical network data link physical network data link physical • unreliable, unordered delivery: UDP network data link physical network data link physical – no-frills extension of “besteffort” IP application transport network data link physical • services not available: – delay guarantees – bandwidth guarantees Xin Liu 5 Outline • • • • • Transport-layer services Multiplexing and demultiplexing Connectionless transport: UDP Principles of reliable data transfer Connection-oriented transport: TCP – – – – segment structure reliable data transfer flow control connection management • Principles of congestion control • TCP congestion control Xin Liu 6 7 Multiplexing/demultiplexing Multiplexing at send host: gathering data from multiple sockets, enveloping data with header (later used for demultiplexing) Demultiplexing at rcv host: delivering received segments to correct socket = socket application transport network link = process P3 P1 P1 application transport network P2 P4 application transport network link link physical host 1 physical host 2 Xin Liu physical host 3 8 How demultiplexing works • host receives IP datagrams – each datagram has source IP address, destination IP address – each datagram carries 1 transport-layer segment – each segment has source, destination port number (recall: well-known port numbers for specific applications) • host uses IP addresses & port numbers to direct segment to appropriate socket 32 bits source port # dest port # other header fields application data (message) TCP/UDP segment format Xin Liu Connectionless demultiplexing • Create sockets : 9 • When host receives UDP sock=socket(PF_INET,SOCK_DGR segment: AM, IPPROTO_UDP); – checks destination port number bind(sock,(struct sockaddr in segment *)&addr,sizeof(addr)); – directs UDP segment to socket sendto(sock,buffer,size,0); with that port number recvfrom(sock,Buffer,buffers • IP datagrams with different ize,0); • UDP socket identified by two-tuple: (dest IP address, dest port number) Xin Liu source IP addresses and/or source port numbers directed to same socket Connection-oriented demux • TCP socket identified by 4-tuple: – – – – source IP address source port number dest IP address dest port number • recv host uses all four values to direct segment to appropriate socket • Server host may support many simultaneous TCP sockets: – each socket identified by its own 4-tuple • Web servers have different sockets for each connecting client – non-persistent HTTP will have different socket for each request Xin Liu 10 Outline • • • • • Transport-layer services Multiplexing and demultiplexing Connectionless transport: UDP Principles of reliable data transfer Connection-oriented transport: TCP – – – – segment structure reliable data transfer flow control connection management • Principles of congestion control • TCP congestion control Xin Liu 11 Port • Known port • Port mapper • Port is an abstraction – Implementation depends on OS. • Q: can we use pid (process id) instead of port number? Xin Liu 12 13 UDP: User Datagram Protocol [RFC 768] • “no frills,” “bare bones” Internet transport protocol • “best effort” service, UDP segments may be: – lost – delivered out of order to app • connectionless: – no handshaking between UDP sender, receiver – each UDP segment handled independently of others Why is there a UDP? • no connection establishment (which can add delay) • simple: no connection state at sender, receiver • small segment header • no congestion control: UDP can blast away as fast as desired Xin Liu DHCP client-server scenario DHCP server: 223.1.2.5 DHCP discover src : 0.0.0.0, 68 dest.: 255.255.255.255,67 yiaddr: 0.0.0.0 transaction ID: 654 DHCP offer src: 223.1.2.5, 67 dest: 255.255.255.255, 68 yiaddrr: 223.1.2.4 transaction ID: 654 Lifetime: 3600 secs DHCP request time src: 0.0.0.0, 68 dest:: 255.255.255.255, 67 yiaddrr: 223.1.2.4 transaction ID: 655 Lifetime: 3600 secs DHCP ACK src: 223.1.2.5, 67 dest: 255.255.255.255, 68 yiaddrr: 223.1.2.4 transaction ID: 655 Lifetime: 3600 secs Xin Liu arriving client 14 Applications and protocols application E-mail Remote terminal access Web File transfer Streaming IP-phone Routing Name translation Dynamic IP Network mng. App_layer prtcl SMTP Telnet HTTP FTP proprietary proprietary RIP DNS DHCP SNMP Xin Liu Transport prtcl TCP TCP TCP TCP Typically UDP Typically UDP Typically UDP Typically UDP Typically UDP Typically UDP 15 16 UDP: more • often used for streaming multimedia apps Length, in – loss tolerant bytes of UDP segment, – rate sensitive including • reliable transfer over header UDP: add reliability at application layer – application-specific error recovery! 32 bits source port # dest port # length checksum Application data (message) UDP segment format Xin Liu Checksum • Goal: detect “errors” (e.g., flipped bits) in transmitted segment • UDP header and data, and part of IP header • Pseudo header – Source/dest IP address – Protocol, & UDP length field (again!) – Q: why include the pseudo header, which is not a part of UDP segment? • Same procedure for TCP Xin Liu 17 UDP checksum Sender: Receiver: • treat segment contents as sequence of 16-bit integers • checksum: addition (1’s complement sum) of segment contents • sender puts checksum value into UDP checksum field • compute checksum of received segment • check if computed checksum equals checksum field value: – NO - error detected – YES - no error detected. But maybe errors nonetheless? – may pass the damaged data Xin Liu 18 Checksum Examples 1111 1000 ------0111 1 -----1000 A: 0111 Q: one bit flip Recv: 0011 1111 1000 0011 --------1011 A: error! Xin Liu Q: two bits flip? Q: is it possible the checksum is all 1s? 19 Outline • • • • • Transport-layer services Multiplexing and demultiplexing Connectionless transport: UDP Principles of reliable data transfer Connection-oriented transport: TCP – – – – segment structure reliable data transfer flow control connection management • Principles of congestion control • TCP congestion control Xin Liu 20 21 Principles of Reliable data transfer • important in app., transport, link layers • top-10 list of important networking topics! • characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt) Xin Liu 22 Reliable data transfer: getting started rdt_send(): called from above, (e.g., by app.). Passed data to deliver to receiver upper layer deliver_data(): called by rdt to deliver data to upper send side receive side udt_send(): called by rdt, to transfer packet over unreliable channel to receiver rdt_rcv(): called when packet arrives on rcv-side of channel Xin Liu 23 Reliable data transfer: getting started We’ll: • incrementally develop sender, receiver sides of reliable data transfer protocol (rdt) • consider only unidirectional data transfer – but control info will flow on both directions! • use finite state machines (FSM) to specify sender, receiver state: when in this “state” next state uniquely determined by next event event causing state transition actions taken on state transition state 1 event actions Xin Liu state 2 24 Rdt1.0: reliable transfer over a reliable channel • underlying channel perfectly reliable – no bit errors – no loss of packets • separate FSMs for sender, receiver: – sender sends data into underlying channel – receiver read data from underlying channel Wait for call from above rdt_send(data) packet = make_pkt(data) udt_send(packet) sender Wait for call from below rdt_rcv(packet) extract (packet,data) deliver_data(data) receiver Q: What about bit error? Xin Liu 25 Rdt2.0: channel with bit errors • underlying channel may flip bits in packet – recall: UDP checksum to detect bit errors • the question: how to recover from errors: – acknowledgements (ACKs): receiver explicitly tells sender that pkt received OK – negative acknowledgements (NAKs): receiver explicitly tells sender that pkt had errors – sender retransmits pkt on receipt of NAK – human scenarios using ACKs, NAKs? • new mechanisms in rdt2.0 (beyond rdt1.0): – error detection – receiver feedback: control msgs (ACK,NAK) rcvr->sender Xin Liu 26 rdt2.0: FSM specification rdt_send(data) snkpkt = make_pkt(data, checksum) udt_send(sndpkt) rdt_rcv(rcvpkt) && isNAK(rcvpkt) Wait for Wait for call from ACK or udt_send(sndpkt) above NAK rdt_rcv(rcvpkt) && isACK(rcvpkt) L receiver rdt_rcv(rcvpkt) && corrupt(rcvpkt) udt_send(NAK) Wait for call from below sender rdt_rcv(rcvpkt) && notcorrupt(rcvpkt) extract(rcvpkt,data) deliver_data(data) udt_send(ACK) Xin Liu 27 rdt2.0: operation with no errors rdt_send(data) snkpkt = make_pkt(data, checksum) udt_send(sndpkt) rdt_rcv(rcvpkt) && isNAK(rcvpkt) Wait for Wait for call from ACK or udt_send(sndpkt) above NAK rdt_rcv(rcvpkt) && isACK(rcvpkt) rdt_rcv(rcvpkt) && corrupt(rcvpkt) udt_send(NAK) Wait for call from below L rdt_rcv(rcvpkt) && notcorrupt(rcvpkt) extract(rcvpkt,data) deliver_data(data) udt_send(ACK) Xin Liu 28 rdt2.0: error scenario rdt_send(data) snkpkt = make_pkt(data, checksum) udt_send(sndpkt) rdt_rcv(rcvpkt) && isNAK(rcvpkt) Wait for Wait for call from ACK or udt_send(sndpkt) above NAK rdt_rcv(rcvpkt) && isACK(rcvpkt) rdt_rcv(rcvpkt) && corrupt(rcvpkt) udt_send(NAK) Wait for call from below L rdt_rcv(rcvpkt) && notcorrupt(rcvpkt) extract(rcvpkt,data) deliver_data(data) udt_send(ACK) Q: what about ack/nak corruption? Xin Liu rdt2.0 has a fatal flaw! What happens if ACK/NAK corrupted? Handling duplicates: • sender adds sequence number to each pkt • sender retransmits current pkt if ACK/NAK garbled • receiver discards (doesn’t deliver up) duplicate pkt • sender doesn’t know what happened at receiver! • can’t just retransmit: possible duplicate What to do? • sender ACKs/NAKs receiver’s ACK/NAK? What if sender ACK/NAK lost? • retransmit, but this might cause retransmission of correctly received pkt! Xin Liu stop and wait Sender sends one packet, then waits for receiver response 29 30 rdt2.1: sender, handles garbled ACK/NAKs rdt_send(data) sndpkt = make_pkt(0, data, checksum) udt_send(sndpkt) rdt_rcv(rcvpkt) && ( corrupt(rcvpkt) || Wait for Wait for isNAK(rcvpkt) ) ACK or call 0 from udt_send(sndpkt) NAK 0 above rdt_rcv(rcvpkt) && notcorrupt(rcvpkt) && isACK(rcvpkt) rdt_rcv(rcvpkt) && notcorrupt(rcvpkt) && isACK(rcvpkt) L rdt_rcv(rcvpkt) && ( corrupt(rcvpkt) || isNAK(rcvpkt) ) udt_send(sndpkt) L Wait for ACK or NAK 1 Wait for call 1 from above rdt_send(data) sndpkt = make_pkt(1, data, checksum) udt_send(sndpkt) Xin Liu 31 rdt2.1: receiver, handles garbled ACK/NAKs rdt_rcv(rcvpkt) && notcorrupt(rcvpkt) && has_seq0(rcvpkt) rdt_rcv(rcvpkt) && (corrupt(rcvpkt) extract(rcvpkt,data) deliver_data(data) sndpkt = make_pkt(ACK, chksum) udt_send(sndpkt) rdt_rcv(rcvpkt) && (corrupt(rcvpkt) sndpkt = make_pkt(NAK, chksum) udt_send(sndpkt) rdt_rcv(rcvpkt) && not corrupt(rcvpkt) && has_seq1(rcvpkt) sndpkt = make_pkt(ACK, chksum) udt_send(sndpkt) sndpkt = make_pkt(NAK, chksum) udt_send(sndpkt) Wait for 0 from below Wait for 1 from below rdt_rcv(rcvpkt) && notcorrupt(rcvpkt) && has_seq1(rcvpkt) extract(rcvpkt,data) deliver_data(data) sndpkt = make_pkt(ACK, chksum) udt_send(sndpkt) Xin Liu rdt_rcv(rcvpkt) && not corrupt(rcvpkt) && has_seq0(rcvpkt) sndpkt = make_pkt(ACK, chksum) udt_send(sndpkt) rdt2.1: discussion Sender: • seq # added to pkt • two seq. #’s (0,1) will suffice. Why? • must check if received ACK/NAK corrupted • twice as many states – state must “remember” whether “current” pkt has 0 or 1 seq. # Receiver: • must check if received packet is duplicate – state indicates whether 0 or 1 is expected pkt seq # • note: receiver can not know if its last ACK/NAK received OK at sender Xin Liu 32 33 rdt2.2: a NAK-free protocol • same functionality as rdt2.1, using NAKs only • instead of NAK, receiver sends ACK for last pkt received OK – receiver must explicitly include seq # of pkt being ACKed • duplicate ACK at sender results in same action as NAK: retransmit current pkt Xin Liu 34 rdt2.2: sender, receiver fragments rdt_send(data) sndpkt = make_pkt(0, data, checksum) udt_send(sndpkt) rdt_rcv(rcvpkt) && ( corrupt(rcvpkt) || Wait for Wait for isACK(rcvpkt,1) ) ACK call 0 from 0 udt_send(sndpkt) above sender FSM fragment rdt_rcv(rcvpkt) && (corrupt(rcvpkt) || has_seq1(rcvpkt)) udt_send(sndpkt) Wait for 0 from below rdt_rcv(rcvpkt) && notcorrupt(rcvpkt) && isACK(rcvpkt,0) receiver FSM fragment rdt_rcv(rcvpkt) && notcorrupt(rcvpkt) && has_seq1(rcvpkt) extract(rcvpkt,data) deliver_data(data) sndpkt = make_pkt(ACK1, chksum) udt_send(sndpkt) Xin Liu L 35 rdt3.0: channels with errors and loss New assumption: underlying channel can also lose packets (data or ACKs) – checksum, seq. #, ACKs, retransmissions will be of help, but not enough Q: how to deal with loss? – sender waits until certain data or ACK lost, then retransmits – yuck: drawbacks? Approach: sender waits “reasonable” amount of time for ACK • retransmits if no ACK received in this time • if pkt (or ACK) just delayed (not lost): – retransmission will be duplicate, but use of seq. #’s already handles this – receiver must specify seq # of pkt being ACKed • requires countdown timer Xin Liu 36 rdt3.0 sender rdt_send(data) sndpkt = make_pkt(0, data, checksum) udt_send(sndpkt) start_timer rdt_rcv(rcvpkt) L rdt_rcv(rcvpkt) && notcorrupt(rcvpkt) && isACK(rcvpkt,1) rdt_rcv(rcvpkt) && ( corrupt(rcvpkt) || isACK(rcvpkt,0) ) timeout udt_send(sndpkt) start_timer rdt_rcv(rcvpkt) && notcorrupt(rcvpkt) && isACK(rcvpkt,0) stop_timer stop_timer timeout udt_send(sndpkt) start_timer L Wait for ACK0 Wait for call 0from above L rdt_rcv(rcvpkt) && ( corrupt(rcvpkt) || isACK(rcvpkt,1) ) Wait for ACK1 Wait for call 1 from above rdt_send(data) sndpkt = make_pkt(1, data, checksum) udt_send(sndpkt) start_timer Q: can we improve this one? Xin Liu rdt_rcv(rcvpkt) L 37 rdt3.0 in action Xin Liu 38 rdt3.0 in action Xin Liu 39 Performance of rdt3.0 • example: 1 Gbps link, 15 ms e-e prop. delay, 1KB packet • U sender: utilization – fraction of time sender busy sending • Utilization? Xin Liu 40 rdt3.0: stop-and-wait operation sender receiver first packet bit transmitted, t = 0 last packet bit transmitted, t = L / R first packet bit arrives last packet bit arrives, send ACK RTT ACK arrives, send next packet, t = RTT + L / R •rdt3.0 works, but performance stinks Xin Liu 41 Performance Ttransmit = U L (packet length in bits) 8kb/pkt = = 8 microsec R (transmission rate, bps) 10**9 b/sec sender = L/R RTT + L / R = .008 30.008 = 0.00027 microsec onds – 1KB pkt every 30 msec -> 33kB/sec thruput over 1 Gbps link – network protocol limits use of physical resources! Xin Liu 42 Pipelined protocols Pipelining: sender allows multiple, “in-flight”, yetto-be-acknowledged pkts – range of sequence numbers must be increased – buffering at sender and/or receiver • Two generic forms of pipelined protocols: go-Back-N, selective repeat Xin Liu 43 Pipelining: increased utilization sender receiver first packet bit transmitted, t = 0 last bit transmitted, t = L / R first packet bit arrives last packet bit arrives, send ACK last bit of 2nd packet arrives, send ACK last bit of 3rd packet arrives, send ACK RTT ACK arrives, send next packet, t = RTT + L / R Increase utilization by a factor of 3! U sender = 3*L/R RTT + L / R = .024 30.008 Xin Liu = 0.0008 microsecon ds Go-Back-N Sender: • k-bit seq # in pkt header • “window” of up to N, consecutive unack’ed pkts allowed • ACK(n): ACKs all pkts up to, including seq # n - “cumulative ACK” -- duplicate ACKs (see receiver) • timer for each in-flight pkt • timeout(n): retransmit pkt n and all higher seq # pkts in window Xin Liu 44 45 GBN: sender extended FSM rdt_send(data) L base=1 nextseqnum=1 if (nextseqnum < base+N) { sndpkt[nextseqnum] = make_pkt(nextseqnum,data,chksum) udt_send(sndpkt[nextseqnum]) if (base == nextseqnum) start_timer nextseqnum++ } else refuse_data(data) Wait rdt_rcv(rcvpkt) && corrupt(rcvpkt) L timeout start_timer udt_send(sndpkt[base]) udt_send(sndpkt[base+1]) … udt_send(sndpkt[nextseqnum-1]) rdt_rcv(rcvpkt) && notcorrupt(rcvpkt) base = getacknum(rcvpkt)+1 If (base == nextseqnum) stop_timer else start_timer Xin Liu 46 GBN: receiver extended FSM default udt_send(sndpkt) L Wait expectedseqnum=1 sndpkt = make_pkt(expectedseqnum,ACK,chksum) rdt_rcv(rcvpkt) && notcurrupt(rcvpkt) && hasseqnum(rcvpkt,expectedseqnum) extract(rcvpkt,data) deliver_data(data) sndpkt = make_pkt(expectedseqnum,ACK,chksum) udt_send(sndpkt) expectedseqnum++ ACK-only: always send ACK for correctly-received pkt with highest in-order seq # – may generate duplicate ACKs – need only remember expectedseqnum • out-of-order pkt: – discard (don’t buffer) -> no receiver buffering! – Re-ACK pkt with highest in-order seq # Xin Liu 47 GBN in action Xin Liu 48 Selective Repeat • receiver individually acknowledges all correctly received pkts – buffers pkts, as needed, for eventual in-order delivery to upper layer • sender only resends pkts for which ACK not received – sender timer for each unACKed pkt • sender window – N consecutive seq #’s – again limits seq #s of sent, unACKed pkts Xin Liu 49 Selective repeat: sender, receiver windows Xin Liu 50 Selective repeat receiver pkt n in [rcvbase, rcvbase+N-1] sender data from above : • send ACK(n) • out-of-order: buffer • in-order: deliver (also deliver buffered, in-order pkts), advance window to next notyet-received pkt • if next available seq # in window, send pkt timeout(n): • resend pkt n, restart timer ACK(n) in [sendbase,sendbase+N]: • mark pkt n as received • if n smallest unACKed pkt, advance window base to next unACKed seq # pkt n in [rcvbase-N,rcvbase-1] • ACK(n) otherwise: • ignore Xin Liu 51 Selective repeat in action Xin Liu Selective repeat: dilemma Example: • seq #’s: 0, 1, 2, 3 • window size=3 • receiver sees no difference in two scenarios! • incorrectly passes duplicate data as new in (a) Q: what relationship between seq # size and window size? Xin Liu 52 Outline • • • • • Transport-layer services Multiplexing and demultiplexing Connectionless transport: UDP Principles of reliable data transfer Connection-oriented transport: TCP – – – – segment structure reliable data transfer flow control connection management • Principles of congestion control • TCP congestion control Xin Liu 53 54 TCP: Overview RFCs: 793, 1122, 1323, 2018, 2581 • point-to-point: • full duplex data: – one sender, one receiver – bi-directional data flow in same connection – MSS: maximum segment size • reliable, in-order byte steam: – no “message boundaries” • connection-oriented: • pipelined: – handshaking (exchange of control msgs) init’s sender, receiver state before data exchange – TCP congestion and flow control set window size • send & receive buffers • flow controlled: socket door application writes data application reads data TCP send buffer TCP receive buffer socket door segment Xin Liu – sender will not overwhelm receiver TCP segment structure 55 32 bits URG: urgent data (generally not used) ACK: ACK # valid PSH: push data now source port # sequence number acknowledgement number head not UA P R S F len used checksum RST, SYN, FIN: connection estab (setup, teardown commands) Internet checksum (as in UDP) dest port # Receive window Urg data pnter Options (variable length) application data (variable length) Xin Liu counting by bytes of data (not segments!) # bytes rcvr willing to accept TCP Options • • • • MSS Timestamp Window scale Selective-ACK Xin Liu 56 Outline • • • • • Transport-layer services Multiplexing and demultiplexing Connectionless transport: UDP Principles of reliable data transfer Connection-oriented transport: TCP – – – – segment structure reliable data transfer flow control connection management • Principles of congestion control • TCP congestion control Xin Liu 57 58 TCP seq. #’s and ACKs Seq. #’s: – byte stream “number” of first byte in segment’s data ACKs: – seq # of next byte expected from other side – cumulative ACK Q: how receiver handles outof-order segments – A: TCP spec doesn’t say, - up to implementor Host A User types ‘C’ Host B host ACKs receipt of ‘C’, echoes back ‘C’ host ACKs receipt of echoed ‘C’ simple telnet scenario Xin Liu time 59 TCP Round Trip Time and Timeout Q: how to set TCP timeout value? • longer than RTT – but RTT varies • too short: premature timeout – unnecessary retransmissions • too long: slow reaction to segment loss Q: how to estimate RTT? • SampleRTT: measured time from segment transmission until ACK receipt – ignore retransmissions • SampleRTT will vary, want estimated RTT “smoother” – average several recent measurements, not just current SampleRTT Xin Liu 60 TCP Round Trip Time and Timeout EstimatedRTT = (1- )*EstimatedRTT + *SampleRTT • Exponential weighted moving average • influence of past sample decreases exponentially fast • typical value: = 0.125 Xin Liu 61 Example RTT estimation: RTT: gaia.cs.umass.edu to fantasia.eurecom.fr 350 RTT (milliseconds) 300 250 200 150 100 1 8 15 22 29 36 43 50 57 64 71 time (seconnds) SampleRTT Estimated RTT Xin Liu 78 85 92 99 106 62 TCP Round Trip Time and Timeout Setting the timeout • EstimtedRTT plus “safety margin” – large variation in EstimatedRTT -> larger safety margin • first estimate of how much SampleRTT deviates from EstimatedRTT: DevRTT = (1-)*DevRTT + *|SampleRTT-EstimatedRTT| (typically, = 0.25) Then set timeout interval: TimeoutInterval = EstimatedRTT + 4*DevRTT Xin Liu RTT • Timestamp can be used to measure RTT for each segment • Better RTT estimate • Q: do sender and receiver need to synchronize their clocks/timestamp unit? • NO Syn required Xin Liu 63 TCP reliable data transfer • TCP creates rdt service on top of IP’s unreliable service • Pipelined segments • Cumulative acks • TCP uses single retransmission timer • Retransmissions are triggered by: – timeout events – duplicate acks • Initially consider simplified TCP sender: – ignore duplicate acks – ignore flow control, congestion control Xin Liu 64 TCP sender events: data rcvd from app: • Create segment with seq # • seq # is byte-stream number of first data byte in segment • start timer if not already running (think of timer as for oldest unacked segment) • expiration interval: TimeOutInterval timeout: • retransmit segment that caused timeout • restart timer Ack rcvd: • If acknowledges previously unacked segments – update what is known to be acked – start timer if there are outstanding segments Xin Liu 65 66 NextSeqNum = InitialSeqNum SendBase = InitialSeqNum TCP sender loop (forever) { switch(event) event: data received from application above create TCP segment with sequence number NextSeqNum if (timer currently not running) start timer pass segment to IP NextSeqNum = NextSeqNum + length(data) event: timer timeout retransmit not-yet-acknowledged segment with smallest sequence number start timer event: ACK received, with ACK field value of y if (y > SendBase) { SendBase = y if (there are currently not-yet-acknowledged segments) start timer } } /* end of loop forever */ Xin Liu (simplified) Comment: • SendBase-1: last cumulatively ack’ed byte Example: • SendBase-1 = 71; y= 73, so the rcvr wants 73+ ; y > SendBase, so that new data is acked 67 TCP: retransmission scenarios Host A X loss Sendbase = 100 SendBase = 120 SendBase = 100 time Host B Seq=92 timeout Host B SendBase = 120 Seq=92 timeout timeout Host A time lost ACK scenario Xin Liu premature timeout 68 TCP retransmission scenarios (more) timeout Host A Host B X loss SendBase = 120 time Cumulative ACK scenario Xin Liu Q: what to do at time t+t_0? Is there a timeout? TCP ACK generation [RFC 1122, RFC 2581] Event at Receiver TCP Receiver action Arrival of in-order segment with expected seq #. All data up to expected seq # already ACKed Delayed ACK. Wait up to 500ms for next segment. If no next segment, send ACK Arrival of in-order segment with expected seq #. One other segment has ACK pending Immediately send single cumulative ACK, ACKing both in-order segments Arrival of out-of-order segment higher-than-expect seq. # . Gap detected Immediately send duplicate ACK, indicating seq. # of next expected byte Arrival of segment that partially or completely fills gap Immediate send ACK, provided that segment startsat lower end of gap Xin Liu 69 Nagel algorithm • Small segments cannot be sent until the outstanding data is acked if a TCP connection has outstanding data that has not yet been acked. • Self-clocking: the fast the ack comes back, the faster the data is sent • May need to be disabled somtimes Xin Liu 70 Fast Retransmit • Time-out period often • If sender receives 3 relatively long: ACKs for the same – long delay before data, it supposes that resending lost packet segment after ACKed • Detect lost segments data was lost: via duplicate ACKs. – Sender often sends many segments backto-back – If segment is lost, there will likely be many duplicate ACKs. Xin Liu – fast retransmit: resend segment before timer expires 71 Fast retransmit algorithm: event: ACK received, with ACK field value of y if (y > SendBase) { SendBase = y if (there are currently not-yet-acknowledged segments) start timer } else { increment count of dup ACKs received for y if (count of dup ACKs received for y = 3) { resend segment with sequence number y } a duplicate ACK for already ACKed segment fast retransmit Xin Liu 72 Outline • • • • • Transport-layer services Multiplexing and demultiplexing Connectionless transport: UDP Principles of reliable data transfer Connection-oriented transport: TCP – – – – segment structure reliable data transfer flow control connection management • Principles of congestion control • TCP congestion control Xin Liu 73 TCP Flow Control flow control • receive side of TCP connection has a receive buffer: sender won’t overflow receiver’s buffer by transmitting too much, too fast • speed-matching service: matching the send rate to the receiving app’s drain rate • app process may be slow at reading from buffer Xin Liu 74 TCP Flow control: how it works (Suppose TCP receiver discards out-of-order segments) • spare room in buffer = RcvWindow = RcvBuffer-[LastByteRcvd LastByteRead] Xin Liu 75 • Rcvr advertises spare room by including value of RcvWindow in segments • Sender limits unACKed data to RcvWindow – guarantees receive buffer doesn’t overflow More • Slow receiver – Persist timer, a timer set by sender to keep asking receiver about its window size. – Ack new window • Long fat pipeline: high speed link and/or long RTT – Q: to fully utilize a long fat pipeline, what should the window size be? • At least bandwidth* RTT. • Window scale option during handshaking – Option: scale 0 -14, 2^scale Xin Liu 76 Header 32 bits source port # dest port # sequence number acknowledgement number head not UA P R S F len used checksum Receive window Urg data pnter Options (variable length) application data (variable length) Xin Liu 77 Outline • • • • • Transport-layer services Multiplexing and demultiplexing Connectionless transport: UDP Principles of reliable data transfer Connection-oriented transport: TCP – – – – segment structure reliable data transfer flow control connection management • Principles of congestion control • TCP congestion control Xin Liu 79 TCP Connection Management Recall: TCP sender, receiver Three way handshake: establish “connection” before exchanging data segments • initialize TCP variables: – seq. #s – buffers, flow control info (e.g. RcvWindow) Step 1: client host sends TCP SYN segment to server – specifies initial seq # – no data • client: connection initiator – connect(); • server: contacted by client – accept(); Step 2: server host receives SYN, replies with SYNACK segment – server allocates buffers – specifies server initial seq. # Step 3: client receives SYNACK, replies with ACK segment, which may contain data Xin Liu 80 TCP Connection • SYN, FIN consume one sequence number • ISN increase by one every 4ms. • 9.5 hours reuse cycle. • MSS: maximum segment size Xin Liu 81 82 TCP Connection Management (cont.) Closing a connection: client client closes socket: close(); close Step 1: client end system close timed wait sends TCP FIN control segment to server Step 2: server receives FIN, replies with ACK. Closes connection, sends FIN. server closed Xin Liu 83 TCP Connection Management (cont.) Step 3: client receives FIN, client replies with ACK. – Enters “timed wait” - will respond with ACK to received FINs Step 4: server, receives ACK. Connection closed. server closing FIN_WAIT_1 closing FIN_WAIT_2 TIME_WAIT timed wait Note: with small modification, can handle simultaneous FINs. closed Xin Liu closed 84 TCP Connection Management (cont) TCP server lifecycle TCP client lifecycle Xin Liu TCP Connection Management • Allow half-close, i.e., one end to terminate its output, but still receiving data • Allow simultaneous open • Allow simultaneous close • Crashes? Xin Liu 85 86 [root@shannon liu]# tcpdump -S tcp port 22 tcpdump: listening on eth0 23:01:51.363983 shannon.cs.ucdavis.edu.60042 > weasel.cs.ucdavis.edu.ssh: S 3036713598:3036713598(0) win 5840 <mss 1460,sackOK,timestamp 13989220 0,nop,wscale 0> (DF) 23:01:51.364829 weasel.cs.ucdavis.edu.ssh > shannon.cs.ucdavis.edu.60042: S 2462279815:2462279815(0) ack 3036713599 win 24616 <nop,nop,timestamp 626257407 13989220,nop,wscale 0,nop,nop,sackOK,mss 1460> (DF) 23:01:51.364844 shannon.cs.ucdavis.edu.60042 > weasel.cs.ucdavis.edu.ssh: . ack 2462279816 win 5840 <nop,nop,timestamp 13989220 626257407> (DF) 23:01:51.375451 weasel.cs.ucdavis.edu.ssh > shannon.cs.ucdavis.edu.60042: P 2462279816:2462279865(49) ack 3036713599 win 24616 <nop,nop,timestamp 626257408 13989220> (DF) 23:01:51.375478 shannon.cs.ucdavis.edu.60042 > weasel.cs.ucdavis.edu.ssh: . ack 2462279865 win 5840 <nop,nop,timestamp 13989221 626257408> (DF) 23:01:51.379319 shannon.cs.ucdavis.edu.60042 > weasel.cs.ucdavis.edu.ssh: P 3036713599:3036713621(22) ack 2462279865 win 5840 <nop,nop,timestamp 13989221 626257408> (DF) 23:01:51.379570 weasel.cs.ucdavis.edu.ssh > shannon.cs.ucdavis.edu.60042: . ack 3036713621 win 24616 <nop,nop,timestamp 626257408 13989221> (DF) Xin Liu 87 23:01:51.941616 shannon.cs.ucdavis.edu.60042 > weasel.cs.ucdavis.edu.ssh: P 3036714373:3036714437(64) ack 2462281065 win 7680 <nop,nop,timestamp 13989277 626257462> (DF) 23:01:51.952442 weasel.cs.ucdavis.edu.ssh > shannon.cs.ucdavis.edu.60042: P 2462281065:2462282153(1088) ack 3036714437 win 24616 <nop,nop,timestamp 626257465 13989277> (DF) 23:01:51.991682 shannon.cs.ucdavis.edu.60042 > weasel.cs.ucdavis.edu.ssh: . ack 2462282153 win 9792 <nop,nop,timestamp 13989283 626257465> (DF) 23:01:54.699597 shannon.cs.ucdavis.edu.60042 > weasel.cs.ucdavis.edu.ssh: F 3036714437:3036714437(0) ack 2462282153 win 9792 <nop,nop,timestamp 13989553 626257465> (DF) 23:01:54.699880 weasel.cs.ucdavis.edu.ssh > shannon.cs.ucdavis.edu.60042: . ack 3036714438 win 24616 <nop,nop,timestamp 626257740 13989553>(DF) 23:01:54.701129 weasel.cs.ucdavis.edu.ssh > shannon.cs.ucdavis.edu.60042: F 2462282153:2462282153(0) ack 3036714438 win 24616 <nop,nop,timestamp 626257740 13989553> (DF) 23:01:54.701143 shannon.cs.ucdavis.edu.60042 > weasel.cs.ucdavis.edu.ssh: . ack 2462282154 win 9792 <nop,nop,timestamp 13989553 626257740> (DF) 26 packets received by filter 0 packets dropped by kernel Xin Liu Outline • • • • • Transport-layer services Multiplexing and demultiplexing Connectionless transport: UDP Principles of reliable data transfer Connection-oriented transport: TCP – – – – segment structure reliable data transfer flow control connection management • Principles of congestion control • TCP congestion control Xin Liu 88 89 Principles of Congestion Control Congestion: • informally: “too many sources sending too much data too fast for network to handle” • different from flow control! • manifestations: – lost packets (buffer overflow at routers) – long delays (queueing in router buffers) • a top-10 problem! Xin Liu 90 Causes/costs of congestion: scenario 1 Host A • two senders, two receivers • one router, infinite buffers • no retransmission lout lin : original data unlimited shared output link buffers Host B • large delays when congested • maximum achievable throughput Xin Liu 91 Causes/costs of congestion: scenario 2 • one router, finite buffers • sender retransmission of lost packet Host A Host B lin : original data l'in : original data, plus retransmitted data lout finite shared output link buffers • “costs” of congestion: • more work (retrans) for given “goodput” • unneeded retransmissions: link carries multiple copies of pkt Xin Liu 92 Causes/costs of congestion: scenario 2 Ideal case: knowing whether the buffer is free. Knowing which packet is lost. Xin Liu Packet loss + unnecessary retransmission 93 Causes/costs of congestion: scenario 3 Q: what happens as lin and linincrease ? • four senders • multihop paths • timeout/retransmit Host A lin : original data l'in : original data, plus retransmitted data finite shared output link buffers Host B Xin Liu lout 94 Causes/costs of congestion: scenario 3 H o s t A l o u t H o s t B Another “cost” of congestion: • when packet dropped, any “upstream transmission capacity used for that packet was wasted! Xin Liu 95 Approaches towards congestion control Two broad approaches towards congestion control: End-end congestion control: Network-assisted congestion control: • no explicit feedback from network • congestion inferred from end-system observed loss, delay • approach taken by TCP • routers provide feedback to end systems – single bit indicating congestion (SNA, DECbit, TCP/IP ECN, ATM) – explicit rate sender should send at Xin Liu 96 Case study: ATM ABR congestion control ABR: available bit rate: • “elastic service” • if sender’s path “underloaded”: – sender should use available bandwidth • if sender’s path congested: – sender throttled to minimum guaranteed rate RM (resource management) cells: • sent by sender, interspersed with data cells • bits in RM cell set by switches (“network-assisted”) – NI bit: no increase in rate (mild congestion) – CI bit: congestion indication • RM cells returned to sender by receiver, with bits intact Xin Liu 97 Case study: ATM ABR congestion control • two-byte ER (explicit rate) field in RM cell – congested switch may lower ER value in cell – sender’ send rate thus minimum supportable rate on path • EFCI bit in data cells: set to 1 in congested switch – if data cell preceding RM cell has EFCI set, sender sets CI bit in returned RM cell Xin Liu Outline • • • • • Transport-layer services Multiplexing and demultiplexing Connectionless transport: UDP Principles of reliable data transfer Connection-oriented transport: TCP – – – – segment structure reliable data transfer flow control connection management • Principles of congestion control • TCP congestion control Xin Liu 98 TCP Congestion Control • end-end control (no network assistance) • sender limits transmission: LastByteSent-LastByteAcked cwnd • Roughly, rate = cwnd RTT Bytes/sec • cwnd is dynamic, function of perceived network congestion 99 How does sender perceive congestion? • loss event = timeout or 3 duplicate acks • TCP sender reduces rate (cwnd) after loss event mechanisms: – slow start – congestion avoidance – AIMD Xin Liu TCP Slow Start • When connection begins, increase rate exponentially fast until – Example: MSS = 500 first loss event bytes & RTT = 200 msec • When connection begins, cwnd = 1 MSS – initial rate = 20 kbps • available bandwidth may be >> MSS/RTT – desirable to quickly ramp up to respectable rate Xin Liu 100 101 TCP Slow Start (more) • When connection begins, increase rate exponentially until first loss event or ssthresh: Host B RTT Host A – incrementing cwnd for every ACK received – double cwnd every RTT • Summary: initial rate is slow but ramps up exponentially fast time Xin Liu Congestion Avoidance • ssthresh: when cwnd reaches ssthresh, congestion avoidance begins • Congestion avoidance: increase cwnd by 1/cwnd each time an ACK is received • Congestion happens: ssthresh=max(2MSS, cwnd/2) Xin Liu 102 103 TCP AIMD multiplicative decrease: cut cwnd in half after loss event congestion window 24 Kbytes additive increase: increase cwnd by 1 MSS every RTT in the absence of loss events: probing 16 Kbytes 8 Kbytes time Long-lived TCP connection Xin Liu Reno vs. Tahoe Philosophy: • After 3 dup ACKs: – cwnd is cut in half – window then grows linearly • But after timeout event: – cwnd instead set to 1 MSS; • 3 dup ACKs indicates network capable of delivering some segments • timeout before 3 dup ACKs is “more alarming” – window then grows exponentially – to a sshthresh, then grows linearly •Fast retransmission: retransmit the seemingly missing segment •Fast recovery: congestion avoidance instead of slow start. Xin Liu 104 105 Summary: TCP Congestion Control • When cwnd is below sshthresh, sender in slow-start phase, window grows exponentially. • When cwnd is above sshthresh, sender is in congestion-avoidance phase, window grows linearly. • When a triple duplicate ACK occurs, sshthresh set to cwnd/2 and cwnd set to sshthresh. • When timeout occurs, sshthresh set to cwnd/2 and cwnd is set to 1 MSS. • Read Table 3.3 in the textbook. Xin Liu Trend • Recent research proposes network-assisted congestion control: active queue management • ECN: explicit congestion notification – 2 bits: 6 &7 in the IP TOS field • RED: random early detection – Implicit – Can be adapted to explicit methods by marking instead of dropping Xin Liu 106 107 TCP Fairness Fairness goal: if K TCP sessions share same bottleneck link of bandwidth R, each should have average rate of R/K TCP connection 1 TCP connection 2 bottleneck router capacity R Xin Liu Why is TCP fair? Two competing sessions: • Additive increase gives slope of 1, as throughout increases • multiplicative decrease decreases throughput proportionally R equal bandwidth share loss: decrease window by factor of 2 congestion avoidance: additive increase loss: decrease window by factor of 2 congestion avoidance: additive increase Connection 1 throughput R Xin Liu 108 Fairness (more) 109 Fairness and parallel TCP Fairness and UDP connections • Multimedia apps often • nothing prevents app from do not use TCP opening parallel cnctions – do not want rate between 2 hosts. throttled by congestion control • Web browsers do this • Instead use UDP: • Example: link of rate R – pump audio/video at supporting 9 cnctions; constant rate, tolerate packet loss – new app asks for 1 TCP, gets rate R/10 – new app asks for 11 TCPs, gets R/2 ! • Research area: TCP friendly Xin Liu Steady State Throughput • R: rate at the bottleneck router • W=R*RTT ( unit is byte) – Maximum window size before packets dropping • cwnd ( unit is byte) – 0.5 * W--- 1 * W • Average Throughput – 0.75*W/RTT =0.75 R Xin Liu 110 Steady State Loss Rate • • • • cwnd varies from W/2/RTT to W/RTT. Only one packet is lost w= W/MSS: # of segments in window W. Loss rate is Xin Liu 111 112 Delay modeling Notation, assumptions: Q: How long does it take to receive an object from a Web server after sending a request? Ignoring congestion, delay is influenced by: • Assume one link between client and server of rate R • S: MSS (bits) • O: object size (bits) • no retransmissions (no loss, no corruption) • TCP connection establishment • data transmission delay • slow start Xin Liu Window size: • First assume: fixed congestion window, W segments • Then dynamic window, modeling slow start 113 Fixed congestion window (1) First case: WS/R > RTT + S/R: ACK for first segment in window returns before window’s worth of data sent delay = 2RTT + O/R Xin Liu 114 Fixed congestion window (2) Second case: • WS/R < RTT + S/R: wait for ACK after sending window’s worth of data sent delay = 2RTT + O/R + (K-1)[S/R + RTT - WS/R] Xin Liu TCP Delay Modeling: Slow Start (1) Now suppose window grows according to slow start Will show that the delay for one object is: Latency 2 RTT O S S P RTT ( 2 P 1) R R R where P is the number of times TCP idles at server: P min {Q, K 1} - where Q is the number of times the server idles if the object were of infinite size. - and K is the number of windows that cover the object. Xin Liu 115 116 TCP Delay Modeling: Slow Start (2) Delay components: • 2 RTT for connection estab and request • O/R to transmit object • time server idles due to slow start initiate TCP connection request object first window = S/R RTT second window = 2S/R Server idles: P = min{K-1,Q} times Example: • O/S = 15 segments • K = 4 windows •Q=2 • P = min{K-1,Q} = 2 Server idles P=2 times third window = 4S/R fourth window = 8S/R complete transmission object delivered time at client Xin Liu time at server 117 TCP Delay Modeling (3) S RTT time from when server starts to send segment R until server receives acknowledg ement initiate TCP connection 2k 1 S time to transmit the kth window R request object S k 1 S RTT 2 idle time after the kth window R R first window = S/R RTT second window = 2S/R third window = 4S/R P O delay 2 RTT idleTime p R p 1 P O S S 2 RTT [ RTT 2 k 1 ] R R k 1 R O S S 2 RTT P[ RTT ] (2 P 1) R R R fourth window = 8S/R complete transmission object delivered time at client Xin Liu time at server 118 TCP Delay Modeling (4) Recall K = number of windows that cover object How do we calculate K ? K min {k : 20 S 21 S 2 k 1 S O} min {k : 20 21 2 k 1 O / S } O min {k : 2 1 } S O min {k : k log 2 ( 1)} S O log 2 ( 1) S k Calculation of Q, number of idles for infinite-size object, is similar. Xin Liu Delay • We show: Latency 2 RTT O S S P RTT ( 2 P 1) R R R where P is the number of times TCP idles at server: P min {Q, K 1} - where Q is the number of times the server idles if the object were of infinite size. - and K is the number of windows that cover the object. Xin Liu 119 Summary • What factors limit TCP throughput? • What if the sender’s link is the bottleneck? – What happens to cwnd? • A bottleneck in the middle of the network • The receiver is the bottleneck? • Summary: – Minimum of all Xin Liu 120 121 HTTP • Assume Web page consists of: – 1 base HTML page (of size O bits) – M images (each of size O bits) • Persist HTTP – Multiple objects can be sent over single TCP connection between client and server. – HTTP/1.1 uses persistent connections in default mode • Non-persist HTTP – At most one object is sent over a TCP connection. – HTTP/1.0 uses nonpersistent HTTP Xin Liu HTTP Modeling • Non-persistent HTTP: – M+1 TCP connections in series – Response time = (M+1)O/R + (M+1)2RTT + sum of idle times • Persistent HTTP: – 2 RTT to request and receive base HTML file – 1 RTT to request and receive M images – Response time = (M+1)O/R + 3RTT + sum of idle times • Non-persistent HTTP with X parallel connections – Suppose M/X integer. – 1 TCP connection for base file – M/X sets of parallel connections for images. – Response time = (M+1)O/R + (M/X + 1)2RTT + sum of idle times Xin Liu 122 HTTP response time • Small RTT? – low bandwidth, connection & response time dominated by transmission time. – Persistent connections only give minor improvement over parallel connections. • Large RTT? – For larger RTT, response time dominated by TCP establishment & slow start delays. Persistent connections now give important improvement: particularly in high delay*bandwidth networks. Xin Liu 123 Reading • D. Chiu and R. Jain, “Analysis of the increase and decrease algorithms for the congestion avoidance in computer networks”, Computer Networks and ISDN systems, vol. 17, no. 1, pp. 1-14. (fairness and performance) • T. Lakshman, U. Madhow, “the performance of TCP/IP for networks with high-bandwidth-delay products and random losses,” IEEE/ACM Trans. On Networking, vol. 5, no. 3, pp. 336-350. Xin Liu 124 SCTP • SCTP: stream control transmission protocol • Motivation – Many applications need reliable message delivery – they do so by delineating a TCP stream – TCP provides both strict-ordering and reliability – many applications may not need both – Many applications require better fault tolerance Xin Liu 125 Motivation (contd) • HTTP is one such application – While transferring multiple embedded files we only want • Reliable file transfer for each file • Partial ordering for the packets of each file but not total ordering amongst all the packets – TCP provides more than this (but overhead?) – SCTP may help (how? – later) Xin Liu 126 What is SCTP? • Originally designed to support PSTN signaling messages over IP Networks • It is a reliable transport protocol operating on top of a connectionless packet network such as IP (same level as TCP) Xin Liu 127 Major Differences from TCP • Connection Setup • SCTP is message oriented as opposed to being byte stream oriented – Boundary conservation • SCTP has the concept of an association instead of a connection – Each association can have multiple streams • SCTP separates reliable transfer of datagrams from the delivery mechanism – No-head-of line blocking • SCTP supports multihoming Xin Liu 128 Similarities to TCP • Similar Flow Control and Congestion Control Strategies employed – – – – Slow Start and Congestion Avoidance phases Selective Acknowledgement Fast Retransmit Slight differences for supporting multihoming • Will allow co-existence of TCP and SCTP [JST 2000] Xin Liu 129 130 Connection Setup server Generate cookie • No action at server after send cookie • Cookie verification – Effective to SYN attacks Xin Liu Packet Format Xin Liu 131 HTTP Server Architecture 132 Single File Transfer ( Both TCP and SCTP are similar) Request file Server Fork child Client Send file Xin Liu Child process HTTP Server Architecture 133 Multiple File Transfer (Embedded files) - TCP Request file 0 Client Send file 0 Request file 1..N Send file 1,2,…N Xin Liu Server Fork child Child process SCTP Packet Format streams Xin Liu 134 HTTP Server Architecture 135 Multiple Files Transfer (Embedded Files) - SCTP Request file 0 Client Send file 0 – stream 0 Request files 1..N Send file 1 – stream 1 Send file N – stream N Xin Liu Server Fork child Child process 136 Hypothesis Server Client 1 3 2 File 3 1 3 2 File 2 1 3 2 1 File 1 TCP Receive buffer in kernel Xin Liu 137 Hypothesis Server Client 1 3 2 File 3 1 3 2 File 2 1 3 2 1 File 1 SCTP Receive buffer in kernel Xin Liu Multihome • designed to offer network resilience to failed interfaces on the host and faster recovery during network failures – Less effective if a single point of failure exists – Today’s IP subject to a routing reconvergence time Xin Liu 138 Reading • RFC 2960 • SCTP for Beginners http://tdrwww.exp-math.uni-essen.de/inhalt/forschung/sctp_fb/sctp_intro.html Xin Liu 139 Wireless TCP • Motivation – Wireless channels are unreliable and timevarying – Cause TCP timeout/Duplicate acks • Approaches Xin Liu 140 Channel Conditions • Decides transmission performance • Determined by – Strength of desired signal – Noise level • Interference from other transmissions • Background noise – Time-varying and location-dependent. Xin Liu 141 142 Interference and Noise Noise Interference Interference Desired Signal Interference Xin Liu 143 Propagation Environment Weak Shadowing Strong Path Loss Multi-path Fading Xin Liu Time-varying Channel Conditions • Due to users’ mobility and variability in the propagation environment, both desired signal and interference are time-varying and locationdependent • A measure of channel quality: SINR (Signal to Interference plus Noise Ratio) Xin Liu 144 145 Illustration of Channel Conditions Based on Lee’s path loss model, log-normal shadowing, and Raleigh fading Xin Liu Random Errors • If number of errors is small, they may be corrected by an error correcting code • Excessive bit errors result in a packet being discarded, possibly before it reaches the transport layer Xin Liu 146 147 Random Errors May Cause Fast Retransmit • Fast retransmit results in – retransmission of lost packet – reduction in congestion window • Reducing congestion window in response to errors is unnecessary • Reduction in congestion window reduces the throughput Xin Liu Sometimes Congestion Response May be Appropriate in Response to Errors • On a CDMA channel, errors occur due to interference from other user, and due to noise [Karn99pilc] – Interference due to other users is an indication of congestion. If such interference causes transmission errors, it is appropriate to reduce congestion window – If noise causes errors, it is not appropriate to reduce window • When a channel is in a bad state for a long duration, it might be better to let TCP backoff, so that it does not unnecessarily attempt retransmissions while the channel remains in the bad state [Padmanabhan99pilc] Xin Liu 148 149 Burst Errors May Cause Timeouts • If wireless link remains unavailable for extended duration, a window worth of data may be lost – driving through a tunnel – passing a truck • Timeout results in slow start • Slow start reduces congestion window to 1 MSS, reducing throughput • Reduction in window in response to errors unnecessary Xin Liu 150 Random Errors May Also Cause Timeout • Multiple packet losses in a window can result in timeout when using TCP-Reno Xin Liu Impact of Transmission Errors • TCP cannot distinguish between packet losses due to congestion and transmission errors • Unnecessarily reduces congestion window • Throughput suffers Xin Liu 151 Various Schemes • • • • Link level mechanisms Split connection approach TCP-Aware link layer TCP-Unaware approximation of TCP-aware link layer • Explicit notification • Receiver-based discrimination • Sender-based discrimination Ref: tutorials by Nitin Vaidya Xin Liu 152 153 Split Connection Approach Xin Liu Split Connection Approach • End-to-end TCP connection is broken into one connection on the wired part of route and one over wireless part of the route • A single TCP connection split into two TCP connections – if wireless link is not last on route, then more than two TCP connections may be needed Xin Liu 154 Split Connection Approach • Connection between wireless host MH and fixed host FH goes through base station BS • FH-MH = FH-BS + BS-MH FH Fixed Host BS MH Base Station Xin Liu Mobile Host 155 Split Connection Approach Per-TCP connection state TCP connection TCP connection application application transport transport transport network network network link link link physical physical physical rxmt wireless Xin Liu application 156 157 Split Connection Approach : Classification • Hides transmission errors from sender • Primary responsibility at base station • If specialized transport protocol used on wireless, then wireless host also needs modification Xin Liu 158 Split Connection Approach : Advantages • BS-MH connection can be optimized independent of FHBS connection – Different flow / error control on the two connections • Local recovery of errors – Faster recovery due to relatively shorter RTT on wireless link • Good performance achievable using appropriate BS-MH protocol Xin Liu 159 Split Connection Approach : Disadvantages • End-to-end semantics violated • BS retains hard state – BS failure can result in loss of data (unreliability) – Hand-off latency increases due to state transfer – Buffer space needed at BS for each TCP connection • Extra copying of data at BS – copying from FH-BS socket buffer to BS-MH socket buffer – increases end-to-end latency • May not be useful if data and acks traverse different paths (both do not go through the base station) – Example: data on a satellite wireless hop, acks on a dial-up channel Xin Liu Various Schemes • • • • Link layer mechanisms Split connection approach TCP-Aware link layer TCP-Unaware approximation of TCP-aware link layer • Explicit notification • Receiver-based discrimination • Sender-based discrimination Xin Liu 160 161 TCP-Aware Link Layer Xin Liu Snoop Protocol [Balakrishnan95acm] • Retains local recovery of Split Connection approach and link level retransmission schemes • Improves on split connection – end-to-end semantics retained – soft state at base station, instead of hard state Xin Liu 162 163 Snoop Protocol Per TCP-connection state TCP connection application application application transport transport transport network network link link link physical physical physical FH BS Xin Liu rxmt wireless network MH Snoop Protocol • Buffers data packets at the base station BS – to allow link layer retransmission • When dupacks received by BS from MH, retransmit on wireless link, if packet present in buffer • Prevents fast retransmit at TCP sender FH by dropping the dupacks at BS FH BS MH Xin Liu 164 165 Snoop : Example 35 36 TCP state maintained at link layer 37 38 40 39 38 FH 37 BS 34 MH 36 Example assumes delayed ack - every other packet ack’d Xin Liu 166 Snoop : Example 35 39 36 37 38 41 40 39 34 38 36 Xin Liu 167 Snoop : Example 37 40 38 39 42 41 40 36 39 36 dupack Duplicate acks are not delayed Xin Liu 168 Snoop : Example 37 40 38 41 39 43 42 41 36 40 36 36 Duplicate acks Xin Liu 169 Snoop : Example 44 37 40 38 41 39 42 43 37 FH 41 BS Discard dupack Dupack triggers retransmission 36 of packet 37 from base station BS needs to be TCP-aware to be able to interpret TCP headers Xin Liu MH 36 36 170 Snoop : Example 45 37 40 38 41 39 42 44 43 42 37 36 36 36 Xin Liu 36 171 Snoop : Example 46 37 40 43 38 41 44 39 42 45 43 42 36 TCP sender does not fast retransmit 36 36 36 Xin Liu 41 172 Snoop : Example 47 37 40 43 38 41 44 39 42 45 46 44 43 41 TCP sender does not fast retransmit 36 36 36 36 Xin Liu 173 Snoop : Example 42 45 43 46 44 48 47 45 FH 44 BS 41 MH 43 36 36 36 36 Xin Liu Snoop [Balakrishnan95acm] 2000000 bits/sec 1600000 1200000 base TCP Snoop 800000 400000 0 no error 256K 128K 64K 32K 16K 1/error rate (in bytes) 2 Mbps Wireless link Xin Liu 174 Snoop Protocol When Beneficial? • Snoop prevents fast retransmit from sender despite transmission errors, and out-of-order delivery on the wireless link • delivery causes fast retransmit only if it results in at least 3 dupacks • If wireless link level delay-bandwidth product is less than 4 packets, a simple (TCP-unaware) link level retransmission scheme can suffice – Since delay-bandwidth product is small, the retransmission scheme can deliver the lost packet without resulting in 3 dupacks from the TCP receiver Xin Liu 175 Snoop Protocol : Classification • Hides wireless losses from the sender • Requires modification to only BS (networkcentric approach) Xin Liu 176 177 Snoop Protocol : Advantages • High throughput can be achieved – performance further improved using selective acks • Local recovery from wireless losses • Fast retransmit not triggered at sender despite out-of-order link layer delivery • End-to-end semantics retained • Soft state at base station – loss of the soft state affects performance, but not correctness Xin Liu 178 Snoop Protocol : Disadvantages • Link layer at base station needs to be TCP-aware • Not useful if TCP headers are encrypted (IPsec) • Cannot be used if TCP data and TCP acks traverse different paths (both do not go through the base station) Xin Liu WTCP Protocol [Ratnam98] • Snoop hides wireless losses from the sender • But sender’s RTT estimates may be larger in presence of errors • Larger RTO results in slower response for congestion losses FH BS MH Xin Liu 179 180 WTCP Protocol • WTCP performs local recovery, similar to Snoop • In addition, WTCP uses the timestamp option to estimate RTT • The base station adds base station residence time to the timestamp when processing an ack received from the wireless host • Sender’s RTT estimate not affected by retransmissions on wireless link FH BS MH Xin Liu 181 WTCP Example 3 3 FH BS 4 MH 3 Numbers in this figure are timestamps Base station residence time is 1 unit Xin Liu 182 WTCP : Disadvantages • Requires use of the timestamp option • May be useful only if retransmission times are large – link stays in bad state for a long time – link frequently enters a bad state – link delay large • WTCP does not account for congestion on wireless hop – assumes that all delay at base station is due to queuing and retransmissions – will not work for shared wireless LAN, where delays also incurred due to contention with other transmitters Xin Liu Optional Reading • S. Dawkins, et. al, “performance implications of link-layer characteristics: links and errors,” tech. rep. IETF, June, 1999. • G. Montenegro, et al, “Long thin networks”, tech. rep. IETF, May, 1999. • Nitin Vaidya, http://www.crhc.uiuc.edu/wireless/tutorials.html Xin Liu 183