IT 347: Chapter 3 Transport Layer Instructor: Christopher Cole Some slides taken from Kurose & Ross book • Network layer: logical communication between hosts • Transport layer: logical communication between processes – end-to-end only – Routers, etc. don’t read segments • Weird analogy of 2 families with Bill and Ann • IP provides a best-effort delivery service – No guarantees! – unreliable • Most fundamentally: extend host-to-host delivery to process-to-process delivery Internet transport-layer protocols • reliable, in-order delivery (TCP) application transport network data link physical – congestion control – flow control – connection setup • unreliable, unordered delivery: UDP network data link physical network data link physical network data link physical network data link physical – no-frills extension of “besteffort” IP network data link physical • services not available: network data link physical application transport network data link physical – delay guarantees – bandwidth guarantees Transport Layer 3-3 Multiplexing • Everybody is on sockets – Multiplexing: passing segments to network layer – Demultiplexing: taking network layer and distributing to sockets – How is it done? • What are the 2 things that applications need to talk to each other? Source port/IP and destination port/IP – Server side usually specifically names a port – Client side lets the transport layer automatically assign a port More multiplexing • UDP: two-tuple – Identified by destination IP address and port number – If two UDP segments have difference source IP addresses and/or port numbers, but the same destination IP and port number, they will be directed to the same destination process via the same socket • TCP: four-tuple – Identified by source IP/port and destination IP/port – Two segments with difference source info, but the same destination info, will be directed to different sockets – TCP keeps the port open as a “welcoming socket” • Server process creates new socket when connection is created • One server can have many sockets open at a time • Web server – Spawns a new thread for each connection (How does it know which collection belongs to who? Source port & IP) – Threads are like lightweight subprocesses – Many threads for one process UDP • A transport layer protocol must: provide multiplexing/demultiplexing – That’s all UDP does besides some light error checking – You’re practically talking directly to IP • UDP process – – – – Take the message from the application Add source and destination port number fields Add length & checksum fields Send the segment to layer 3 • Connectionless (no handshaking) – Why is DNS using UDP? Advantages of UDP • Finer application control – Just spit the bits onto the wire. No congestion control, flow control, etc. • Connectionless – No extra RTT delay – Doesn’t create buffers, so a UDP server can take more clients than a TCP server • Small packet overhead – TCP overhead = 20 bytes – UDP overhead = 8 bytes The UDP Controversy • UDP doesn’t play nice – No congestion control • Say UDP packets flood the lines… – Causes routers to get more congested – TCP sees this, and slows down packet sending – UDP doesn’t • Only UDP packets end up getting sent UDP: User Datagram Protocol [RFC 768] • “no frills,” “bare bones” Internet transport protocol • “best effort” service, UDP segments may be: – lost – delivered out of order to app • connectionless: – no handshaking between UDP sender, receiver – each UDP segment handled independently of others Why is there a UDP? • no connection establishment (which can add delay) • simple: no connection state at sender, receiver • small segment header • no congestion control: UDP can blast away as fast as desired Transport Layer 3-10 UDP: more • often used for streaming multimedia apps – loss tolerant – rate sensitive • other UDP uses 32 bits Length, in bytes of UDP segment, including header – DNS – SNMP • reliable transfer over UDP: add reliability at application layer – application-specific error recovery! source port # length dest port # checksum Application data (message) UDP segment format Transport Layer 3-11 UDP checksum Goal: detect “errors” (e.g., flipped bits) in transmitted segment Sender: Receiver: • treat segment contents as sequence of 16-bit integers • checksum: addition (1’s complement sum) of segment contents • sender puts checksum value into UDP checksum field • compute checksum of received segment • check if computed checksum equals checksum field value: – NO - error detected – YES - no error detected. Transport Layer • Throw the bit away OR • Pass it on with a warning 3-12 Internet Checksum Example • Note – When adding numbers, a carryout from the most significant bit needs to be added to the result • Example: add two 16-bit integers 1 1 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 1 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 wraparound 1 1 0 1 1 1 0 1 1 1 0 1 1 1 0 1 1 sum 1 1 0 1 1 1 0 1 1 1 0 1 1 1 1 0 0 checksum 1 0 1 0 0 0 1 0 0 0 1 0 0 0 0 1 1 Transport Layer 3-13 Reliable Data Transfer • See book for state machines and full explanations • Case 1: underlying channel completely reliable – Just have a sender and receiver. • Case 2: bit errors (but no packet loss) – How do you know? • Receiver has to acknowledge (ACK) or negative (NAK) • A NAK makes the sender resend the packet • NAK based on UDP checksum (talk about timers later) – Reliable transfer based on retransmission = ARQ (Automatic Repeat reQuest) protocol – 3 capabilities: Error detection, receiver feedback, retransmission – Stop and wait protocol • What if the ACK or NAK packet is corrupted? – Just resend the old packet. • Duplicate packets: But how does the receiver know it is the same packet and not the next one? • Add a sequence number! • Do we really need a NAK? – Just ACK the last packet that was received. • Case 3: bit errors and packet loss – What if a packet gets lost? • Set a timer on each packet sent. When the timer runs out, send the packet again. • Can we handle duplicate packets? – How long? • At least as long as a RTT Performance of rdt3.0 • rdt3.0 works, but performance stinks • ex: 1 Gbps link, 15 ms prop. delay, 8000 bit packet: L 8000bits d trans 8 microsecon ds 9 R 10 bps U sender: utilization – fraction of time sender busy sending U sender = L/R RTT + L / R = .008 30.008 = 0.00027 microsec onds 1KB pkt every 30 msec -> 33kB/sec (264 kbps) thruput over 1 Gbps link network protocol limits use of physical resources! Transport Layer 3-18 • Pipelining will fix it! – Make sure your sequence numbers are large enough – Buffering packets • Sender buffers packets that have been transmitted but not yet acknowledged • Receiver buffers correctly received packets • Two basic approaches: Go-Back-N and selective repeat Pipelined protocols Pipelining: sender allows multiple, “in-flight”, yet-to-beacknowledged pkts – range of sequence numbers must be increased – buffering at sender and/or receiver • Two generic forms of pipelined protocols: go-Back-N, selective repeat Transport Layer 3-20 Go-Back-N (sliding window) Sender: • k-bit seq # in pkt header • “window” of up to N, consecutive unACKed pkts allowed ACK(n): ACKs all pkts up to, including seq # n - “cumulative ACK” may receive duplicate ACKs (see receiver) timer for each in-flight pkt timeout(n): retransmit pkt n and all higher seq # pkts in window Transport Layer 3-21 ACK-only: always send ACK for correctly-received pkt with highest in-order seq # – may generate duplicate ACKs – need only remember expectedseqnum • out-of-order pkt: – discard (don’t buffer) -> no receiver buffering! – Re-ACK pkt with highest in-order seq # • Problems with GBN? – It can spit out a lot of needless packets onto the wire. • A single error will really do some damage. A wire with lots of errors? Lots of needless duplicate packets Selective Repeat • receiver individually acknowledges all correctly received pkts – buffers pkts, as needed, for eventual in-order delivery to upper layer • sender only resends pkts for which ACK not received – sender timer for each unACKed pkt • sender window – N consecutive seq #’s – again limits seq #s of sent, unACKed pkts – The sender and receiver window will not always coincide! • If your sequence numbers aren’t big enough, you won’t know which is which Transport Layer 3-23 Selective repeat: sender, receiver windows Transport Layer 3-24 Selective repeat sender data from above : receiver pkt n in [rcvbase, rcvbase+N-1] • if next available seq # in window, send pkt send ACK(n) timeout(n): in-order: deliver (also deliver out-of-order: buffer buffered, in-order pkts), advance window to next notyet-received pkt • resend pkt n, restart timer ACK(n) in [sendbase,sendbase+N]: • mark pkt n as received • if n smallest unACKed pkt, advance window base to next unACKed seq # pkt n in [rcvbase-N,rcvbase-1] ACK(n) otherwise: ignore Transport Layer 3-25 Pipelining Protocols - Summary Go-back-N: overview • sender: up to N unACKed pkts in pipeline • receiver: only sends cumulative ACKs Selective Repeat: overview • sender: up to N unACKed packets in pipeline • receiver: ACKs individual pkts • sender: maintains timer for each unACKed pkt – doesn’t ACK pkt if there’s a gap • sender: has timer for oldest unACKed pkt – if timer expires: retransmit only unACKed packet – if timer expires: retransmit all unACKed packets Transport Layer 3-26 Reliable Data Transfer Mechanisms • See table p. 242 – Checksum – Timer – Sequence number – Acknowledgement – Negative acknowledgement – Window, pipelining TCP • Read 3.5 to the end • Point to point – Single sender, single receiver – Multicasting (4.7) will not work with TCP • 3 way handshake – SYN – SYN-ACK – ACK • TCP sets aside a send buffer – where the application message data gets put • TCP takes chunks of data from this buffer and sends segments Vocabulary • RTT = Round Trip Time • MSS = Maximum segment size – Maximum amount of data that TCP can grab and place into a segment – This is application layer data – does not include TCP headers, etc. • MTU = Maximum transmission unit – The largest link-layer frame that can be sent by the local sending host – This will have a lot of bearing on the MSS – Common values: 1,460, 536, and 512 bytes TCP: Overview • point-to-point: RFCs: 793, 1122, 1323, 2018, 2581 • full duplex data: – one sender, one receiver – bi-directional data flow in same connection – MSS: maximum segment size • reliable, in-order byte steam: – no “message boundaries” • connection-oriented: • pipelined: – handshaking (exchange of control msgs) init’s sender, receiver state before data exchange – TCP congestion and flow control set window size • send & receive buffers • flow controlled: socket door application writes data application reads data TCP send buffer TCP receive buffer socket door – sender will not overwhelm receiver segment Transport Layer 3-30 TCP segment structure 32 bits URG: urgent data (generally not used) ACK: ACK # valid PSH: push data now (generally not used) RST, SYN, FIN: connection estab (setup, teardown commands) Internet checksum (as in UDP) source port # dest port # sequence number acknowledgement number head not len used U A P R S F Receive window checksum Urg data pointer Options (variable length) counting by bytes of data (not segments!) # bytes rcvr willing to accept application data (variable length) Transport Layer 3-31 TCP seq. #’s and ACKs Seq. #’s: – byte stream “number” of first byte in segment’s data ACKs: – seq # of next byte expected from other side – cumulative ACK Q: how receiver handles outof-order segments – A: TCP spec doesn’t say, - up to implementer Host A Host B User types ‘C’ host ACKs receipt of ‘C’, echoes back ‘C’ host ACKs receipt of echoed ‘C’ time simple telnet scenario Transport Layer 3-32 TCP Round Trip Time and Timeout Q: how to set TCP timeout value? • longer than RTT – but RTT varies • too short: premature timeout – unnecessary retransmissions • too long: slow reaction to segment loss Q: how to estimate RTT? • SampleRTT: measured time from segment transmission until ACK receipt – ignore retransmissions • SampleRTT will vary, want estimated RTT “smoother” – average several recent measurements, not just current SampleRTT Transport Layer 3-33 TCP Round Trip Time and Timeout EstimatedRTT = (1- )*EstimatedRTT + *SampleRTT Exponential weighted moving average influence of past sample decreases exponentially fast typical value: = 0.125 Transport Layer 3-34 Example RTT estimation: RTT: gaia.cs.umass.edu to fantasia.eurecom.fr 350 RTT (milliseconds) 300 250 200 150 100 1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 time (seconnds) SampleRTT Estimated RTT Transport Layer 3-35 TCP Round Trip Time and Timeout Setting the timeout • EstimtedRTT plus “safety margin” – large variation in EstimatedRTT -> larger safety margin • first estimate of how much SampleRTT deviates from EstimatedRTT: DevRTT = (1-)*DevRTT + *|SampleRTT-EstimatedRTT| (typically, = 0.25) Then set timeout interval: TimeoutInterval = EstimatedRTT + 4*DevRTT Transport Layer 3-36 TCP reliable data transfer • TCP creates rdt service on top of IP’s unreliable service • pipelined segments • cumulative ACKs • TCP uses single retransmission timer • retransmissions are triggered by: – timeout events – duplicate ACKs • initially consider simplified TCP sender: Transport Layer – ignore duplicate ACKs – ignore flow control, congestion control 3-37 TCP sender events: data rcvd from app: • create segment with seq # • seq # is byte-stream number of first data byte in segment • start timer if not already running (think of timer as for oldest unACKed segment) • expiration interval: TimeOutInterval timeout: • retransmit segment that caused timeout • restart timer ACK rcvd: • if acknowledges previously unACKed segments – update what is known to be ACKed – start timer if there are outstanding segments Transport Layer 3-38 NextSeqNum = InitialSeqNum SendBase = InitialSeqNum loop (forever) { switch(event) TCP sender event: data received from application above (simplified) create TCP segment with sequence number NextSeqNum if (timer currently not running) start timer Comment: pass segment to IP • SendBase-1: last NextSeqNum = NextSeqNum + length(data) cumulatively ACKed byte event: timer timeout Example: retransmit not-yet-acknowledged segment with • SendBase-1 = 71; smallest sequence number y= 73, so the rcvr start timer wants 73+ ; y > SendBase, so event: ACK received, with ACK field value of y that new data is if (y > SendBase) { ACKed SendBase = y if (there are currently not-yet-acknowledged segments) start timer Transport Layer 3-39 } TCP ACK generation [RFC 1122, RFC 2581] Event at Receiver TCP Receiver action Arrival of in-order segment with expected seq #. All data up to expected seq # already ACKed Delayed ACK. Wait up to 500ms for next segment. If no next segment, send ACK Arrival of in-order segment with expected seq #. One other segment has ACK pending Immediately send single cumulative ACK, ACKing both in-order segments Arrival of out-of-order segment higher-than-expect seq. # . Gap detected Immediately send duplicate ACK, indicating seq. # of next expected byte Arrival of segment that partially or completely fills gap Immediate send ACK, provided that segment starts at lower end of gap Transport Layer 3-40 Fast Retransmit • time-out period often relatively long: – long delay before resending lost packet • detect lost segments via duplicate ACKs. – sender often sends many segments back-to-back – if segment is lost, there will likely be many duplicate ACKs for that segment • If sender receives 3 ACKs for same data, it assumes that segment after ACKed data was lost: – fast retransmit: resend segment before timer expires Transport Layer 3-41 Fast retransmit algorithm: event: ACK received, with ACK field value of y if (y > SendBase) { SendBase = y if (there are currently not-yet-acknowledged segments) start timer } else { increment count of dup ACKs received for y if (count of dup ACKs received for y = 3) { resend segment with sequence number y } a duplicate ACK for already ACKed segment fast retransmit Transport Layer 3-42 Go-Back-N or Selective Repeat? • The book says it’s sorta both. To me it mostly looks like GBN – Out of order segments not individually ACKed • However – Many TCP implementations will buffer out of order segments – TCP will also usually only retransmit a single segment rather than all of them Flow Control (NOT congestion control) • TCP creates a receive buffer – Data is put into the receive buffer once it has been received correctly and in order – The application reads from the receive buffer • Sometimes not right away. • Flow control tries not to overflow this receive buffer • Each sender maintains a variable called the receive window – What if the receive window goes to 0? – In this case, the sending host is required to send segments with 1 data byte • What happens in UDP when the UDP receive buffer overflows? TCP Flow Control flow control sender won’t overflow receiver’s buffer by transmitting too much, too fast • receive side of TCP connection has a receive buffer: IP datagrams (currently) unused buffer space TCP data (in buffer) application process • speed-matching service: matching send rate to receiving application’s drain rate app process may be slow at reading from buffer Transport Layer 3-45 TCP Connection Management Recall: TCP sender, receiver establish “connection” before exchanging data segments • initialize TCP variables: – seq. #s – buffers, flow control info (e.g. RcvWindow) • client: connection initiator Socket clientSocket = new Socket("hostname","port number"); • server: contacted by client Socket connectionSocket = welcomeSocket.accept(); Three way handshake: Step 1: client host sends TCP SYN segment to server – specifies initial seq # – no data Step 2: server host receives SYN, replies with SYNACK segment – server allocates buffers – specifies server initial seq. # Step 3: client receives SYNACK, replies with ACK segment, which may contain data Transport Layer 3-46 TCP Connection Management (cont.) Closing a connection: client server close client closes socket: clientSocket.close(); Step 1: client end system sends TCP FIN control segment to server close with ACK. Closes connection, sends FIN. timed wait Step 2: server receives FIN, replies closed Transport Layer 3-47 TCP Connection Management (cont.) Step 3: client receives FIN, client replies with ACK. server closing – Enters “timed wait” - will respond with ACK to received FINs closing Step 4: server, receives ACK. Note: with small modification, can handle simultaneous FINs. timed wait Connection closed. closed closed Transport Layer 3-48 TCP Connection Management (cont) TCP server lifecycle TCP client lifecycle Transport Layer 3-49 SYN Flood Attack • Bad guy sends a bunch of TCP SYN segments • Server opens up buffers to create this segment • Resources all become allocated to half open TCP connections – This is called a SYN flood attack • SYN Cookies – The server allocates on the resources upon receipt of a ACK (third part of handshake) segment rather than a SYN segment – It knows because the sequence field the server sent out was a special number (complex hash function of source and destination IP and port plus the server’s secret number) – P. 269s How does nmap work? • To find out what is on a port, nmap sends a TCP SYN segment to that port – If the port responds with a SYNACK, it labels the port open – If the response is a TCP RST segment, it means the port is not blocked, but it is closed – If the response is nothing, the port is blocked by a firewall Congestion Control Principles • Typical cause of congestion: – Overflowing of router buffers as the network becomes congested. Scenario 1 • Scenario 1: two senders, a router with infinite buffers – Always maximizing throughput looks good with throughput alone. (left) – Maximizing throughput is bad when looking at delays (right) Scenario 2 • Scenario 2: Routers with finite buffers & retransmission – If you are constantly resending packets, throughput is even lower since a % of the packets are retransmissions – Creating these large delays, you may send needless retransmissions, wasting precious router resources Scenario 3 • Scenario 3: multiple hops – When a packet is dropped along a path, the transmission capacity at each of the upstream links ends up being wasted Approaches towards congestion control two broad approaches towards congestion control: end-end congestion control: • no explicit feedback from network • congestion inferred from endsystem observed loss, delay • approach taken by TCP network-assisted congestion control: • routers provide feedback to end systems – single bit indicating congestion (SNA, DECbit, TCP/IP ECN, ATM) – explicit rate sender should send at Transport Layer 3-56 TCP congestion control: goal: TCP sender should transmit as fast as possible, but without congesting network Q: how to find rate just below congestion level decentralized: each TCP sender sets its own rate, based on implicit feedback: ACK: segment received (a good thing!), network not congested, so increase sending rate lost segment: assume loss due to congested network, so decrease sending rate Transport Layer 3-57 TCP congestion control: bandwidth probing “probing for bandwidth”: increase transmission rate on receipt of ACK, until eventually loss occurs, then decrease transmission rate continue to increase on ACK, decrease on loss (since available bandwidth is changing, depending on other connections in network) ACKs being received, so increase rate X loss, so decrease rate sending rate X X X TCP’s “sawtooth” behavior X time Q: how fast to increase/decrease? details to follow Transport Layer 3-58 TCP Congestion Control: details • sender limits rate by limiting number of unACKed bytes “in pipeline”: LastByteSent-LastByteAcked cwnd – cwnd: differs from rwnd (how, why?) – sender limited by min(cwnd,rwnd) • roughly, cwnd bytes cwnd bytes/sec • cwnd is dynamic,RTT function of perceived rate = network congestion RTT ACK(s) Transport Layer 3-59 TCP Congestion Control: more details segment loss event: reducing cwnd • timeout: no response from receiver ACK received: increase cwnd slowstart phase: – cut cwnd to 1 increase exponentially fast (despite name) at connection start, or following timeout congestion avoidance: increase linearly • 3 duplicate ACKs: at least some segments getting through (recall fast retransmit) – cut cwnd in half, less aggressively than on timeout Transport Layer 3-60 TCP Slow Start Transport Layer Host A Host B RTT • when connection begins, cwnd = 1 MSS – example: MSS = 500 bytes & RTT = 200 msec – initial rate = 20 kbps • available bandwidth may be >> MSS/RTT – desirable to quickly ramp up to respectable rate • increase rate exponentially until first loss event or when threshold reached – double cwnd every RTT – done by incrementing cwnd by 1 for every ACK received time 3-61 Transitioning into/out of slowstart ssthresh: cwnd threshold maintained by TCP • on loss event: set ssthresh to cwnd/2 – remember (half of) TCP rate when congestion last occurred • when cwnd >= ssthresh: transition from slowstart to congestion avoidance phase duplicate ACK dupACKcount++ L cwnd = 1 MSS ssthresh = 64 KB dupACKcount = 0 timeout ssthresh = cwnd/2 cwnd = 1 MSS dupACKcount = 0 retransmit missing segment slow start new ACK cwnd = cwnd+MSS dupACKcount = 0 transmit new segment(s),as allowed cwnd > ssthresh L timeout ssthresh = cwnd/2 cwnd = 1 MSS dupACKcount = 0 retransmit missing segment Transport Layer congestion avoidance 3-62 TCP: congestion avoidance • when cwnd > ssthresh grow cwnd linearly – increase cwnd by 1 MSS per RTT – approach possible congestion slower than in slowstart – implementation: cwnd = cwnd + MSS/cwnd for each ACK received Transport Layer AIMD ACKs: increase cwnd by 1 MSS per RTT: additive increase loss: cut cwnd in half (non-timeout-detected loss ): multiplicative decrease AIMD: Additive Increase Multiplicative Decrease 3-63 Popular “flavors” of TCP cwnd window size (in segments) TCP Reno ssthresh ssthresh TCP Tahoe Transmission round Transport Layer 3-64 Summary: TCP Congestion Control • when cwnd < ssthresh, sender in slow-start phase, window grows exponentially. • when cwnd >= ssthresh, sender is in congestionavoidance phase, window grows linearly. • when triple duplicate ACK occurs, ssthresh set to cwnd/2, cwnd set to ~ ssthresh • when timeout occurs, ssthresh set to cwnd/2, cwnd set to 1 MSS. Transport Layer 3-65 Fairness (more) Fairness and UDP • multimedia apps often do not use TCP – do not want rate throttled by congestion control • instead use UDP: – pump audio/video at constant rate, tolerate packet loss Fairness and parallel TCP connections • nothing prevents app from opening parallel connections between 2 hosts. • web browsers do this • example: link of rate R supporting 9 connections; – new app asks for 1 TCP, gets rate R/10 – new app asks for 11 TCPs, gets R/2 ! Transport Layer 3-66