UDP Header Format User Datagram Protocol (UDP) Thin wrapper around IP services • Service Model 0 – Unreliable unordered datagram service – Addresses multiplexing of multiple connections 16 31 Destination Port UDP Length UDP Checksum Length includes 8-byte header and data Checksum • • • Multiplexing 8 Source Port – Uses IP checksum algorithm – Computed on header, data and “pseudo header”: – 16-bit port numbers (some are “well-known”) • Checksum 0 – Validate header – Optional in IPv4 – Mandatory in IPv6 8 16 31 Source IP Address Destination IP Address 0 17 (UDP) UDP Length Transmission Control Protocol (TCP) • Guaranteed delivery: TCP – Messages delivered in the order they were sent – Messages delivered at most once • • • • 3 No limit on message size Synchronization between sender and receiver Multiple connections per host Flow control 4 TCP TCP vs. Direct Link • Connection oriented • Explicit connection setup requires • RTT varies, depending on destination and network condition adaptive approach to retransmission • Packets – Explicit setup and teardown required • Byte stream abstraction – No boundaries in data – App writes bytes, TCP send segments, App receives bytes • Full duplex – Data flows in both directions simultaneously – Point-to-point connection – Delayed – Reordered – Late • Implements congestion control – Flow control: receiver controls sender rate – Congestion control: network indirectly controls sender rate 5 TCP vs. Direct Link 6 TCP: Connection Stages 1. Connection setup • Peer capabilities vary – 3-way handshake – Minimum link speed on route – Buffering capacity at destination 2. Data transport: Sender writes data, and TCP… – Breaks data into segments – Sends segment in IP packets – Retransmits, reorders and removes duplicates as necessary – Delivers data to receiver adaptive approach to window sizes • Network capacity varies – Other traffic competes for most links Requires global congestion control strategy 3. Teardown – 4 step exchange 7 8 TCP Segment Header TCP Segment Header Format 0 8 16 • 16-bit source and destination ports • 32-bit send and ACK sequence numbers • 4-bit header length (unit = 32 bits) 31 Source Port Destination Port Sequence Number ACK Sequence Number Header Length 0 Flags Advertised Window TCP Checksum Urgent Pointer Options – Minimum 5 (20 bytes) – Used as offset to first data byte • 6 × 1-bit flags – – – – – – Meta header 0 8 16 31 Source IP Address Destination IP Address 0 16 (TCP) TCP Segment Length URG: ACK: PSH: RST: SYN: FIN: *Segment contains urgent data ACK sequence number is valid *Do not delay delivery of data Reset connection (reject or abn. termination) Synchronize segment for setup Final segment for teardown 9 10 TCP Segment Header (cont.) TCP Options • Negotiate maximum segment size (MSS) • 16-bit advertised window – Each host suggests a value – Minimum of two values is chosen – Prevents IP fragmentation over first and last hops – Space remaining in receive window • 16-bit checksum – Uses IP checksum algorithm – Computed on header, data and pseudo header • Packet timestamp – Allows RTT calculation for retransmitted packets – Extends sequence number space for identification of stray packets • 16-bit urgent data pointer • Negotiate advertised window scaling factor – If URG = 1 – Index of last byte of urgent data in segment – Allows larger windows: 64KB too small for routes with large bandwidth-delay products 11 12 TCP: Data Transport TCP Byte Stream • Data broken into segments – Limited by maximum segment size (MSS) – Negotiable during connection setup – Typically set to • MTU of directly connected network – size of TCP and IP headers Application process Write bytes • Three events cause a segment to be sent – At least MSS bytes of data ready to be sent – Explicit PUSH operation by application – Periodic timeout Application process Read bytes TCP TCP Send buffer Recv buffer TCP Segment … TCP Segment TCP Segment 13 14 TCP SNs and ACKs Seq. #’s: – Count bytes, not packets. First SN to avoid insertion Host A User types ‘C’ ACKs: – SN of next byte expected from other side – cumulative ACK GBN: TCP spec doesn’t say what to do with premature packets - up to implementation TCP ACK rules Host B host ACKs receipt of ‘C’, echoes back ‘C’ host ACKs receipt of echoed ‘C’ simple telnet scenario time 15 Event TCP Receiver action in-order segment arrival, no gaps, everything else already ACKed delayed ACK. Wait up to 500ms for next segment. If no next segment, send ACK in-order segment arrival, no gaps, one delayed ACK pending immediately send single cumulative ACK out-of-order segment arrival higher-than-expect seq. # gap detected send duplicate ACK, indicating seq. # of next expected byte arrival of segment that partially or completely fills gap immediate ACK if segment starts at lower end of gap 16 TCP: retransmission scenarios X loss time Host A Host B lost ACK scenario Host B Round-trip time (RTT) Seq=100 timeout Seq=92 timeout timeout Host A TCP: Retransmission and Timeouts Retransmission TimeOut (RTO) Guard Band Host A Estimated RTT Data1 Data2 ACK ACK Host B TCP uses an adaptive retransmission timeout value Dynamic network (congestion, changes in routing) => RTT cannot be static premature timeout, cumulative ACKs 17 18 TCP: Retransmission and Timeouts TCP: Retransmission and Timeouts (Jacobson/Karels alg.) RTO value is important: too big: wait too long to retransmit a packet too small: unnecessarily retransmit packets. Newer algorithm estimates std. dev. of RTT: 1. Original algorithm for picking RTO: 1. EstimatedRTT = α · EstimatedRTT + (1 - α) · SampleRTT 2. 2. RTO = 2 · EstimatedRTT 3. 4. Characteristics of the original algorithm: Std. dev. implicitly assumed to be bounded by RTT. But if utilization = 75%, could have factor 16 between “typical” (mean±2stdev) short and long RTTs 19 Diff = SampleRTT - EstimatedRTT EstimatedRTT = EstimatedRTT + δ·Diff (for some 0<δ<1) Deviation = Deviation + δ ·( |Diff| - Deviation ) RTO = μ · EstimatedRTT + φ · Deviation μ≈1 φ≈4 20 TCP Sliding Window Protocol – Sender Side TCP: Retransmission and Timeouts (Karn’s Alg.) Host A Host B Host A Retransmission Wrong RTT Sample Host B • • • LastByteAcked <= LastByteSent LastByteSent <= LastByteWritten Buffer bytes between LastByteAcked and LastByteWritten Retransmission Maximum buffer size Wrong RTT Sample Advertised window Problem: How to estimate RTT of retransmitted packets? Solution: Don’t! Also: double RTO. Data available, but outside window First unacknowledged byte Last byte sent 21 22 TCP Sliding Window Protocol – Receiver Side • • • • • TCP Flow Control Receiving side LastByteRead < NextByteExpected NextByteExpected <= LastByteRcvd + 1 Buffer bytes between NextByteRead and LastByteRcvd Shrinks as data arrives and Grows as the application consumes data – Receive buffer size = MaxRcvBuffer – LastByteRcvd - LastByteRead < = MaxRcvBuffer – AdvertisedWindow = MaxRcvBuffer - (NextByteExpected NextByteRead) • • Maximum buffer size Shrinks as data arrives and Grows as the application consumes data Sending side Advertised window – Send buffer size = MaxSendBuffer – LastByteSent - LastByteAcked < = AdvertisedWindow – EffectiveWindow = AdvertisedWindow - (LastByteSent LastByteAcked) • EffectiveWindow > 0 to send data – LastByteWritten - LastByteAcked < = MaxSendBuffer – block sender if (LastByteWritten - LastByteAcked) + y > MaxSenderBuffer Buffered, out-of-order data Next byte expected (ACK value) Next byte to be read by application 23 24 TCP Flow Control TCP Flow Control • Problem: Slow receiver application • Problem: Application delivers tiny pieces of data to TCP – Advertised window goes to 0 – Sender cannot send more data – Receiver may not spontaneously generate update or update may be lost – Sender gets stuck – Example: telnet in character mode – Each piece sent as a segment, returned as ACK – Very inefficient • Solution • Solution – Sender periodically sends 1-byte segment, ignoring advertised window of 0 – Eventually window opens – Sender learns of opening from next ACK of 1-byte segment – Delay transmission to accumulate more data – Nagle’s algorithm • Send first piece of data • Accumulate data until first piece ACK’ed • Send accumulated data and restart accumulation • Not ideal for some traffic (e.g. mouse motion) 25 TCP Flow Control 26 TCP Bit Allocation Limitations • Problem: Slow application reads data in tiny pieces • Sequence numbers vs. packet lifetime – Receiver advertises tiny window – Sender fills tiny window – Known as silly window syndrome – Assumed that IP packets live less than 60 seconds – Can we send 232 bytes in 60 seconds? – approx. 573Mbps: Less than an STS-12 line • Solution • Advertised window vs. delay-bandwidth – Advertise window opening only when MSS or ½ of buffer is available – Sender delays sending until window is MSS or ½ of receiver’s buffer (estimated) 27 – Only 16 bits for advertised window – coast-coast RTT = 100 ms – Adequate for only 5.24 Mbps! 28 TCP Sequence Numbers – 32-bit Bandwidth Speed TCP Connection Establishment 3-Way Handshake • Exchange initial sequence numbers (j,k) • Message Types Time until wrap around T1 1.5 Mbps 6.4 hours Ethernet 10 Mbps 57 minutes T3 45 Mbps 13 minutes FDDI 100 Mbps 6 minutes STS-3 155 Mbps 4 minutes STS-12 622 Mbps 55 seconds STS-24 1.2 Gbps 28 seconds Client Server listen – Synchronize (SYN) – Acknowledge (ACK): cumulative! • Passive Open – Server listens for connection from client • Active Open Time flows down – Client initiates connection to server 29 30 TCP State Descriptions TCP: Connection Termination • Message Types – Finished (FIN) – Acknowledge (ACK) Client Server • Active Close – Sends no more data • Passive close – Accepts no more data • Connection can be half closed (one-way) Time flows down 31 CLOSED Disconnected LISTEN Waiting for incoming connection SYN_RCVD Connection request received SYN_SENT Connection request sent ESTABLISHED Connection ready for data transport CLOSE_WAIT Connection closed by peer LAST_ACK Connection closed by peer, closed locally, await ACK FIN_WAIT_1 Connection closed locally FIN_WAIT_2 Connection closed locally and ACK’d CLOSING Connection closed by both sides simultaneously TIME_WAIT Wait for network to discard related packets TCP State Transition Diagram TCP State Transition Diagram Passive open Close SYN/SYN + ACK SYN_RCVD Close/FIN Close/FIN LISTEN SYN/SYN + ACK ESTABLISHED ACK FIN + ACK/ACK SYN + ACK/ACK CLOSE_WAIT CLOSING FIN_WAIT_2 Close/FIN LAST_ACK ACK TIME_WAIT FIN/ACK SYN_SENT FIN/ACK FIN/ACK FIN_WAIT_1 – State transitions • Describe the path taken by a server under normal conditions • Describe the path taken by a client under normal conditions • Describe the path taken assuming the client closes the connection first – TIME_WAIT state • What purpose does this state serve • Prove that at least one side of a connection enters this state • Explain how both sides might enter this state Close Send/SYN ACK • Questions Active open/SYN CLOSED ACK Timeout CLOSED 33 TCP State Transition Diagram Close SYN/SYN + ACK SYN_RCVD Close/FIN Close/FIN LISTEN ACK FIN + ACK/ACK FIN/ACK SYN_RCVD SYN + ACK/ACK Close Close/FIN Close/FIN CLOSE_WAIT ACK LAST_ACK TIME_WAIT ACK Timeout FIN + ACK/ACK FIN/ACK 35 SYN_SENT SYN + ACK/ACK ESTABLISHED FIN/ACK CLOSE_WAIT CLOSING FIN_WAIT_2 CLOSED SYN/SYN + ACK FIN/ACK FIN_WAIT_1 Close/FIN ACK Close LISTEN Send/SYN ACK FIN/ACK CLOSING FIN_WAIT_2 SYN_SENT ESTABLISHED FIN/ACK FIN_WAIT_1 Passive open SYN/SYN + ACK SYN/SYN + ACK Active open/SYN CLOSED Close Send/SYN ACK TCP State Transition Diagram Active open/SYN CLOSED Passive open 34 Close/FIN LAST_ACK ACK TIME_WAIT ACK Timeout CLOSED 36 Congestion H1 Congestion Control & Avoidance A1(t) 10Mb/s R1 H2 D(t) 1.5Mb/s H3 A2(t) 100Mb/s A1(t) A1(t)+A2(t) A2(t) Cumulative bytes A2(t) D(t) X(t) A1(t) X(t) D(t) 37 TCP Congestion Control t 38 Ideal steady state: self-clocking Basic idea: control rate by window size. • Average rate ≤ (window)/RTT – Crude • Add notion of congestion window – Effective window is minimum of Advertised window (flow control), and Congestion window (congestion control) 39 40 TCP Congestion Control Slow Start • Start up phase: quickly find the correct rate Destination Source • Objective: determine available capacity • Idea: – “Slow Start” • Steady state: gently try to increase rate, back off quickly when congestion detected – Begin with cwnd = 1 packet – Increment cwnd by 1 packet for each ACK – “Congestion Avoidance” • Meaning: double every RTT! … • Phases are determined by the value of variable ssthres 41 Slow Start Implementation 42 Slow Start Trace When starting or restarting after timeout, cwnd=1. • On each ack for new segment, cwnd += segSize. 43 • Each “dot” is a 512B packet sent, y-axis is sequence number, x-axis is time, straight line is 20 KBps of available bandwidth. • without ss: ~7KBps, with ss: ~19KBps 44 Host Solutions Congestion is good? • Q: How does the source determine whether or not the network is congested? • A: Timeout signals packet loss • Empty buffers => low delay, low utilization • Full buffers => good utilizaion, but high delay, potential loss • Real question: how much congestion is too much? – Packet loss is rarely due to transmission error (on wired networks) – Lost packet implies congestion! 45 Congestion Avoidance 46 How to get to steady-state? • Control vs. avoidance • If overusing link => packet loss => decrease rate • Why increase at all? – Control: minimize impact of congestion when it occurs – Avoidance: avoid producing congestion – Must check all the time so in order not to leave “dead” bandwidth; only indication is dropped packets • In terms of operating point limits • Slow-start: multiplicative increase optimal load – Timeout: decrease to 1! control power avoidance idealized power curve • Symmetric multiplicative increase and decrease: strong oscillation, poor throughput. “Rush-hour effect.” load 47 48 Additive Increase/ Multiplicative Decrease Rush Hour Effect • Easy to drive the network into saturation, but difficult for the network to recover. • Analogy to rush hour traffic rate Arrivals & departures Source – Increment cwnd by one packet per RTT • Linear increase – Divide CongestionWindow by two whenever a timeout occurs • Multiplicative decrease Destination … Queue size • Algorithm 50 AIMD: additive increase, multiplicative decrease • increase window by 1 per RTT • decrease window by factor of 2 on loss event Why AIMD? Fairness goal: if N TCP sessions share same bottleneck link, each should get 1/N of link capacity Model: Two sessions compete for R bandwidth underutilized & unfair to 1 R overutilized & unfair to 1 overutilized & unfair to 2 TCP connection 1 TCP connection 2 desired region bottleneck router capacity R underutilized & unfair to 2 Conn 1 throughput 51 full utilization line R 52 Model assumptions • Sessions know if link is overused (losses) • Sessions don’t know relative rates • Simplification: Sessions respond simultaneously, and in the same direction (both increase or both decrease) AIMD Convergence Additive Increase – up at 45º angle R (both connections add 1) Multiplicative Decrease – down R toward the origin X pt. of convergence full utilization line Conn 1 throughput full utilization line R R Conn 1 throughput 53 54 Convergence Avoidance Typical Trace TCP Congestion Avoidance • Trace: sawtooth behavior • When a new segment is acked, the sender does the following: KB – If (cwnd < ssthresh) cwnd += segSize – else cwnd += segSize/cwnd – (What happens when an ACK arrives for x new segments?) • On timeout: – ssthresh := cwnd/2 70 60 50 40 30 20 10 1.0 – cwnd := 1 (i.e., slow start) 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 Time (seconds) 55 56 Fast Retransmit and Fast Recovery Sender Problem: crude TCP timeouts lead to idle periods, slow start is not fast Fast retransmit: • use duplicate ACKs to trigger retransmission Fast recovery: • skip slow start, go directly to half the last successful cwnd (called ssthresh) TCP Congestion Control: summary • Maintain threshold window size (“last good estimate”) • Threshold value Receiver Packet 1 Packet 2 Packet 3 ACK 1 Packet 4 ACK 2 Packet 5 ACK 2 – Initially set to maximum window size – Set to 1/2 of current window on timeout or 3 dup ACKs • Congestion window drops to 1 on timeout, drops by half on 3 dup ACKs • When congestion window smaller than threshold: Packet 6 ACK 2 ACK 2 TIMEOUT! Retransmit packet 3 – Double window for each window ACK’d (multiplicative increase) • When congestion window larger than threshold: ACK 6 – Increase window by one MSS for each window ACK’d • Try to avoid timeouts by fast retransmit 57 58 TCP Dynamics: Rate TCP Congestion Window Trace • TCP Reno 70 Congestion Window 60 timeouts threshold • Sending rate: Congwin*MSS / RTT congestion window • Assume fixed RTT 50 W fast retransmission 40 30 20 additive increase W/2 10 slow start period 0 0 10 20 30 40 50 60 Actual Sending rate: Time 59 between W*MSS / RTT and (1/2) W*MSS / RTT Average (3/4) W*MSS / RTT 60 TCP Dynamics: Loss Congestion Avoidance • TCP’s strategy: increase load until congestion occurs, then back off • Loss rate (TCP Reno) • Consider a cycle • Alternative Strategy W – Predict when congestion is about to happen and reduce rate just before packets start being discarded • Two possibilities – Some help from network: W/2 • DECbit, RED Total packet sent: – Host-centric about (3/8) W2 MSS/RTT = O(W2) One packet loss • TCP Vegas Loss Probability: p=O(1/W2) or W=O(1/√p) 61 62