Computer Networks An Open Source Approach Chapter 5: Transport Layer Chapter 5: Transport Layer 1 Content 5.1 General Issues 5.2 UDP - Unreliable Connectionless Transfer Port-Multiplexing 5.3 TCP - Reliable Connection-Oriented Transfer Port-Multiplexing, Reliability, Flow/Congestion Control Connection Management Reliability Flow Control Performance Enhancements 5.4 Socket Programming Interface 5.5 Real-time Transport (RTP & RTCP) 5.6 Summary Chapter 5: Transport Layer 2 5.1 General Issues End-to-End Communication Channel Data Integrity Flow Control Socket Programming Interface Chapter 5: Transport Layer 3 5.1 General Issues End-to-End Communication Channel: PortMultiplexing Port: communication end point IP MAC IP Multi-Access Channel LAN host 1 AP1 AP2 TCP/UDP MAC LAN host 2 AP1 IP Network AP2 TCP/UDP IP host 1 IP host 2 Condense delay distribution Loose delay distribution Node-to-Node Channel End-to-End Channel Chapter 5: Transport Layer 4 General Issues: Direct-Linked vs. End-toEnd Direct-Linked Protocol Layer End-to-End Protocol Layer physical layer internetworking layer base on what services? services addressing node-to-node channel within a link. (MAC address) process-to-process channel between hosts (port number) error checking per-frame per-segment data reliability per-link per-flow flow control per-link per-flow condensed distribution loose distribution channel delay Note: per-frame integrity such as Ethernet Collision: can be detected and be retransmitted CRC/alignment error: can only rely on upper-layer protocols Chapter 5: Transport Layer 5 Open Source Implementation 5.1: an incoming packet in the transport layer Application layer recvfrom raw_recvmsg (sk, buf) read sys_socketcall sys_read sys_recvfrom vfs_read sock_recvmsg do_sync_read __sock_recvmsg sock_aio_read sock_common_recvmsg do_sock_read udp_recvmsg (sk,buf) skb_recvdatagram skb=skb_dequeue sock_recvmsg tcp_recvmsg(sk,buf) skb_copy_datagram_iovec sk->sk_data_ready sk_receive_queue __skb_queue_tail(sk->sk_receive_queue, skb) socket_queue_rcv_skb(sk,skb) tcp_data_queue(sk,skb) tcp_rcv_established raw_rcv_skb(sk,skb) raw_rcv(sk,skb) sk=__raw_v4_lookup(skb) udp_queue_rcv_skb(sk,skb) tcp_v4_do_rcv(sk,skb) sk=udp_v4_lookup(skb) sk=inet_lookup(skb) raw_v4_input(skb) RAW udp_rcv(skb) tcp_v4_rcv(skb) TCP UDP io_local_deliver_finish Copyright Reserved 2009 Chapter 5: Transport Layer Transport layer Network layer 6 Open Source Implementation 5.1: an outgoing packet in the transport layer Application layer raw_sendmsg(sk,buf) sendto write sys_socketcall sys_write sys_sendto vfs_write sock_sendmsg do_sync_write inet_sendmsg sock_aio_write udp_sendmsg(sk,buf) do_sock_write ip_append_data(sk, buf) skb=sock_wmalloc(sk) sock_sendmsg skb=sock_alloc_send_skb(sk) inet_sendmsg ip_generic_getfrag tcp_sendmsg(sk,buf) skb_queue_tail(&sk->sk_write_queue,skb) sk_write_queue udp_push_pending_frames ip_push_pending_frames tcp_push __tcp_push_pending_frames tcp_write_xmit tcp_transmit_skb dst_output ip_queue_xmit skb->dst->output ip_output Copyright Reserved 2009 Chapter 5: Transport Layer Transport layer Network layer 7 5.2 UDP – Unreliable Connectionless Transfers Objectives Header Format Unicast Real-time Applications Using UDP Chapter Copyright 5: Transport Reserved 2009 Layer 8 5.2 UDP – For Unreliable Connectionless Transfers Objectives Port-Multiplexing AP1 AP2 TCP AP1 IP Networks IP host 1 AP2 TCP IP host 2 Per-Segment Error Checking: Checksum Header Format 0 15 16 31 source port number destination port number UDP length UDP checksum (optional) 8 bytes ~ ~ data (if any) ~ ~ Carrying Unicast/Multicast Real-Time Traffic Retransmission is Meaningless: No Per-Flow Integrity Needed Bit-rate is Determined by Codec Used: No Flow Control Needed Copyright Reserved 2009 Chapter 5: Transport Layer 9 Open Source Implementation 5.2: UDP and TCP Checksum th->check = tcp_v4_check(len, inet->saddr, inet->daddr, csum_partial((char *)th, th->doff << 2, skb->csum)); Checksum in TCP/IP In Linux 2.6: ip_send_check(iph) tcp_v4_check(T, lenT, SA, DA, csum) Pseudo Header csum=csum_partial(T,lenT,csum) csum=csum_partial(D,lenD,0) IP Header TCP/UDP Header Copyright Reserved 2009 Chapter 5: Transport Layer Application Data 10 5.3 TCP – Reliable ConnectionOriented Transfers Objectives Connection Management Per-Flow Data Integrity Per-Flow Flow/Congestion Control Performance Problems and Enhancements Chapter Copyright 5: Transport Reserved 2009 Layer 11 5.3 TCP – For Reliable ConnectionOriented Transfers Objectives Connection Establishment/Disconnection & State Transitions Per-Flow Data Integrity Stateful (Ch1) !! Requires connection management Connection Management Port-Multiplexing: Same as UDP Per-Flow Reliability Per-Flow Flow Control Per-Frame Checksum & Per-Flow ACKs Per-Flow Flow/Congestion Control Performance Interactive vs. Bulk-Data Transfers Copyright Reserved 2009 Chapter 5: Transport Layer 12 TCP Connection Management Establishment/Termination – 3-Way Handshake Protocol Establishment client Termination server client server F IN SYN (seq= x) ACK of FIN eq=y) SYN (s (ack=x+1) SYN A C K of (seq=x) A C K of S YN (ack= y+1) FIN A C K of F Copyright Reserved 2009 Chapter 5: Transport Layer IN 13 TCP State Transition Diagram CLOSED server client app: passive open send: nothing timeout send: RST : SY recv N : send ,A SYN : RS recv CK LISTEN app : sen send d d: SY ata N T recv: SYN send: SYN, ACK simultaneous open SYN_RCVD app: close send: FIN re sen cv: A d: not CK hin g ESTABLISHED recv: ACK send: nothing SYN_SENT app: close or timeout active open CK N, A Y S : K recv nd: AC se e clos app: : FIN send FIN_WAIT_1 ap p: se acti nd ve : S op YN en recv: FIN send: ACK rec v sen : FIN , d: not ACK hin g recv: FIN send: ACK CLOSE_WAIT data transfer state simultaneous close CLOSING app: close send: FIN LAST_ACK recv: ACK send: nothing recv: ACK send: nothing passive close FIN_WAIT_2 recv: FIN send: ACK TIME_WAIT active close Copyright Reserved 2009 Chapter 5: Transport Layer 14 State Transitions: Establishment client server LISTEN CLOSED SYN_SENT SYN (seq= x) =y) q e s ( N 1) + SY x = k c N (a Y S f o A CK ESTABLISHED A C K of S (seq=x) YN (ack= y+1) Chapter 5: Transport Layer SYN_RCVD ESTABLISHED 15 State Transitions: Termination client server ESTABLISHED FIN_WAIT_1 ESTABLISHED F IN A C K o f F IN FIN_WAIT_2 CLOSE_WAIT LAST_ACK F IN TIME_WAIT 2MSL timeout A C K of F IN CLOSED CLOSED Chapter 5: Transport Layer 16 State Transitions: Simultaneous Open/Close client server CLOSED SYN_SENT SYN_RCVD ESTABLISHED SYN (seq= x) S Y N (s SYN (seq= x) A CK o f S YN (ack= y+1) 1) ) =y ck=x+ q e s a N( N( S Y of S Y K AC eq=y) client LISTEN ESTABLISHED SYN_SENT FIN_WAIT_1 SYN_RCVD CLOSING ESTABLISHED server ESTABLISHED F IN A CK 2MSL timeout CLOSED (a) state transitions in simultaneous open FIN_WAIT_1 A C K of F TIME_WAIT FIN o f F IN IN CLOSING TIME_WAIT 2MSL timeout CLOSED (b) state transitions in simultaneous close Chapter 5: Transport Layer 17 State Transitions : Loss in Establishment client LISTEN CLOSED SYN_SENT client server SYN (seq= x) LISTEN CLOSED SYN (seq= SYN_SENT timeout timeout CLOSED server x) eq=y) SYN (s (ack=x+1) SYN A C K of SYN_RCVD timeout CLOSED RS T CLOSED LISTEN CLOSED (a) SYN sent by the client is lost (b) SYN sent by the server is lost client server LISTEN CLOSED SYN_SENT ESTABLISHED SYN (seq= x) eq=y) SYN (s (ack=x+1) SYN A C K of (seq=x+1) A C K of S YN (ack= y+1) RS T SYN_RCVD timeout CLOSED LISTEN CLOSED (c) ACK of SYN sent by the client is lost Chapter 5: Transport Layer 18 TCP State Transition Implementation In “sock” structure State names Copyright Reserved 2009 Chapter 5: Transport Layer 19 Reliability of Data Transfers Definition: Data Reliability vs. Data Integrity Data Integrity: Data Reliability: Successfully received packets are exactly the same as they are transmitted. Every transmitted packet is successfully received and is exactly the same as the original transmitted one. TCP Per-Segment Integrity: Checksum Per-Flow Reliability: Sequence Number & ACK Copyright Reserved 2009 Chapter 5: Transport Layer 20 Per-Flow Data Reliability: Sequence Number & Acknowledgement Per-Flow Data Reliability: Sequence Number & ACK ACK every successfully received data segment Segment sent but not ACK? Dropped by some intermediate router Dropped by the receiver Insufficient buffer Forced drop Wrong checksum Retransmitting Lost Packets When to Retransmit Which? Copyright Reserved 2009 Chapter 5: Transport Layer 21 Data Reliability: Cumulative ACK (1/2) Client Server Client DATA(S e DATA(S e DATA(S e q=100, Len X (Ac ACK Timeout q=150, Len=30 ) DATA(S eq=100 , Len ck ACK(A q=150, Len=30 ) 0) =15 k c (A K AC =50) =180) time =50) DATA(S e 0) k=10 Timeout timeout q=100, Len=50 ) Server ) 180 c k= DATA(S eq (A ACK =100, Le n=50) duplicate data drop it 80) k=1 c A ( ACK time (b) delay (a) packet loss Chapter 5: Transport Layer 22 Data Reliability: Cumulative ACK (2/2) Client Server DATA(S eq=10 Client ) DATA(S eq=150, Len=30) X ck ACK(A ck ACK(A =150) timeout timeout 0, Len=5 0 =180) time Server DAT A(Se q=10 0, Le n=50 ) DATA(Seq=1 50, Len=30) 00) ck=1 A ( K AC 0) k=18 c A ( ACK time (d) out-of-sequence (c) ACK loss Chapter 5: Transport Layer 23 Pseudo Code of Sliding Window in the Sender SWS: send window size. n: current sequence number, i.e., the next packet to be transmitted. LAR: last acknowledgment received. if the sender has data to send Transmit up to SWS packets ahead of the latest acknowledgment LAR, i.e., it may transmit packet number n as long as n < LAR+SWS. endif if an ACK arrives, Set LAR as ack num if its ack num > LAR. endif Chapter 5: Transport Layer 24 Per-Flow Flow/Congestion Control Sliding Window Should maintain a out-of-order queue to resequence before returning to application Receiving byte stream 2 3 4 DATA ACK Ack=5 Sending Stream Receiver 6 ACK Ack=6 SND_UNA 2 5 DATA SND_NXT 7 sliding DATA Network Pipe (size=Data 4~8) 8 3 9 Sent & ACKed TCP Window Size( = min(RWND, CWND) ) 10 Sender To be sent When window moves Should maintain a retransmission queue in case of packet loss Copyright Reserved 2009 Chapter 5: Transport Layer 25 Sliding Window : Normal Case (1/2) 2 3 4 5 ACK Ack=6 ACK Ack=5 DATA 7 Network Pipe DATA 8 sliding Sending Stream 2 Receiver DATA 6 3 Sender To be sent When window moves TCP Window Size = min(RWND, CWND) Sent & ACKed (a) Original condition 2 3 4 5 6 ACK Ack=7 ACK Ack=6 Sending Stream 2 3 Receiver DATA 7 DATA 8 DATA 9 sliding 4 Sender (b) Sender receives ACK(Ack=5) Chapter 5: Transport Layer 26 Sliding Window : Normal Case (2/2) 2 3 4 5 6 7 ACK Ack=8 ACK Ack=7 3 4 DATA 9 DATA 10 sliding Sending Stream 2 Receiver DATA 8 5 Sender (c) Sender receives ACK(Ack=6) 2 3 4 5 6 7 8 DATA 9 ACK Ack=9 DATA 10 ACK Ack=8 sliding Sending Stream 2 3 DATA 11 4 5 6 Sender (d) Sender receives ACK(Ack=7) Chapter 5: Transport Layer 27 Sliding Window : Out-of-Sequence(1/2) 2 3 DATA 4 5 Receiver 6 ACK Ack=4 DATA 7 ACK Ack=4 Network Pipe DATA 8 Sending Stream 2 3 Sender To be sent When window moves TCP Window Size = min(RWND, CWND) Sent & ACKed (a) Original condition 2 3 4 5 6 ACK Ack=7 Receiver DATA 7 DATA 8 ACK Ack=4 Sending Stream 2 3 Sender (b) Sender receives ACK(Ack=4) of DATA 5 Chapter 5: Transport Layer 28 Sliding Window : Out-of-Sequence(2/2) 2 3 4 5 6 7 Receiver DATA 8 ACK Ack=8 ACK Ack=7 Sending Stream 2 3 Sender (c) Sender receives ACK(Ack=4) of DATA 6 2 3 4 5 6 7 8 ACK Ack=9 ACK Ack=8 3 10 sliding Sending Stream 2 DATA 9 DATA 4 5 6 DATA 11 Sender (d) Sender receives ACK(Ack=7) Chapter 5: Transport Layer 29 Per-Flow Flow/Congestion Control Opening & Shrinking of Window Size Close 2 Shrink 3 Open 9 10 TCP Window Size( = min(RWND, CWND) ) Copyright Reserved 2009 Chapter 5: Transport Layer 30 Retransmitting Lost Packets Retransmit Which Packet? Fast Retransmit Towards Better Accuracy: TCP SACK Option Further Refinement: FACK (based on SACK) When to Retransmit? Fast Retransmit: same as above Retransmission Timeout (RTO) Round-Trip Time (RTT) Measurement Tradeoff: RTT vs. RTO Karn’s Algorithm Towards Better RTO: TCP Timestamp Option Copyright Reserved 2009 Chapter 5: Transport Layer 31 Retransmit Which Packet? 3 4 4 4 4 Fast Retransmit 2 3 6 7 8 Packet Reordering Time at Receiver Duplicate ACKs ACK Packet Loss Internet Route Change DATA 2 3 6 7 8 TCP Receiver ACK the First “Hole” Triple Duplicate ACKs (TDA) 4 Same ACKs (ACK field=X) TCP Sender Infer TDA as Congestion Retransmit the Packet with SeqNum=X Halve Its Sending Rate Copyright Reserved 2009 Chapter 5: Transport Layer 32 When to Retransmit? Retransmission TimeOut (RTO) Round-Trip Time (RTT) Measurement vs. RTO RTT: Varying Dramatically Smoothed RTT (SRTT) : Exponential Weighted Moving Average Mdev: Mean Deviation of RTT RTO=SRTT+4*Mdev Karn’s Algorithm Don’t Update RTO When Retransmission is Also Lost Copyright Reserved 2009 Chapter 5: Transport Layer 33 TCP RTT Estimator Fast estimator by Van Jacobson ’88 & ‘90 srtt (smoothed rtt) is kept in 8 times RTT mdev is kept in 4 times the real mean deviation In tcp_input.c from Linux 2.6: exponential weighted moving average Copyright Reserved 2009 Chapter 5: Transport Layer 34 Per-Flow Flow/Congestion Control How Fast to Send? Fast Sender vs. Slow Receiver How to Know? Fast Sender vs. Congested Network How to Know? Feedback RWND (Receiver Advertised Window) in ACK by Receiver Feedback Loss Events by Network Re-adjust (Congestion Window) CWND How Fast? Satisfy Both: min (RWND, CWND) Copyright Reserved 2009 Chapter 5: Transport Layer 35 TCP Tahoe Congestion Control Fast retransmit ≧3 duplicate ACK ≧3 duplicate ACK cwnd=cwnd+1 ACK send packet Slow start send missing packet ssth=cwnd/2 cwnd=1 send data packet cwnd=cwnd+ 1/ cwnd ACK Congestion avoidance cwnd ≧ ssth start timeout timeout all ACKed Retransmission timeout cwnd=1 Chapter 5: Transport Layer 36 Slow Start & Congestion Avoidance Slow Start source destination Congestion Aviodance source Copyright Reserved 2009 Chapter 5: Transport Layer destination 37 An example: TCP Tahoe (1/2) 38 37 36 35 34 33 32 31 (1) S cwnd=8 awnd=8 (2) S cwnd=8 awnd=8 Chapter 5: Transport Layer D Sender sent segment 31-38 D Receiver replied seven duplicate ACKs of segment 30 38 An example: TCP Tahoe (2/2) D Sender received three duplicate ACKs and cwnd is changed 1 packets. The lost segment 31 is retransmitted. Sender exited the fast transmit and entered the slow start state. D Receiver replied the ACK of segment 38 when it received the retransmitted segment 31. D Sender sent segment 39. 31 (3) S cwnd=1 awnd=8 30 30 30 30 (4) S cwnd=1 awnd=8 38 39 (5) S cwnd=1 awnd=1 Chapter 5: Transport Layer 39 TCP Reno Congestion Control (RFC 2581) Fast retransmit ssth=cwnd/2 cwnd=ssth send missing packet ≧ 3 duplicate ACK >= 3 duplicate ACK = x send data packet cwnd=cwnd+ 1/ cwnd cwnd=cwnd+1 ACK send packet Slow start cwnd ≧ ssth Congestion avoidance start ACK cwnd=ssth non-duplicate ACK > x timeout timeout Fast recovery duplicate ACK send data cwnd=cwnd+1 packet timeout all ACKed Retransmission timeout cwnd=1 Copyright Reserved 2009 Chapter 5: Transport Layer 40 An example: TCP Reno D Sender received three duplicate ACKs and cwnd is changed to (8/2)+3 packets. The lost segment 31 is retransmitted. Sender exited the fast transmit and entered the fast recovery state. D Receiver replied the ACK of segment 38 when it received the retransmitted segment 31. D Sender exited the fast recovery and entered the congestion avoidance state. Cwnd is changed to 4 segments. 31 (3) S cwnd=7 awnd=8 30 30 30 30 39 40 41 (4) S cwnd=11 awnd=8->11 38 42 (5) S cwnd=4 awnd=3->4 39 40 41 Chapter 5: Transport Layer 41 Open Source Implementation 5.4: TCP Slow Start and Congestion Avoidance if (tp->snd_cwnd <= tp->snd_ssthresh) { /* Slow start*/ if (tp->snd_cwnd < tp->snd_cwnd_clamp) tp->snd_cwnd++; } else { if (tp->snd_cwnd_cnt >= tp->snd_cwnd) { /* Congestion Avoidance*/ if (tp->snd_cwnd < tp->snd_cwnd_clamp) tp->snd_cwnd++; tp->snd_cwnd_cnt=0; } else tp->snd_cwnd_cnt++; } } Copyright Reserved 2009 Chapter 5: Transport Layer 42 Principle in Action: TCP Congestion Control Behaviors slow-start fast recovery pipe limit ssth reset congestion avoidance fast retransmit triple-duplicate ACKs Copyright Reserved 2009 Chapter 5: Transport Layer 43 TCP Header Format 0 4 15 16 source port number 31 destination port number 32-bit sequence number 32-bit acknowledgement number header 6-bit reserved U A length TCP checksum P R S F 20 bytes window size urgent pointer ~ ~ options (if any) ~ ~ ~ ~ data ~ ~ Chapter 5: Transport Layer 44 TCP Options End of option list kind=0 No operation kind=1 Maximum segment size kind=2 len=4 Window scale factor kind=3 len=3 Timestamp kind=8 len=10 Maximum segment size (MSS) shift count timestamp value Chapter 5: Transport Layer timestamp echo reply 45 TCP Options End of Option List No Operation As name suggests Padding fields to a multiple of 4 bytes Maximum Segment Size Negotiating the max transfer unit at 3-way handshake 1 byte End of option list: kind=0 1 byte No operation: kind=1 1 byte Maximum segment size: kind=2 1 byte len=4 2 bytes Maximum segment size (MSS) Chapter 5: Transport Layer 46 TCP Options (Window Scale Factor, RFC 1323) Issue: window too small when in Gigabit networks, causing limited throughput Solution: negotiate a shifting factor for window Negotiate during 3-way handshaking SYN with timestamp, then SYN+ACK with timestamp Shift up to 14 bits (from 216 to 216x214) Window scale factor: When this option is not used: 1 byte 1 byte 1 byte kind=3 len=3 shift count Linux do not advertise window over 215 to avoid other stack that uses signed bit (include/net/tcp.h) Chapter 5: Transport Layer 47 TCP Options – Timestamp Mission 1 – Improving RTT measurement Receiver: copies & replies the timestamp Sender: always update RTT when seeing timestamp Mission 2 – Protecting Wrapped SeqNum Delayed ACK Avoid receiving old segments in high speed network How to enable timestamp option? 3-way handshake Timestamp in SYN, timestamp in its ACK Timestamp: 1 byte 1 byte kind=8 len=10 4 bytes timestamp value Chapter 5: Transport Layer 4 bytes timestamp echo reply 48 TCP Timer Management in Linux Retransmit Timer Persist Timer To start retransmitting To prevent deadlocks Keepalive Timer (non-standard) To clean up redundant TCP states Chapter 5: Transport Layer 49 Functions of All TCP Timers Name connection timer retransmission timer delayed ACK timer persist timer keepalive timer FIN_WAIT_2 timer TIME_WAIT timer Function To establish a new TCP connection, a SYN segment is sent. If no response of the SYN segment is received within connection timeout, the connection is aborted. TCP retransmits the data if data is not acknowledged and this timer expires. The receiver must wait till delayed ACK timeout to send the ACK. If during this period there is data to send, it sends the ACK with piggybacking. A deadlock problem is solved by the sender sending periodic probes after the persist timer expires. If the connection is idle for a few hours, the keepalive timeout expires and TCP sends probes. If no response is received, TCP thinks that the other end is crashed. This timer avoids leaving a connection in the FIN_WAIT_2 state forever, if the other end is crashed. The timer is used in the TIME_WAIT state to enter the CLOSED state. Chapter 5: Transport Layer 50 Open Source Implementation 5.5: TCP Retransmit Timer Approximating RTT Linux provides good retx timer granularity Just like other timers BSD-derived UNIXs have bad granularity For minimizing timer overhead check wether ACKed every 500 ms RTT is over-estimated RTO is then over-estimated Slow packet retx when lost recovered not by Fast Retx Chapter 5: Transport Layer 51 Open Source Implementation 5.6: TCP Persistent (or Probe) Timer When RWND=0 && Next RWND>0 lost Deadlock occurs Sender waits for RWND>0 (window update) Receiver waits for new data Solution Sender emits one byte of data to probe Persist timer tcp_output.c (Linux 2.6) Use RTO with binary exponential backoff until 120 seconds Chapter 5: Transport Layer 52 Open Source Implementation 5.6 (cont): TCP Keepalive Timer (Non-standard) When no data exchange for a long time Connection Timeout? Belongs to application’s preference The other end is dead? Linux 2.6 Implementation (tcp_timer.c) Call tcp_keepalive_timer() every 75 seconds Initialized by af_inet init routine searches every established TCP connection If dead & not reboot => state cleared after 5 probes If dead & reboot => state cleared after getting RST Chapter 5: Transport Layer 53 Data Structures of TCP Connections in Linux Important variables: include/net/sock.h Chapter 5: Transport Layer 54 Summary: Properties of TCP Per-Flow Reliability Through ACKs Window-based Flow Control Self-clocking using ACKs Chapter 5: Transport Layer 55 TCP Performance Interactive Connections Silly Window Syndrome Bulk-Data Transfers ACK Compression Phenomenon Reno’s Multiple Packet Loss Problem Chapter 5: Transport Layer 56 TCP Performance Problems & Enhancement Interactive Connections Silly Window Syndrome (SWS) Solution: Clark & Nagle Bulk Data Transfers ACK Compression Phenomenon Possible solution: Paced TCP Sender Reno’s Multiple Packet Loss Problem Solution: NewReno, SACK, FACK Chapter 5: Transport Layer 57 TCP Performance Problems and Solutions Transmission Style Problem Solution Interactive connection Silly window syndrome Nagle, Clark Bulk-data transfer ACK compression Zhang Reno’s MPL* problem NewReno, SACK, FACK *MPL stands for Multiple-Packet-Loss Chapter 5: Transport Layer 58 Performance of Interactive Connections – Problems & Solutions Problem: Silly Window Syndrome (SWS) Sender transmits small packets Receiver advertises small window Solution Sender sends whenever either of the following holds: Data Accumulated to Full-sized Segment Data Accumulated to ½ RWND Nagle’s Algorithm Disabled/Not Applied Receiver advertises window whenever either of the following holds: Buffer available to full-sized Segment Buffer available to ½ of its buffer space Chapter 5: Transport Layer 59 SWS : Receiver Advertises Small Window Client Server RWND = 320 Dat a(S eq= A 1, L , Le RWND = 80 n=8 0) 4. Receive Segment; Send Ack, Reduce Window to 40 RWND = 40 220/320 , Le 60/80 n=4 0) =30) WND R , 1 4 ck=4 CK(A • • • 2. Receive Segment; Send Ack, Reduce Window to 80 240/320 =40) WND R , 1 0 ck=4 CK(A Data (Seq =40 1 A ) 8 0) ND = W R 21, ck=3 CK(A Data (Seq =32 1 A en = 320 4. Receive Segment; Send Ack, Reduce Window to 30 RWND = 30 200/320 60/80 Chapter 5: Transport Layer 30/ 40 60 Performance Enhancement of Interactive Connections – Nagle’s Algorithm To efficiently utilize the bandwidth resource TELNET: Typing speed vs. available bandwidth When RTT is short (bandwidth may be sufficient) When RTT is large (bandwidth may be insufficient) Inter-character spacing > RTT Only one outstanding packet per RTT => efficient!!! Inter-character spacing < RTT Multiple single-character packets per RTT => inefficient!! Nagle: don’t send small packet until pipe is clean (keep only one packet in pipe) => efficient!!! When RTT is short Nagle’s Algorithm is rarely used When RTT is large The beauty of Nagle Nagle’s Algorithm is often used Chapter 5: Transport Layer 61 Chapter 5: Transport Layer 62 Performance of Bulk Data Transfers Computing the Performance through Bandwidth Delay Product (BDP) Horizontal Dimension: Delay Vertical Dimension: Bandwidth Shaded Area: Packet Size BDP=pipe size=Bandwidth x RTT Chapter 5: Transport Layer 63 Performance of Bulk Data Transfers Filling the network pipe TCP Sender Pipe for sending data packets TCP Receiver Pipe for replying ACKs WAN Pipe Highest Performance: Pipe is full TCP Sender TCP Receiver Chapter 5: Transport Layer 64 Performance of Bulk Data Transfers Steps of filling the pipe using Congestion Avoidance cwnd=1 (1) (2) (3) (4) (5) (6) cwnd=2 (7) (8) (9) (10) (11) (12) cwnd=3 (13) (14) (15) (16) (17) (18) (22) (23) (24) (29) (30) cwnd=4 (19) (20) (21) cwnd=5 (25) (26) (27) (28) cwnd=6 (31) (32) (33) (34) Chapter 5: Transport Layer (35) (36) 65 Performance of Bulk Data Transfers Modeling TCP Throughput Given RTT, segment size s, loss rate p: cs T (tRTT , s, p) tRTT p where c is a constant value Given additional information: Max Window Size Wm, # delayed ACK b, RTO Wm s s T (tRTT , tRTO , s, p) min , tRTT 2bp 3bp 2 tRTT tRTO min1,3 p(1 32 p ) 3 8 Chapter 5: Transport Layer 66 Problem of TCP Bulk Data Transfers: ACK-Compression Phenomenon Bursty traffic when Simultaneous 2-Way Traffic Asymmetric Path No general solution Distribute a window of packets across an RTT may alleviate the phenomenon Slow link Proper spacing Sender ACKs have proper spacing Receiver Slow link Chapter 5: Transport Layer 67 Historical Evolution: Multiple-Packet-Loss Recovery in NewReno, SACK, FACK and Vegas Solution (I) to TCP Reno’s Problem: TCP NewReno Solution (II) to TCP Reno’s Problem: TCP SACK Solution (III) to TCP Reno’s Problem: TCP FACK Solution (IV) to TCP Reno’s Problem: TCP Vegas Copyright Reserved 2009 Chapter 5: Transport Layer 68 Problem of TCP Bulk Data Transfers: Reno’s Multiple Packet Lost Problem(1/2) 38 37 36 35 34 33 32 31 (1) S cwnd=8 awnd=8 D Sender sent segment 31-38 (2) S cwnd=8 awnd=8 D Receiver replied five duplicate ACKs of segment 30 D Sender received three duplicate ACKs and cwnd is changed to (8/2)+3 packets. The lost segment 31 is retransmitted. 30 30 30 30 30 31 (3) S cwnd=7 awnd=8 30 30 Chapter 5: Transport Layer 69 Problem of TCP Bulk Data Transfers: Reno’s Multiple Packet Lost Problem(2/2) 39 (4) S cwnd=9 awnd=8->9 D Receiver replied the ACK of segment 32 when it received the retransmitted segment 31. This is a partial ACK. D Sender exited the fast recovery and entered the congestion avoidance state when receiving the partial ACK. Cwnd is changed to 4 segments. D Sender waited until timeout 32 (5) S cwnd=4 awnd=7 32 (6) S cwnd=4 awnd=7 Chapter 5: Transport Layer 70 Eliminating MPL Problem (I): TCP NewReno (1/3) RFC 2582: Extending Fast-Recovery Phase Remain in Fast-Recovery until All data in pipe before detecting 3-Dup ACK are ACKed 38 37 36 35 34 33 32 31 (1) S cwnd=8 awnd=8 D Sender sent segment 31-38 (2) S cwnd=8 awnd=8 D Receiver replied five duplicate ACKs of segment 30 D Sender received three duplicate ACKs and cwnd is changed to (8/2)+3 packets. The lost segment 31 is retransmitted. 30 30 30 30 30 31 (3) S cwnd=7 awnd=8 30 30 Chapter 5: Transport Copyright ReservedLayer 2009 71 Eliminating MPL Problem (I): TCP NewReno (2/3) 39 (4) S cwnd=9 awnd=8->9 D Receiver replied the ACK of segment 32 when it received the retransmitted segment 31. This is a partial ACK. D Sender received a partial ACK of segment 32 and immediately retransmitted the lost segment 33. Cwnd is changed to 9-2+1 D Sender received a duplicate ACK and added cwnd by 1, thus segment 41 is kicked out. Receiver replied a partial ACK and one duplicate ACK of segment 33. D The partial ACK triggered the sender retransmitting segment 34 and shrink the awnd to 8 (4133). Receiver replied an ACK of segment 33 upon receiving segment 34. 32 40 33 (5) S cwnd=8 awnd=7->8 32 41 (6) S cwnd=9 awnd=8->9 33 33 34 (7) S cwnd=9 awnd=9->8 33 33 Chapter 5: Transport Layer 72 Eliminating MPL Problem (I): TCP NewReno (3/3) 43 42 34 (8) S cwnd=10 awnd=9->10 D Upon receiving the duplicate ACK of segment 33, cwnd was advanced by one. Since awnd was smaller than cwnd, two new segments were sent. D On receiving the duplicate ACK of segment 33, cwnd was advanced by one and thus segment 44 was triggered out. D Receiver replied ACKs of segment 40, 42, 43, and 44. D Sender exited fast recovery upon receiving the ACK of segment 40. Cwnd and awnd were reset to 4. 33 44 (9) S cwnd=11 awnd=10->11 (10) S cwnd=11 awnd=10->11 43 42 34 40 42 43 (11) S 44 cwnd=4 awnd=4 42 43 44 Chapter 5: Transport Layer 73 Eliminating MPL Problem (II): TCP SACK (1/2) Reporting non-contiguous block of data 38 37 36 35 34 33 32 31 (1) (2) S cwnd=8 awnd=8 S cwnd=8 awnd=8 SACK options: 1 2 3 4 30 30 30 30 30 D Sender received ACK of segment 30 and sent segment 31-38. D Receiver sent five duplicate ACKs with SACK options of segment 30 D Sender received duplicate ACKs and began retransmitting the lost segments reported in the SACK options. Awnd was set to 8-3+1 (three duplicate ACKs and one retransmitted segment.). 5 1 (32,32; 0, 0; 0, 0) 2 (35,35;32,32; 0, 0) 3 (35,36;32,32; 0, 0) 4 (35,37;32,32; 0, 0) 5 (35,38;32,32; 0, 0) 31 (3) S cwnd=4 awnd=6 4 5 30 30 Chapter 5: Transport Layer 74 Eliminating MPL Problem (II): TCP SACK (2/2) (4) S cwnd=4 awnd=4 D Receiver replied partial ACKs for received retransmitted segments. D Sender received partial ACKs, reduced awnd by two, and thus retransmitted two lost segments. D Receiver replied ACKs for received retransmitted segments. D Sender exited fast recovery after receiving ACK of segment 38. 32 34 33 (5) S cwnd=4 awnd=2->4 (6) S cwnd=4 awnd=4 33 38 42 41 40 39 (7) S cwnd=4 awnd=4 Chapter 5: Transport Layer 75 Eliminating MPL Problem (III): TCP FACK (1/2) Extension of SACK, better estimation of awnd 38 37 36 35 34 33 32 31 (1) S cwnd=8 awnd=8 (2) S cwnd=8 awnd=8 SACK options: 1 2 3 4 30 30 30 30 30 D Sender received ACK of segment 30 and sent segment 31-38. D Receiver sent five duplicate ACKs with SACK options of segment 30 D Sender received two duplicate ACKs and began retransmitting the lost segments reported in the SACK options. 5 1 (32,32; 0, 0; 0, 0) 2 (35,35;32,32; 0, 0) 3 (35,36;32,32; 0, 0) 4 (35,37;32,32; 0, 0) 5 (35,38;32,32; 0, 0) 31 (3) S cwnd=4 awnd=4 3 4 5 30 30 30 Chapter 5: Transport Layer 76 Eliminating MPL Problem (III): TCP FACK (2/2) 39 34 33 (4) S cwnd=4 awnd=4 D Sender calculated awnd for received duplicate ACKs and kept sending packets allowed. D Receiver replied ACKs. D Sender exited fast recovery after receiving ACK of segment 38. 32 40 (5) S cwnd=4 awnd=4 33 38 39 43 42 41 (6) S cwnd=4 awnd=4 40 Chapter 5: Transport Layer 77 Performance of Bulk Data Transfers When RTTs are heterogeneous…… What have you observed? Chapter 5: Transport Layer 78 5.4 Socket Programming Interface Programming Interface to Protocol Layers in Linux Accessing End-to-End Protocol Layer Accessing Internetworking Protocol Layer Accessing Direct-Linked Protocol Layer Packet Capturing & Filtering Chapter 5: Transport Layer 79 5.4 Socket Programming Interface Issue: programming interface to protocol layers Socket interface in Linux 2.2.17 kernel Application Socket interface Socket Library net/socket.c net/ipv4/af_inet.c net/ipv4/{tcp*,udp*} net/ipv4/{ip*,icmp*} net/ethernet/eth.c drivers/net/*.{c,h} BSD Socket INET Socket TCP/UDP ICMP IP … ARP ethernet-header builder ethernet NIC Driver Chapter 5: Transport Layer User-space Kernel-space 80 Bridging Applications & End-to-End Protocols socket(domain, type, protocol) INET domain: AF_INET type UDP: SOCK_DGRAM TCP: SOCK_STREAM Protocol: NULL Typical Applications: telnet ftp HTTP Chapter 5: Transport Layer 81 Elementary Socket: TCP Client/Server TCP Server TCP Client obtain a descriptor initiate 3-way handshake socket() connect() write() socket() obtain a descriptor bind() assign IP & port to the socket listen() 1. switch to passive socket 2. create connection queue accept() enter ESTABLISHED state blocks until connection from client connection establishment (TCP Three-way handshake) data (r equest) read() process request data (r eply) write() read() close() end-of- life noti fication read() Chapter 5: Transport Layer close() 82 Elementary Socket: UDP Client/Server UDP Server UDP Server obtain a descriptor socket() UDP Server Client Client obtain a descriptor obtain a descriptor UDP socket() obtain a descriptor ent ) data (request) socket() bind() sendto() sendto() recvfrom() blocks until connection from client ) obtain a descriptor assign IP & port to the socket assign IP & port to the socket blocks until connection from client blocks until connection from client process request process request recvfrom() process request recvfrom() data (reply assign IP & port to the socket data (request) data (request) socket() bind() bind() recvfrom() recvfrom() ) data (reply ) data (reply sendto() sendto() close() sendto() close() Chapter 5: Transport Layer 83 Open Source Implementation 5.7: Socket Read/Write Inside out User Space Server Server socket creation socket() bind() listen() send data accept() sys_socketcall sys_socket sys_bind sys_listen sock_create inet_bind inet_listen inet_create write() sys_write sys_accept do_sock_wri te inet_accept sock_ sendmsg tcp_accept wait_for_ connection Client Client socket creation socket() connect() sys_socketcall send data read() sys_read sys_socket sys_connect do_sock_read sock_create inet_stream _connect sock_ recvmsg tcp_v4_ getport sock_commo n_ recvmsg inet_create inet_ sendmsg tcp_ sendmsg tcp_ write_xmit tcp_v4_ connect inet_wait _connect tcp_ recvmsg memcpy_ toiovec Kernel Space Internet Chapter 5: Transport Layer 84 Open Source Implementation 5.7: Socket Read/Write Inside out linux/sched.h struct files_struct count file_lock max_fds max_fdset next_fd fd[0] fd[1] …… fd[255] …… opened Linux socket linux/fs.h struct file f_list f_dentry max_fds f_vfsmnt f_op f_count f_flags f_mode f_pos …… linux/dentry.h struct dentry d_count d_flags d_inode d_parent …… ipv4/tcp_ipv4.c struct tcp_func tcp_close tcp_v4_connect tcp_disconnect tcp_accept tcp_ioctl tcp_v4_init_sock tcp_v4_destory_sock tcp_shutdown tcp_setsockopt tcp_getsockopt tcp_sendmsg tcp_recvmsg …… net/sock.h struct proto close connect disconnect accept ioctl init destory shutdown setsockopt getsockopt sendmsg recvmsg …… Chapter 5: Transport Layer linux/fs.h struct inode …… union u struct socket …… inode file sk …… …… net/sock.h struct sock d_addr s_addr dport sport bound_dev_if …… receive_queue write_queue …… proto …… union tp_pinfo struct tcp_opt …… snd_cwnd …… …… sk_filter …… socket …… 85 Performance Matters: Interrupt and Memory Copy at Socket Latency in receiving TCP segments in the TCP layer Latency in transmitting TCP segments in the TCP layer Chapter 5: Transport Layer 86 Bridging Applications to Internetworking Protocols in Linux 2.6 socket(domain, type, protocol) Parameters: Kernel functions PACKET domain: PF_PACKET type: SOCK_DGRAM Protocol: NULL net/packet/af_packet.c Typical Applications: ping traceroute Chapter 5: Transport Layer 87 Bridging Applications to Node-to-Node Protocols in Linux 2.6 socket(domain, type, protocol) Parameters: PACKET domain: PF_PACKET type: SOCK_RAW Ethernet Encapsulated IP packet: ETH_P_IP Complete access to Ethernet header Kernel functions Or others in “/usr/include/linux/if_ether.h” net/packet/af_packet.c Typical Applications: Packet sniffers => performance problem!!! Hacking tools Chapter 5: Transport Layer 88 Open Source Implementation 5.8: Bypassing the End-to-End Layer int main() { int n; int fd; char buf[2048]; if((fd = socket(PF_PACKET, SOCK_RAW, htons(ETH_P_ALL))) == -1) { printf("fail to open socket\n"); return(1); } while(1) { n = recvfrom(fd, buf, sizeof(buf),0,0,0); if(n>0) printf("recv %d bytes\n", n); } return 0; } Copyright Reserved 2009 Chapter 5: Transport Layer 89 Open Source Implementation 5.9: Making Myself Promiscuous strncpy(ethreq.ifr_name,"eth0",IFNAMSIZ); ioctl(sock, SIOCGIFFLAGS, &ethreq); ethreq.ifr_flags |= IFF_PROMISC; ioctl(sock, SIOCSIFFLAGS, &ethreq); Copyright Reserved 2009 Chapter 5: Transport Layer 90 Packet Sniffers: Packet Capturing & Filtering Capture until what header? Towards Efficient Packet Filtering: Layered Model User-Space Tool: tcpdump User-Space Packet Filter: libpcap (portable) Kernel-Space Packet Filter: Linux Socket Filter Chapter 5: Transport Layer 91 Open Source Implementation 5.10: Linux Socket Filter Linux Socket Filter (net/core/filter.c) Similar to BPF (Berkley Packet FIilter) network monitor network monitor rarpd buffer buffer buffer filter filter link-level driver user filter link-level driver kernel protocol stack link-level driver kernel network Chapter Copyright 5: Transport Reserved 2009 Layer 92 5.5 Transport Protocols for Streaming Issues Real-Time Transport Protocol (RTP) RTP Control Protocol (RTCP) Example: VoIP Gateway Using RTP/RTCP Chapter 5: Transport Layer 93 Issue 1: Multi-homing & Multi-streaming Stream Control Transmission Protocol Multi-homing a session of the SCTP can be concurrently constructed by multiple connections through different network adapters a heartbeat for each connection Multi-streaming Support ordered reception for each streaming Avoid the HOL blocking of TCP. a 4-way handshake mechanism for security Copyright Reserved 2009 Chapter 5: Transport Layer 94 Issue 2: Smooth Rate Control and TCPfriendliness AIMD is not suitable for streaming TCP-friendliness: A flow should …. respond to the congestion at the transit state use no more bandwidth than a TCP flow at the steady state when both received the same network conditions, such as packet loss ratio and RTT. Datagram Congestion Control Protocol (DCCP) : free selection of a congestion control scheme Copyright Reserved 2009 Chapter 5: Transport Layer 95 Principle in Action: Streaming: TCP or UDP? Why not TCP loss retransmission mechanism continuous rate fluctuation Why not UDP too simple, dropped by network devices for security Both are the only two mature protocols, so.. UDP is used to carry pure audio streaming, like audio and VoIP. TCP is used for streaming : large buffer ->delay OK one-way application, e.g. watching clips from YouTube Not OK for the interactive application, like video conference, Copyright Reserved 2009 Chapter 5: Transport Layer 96 Issues 3: Playback Reconstruction and Path Quality Report Issues: Codec Encapsulation & Path Quality Report Data-Plane: Video/Voice Codecs Video: H.263… Voice: G.729… Control-Plane: Delay/Jitter/Loss Report RFC Standards: RTP & RTCP RTP: Data-Plane, Encapsulating the Chosen Codec RTCP: Control-Plane, Reporting Delay/Jitter/Loss to Senders Chapter 5: Transport Layer 97 RTP (Real-Time Protocol) Objectives Eliminating Packet Reorder & Loss Detection: Sequence # Timestamp Synchronization Source Identifier Contributing Source Identifier Header Format Chapter 5: Transport Layer 98 RTCP (Real-Time Transport Protocol) Objectives Reporting End-to-End Delay Reporting Delay Jitter Reporting Loss Rate Report to sender for what? Switch to lower-bitrate codec User may get smoother real-time Chapter 5: Transport Layer 99 VoIP using RTP: Multiplexing using SSRC One RTP session between VoIP gateways Many phone call between branch offices Multiplexing using different SSRC ID within the RTP session Gatekeeper Public Telephone Network Public Telephone Network IP Cloud Phone VoIP Gateway Internet or private IP network Chapter 5: Transport Layer VoIP Gateway Phone 100 VoIP using RTP: Codec Encapsulation Compress/Decompress Analog to Digital Compander Inside a VoIP Gateway Codec VoIP Gateway Analog to Digital Converter 128 kbps 16 bits, 8khz Compander A-Law u-Law Analog signal source The converter assigns 16 bits evenly distributed across x,y coordinates of the sine Chapter 5: Transport Layer 64 kbps 8 bits, 8khz Digital output signal 64 kbps The compander compresses the data 101 Historical Evolution: RTP Implementation Resources Sample Implementation in RFC 1889 Vat ftp://ftp.cs.columbia.edu/pub/schulzrinne/rtptools/ NeVoT http://www-nrg.ee.lbl.gov/vat/ Rtptools http://rfc.net/rfc1889.txt http://www.cs.columbia.edu/~hgs/rtp/nevot.html RTP Library http://www.iasi.rm.cnr.it/iasi/netlab/gettingSoftware.html by E.A.Mastromartino offers convenient ways to incorporate RTP functionality into C++ Internet applications. Chapter 5: Transport Layer 102 5.6 Summary (1/2) Three key features in process-to-process channels (1) port-level addressing, (2) reliable packet delivery, (3) flow rate control UDP: (1) only; TCP: all of them TCP techniques three-way handshake ack/retx, sliding-window flow control various versions of congestion control to retx potentially lost packets Chapter 5: Transport Layer 103 5.6 Summary (2/2) Real-time transport by RTP/RTCP multi-streaming, multi-homing, smooth rate control, TCP-friendliness, playback reconstruction, and path quality reporting Socket interfaces to different layers Chapter 5: Transport Layer 104