Reliable Stream Transport Service (TCP) Chapter 12 We’ve looked at Unreliable connectionless packet delivery service And the IP protocol that defines it Now we will examine Reliable stream delivery And the Transmission Control Protocol that defines it TCP is presented as a part of TCP/IP Is independent, general purpose protocol Can be adapted for use with other delivery systems Need for Stream Delivery At low levels, have unreliable packets Lost, destroyed, discarded, duplicated, delayed Size constraints affect efficient transfer Applications need to send lots of data Unreliability is tedious and annoying Programmers must worry about errors Goal of network protocol research General purpose reliable stream delivery method Properties of the Service Interface between applications and TCP/IP has five characteristic features: Stream Orientation Sender provides stream of bits divided into bytes Receiver is passed exact same sequence Virtual Circuit Connection Service provides illusion of dedicated circuit “Call” setup from one application to the other Two OSs talk and settle details Continue to communicate during transfer If error, detect and report to applications Buffered Transfer Applications send stream in whatever size it wants May be as small as a single octet Protocol software wants efficient transfer Small blocks of data: buffer until get enough for a datagram Large blocks of data: break into smaller pieces Push mechanism When transfer needs to happen before buffer is full Application invokes a push Data generated until then is sent immediately At receiving end, is delivered without delay Protocol software may divide stream in unexpected ways Unstructured Stream Applications cannot mark record boundaries Must agree that stream service will be unstructured Full Duplex Connection Connections allow concurrent transfer both ways Appears as two independent streams in opposite directions Can terminate one direction without affecting other Control information can be piggybacked on data Providing Reliability Want reliable transfer out of unreliable packet delivery system Most reliable protocols use a single technique Positive acknowledgement with retransmission Recipient must send ACK message as it gets data Sender keeps record of each packet sent If timer expires for an ACK, retransmits packet Figure 12.1 Can also have duplicate packets Network delays may cause premature retransmission Both packets and ACKs can be duplicated Usually solve by assigning sequence numbers Receiver must remember which sequence numbers received ACKs include the sequence numbers as well Sliding Windows Sending one packet and waiting for ACK wastes time Full duplex circuit; have lots of idle time Sliding window technique used More complex form of positive ack & retrans Use bandwidth more efficiently Sender transmits multiple packets before ACK Number of unacknowledged packets limited by window size Performance depends upon window size Size of 1: same as simple positive ack protocol Increase size with goal of sending packets as fast as the network can handle Conceptually, separate timer for each packet Only unack’ed packets are retransmitted Receiver has a similar window TCP Is a communication protocol NOT a piece of software TCP is the standard Various TCP software implements the standard Standard includes: Format of data and acknowledgments Procedures for reliability Distinguish multiple destinations on a machine Error recovery procedures Initiation and closing a TCP stream transfer Standard does not include: Details of application/TCP interface Not discuss exact procedures to invoke for operations Not specified for flexibility TCP usually implemented in OS Can use whatever interface given OS provides Single specification for variety of machines TCP assumes little about underlying system Can be used with variety of packet delivery systems (including IP) Dialup lines; LAN; high speed fiber; low speed WAN Ports, Connections, & Endpoints TCP resides above IP in the layering scheme Application Reliable Stream (TCP) User Datagram (UDP) Internet (IP) Network Interface Multiple applications can communicate concurrently Multiplexes and demultiplexes incoming msgs Uses port numbers (like UDP discussion) TCP ports more complex Using the connection abstraction Objects are virtual circuits, not ports Connections identified by a pair of endpoints Endpoint is pair of integers: (host, port) host is IP address for a host port is TCP port on that host Pair of endpoints defines connection (128.9.0.32, 1184) and (128.10.2.3, 53) A single TCP port can be shared by multiple connections on the same machine (128.2.254.139, 1012) and (128.10.2.3, 53) No ambiguity Incoming messages associated with connection, not port Both endpoints used to identify appropriate connection Makes things easier for programmers Can provide concurrent service without unique ports Example: Email Multiple computers can send mail concurrently Accepting program needs only one TCP port Passive & Active Opens TCP is connection-oriented Both endpoints must agree to participate Passive open Application at one end tells OS it will accept connection OS assigns a TCP port number for its end Active open Done by application wishing to connect Tells OS to establish a connection Two TCP modules communicate Establish and verify the connection; then pass data Segments, Streams, & Sequence Numbers TCP views the data stream in segments Segment contains sequence of octets Usually each segment in one IP datagram Two important problems: Efficient transmission Good use of available network Flow control End-to-end problem Cannot overflow the receiver’s buffer Special sliding window protocol used Solves both problems Current window 1 2 3 1 4 5 6 7 3 8 9 10 11 … 2 Octets of the data stream are numbered sequentially 1st pointer: sent and ACKed vs sent and not ACKed 2nd pointer: end of window 3rd pointer: boundary between sent and unsent Receiver maintains a similar window Full duplex: SW at each end maintains 2 windows Also allows window size to vary over time Each ACK has window advertisement Tells how many more octets willing to accept Increased advertisement: Sender can increase size of sliding window, send more Decreased advertisement: Sender decreases size of sliding window, stop at boundary Extreme case: sends advertisement of zero, stops all This provides flow control Essential in internet environment Two independent flow problems: End-to-end Minicomputer communicating with mainframe Intermediate systems Routers need to control flow, too Overloaded router condition is congestion No explicit congestion control mechanism; uses sliding window Good TCP implementation can detect & recover Poor implementation can make it worse TCP Segment Format Unit of TCP/IP sw transfer is segment Establish connections Transfer data Send ACKs May piggyback on a segment carrying data Advertise window size Close connections Figure 12.7 Code Bits field reveals type of segment Bit (left to right) URG ACK PSH RST SYN FIN Meaning if bit set to 1 Urgent pointer field is valid Acknowledgement field is valid Segment requests a push Reset the connection Synchronize sequence numbers Sender has reached end of its byte stream Out of Band Data Out of Band Data sent without waiting for octets in the stream to be consumed by the receiver Ex: to interrupt or abort a program Use urgent bit and URGENT POINTER field This data is consumed first, regardless of stream position Maximum Segment Size Option Not all segments will be of same size But, must agree on a maximum size Uses OPTIONS field Can specify MSS (maximum segment size) If on same network, may use size such that resulting datagrams match network MTU If not, will attempt to discover the minimum MTU along the path Or use 536 (default datagram size, minus IP & TCP headers) Choosing good MSS is difficult Too large or too small are both bad Too small: network utilization is low Segments in datagram; datagram in frame At least 40 octets of headers Small amount of data gives poor utilization Too large: large IP datagrams Probably get fragmented somewhere Cannot ACK partial segment Must receive all fragments More fragments increases probability of losing one In theory, best MSS is when IP datagrams are as large as possible without being fragmented Difficult to figure out: Most implementations do not have a mechanism for doing so Routes can change dynamically This may change the MTU of the path Optimum size depends on lower level headers Segment size must be reduced to account for IP options Window Scaling Option WINDOW field is 16 bits Limits max window size to 64 Kbytes Ok in early networks Need more for networks with large delay Option allows a larger size Do not need to know details…. Timestamp Option Used to: Help compute delay on underlying network Handle “wrap around” sequence numbers Process: Sender: Places timestamp from its clock in message Receiver: Copies timestamp field into ack Allows sender to compute elapsed time TCP Checksum CHECKSUM contains 16-bit integer Uses a pseudo header like UDP Purpose is just the same Verify segment has reached correct destination 0 8 16 31 Source IP Address Destination IP Address Zero Protocol TCP Length ACKs & Retransmission Hard to refer to datagrams or segments Variable length segments Retransmitted segments may have more data than original Instead, use position in stream Based on sequence numbers Cumulative acknowledgement scheme Receiver collects arriving data octets Reconstructs stream of sender May have to reorder segments due to delivery Will have reconstructed zero or more octets May have other stream pieces present but out of order Receiver ACKs longest contiguous prefix ACK specifies the next octet expected to be received Adv: ACKs easy to generate and unambiguous Lost ACKs may not force retransmission Disadv: Only send info about single position in the stream Lack of information is inefficient Imagine window that spans 5000 octets Starts with position 101 in the stream Sender has sent all data in five segments Suppose first segment got lost Receiver sends ACK as each segment arrives All ACKs specify octet 101 as next expected No way to tell sender that all the other data is there Sender has two choices upon timeout: Send all five segments over Send only first segment, then wait for ACK to do anything else Timeout and Retransmission TCP has a timer for each segment If timer goes off before ACK received – retrans Different algorithm than other protocols Due to internet environment Cannot know how quickly ACKs should come May span one or many networks May encounter router delays Must accommodate vast time differences Figure 12.10 Adaptive Retransmission Algorithm Used to accommodate varying delays Monitors performance of each connection Deduces reasonable values for timeouts As performance changes, timeout value revised Must collect data for the algorithm Records time each segment sent & when ACK arrives Computes elapsed time (sample round trip time) Get new sample; adjust average round trip time for the connection RTT stored as weighted average (usually) New round trip samples change the average slowly Example: RTT = (a * Old_RTT) + ((1-a) * New_Round_Trip _Sample) where: a is the constant weighting factor; 0 < a < 1 Choosing a value close to 1: Weighted average only changed small amount Immune to changes that last a short time Choosing a value close to 0: Weighted average responds quickly to changes in delay Timeout value is a function of the current RTT Early implementations used constant weighting factor, B (B > 1) Timeout = B * RTT Choosing a value for B is hard Close to 1 Timeout close to current RTT Detects packet loss quickly Any small delay may cause unnecessary retransmissions Original specification recommended B=2 Will look at better techniques for timeout Measuring Round Trip Samples Measuring round trip sample seems trivial But, TCP uses cumulative acknowledgement ACK refers to data received, not datagram that carried it Consider a retransmission: Form segment; put in datagram; send; timer expires Send again in second datagram Get ACK: for which datagram? Called acknowledgement ambiguity Assume ACK belongs to earliest datagram Make estimated round trip time grow Incorrect if the original datagram was really lost If many lost, estimate grows arbitrarily large Assume ACK belongs to latest datagram Send retransmission just before ACK arrives Decreases the timeout time Makes things worse; more retransmissions Estimate will eventually stabilize RTT will be slightly less than ½ of the correct value Every segment sent twice even though no loss occurs Karn’s Algorithm If associating ACK with earliest or most recent are both wrong…what to do? Do not update on retransmitted segments Idea known as Karn’s Algorithm Avoids ambiguous acknowledgement problem Simplistic implementation can be a problem Get sharp increase in delay; do some retransmissions Ignore ACKs for retransmissions; no new estimate Must also use a timer backoff strategy Compute initial timeout with round trip estimate If timer expires and causes retransmission, increase the timeout (within a bound) Most implementations multiply timeout by 2 Next segment timed with new timeout Continues backoff until send segment without retransmitting Computes new round trip estimate Resets timeout accordingly Shown to work well even with high packet loss High Variance in Delay Computations do not respond well to wide range of variation in delay Variation in RTT Proportional to 1/(1-network load) Original TCP standard estimated RTT as shown earlier Limiting B to 2 can adapt to loads of at most 30% 1989 spec requires estimates of both average RTT and variance Must use variance in place of constant B Approximations are computationally easy DIFF = SAMPLE – Old_RTT Smoothed_RTT = Old_RTT + d * DIFF DEV = Old_DEV + p (|DIFF| - Old_DEV) Timeout = Smoothed_RTT + e * DEV Where: DEV is the estimated mean deviation d is fraction between 0 & 1; controls effect on weighted average p is fraction between 0 & 1; controls effect on mean deviation e is a factor controlling how much deviation effects RT timeout (Research suggests d and p to be inverse power of 2; scales by 2n, uses integer arithmetic, and: d = 1/(23), p = 1/(22), n = 3, and e = 4 ) Figure 12.11 Figure 12.12 12.10, Response to Congestion TCP software must deal with congestion Severe delay caused by an overload of datagrams Congestion occurs at routers Routers have finite storage When run out of storage, start dropping datagrams Endpoints do not know where congestion is Just see increased delay Get timeouts; send more datagrams (retrans) May cause congestion collapse TCP must reduce transmission rate ICMP source quench messages inform hosts of congestion TCP needs to help Want to automatically reduce transmission rates when congestion occurs TCP standard recommends two techniques Slow-start Multiplicative Decrease Multiplicative Decrease TCP must already use receiver’s window size Keep another window size to use during congestion Called congestion window At any time, the allowed window is: min(receiver_advertisement, congestion_window) During non-congestion, both are same To estimate congestion window size, TCP assumes most datagram loss comes from congestion Upon segment loss: Reduce congestion window by half (min of one segment) For segments still in window, backoff timer exponentially Does for every loss; quickly clear router traffic Slow-start How recover when congestion ends? If do reverse (2x congestion window) - unstable Use slow-start recovery When starting traffic on connection or after congestion Start window at size of single segment Increase by one segment every time get an ACK Avoids swamping Not so slow actually: Log2N round trips until can send N segments One other restriction – congestion avoidance phase When congestion window reaches ½ original size, increase by 1 segment only if all segments been ACKed Overall, known as Additive Increase Multiplicative Decrease (AIMD) Techniques powerful when combined Slow-start increase Multiplicative decrease Additive Increase Measurement of variation Exponential timer backoff Improve TCP performance dramatically Add very little computational overhead Performance improves by factors of 2 to 10 Fast Recovery & Other Modifications Heuristic used where loss is infrequent Uses info from cumulative ack scheme Can resend data before timer expires Do not need to know details… Explicit Feedback Mechanisms Most TCP versions use implicit techniques: Timeout and duplicate ACKs to detect loss Changes in RTT to detect congestion Two explicit techniques have been proposed Selective Acknowledgement (SACK) Explicit Congestion Notification (ECN) SACK Can specify exactly which data has been received and which is missing Sender knows which segment(s) to retransmit TCP provides two options for SACK **Do not need to know details** Does not replace cumulative ack mechanism Nor is it mandatory ECN Used to notify TCP about congestion As a TCP segment goes through routers: Two bits in IP header used to record congestion When segment arrives, receiver knows Sender needs to know; receiver uses ACK to tell IP header bits: Taken from TOS field TCP header bits: Taken from reserved area Congestion, Tail Drop, and TCP Protocols are layered Layers operate in isolation TCP at source/destination cannot interact with lower layer elements along the path TCP not know condition of network TCP not notify lower layers before transferring data Policies used by routers can affect TCP Both a single connection and aggregate of all connections Example: Router delays some datagrams more than others TCP backs off retransmission timer If delay exceeds timer, TCP assumes congestion Layers are defined independently, but they interact Thus, try to define mechanisms in one layer to work well with protocols in others Important interaction between TCP and IP Router overrun and begins to drop datagrams Early router software used tail-drop policy If input queue is full when datagram arrives, drop it Interesting effect on TCP If segments are from a single TCP connection: TCP enters slow-start until begin receiving ACKs If segments are from multiple TCP connections: All N instances of TCP enter slow-start at same time Causes global synchronization Random Early Detection Routers need to avoid global synchronization Use scheme to avoid tail-drop when possible Called Random Early Detection (RED) (or Random Early Discard or Random Early Drop) Uses two markers in queue: Tmin and Tmax Three rules: If queue contains fewer than Tmin datagrams, add new one If queue contains more than Tmax datagrams, discard new one If queue contains between Tmin and Tmax datagrams, randomly discard the datagram with probability p Randomness keeps from waiting for overflow Router slowly and randomly drops datagrams as congestion increases Keeps from putting all TCP connection in slow-start Key is in choice of the thresholds and p Tmin must be large enough to utilize output link Tmax must be larger than typical increase in queue size during round trip time Discard probability is most complex choice Not use a constant; compute for each datagram Can vary probability from 0 (Tmin queue size) to 1 (Tmax queue size) in a linear fashion Linear scheme forms the basis of probability p Must avoid overreacting to bursty traffic If short burst Do not drop datagrams because queue will not overflow But, cannot postpone discard indefinitely Long burst Will overflow queue and start tail-drop Use weighted average technique Not use actual queue size at any instant Compute weighted average queue size Update each time a datagram arrives Avg = (1 – g) * Old_avg + g * Current_queue_size where g is a value between 1 and 0 Some details glossed over Computations very efficient if: Choose constants as powers of 2 Use integer arithmetic Measurement of queue size Time required to forward datagram proportional to size Measure queue size in octets versus datagrams Affects type of traffic dropped Discard probability proportional to amount of data Not based on number of segments Smaller datagrams: less probability of being dropped Good for ACKs, remote login traffic, etc. Analysis and simulation shows RED works Establishing a TCP Connection Use a 3-way handshake Is both necessary and sufficient for correct synchronization Also uses rule that additional requests for connection are ignored if connection established Can initiate connection from both ends simultaneously Figure 12.13 The sequence of messages in a three-way handshake. Time proceeds down the page; diagonal lines represent segments sent between sites. SYN segments carry initial sequence number information. Initial Sequence Numbers 3-way handshake accomplishes 2 functions Guarantees both sides ready to transfer data Sets up agreement on initial sequence numbers Each machine can choose initial number at random Cannot start at 1 each time Numbers set in three messages First machine: sends x Second machine: records x, sends y and ACKs x First machine: ACKs y Possible to send data with handshake segments Included with the initial sequence numbers TCP software must buffer until handshake done Once connection established, can release the data to the application program quickly Closing a TCP Connection Close operation used to terminate gracefully Connections are full duplex When application tell TCP it is done, TCP closes the connection in one direction Sending TCP sends remaining data Waits for receiver ACK Sends segment with FIN bit set Receiver ACKs the FIN segment and informs its application that data is done Can still send data in opposite direction When both directions closed, TCP deletes its record of the connection Modified 3-way handshake is used to close Figure 12.14 The modified three-way handshake used to close connections. The site that receives the first FIN segment acknowledges it immediately, and then delays before sending the second FIN segment. TCP Connection Reset Close operation used for normal shutdown Sometimes abnormal conditions arise Force the connection to be broken TCP has a reset for such conditions One side sends segment with RST bit set Other side responds immediately by aborting connection TCP informs application that connection was reset Transfer in both directions ceases immediately TCP State Machine Operation of TCP can be explained with a theoretical model called finite state machine Circles represent states Arrows represent transitions between them Figure 12.15 A B Forcing Data Delivery Data stream usually buffered Accumulate enough octets for efficient transfer May need to send data before get a lot Example: interactive terminal keystrokes Push operation forces delivery of octets Also sets PSH bit in segment code field Causes delivery of data to destination application Reserved TCP Port Numbers Combines static and dynamic port binding Like UDP Many of the port numbers are the same for services accessible by both TCP and UDP See Figure 12.16 Figure 12.16 TCP Performance TCP is complex protocol Handles wide variety of underlying technologies Generality does not hinder TCP performance Research done at Berkeley Shows that same TCP that gives efficient internet operation can sustain 8 Mbps throughput between two stations on 10 Mbps Ethernet Cray Research: TCP thruput approaching Gps Silly Window Syndrome TCP can have serious performance problem Caused when sender & receiver operate at different speeds If receiver reads data one octet at a time Sender quickly fills buffer Must wait for window advertisement Gets advertisement for one octet Results in many small segments Inefficient use of bandwith and lots of overhead If sender sends data one octet at a time Ends up with same problem Known as silly window syndrome Early TCP implementations exhibited the problem Each ACK advertises small amount of space Causes each segment to carry a small amount of data Avoiding Silly Window Syndrome TCP specs include heuristics to avoid SWS On sender, avoids sending small data amounts On receiver, avoids sending small advertisements TCP software should contain both Receive-side silly window avoidance Receiver maintains currently available window Delays advertising until can advance window a “significant” amount Minimum of ½ of the receiver’s buffer, or Number of octets in a maximum-sized segment Summary of technique: Before sending an updated advertisement after advertising a zero window Wait for space 50% of total buffer or maximum sized segment Two approaches for implementation ACK each arriving segment, but do not advertise until allowed Delay sending ACK if window too small to advertise Standard recommends using delayed ACKs Adv: delayed ACKs decrease traffic, increase thruput One ACK for all data received during delay May get outgoing data segment to piggyback on If data read quickly, ACK and adv can go in one segment Disadv: May get retransmissions if delay too long Bad round trip time estimates Cannot delay more than 500 ms Recommend receiver ACK every other data segment Send-side silly window avoidance Goal is to avoid sending small segments Use clumping Delay sending until get reasonable amount of data How long should TCP wait? Too long: application has large delays Cannot know when application will send more data Not long enough: get small segments Fixed delay not optimal for all applications Uses an adaptive algorithm Delay depends on current internet performance Does not compute delays Uses arrival of ACK to trigger transmission of additional packets Heuristic: Application generates more data to send Buffer if previous data sent but not ACKed Wait until get enough for maximum-sized segment If waiting when ACK arrives, send all data in buffer Apply rule even when push operation requested If application fast compared to network Successive segments have many octets If application slow compared to network Small segments get sent without long delay Known as the Nagle algorithm Elegant due to little computational overhead Adapts to arbitrary combinations of: network delay maximum segment size application speed But does not lower throughput in normal cases Summary TCP defines reliable stream delivery service Full duplex connection Exchange large volumes of data efficiently Sliding window gives efficient network use Few assumptions of underlying network Flexible for wide variety of delivery systems Has flow control Flexible for systems with differing speeds Basic unit of transfer is a segment Pass data or control information Permits piggyback of ACKs Flow control Implemented by receiver advertisements Urgent facility supports out-of-band messages Push mechanism forces delivery TCP standard specifies Exponential backoff for retransmission timers Congestion avoidance algorithms Slow-start Multiplicative decrease Additive increase Uses heuristics to avoid small packets Recommends using RED versus tail-drop Avoids TCP synchronization Improves throughput