Chapter 12

Reliable Stream
Transport Service (TCP)
We’ve looked at
Unreliable connectionless packet delivery
And the IP protocol that defines it
Now we will examine
Reliable stream delivery
And the Transmission Control Protocol that
defines it
TCP is presented as a part of TCP/IP
Is independent, general purpose protocol
Can be adapted for use with other delivery
Need for Stream Delivery
At low levels, have unreliable packets
Lost, destroyed, discarded, duplicated, delayed
Size constraints affect efficient transfer
Applications need to send lots of data
Unreliability is tedious and annoying
Programmers must worry about errors
Goal of network protocol research
General purpose reliable stream delivery
Properties of the Service
Interface between applications and
TCP/IP has five characteristic features:
Stream Orientation
Sender provides stream of bits divided into bytes
Receiver is passed exact same sequence
Virtual Circuit Connection
Service provides illusion of dedicated circuit
“Call” setup from one application to the other
Two OSs talk and settle details
Continue to communicate during transfer
If error, detect and report to applications
Buffered Transfer
Applications send stream in whatever size it wants
May be as small as a single octet
Protocol software wants efficient transfer
Small blocks of data: buffer until get enough for a datagram
Large blocks of data: break into smaller pieces
Push mechanism
When transfer needs to happen before buffer is full
Application invokes a push
Data generated until then is sent immediately
At receiving end, is delivered without delay
Protocol software may divide stream in unexpected ways
Unstructured Stream
Applications cannot mark record boundaries
Must agree that stream service will be unstructured
Full Duplex Connection
Connections allow concurrent transfer both ways
Appears as two independent streams in opposite
Can terminate one direction without affecting other
Control information can be piggybacked on data
Providing Reliability
Want reliable transfer out of unreliable
packet delivery system
Most reliable protocols use a single technique
Positive acknowledgement with retransmission
Recipient must send ACK message as it gets data
Sender keeps record of each packet sent
If timer expires for an ACK, retransmits packet
Figure 12.1
Can also have duplicate packets
Network delays may cause premature retransmission
Both packets and ACKs can be duplicated
Usually solve by assigning sequence numbers
Receiver must remember which sequence numbers
ACKs include the sequence numbers as well
Sliding Windows
Sending one packet and waiting for ACK
wastes time
Full duplex circuit; have lots of idle time
Sliding window technique used
More complex form of positive ack & retrans
Use bandwidth more efficiently
Sender transmits multiple packets before ACK
Number of unacknowledged packets limited
by window size
Performance depends upon window size
Size of 1: same as simple positive ack protocol
Increase size with goal of sending packets as fast as
the network can handle
Conceptually, separate timer for each packet
Only unack’ed packets are retransmitted
Receiver has a similar window
Is a communication protocol
NOT a piece of software
TCP is the standard
Various TCP software implements the standard
Standard includes:
Format of data and acknowledgments
Procedures for reliability
Distinguish multiple destinations on a machine
Error recovery procedures
Initiation and closing a TCP stream transfer
Standard does not include:
Details of application/TCP interface
Not discuss exact procedures to invoke for operations
Not specified for flexibility
TCP usually implemented in OS
Can use whatever interface given OS provides
Single specification for variety of machines
TCP assumes little about underlying
Can be used with variety of packet delivery
systems (including IP)
Dialup lines; LAN; high speed fiber; low speed WAN
Ports, Connections, &
TCP resides above IP in the layering
Reliable Stream (TCP)
User Datagram (UDP)
Internet (IP)
Network Interface
Multiple applications can communicate
Multiplexes and demultiplexes incoming msgs
Uses port numbers (like UDP discussion)
TCP ports more complex
Using the connection abstraction
Objects are virtual circuits, not ports
Connections identified by a pair of endpoints
Endpoint is pair of integers: (host, port)
host is IP address for a host
port is TCP port on that host
Pair of endpoints defines connection
(, 1184) and (, 53)
A single TCP port can be shared by multiple
connections on the same machine
(, 1012) and (, 53)
No ambiguity
Incoming messages associated with connection, not port
Both endpoints used to identify appropriate connection
Makes things easier for programmers
Can provide concurrent service without unique ports
Example: Email
Multiple computers can send mail concurrently
Accepting program needs only one TCP port
Passive & Active Opens
TCP is connection-oriented
Both endpoints must agree to participate
Passive open
Application at one end tells OS it will accept connection
OS assigns a TCP port number for its end
Active open
Done by application wishing to connect
Tells OS to establish a connection
Two TCP modules communicate
Establish and verify the connection; then pass data
Segments, Streams, & Sequence
TCP views the data stream in segments
Segment contains sequence of octets
Usually each segment in one IP datagram
Two important problems:
Efficient transmission
Good use of available network
Flow control
End-to-end problem
Cannot overflow the receiver’s buffer
Special sliding window protocol used
Solves both problems
Current window
10 11 …
Octets of the data stream are numbered sequentially
1st pointer: sent and ACKed vs sent and not ACKed
2nd pointer: end of window
3rd pointer: boundary between sent and unsent
Receiver maintains a similar window
Full duplex: SW at each end maintains 2 windows
Also allows window size to vary over time
Each ACK has window advertisement
Tells how many more octets willing to accept
Increased advertisement:
Sender can increase size of sliding window, send more
Decreased advertisement:
Sender decreases size of sliding window, stop at boundary
Extreme case: sends advertisement of zero, stops all
This provides flow control
Essential in internet environment
Two independent flow problems:
Minicomputer communicating with mainframe
Intermediate systems
Routers need to control flow, too
Overloaded router condition is congestion
No explicit congestion control mechanism; uses sliding
Good TCP implementation can detect & recover
Poor implementation can make it worse
TCP Segment Format
Unit of TCP/IP sw transfer is segment
Establish connections
Transfer data
Send ACKs
May piggyback on a segment carrying data
Advertise window size
Close connections
Figure 12.7
Code Bits field reveals type of segment
Bit (left to right)
Meaning if bit set to 1
Urgent pointer field is valid
Acknowledgement field is valid
Segment requests a push
Reset the connection
Synchronize sequence numbers
Sender has reached end of its byte
Out of Band Data
Out of Band
Data sent without waiting for octets in the stream
to be consumed by the receiver
Ex: to interrupt or abort a program
Use urgent bit and URGENT POINTER field
This data is consumed first, regardless of stream
Maximum Segment Size Option
Not all segments will be of same size
But, must agree on a maximum size
Uses OPTIONS field
Can specify MSS (maximum segment size)
If on same network, may use size such that
resulting datagrams match network MTU
If not, will attempt to discover the minimum MTU
along the path
Or use 536 (default datagram size, minus IP & TCP
Choosing good MSS is difficult
Too large or too small are both bad
Too small: network utilization is low
Segments in datagram; datagram in frame
At least 40 octets of headers
Small amount of data gives poor utilization
Too large: large IP datagrams
Probably get fragmented somewhere
Cannot ACK partial segment
Must receive all fragments
More fragments increases probability of losing one
In theory, best MSS is when IP datagrams
are as large as possible without being
Difficult to figure out:
Most implementations do not have a mechanism
for doing so
Routes can change dynamically
This may change the MTU of the path
Optimum size depends on lower level headers
Segment size must be reduced to account for IP
Window Scaling Option
WINDOW field is 16 bits
Limits max window size to 64 Kbytes
Ok in early networks
Need more for networks with large delay
Option allows a larger size
Do not need to know details….
Timestamp Option
Used to:
Help compute delay on underlying network
Handle “wrap around” sequence numbers
Places timestamp from its clock in message
Copies timestamp field into ack
Allows sender to compute elapsed time
TCP Checksum
CHECKSUM contains 16-bit integer
Uses a pseudo header like UDP
Purpose is just the same
Verify segment has reached correct destination
Source IP Address
Destination IP Address
TCP Length
ACKs & Retransmission
Hard to refer to datagrams or segments
Variable length segments
Retransmitted segments may have more data
than original
Instead, use position in stream
Based on sequence numbers
Cumulative acknowledgement scheme
Receiver collects arriving data octets
Reconstructs stream of sender
May have to reorder segments due to delivery
Will have reconstructed zero or more octets
May have other stream pieces present but out of order
Receiver ACKs longest contiguous prefix
ACK specifies the next octet expected to be received
ACKs easy to generate and unambiguous
Lost ACKs may not force retransmission
Only send info about single position in the stream
Lack of information is inefficient
Imagine window that spans 5000 octets
Starts with position 101 in the stream
Sender has sent all data in five segments
Suppose first segment got lost
Receiver sends ACK as each segment arrives
All ACKs specify octet 101 as next expected
No way to tell sender that all the other data is there
Sender has two choices upon timeout:
Send all five segments over
Send only first segment, then wait for ACK to do
anything else
Timeout and Retransmission
TCP has a timer for each segment
If timer goes off before ACK received – retrans
Different algorithm than other protocols
Due to internet environment
Cannot know how quickly ACKs should come
May span one or many networks
May encounter router delays
Must accommodate vast time differences
Figure 12.10
Adaptive Retransmission Algorithm
Used to accommodate varying delays
Monitors performance of each connection
Deduces reasonable values for timeouts
As performance changes, timeout value revised
Must collect data for the algorithm
Records time each segment sent & when ACK arrives
Computes elapsed time (sample round trip time)
Get new sample; adjust average round trip time for the
RTT stored as weighted average (usually)
New round trip samples change the average slowly
RTT = (a * Old_RTT) + ((1-a) * New_Round_Trip _Sample)
a is the constant weighting factor; 0 < a < 1
Choosing a value close to 1:
Weighted average only changed small amount
Immune to changes that last a short time
Choosing a value close to 0:
Weighted average responds quickly to changes in delay
Timeout value is a function of the current RTT
Early implementations used constant weighting
factor, B (B > 1)
Timeout = B * RTT
Choosing a value for B is hard
Close to 1
Timeout close to current RTT
Detects packet loss quickly
Any small delay may cause unnecessary retransmissions
Original specification recommended B=2
Will look at better techniques for timeout
Measuring Round Trip Samples
Measuring round trip sample seems trivial
But, TCP uses cumulative acknowledgement
ACK refers to data received, not datagram that
carried it
Consider a retransmission:
Form segment; put in datagram; send; timer expires
Send again in second datagram
Get ACK: for which datagram?
Called acknowledgement ambiguity
Assume ACK belongs to earliest datagram
Make estimated round trip time grow
Incorrect if the original datagram was really lost
If many lost, estimate grows arbitrarily large
Assume ACK belongs to latest datagram
Send retransmission just before ACK arrives
Decreases the timeout time
Makes things worse; more retransmissions
Estimate will eventually stabilize
RTT will be slightly less than ½ of the correct value
Every segment sent twice even though no loss occurs
Karn’s Algorithm
If associating ACK with earliest or most
recent are both wrong…what to do?
Do not update on retransmitted segments
Idea known as Karn’s Algorithm
Avoids ambiguous acknowledgement problem
Simplistic implementation can be a problem
Get sharp increase in delay; do some retransmissions
Ignore ACKs for retransmissions; no new estimate
Must also use a timer backoff strategy
Compute initial timeout with round trip estimate
If timer expires and causes retransmission,
increase the timeout (within a bound)
Most implementations multiply timeout by 2
Next segment timed with new timeout
Continues backoff until send segment without
Computes new round trip estimate
Resets timeout accordingly
Shown to work well even with high packet loss
High Variance in Delay
Computations do not respond well to wide
range of variation in delay
Variation in RTT
Proportional to 1/(1-network load)
Original TCP standard estimated RTT as
shown earlier
Limiting B to 2 can adapt to loads of at most 30%
1989 spec requires estimates of both average
RTT and variance
Must use variance in place of constant B
Approximations are computationally easy
Smoothed_RTT = Old_RTT + d * DIFF
DEV = Old_DEV + p (|DIFF| - Old_DEV)
Timeout = Smoothed_RTT + e * DEV
DEV is the estimated mean deviation
d is fraction between 0 & 1; controls effect on weighted average
p is fraction between 0 & 1; controls effect on mean deviation
e is a factor controlling how much deviation effects RT timeout
(Research suggests d and p to be inverse power of 2; scales by 2n,
uses integer arithmetic, and:
d = 1/(23), p = 1/(22), n = 3, and e = 4 )
Figure 12.11
Figure 12.12
Response to Congestion
TCP software must deal with congestion
Severe delay caused by an overload of datagrams
Congestion occurs at routers
Routers have finite storage
When run out of storage, start dropping datagrams
Endpoints do not know where congestion is
Just see increased delay
Get timeouts; send more datagrams (retrans)
May cause congestion collapse
TCP must reduce transmission rate
ICMP source quench messages inform hosts of
TCP needs to help
Want to automatically reduce transmission rates
when congestion occurs
TCP standard recommends two techniques
Multiplicative Decrease
Multiplicative Decrease
TCP must already use receiver’s window size
Keep another window size to use during congestion
Called congestion window
At any time, the allowed window is:
min(receiver_advertisement, congestion_window)
During non-congestion, both are same
To estimate congestion window size, TCP assumes
most datagram loss comes from congestion
Upon segment loss:
Reduce congestion window by half (min of one segment)
For segments still in window, backoff timer exponentially
Does for every loss; quickly clear router traffic
How recover when congestion ends?
If do reverse (2x congestion window) - unstable
Use slow-start recovery
When starting traffic on connection or after congestion
Start window at size of single segment
Increase by one segment every time get an ACK
Avoids swamping
Not so slow actually:
Log2N round trips until can send N segments
One other restriction – congestion avoidance phase
When congestion window reaches ½ original size, increase
by 1 segment only if all segments been ACKed
Overall, known as Additive Increase Multiplicative
Decrease (AIMD)
Techniques powerful when combined
Slow-start increase
Multiplicative decrease
Additive Increase
Measurement of variation
Exponential timer backoff
Improve TCP performance dramatically
Add very little computational overhead
Performance improves by factors of 2 to 10
Fast Recovery & Other
Heuristic used where loss is infrequent
Uses info from cumulative ack scheme
Can resend data before timer expires
Do not need to know details…
Explicit Feedback Mechanisms
Most TCP versions use implicit techniques:
Timeout and duplicate ACKs to detect loss
Changes in RTT to detect congestion
Two explicit techniques have been proposed
Selective Acknowledgement (SACK)
Explicit Congestion Notification (ECN)
Can specify exactly which data has been
received and which is missing
Sender knows which segment(s) to retransmit
TCP provides two options for SACK
**Do not need to know details**
Does not replace cumulative ack mechanism
Nor is it mandatory
Used to notify TCP about congestion
As a TCP segment goes through routers:
Two bits in IP header used to record congestion
When segment arrives, receiver knows
Sender needs to know; receiver uses ACK to tell
IP header bits:
Taken from TOS field
TCP header bits:
Taken from reserved area
Congestion, Tail Drop, and TCP
Protocols are layered
Layers operate in isolation
TCP at source/destination cannot interact with
lower layer elements along the path
TCP not know condition of network
TCP not notify lower layers before transferring data
Policies used by routers can affect TCP
Both a single connection and aggregate of all
Router delays some datagrams more than others
TCP backs off retransmission timer
If delay exceeds timer, TCP assumes congestion
Layers are defined independently, but
they interact
Thus, try to define mechanisms in one
layer to work well with protocols in others
Important interaction between TCP and IP
Router overrun and begins to drop datagrams
Early router software used tail-drop policy
If input queue is full when datagram arrives, drop it
Interesting effect on TCP
If segments are from a single TCP connection:
TCP enters slow-start until begin receiving ACKs
If segments are from multiple TCP connections:
All N instances of TCP enter slow-start at same time
Causes global synchronization
Random Early Detection
Routers need to avoid global synchronization
Use scheme to avoid tail-drop when possible
Called Random Early Detection (RED)
(or Random Early Discard or Random Early Drop)
Uses two markers in queue: Tmin and Tmax
Three rules:
If queue contains fewer than Tmin datagrams, add new one
If queue contains more than Tmax datagrams, discard new one
If queue contains between Tmin and Tmax datagrams, randomly
discard the datagram with probability p
Randomness keeps from waiting for overflow
Router slowly and randomly drops datagrams as
congestion increases
Keeps from putting all TCP connection in slow-start
Key is in choice of the thresholds and p
Tmin must be large enough to utilize output link
Tmax must be larger than typical increase in queue size
during round trip time
Discard probability is most complex choice
Not use a constant; compute for each datagram
Can vary probability from 0 (Tmin queue size) to 1 (Tmax queue
size) in a linear fashion
Linear scheme forms the basis of probability p
Must avoid overreacting to bursty traffic
If short burst
Do not drop datagrams because queue will not overflow
But, cannot postpone discard indefinitely
Long burst
Will overflow queue and start tail-drop
Use weighted average technique
Not use actual queue size at any instant
 Compute weighted average queue size
 Update each time a datagram arrives
Avg = (1 – g) * Old_avg + g * Current_queue_size
g is a value between 1 and 0
Some details glossed over
Computations very efficient if:
Choose constants as powers of 2
Use integer arithmetic
Measurement of queue size
Time required to forward datagram proportional to size
Measure queue size in octets versus datagrams
Affects type of traffic dropped
Discard probability proportional to amount of data
Not based on number of segments
Smaller datagrams: less probability of being dropped
Good for ACKs, remote login traffic, etc.
Analysis and simulation shows RED works
Establishing a TCP Connection
Use a 3-way handshake
Is both necessary and sufficient for correct
Also uses rule that additional requests for connection
are ignored if connection established
Can initiate connection from both ends
Figure 12.13 The sequence of messages in a three-way handshake. Time
proceeds down the page; diagonal lines represent segments sent
between sites. SYN segments carry initial sequence number
Initial Sequence Numbers
3-way handshake accomplishes 2 functions
Guarantees both sides ready to transfer data
Sets up agreement on initial sequence numbers
Each machine can choose initial number at random
Cannot start at 1 each time
Numbers set in three messages
First machine: sends x
Second machine: records x, sends y and ACKs x
First machine: ACKs y
Possible to send data with handshake segments
Included with the initial sequence numbers
TCP software must buffer until handshake done
Once connection established, can release the data to
the application program quickly
Closing a TCP Connection
Close operation used to terminate gracefully
Connections are full duplex
When application tell TCP it is done, TCP closes
the connection in one direction
Sending TCP sends remaining data
Waits for receiver ACK
Sends segment with FIN bit set
Receiver ACKs the FIN segment and informs its
application that data is done
Can still send data in opposite direction
When both directions closed, TCP deletes its
record of the connection
Modified 3-way handshake is used to close
Figure 12.14 The modified three-way handshake used to close connections.
The site that receives the first FIN segment acknowledges it
immediately, and then delays before sending the second FIN
TCP Connection Reset
Close operation used for normal shutdown
Sometimes abnormal conditions arise
Force the connection to be broken
TCP has a reset for such conditions
One side sends segment with RST bit set
Other side responds immediately by aborting
TCP informs application that connection was reset
Transfer in both directions ceases immediately
TCP State Machine
Operation of TCP can be explained with a
theoretical model called finite state
Circles represent states
Arrows represent transitions between them
Figure 12.15
Forcing Data Delivery
Data stream usually buffered
Accumulate enough octets for efficient transfer
May need to send data before get a lot
Example: interactive terminal keystrokes
Push operation forces delivery of octets
Also sets PSH bit in segment code field
Causes delivery of data to destination application
Reserved TCP Port Numbers
Combines static and dynamic port binding
Like UDP
Many of the port numbers are the same for
services accessible by both TCP and UDP
See Figure 12.16
Figure 12.16
TCP Performance
TCP is complex protocol
Handles wide variety of underlying technologies
Generality does not hinder TCP performance
Research done at Berkeley
Shows that same TCP that gives efficient internet
operation can sustain 8 Mbps throughput between two
stations on 10 Mbps Ethernet
Cray Research: TCP thruput approaching Gps
Silly Window Syndrome
TCP can have serious performance problem
Caused when sender & receiver operate at
different speeds
If receiver reads data one octet at a time
Sender quickly fills buffer
Must wait for window advertisement
Gets advertisement for one octet
Results in many small segments
Inefficient use of bandwith and lots of overhead
If sender sends data one octet at a time
Ends up with same problem
Known as silly window syndrome
Early TCP implementations exhibited the problem
Each ACK advertises small amount of space
Causes each segment to carry a small amount of data
Avoiding Silly Window Syndrome
TCP specs include heuristics to avoid SWS
On sender, avoids sending small data amounts
On receiver, avoids sending small advertisements
TCP software should contain both
Receive-side silly window avoidance
Receiver maintains currently available window
Delays advertising until can advance window
a “significant” amount
Minimum of ½ of the receiver’s buffer, or
Number of octets in a maximum-sized segment
Summary of technique:
Before sending an updated advertisement after
advertising a zero window
Wait for space
50% of total buffer or maximum sized segment
Two approaches for implementation
ACK each arriving segment, but do not advertise until
Delay sending ACK if window too small to advertise
Standard recommends using delayed ACKs
Adv: delayed ACKs decrease traffic, increase thruput
One ACK for all data received during delay
May get outgoing data segment to piggyback on
If data read quickly, ACK and adv can go in one segment
May get retransmissions if delay too long
Bad round trip time estimates
Cannot delay more than 500 ms
Recommend receiver ACK every other data segment
Send-side silly window avoidance
Goal is to avoid sending small segments
Use clumping
Delay sending until get reasonable amount of data
How long should TCP wait?
Too long: application has large delays
Cannot know when application will send more data
Not long enough: get small segments
Fixed delay not optimal for all applications
Uses an adaptive algorithm
Delay depends on current internet performance
Does not compute delays
Uses arrival of ACK to trigger transmission of
additional packets
Application generates more data to send
Buffer if previous data sent but not ACKed
Wait until get enough for maximum-sized segment
If waiting when ACK arrives, send all data in buffer
Apply rule even when push operation requested
If application fast compared to network
Successive segments have many octets
If application slow compared to network
Small segments get sent without long delay
Known as the Nagle algorithm
Elegant due to little computational overhead
Adapts to arbitrary combinations of:
network delay
maximum segment size
application speed
But does not lower throughput in normal cases
TCP defines reliable stream delivery service
Full duplex connection
Exchange large volumes of data efficiently
Sliding window gives efficient network use
Few assumptions of underlying network
Flexible for wide variety of delivery systems
Has flow control
Flexible for systems with differing speeds
Basic unit of transfer is a segment
Pass data or control information
Permits piggyback of ACKs
Flow control
Implemented by receiver advertisements
Urgent facility supports out-of-band messages
Push mechanism forces delivery
TCP standard specifies
Exponential backoff for retransmission timers
Congestion avoidance algorithms
Multiplicative decrease
Additive increase
Uses heuristics to avoid small packets
Recommends using RED versus tail-drop
Avoids TCP synchronization
Improves throughput