Chapter 12

advertisement
Reliable Stream
Transport Service (TCP)
Chapter 12
We’ve looked at


Unreliable connectionless packet delivery
service
And the IP protocol that defines it
Now we will examine


Reliable stream delivery
And the Transmission Control Protocol that
defines it
TCP is presented as a part of TCP/IP


Is independent, general purpose protocol
Can be adapted for use with other delivery
systems
Need for Stream Delivery
At low levels, have unreliable packets


Lost, destroyed, discarded, duplicated, delayed
Size constraints affect efficient transfer
Applications need to send lots of data


Unreliability is tedious and annoying
Programmers must worry about errors
Goal of network protocol research

General purpose reliable stream delivery
method
Properties of the Service
Interface between applications and
TCP/IP has five characteristic features:

Stream Orientation
Sender provides stream of bits divided into bytes
Receiver is passed exact same sequence

Virtual Circuit Connection
Service provides illusion of dedicated circuit
“Call” setup from one application to the other
Two OSs talk and settle details

Continue to communicate during transfer
If error, detect and report to applications

Buffered Transfer
Applications send stream in whatever size it wants

May be as small as a single octet
Protocol software wants efficient transfer


Small blocks of data: buffer until get enough for a datagram
Large blocks of data: break into smaller pieces
Push mechanism


When transfer needs to happen before buffer is full
Application invokes a push
Data generated until then is sent immediately
At receiving end, is delivered without delay
Protocol software may divide stream in unexpected ways

Unstructured Stream
Applications cannot mark record boundaries
Must agree that stream service will be unstructured

Full Duplex Connection
Connections allow concurrent transfer both ways
Appears as two independent streams in opposite
directions
Can terminate one direction without affecting other
Control information can be piggybacked on data
Providing Reliability
Want reliable transfer out of unreliable
packet delivery system


Most reliable protocols use a single technique
Positive acknowledgement with retransmission
Recipient must send ACK message as it gets data
Sender keeps record of each packet sent
If timer expires for an ACK, retransmits packet
Figure 12.1

Can also have duplicate packets
Network delays may cause premature retransmission
Both packets and ACKs can be duplicated
Usually solve by assigning sequence numbers


Receiver must remember which sequence numbers
received
ACKs include the sequence numbers as well
Sliding Windows
Sending one packet and waiting for ACK
wastes time

Full duplex circuit; have lots of idle time
Sliding window technique used



More complex form of positive ack & retrans
Use bandwidth more efficiently
Sender transmits multiple packets before ACK


Number of unacknowledged packets limited
by window size
Performance depends upon window size
Size of 1: same as simple positive ack protocol
Increase size with goal of sending packets as fast as
the network can handle

Conceptually, separate timer for each packet
Only unack’ed packets are retransmitted
Receiver has a similar window
TCP
Is a communication protocol

NOT a piece of software
TCP is the standard

Various TCP software implements the standard
Standard includes:





Format of data and acknowledgments
Procedures for reliability
Distinguish multiple destinations on a machine
Error recovery procedures
Initiation and closing a TCP stream transfer
Standard does not include:

Details of application/TCP interface
Not discuss exact procedures to invoke for operations
Not specified for flexibility



TCP usually implemented in OS
Can use whatever interface given OS provides
Single specification for variety of machines
TCP assumes little about underlying
system

Can be used with variety of packet delivery
systems (including IP)
Dialup lines; LAN; high speed fiber; low speed WAN
Ports, Connections, &
Endpoints
TCP resides above IP in the layering
scheme
Application
Reliable Stream (TCP)
User Datagram (UDP)
Internet (IP)
Network Interface
Multiple applications can communicate
concurrently



Multiplexes and demultiplexes incoming msgs
Uses port numbers (like UDP discussion)
TCP ports more complex
Using the connection abstraction
Objects are virtual circuits, not ports
Connections identified by a pair of endpoints



Endpoint is pair of integers: (host, port)
host is IP address for a host
port is TCP port on that host
Pair of endpoints defines connection
(128.9.0.32, 1184) and (128.10.2.3, 53)
A single TCP port can be shared by multiple
connections on the same machine
(128.2.254.139, 1012) and (128.10.2.3, 53)



No ambiguity
Incoming messages associated with connection, not port
Both endpoints used to identify appropriate connection
Makes things easier for programmers


Can provide concurrent service without unique ports
Example: Email
Multiple computers can send mail concurrently
Accepting program needs only one TCP port
Passive & Active Opens
TCP is connection-oriented


Both endpoints must agree to participate
Passive open
Application at one end tells OS it will accept connection
OS assigns a TCP port number for its end

Active open
Done by application wishing to connect
Tells OS to establish a connection

Two TCP modules communicate
Establish and verify the connection; then pass data
Segments, Streams, & Sequence
Numbers
TCP views the data stream in segments


Segment contains sequence of octets
Usually each segment in one IP datagram
Two important problems:

Efficient transmission
Good use of available network

Flow control
End-to-end problem
Cannot overflow the receiver’s buffer
Special sliding window protocol used

Solves both problems
Current window
1
2
3
1
4
5
6
7
3
8
9
10 11 …
2
Octets of the data stream are numbered sequentially
1st pointer: sent and ACKed vs sent and not ACKed
2nd pointer: end of window
3rd pointer: boundary between sent and unsent

Receiver maintains a similar window
Full duplex: SW at each end maintains 2 windows

Also allows window size to vary over time
Each ACK has window advertisement
Tells how many more octets willing to accept
Increased advertisement:

Sender can increase size of sliding window, send more
Decreased advertisement:


Sender decreases size of sliding window, stop at boundary
Extreme case: sends advertisement of zero, stops all
This provides flow control

Essential in internet environment
Two independent flow problems:


End-to-end
Minicomputer communicating with mainframe
Intermediate systems
Routers need to control flow, too
Overloaded router condition is congestion
No explicit congestion control mechanism; uses sliding
window
Good TCP implementation can detect & recover
Poor implementation can make it worse
TCP Segment Format
Unit of TCP/IP sw transfer is segment



Establish connections
Transfer data
Send ACKs
May piggyback on a segment carrying data


Advertise window size
Close connections
Figure 12.7
Code Bits field reveals type of segment
Bit (left to right)
URG
ACK
PSH
RST
SYN
FIN
Meaning if bit set to 1
Urgent pointer field is valid
Acknowledgement field is valid
Segment requests a push
Reset the connection
Synchronize sequence numbers
Sender has reached end of its byte
stream
Out of Band Data
Out of Band



Data sent without waiting for octets in the stream
to be consumed by the receiver
Ex: to interrupt or abort a program
Use urgent bit and URGENT POINTER field
This data is consumed first, regardless of stream
position
Maximum Segment Size Option
Not all segments will be of same size

But, must agree on a maximum size
Uses OPTIONS field



Can specify MSS (maximum segment size)
If on same network, may use size such that
resulting datagrams match network MTU
If not, will attempt to discover the minimum MTU
along the path
Or use 536 (default datagram size, minus IP & TCP
headers)
Choosing good MSS is difficult

Too large or too small are both bad
Too small: network utilization is low



Segments in datagram; datagram in frame
At least 40 octets of headers
Small amount of data gives poor utilization
Too large: large IP datagrams




Probably get fragmented somewhere
Cannot ACK partial segment
Must receive all fragments
More fragments increases probability of losing one
In theory, best MSS is when IP datagrams
are as large as possible without being
fragmented
Difficult to figure out:


Most implementations do not have a mechanism
for doing so
Routes can change dynamically
This may change the MTU of the path

Optimum size depends on lower level headers
Segment size must be reduced to account for IP
options
Window Scaling Option
WINDOW field is 16 bits



Limits max window size to 64 Kbytes
Ok in early networks
Need more for networks with large delay
Option allows a larger size

Do not need to know details….
Timestamp Option
Used to:


Help compute delay on underlying network
Handle “wrap around” sequence numbers
Process:

Sender:
Places timestamp from its clock in message

Receiver:
Copies timestamp field into ack

Allows sender to compute elapsed time
TCP Checksum
CHECKSUM contains 16-bit integer


Uses a pseudo header like UDP
Purpose is just the same
Verify segment has reached correct destination
0
8
16
31
Source IP Address
Destination IP Address
Zero
Protocol
TCP Length
ACKs & Retransmission
Hard to refer to datagrams or segments


Variable length segments
Retransmitted segments may have more data
than original
Instead, use position in stream

Based on sequence numbers
Cumulative acknowledgement scheme

Receiver collects arriving data octets
Reconstructs stream of sender
May have to reorder segments due to delivery



Will have reconstructed zero or more octets
May have other stream pieces present but out of order
Receiver ACKs longest contiguous prefix
ACK specifies the next octet expected to be received

Adv:
ACKs easy to generate and unambiguous
Lost ACKs may not force retransmission

Disadv:
Only send info about single position in the stream
Lack of information is inefficient

Imagine window that spans 5000 octets
Starts with position 101 in the stream

Sender has sent all data in five segments
Suppose first segment got lost

Receiver sends ACK as each segment arrives
All ACKs specify octet 101 as next expected
No way to tell sender that all the other data is there

Sender has two choices upon timeout:
Send all five segments over
Send only first segment, then wait for ACK to do
anything else
Timeout and Retransmission
TCP has a timer for each segment

If timer goes off before ACK received – retrans
Different algorithm than other protocols


Due to internet environment
Cannot know how quickly ACKs should come
May span one or many networks
May encounter router delays
Must accommodate vast time differences
Figure 12.10
Adaptive Retransmission Algorithm


Used to accommodate varying delays
Monitors performance of each connection
Deduces reasonable values for timeouts
As performance changes, timeout value revised

Must collect data for the algorithm
Records time each segment sent & when ACK arrives
Computes elapsed time (sample round trip time)
Get new sample; adjust average round trip time for the
connection
RTT stored as weighted average (usually)

New round trip samples change the average slowly

Example:
RTT = (a * Old_RTT) + ((1-a) * New_Round_Trip _Sample)
where:
a is the constant weighting factor; 0 < a < 1
Choosing a value close to 1:


Weighted average only changed small amount
Immune to changes that last a short time
Choosing a value close to 0:

Weighted average responds quickly to changes in delay


Timeout value is a function of the current RTT
Early implementations used constant weighting
factor, B (B > 1)
Timeout = B * RTT
Choosing a value for B is hard



Close to 1
Timeout close to current RTT
Detects packet loss quickly
Any small delay may cause unnecessary retransmissions
Original specification recommended B=2
Will look at better techniques for timeout
Measuring Round Trip Samples
Measuring round trip sample seems trivial



But, TCP uses cumulative acknowledgement
ACK refers to data received, not datagram that
carried it
Consider a retransmission:
Form segment; put in datagram; send; timer expires
Send again in second datagram
Get ACK: for which datagram?

Called acknowledgement ambiguity
Assume ACK belongs to earliest datagram



Make estimated round trip time grow
Incorrect if the original datagram was really lost
If many lost, estimate grows arbitrarily large
Assume ACK belongs to latest datagram


Send retransmission just before ACK arrives
Decreases the timeout time
Makes things worse; more retransmissions

Estimate will eventually stabilize
RTT will be slightly less than ½ of the correct value
Every segment sent twice even though no loss occurs
Karn’s Algorithm
If associating ACK with earliest or most
recent are both wrong…what to do?
Do not update on retransmitted segments



Idea known as Karn’s Algorithm
Avoids ambiguous acknowledgement problem
Simplistic implementation can be a problem
Get sharp increase in delay; do some retransmissions
Ignore ACKs for retransmissions; no new estimate
Must also use a timer backoff strategy


Compute initial timeout with round trip estimate
If timer expires and causes retransmission,
increase the timeout (within a bound)
Most implementations multiply timeout by 2


Next segment timed with new timeout
Continues backoff until send segment without
retransmitting
Computes new round trip estimate
Resets timeout accordingly

Shown to work well even with high packet loss
High Variance in Delay
Computations do not respond well to wide
range of variation in delay

Variation in RTT
Proportional to 1/(1-network load)

Original TCP standard estimated RTT as
shown earlier
Limiting B to 2 can adapt to loads of at most 30%

1989 spec requires estimates of both average
RTT and variance
Must use variance in place of constant B
Approximations are computationally easy
DIFF = SAMPLE – Old_RTT
Smoothed_RTT = Old_RTT + d * DIFF
DEV = Old_DEV + p (|DIFF| - Old_DEV)
Timeout = Smoothed_RTT + e * DEV
Where:
DEV is the estimated mean deviation
d is fraction between 0 & 1; controls effect on weighted average
p is fraction between 0 & 1; controls effect on mean deviation
e is a factor controlling how much deviation effects RT timeout
(Research suggests d and p to be inverse power of 2; scales by 2n,
uses integer arithmetic, and:
d = 1/(23), p = 1/(22), n = 3, and e = 4 )
Figure 12.11
Figure 12.12
12.10,
Response to Congestion
TCP software must deal with congestion



Severe delay caused by an overload of datagrams
Congestion occurs at routers
Routers have finite storage
When run out of storage, start dropping datagrams
Endpoints do not know where congestion is



Just see increased delay
Get timeouts; send more datagrams (retrans)
May cause congestion collapse
TCP must reduce transmission rate


ICMP source quench messages inform hosts of
congestion
TCP needs to help
Want to automatically reduce transmission rates
when congestion occurs

TCP standard recommends two techniques
Slow-start
Multiplicative Decrease

Multiplicative Decrease
TCP must already use receiver’s window size
Keep another window size to use during congestion
Called congestion window
At any time, the allowed window is:
min(receiver_advertisement, congestion_window)
During non-congestion, both are same
To estimate congestion window size, TCP assumes
most datagram loss comes from congestion
Upon segment loss:


Reduce congestion window by half (min of one segment)
For segments still in window, backoff timer exponentially
Does for every loss; quickly clear router traffic

Slow-start
How recover when congestion ends?
If do reverse (2x congestion window) - unstable
Use slow-start recovery



When starting traffic on connection or after congestion
Start window at size of single segment
Increase by one segment every time get an ACK
Avoids swamping
Not so slow actually:

Log2N round trips until can send N segments
One other restriction – congestion avoidance phase

When congestion window reaches ½ original size, increase
by 1 segment only if all segments been ACKed
Overall, known as Additive Increase Multiplicative
Decrease (AIMD)

Techniques powerful when combined
Slow-start increase
Multiplicative decrease
Additive Increase
Measurement of variation
Exponential timer backoff

Improve TCP performance dramatically
Add very little computational overhead
Performance improves by factors of 2 to 10
Fast Recovery & Other
Modifications
Heuristic used where loss is infrequent


Uses info from cumulative ack scheme
Can resend data before timer expires
Do not need to know details…
Explicit Feedback Mechanisms
Most TCP versions use implicit techniques:


Timeout and duplicate ACKs to detect loss
Changes in RTT to detect congestion
Two explicit techniques have been proposed


Selective Acknowledgement (SACK)
Explicit Congestion Notification (ECN)
SACK





Can specify exactly which data has been
received and which is missing
Sender knows which segment(s) to retransmit
TCP provides two options for SACK
**Do not need to know details**
Does not replace cumulative ack mechanism
Nor is it mandatory
ECN


Used to notify TCP about congestion
As a TCP segment goes through routers:
Two bits in IP header used to record congestion
When segment arrives, receiver knows
Sender needs to know; receiver uses ACK to tell

IP header bits:
Taken from TOS field

TCP header bits:
Taken from reserved area
Congestion, Tail Drop, and TCP
Protocols are layered


Layers operate in isolation
TCP at source/destination cannot interact with
lower layer elements along the path
TCP not know condition of network
TCP not notify lower layers before transferring data

Policies used by routers can affect TCP
Both a single connection and aggregate of all
connections
Example:

Router delays some datagrams more than others
TCP backs off retransmission timer
If delay exceeds timer, TCP assumes congestion
Layers are defined independently, but
they interact
Thus, try to define mechanisms in one
layer to work well with protocols in others
Important interaction between TCP and IP


Router overrun and begins to drop datagrams
Early router software used tail-drop policy
If input queue is full when datagram arrives, drop it

Interesting effect on TCP
If segments are from a single TCP connection:

TCP enters slow-start until begin receiving ACKs
If segments are from multiple TCP connections:


All N instances of TCP enter slow-start at same time
Causes global synchronization
Random Early Detection
Routers need to avoid global synchronization


Use scheme to avoid tail-drop when possible
Called Random Early Detection (RED)
(or Random Early Discard or Random Early Drop)
Uses two markers in queue: Tmin and Tmax
Three rules:



If queue contains fewer than Tmin datagrams, add new one
If queue contains more than Tmax datagrams, discard new one
If queue contains between Tmin and Tmax datagrams, randomly
discard the datagram with probability p

Randomness keeps from waiting for overflow
Router slowly and randomly drops datagrams as
congestion increases
Keeps from putting all TCP connection in slow-start

Key is in choice of the thresholds and p
Tmin must be large enough to utilize output link
Tmax must be larger than typical increase in queue size
during round trip time
Discard probability is most complex choice


Not use a constant; compute for each datagram
Can vary probability from 0 (Tmin queue size) to 1 (Tmax queue
size) in a linear fashion


Linear scheme forms the basis of probability p
Must avoid overreacting to bursty traffic
If short burst

Do not drop datagrams because queue will not overflow
But, cannot postpone discard indefinitely
Long burst

Will overflow queue and start tail-drop
Use weighted average technique
Not use actual queue size at any instant
 Compute weighted average queue size
 Update each time a datagram arrives
Avg = (1 – g) * Old_avg + g * Current_queue_size
where
g is a value between 1 and 0


Some details glossed over
Computations very efficient if:


Choose constants as powers of 2
Use integer arithmetic
Measurement of queue size




Time required to forward datagram proportional to size
Measure queue size in octets versus datagrams
Affects type of traffic dropped
Discard probability proportional to amount of data
Not based on number of segments
Smaller datagrams: less probability of being dropped
Good for ACKs, remote login traffic, etc.
Analysis and simulation shows RED works
Establishing a TCP Connection
Use a 3-way handshake

Is both necessary and sufficient for correct
synchronization
Also uses rule that additional requests for connection
are ignored if connection established

Can initiate connection from both ends
simultaneously
Figure 12.13 The sequence of messages in a three-way handshake. Time
proceeds down the page; diagonal lines represent segments sent
between sites. SYN segments carry initial sequence number
information.
Initial Sequence Numbers
3-way handshake accomplishes 2 functions


Guarantees both sides ready to transfer data
Sets up agreement on initial sequence numbers
Each machine can choose initial number at random
Cannot start at 1 each time
Numbers set in three messages



First machine: sends x
Second machine: records x, sends y and ACKs x
First machine: ACKs y

Possible to send data with handshake segments
Included with the initial sequence numbers
TCP software must buffer until handshake done
Once connection established, can release the data to
the application program quickly
Closing a TCP Connection
Close operation used to terminate gracefully


Connections are full duplex
When application tell TCP it is done, TCP closes
the connection in one direction
Sending TCP sends remaining data
Waits for receiver ACK
Sends segment with FIN bit set
Receiver ACKs the FIN segment and informs its
application that data is done



Can still send data in opposite direction
When both directions closed, TCP deletes its
record of the connection
Modified 3-way handshake is used to close
Figure 12.14 The modified three-way handshake used to close connections.
The site that receives the first FIN segment acknowledges it
immediately, and then delays before sending the second FIN
segment.
TCP Connection Reset
Close operation used for normal shutdown
Sometimes abnormal conditions arise


Force the connection to be broken
TCP has a reset for such conditions
One side sends segment with RST bit set
Other side responds immediately by aborting
connection
TCP informs application that connection was reset
Transfer in both directions ceases immediately
TCP State Machine
Operation of TCP can be explained with a
theoretical model called finite state
machine


Circles represent states
Arrows represent transitions between them
Figure 12.15
A
B
Forcing Data Delivery
Data stream usually buffered


Accumulate enough octets for efficient transfer
May need to send data before get a lot
Example: interactive terminal keystrokes

Push operation forces delivery of octets
Also sets PSH bit in segment code field
Causes delivery of data to destination application
Reserved TCP Port Numbers
Combines static and dynamic port binding



Like UDP
Many of the port numbers are the same for
services accessible by both TCP and UDP
See Figure 12.16
Figure 12.16
TCP Performance
TCP is complex protocol

Handles wide variety of underlying technologies
Generality does not hinder TCP performance

Research done at Berkeley
Shows that same TCP that gives efficient internet
operation can sustain 8 Mbps throughput between two
stations on 10 Mbps Ethernet

Cray Research: TCP thruput approaching Gps
Silly Window Syndrome
TCP can have serious performance problem


Caused when sender & receiver operate at
different speeds
If receiver reads data one octet at a time
Sender quickly fills buffer
Must wait for window advertisement

Gets advertisement for one octet
Results in many small segments

Inefficient use of bandwith and lots of overhead

If sender sends data one octet at a time
Ends up with same problem

Known as silly window syndrome
Early TCP implementations exhibited the problem
Each ACK advertises small amount of space
Causes each segment to carry a small amount of data
Avoiding Silly Window Syndrome
TCP specs include heuristics to avoid SWS


On sender, avoids sending small data amounts
On receiver, avoids sending small advertisements
TCP software should contain both
Receive-side silly window avoidance


Receiver maintains currently available window
Delays advertising until can advance window
a “significant” amount
Minimum of ½ of the receiver’s buffer, or
Number of octets in a maximum-sized segment

Summary of technique:
Before sending an updated advertisement after
advertising a zero window
Wait for space

50% of total buffer or maximum sized segment

Two approaches for implementation
ACK each arriving segment, but do not advertise until
allowed
Delay sending ACK if window too small to advertise

Standard recommends using delayed ACKs
Adv: delayed ACKs decrease traffic, increase thruput



One ACK for all data received during delay
May get outgoing data segment to piggyback on
If data read quickly, ACK and adv can go in one segment
Disadv:




May get retransmissions if delay too long
Bad round trip time estimates
Cannot delay more than 500 ms
Recommend receiver ACK every other data segment
Send-side silly window avoidance


Goal is to avoid sending small segments
Use clumping
Delay sending until get reasonable amount of data

How long should TCP wait?
Too long: application has large delays

Cannot know when application will send more data
Not long enough: get small segments
Fixed delay not optimal for all applications

Uses an adaptive algorithm
Delay depends on current internet performance

Does not compute delays
Uses arrival of ACK to trigger transmission of
additional packets
Heuristic:





Application generates more data to send
Buffer if previous data sent but not ACKed
Wait until get enough for maximum-sized segment
If waiting when ACK arrives, send all data in buffer
Apply rule even when push operation requested
If application fast compared to network
Successive segments have many octets

If application slow compared to network
Small segments get sent without long delay

Known as the Nagle algorithm
Elegant due to little computational overhead
Adapts to arbitrary combinations of:



network delay
maximum segment size
application speed
But does not lower throughput in normal cases
Summary
TCP defines reliable stream delivery service


Full duplex connection
Exchange large volumes of data efficiently
Sliding window gives efficient network use
Few assumptions of underlying network

Flexible for wide variety of delivery systems
Has flow control

Flexible for systems with differing speeds
Basic unit of transfer is a segment


Pass data or control information
Permits piggyback of ACKs
Flow control

Implemented by receiver advertisements
Urgent facility supports out-of-band messages
Push mechanism forces delivery
TCP standard specifies


Exponential backoff for retransmission timers
Congestion avoidance algorithms
Slow-start
Multiplicative decrease
Additive increase


Uses heuristics to avoid small packets
Recommends using RED versus tail-drop
Avoids TCP synchronization
Improves throughput
Download