COMS/CSEE 4140 Networking Laboratory Lecture 06 Salman Abdul Baset

advertisement
COMS/CSEE 4140
Networking Laboratory
Lecture 06
Salman Abdul Baset
Spring 2008
Announcements
Lab 4 (5-7) due next week before your lab slot
 Prelab 5 due next week.
 There will be Lab 5 next week.
 Midterm (March 10th, duration ~1.5 hours)
 Assignment 2 issues




aslookup compilation?
ISP name: nslookup or whois for IP address
Lab 4 (count-to-infinity issues)
2
Agenda
Autonomous Systems (AS)
 Policy vs. distance based routing
 Border gateway protocol (BGP)
 Transmission control protocol (TCP)

3
Autonomous Systems Terminology
local traffic
= traffic with source or
destination in AS
 transit traffic = traffic that passes through
the AS
 Stub AS
= has connection to only one
AS, only
carry local traffic
 Multihomed AS = has connection to >1 AS,
but does
not carry transit traffic
 Transit AS
= has connection to >1 AS and
carries
transit traffic

4
Stub and Transit Networks
AS 1



AS 1, AS 2, and AS 5
are stub networks
AS 2 is a multihomed stub network
AS 3 and AS 4 are
transit networks
AS 2
AS 3
AS 4
AS 5
5
Selective Transit
Example:
AS 1
 Transit AS 3 carries traffic
between AS 1 and AS 4 and
between AS 2 and AS 4
 But AS 3 does not carry
traffic between AS 1 and AS
2

The example shows a
routing policy.
AS 2
AS 3
AS 4
6
Customer/Provider
AS 2
Customer/
Provider
Customer/
Provider
AS 4
Customer/
Provider
AS 6



AS 5
Customer/
Provider
AS 6
Customer/
Provider
AS 6
A stub network typically obtains access to the Internet
through a transit network.
Transit network that is a provider may be a customer for
another network
Customer pays provider for service
7
Customer/Provider and Peers
AS 1
AS 2
AS 3
Peers
Peers
Customer/
Provider
Customer/
Provider
Customer/
Provider
AS 4
AS 5
Customer/Provider
AS 6
Customer/
Provider
AS 6
AS 6




Transit networks can have a peer relationship
Peers provide transit between their respective customers
Peers do not provide transit between peers
Peers normally do not pay each other for service
8
Shortcuts through peering
AS 1
AS 2
AS 3
Peers
Peers
Customer/
Provider
Customer/
Provider
AS 4
AS 5
Customer/
Provider
AS 6
Peers
Customer/Provider
Customer/
Provider
AS 6
AS 6



Note that peering reduces upstream traffic
Delays can be reduced through peering
But: Peering may not generate revenue
9
ASNs already assigned
Source: http://www.potaroo.net/tools/asn32/
private ASN: 65412 – 65536
10
ASNs in use
11
ASN projections
12
ARDs versus ASes
Autonomous Routing Domains Don’t
Always Need BGP or an ASN
Qwest
Nail up routes 130.132.0.0/16
pointing to Yale
Nail up default routes 0.0.0.0/0
pointing to Qwest
Yale University
130.132.0.0/16
Static routing is the most common way of connecting an
autonomous routing domain to the Internet.
This helps explain why BGP is a mystery to many …
13
ASNs Can Be “Shared” (RFC
2270)
AS 701
UUNet
AS 7046
Crestar
Bank
AS 7046
NJIT
AS 7046
Hood
College
128.235.0.0/16
ASN 7046 is assigned to UUNet. It is used by
Customers single homed to UUNet, but needing
BGP for some reason (load balancing, etc..) [RFC 2270]
14
ARDs and ASes: Summary

Most ARDs have no ASN (statically routed at
Internet edge)

Some unrelated ARDs share the same ASN (RFC
2270)

Some ARDs are implemented with multiple ASNs
(example: Worldcom)
ASes are just an implementation detail of Inter-domain routing
15
Agenda
Autonomous Systems (AS)
 Policy vs. distance based routing
 Border gateway protocol (BGP)
 Transmission control protocol (TCP)

16
Why not minimize “AS hop
Count”?
National
ISP1
National
ISP2
YES
NO
Regional
ISP3
Cust3
Regional
ISP2
Cust2
Regional
ISP1
Cust1
Shortest path routing is not compatible with commercial relations
17
Customer versus Provider
provider
provider
customer
IP traffic
customer
Customer pays provider for access to the Internet
18
The “Peering” Relationship
peer
provider
peer
customer
Peers provide transit between
their respective customers
Peers do not provide transit
between peers
traffic
allowed
traffic NOT
allowed
Peers (often) do not exchange $$$
19
Peering Provides Shortcuts
Peering also allows connectivity between
the customers of “Tier 1” providers.
peer
provider
peer
customer20
Peering Wars
Peer



Reduces upstream transit
costs
Can increase end-to-end
performance
May be the only way to
connect your customers to
some part of the Internet
(“Tier 1”)
Don’t Peer



You would rather have
customers
Peers are usually your
competition
Peering relationships may
require periodic
renegotiation
Peering struggles are by far the most
contentious issues in the ISP world!
Peering agreements are often confidential.
21
Agenda
Autonomous Systems (AS)
 Policy vs. distance based routing
 Border gateway protocol (BGP)
 Transmission control protocol (TCP)

22
The Gang of Four
Link State
IGP
EGP
OSPF
IS-IS
Vectoring
RIP
BGP
23
BGP Overview







BGP = Border Gateway Protocol v4 . RFC 1771. (~ 60 pages)
Note: In the context of BGP, a gateway is nothing else but
an IP router that connects autonomous systems.
Interdomain routing protocol for routing between
autonomous systems.
Uses TCP to establish a BGP session and to send routing
messages over the BGP session.
Update only new routes.
BGP is a path vector protocol. Routing messages in BGP
contain complete routes.
Network administrators can specify routing policies.
24
BGP Policy-based Routing

Each node is assigned an AS number (ASN)

BGP’s goal is to find any AS-path (not an optimal
one). Since the internals of the AS are never
revealed, finding an optimal path is not feasible.

Network administrator sets BGP’s policies to
determine the best path to reach a destination
network.
25
The Border Gateway Protocol (BGP)
BGP =
+
RFC 1771
“optional” extensions
RFC 1997 (communities) RFC 2439 (damping) RFC 2796 (reflection) RFC3065 (confederation) …
+
routing policy configuration
languages (vendor-specific)
+
Current Best Practices in
management of Interdomain Routing
BGP was not DESIGNED.
It EVOLVED.
26
BGP Route Processing
Open ended programming.
Constrained only by vendor configuration language
Receive Apply Policy =
filter routes &
BGP
Updates tweak
attributes
Apply Import
Policies
Based on
Attribute
Values
Best
Routes
Best Route
Selection
Best Route
Table
Apply Policy =
filter routes &
tweak
attributes
Transmit
BGP
Updates
Apply Export
Policies
Install forwarding
Entries for best
Routes.
IP Forwarding Table
27
BGP Attributes
Value
----1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
...
255
Code
--------------------------------ORIGIN
AS_PATH
NEXT_HOP
MULTI_EXIT_DISC
LOCAL_PREF
ATOMIC_AGGREGATE
AGGREGATOR
COMMUNITY
ORIGINATOR_ID
CLUSTER_LIST
DPA
ADVERTISER
RCID_PATH / CLUSTER_ID
MP_REACH_NLRI
MP_UNREACH_NLRI
EXTENDED COMMUNITIES
Reference
--------[RFC1771]
[RFC1771]
[RFC1771]
[RFC1771]
[RFC1771]
[RFC1771]
[RFC1771]
[RFC1997]
[RFC2796]
[RFC2796]
[Chen]
[RFC1863]
[RFC1863]
[RFC2283]
[RFC2283]
[Rosen]
Most
important
attributes
reserved for development
From IANA: http://www.iana.org/assignments/bgp-parameters
Not all attributes
need to be present in
every announcement
28
LOCAL_PREF Attribute
Forces outbound traffic to take primary link, unless link is down.29
NEXT_HOP Attribute


EGP: IP address used to reach the advertising router
IGP: next-hop address is carried into local AS
30
AS_PATH Attribute

Used to detect routing loops and find shortest paths
31
Shedding Inbound Traffic with
ASPATH Prepending
AS 1
Prepending will (usually)
force inbound
traffic from AS 1
to take primary link
provider
192.0.2.0/24
ASPATH = 2 2 2
192.0.2.0/24
ASPATH = 2
primary
backup
customer
AS 2
192.0.2.0/24
Yes, this is a
Glorious Hack …
32
… But Padding Does Not Always
Work
AS 1
AS 3
provider
provider
192.0.2.0/24
ASPATH = 2
192.0.2.0/24
ASPATH = 2 2 2 2 2 2 2 2 2 2 2 2 2
primary
backup
customer
AS 2
192.0.2.0/24
AS 3 will send
traffic on “backup”
link because it prefers
customer routes and local
preference is considered
before ASPATH length!
Padding in this way is often
used as a form of load
33
balancing
COMMUNITY Attribute to the
Rescue!
AS 1
AS 3
provider
provider
AS 3: normal
customer local
pref is 100,
peer local pref is 90
192.0.2.0/24
ASPATH = 2
COMMUNITY = 3:70
192.0.2.0/24
ASPATH = 2
primary
backup
customer
AS 2
192.0.2.0/24
Customer import policy at AS 3:
If 3:90 in COMMUNITY then
set local preference to 90
If 3:80 in COMMUNITY then
set local preference to 80
If 3:70 in COMMUNITY then
set local preference to 70
34
BGP Issues - What is a BGP
Wedgie?
 BGP
¾ wedgie
Full
wedgie
policies make sense locally
 Interaction of local policies allows
multiple stable routings
 Some routings are consistent with
intended policies, and some are not
 If an unintended routing is
installed (BGP is “wedged”), then
manual intervention is needed to
change to an intended routing
 When
an unintended routing is
installed, no single group of network
operators has enough knowledge to
debug the problem
35
YouTube blocking
Pakistan blocks YouTube
 How? (according to BBC)






Advertise a shorter route to reach YouTube
The incorrect short route gets propagated
Seen by two thirds of the Internet
Traffic to YouTube goes through Pakistan
Since Pakistan blocked YouTube, all traffic reaches a
dead end!
36
Dynamic Routing Protocols:
Summary

Dynamic routing protocols: RIP, OSPF, BGP

RIP uses distance vector algorithm, and converges slow
(the count-to-infinity problem)

OSPF uses link state algorithm, and converges fast. But it
is more complicated than RIP.

Both RIP and OSPF finds lowest-cost path.

BGP uses path vector algorithm, and its path selection
algorithm is complicated, and is influenced by policies.

BGP has its own problems see WIDGI by Tim Griffin
37
More Readings (Optional)
BGP Wedgies: Bad Routing Policy Interactions that
Cannot be Debugged
JI’s Intro to interdomain routing.
"Interdomain Setting of PlanetLab Nodes."
PlanetLab Meeting, May 14, 2004.
Understanding the Border Gateway Protocol (BGP)
ICNP 2002 Tutorial Session
38
Agenda
Autonomous Systems (AS)
 Policy vs. distance based routing
 Border gateway protocol (BGP)
 Transmission control protocol (TCP)

39
Transmission Control Protocol (RFC)

Reliable and in-order byte-stream service
TCP format
 Connection establishment
 Flow control
 Reaction to congestion
 Packet corruption

40
TCP Format
• TCP segments have a 20 byte header with >= 0 bytes of data.
IP header TCP header
20 bytes
TCP data
20 bytes
0
15 16
Source Port Number
31
Destination Port Number
Sequence number (32 bits)
header
length
0
Flags
TCP checksum
20 bytes
Acknowledgement number (32 bits)
window size
urgent pointer
Options (if any)
DATA
41
TCP header fields

Sequence Number (SeqNo):


Sequence number is 32 bits long.
So the range of SeqNo is
0 <= SeqNo <= 232 -1  4.3 Gbyte


Each sequence number identifies a byte in the byte
stream
Initial Sequence Number (ISN) of a connection is set
during connection establishment
Q: What are possible requirements for ISN ?
42
TCP header fields

Acknowledgement Number (AckNo):

Acknowledgements are piggybacked, i.e.,
a segment from A -> B can contain an acknowledgement for a
data sent in the B -> A direction
Q: Why is piggybacking good ?

A hosts uses the AckNo field to send
acknowledgements. (If a host sends an AckNo in a segment it
sets the “ACK flag”)

The AckNo contains the next SeqNo that a hosts wants
to receive
Example: The acknowledgement for a segment with
sequence numbers 0-1500 is AckNo=1501
43
TCP header fields

Acknowledge Number (cont’d)


TCP uses the sliding window flow protocol (see CS
457) to regulate the flow of traffic from sender to
receiver
TCP uses the following variation of sliding window:
 no NACKs (Negative ACKnowledgement)
 only cumulative ACKs

Example:
Assume: Sender sends two segments with “1..1500”
and “1501..3000”, but receiver only gets the second
segment.
In this case, the receiver cannot acknowledge the
second packet. It can only send AckNo=1
44
TCP header fields

Header Length ( 4bits):


Length of header in 32-bit words
Note that TCP header has variable length (with
minimum 20 bytes)
45
TCP header fields

Flag bits:

URG:

Urgent pointer is valid
If the bit is set, the following bytes contain an urgent message
in the range:
SeqNo <= urgent message <= SeqNo+urgent pointer


ACK: Acknowledgement Number is valid
PSH: PUSH Flag


Notification from sender to the receiver that the receiver
should pass all data that it has to the application.
Normally set by sender when the sender’s buffer is empty
46
TCP header fields

Flag bits:

RST: Reset the connection
 The flag causes the receiver to reset the connection
 Receiver of a RST terminates the connection and indicates higher
layer application about the reset

SYN: Synchronize sequence numbers
 Sent in the first packet when initiating a connection

FIN: Sender is finished with sending
 Used for closing a connection
 Both sides of a connection must send a FIN
47
TCP header fields

Window Size:




TCP Checksum:



Each side of the connection advertises the window
size
Window size is the maximum number of bytes that a
receiver can accept.
Maximum window size is 216-1= 65535 bytes
TCP checksum covers over both TCP header and TCP
data (also covers some parts of the IP header)
16-bit one’s complement
Urgent Pointer:

Only valid if URG flag is set
48
TCP header fields

Options:
End of
Options
kind=0
1 byte
NOP
(no operation)
kind=1
1 byte
Maximum
Segment Size
Window Scale
Factor
Timestamp
kind=2
len=4
maximum
segment size
1 byte
1 byte
2 bytes
kind=3
len=3
shift count
1 byte
1 byte
1 byte
kind=8
len=10
timestamp value
timestamp echo reply
1 byte
1 byte
4 bytes
4 bytes
49
TCP header fields

Options:



NOP is used to pad TCP header to multiples of 4
bytes
Maximum Segment Size
Window Scale Options

Increases the TCP window from 16 to 32 bits, i.e., the window
size is interpreted differently
Q: What is the different interpretation ?


This option can only be used in the SYN segment (first
segment) during connection establishment time
Timestamp Option

Can be used for roundtrip measurements
50
Three-Way Handshake
aida.poly.edu
mng.poly.edu
S 103188
0193:103
1880193(
win 16384
0)
<mss 146
0, ...>
8586(0)
8
4
2
7
:1
6
8
5
8
8
S 1724
<mss 1460>
0
6
7
8
in
w
4
9
ack 10318801
ack 172488
587 win 175
20
51
Why is a Two-Way Handshake not
enough?
aida.poly.edu
S 1031
880193
:10318
win 16
384 <m 80193(0)
ss 146
0, ...>
S 1532
211235
win 163 4:1532211235
4
84 <ms
s 1460, (0)
...>
6(0)
8
5
8
8
:1724 >
6
8
5
8
48
460
S 172 0 <mss 1
76
win 8
mng.poly.edu
The red
line is a
delayed
duplicate
packet.
Will be discarded
as a duplicate
SYN
When aida initiates the data transfer (starting with SeqNo=15322112355),
mng will reject all data.
52
TCP Connection Termination
aida.poly.edu
mng.poly.edu
F 172488734:172488734(0)
ack 1031880221 win 8733
. ack 17
2488735
win 174
84
F 10318
80221:1
0318802
ack 1 72
21(0)
488735
win 175
20
222 win
. ack 1031880
8733
53
Connection termination with
tcpdump
aida issues
an "telnet mng"
aida.poly.edu
mng.poly.edu
1 mng.poly.edu.telnet > aida.poly.edu.1121: F 172488734:172488734(0)
ack 1031880221 win 8733
2 aida.poly.edu.1121 > mng.poly.edu.telnet: . ack 172488735 win 17484
3 aida.poly.edu.1121 > mng.poly.edu.telnet: F 1031880221:1031880221(0)
ack 172488735 win 17520
4 mng.poly.edu.telnet > aida.poly.edu.1121: . ack 1031880222 win 8733
54
TCP States in “Normal” Connection
Lifetime
SYN_SENT
(active open)
SYN (SeqNo = x)
y, AckNo
=
o
N
q
e
(S
N
Y
S
=x+1)
LISTEN
(passive open)
SYN_RCVD
(AckNo = y + 1 )
ESTABLISHED
ESTABLISHED
FIN_WAIT_1
(active close)
FIN_WAIT_2
TIME_WAIT
FIN (SeqNo = m)
(AckNo = m+ 1 )
CLOSE_WAIT
(passive close)
FIN (SeqNo = n )
(AckNo =
LAST_ACK
n+1)
CLOSED
55
TCP State Transition Diagram
Opening A Connection
CLOSED
passive open
send: . / .
LISTEN
recv:
RST
close or
timeout
active open
send: SYN
Application sends data
send: SYN
recv: SYN
send: SYN, ACK
SYN RCVD
recvd: ACK
send: . / .
send:
FIN
simultaneous open
recv: SYN
send: SYN, ACK
SYN SENT
recv: SYN, ACK
send: ACK
ESTABLISHED
recvd: FIN
send: FIN
56
TCP State Transition Diagram
Closing A Connection
active close
send: FIN
ESTABLISHED
FIN_WAIT_1
recv: ACK
send: . / .
recv: FIN
send: ACK
recv:
FIN, ACK
send: ACK
FIN_WAIT_2
recv: FIN
send: ACK
CLOSING
recvd: ACK
send: . / .
passive close
recv: FIN
send: ACK
CLOSE_WAIT
application
closes
send: FIN
LAST_ACK
TIME_WAIT
Timeout
(2 MSL)
recv: ACK
send: . / .
CLOSED
Issue close()
57
2MSL Wait State
2MSL Wait State = TIME_WAIT
 When TCP does an active close, and sends the final ACK,
the connection must stay in in the TIME_WAIT
state for twice the maximum segment lifetime.
2MSL= 2 * Maximum Segment Lifetime


Why?
TCP is given a chance to resent the final ACK. (Server
will timeout after sending the FIN segment and resend
the FIN)
The MSL is set to 2 minutes or 1 minute or 30 seconds.
58
Rules for sending Acknowledgments

TCP has rules that influence the transmission of
acknowledgments

Rule 1: Delayed Acknowledgments



Goal: Avoid sending ACK segments that do not carry data
Implementation: Delay the transmission of (some) ACKs
Rule 2: Nagle’s rule

Goal: Reduce transmission of small segments Implementation:
A sender cannot send multiple segments with a 1-byte payload
(i.e., it must wait for an ACK)
59
Delayed Acknowledgement

TCP delays transmission of ACKs for up to 200ms

Goal: Avoid to send ACK packets that do not carry data.

The hope is that, within the delay, the receiver will have data ready to
be sent to the receiver. Then, the ACK can be piggybacked with a data
segment
In Example:


Delayed ACK explains why the “ACK of character” and the “echo of character” are sent in
the same segment
The duration of delayed ACKs can be observed in the example when Argon sends ACKs
Exceptions:
 ACK should be sent for every second full sized segment
 Delayed ACK is not used when packets arrive out of order
60
Observing Delayed Acknowledgements
•
Remote terminal applications (e.g., Telnet) send characters
to a server. The server interprets the character and sends
the output at the server to the client.
•
For each character typed, you see three packets:
1. Client  Server: Send typed character
2. Server  Client: Echo of character (or user output) and
acknowledgement for first packet
3. Client  Server: Acknowledgement for second packet
61
Observing Delayed Acknowledgements
Telnet session
from Argon
to Neon

Argon
Neon
This is the output of typing 3 (three) characters :
Time 44.062449:
Time 44.063317:
Time 44.182705:
Argon  Neon: Push, SeqNo 0:1(1), AckNo 1
Neon  Argon: Push, SeqNo 1:2(1), AckNo 1
Argon  Neon: No Data, AckNo 2
Time 48.946471:
Time 48.947326:
Time 48.982786:
Argon  Neon: Push, SeqNo 1:2(1), AckNo 2
Neon  Argon: Push, SeqNo 2:3(1), AckNo 2
Argon  Neon: No Data, AckNo 3
Time 55.116581:
Time 55.117497:
Time 55.183694:
Argon  Neon: Push, SeqNo 2:3(1) AckNo 3
Neon  Argon: Push, SeqNo 3:4(1) AckNo 3
Argon  Neon: No Data, AckNo 4
62
Why 3 segments per character?
character

We would expect four
segments per character:
cter
ACK of chara
c
echo of chara
ter
ACK of echoed character
character

But we only see three segments
per character:
ACK and echo
of character
ACK of echoed character

This is due to delayed
acknowledgements
63
Observing Nagle’s Rule
Telnet session
between argon.cs.virginia.edu
and
tenet.cs.berkeley.edu
argon.cs.virginia.edu

3000
miles
tenet.cs.berkeley.edu
This is the output of typing 7 characters :
Time 16.401963:
Time 16.481929:
Argon  Tenet: Push, SeqNo 1:2(1), AckNo 2
Tenet  Argon: Push, SeqNo 2:3(1) , AckNo 2
Time 16.482154:
Time 16.559447:
Argon  Tenet: Push, SeqNo 2:3(1) , AckNo 3
Tenet  Argon: Push, SeqNo 3:4(1), AckNo 3
Time 16.559684:
Time 16.640508:
Argon  Tenet: Push, SeqNo 3:4(1), AckNo 4
Tenet  Argon: Push, SeqNo 4:5(1) AckNo 4
Time 16.640761:
Time 16.728402:
Argon  Tenet: Push, SeqNo 4:8(4) AckNo 5
Tenet  Argon: Push, SeqNo 5:9(4) AckNo 8
64
Observing Nagle’s Rule
char1

Observation: Transmission
of segments follows a different
pattern, i.e., there are only two
segments per character typed
r1
+ echo of cha
ACK of char 1
ACK + char2
f char2
ACK + echo o
ACK + char3
f char3
ACK + echo o



Delayed acknowledgment does
not kick in at Argon
The reason is that there is
always data at Argon ready to
sent when the ACK arrives
Why is Argon not sending the
data (typed character) as soon
as it is available?
ACK + char4-7
f char3
ACK + echo o
65
Resetting Connections
Resetting connections is done by setting the
RST flag
 When is the RST flag set?



Connection request arrives and no server process is
waiting on the destination port
Abort (Terminate) a connection
Causes the receiver to throw away buffered data.
Receiver does not acknowledge the RST segment
66
TCP Congestion Control

TCP has a mechanism for congestion control.
The mechanism is implemented at the sender

The window size at the sender is set as follows:
Send Window = MIN (flow control window, congestion
window)
where


flow control window is advertised by the receiver
congestion window is adjusted based on feedback
from the network
67
TCP Congestion Control

TCP congestion control is governed by two
parameters:

Congestion Window (cwnd)

Slow-start threshhold Value (ssthresh)
Initial value is 216-1

Congestion control works in two modes:


slow start (cwnd < ssthresh)
congestion avoidance (cwnd ≥ ssthresh
68
Slow Start

Initial value:



Note: Unit is a segment size. TCP actually is based on bytes and
increments by 1 MSS (maximum segment size)
The receiver sends an acknowledgement (ACK) for each Segment


Set cwnd = 1
Note: Generally, a TCP receiver sends an ACK for every other
segment.
Each time an ACK is received by the sender, the congestion
window is increased by 1 segment:
cwnd = cwnd + 1

If an ACK acknowledges two segments, cwnd is still increased by
only 1 segment.

Even if ACK acknowledges a segment that is smaller than MSS bytes long,
cwnd is increased by 1.
Does Slow Start increment slowly? Not really. In fact, the
increase of cwnd is exponential
69
Slow Start Example

The congestion
window size grows
very rapidly


For every ACK, we
increase cwnd by 1
irrespective of the
number of segments
ACK’ed
TCP slows down the
increase of cwnd
when
cwnd > ssthresh
cwnd = 1
segment 1
t1
ACK for segmen
cwnd = 2
cwnd = 4
segment 2
segment 3
ts 2
ACK for segmen
ts 3
ACK for segmen
segment 4
segment 5
segment 6
ts 4
ACK for segmen
ts 5
ACK for segmen
ts 6
ACK for segmen
cwnd = 7
70
Congestion Avoidance

Congestion avoidance phase is started if cwnd
has reached the slow-start threshold value

If cwnd ≥ ssthresh then each time an ACK is
received, increment cwnd as follows:


cwnd = cwnd + 1/ cwnd
So cwnd is increased by one only if all cwnd
segments have been acknowledged.
71
Example of Slow Start/Congestion
Avoidance
cwnd = 1
Assume that ssthresh = 8
cwnd = 2
cwnd = 4
14
cwnd = 8
10
ssthresh
8
6
4
cwnd = 9
2
6
t=
4
t=
2
t=
0
0
t=
Cwnd (in segments)
12
Roundtrip times
cwnd = 10
72
Responses to Congestion
So, TCP assumes there is congestion if it detects
a packet loss
 A TCP sender can detect lost packets via:

Timeout of a retransmission timer
 Receipt of a duplicate ACK
TCP interprets a Timeout as a binary congestion signal.
When a timeout occurs, the sender performs:



cwnd is reset to one:
cwnd = 1

ssthresh is set to half the current size of the congestion window:
ssthressh = cwnd / 2

and slow-start is entered
73
Fast Retransmit

If three or more
duplicate ACKs are
received in a row, the
TCP sender believes
that a segment has
been lost.
1K SeqNo=0
AckNo=1024
1K SeqNo=1
024
1K SeqNo=2
048
1. duplicate
1K SeqNo=3
072
2. duplicate

Then TCP performs a
retransmission of what
seems to be the
missing segment,
without waiting for a
timeout to happen.
AckNo=1024
AckNo=1024
1K SeqNo=4
096
3. duplicate
AckNo=1024
1K SeqNo=1
024
1K SeqNo=5
120
74
Fast Recovery

Fast recovery avoids slow start
after a fast retransmit
cwnd=12
sshtresh=5
cwnd=12
sshtresh=5
1K SeqNo=0
AckNo=1024
1K SeqNo=1
024

Intuition: Duplicate ACKs
indicate that data is getting
through
1K SeqNo=2
048
1. duplicate

After three duplicate ACKs set:






Retransmit packet that is
presumed lost
ssthresh = cwnd/2
cwnd = cwnd+3
(note the order of operations)
Increment cwnd by one for each
additional duplicate ACK
When ACK arrives that
acknowledges “new data” (here:
AckNo=6148), set:
cwnd=ssthresh
enter congestion avoidance
AckNo=1024
cwnd=12
sshtresh=5
2. duplicate
cwnd=12
sshtresh=5
3. duplicate
1K SeqNo=3
072
AckNo=1024
1K SeqNo=4
096
AckNo=1024
cwnd=15
sshtresh=6
1K SeqNo=1
024
1K SeqNo=5
120
ACK for new data
cwnd=6
sshtresh=6
AckNo=6148
75
Flavors of TCP Congestion Control

TCP Tahoe (1988, FreeBSD 4.3 Tahoe)




Slow Start
Congestion Avoidance
Fast Retransmit
TCP Reno (1990, FreeBSD 4.3 Reno)

Fast Recovery
New Reno (1996)
 SACK (1996)


RED (Floyd and Jacobson 1993)
76
SACK

SACK = Selective acknowledgment

Issue: Reno and New Reno retransmit at most 1 lost packet per
round trip time


Selective acknowledgments: The receiver can acknowledge noncontinuous blocks of data (SACK 0-1023, 1024-2047)
Multiple blocks can be sent in a single segment.

TCP SACK:


Enters fast recovery upon 3 duplicate ACKs
Sender keeps track of SACKs and infers if segments are lost. Sender
retransmits the next segment from the list of segments that are deemed
lost.
77
TCP in Linux

Congestion control algorithm is pluggable


/proc/sys/net/ipv4/tcp_congestion_control
TCP read and write buffer sizes

/proc/sys/net/ipv4/tcp_r[w]mem
78
Midterm questions
ARP, ICMP, UDP, TCP, RIP, OSPF, BGP
 Compare and contrast design principles in
protocols.
 Fragmentation

79
Download