A Map of the Networking Code in Linux Kernel 2.4.20

advertisement
DRAFT – Redistribution of this material is forbidden – DRAFT
A Map of the Networking Code in Linux Kernel 2.4.20
Draft 5
29 April 2003
Miguel Rio
Department of Physics and Astronomy
University College London
Gower Street
London WC1E 6BT
UK
E-mail: M.Rio@cs.ucl.ac.uk
Web: http://www.cs.ucl.ac.uk/staff/M.Rio/
Tom Kelly
LCE
Cambridge University Engineering Department
Trumpington Street
Cambridge CB2 1PZ
UK
E-mail: ctk21@cam.ac.uk
Web: http://www-lce.eng.cam.ac.uk/~ctk21/
Mathieu Goutelle
LIP Laboratory, INRIA/ReSO Team
ENS Lyon
46, allée d'Italie
69 364 Lyon Cedex 07
France
E-mail: Mathieu.Goutelle@ecl2002.ec-lyon.fr
Web: http://www.ens-lyon.fr/~mgoutell/
Gareth Fairey
Department of Physics and Astronomy
University of Manchester
Oxford Road
Manchester M13 9PL
UK
E-mail: garethf@hep.man.ac.uk
Web: http://www.hep.man.ac.uk/~garethf/
Jean-Philippe Martin-Flatin
IT Division
CERN
1211 Geneva 23
Switzerland
E-mail: jp.martin-flatin@ieee.org
Web: http://cern.ch/jpmf/
Richard Hughes-Jones
Department of Physics and Astronomy
University of Manchester
Oxford Road
Manchester M13 9PL
UK
E-mail: R.Hughes-Jones@man.ac.uk
Web: http://www.hep.man.ac.uk/~rich/
Abstract (JP)
In this technical report, we describe the structure and organization of the
networking code of Linux kernel 2.4.20. This release is the first of the 2.4 branch
to support network interrupt mitigation via a mechanism known as NAPI. We
describe the main data structures, the sub-IP layer, the IP layer, and two
transport layers: TCP and UDP. Neither IPv6 nor SCTP are considered here.
1
DRAFT – Redistribution of this material is forbidden – DRAFT
1 Introduction (JP)
When we investigated the performance of gigabit networks and end-hosts, we soon realized that
some losses occurred in end-hosts, and that it was not clear at all where these losses happened. To
get a better understanding of packet losses and buffer overflows, we gradually built a picture of
how the networking code of the Linux kernel works, and instrumented parts of the code where we
suspected that losses could happen unnoticed.
This report documents our understanding of how the networking code works in Linux kernel
2.4.20 [1]. We selected release 2.4.20 because, to date, it is the latest stable release of the Linux
kernel (2.6 has not been released yet), and because it is the first sub-release of the 2.4 tree to
support NAPI (New Application Programming Interface [4]), which supports network interrupt
mitigation and thereby introduces a major change in the way data is buffered between the highlevel TCP layer in the kernel and the low-level device driver that directly interfaces to the NIC
(Network Interface Card). Until 2.4.20 was released, NAPI was one of the main novelties in the
development branch 2.5 and was only expected to appear in 2.6; it was not supported by the 2.4
branch up to 2.4.19 included. For more information on NAPI and the new networking features
expected to appear in Linux kernel 2.6, see Cooperstein’s online tutorial [5].
In this document, we describe the paths through the kernel followed by IP (Internet Protocol)
packets when they are received and transmitted from the host. Other protocols such as X.25 are
not considered here. In the lower layers, often known as the sub-IP layers, we concentrate on the
Ethernet protocol and ignore other protocols such as ATM. Finally, in the IP code, we describe
only the IPv4 code and let IPv6 for future work. Note that the IPv6 code is not vastly different
from the IPv4 code as far as networking is concerned (larger address space, no packet
fragmentation, etc).
The reader of this report is expected to be familiar with IP networking. For a primer on the
internals of the Internet Protocol (IP) and Transmission Control Protocol (TCP), see Stevens [6]
and Wright and Stevens [7]. Linux kernel 2.4.20 implements a variant of TCP known as
NewReno (Tom: Reno or NewReno?), with the congestion control algorithm specified in RFC
2581 [2], and the selective acknowledgment (SACK) option, which is specified in RFCs 2018 and
2883. The classic introductory books to the Linux kernel are Bovet and Cesati [10], updated a few
months ago, and Crowcroft and Phillips [3]. For Linux device drivers, see Rubini et al. [11].
In the rest of this report, we follow a bottom-up approach to investigate the Linux kernel. A brief
introduction to the most relevant data structures is given in section 0. In section 4, the sub-IP
layer is described. In section 5, we investigate the IP layer (network layer). TCP is studied in
section 6, and UDP in section 7. Finally, we present some concluding remarks in section 8.
2
DRAFT – Redistribution of this material is forbidden – DRAFT
2 Networking Code: The Big Picture (JP + Tom)
Figure 1 depicts where the networking code is located in the Linux kernel. Most of the code is in
net/ipv4. The rest of the relevant code is in net/core and net/sched. The header files can
be found in include/linux and include/net.
Rich: There should be a brief comment to what is in each bold section e.g.
All the include files specific to the IP stack are in include/net; those that are related to the general
part of the kernel or are specific to Linux are in include/Linux.
Rich: The same for net/core etc
/
arch
asm-*
drivers
linux
fs
math-emu
include
net
init
pcmcia
ipc
scsi
kernel
video
lib
mm
802
net
…
scripts
bridge
core
…
ipv4
ipv6
…
sched
…
wanrouter
x25
Figure 1 – Networking code in the Linux kernel tree
3
DRAFT – Redistribution of this material is forbidden – DRAFT
The networking code of the kernel is sprinkled with netfilter hooks [15] where developers can
hang their own code and analyze or change packets. These are marked as “HOOK” in the
diagrams presented in this document.
Miguel: It would be nice to have a section on socket layer and interactions with the user space
(JP: what do you mean by that?)
JP: Include high-level explanations of what tx_ring, rx_ring, qdisc, etc. are meant for and how
they operate.
Rich: I like figs 2&3 but we need some text and can we put boxes for the IP processing and show
the kernel-user interface for UDP packets?
Figures 2&3 present an overview of the packet flows through the kernel. They indicate the areas
where the hardware and driver code operate, the role of the kernel protocol stack and the kernel –
application interface.
DMA
softirq
IP firewall
IP routing
NIC
hardware
rx_ring
interrupt
scheduled
tcp_v4_rcv
socket backlog
recv
recv_backlog
device
driver
TCP
process
kernel
Figure 2: Handling of an incoming TCP segment
4
kernel
recv
buffer
read
Application
user
DRAFT – Redistribution of this material is forbidden – DRAFT
TCP
process
kernel
send
buffer
write
Application
send_msg
IP csum
IP route
IP filter
qdisc_run
qdisc_restcut
net_tx_action
dev_xmit
DMA
qdisc
NIC
hardware
tx_ring
softirq to
free
tx sntr
completion
queue
device
driver
kernel
user
Figure 3: Handling of an outgoing TCP segment
Rich: Toms section describes the driver code and thus I think it might be
better in section 4
5
DRAFT – Redistribution of this material is forbidden – DRAFT
3 General Data Structures (Miguel + Tom + JP)
The networking part of the kernel uses mainly two data structures: one to keep the state of a
connection called sock (for “socket”), and another to keep the status of both incoming and
outgoing packets called sk_buff (for “socket buffer”). Both of them are described in this section.
We also include a brief description of tcp_opt, a structure that is part of the sock structure and is
used to maintain the TCP connection state. The details of TCP will be presented in section 6.
3.1 Socket buffers (JP)
The sk_buff data structure is defined in include/linux/skbuff.h.
When a packet is processed by the kernel, coming either from user space or from the network
card, one of these data structures is created. Changing a field in a packet is achieved by updating
a field of this data structure. In the networking code, virtually every function is invoked with an
sk_buff (the variable is usually called skb) passed as a parameter.
The first two fields are pointers to the next and previous sk_buff’s in the linked list (packets are
frequently stored in linked lists or queues).
JP: We have forgotten to define sk_buff_head.
The socket that owns the packet is stored in sk (note that if the packet comes from the network,
the socket owner will be known only at a later stage).
The time of arrival is stored in a timestamp called stamp. The dev field stores the device from
which the packet arrived, if the packet is for input; and the dev field is updated when and if the
device to be used for transmission is known (for example by inspection of the routing table) the
dev field is updated correspondingly.
struct sk_buff {
/* These two members must be first. */
struct sk_buff
* next;
/* Next buffer in list */
struct sk_buff
* prev;
/* Previous buffer in list */
struct sk_buff_head * list;
/* List we are on */
struct sock *sk;
/* Socket we are owned by */
struct timeval
stamp;
/* Time we arrived */
struct net_device
*dev;
/* Device we arrived on/are
by*/
leaving
The transport section is a union that points to the corresponding transport layer structure (TCP,
UDP, ICMP, etc).
/* Transport layer header */
union
{
struct tcphdr *th;
struct udphdr *uh;
struct icmphdr
*icmph;
struct igmphdr
*igmph;
6
DRAFT – Redistribution of this material is forbidden – DRAFT
struct iphdr *ipiph;
struct spxhdr *spxh;
unsigned char *raw;
} h;
The network layer header points to the corresponding data structures (IPv4, IPv6, ARP, raw, etc).
/* Network layer header */
union
{
struct iphdr *iph;
struct ipv6hdr
*ipv6h;
struct arphdr *arph;
struct ipxhdr *ipxh;
unsigned char *raw;
} nh;
The link layer is stored in a final union called mac. Only a special case for Ethernet is included.
Other technologies will use the raw fields with appropriate casts.
/* Link layer header */
union
{
struct ethhdr *ethernet;
unsigned char
*raw;
} mac;
struct
dst_entry *dst;
Other information about the packet such as Length, data length, checksum, packet type, etcinfo is
stored in the structure as shown below.
char
cb[48];
unsigned int len;
This field is not defined!
/* Length of actual data */
unsigned int data_len;
JP: This field is not defined!
unsigned int csum;
/* Checksum */
unsigned char
__unused,
/* Dead field, may be reused */
cloned,
/* head may be cloned (check refcnt
to be sure) */
pkt_type,
/* Packet class */
ip_summed;
/* Driver fed us an IP checksum */
__u32
priority;
/* Packet queueing priority */
atomic_t
users;
/* User count - see datagram.c,tcp.c */
unsigned short
protocol;
/* Packet protocol from driver */
unsigned short
security;
/* Security level of packet */
unsigned int truesize;
/* Buffer size */
unsigned char *head;
/* Head of buffer */
unsigned char *data;
/* Data head pointer
*/
unsigned char *tail;
/* Tail pointer */
unsigned char
*end;
/* End pointer */
7
DRAFT – Redistribution of this material is forbidden – DRAFT
3.2 sock (Miguel)
The sock data structure keeps data about a specific TCP connection (e.g. TCP state) or virtual
UDP connection. When a socket is created in user space, a sock structure is allocated.
The first fields contain the source and destination addresses and ports of the socket pair.
struct sock {
/* Socket demultiplex comparisons on incoming packets. */
__u32
daddr;
/* Foreign IPv4 addr
*/
__u32
rcv_saddr;
/* Bound local IPv4 addr
__u16
dport;
/* Destination port
unsigned short
num;
/* Local port
int
bound_dev_if; /* Bound device index if != 0
*/
*/
*/
*/
Among many other fields, the sock structure contains protocol-specific information. These fields
contain state information about each layer.
union {
struct ipv6_pinfo
} net_pinfo;
union {
struct
struct
struct
struct
} tp_pinfo;
af_inet6;
tcp_opt
raw_opt
raw6_opt
spx_opt
af_tcp;
tp_raw4;
tp_raw;
af_spx;
};
3.3 TCP options (Miguel)
One of the main components of the sock structure is the TCP option field (tcp_opt). Both IP and
UDP are stateless protocols with a minimum need to store information about their connections.
TCP, however, needs to store a large set of variables. These variables are stored in the fields of
the tcp_opt structure; only the most relevant fields are shown below (comments are selfexplanatory).
JP: This section seems fairly useless to me. We simply cut & paste the structure definition from a
header file…
Rich: True its cut and paste but useful to have to hand AND we should ref. it from the TCP
Section.
struct tcp_opt {
int
tcp_header_len;
__u32 rcv_nxt;
__u32 snd_nxt;
__u32 snd_una;
/*
/*
/*
/*
Bytes of tcp header to send
What we want to receive next */
Next sequence we send
*/
First byte we want an ack for */
8
*/
DRAFT – Redistribution of this material is forbidden – DRAFT
__u32
snd_sml;
__u32 rcv_tstamp;
__u32 lsndtime;
(for restart window) */
/* Last byte of
small packet
/* timestamp of
/* timestamp of
the most recently transmitted
*/
last received ACK (for keepalives) */
last sent data packet
/* Delayed ACK control data */
struct {
__u8
pending;
/* ACK is pending */
__u8
quick;
/* Scheduled number of quick acks */
__u8
pingpong;
/* The session is interactive
*/
__u8
blocked;
/* Delayed ACK was blocked by socket lock*/
__u32 ato;
/* Predicted tick of soft clock
*/
unsigned long timeout;
/* Currently scheduled timeout
*/
__u32 lrcvtime;
/* timestamp of last received data packet*/
__u16 last_seg_size;
/* Size of last incoming segment */
__u16 rcv_mss;
/* MSS used for delayed ACK decisions
*/
} ack;
/* Data for direct copy to
struct {
struct sk_buff_head
struct task_struct
struct iovec
int
int
} ucopy;
user */
prequeue;
*task;
*iov;
memory;
len;
__u32
__u32
__u32
__u32
__u16
__u16
__u16
snd_wl1;
/*
snd_wnd;
/*
max_window; /*
pmtu_cookie; /*
mss_cache;
/*
mss_clamp;
/*
ext_header_len;
Sequence for window update
*/
The window we expect to receive
*/
Maximal window ever seen from peer
*/
Last pmtu seen by socket
*/
Cached effective mss, not including SACKS */
Maximal mss, negotiated at connection setup */
/* Network protocol overhead (IP/IPv6 options)
__u8
__u8
ca_state;
/* State of fast-retransmit machine
retransmits; /* Number of unrecovered RTO timeouts.
__u8
__u8
__u8
reordering; /* Packet reordering metric.
*/
queue_shrunk; /* Write queue has been shrunk recently.*/
defer_accept; /* User waits for some data after accept() */
*/
/* RTT measurement */
__u8
backoff;
__u32 srtt;
__u32 mdev;
__u32 mdev_max;
__u32 rttvar;
/*
/*
/*
/*
*/
*/
backoff
*/
smothed round trip time << 3
*/
medium deviation
*/
maximal mdev for the last rtt period */
/* smoothed mdev_max
9
*/
DRAFT – Redistribution of this material is forbidden – DRAFT
__u32 rtt_seq;
__u32 rto;
__u32
__u32
__u32
/*
*
*/
/* sequence number to update rttvar
/* retransmit timeout
*/
*/
packets_out; /* Packets which are "in flight" */
left_out;
/* Packets which leaved network
retrans_out; /* Retransmitted packets out
*/
*/
Slow start and congestion control (see also Nagle, and Karn & Partridge)
__u32
__u32
__u16
__u16
__u32
__u32
snd_ssthresh; /*
snd_cwnd;
/*
snd_cwnd_cnt; /*
snd_cwnd_clamp;
snd_cwnd_used;
snd_cwnd_stamp;
Slow start size threshold
*/
Sending congestion window
*/
Linear increase counter
*/
/* Do not allow snd_cwnd to grow above this */
/* Two commonly used timers in both sender and receiver paths. */
unsigned long
timeout;
struct timer_list
retransmit_timer;
/* Resend (no ack) */
struct timer_list
delack_timer;
/* Ack delay
*/
struct sk_buff_head out_of_order_queue; /* Out of order segments go here
*/
struct tcp_func
AF_INET{4,6} specific
struct sk_buff
*/
struct page
*/
u32
*/
__u32
__u32
__u32
__u32
__u32
/*
*
*/
rcv_wnd;
rcv_wup;
write_seq;
pushed_seq;
copied_seq;
*af_specific; /*
Operations
which
*/
*send_head;
/* Front of stuff to transmit
*sndmsg_page; /* Cached page for sendmsg
sndmsg_off;
/*
/*
/*
/*
/*
/* Cached offset for sendmsg
Current receiver window
*/
rcv_nxt on last window update sent
*/
Tail(+1) of data held in tcp send buffer */
Last pushed seq, required to talk to windows */
Head of yet unread data
*/
Options received (usually on last packet, some only on SYN packets).
char
tstamp_ok,
wscale_ok,
sack_ok;
char
saw_tstamp;
__u8 snd_wscale;
__u8 rcv_wscale;
__u8
nonagle;
/*
/*
/*
/*
/*
/*
/*
TIMESTAMP seen on SYN packet
Wscale seen on SYN packet
SACK seen on SYN packet
*/
Saw TIMESTAMP on last packet
Window scaling received from sender
Window scaling to send to receiver
Disable Nagle algorithm?
10
*/
*/
*/
*/
*/
*/
are
DRAFT – Redistribution of this material is forbidden – DRAFT
__u8
/*
/*
keepalive_probes; /* num of allowed keep alive probes */
PAWS/RTTM data
*/
__u32 rcv_tsval;
/* Time stamp
__u32 rcv_tsecr;
/* Time stamp
__u32 ts_recent;
/* Time stamp
long ts_recent_stamp;/* Time we
value
*/
echo reply
*/
to echo next
*/
stored ts_recent (for aging) */
SACKs data
*/
__u16 user_mss;
/* mss requested by user in ioctl */
__u8
dsack;
/* D-SACK is scheduled
*/
__u8
eff_sacks;
/* Size of SACK array to send with next packet */
struct tcp_sack_block duplicate_sack[1]; /* D-SACK block */
struct tcp_sack_block selective_acks[4]; /* The SACKS themselves*/
__u32
__u32
__u8
__u8
__u16
window_clamp; /*
rcv_ssthresh; /*
probes_out; /*
num_sacks;
/*
advmss;
Maximal window to advertise
Current window clamp
unanswered 0 window probes
Number of SACK blocks
/* Advertised MSS
*/
*/
*/
__u8
__u8
__u16
__u32
__u32
__u32
__u32
syn_retries; /*
ecn_flags;
/*
prior_ssthresh;
lost_out;
/*
sacked_out; /*
fackets_out; /*
high_seq;
/*
num of allowed syn retries */
ECN status bits.
*/
/* ssthresh saved at recovery start
Lost packets
SACK'd packets
*/
FACK'd packets
*/
snd_nxt at onset of congestion */
__u32
retrans_stamp;
__u32
int
__u32
__u16
__u8
__u8
__u32
*
*
undo_marker; /*
undo_retrans; /*
urg_seq;
/*
urg_data;
/*
pending;
/*
urg_mode;
/*
snd_up;
/* Timestamp of the last retransmit,
also used in SYN-SENT to remember stamp of
the first SYN. */
tracking retrans started here. */
number of undoable retransmissions. */
Seq of received urgent pointer */
Saved octet of OOB data and control flags */
Scheduled timer event
*/
In urgent mode
*/
/* Urgent pointer
*/
*/
*/
*/
*/
};
3.4 Memory management (Tom)
Rich: Both these 2 sections are useful but don’t really fit under the structures section.
JP: This is raw material from Tom (file “memory”).
11
DRAFT – Redistribution of this material is forbidden – DRAFT
JP: We need to merge this with the rest of this section.
JP: All of this section is for Linux 2.4.19. Check that it’s still valid for 2.4.20.
alloc_skb [net/core/skbuff.c - 164]
----------------------------------- get skb from pool head with skb_head_from_pool [net/core/skbuff.c]
- allocate data area with kmalloc from slab
- setup data pointers and skbuff state
__kfree_skb [net/core/skbuff.c - 310]
------------------------------------- release dst descriptor with dst_release
- call desturtor (JP: is this a typo for destructor?)
>destructor
- skb_headerinit to clean state of skbuff
- kfree_skbmem calls skb_release_data and skb_head_to_pool
if
3.5 Scheduler (Tom)
JP: This is raw material from Tom (file “scheduler”).
JP: We need to merge this with the rest of this section.
JP: All of this section is for Linux 2.4.19. Check that it’s still valid for 2.4.20.
Useful constants
================
HZ [include/asm/param.h]
-----------------------For i386:
#define HZ 100
tick [kernel/timer.c - 32]
-------------------------tick is measured in microseconds and is the timer interrupt period
long tick = (1000000 +HZ/2) / HZ
So for i386, tick = 10000 (i.e. 10ms)
jiffies [kernel/timer.c - 68]
----------------------------It is updated in do_timer [kernel/timer.c]
jiffies counts the number of ticks...
12
present
skb-
DRAFT – Redistribution of this material is forbidden – DRAFT
do_timer [kernel/timer.c]
------------------------Called from do_timer_interrupt [arch/i386/kernel/time.c]
which itself is called from timer_interrupt.
Timeout interrupts are scheduled depending on hardware for example see:
[arch/i386/kernel/apic.c] to be about 1 tick or every 1/HZ.
schedule [kernel/sched.c]
------------------------Never called from an interrupt.
- updates previous process and adds to last of runqueue or removes
- goes through the runqueue looking for a process to schedule using:
+ can_schedule(p, this_cpu) [kernel/sched.c]
+ goodness(p, this_cpu, prev->active_mm) [kernel/sched.c]
- check if have a process, otherwise update counters.
- commit to scheduling that process
- update kstat.context_swtch (exported in /proc/stat see fs/proc/proc_misc.c)
- prepare_to_switch() (no-op except on sparc)
- switch tlb and mm using (switch_mm [include/asm*/mmu_context.h], mmdrop,
enter_lazy_tlb)
switch_to()
(jump
from
[include/asm*/system.h]
to
__switch_to()
[arch/*/kernel/process.c])
try_to_wake_up & wake_up_process [kernel/sched.c - 351 & 372]
------------------------------------------------------------wake_up_process is a wrapper around try_to_wake_up(p,0)
try_to_wake_up:
- mark process running and put task on runqueue
- if second arg is non-zero then call reschedule_idle
- return with success or not
reschedule_idle [kernel/sched.c - 212]
-------------------------------------SMP logic complex.
UP logic:
- compare current task with scheduled task using preemption_goodness
- if preemption_goodness is high then set need_resched;
the current process will be switched at next schedule call
When are softirq's scheduled?
----------------------------- in do_irq [arch/i386/irq.c] do_softirq is called to do interrupts if any
pending.
- they can also be scheduled through the ksoftirq process and the scheduler
via a timeout:
+ whenever interrupt processing occurs and do_softirq already running pending
events.
13
DRAFT – Redistribution of this material is forbidden – DRAFT
+ whenever a softirq is raised from outside a bh or irq
- you can see how often the softirq is raised from interrupt or ksoftirq
context because it is only attributed to ksoftirq if it is scheduled via
the scheduler.
do_softirq processes softirqs in the following order:
- HI_SOFTIRQ, NET_TX_SOFTIRQ, NET_RX_SOFTIRQ, TASKLET_SOFTIRQ
14
DRAFT – Redistribution of this material is forbidden – DRAFT
4 Sub-IP Layers (Mathieu)
This chapter describes the reception and handling of packets by the hardware and the NIC driver.
This corresponds to the link in the lower layers. The driver and the IP layer are tightly bound with
the driver using methods from the kernel and the IP layer. (*** Have I got this correct?)
***Need to distinguish between :
IP processing
---------------Q management by the kernel routines
-------------------------Driver operations
-------------Hardware operation on the data
*** should state clearly what bit of code sets up an entity eg a data area or a ring buffer.
Also which bit of HW or software “owns” or modifies the entities
4.1 Packet Reception
Miguel: Needs relevant files
Miguel: Needs some reordering to maybe follow the same structure
JP: Explain the point of tx_ring and rx_ring and how they operate
15
ransmitter
DRAFT – Redistribution of this material is forbidden – DRAFT
IP (ip_rcv)
Receiver (old API)
rx_softirq (net_rx_action)
xmit)
rt:
& !stopped)
_xmit
drop if
throttle
packet
Interrupt
handler
netif_rx:
enqueue
schedule_softirq
ing full
e)
irq
Figure 4 - Packet Reception with the old API Until Linux kernel 2.4.19
Do we need a diag to separate out the data flow and control information: eg:
16
DRAFT – Redistribution of this material is forbidden – DRAFT
IP ip_rcv()
rx_softirq
net_rx_action()
backlog queue (per CPU)
Drop if in
throttle state
netif_rx()
enqueue packet in backlog
schedule softirq
Interrupt Handler
Kernel memory
Ring buffer
Data packet
Free descriptor
& update
Interrupt
generator
DMA engine
NIC memory
17
DRAFT – Redistribution of this material is forbidden – DRAFT
Transmitter
Receiver (NAPI)
IP (ip_rcv)
rx_softirq (net_rx_action):
dev->poll
e_xmit)
ll
device
tart:
y & !stopped)
rt_xmit
Interrupt
handler
netif_rx_schedule:
enqueue
schedule_softirq
n ring full
lure)
irq
Figure 5 - Packet reception with Linux kernel 2.4.20: the new API (NAPI)
As well as containing data for the higher layers, the packets are associated with descriptors that
provide information on the physical location of the data, the length of the data as well as other
control and status information. Usually the NIC driver sets up the packet descriptors and
organises them as ring buffers when the driver is loaded. Separate ring buffers are used by the
NIC DMA engine to transfer packets to and from main memory.
Figure 4 and Figure 5 show the data flows that occur when a packet is received the following
steps are followed by a host:
4.1.1 Step 1
When a packet is received by the NIC, it is put into kernel memory by the card DMA engine. The
engine uses a list of packet descriptors that point to available areas of kernel memory where the
packet may be placed. These descriptors are held in rx_ring a ring buffer in kernel memory.
18
DRAFT – Redistribution of this material is forbidden – DRAFT
Each available data area must be large enough to hold the maximum size of packet that particular
interface can receive. This maximum size is specified by maxMTU.
The size of the ring is driver and hardware dependent. Older cards use the program i/o (PIO)
scheme: it is, where the host CPU which transfers the data from the card into the host memory.;
During subsequent processing in the network stack, the packet data remains at the same kernel
memory location. No extra copies are involved.
Rich: Which bit of code updates the descriptor ??
4.1.2 Step 2
The card interrupts the CPU, which then jumps to the driver ISR code. Here arise some
differences between the old network subsystem (<= (in kernels prior to 2.4.19) and NAPI.
Figure 3.1 shows the routines called for network stacks prior to 2.4.20;, the interrupt handler calls
the netif_rx() kernel function (to be found in net/dev/core.c, line 1215). The netif_rx()
function enqueues the received packet in the interrupted CPU's backlog queue and schedules a
softirq1, which is responsible for further processing of the packet (e.g., TCP/IP processing). The
backlog is 300-packet long (/proc/sys/net/core/netdev_max_backlog). If the backlog
queue becomes full, it enters the throttle state and waits for being the queue to be totally empty
beforeto re-entering athe normal state and allowing again anfurther packets to be enqueue
(netif_rx() in net/dev/core.c). If the backlog is in the throttle state, netif_rx drops the packet.
Rich: what happens to the packet descriptors when a packet is enqueued??
Rich comment: Dropping a packet if there IS space seems a perverse and dumb thing to do - can
we say WHY it was made like this?
Backlog statistics are available from /proc/net/softnet_stat. The format of the output is
defined in net/core/dev.c, lines 1804 onward. There is one line per CPU. The colums have
the following meanings:
Rich: need a bit more detail for some of the counters
1. the packet count
2. drop count.
3. the time squeeze counter, i.e. the number of time the softirq takes too much time to
handle the packets from the device (e.g. the backlog). When the budget of the softirq (the
number of packets it can dequeue, which depends on the device, max = 300) reaches zero
or when its execution time lasts more than one jiffie (10ms), the softirq stops dequeuing
packets, increments the time squeeze counter of the CPU and reschedules itself for later
execution.
4. the number of times the backlog entered the throttle state.
1
A softirq (software interrupt request) is a kind of kernel thread [12] [13].
19
DRAFT – Redistribution of this material is forbidden – DRAFT
5. the number of hits in fast routes
6. 6 the number of successes
7. the number of defers
8. the number of defers out.
9. The right-most column indicates either latency reduction in fast routes or CPU collision,
depending on a #ifdef flag. An example of backlog stats is shown below.
$ cat /proc/net/softnet_stat
94d449be 00009e0e 000003cd
0000099f
000001da 00000000 00000000
0000005b
000002ca 00000000 00000000
00000b5a
000001fe 00000000 00000000
00000010
0000000e
00000000
00000000
00000000
00000000
00000000
00000000
00000000
00000000
00000000
00000000
00000000
00000000
00000000
00000000
00000000
00000000
00000000
00000000
00000000
NAPI drivers act differently: the interrupt handler calls netif_rx_schedule()
(include/linux/netdevice.h, line 738). Instead of putting the packets in the backlog, it puts
a reference to the device in a queue attached to the interrupted CPU (softnet_data->poll_list;
include/linux/netdevice.h, line 496). A softirq is then scheduled too, just like in the
previous case. To insure backward compatibility, the backlog is considered as a device in NAPI
to handle all the incoming packets, it can be enqueued just as any another NIC device. netif_rx()
is rewritten to enqueue the backlog into the poll_list of the CPU after having enqueued the packet
into the backlog.
*** but the diagram and other text does not discuss this further – needs clarification.;
4.1.3 Step 3
When the softirq is scheduled, it executes net_rx_action() (net/core/dev.c, line 1558). Since
the last step differs between the older network subsystem and NAPI, this one does too.
For kernel versions prior to<= 2.4.2019, net_rx_action() putlls all the packets in the backlog
queue and calls for each of them the ip_rcv() procedure for each of the data packets
(net/ipv4/ip_input.c, line 379). For otheror another one depending on the types of the packets: arp,
bootp, etc. the corresponding ip_xx() routine is called.
Rich: but step 2 above said that netif_rx() put the packets on the backlog queue .
For NAPI, the CPU polls the devices present in his poll_list to get all the received packets from
their rx_ring or from the backlog. The poll method of the backlog (process_backlog;
net/core/dev.c, line 1496) or of any device calls, for each received packet,
netif_receive_skb() (net/core/dev.c, line 1415) which roughly calls ip_rcv().
Rich: hmm I admit to being lost here !!
Rich: Should we give the advantags of NAPI – a 2 line summary from the NAPI paper? - might
be good.
20
DRAFT – Redistribution of this material is forbidden – DRAFT
4.2 Packet Transmission
Transmitter
IP (ip_rcv)
IP
A
(NA
(old
Receiver
Receiver
rx_softirq (net_rx_acti
dev->poll
enqueue
(dev_queue_xmit)
drop if full
qdisc_restart:
while (!empty & !stopped)
hard_start_xmit
drop if
throttle
packet
device
Interrupt
handler
start
stop (when ring full
or link failure)
Figure 6 - Transmission of a packet
netif_rx_schedu
netif_rx:
enqueue
schedule_softirq
irq
All the IP packets are built using the arp_constructor() method. Each packet contains a dst field,
that provides the destination computed by the routing algorithm. The dst field provides an output
method, which is dev_queue_xmit() for IP packets;
The kernel provides lots of queueing disciplines between the kernel and the driver.
Rich: are these generic or just for networking ? It is intended to provide QoS support. The default
queuing discipline or qdisc, is a FIFO queue with a default length of 100 packets (ether_setup():
dev->tx_queue_len ; drivers/net/net_init.c, line 405).
4.2.1 Step 1
For each packet to be transmit from the IP layer, the dev_queue_xmit() procedure
(net/core/dev.c,l. 991) is called. It queues a packet in the qdisc associated to the output interface
(as determined by the routing). Then, if the device is not stopped (e.g. due to link failure, or the
tx_ring being full), all packets present in the qdisc are handled by qdisc_restart()
(net/sched/sch_generic.c, line 77).
21
DRAFT – Redistribution of this material is forbidden – DRAFT
Rich: is this called for each packet?
4.2.2 Step 2
The hard_start_xmit() virtual method is then called. This method is implemented in the driver
code. The packet descriptor ?? is placed in the tx_ring and the driver tells the NIC that there are
some packets to send.
4.2.3 Step 3
Once the card has sent a packet or a group of packets, it communicates to the CPU that the
packets have been sent out. The CPU uses this information (net_tx_action(); net/core/dev.c, l.
1326) to put the packets into a completion_queue and to schedule a softirq for later deallocating
the memory associated with these packets. This communication between the card and the CPU is
card and driver-dependant.
4.3 Commands for monitoring and controlling the input and output
network queues
'ifconfig' can be used to override the length of the output packet queue using the 'txqueuelen'
option.
You can't get stats for the default packet input queue qdisc. The trick is to replace it with the same
FIFO queue using the 'tc' command:



to replace the default qdisc: tc qdisc add dev eth0 root pfifo limit 100
to get stats from this qdisc: tc -s -d qdisc show dev eth0
to recover to default state: tc qdisc del dev eth0 root
Rich: I found that this chapter was much clearer after reading chapter 3 – as noted at the start of
this chapter, the ‘scope’ of the description should be stated and its context with the kernel and the
IP layer.
Rich: should mention interrupt coalescence in this chapter.
22
DRAFT – Redistribution of this material is forbidden – DRAFT
4.4 Device Drivers (Tom)
JP: This is raw material from Tom (file “device-notes”).
JP: We need to merge this with the rest of this section.
JP: All of this section is for Linux 2.4.19. Check that it’s still valid for 2.4.20.
dev_open [net/core/dev.c - 682]
------------------------------prepares an interface for use.
- sets up device in netdev chain
- calls dev_activate [net/sched/sch_generic.c]
- calls dev->open() on the device driver
dev_activate [net/sched/sch_generic.c - 432]
-------------------------------------------- creates qdisc for device (default is pfifo_fast [net/sched/sch_generic.c]
- starts any exterior watchdog timer for the device
dev_queue_xmit [net/core/dev.c - 976]
------------------------------------queue a buffer for transmission to a network device queue (on a qdisc).
Function can be called from an interrupt.
- checks for hdware checksuming; if not do it
- grab device outgoing queue:
+ enqueue on qdisc [net/sched/]
- checks against dev->tx_queue_len and qdisc queueing policy
- packet can be dropped here e.g. pfifo_fast_enqueue:
o checks tx_queue_len if fail then qdisc->stats.drops++
and then kfree_skb return with NET_XMIT_DROP
WARNING this queue is not the same as that reported by ifconfig
it is the qdisc queue and can only be got at with /sbin/tc
+ qdisc_run [include/net/pkt_sched]
- this checks if the device netif queue is locked
if okay then kick device with qdisc_restart to get the device to
send packets. If not wait.
- qdisc_restart does locking and actually kicks device driver
through dev->hard_start_xmit
netif_rx [net/core/dev.c - 1215]
-------------------------------Recieves a packet from the device driver and queues it for the upper levels
23
DRAFT – Redistribution of this material is forbidden – DRAFT
to process. Has limited congestion control on the backlog queue; most
drivers seem to ignore this.
- gets backlog queue for this CPU
- place the packet on this CPU
- raise a softirq for this cpu with cpu_raise_softirq
- return congestion level
- stats are kept in the netdev_rx_stat array of cpus for drops and throttles.
* these statistics are returned in /proc/net/softnet_stat
net_tx_action [net/core/dev.c - 1326]
------------------------------------Soft IRQ for send that is setup from net_dev_init as NET_TX_SOFTIRQ.
It is scheduled from:
- netif_schedule [include/linux/netdevice.h]
* called from with net_tx_action to reschedule when couldn't lock
* qdisc_restart if the device lock held
- dev_kfree_skb_irq (called from drivers) adds skbs to the completion
queue to be deallocated by net_tx_action. Exectued in interrupt context.
Essentially has two functions; to clear the completion queue when a
TX interrupt is raised and to kick the output_queue if there is data on the
outgoing qdisc.
- checks the cpu's completion queue and frees all skb's there
- checks the cpu's output queue and tries to send them through qdisc_run
if busy then rescehdule with netif_schedule
net_rx_action [net/core/dev.c - 1420]
------------------------------------Soft IRQ for receive that is setup from net_dev_init as NET_TX_SOFTIRQ.
It is scheduled from:
- net_rx_action when the device is time squeezed [net/core/dev.c]
(ie uses a timeslice or queues more than backlog queue length)
- netif_rx to schedule after a receive [net/core/dev.c]
Essentially breaks the receive processing from the IRQ and allows
scheduling of recieve side processing:
- takes item off the input_pkt_queue of this cpu
- checks for fastroute
- tries to find higher level protocols that want all packets
- tries diverts and bridging
- looks for specific protocol looking for this type of packet
- throttling code: reenable device if backlog queue cleared
- break the receive processing if it takes too long or we have handled
netdev_max_backlog of them
netif_stop_queue & netif_start_queue [include/linux/netdevice.h]
24
DRAFT – Redistribution of this material is forbidden – DRAFT
---------------------------------------------------------------netif_stop_queue stops the device outgoing queue if it fills for
some reason. It is generally called from:
- device drivers doing dev->hard_start_xmit with no more descriptors
It prevents many routines form enqueuing more packets through
netif_queue_stopped which is checked in:
- dev_queue_xmit [net/core/dev.c] when trying to send a packet.
- whenever qdisc_run [include/net/pkt_sched.h] is called to
before doing qdisc_restart
netif_start_queue starts the queue again and is generally called from:
- device drivers when they free tx descriptors (on interrupt from tx)
GENERIC DEVICE DRIVER ROUTINES
==============================
init_netdev [drivers/net/net_init.c - 129]
------------------------------------------ Allocates name and allocates the device structure.
- Calls netdev_boot_setup_check
- Does the device specific setup by calling the passed in setup function
- registers the device in the kernel
- returns the net_device structure to the caller
init_etherdev [drivers/net/net_init.c - 210]
-------------------------------------------Calls init_netdev with ether_setup as the setup function.
ether_setup [drivers/net/net_init.c - 405]
-----------------------------------------Sets up the device structure with ethernet generic functions:
e.g.
<snippet>
dev->change_mtu
= eth_change_mtu;
dev->hard_header
= eth_header;
dev->rebuild_header
= eth_rebuild_header;
dev->set_mac_address
= eth_mac_addr;
dev->hard_header_cache
= eth_header_cache;
dev->header_cache_update= eth_header_cache_update;
dev->hard_header_parse
= eth_header_parse;
dev->type
=
dev->hard_header_len
dev->mtu
=
dev->addr_len
=
dev->tx_queue_len
=
dev->flags
ARPHRD_ETHER;
= ETH_HLEN;
1500; /* eth_mtu */
ETH_ALEN;
100;
= IFF_BROADCAST|IFF_MULTICAST;
25
DRAFT – Redistribution of this material is forbidden – DRAFT
</snippet>
SK98LN Driver - as found in 2.4.19
==================================
This driver looks like it is ported from another operating system.
It winds and twists and don't look that clean...
The driver uses the transmit complete interrupt for every packet
by default. See the #define's USE_TX_COMPLETE and USE_INT_MOD in skge.c
Also the card copies small frames coming in on a receive interrupt
to try and save memory; but this is exactly when the system is
loaded and so the bus and CPU are also scarce. See SK_COPY_THRESHOLD.
Useful structs and typedefs in [drivers/net/sk98ln/h/skdrv1st.h]
---------------------------------------------------------------typedef struct s_AC SK_AC;
For struct s_AC definition see [drivers/net/sk98ln/h/skdrv2nd.h]
Useful structs and typedefs in [drivers/net/sk98ln/h/skdrv2nd.h]
---------------------------------------------------------------typedef struct s_DevNet DEV_NET;
struct s_DevNet {
int
PortNr;
int
NetNr;
int
Mtu;
int
Up;
SK_AC
*pAC;
};
skge_probe [drivers/net/sk98ln/skge.c - 405]
-------------------------------------------Probes and sets up the card
- checks if probed already; checks PCI, prints version
- create_proc_entry (see [drivers/net/sk98lin/skproc.c] for what it provides)
- allocates etherdev strucuture
- allocates private s_AC structure in dev->priv->pAC
- sets up the function pointers:
<snippet>
dev->open =
&SkGeOpen;
dev->stop =
&SkGeClose;
dev->hard_start_xmit =
&SkGeXmit;
dev->get_stats =
&SkGeStats;
dev->set_multicast_list = &SkGeSetRxMode;
dev->set_mac_address =
&SkGeSetMacAddr;
26
DRAFT – Redistribution of this material is forbidden – DRAFT
dev->do_ioctl =
dev->change_mtu =
&SkGeIoctl;
&SkGeChangeMtu;
</snippet>
- sets up the proc link backs
- maps the regs into kernel space with ioremap
- calls SkGeBoardInit [drivers/net/sk98ln/skge.c]
- sets the mac address
- checks for multiple ports on the board
SkGeBoardInit [drivers/net/sk98ln/skge.c - 839]
----------------------------------------------Sets up descriptor ring, allocates IRQ and examines config settings.
- sets up tx and rx descriptors
- Does the following twice with two different levels (?!)
* resets the board with SkGeInit
* sets up the other modules:
<snippet>
SkI2cInit( pAC, pAC->IoBase, 0);
SkEventInit(pAC, pAC->IoBase, 0);
SkPnmiInit( pAC, pAC->IoBase, 0);
SkAddrInit( pAC, pAC->IoBase, 0);
SkRlmtInit( pAC, pAC->IoBase, 0);
SkTimerInit(pAC, pAC->IoBase, 0);
</snippet>
- sets up interrupt service routines SkGeIsr or SkGeIsrOnePort
request_irq
- Allocates memory for board with BoardAllocMem
- Sets recieve flags with SkCsSetReceiveFlags
- does BoardInitMem and SetQueueSizes
- outputs some product and config settings then SkGeYellowLED
SkGeIsrOnePort and SkGeIsr [drivers/net/sk98ln/skge.c]
-----------------------------------------------------Interrupt service routines.
Mainly demuxes the interrupts recieved to the functions:
- ReceiveIrq
- FreeTxDescriptors (seems to depend on USE_TX_COMPLETE config option)
- ClearTxIrq
- ClearAndStartRx
- SkGeSirqIsr special IRQ (!?)
There a bunch of counters here of the type:
SK_PNMI_CNT_RX_INTR, SK_PNMI_CNT_TX_INTR, ClearTxIrq
Where are they exposed?
ReceiveIRQ [drivers/net/sk98ln/skge.c]
--------------------------------------
27
using
DRAFT – Redistribution of this material is forbidden – DRAFT
Called when a Receive IRQ is set. Walks recieve descriptor ring
and sends up the frames that have completed.
- circles through ring
- exits via FillRxRing if the descriptor not ready
- checks length and STF, EOF jumping to rx_failed if bad
* rx_failed does not store error
* releases DMA and unmaps a page
* patches up ring buffer...
- checks framestats for RX_CTRL_STAT_VALID and XMR_FS_ANY_ERR | XMR_FS_2L_VLAN
* patches up DMA and buffer...
* no logging here...
- if short frame then copy data to reduce memory waste (!?)
* allocates new skb
* uses eth_copy_and_sum
* ReQueueRxBuffer with this frame
# THIS IS OUTRAGEOUS: short frames are exactly where there is most
# bus and CPU load ; this does not look a good idea!
- otherwise:
* release the DMA mapping
* set the length in the skb
* checks the checksums from the card for the IP and TCP frames...
- checks if this frame is for a protocol we handle if so:
* update counter SK_PNMI_CNT_RX_OCTETS_DELIVERED
* pass to netif_rx for recieve at next layer
No congestion management performed with return from netif_rx
- otherwise:
* drop frame if not active port
* check if packet for rlmt if so put in event queue
- when RXD is empty call FillRxRing and ClearAndStartRx
FillRxRing & FillRxDescriptor [device/net/sk98ln/skge.c]
-------------------------------------------------------FillRxRing simply calls FillRxDescriptor for each
descriptor in the ring to refresh the available DMA rings.
FillRxDescriptor allocates a new recieve buffer by
- alloc_skb of the size pAC->RxBufSize
- skb_reserve to align IP Frames
- updates the Rx ring pointers
- sets up the dma mapping with pci_map_page
- sets the control settings on this descriptor
SkGeXmit [device/net/sk98ln/skge.c - 1726]
-----------------------------------------Places the frame in the tx descriptor ring.
- calls XmitFrame to do the work...
28
DRAFT – Redistribution of this material is forbidden – DRAFT
- calls netif_stop_queue if transmitter out of resources (not logged)
[include/linux/netdevice.h]
XmitFrame [device/net/sk98ln/skge.c - 1777]
------------------------------------------Logically: Lock buffer, allocate descriptor, link buffer to descriptor,
set control bits on descriptor.
-
see if ring free, with retry through FreeTxDescriptors
advance tx desriptor pointers
setup DMA mapping to the payload with pci_map_page
setup the control words on the TX descriptors
try to kick the frame out by setting the Q start on the card
if the TX descriptors are full, return signal to stop netif queue
29
DRAFT – Redistribution of this material is forbidden – DRAFT
5 Network Layer (Miguel)
The main files implementing the network layer are:



ip_input.c – Processing of the packets arriving at the host
ip_output.c – Processing of the packets leaving the host
ip_forward.c – Processing of the packets being routed by the host
Other files include:




ip_fragment.c – IP packet fragmentation
ip_options.c – IP options
ipmr.c – multicast
ipip.c – IP over IP
5.1 The Data Path through the Network Layer Miguel
Figure 7 - Network layer data path
30
DRAFT – Redistribution of this material is forbidden – DRAFT
5.1.1 Unicast
Figure 7 describes the paths that a packets traverses inside the IP Linux layer. Packet reception
from the network is shown on the left hand side and packets to be transmitted flow down the right
hand side of the diagram. If the packet has reached the host from the network, it has passeds
through the functions already described in chapter Error! Reference source not found. \ref{}
until it reaches net_rx_action() that passes the packet to ip_rcv(). After passing the first netfilter
hook (see chapter Section Error! Reference source not found. *** does not say much what was
the ref?\ref{}) the packet reaches ip_rcv_finish() which verifies if the packet is for local delivery.
If ( it is addressed to this host, the packet is) givening it to ip_local_delivery().(), which in turn
ip_local_delivery() will give it to the appropriate transport layer function (TCP, UDP, etc).
A packet can also reach the IP layer coming from the upper layers (e.g. delivered by TCP, or
UDP, or coming direct to the IP layer from some applications).The first function to process the
packet is ip_queue_xmit() which passes the packet to the output part through ip_output().
In the output part the last changes to the packet are made in ***... and the function
dev_queue_transmit() is called which enqueues the packet in the output queue. It also tries to run
the network scheduler mechanism by calling qdisc_run(). This pointer will point to different
functions depending on the scheduler installed. By default a FIFO scheduler is installed but this
can be changed with the tc utility, as described above.
The scheduling functions (qdisc_restart() and dev_queue_xmit_init()) run independently from the
rest of the IP code.
Rich: The above is not too clear - need more detail as well eg what happens if the Q is full? How
is the error returned to the user? (if it is)
A lot of routines have been mentioned – what do each of them really do?
The scheduling functions (qdisc_restart() and dev_queue_xmit_init()) run independently from the
rest of the IP code. We should say how this is done.
5.1.2 Routing
If the packet has the IP address of another host as the destination this means that our host is acting
as a router (a frequent scenario in small networks). If the host is configured to execute forwarding
(this can be sees and set through the value of the /proc/sys/net/ipv4/ip_forward variable) it then
has to be processed by a set of complex but very efficient functions.
The route is calculated by calling ip_route_input() which (if a fast hash does not exist) calls
ip_route_input_slow(). ip_route_input_slow() calls the fib (Forward information base) set of
functions in the fib*.c files. The FIB structure is quite complex (see for example \cite{} *** ??).
Miguel: Maybe we need a picture and a better explanation of this.
Rich: my suggestion is to keep it simple as we don’t do much routing in out hosts.
If the packet is a multicast packet the function that calculates the set of devices to transmit the
packet (in this case the IP destination is unchanged) is ip_route_input_mc().
After the route is calculated .... inserts the new IP destination in the IP packet and the output
device in the sk_buff structure. The packet is then passed to the forwarding functions
31
DRAFT – Redistribution of this material is forbidden – DRAFT
(ip_forward() and ip_forward_finish() which sends it to the output components. This is further
discussed in Section ***.
Miguel: This section needs sections on: ARP, ICMP, Multicast (both IGMP and multicast
routing),
Miguel: Packet fragmentation and IPv6. None of these is very complicated.
5.1.3 Multicast
When a host joins an IP multicast group, the corresponding level 2 (MAC) multicast address is
loaded into the NIC, so that this set of multicast packets are in fact given to the kernel and not
dropped by the NIC. The (MAC) multicast address correspond to several IP multicast groups, so
further checking is required to establish that the packet is for this host.
5.2 Raw material from Tom (file “ip”)
JP: We need to merge this with the rest of this section.
JP: All of this section is for Linux 2.4.19. Check that it’s still valid for 2.4.20.
tcp_transmit_skb [net/tcp_output.c] calls tp->af_specific->queue_xmit
which resolves to ip_queue_xmit [net/ip_output.c] performs routing
lookup and passes it to dst->output
which resolves to ip_output [net/ip_output.c] and calls ip_finish_output
this calls looks in the hh_cache or dst->neighbour to output
packets this places dev header in neigh_connected_output
and then calls neigh->ops->queue_xmit which resolves to
dev_queue_xmit through the arp lookup.
32
DRAFT – Redistribution of this material is forbidden – DRAFT
6 TCP (all?)
Rcih: first notes from the UCL DataTAG meeting:
We need a list when and how the following action occur
When does TCP get out of the slow start phase into the congestion avoidance phase



Txqueuelen exceeded
Exceed ssthresh
If loose ACKs in slowstart ?
When is cwin-moderate() called [and what does it do]
What changes ssthresh
What causes loss of throughput / reduction in cwind




Receive a triple ACK
Timeout occurs
TX Queue is full
Loss of CPU time / power available to the networking code {time-squeeze}
What are the conditions for loosing an incoming packet



Receive ring buffer is full – which counter tells us this – ifconfig error or overrun?
No free space in queues from IP to transport – ipdiscards?
Checksum incorrect – which counters
What causes the congestion window undo code to be called [*** is this cwin-moderate() or
tcp_undo_cwnd() my notes are not clear]



Separate for send and receive ?
Modem optimisation algorithms
If a timout occurs and THEN you get an ACK
Other reductions – when / how do they occur?
*** I suggest that the discussion of the data flow and routines is given followed by the lists of
conditions given above.
33
DRAFT – Redistribution of this material is forbidden – DRAFT
This section describes what is possibly the most complex part of the networking Linux kernel: the
code implementing TCP.
The main files that handle TCP are:






tcp_input.c - This is the biggest file. It deals with incoming packets from the network.
tcp_output.c - This deals with sending packets to the network.
tcp.c - General TCP code.
tcp_ipv4.c - IPv4 TCP specific code
tcp_timer.c - Timer management
tcp.h - TCP definitions
Figure 8 and Figure 9 depict the TCP data path. Input processing is described in Figure 8 and
output processing is illustrated by Figure 9.
Miguel: These two figures should be on opposite pages
Rich: The diagrams are very useful but not too clear Do you need an offer to re-draw?
34
DRAFT – Redistribution of this material is forbidden – DRAFT
Figure 8 – TCP: input processing
35
DRAFT – Redistribution of this material is forbidden – DRAFT
Figure 9 – TCP: output processing
36
DRAFT – Redistribution of this material is forbidden – DRAFT
6.1 TCP Input (Miguel)
Mainly implemented in ????????????/tcp_input.c.
This is the main sector of the TCP implementation. It deals with the reception of a TCP packet.
The sender and receiver code is tightly coupled since an entity can be both at the same time.
Incoming packets are made available to the TCP routines from the IP layer by ip_local_deliver()
shown on the left side of Figure 8. This routine gives the packet to the function pointed by
ipproto->handler. *** can you refer to the struct in section 2? For the IPv4 protocol stack this is
tcp_v4_rcv() which calls tcp_v4_do_rcv(). The function tcp_v4_do_rcv() in turn calls a further
function depending on the TCP state of the connection.
If the connection is established (state is TCP_ESTABLISHED) it calls tcp_rcv_established().
This is the main case that we will examine from now on. If the state is TIME_WAIT it calls
tcp_timewait_process(); all other states are processed by tcp_rcv_state_process(). For example,
this function will call tcp_rcv_sysent_state_process() if the state is SYS_ENT.
For some TCP states, e.g. **call setup?? tcp_rcv_state_process() and tcp_timewait_process()
have to initialise the TCP structures. They call tcp_init_buffer_space() and tcp_init_metrics().
tcp_init_metrics() initialises the congestion window by calling tcp_init_cwnd().
The following sub-sections describe the action of the functions shown in Figure 8 and Figure 9.
6.1.1 tcp_rcv_established() (???)
tcp_rcv_established() has to ways of operation: fast path and slow path. We first follow the slow
path since it is more clear and leave the fast path for the end (note that in the code the fast path is
dealt with before)
6.1.2
tcp_rcv_established() - Slow path (???)
The slow path code follows the 7 steps on RFC .**.. (labelled STEP xx) with some other
operations:
The slow path code follows the 7 steps on RFC ... with some other operations:
The checksum is calculated with tcp_checksum_complete_user() . If is is not correct the packet is
discarded.
The PAWS (Protection Against Wrapped Sequence Numbers [14]) is done with
tcp_paws_discard().
STEP 1 - The sequence number of the packet is checked. If it is not in sequence the receiver
sends a DUPACK with tcp_send_dupack(). tcp_send_dupack() may have to implement a SACK
(tcp_dsack_set()) but if finishes by calling tcp_send_ack().
STEP 2 - Check the RST (connection reset) bit (th->rst ) and if it is on calls tcp_reset().
STEP 3 - Check security and precedence (this is not implemented)
STEP 4 - Check SYN bit. If it is on calls tcp_reset()... *** more?
Calculate an estimative for the RTT (RTTM) by calling tcp_replace_ts_recent()
37
DRAFT – Redistribution of this material is forbidden – DRAFT
STEP 5 - Check ACK field. If this is on, the packet brings an ACK and tcp_ack() is called (more
details in section *** below)
STEP 6 - Check URG (urgent) bit. If it is on call tcp_urg()*** more?
STEP 7 - Process data on the packet. This is done by calling tcp_data_queue() (more details in
Section 6.1.3 below).
Checks if there is data to send by calling tcp_data_snd_check(). This function calls
tcp_write_xmit() on the TCP output sector.
Finally, check if there are ACKs to send with tcp_ack_snd_check(). This may result in sending an
ACK straight away with tcp_send_ack() or scheduling a delayed ACK with
tcp_send_delayed_ack(). The delayed ACK is stored in tcp->ack.pending .
6.1.3 tcp_data_queue() & tcp_event_data_recv()(???)
tcp_data_queue() is the function responsible with giving the data to the user. If the packet arrived
in order (all previous packets having already arrived) it copies the data to tp->ucopy.iov
(skb_copy_datagram_iovec(skb, 0, tp->ucopy.iov, chunk)).
Rich: can you reference the struct in section2
If the packet did not arrive in order, it puts it in the out of order queue with tcp_ofo_queue().
If a gap in the queue is filled, section 4.2 of RFC 2581 [2] says that we should send an ACK
immediately (tp->ack.pingpong = 0 and tcp_ack_snd_check() will send the ACK now).
The arrival of a packet has several consequences. These are dealt by calling
tcp_event_data_recv(). This first schedules an ACK with tcp_schedule_ack(), and then estimates
the MSS (Maximum Segment Size) with tcp_measure_rcv_mss().
In certain conditions (e.g if we are in slow start) the receiver TCP should be in quickack mode
where ACKs are sent immediately. If this is the situation, tcp_event_data_recv() switches this on
with tcp_incr_quickack(). It may also have to increase the advertised window with
tcp_grow_window().
Finally tcp_data_queue() checks if the FIN bit is on, and if yes tcp_fin() is called.
6.1.4 tcp_ack() (???)
Every time an ACK is received tcp_ack() is called. The first thing it does is to check if the ACK
is newer than sent or older than previous ACKs, if this is the case, then we can probably ignore it
goto uninteresting_ack and goto old_ack *** goto??
The first thing it does is to check if the ACK is newer than sent or older than previous acks then
we can probably ignore it goto uninteresting_ack and goto old_ack )
If everything is normal it updates the sender congestion window with tcp_ack_update_window()
and/or tcp_update_wl(). An ACK may be considered “normal” if it acknowledges the next section
of contiguous data starting from the pointer to the last fully acknowledged block of data.
Miguel: When is everything normal ?
38
DRAFT – Redistribution of this material is forbidden – DRAFT
If the ACK is dubious (e.g there was nothing non-acknowledged on the retransmission queue, or
???) enter fast retransmit with tcp_fastretrans_alert() (see Section … below).
Miguel: In what exact conditions is fast_retransmit() called ?
If (???) enter slow start/congestion avoidance with tcp_cong_avoid(). This functions implements
both the exponential increase in slow start and the linear increase in congestion avoidance.
Miguel: In which conditions is tcp_cong_avoid() is entered ?
Note that tcp_ack() should not be confused with tcp_send_ack() which is called by the "receiver"
to send ACKs using calls tcp_write_xmit().
6.1.5 tcp_fast_retransmit() (???)
Under certain conditions tcp_fast_retransmit_alert() is called by tcp_ack() (it is only called from
tcp_ack()). To understand these conditions we have to go through the Linux
NewReno/SACK/FACK/ECN state machine. What follows is practically a copy of a comment in
tcp_input.c. . Note that this “ACK state machine” has nothing to do with the TCP state machine.
The TCP state is usually *** clarify? TCP_ESTABLISHED.
The Linux state machine can be in any of the following states:





Open: Normal state, no dubious events, fast path.
Disorder: In all the respects it is "Open",but requires a bit more attention. It is entered when
we see some SACKs or dupacks. It is split of "Open" mainly to move some processing from
fast path to slow one.
CWR: CWND was reduced due to some Congestion Notification event, which can be ECN,
ICMP source quench, 3 duplicate ACKs ??, or local device congestion.
Recovery: CWND was reduced, but now we are fast-retransmitting.
Loss: CWND was reduced due to a RTO timeout or SACK reneging.
This state is kept in tp->ca_state as
TCP_CA_Recover or TCP_CA_Loss.
TCP_CA_Open, TCP_CA_Disorder, TCP_CA_Cwr,
tcp_fastretrans_alert() is entered if state is not "Open" when an ACK is received or "strange"
ACKs are received (SACK, DUPACK, ECN, ECE).
Miguel: In some non-"Open" conditions from is not entered...check this
6.2 Fast path (???)
The fast path is entered in certain conditions. For example a receiver usually enters the fast path
since the TCP processing in the receiver side is much more simple. However this is not the only
case.
Miguel: When is fast path entered and which steps does it follow.
39
DRAFT – Redistribution of this material is forbidden – DRAFT
6.3 SACKs (Tom)
The Linux TCP implementation completely implements SACKS (selective ACKs) \cite{}. The
connection SACK capabilities are stored in the tp->sack_ok field (FACKs are enabled if the 2 bit
is one and DSACKS (delayed SACKS) are enabled if the 3 bit is 1).
SACKS occupies an unexpected great part of the TCP implementation. More than a dozen
functions and significant parts of other functions are dedicated to implement SACKS.
Miguel: SACKs need more work...
Miguel: What are FACKs ?
6.4 Quick ACKs (Tom)
At certain times, the receiver enter quickack mode. That is, delayed ACKS are disabled. One
example is in slow start when delaying ACKs would delay the slow start considerably.
tcp_enter_quick_ack_mode() is called by tc_rcv_sysent_state_process() since in the beginning of
the connection this should be the state.
Miguel: When does the receiver enter quickack mode ?
6.5 Timeouts (???)
Timeouts are vital for the correct behavior of the TCP functions. They are used, for example, to
infer a packet loss in the network. Here we trace the events on registering and triggering of the
RETRANSMIT timer. These can be seen in Figure 10 and Figure 11.
Figure 10 - Scheduling a timeout
The setting of the retransmit timer happens when a packet is sent (which will be seen in more
detail in section \ref{}). tcp_push_pending_frames() calls tcp_check_probe_timer() which may
call tcp_reset_xmit_timer(). This will schedule a software interrupt (which are dealt by nonnetworking parts of the kernel).
When the timeout expires a software interrupt is generated which calls timer_bh() which calls
run_timer_list(). This will call timer->function which will, in this case, pointing to
tcp_wite_timer(). This will call tcp_retransmit_timer() which will finally call tcp_enter_loss().
40
DRAFT – Redistribution of this material is forbidden – DRAFT
The state of the Linux machine will be set to CA_Loss and the fastretransmit_alert() function
will schedule the retransmission of the packet.
Miguel: Why is fast retransmit called here ?
Figure 11 - ??????????????
6.6 ECN (Tom)
The ECN (Explicit Congestion Notification) code is not difficult to trace.
Almost all the code is in the tcp_ecn.h in the include/net directory. It contains the code to
receive and send the several ECN packet types.
On the packet input processing side we have some references:

tcp_ack() which checks calls TCP_ECN_rcv_ecn_echo() to possibly process a ECN packet.
Miguel: We need some more detail about ECN.
6.7 TCP output (???)
This part of the code (mainly tcp_output.c) deals with packets going out of the host and includes
both data packets from the "sender" and ACKs from the "receiver".
This is negotiated in the first SYN packets (as any other TCP option) (see tcp_init_metrics()).
Rich: what is??
Miguel: The TCP section might gain from a section on connection establishment and release.
6.8 Raw material from Tom (file “networking-notes”)
JP: We need to merge this with the rest of this section.
JP: All of this section is for Linux 2.4.19. Check that it’s still valid for 2.4.20.
TCP CONGESTION AVOIDANCE ROUTINES
=================================
Linux NewReno/SACK/FACK/ECN state machine at [net/ipv4/tcp_input.c - 1097]
enum tcp_ca_state [include/linux/tcp.h - 136]
---------------------------------------------
41
DRAFT – Redistribution of this material is forbidden – DRAFT
TCP_CA_Open = 0
Normal state, no dubious events, fast path.
TCP_CA_Disorder = 1 In all the respects it is "Open",
but requires a bit more attention. It is entered when
we see some SACKs or dupacks. It is split of "Open"
mainly to move some processing from fast path to slow one
TCP_CA_CWR = 2
CWND was reduced due to some Congestion Notification
event.
It can be ECN, ICMP source quench, local device congestion.
TCP_CA_Recovery = 3 CWND was reduced, we are fast-retransmitting.
TCP_CA_Loss = 4
CWND was reduced due to RTO timeout or SACK reneging
tcp_enter_cwr [include/net/tcp.h - 1139]
---------------------------------------Called on device congestion or ECE.
- clears prior_ssthresh and if state is not TCP_CA_CWR it enters
that state with tp->ca_state = TCP_CA_CWR
- helper function __tcp_enter_cwr called which:
{
tp->undo_marker = 0;
tp->snd_ssthresh = tcp_recalc_ssthresh(tp); [include/net/tcp.h]
tp->snd_cwnd = min(tp->snd_cwnd,
tcp_packets_in_flight(tp) + 1U); [include/net/tcp.h]
tp->snd_cwnd_cnt = 0;
tp->high_seq = tp->snd_nxt;
tp->snd_cwnd_stamp = tcp_time_stamp;
TCP_ECN_queue_cwr(tp);
[queues
ECN
CWR
in
tp->ecn_flags
include/tcp_ecn.h]
}
- the variable tp->ca_state signals responses and used in:
(*) denotes TCP_CA_CWR explicit check for
+ [include/net/tcp.h] tcp_current_ssthresh
+ [net/ipv4/tcp.c] tcp_disconnect, tcp_getsockopt for TCP_INFO
+ [net/ipv4/tcp_diag.c] tcpdiag_fill
+ [net/ipv4/tcp_input.c] tcp_update_metrics, tcp_update_reordering,
tcp_sacktag_write_queue, tcp_enter_loss, tcp_try_undo_recovery,
tcp_try_undo_loss, tcp_try_to_open (* - calls tcp_cwnd_down as a result),
tcp_fastretrans_alert (*), tcp_clean_rtx_queue, tcp_ack_is_dubious,
tcp_may_raise_cwnd, tcp_cwnd_application_limited
+ [net/ipv4/tcp_minisocks.c] tcp_create_openreq_child
+ [net/ipv4/tcp_output.c] tcp_simple_retransmit,
tcp_xmit_retransmit_queue
+ [net/ipv4/tcp_timer.c] tcp_retransmit_timer
More on the Linux NewReno/SACK/FACK/ECN state machine at:
[net/ipv4/tcp_input.c - 1097]
tcp_complete_cwr [net/ipv4/tcp_input.c - 1444]
---------------------------------------------{
tp->snd_cwnd = min(tp->snd_cwnd, tp->snd_ssthresh);
42
DRAFT – Redistribution of this material is forbidden – DRAFT
tp->snd_cwnd_stamp = tcp_time_stamp;
}
called from: tcp_fastretrans_alert - when received and ack beyond end
of data and in TCP_CA_CWR state
and tcp_fastretrans_alert - when leaving loss recovery stage.
Since snd_ssthresh is half of snd_cwnd the change takes place here...
tcp_cwnd_down [net/ipv4/tcp_input.c - 1295]
------------------------------------------Called only from tcp_try_to_open [net/ipv4/tcp_input.c] when
tp->ca_state == TCP_CA_CWR
and tcp_fastretrans_alert [net/ipv4/tcp_input.c] when loss detected
and state machine changes.
It is *clever* the cwnd reduction is applied over the whole round
trip time by using snd_cwnd_cnt to count which ack this is...
* TODO Understand how this pacing functions....
* Why does it set decr to tp->snd_cwnd_cnt + 1; ? looks like stretch acks..
tcp_cong_avoid [net/ipv4/tcp_input.c - 1699]
-------------------------------------------Called from tcp_ack to open the congestion window in slow-start and CA;
does standard thing of advancing snd_cwnd_cnt for window increase
tp->snd_cwnd [include/net/sock.h - 324]
--------------------------------------The congestion window is opened in: tcp_cong_avoid [net/ipv4/tcp_input.c]
and is altered for a cwr through the rate halving tcp_cwnd_down and
the completion with tcp_complete_cwr.
Touched in:
+ [net/ipv4/tcp.c] tcp_disconnect, tcp_getsockopt
+ [net/ipv4/tcp_diag.c] tcpdiag_fill
+ [net/ipv4/tcp_input.c]
tcp_update_metrics - called when TCP finishes to update route cache etc.
tcp_init_metrics - called at start of connection to get route cache stuff
tcp_enter_loss - called on loss detection
tcp_moderate_cwnd - called to prevent bursts by stretch acks
tcp_cwnd_down - rate halving code
tcp_undo_cwr - undo congestion window change...
tcp_complete_cwr - complete congestion window reduction
tcp_cong_avoid - open window
tcp_may_raise_cwnd - checks that not ECE, not recovery or cwr. and snd_cwnd <
snd_ssthresh?
tcp_ack - to advance the congestion window through tcp_cong_avoid and
tcp_may_raise_cwnd
tcp_cwnd_application_limited - adjust cwnd if applicatoin can't fill it.
43
DRAFT – Redistribution of this material is forbidden – DRAFT
tcp_new_space - increases send socket size but checks that the cwnd isn't full
tcp_data_snd_check - try to send a data packet but checks window
+ [net/ipv4/tcp_ipv4.c]
tcp_v4_init_sock - initialises socket (may be overwritten in tcp_init_metrics)
+ [net/ipv4/tcp_minisocks.c]
tcp_create_openreq_child - initialises on listen side
+ [net/ipv4/tcp_output.c]
tcp_cwnd_restart - reset CWND after idle period longer than RTO
tcp_xmit_retransmit_queue - called on retransmit timeout and
initially retransmitted data is acknowledged.
tp->snd_cwnd_cnt [include/net/sock.h - 325]
------------------------------------------linear increase counter and also used to do rate halving in
tcp_cwnd_down [net/ipv4/tcp_input.c]
Touched in:
+ [include/net/tcp.h] tcp_enter_cwr - resets for rate halving
+ [net/ipv4/tcp_minisocks.c] tcp_create_openreq_child
+ [net/ipv4/tcp.c] tcp_disconnect
+ [net/ipv4/tcp_input.c] tcp_enter_loss - enters loss state
tcp_cwnd_down - for pacing on acks
tcp_fast_retrans_alert - entering TCP_CA_Recovery (dupack?)
tcp_cong_avoid - to increase the window
TCP TRANSMIT ROUTINES
=====================
tcp_transmit_skb [net/ipv4/tcp_output.c - 189]
---------------------------------------------Actually transmits TCP packets queued in tcp_do_sendmsg(); used by
both initial transmission and later retransmissions.
SKB's seen here are headerless. This function builds the header and
passes it down to IP to build its header. Then it passes it to the
device
- checks sysctl flags for timestamps, window_scaling and SACK
- build TCP header and checksum
- sets SYN packets
- sets ECN flags
- clears ack event in socket
- clears data sent in TCB
- increments TCP stats through TCP_INC_STATS(TcpOutSegs);
calls
queue_xmit
on
device
presumed
to
link
to
dev_queue_xmit
[net/core/dev.c]
- if no error return otherwise make tcp_enter_cwr [include/net/tcp.h]
44
DRAFT – Redistribution of this material is forbidden – DRAFT
TCP RECEIVE ROUTINES
==================
tcp_rcv_established [net/ipv4/tcp_input.c - 3210]
------------------------------------------------Called from tcp_v4_do_rcv [net/ipv4/tcp_ipv4.c] when socket in recieve
state. Main entry into TCP code.
Split into standard fast and slow path...
Slow path:
- check TCP csum
- check PAWS
- check reset against sequence and do reset
- check timestamp
- check for bad syn
- do ack in tcp_ack
- do tcp_urg
- do tcp_data_queue for data path
- tcp_data_snd_check try to send data
- tcp_ack_snd_check try to send ack if needed
tcp_ack [net/ipv4/tcp_input.c - 1896]
------------------------------------Deal with incoming acks.
Called from:
tcp_recv_established - on fast path as sender and receiver
- slow path before urg, data, data_snd, ack_snd
tcp_rcv_synsent_state_process - to initialise the connection
tcp_rcv_state_process - checks acceptability of acks in various states
The flag field can be FLAG_DATA, FLAG_SLOWPATH, 0
In fastpath (i.e. not FLAG_SLOWPATH):
- check the ack is interesting and after snd_una
- sets snd_wl1 to seq of ack recved using tcp_update_wl [include/net/tcp.h]
- sets the FLAG_WIN_UPDATE flag
- increments the TCPHPAcks flag (i.e. fast path acks)
In slowpath:
- does this sequence number contain data
- calls tcp_ack_update_window to update the send window
- calls tcp_sacktag_write_queue for SACK
- calls TCP_ECN_rcv_ecn_echo if ECE
(Merge with slowpath)
- take timestamp in rcv_tstamp
- clean retransmit queue tcp_clean_rtx_queue (this updates the rtt and
sets the rtt timer)
45
DRAFT – Redistribution of this material is forbidden – DRAFT
- check for dubious acks with tcp_ack_is_dubious if so:
+ checks if the cwnd can be advanced and calls tcp_cong_avoid
+ tcp_fastretrans_alert to check for dups
- if forward progress calls dst_confirm [include/net/dst.h]
to set neigh->confirmed...
tcp_ack_is_dubious [net/ipv4/tcp_input.c - 1837]
-----------------------------------------------Splits dubious acks from those which are expected; e.g. dups etc.
- all those which are have no (FLAG_DATA|FLAG_WIN_UPDATE|FLAG_ACKED)
or (FLAG_DATA_SACKED|FLAG_ECE) or when not in state TCP_CA_Open
Only called from tcp_ack.
tcp_data_queue [net/ipv4/tcp_input.c - 2581]
-------------------------------------------Called to queue data for the receiver from:
- tcp_rcv_established
- tcp_rcv_state_process
Order of processing:
- Checks ECN CWR with TCP_ECN_accept_cwr
- Does some DSACK processing
- Delivers in sequence packets to the receiver if process running
- otherwise queue it on the receive queue tail
- calls tcp_event_data_recv to schedule ack, check ece, grow window,
and
measue rtt
- checks for fin with tcp_fin
- checks if there is data on the out_of_order_queue
+ if so then call tcp_ofo_queue to place data on receive queue
- check if this data changes the SACK advertisment with tcp_sack_remove
- see if we can re-enter fast path with tcp_fast_path_check
- If not in sequence then see if a false retransmit:
+ if so log with in STATS(DelayedACKLost)
+ send dsack with tcp_dsack_set
+ enter quick ack with tcp_enter_quickack_mode
+ schedule ack with tcp_schedule_ack
- Here the packet is an out of order packet filling a hole or retrans
after a hole so enter quickack mode with tcp_enter_quickack_mode
- if partial packet (ie seq < recv_next < end_seq) then generate tcp_dsack_set
- check the ECN CE
- compare sk->rmem_alloc with sk->rcvbuf and tcp_rmem_schedule
+ if scheduled call tcp_prune_queue to prune receive and ofo queues
(EXPENSIVE)
- disable header prediction
- tcp_schedule_ack
- tcp_set_owner_r: updates the destructor of an skbuff and the sk->rmem_alloc
- If initial out of order then create sack block and out_of_order_queue
- if not then try to place at end of last hole.
46
DRAFT – Redistribution of this material is forbidden – DRAFT
otherwise
linear
search
to
find
the
place
to
insert
out_of_order_queue (EXPENSIVE)
- if doing SACk then make sack block with tcp_sack_new_ofo_skb
in
tcp_ofo_queue [net/ipv4/tcp_input.c - 2538]
------------------------------------------Checks to see if we can put data from the out_of_order_queue into the
receive queue.
tcp_sacktag_write_queue [net/ipv4/tcp_input.c - 760]
---------------------------------------------------Called in tcp_ack when not a pure forward advance and also in tcp_ack
when it is an old_ack (presumably for DSACK code).
[NB: TCP_SKB_CB(skb)->ack_seq is set to tp->snd_nxt when a frame is
retransmitted in tcp_retransmit_skb [tcp_output.c] ]
- Zeros fackets_out if sacked_out is zero (why?!)
- for each sack block do:
+ if first block check for D-SACK:
* it is a D-SACK if the start_seq of the SACK block is before the
acknowledgement of the packet or if there is more than one SACK block
and the first SACK block lies within the range of the second SACK block.
* if it is a D-SACK and end_seq is lteq prior_snd_una and after the
undo_marker then remove a possible undo_retrans
* After checking for D-SACK if the ack is old then jump out.
+ if the end_seq is greater than the high_seq (snd_nxt at onset of
congestion) then there is new congestion so flag it.
+ for retransmit queue do:
* jump out if end_seq is leq than the current skb seq
* increment fack_count (fack_count zeroed for each SACK block loop)
* calculate if the packet lies within the SACK block
* if the SACK is a D-SACK and in the block and has been retransmitted
after the undo_marker then decrement undo_retrans
* if the packet has been acked:
- and it is a retransmission and there is a D-SACK which it lies in
and it was previously sacked then there is reordering (retrans frame
got there later), update the reordering.
- else if not a retransmission, not SACKED, and such that the
fack_count is less than the privous fackets it mst be in a hole;
update the reordering.
- jump out to next SACK block
* if the packet not acked and a SACKED retransmission with the SACK
block beyond the packet and no lost_retrans detected then flag
lost_retrans.
* if the packet not in the SACK block then not interesting go
to next packet in retransmit queue.
* if the packet is in the SACK block and not SACK_ACKED then:
47
the
DRAFT – Redistribution of this material is forbidden – DRAFT
- if the packet is a TCPCB_SACKED_RETRANS:
+ and marked as lost decrement lost_out and retrans_out
retransmission is probably still in flight.
- else
+ if new SACK for not retransmitted frame that is in a hole,
it is reordering, update reordering count.
+ if marked as lost then remove TCPCB_LOST and decrement lost_out.
- mark as TCPCB_SACKED_ACKED and flag that FLAG_DATA_SACKED
- increment sacked_out
- update fackets_out if we have more fack_count
* else the packet already SACKED_ACKED if a D-SACK and retransmitted
- then update reordering count.
* D-SACK code to clear TCPCB_SACKED_RETRANS and decrement retrans_out
- after trawling retransmission queue; if a lost_retrans could be possible
check for it if in state TCP_CA_Recovery. Here lost_retrans is set to the
highest end_seq of a SACK block which has an end_seq beyond a retransmitted
packet.
+ for the retrans queue
* if the seq is after lost_retrans then above the highest
retransmitted packet with a SACK block beyond it; no retransmission loss
possible.
* if the packet is acknowledged then continue to next packet
* if the packet is a SACKED_RETRANS and sack block is after it and
there is (FACK enabled or the lost_retrans is below the reordering
threshold beyond this packet) then
- clear TCPCB_SACKED_RETRANS and decrement retrans_out
- if not lost or SACK_ACKED incrment lost out and move to TCPCB_LOST
flag data lost.
- resync left out to sacked_out + lost_out
- if the reordering threshold is less than tp->fackets_out and the state is
TCP_CA_Loss update the reordering threshold through tcp_update_reordering.
TCP REORDERING DETECTION
========================
- tcp_sacktag_write_queue uses the SACK blocks to detect reordering
through segments that make it even though we retransmitted things.
- tcp_check_reno_reordering attempts to workout ho many holes there;
used by Reno attempting to emulate SACK for the rest of the code.
- when a parital ack is received see if there was reordering, if so
update the segment.
tcp_update_reordering [net/ipv4/tcp_input.c - 686]
-------------------------------------------------Called from tcp_check_reno_reordering, tcp_sacktag_write_queue, and
tcp_try_undo_partial.
48
DRAFT – Redistribution of this material is forbidden – DRAFT
- if the metric passed in is greater than the current amount of reordering
+ set reordering to metric upper bound TCP_MAX_REORDERING
+ work out what detected the reordering
+ disable FACK; can't use it if there is reordering
TCP CONGESTION WINDOW UNDO
==========================
Two state variables exist in the socket:
__u32 undo_marker; /* tracking retrans started here. */
int
undo_retrans; /* number retransmissions we can not undo. */
[NB the comment in sock.h is WRONG ]
All undo functions are performed in tcp_try_undo_XXXX functions:
tcp_try_undo_recovery, tcp_try_undo_dsack, tcp_try_undo_partial,
tcp_try_undo_loss.
All undo functions are called from tcp_fastretrans_alert in two places:
1) if the state is not open and the high_seq is acknowledged then in:
- TCP_CA_Loss: try a tcp_try_undo_recovery
- TCP_CA_Disorder: try a tcp_try_undo_dsack
- TCP_CA_Recovery: try a tcp_try_undo_recovery
2) then again depending on state:
- TCP_CA_Recovery: try tcp_try_undo_partial if there is new data acked
- TCP_CA_Loss: try tcp_try_undo_loss
- TCP_CA_Disorder: try tcp_try_undo_dsack
- tcp_time_to_recover passes then try tcp_try_to_open
tcp_may_undo [net/ipv4/tcp_input.c - 1350]
-----------------------------------------IF the undo marker is valid and (there are some retransmissions to
undo or a tcp_packet_delayed was found)
[NB tcp_packet_delayed(tp) gives nothing ware retransmitted or returned
timestamp is less than timestamp of the first retransmission]
tcp_undo_cwr [net/ipv4/tcp_input.c - 1334]
-----------------------------------------Second parameter sets whether the ssthresh is set back to prior_ssthresh.
- If there is a prior_ssthresh then
+ set snd_cwnd to max(tp->snd_cwnd, tp->snd_ssthresh << 1)
+ if undo is set then withdraw cwr and set ssthresh to prior_ssthresh
- else
+ set snd_cwnd to max of snd_cwnd or snd_ssthresh
- then tcp_moderate_cwnd to prevent bursting
- set the tp->snd_cwnd_stamp
49
DRAFT – Redistribution of this material is forbidden – DRAFT
tcp_try_undo_recovery [net/ipv4/tcp_input.c - 1357]
--------------------------------------------------Used to undo from TCP_CA_Loss and TCP_CA_Recovery when high_seq is
acknowledged.
- Check if we can undo with tcp_may_undo:
+ call tcp_undo_cwr with ssthresh resetting
+ Increment TCPLossUndo or TCPFullUndo flags
+ set undo_marker to 0
- if the last acknowledgement acknowledges the high_seq and we are reno
+ moderate_cwnd (hold state until something above high_seq acked)
- set state to tp->ca_state
tcp_try_undo_dsack [net/ipv4/tcp_input.c - 1383]
-----------------------------------------------Used to undo from a TCP_CA_Disorder state
- Check using undo_marker and tp->undo_retrans flags
+ tcp_undo_cwr(tp,1) and tp->undo_marker = 0
+ Increment TCPDSACKUndo flag
tcp_try_undo_partial [net/ipv4/tcp_input.c - 1395]
-------------------------------------------------Used when a partial ack received in TCP_CA_Recovery state
- Use Hoe's retransmit if the fackets_out are greater than reordering
threshold.
- if tcp_may_undo then hole filled with delayed packet rather than retransmit
+ tcp_update_reordering based on fackets and acked, set as TS based
reordering detection
+ tcp_undo_cwr(tp,0) and incremnet TCPPartialUndo
+ failed = 0
- return failed
tcp_try_undo_loss [net/ipv4/tcp_input.c - 1423]
----------------------------------------------Used when a partial ack received in TCP_CA_Loss state
- Check tcp_may_undo
+ tag all of the retransmit queue as not lost (looks dodgy to me)
+ tp->lost_out = 0; tp->left_out = tp->sacked_out;
+ increment TCPLossUndo
+ set tp->retransmits = 0 and tp->undo_marker = 0
+ if not reno then set state TCP_CA_Open
50
DRAFT – Redistribution of this material is forbidden – DRAFT
7 UDP (???)
This section reviews the UDP part of the networking code in the Linux kernel. This is a
significantly simpler piece of code than the TCP part. The absence of reliable delivery and
congestion control allows for a very simple design.
Most of the UDP code is located in one file:
net/ipv4/udp.c
The UDP layer is depicted in Figure 12. When a packet arrives from the IP layer through
ip_local_delivery() it is passed to udp_rcv() (this is the equivalent of tcp_v4_rcv() in the TCP
part). udp_rcv() puts the packet in the socket queue for the user application with sock_put(). This
is the end of the delivery of the packet.
When the user reads the packet e.g. with the recvmsg() system call, inet_recvmsg() is called
which in this case, calls udp_recvmsg() which calls skb_rcv_datagram().skb_rcv_datagram() will
get the packets from the queue and fill the data structure that will be read in user space.
*** what happens when the incoming ring buffer is full?
When a packet arrives from the user the process is simpler. inet_sendmsg() calls udp_sendmsg()
which build the UDP datagram with information taken from the sk structure (this information was
put there when the socket was created and bound to the address).
After the UDP datagram is built it is passed to ip_build_xmit() which builds the IP packet with
the possible help of ip_build_xmit_slow().
*** what happens when the outgoing (ring) buffer is full?? (for some expts. We have done it
seems that the user call blocks and no packets are lost at the transmitter – comments on this
mechanism)
Miguel: Why do UDP and TCP use a different function at the IP layer ?
Once the IP packet has been built, it is passed to ip_output() which, as was seen in section ...,
finalizes the delivery of the packet to the lower layers.
51
DRAFT – Redistribution of this material is forbidden – DRAFT
Figure 12 - UDP
8 Conclusion (JP)
To be written
52
DRAFT – Redistribution of this material is forbidden – DRAFT
Acknowledgments (all)
We would like to thank Marc Herbert, Richard Hughes-Jones, Éric Lemoine and Yee-Ting Li for
their useful feedback and gratefully acknowledge their help. Part of this research was funded by
the IST Program of the European Union (DataTAG project, grant IST-2001-32459).
Acronyms (JP)
ECN
Early Congestion Notification
IP
Internet Protocol
MSS
Maximum Segment Size
NAPI
New Application Programming Interface
NIC
Network interface Card
PAWS
Protect Against Wrapped Sequence numbers
SCTP
Stream Control Transmission Protocol
TCP
Transmission Control Protocol
UDP
User Datagram Protocol
53
DRAFT – Redistribution of this material is forbidden – DRAFT
References (JP)
[1]
Linux kernel 2.4.20. Available from The Linux
http://www.kernel.org/pub/linux/kernel/v2.4/patch-2.4.20.bz2
Kernel
Archives
at:
[2]
M. Allman, V. Paxson and W. Stevens, RFC 2581, TCP Congestion Control, IETF, April
1999.
[3]
J. Crowcroft and I. Phillips, TCP/IP & Linux Protocol Implementation: Systems Code for
the Linux Internet, Wiley, 2002.
[4]
J.H. Salim, R. Olsson and A. Kuznetsov, “Beyond Softnet”. In Proc. Linux 2.5 Kernel
Developers Summit, San Jose, CA, USA, March 2001. Available at
<http://www.cyberus.ca/~hadi/usenix-paper.tgz>.
[5]
J. Cooperstein, Linux Kernel 2.6 – New Features III: Networking. Axian, January 2003.
Available at <http://www.axian.com/pdfs/linux_talk3.pdf>.
[6]
W.R. Stevens. TCP/IP Illustrated, Volume 1: The Protocols, Addison-Wesley, 1994.
[7]
G.R. Wright and W.R. Stevens, TCP/IP Illustrated, Volume 2: The Implementation,
Addison-Wesley, 1995.
[8]
M. Mathis, J. Mahdavi, S. Floyd and A. Romanow, RFC 2018, TCP Selective
Acknowledgment Options, IETF, October 1996.
[9]
S. Floyd, J. Mahdavi, M. Mathis and M. Podolsky, RFC 2883, An Extension to the Selective
Acknowledgement (SACK) Option for TCP, IETF, July 2000.
[10] Daniel P. Bovet and Marco Cesati, Understanding the Linux Kernel, 2nd Edition, O’Reilly,
2002.
[11] A. Rubini and J. Corbet, Linux Device Drivers, 2nd Edition, O’Reilly, 2001.
[12] http://tldp.org/HOWTO/KernelAnalysis-HOWTO-5.html
[13] http://www.netfilter.org/unreliable-guides/kernel-hacking/lk-hacking-guide.html
[14] V. Jacobson, R. Braden and D. Borman, RFC 1323, TCP Extensions for High Performance,
IETF, May 1992.
[15] http://www.netfilter.org/
54
DRAFT – Redistribution of this material is forbidden – DRAFT
Biographies (all)
Miguel Rio is a Research Fellow in the Network Systems Centre of Excellence at University
College London, UK, where he works on Performance Evaluation of High Speed Networks in the
DataTAG project. He holds a PhD from the University of Kent at Canterbury, as well as MSc and
BSc degrees from the University of Minho, Portugal. His research interests include
Programmable Networks, Quality of Service, Multicast and Protocols for Reliable Transfers in
High-Speed Networks.
Tom Kelly received a Mathematics degree from the University of Oxford in July 1999. Since
October 2000 he has been studying for a PhD with the Laboratory for Communication
Engineering in the Cambridge University Engineering Department. His research interests include
middleware, networking, distributed systems and computer architecture. He was a research
internship at AT&T's Center for Internet Research at ICSI (now the ICSI Center for Internet
Research) during the Autumn of 2001. Tom was a resident IPAM research fellow for a workshop
on Large Scale Communication Networks held at UCLA during Spring 2002. During winter
2002/03 he worked for CERN on the EU DataTAG project implementing the Scalable TCP
proposal for high-speed wide area data transfer. Throughout his PhD he has been funded through
an Industrial Fellowship from the Royal Commission for the Exhibition of 1851 and AT&T
Research Labs.
Mathieu Goutelle is a Research Engineer in the INRIA RESO team
(http://www.inria.fr/recherche/equipes/reso.en.html) of the LIP Laboratory at ENS Lyon
(http://www.ens-lyon.fr/LIP/). He is a member of the DataTAG Project and currently works on
the behavior of TCP over a DiffServ-enabled gigabit network. In 2002, he graduated as a
generalist engineer (equiv. to an MSc in electrical and mechanical engineering) from Ecole
Centrale de Lyon (http://www.ec-lyon.fr/) in Lyon, France.
Gareth Fairey is a Research Associate in the HEP group at the University of Manchester where
he works on high throughput issues related to Grids. He holds a BA and DPhil from the
University of Oxford, both in Mathematics. The focus of his current research is on achieving
sustained high throughput using TCP across Gigabit Ethernet WANs, including an
implementation of High-Speed TCP as proposed by Sally Floyd.
J.P. Martin-Flatin is Technical Manager, DataTAG Project at CERN, where he works on fast
networks protocols for the gigabit networks that underlie Grids and on application and network
monitoring. Prior to that, he was a principal MTS with AT&T Labs Research in Florham Park,
NJ, USA, where he worked on large-scale distributed network management, information
modeling and Web-based management. He holds a Ph.D. degree in computer science from the
Swiss Federal Institute of Technology in Lausanne (EPFL). His research interests span distributed
systems, software engineering and IP networking. J.P. is the author of a book, Web-Based
Management of IP Networks and Systems, published in 2002. He is a senior member of the IEEE
and a member of the ACM and GGF. He is a co-chair of the GGF Data Transport Research
Group, a member of the IRTF Network Management Research Group and a member of the
DMTF Architecture Working Group. He was a co-chair of PFLDnet 2003.
Richard Hughes-Jones leads e-science and Trigger and Data Acquisition development in the
Particle Physics group at Manchester University. He has a PhD in Particle Physics and has
worked on Data Acquisition and Network projects for over 20 years, including evaluating and
field-testing OSI transport protocols and products. He is secretary of the Particle Physics
Network Coordinating Group which has the remit to support networking for PPARC funded
55
DRAFT – Redistribution of this material is forbidden – DRAFT
researchers. Within the UK GridPP project he is deputy leader of the network workgroup and is
active in the DataGrid networking workpackage (WP7). He is also responsible for the High
Throughput investigations in the UK e-Science MB-NG project to investigate QoS and various
traffic engineering techniques including MPLS. He is a member of the Global Grid Forum and is
co-chair of the Network Measurements Working Group. He was a member of the Program
Committee of the 2003 PFLDnet workshop, and is a member of the UKLIGHT Technical
Committee. His current interests are in the areas of realtime computing and networking including
the performance of transport protocols over LANs, MANs and WANs, network management and
modeling of Gigabit Ethernet components.
56
Download