Utilizing NIC’s enhancements A look at how driver software

advertisement
Utilizing NIC’s enhancements
A look at how driver software
needs to change when using
newer features of our hardware
‘theory’ versus ‘practice’
• The engineering designs one encounters
in computer hardware components can be
observed to undergo an ‘evolution’ during
successive iterations, from a scheme that
embodies simplicity, purity, and symmetry
at the outset, based upon what designers
think will be the device’s likely uses, to a
conglomeration of disparate ‘add-ons’ as
actual practices dictate accommodations
‘backward compatibility’
• An historically important consideration in
the marketing of computer hardware has
been the need to maintain past functions
in a ‘transparent’ manner – i.e., no change
is needed to run older software on newer
equipment, while offering enhancements
as ‘options’ that can be selectively enabled
Example: Intel’s x86
• The current generation of Intel CPU’s will
still execute all of the software written for
PCs a quarter-century ago – based on a
small set of 16-bit registers, a restricted
set of instructions, and a one-megabyte
memory-space – but is able, as an option,
to use more and larger registers (64-bits),
richer instruction-sets, and more memory
Gigabit NICs
• Intel’s network controller designs exhibit
this same kind of ‘evolution’ over time
• The ‘Legacy’ descriptor-formats are just
one example of keeping prior-generation
functionality: it’s simple, it’s ‘pure’ (i.e., not
tied to any specific network-protocols, but
emphasizing ‘mechanism’, not ‘policy’)
• But now alternatives exist -- as options!
‘Legacy’ RX-Descriptors
The device-driver initializes this ‘base-address’ field
with the physical address of a packet-buffer…
… and network hardware does not ever modify it
Base-address (64-bits)
Packetlength
Packetstatus errors
checksum
VLAN
tag
The network controller later will ‘write-back’ values
into all these fields when it has finished transferring
a received packet’s data into that packet-buffer
RxDesc Status-field
7
6
PIF
IPCS
5
4
TCPCS UDPCS
3
2
VP
1
IXSM
0
EOP
DD
DD = Descriptor Done (1=yes, 0=no) shows if nic is finished with descriptor
EOP = End Of Packet (1=yes, 0=no) shows if this packet is logically last
IXSM = Ignore Checksum Indications (1=yes, 0=no)
VP = VLAN Packet match (1=yes, 0=no)
USPCS = UDP Checksum calculated in packet (1=yes, 0=no)
TCPCS = TCP Checksum calculated in packet (1=yes, 0=no)
IPCS = IPv4 Checksum calculated on packet (1=yes, 0=no)
PIF = Passed In-exact Filter (1=yes, 0=no) shows if software must check
RxDesc Errors-field
7
6
RXE
5
IPE
TCPE
4
3
2
reserved reserved
(=0)
(=0)
1
SEQ
0
SE
CE
CE = CRC Error or Alignment Error (check statistics registers to differentiate)
TCPE = TCP/UDP Checksum Error
IPE = IPv4 Checksum Error
These bits are relevant only while NIC is operating in ‘SerDes’ mode:
SE = Symbol Error
SEQ = Sequence Error RXE = Rx Data Error
‘Extended’ RX-Descriptors
CPU writes this, NIC reads it:
NIC writes this, CPU reads it:
Base-address (64-bits)
Packetchecksum
reserved (=0)
VLAN
tag
The device-driver initializes
the ‘base-address’ field
with the physical address
of a packet-buffer, and
it initializes the ‘reserved’
field with a zero-value…
… the network hardware
will later modify both fields
IP
MRQ
identification
(multiple receive queues)
Packetlength
Extended
errors
Extended
status
The network controller
will ‘write-back’ the
values for these fields
when it has transferred
a received packet’s data
into the packet-buffer
An alternative option
CPU writes this, NIC reads it:
Base-address (64-bits)
reserved (=0)
NIC writes this, CPU reads it:
RSS Hash
MRQ
(Receive Side Scaling)
(multiple receive queues)
VLAN
tag
Packetlength
Extended
errors
‘Receive Side Scaling’ refers to an optional capability in the
network controller to assist with routing of network packets
to various CPUs within a modern multiprocessor system
(See Section 3.2.13 in Intel’s Software Developer’s Manual)
Extended
status
Extended Rx-Status (20-bits)
19
0
18
0
17 16 15
0
0
A
C
K
14
0
13
0
12
0
11 10
9
U
D
P
V
I
P
I
V
0
These ‘extra’ status-bits provide
additional hardware support to
driver software for processing
ethernet packets that conform to
standard TCP/IP network protocols
(with possibilities for future expansion)
ACK = TCP ACK-Packet identification
UDPV = Valid UDP checksum
IPIV = Valid IP Identification
8
7
6
5
4
3
2
1
0
0
P
I
F
I
P
C
S
T
C
P
C
S
U
D
P
C
S
V
P
I
X
S
M
E
O
P
D
D
These eight bits have the
same meanings as in a
‘Legacy’ Rx-Status byte
DD = Descriptor Done
EOP = End Of Packet
IXSM = Ignore Checksum Indications
VP = VLAN Packet match
USPCS = UDP Checksum calculated
TCPCS = TCP Checksum calculated
IPCS = IPv4 Checksum calculated
PIF = Passed In-exact Filter
Extended Rx-Errors (12 bits)
11
10
9
RXE IPE TCPE
8
7
0
0
6
5
SEQ SE
4
3
2
1
0
CE
0
0
0
0
These eight bits have the same meanings,
and the occupy the same arrangement,
as in the ‘Legacy’ Rx-Errors byte
Main device-driver changes
• If we want to utilize the NIC’s ‘Extended’
Receive Descriptor format, we will need
several significant changes in our driver
source-code and data-types:
•
•
•
•
•
Our module’s initialization of ‘base_address’ fields
Our new need for programming register RFCTL
Our ‘typedef’ for the ‘RX_DESCRIPTOR’ structs
Our ‘get_info_rx()’ function for ‘/proc/nicrx’ display
Our interrupt-handler’s treatment of ‘rxring’ entries
Use of C language ‘union’
• Each Receive-Descriptor now has a ‘dual’
identity, as far as the NIC is concerned:
– one layout during its ‘fetch’ from memory
– another layout during ‘write-back’ to memory
• The C language provides a special ‘type’
construction for accommodating this kind
of programming situation, it’s known as a
union and it requires a special syntax
‘Bitfields’ in C
• Some of the fields in the ‘Extended’ RX
Descriptor do not align with the CPU’s
natural 8-bit,16-bit and 32-bit data-sizes
Extended
errors
Extended
status
12-bits
20-bits
• The C language provides ‘bitfields’ for a
situation like this (not yet ‘standardized’)
Syntax for Rx-Descriptors
typedef struct
{
unsigned long long base_address;
unsigned long long reserved;
} RX_DESC_FETCH;
typedef struct
{
unsigned int
unsigned short
unsigned short
unsigned int
unsigned int
unsigned short
unsigned short
} RX_DESC_STORE;
mrq;
ip_identification;
packet_chksum;
desc_status:20;
desc_errors:12;
packet_length;
vlan_tag;
{
RX_DESC_FETCH
RX_DESC_STORE
} RX_DESCRIPTOR;
rxf;
rxs;
typedef union
RFCTL (0x5008)
The Receive Filter Control register
31
16
reserved (=0)
15
E
X
T
E
N
14
13
IP
ACKD
FRSP
_DIS
_DIS
12
11
10
ACK
DIS
IPv6
XSUM
_DIS
IPv6
_DIS
9
8
NFS_VER
7
6
NFSR NFSW
_DIS _DIS
5
4
3
2
iSCSI_DWC
EXTEN (bit 15) = Extended Status Enable (1=yes, 0=no)
This enables the NIC to write-back the ‘Extended Status’
1
0
iSCSI
_DIS
Modifying ‘my_read()’
• To implement use of ‘Extended’ Receive
Descriptors in our most recent charactermode device-driver (i.e., ‘zerocopy.c’), we
need some changes in the ‘read()’ method
• Most obvious example: a packet-buffer’s
memory address can no longer be gotten
from an Rx-Descriptor’s ‘base_address’
(which now gets ‘overwritten’ by the NIC)
For our pseudo-file’s sake…
• Also our driver’s ‘read()’ function shouldn’t
prepare a current rx-descriptor for reuse,
as it did in earlier drivers, since that would
destroy all of the useful information which
the NIC has just written into that descriptor
• Instead, the preparation of a descriptor for
reuse in a future packet-receive operation
should be deferred, at least temporarily
OK, but then when?
• We can reassign the duty to ‘refresh’ some
Rx-Descriptors for reuse to our driver’s
Interrupt Service Routine; specifically, at
the point in time when an ‘RXDMT0’ event
is signaled (Rx-Descriptor Min-Threshold)
• It might be best to create a ‘bottom half’ to
take care of those re-initializations, but we
haven’t yet done that in our new prototype
Handling ‘RXDMT0’ interrupts
irqreturn_t my_isr( int irq, void *dev_id )
{
int
intr_cause = ioread32( io + E1000_ICR );
if ( intr_cause & (1<<4) )
// Rx-Descriptors Low
{
unsigned int
rx_buf = virt_to_phys( rxring ) + 16 * N_RX_DESC;
unsigned int
rxtail = ioread32( io + E1000_RDT ), i, ba;
// prepare the next eight Rx-Descriptors for ‘reuse’ by the NIC
for (i = 0; i < 8; i++)
{
ba = rx_buf + rxtail * RX_BUFSIZ;
rxring[ rxtail ].base_address = ba;
rxring[ rxtail ].reserved = 0LL;
rxtail = (1 + rxtail) % N_RX_DESC;
}
// now give the NIC ‘ownership’ of these reinitialized descriptors
iowrite32( rxtail, io + E1000_RDT );
}
‘extended.c’
• Here’s our revision of ‘zerocopy.c’, aimed
at showing how we can incorporate use of
the NIC’s ‘Extended’ Receive Descriptors
• It appears to function exactly as before,
until a user attempts to view the driver’s
Receive-Descriptor queue:
$ cat /proc/nicrx
• Then we are shown descriptors having two
distinct formats (i.e., FETCH and STORE)
Demo: ‘bitfield.c’
• Because the manner in which ‘bitfields’ are
handled in the C language varies with the
particular C-compiler being used, we have
created a short demo-program that shows
us how our GNU C-compiler ‘gcc’ handles
the layout of bitfields within a C data-item
typedef struct
{
unsigned int
unsigned int
} RXD_ELT;
desc_status:20; // bits 0..19
desc_errors:12; // bits 20..31
Download