Utilizing NIC’s enhancements A look at how driver software needs to change when using newer features of our hardware ‘theory’ versus ‘practice’ • The engineering designs one encounters in computer hardware components can be observed to undergo an ‘evolution’ during successive iterations, from a scheme that embodies simplicity, purity, and symmetry at the outset, based upon what designers think will be the device’s likely uses, to a conglomeration of disparate ‘add-ons’ as actual practices dictate accommodations ‘backward compatibility’ • An historically important consideration in the marketing of computer hardware has been the need to maintain past functions in a ‘transparent’ manner – i.e., no change is needed to run older software on newer equipment, while offering enhancements as ‘options’ that can be selectively enabled Example: Intel’s x86 • The current generation of Intel CPU’s will still execute all of the software written for PCs a quarter-century ago – based on a small set of 16-bit registers, a restricted set of instructions, and a one-megabyte memory-space – but is able, as an option, to use more and larger registers (64-bits), richer instruction-sets, and more memory Gigabit NICs • Intel’s network controller designs exhibit this same kind of ‘evolution’ over time • The ‘Legacy’ descriptor-formats are just one example of keeping prior-generation functionality: it’s simple, it’s ‘pure’ (i.e., not tied to any specific network-protocols, but emphasizing ‘mechanism’, not ‘policy’) • But now alternatives exist -- as options! ‘Legacy’ RX-Descriptors The device-driver initializes this ‘base-address’ field with the physical address of a packet-buffer… … and network hardware does not ever modify it Base-address (64-bits) Packetlength Packetstatus errors checksum VLAN tag The network controller later will ‘write-back’ values into all these fields when it has finished transferring a received packet’s data into that packet-buffer RxDesc Status-field 7 6 PIF IPCS 5 4 TCPCS UDPCS 3 2 VP 1 IXSM 0 EOP DD DD = Descriptor Done (1=yes, 0=no) shows if nic is finished with descriptor EOP = End Of Packet (1=yes, 0=no) shows if this packet is logically last IXSM = Ignore Checksum Indications (1=yes, 0=no) VP = VLAN Packet match (1=yes, 0=no) USPCS = UDP Checksum calculated in packet (1=yes, 0=no) TCPCS = TCP Checksum calculated in packet (1=yes, 0=no) IPCS = IPv4 Checksum calculated on packet (1=yes, 0=no) PIF = Passed In-exact Filter (1=yes, 0=no) shows if software must check RxDesc Errors-field 7 6 RXE 5 IPE TCPE 4 3 2 reserved reserved (=0) (=0) 1 SEQ 0 SE CE CE = CRC Error or Alignment Error (check statistics registers to differentiate) TCPE = TCP/UDP Checksum Error IPE = IPv4 Checksum Error These bits are relevant only while NIC is operating in ‘SerDes’ mode: SE = Symbol Error SEQ = Sequence Error RXE = Rx Data Error ‘Extended’ RX-Descriptors CPU writes this, NIC reads it: NIC writes this, CPU reads it: Base-address (64-bits) Packetchecksum reserved (=0) VLAN tag The device-driver initializes the ‘base-address’ field with the physical address of a packet-buffer, and it initializes the ‘reserved’ field with a zero-value… … the network hardware will later modify both fields IP MRQ identification (multiple receive queues) Packetlength Extended errors Extended status The network controller will ‘write-back’ the values for these fields when it has transferred a received packet’s data into the packet-buffer An alternative option CPU writes this, NIC reads it: Base-address (64-bits) reserved (=0) NIC writes this, CPU reads it: RSS Hash MRQ (Receive Side Scaling) (multiple receive queues) VLAN tag Packetlength Extended errors ‘Receive Side Scaling’ refers to an optional capability in the network controller to assist with routing of network packets to various CPUs within a modern multiprocessor system (See Section 3.2.13 in Intel’s Software Developer’s Manual) Extended status Extended Rx-Status (20-bits) 19 0 18 0 17 16 15 0 0 A C K 14 0 13 0 12 0 11 10 9 U D P V I P I V 0 These ‘extra’ status-bits provide additional hardware support to driver software for processing ethernet packets that conform to standard TCP/IP network protocols (with possibilities for future expansion) ACK = TCP ACK-Packet identification UDPV = Valid UDP checksum IPIV = Valid IP Identification 8 7 6 5 4 3 2 1 0 0 P I F I P C S T C P C S U D P C S V P I X S M E O P D D These eight bits have the same meanings as in a ‘Legacy’ Rx-Status byte DD = Descriptor Done EOP = End Of Packet IXSM = Ignore Checksum Indications VP = VLAN Packet match USPCS = UDP Checksum calculated TCPCS = TCP Checksum calculated IPCS = IPv4 Checksum calculated PIF = Passed In-exact Filter Extended Rx-Errors (12 bits) 11 10 9 RXE IPE TCPE 8 7 0 0 6 5 SEQ SE 4 3 2 1 0 CE 0 0 0 0 These eight bits have the same meanings, and the occupy the same arrangement, as in the ‘Legacy’ Rx-Errors byte Main device-driver changes • If we want to utilize the NIC’s ‘Extended’ Receive Descriptor format, we will need several significant changes in our driver source-code and data-types: • • • • • Our module’s initialization of ‘base_address’ fields Our new need for programming register RFCTL Our ‘typedef’ for the ‘RX_DESCRIPTOR’ structs Our ‘get_info_rx()’ function for ‘/proc/nicrx’ display Our interrupt-handler’s treatment of ‘rxring’ entries Use of C language ‘union’ • Each Receive-Descriptor now has a ‘dual’ identity, as far as the NIC is concerned: – one layout during its ‘fetch’ from memory – another layout during ‘write-back’ to memory • The C language provides a special ‘type’ construction for accommodating this kind of programming situation, it’s known as a union and it requires a special syntax ‘Bitfields’ in C • Some of the fields in the ‘Extended’ RX Descriptor do not align with the CPU’s natural 8-bit,16-bit and 32-bit data-sizes Extended errors Extended status 12-bits 20-bits • The C language provides ‘bitfields’ for a situation like this (not yet ‘standardized’) Syntax for Rx-Descriptors typedef struct { unsigned long long base_address; unsigned long long reserved; } RX_DESC_FETCH; typedef struct { unsigned int unsigned short unsigned short unsigned int unsigned int unsigned short unsigned short } RX_DESC_STORE; mrq; ip_identification; packet_chksum; desc_status:20; desc_errors:12; packet_length; vlan_tag; { RX_DESC_FETCH RX_DESC_STORE } RX_DESCRIPTOR; rxf; rxs; typedef union RFCTL (0x5008) The Receive Filter Control register 31 16 reserved (=0) 15 E X T E N 14 13 IP ACKD FRSP _DIS _DIS 12 11 10 ACK DIS IPv6 XSUM _DIS IPv6 _DIS 9 8 NFS_VER 7 6 NFSR NFSW _DIS _DIS 5 4 3 2 iSCSI_DWC EXTEN (bit 15) = Extended Status Enable (1=yes, 0=no) This enables the NIC to write-back the ‘Extended Status’ 1 0 iSCSI _DIS Modifying ‘my_read()’ • To implement use of ‘Extended’ Receive Descriptors in our most recent charactermode device-driver (i.e., ‘zerocopy.c’), we need some changes in the ‘read()’ method • Most obvious example: a packet-buffer’s memory address can no longer be gotten from an Rx-Descriptor’s ‘base_address’ (which now gets ‘overwritten’ by the NIC) For our pseudo-file’s sake… • Also our driver’s ‘read()’ function shouldn’t prepare a current rx-descriptor for reuse, as it did in earlier drivers, since that would destroy all of the useful information which the NIC has just written into that descriptor • Instead, the preparation of a descriptor for reuse in a future packet-receive operation should be deferred, at least temporarily OK, but then when? • We can reassign the duty to ‘refresh’ some Rx-Descriptors for reuse to our driver’s Interrupt Service Routine; specifically, at the point in time when an ‘RXDMT0’ event is signaled (Rx-Descriptor Min-Threshold) • It might be best to create a ‘bottom half’ to take care of those re-initializations, but we haven’t yet done that in our new prototype Handling ‘RXDMT0’ interrupts irqreturn_t my_isr( int irq, void *dev_id ) { int intr_cause = ioread32( io + E1000_ICR ); if ( intr_cause & (1<<4) ) // Rx-Descriptors Low { unsigned int rx_buf = virt_to_phys( rxring ) + 16 * N_RX_DESC; unsigned int rxtail = ioread32( io + E1000_RDT ), i, ba; // prepare the next eight Rx-Descriptors for ‘reuse’ by the NIC for (i = 0; i < 8; i++) { ba = rx_buf + rxtail * RX_BUFSIZ; rxring[ rxtail ].base_address = ba; rxring[ rxtail ].reserved = 0LL; rxtail = (1 + rxtail) % N_RX_DESC; } // now give the NIC ‘ownership’ of these reinitialized descriptors iowrite32( rxtail, io + E1000_RDT ); } ‘extended.c’ • Here’s our revision of ‘zerocopy.c’, aimed at showing how we can incorporate use of the NIC’s ‘Extended’ Receive Descriptors • It appears to function exactly as before, until a user attempts to view the driver’s Receive-Descriptor queue: $ cat /proc/nicrx • Then we are shown descriptors having two distinct formats (i.e., FETCH and STORE) Demo: ‘bitfield.c’ • Because the manner in which ‘bitfields’ are handled in the C language varies with the particular C-compiler being used, we have created a short demo-program that shows us how our GNU C-compiler ‘gcc’ handles the layout of bitfields within a C data-item typedef struct { unsigned int unsigned int } RXD_ELT; desc_status:20; // bits 0..19 desc_errors:12; // bits 20..31