What’s needed to transmit? A look at the minimum steps required for programming our 82573L nic to send packets Typical NIC hardware packet main memory TX FIFO buffer B U S CPU nic RX FIFO transceiver LAN cable Quotation Many companies do an excellent job of providing information to help customers use their products... but in the end there's no substitute for real-life experiments: putting together the hardware, writing the program code, and watching what happens when the code executes. Then when the result isn't as expected -- as it often isn't -- it means trying something else or searching the documentation for clues. -- Jan Axelson, author, Lakeview Research (1998) Thanks, Intel!☻ • Intel Corporation has kindly posted details online for programming its family of gigabit Ethernet controllers – includes our 82573L Our ‘nictx.c’ module • We’ve created an LKM which has minimal functionality – enough to be sure we know how to ‘transmit’ a raw Ethernet packet – but we do this in a forward-looking way so that our source-code can later be turned into a Linux character-mode device-driver (once we’ve also seen how to write code which allows our nic to ‘receive’ packets) Access to PRO1000 registers • Device registers are hardware mapped to a range of addresses in physical memory • We obtain the location (and the length) of this memory-range from a BAR register in the nic device’s PCI Configuration Space • Then we request the Linux kernel to setup an I/O ‘remapping’ of this memory-range to ‘virtual’ addresses within kernel-space Tx-Desc Ring-Buffer 0x00 TDBA base-address 0x10 0x20 TDH (head) 0x30 TDLEN (in bytes) 0x40 0x50 0x60 TDT (tail) 0x70 0x80 = owned by hardware (nic) = owned by software (cpu) Circular buffer (128-bytes minimum) How ‘transmit’ works List of Buffer-Descriptors descriptor0 descriptor1 descriptor2 descriptor3 0 0 0 0 Buffer0 Buffer1 Buffer2 We setup each data-packets that we want to be transmitted in a ‘Buffer’ area in ram We also create a list of buffer-descriptors and inform the NIC of its location and size Then, when ready, we tell the NIC to ‘Go!’ (i.e., start transmitting), but let us know when these transmissions are ‘Done’ Buffer3 Random Access Memory Allocating kernel-memory • Our 82573L device-driver will need to use a segment of contiguous physical memory which is cache-aligned and non-pageable • Such a memory-block can be allocated by using the kernel’s ‘kzalloc()’ function (and it can later be deallocated using ‘kfree()’) • You should use the ‘GFP_KERNEL’ flag (and we also used the ‘GFP_DMA’ flag) NIC registers (for transmit) enum { E1000_CTRL E1000_STATUS E1000_TCTL E1000_TDBAL E1000_TDBAH E1000_TDLEN E1000_TDH E1000_TDT E1000_TXDCTL E1000_RA }; = 0x0000, = 0x0008, = 0x0400, = 0x3800, = 0x3804, = 0x3808, = 0x3810, = 0x3818, = 0x3828, = 0x5400, // Device Control // Device Status // Transmit Control // Tx-Descriptor Base-Address Low // Tx-Descriptor Base-Address High // Tx-Descriptor queue Length // Tx-Descriptor Head // Tx-Descriptor Tail // Tx-Descriptor Control // Receive-address Array Device Control (0x0000) 31 30 29 R PHY VME RST =0 15 28 27 26 TFCE RFCE RST 14 13 R R R =0 =0 =0 12 25 23 22 21 R R R R R =0 =0 =0 =0 =0 11 FRC FRC DPLX SPD FD = Full-Duplex GIOMD = GIO Master Disable SLU = Set Link Up FRCSPD = Force Speed FRCDPLX = Force Duplex 24 10 R =0 9 SPEED 8 =0 19 ADV D3 WUC 7 R 20 6 S L U R =0 5 18 17 D/UD status 4 R R =0 =0 3 R R R =0 =0 =1 16 2 1 0 GIO M 0 D R 0=0 F D SPEED (00=10Mbps, 01=100Mbps, 10=1000Mbps, 11=reserved) ADVD3WUP = Advertise Cold Wake Up Capability D/UD = Dock/Undock status RFCE = Rx Flow-Control Enable RST = Device Reset TFCE = Tx Flow-Control Enable PHYRST = Phy Reset VME = VLAN Mode Enable 82573L Device Status (0x0008) 31 ? 30 29 28 0 0 27 0 26 0 25 24 0 0 23 0 0 22 0 21 20 0 0 19 18 GIO Master EN 17 0 16 0 0 some undocumented functionality? 15 0 14 0 13 0 12 0 11 0 10 PHY RA 9 ASDV 8 7 6 I S L SPEED L O S U FD = Full-Duplex LU = Link Up TXOFF = Transmission Paused SPEED (00=10Mbps,01=100Mbps, 10=1000Mbps, 11=reserved) ASDV = Auto-negotiation Speed Detection Value PHYRA = PHY Reset Asserted 5 0 4 TX OFF 3 2 1 0 Function ID 0 0U L F D 82573L Transmit Control (0x0400) 31 R =0 30 R =0 29 R 28 MULR 27 26 TXCSCMT =0 15 14 13 12 25 UNO RTX 11 COLD (lower 4-bits) (COLLISION DISTANCE) EN = Transmit Enable PSP = Pad Short Packets CT = Collision Threshold (=0xF) COLD = Collision Distance (=0x3F) 24 RTLC 23 R =0 10 0 9 22 21 20 18 17 16 COLD (upper 6-bits) SW XOFF 8 19 (COLLISION DISTANCE) 7 6 5 I S CT L TBI (COLLISION ASDV THRESHOLD) SPEED L O mode S U 4 3 P S P 2 1 0 R0 =0 0N E R =0 SWXOFF = Software XOFF Transmission RLTC = Retransmit on Late Collision UNORTX = Underrun No Re-Transmit TXCSCMT = TxDescriptor Minimum Threshold MULR = Multiple Request Support 82573L Tx-Descriptor Control (0x3828) 31 0 30 29 0 28 0 15 0 27 0 25 24 0 0 0 G R A N 13 12 11 10 0 14 26 0 FRC HTHRESH FRC 0 DPLX SPD (Host Threshold) 23 22 0 0 9 8 21 20 19 18 17 16 WTHRESH (Writeback Threshold) 7 I L 0 O0 S 6 00 5 A S D E 4 3 2 1 L PTHRESH R 0 00 00 (Prefetch S Threshold) T “This register controls the fetching and write back of transmit descriptors. The three threshhold values are used to determine when descriptors are read from, and written to, host memory. Their values can be in units of cache lines or of descriptors (each descriptor is 16 bytes), based on the value of the GRAN bit (0=cache lines, 1=descriptors). When GRAN = 1, all descriptors are written back (even if not requested).” --Intel manual Recommended for 82573: 0x01010000 (GRAN=1, WTHRESH=1) 0 An observation • We notice that the 82573L device retains the values in many of its internal registers • This fact reduces the programming steps that will be required to operate our nic on the anchor cluster machines, since Intel’s own Linux device driver (‘e1000e.ko’) has already initialized many nic registers • But we MAY need to bring ‘eth1’ down! Using ‘/sbin/ifconfig’ • You can use the ‘/sbin/ifconfig’ command to find out whether the ‘eth1’ interface has been brought ‘down’: $ /sbin/ifconfig eth1 • If it is still operating, you can turn it off with the (privileged) command: $ sudo /sbin/ifconfig eth1 down Programming steps 1) Detect the presence of the 82573L network controller (VENDOR_ID, DEVICE_ID) 2) Obtain the physical address-range where the nic’s device-registers are mapped 3) Ask the kernel to map this address range into the kernel’s virtual address-space 4) Copy the network controller’s MAC-address into a 6-byte array for future access 5) Allocate a block of kernel memory large enough for our descriptors and buffers 6) Insure that the network controller’s ‘Bus Master’ capability has been enabled 7) Select our desired configuration-options for the DEVICE CONTROL register 8) Perform a nic ‘reset’ operation (by toggling bit 26), then delay until reset completes 9) Select our desired configuration-options for the TRANSMIT CONTROL register 10) Initialize our array of Transmit Descriptors with the physical addresses of buffers 11) Initialize the Transmit Engine’s registers (for Tx-Descriptor Queue and Control) 12) Setup the buffer-contents for an Ethernet packet we want to be transmitted 13) Enable the Transmit Engine 14) Give ‘ownership’ of a Tx-Descriptor to the network controller 15) Install our ‘/proc/nictx’ pseudo-file (for user-diagnostic purposes) Legacy Tx-Descriptor Layout 31 0 Buffer-Address low (bits 31..0) 0x0 Buffer-Address high (bits 63..32) 0x4 CMD CSO special Packet Length (in bytes) CSS reserved =0 status Buffer-Address = the packet-buffer’s 64-bit address in physical memory Packet-Length = number of bytes in the data-packet to be transmitted CMD = Command-field CSO/CSS = Checksum Offset/Start (in bytes) STA = Status-field 0x8 0xC Suggested C syntax typedef struct { unsigned long long base_address; unsigned short packet_length; unsigned char cksum_offset; unsigned char desc_command; unsigned char desc_status; unsigned char cksum_origin; unsigned short special_info; } TX_DESCRIPTOR; TxDesc Command-field 7 6 IDE 5 VLE DEXT 4 reserved =0 3 2 RS 1 IC 0 IFCS EOP EOP = End Of Packet (1=yes, 0=no) IFCS = Insert Frame CheckSum (1=yes, 0=no) – provided EOP is set IC = Insert CheckSum (1=yes, 0=no) as indicated by CSO/CSS fields RS = Report Status (1=yes, 0=no) DEXT = Descriptor Extension (1=yes, 0=no) use ‘0’ for Legacy-Mode VLE = VLAN-Packet Enable (1=yes, 0=no) – provided EOP is set IDE = Interrupt-Delay Enable (1=yes, 0=no) TxDesc Status field 3 reserved =0 2 1 LC 0 EC DD DD = Descriptor Done this bit is written back after the NIC processes the descriptor provided the descriptor’s RS-bit was set (i.e., Report Status) EC = Excess Collisions indicates that the packet has experienced more than the maximum number of excessive collisions (as defined by the TCTL.CT field) and therefore was not transmitted. (This bit is meaningful only in HALF-DUPLEX mode.) LC = Late Collision indicates that Late Collision has occurred while operating in HALF-DUPLEX mode. Note that the collision window size is dependent on the SPEED: 64-bytes for 10/100-MBps, or 512-bytes for 1000-Mbps. Bit-mask definitions enum { DD = (1<<0), EC = (1<<1), LC = (1<<2), // Descriptor Done // Excess Collisions // Late Collision EOP = (1<<0), IFCS = (1<<1), IC = (1<<2), RS = (1<<3), DEXT = (1<<5), VLE = (1<<6), IDE = (1<<7) }; // End Of Packet // Insert Frame CheckSum // Insert CheckSum as per CSO/CSS // Report Status // Descriptor Extension // VLAN packet // Interrupt-Delay Enable Ethernet packet layout • Total size normally can vary from 64 bytes up to 1536 bytes (unless ‘jumbo’ packets and/or ‘undersized’ packets are enabled) • The NIC expects a 14-byte packet ‘header’ and it appends a 4-byte CRC check-sum 0 6 destination MAC address (6-bytes) 12 source MAC address (6-bytes) 14 Type/length (2-bytes) the packet’s data ‘payload’ goes here (usually varies from 56 to 1500 bytes) Cyclic Redundancy Checksum (4-bytes) In-class exercises • Modify the code in our ‘nictx.c’ module so that it will transmit more than just one raw packet when you install it into the kernel • Can you also modify the ‘module_exit()’ function so that it will transmit a packet before it disables the ‘Transmit Engine’?