Performance of 1 and 10 Gigabit Ethernet Cards with Server Quality Motherboards Richard Hughes-Jones1, Peter Clarke2, Steven Dallison1 1 Dept. of Physics and Astronomy, The University of Manchester, Oxford Rd., Manchester M13 9PL, UK 2National e-Science Centre, The University of Edinburgh, 15 S. College St., Edinburgh EH8 9AA, UK Abstract System administrators often assume that just by plugging in a Gigabit Ethernet Interface, the system will deliver line rate performance; sadly this is not often the case. The behaviour of various 1 and 10 Gigabit Ethernet Network Interface cards has been investigated using server quality motherboards. The latency, throughput, and the activity on the PCI buses and Gigabit Ethernet links were chosen as performance indicators. The tests were performed using two PCs connected back-to-back and sending UDP/IP frames from one to the other. This paper shows the importance of having a good combination of memory and peripheral bus chipset, Ethernet Interface, CPU power, good driver and operating system designs and proper configuration to achieve and sustain gigabit transfer rates. With these considerations taken into account, and suitably designed hardware, transfers can operate at gigabit and multi-gigabit speeds. Some recommendations are given for high performance data servers. Key words: Gigabit Networking, Gigabit Ethernet, Network Interface Card, UDP, PCI bus, PCI-X bus. 1 Introduction With the increased availability and decrease in cost of Gigabit Ethernet using twisted pair cabling of suitable quality, system suppliers and integrators are offering Gigabit Ethernet and associated switches as the preferred interconnect between disk servers and PC compute farms as well as the most common campus or departmental backbones. With the excellent publicity that ‘Gigabit Ethernet is just Ethernet’, users and administrators are now expecting gigabit performance just by purchasing gigabit components. In addition, 10 Gigabit Ethernet PCI-X (Peripheral Component Interconnect – Extended) cards such as the Intel PRO/10GbE LR Server Adapter[1] and the S2io Xframe[2] are emerging into the market. Although initially targeted at the server market, there is considerable interest in providing access to data at multi-gigabit rates for the emerging high performance Grid applications. In order to be able to perform sustained data transfers over the network at gigabit speeds, it is essential to study the behaviour and performance of the end-system compute platform and the Network Interface Cards (NICs). In general, data must be transferred between system memory and the NIC and then placed on the network. For this reason, the operation and interactions of the memory, CPU, memorybus, the bridge to the input-output bus (often referred to as the “chipset”), the network interface card and the input-output bus itself (in this case PCI / PCI-X) are of great importance. The design and implementation of the software drivers, protocol stack, and operating system are also vital to good performance. For most data transfers, the information must be read from and stored on permanent storage, thus the performance of the storage sub-systems and this interaction with the computer buses is equally important. The work reported here forms part of an ongoing evaluation programme for systems and Gigabit Ethernet components for high performance networking for Grid computing and the Trigger/Data Acquisition in the ATLAS [8] experiment at the Large Hadron Collider being built at CERN, Switzerland. The paper first describes the methodology of the tests performed (Section 2) and then details the equipment used (Section 3). For the 1 Gigabit Ethernet NICs, the results, analysis and discussion of the tests performed are presented in Sections 4, 5 and 6 (one for each motherboard used). Within each of these sections, subsections describe the results of each NIC which was tested. Section 7 describes tests made with the Intel 10 Gigabit Ethernet adaptor on both Intel 32 and 64 bit architectures. Related work is studied in Section 8. Finally, the paper ends with concluding remarks. 2 Methodology and Tests Performed For each combination of motherboard, NIC and Linux kernel, three sets of measurements were made using two PCs with the NICs directly connected together with suitable fibre or copper crossover cables. UDP/IP frames were chosen for the tests as they are processed in a similar manner to TCP/IP frames, but are not subject to the flow control and congestion avoidance algorithms defined in the TCP protocol and thus do not distort the base-level performance. The packet lengths given are those of the user payload1. The following subsections describe the measurements made of the latency, throughput, and the activity on the PCI buses. 2.1 Latency Round trip times were measured using request-response UDP frames. One system sent a UDP packet requesting that a response of the required length be sent back by the remote end. The unused portions of the UDP packets were filled with random values to prevent any data compression. Each test involved measuring many (~1000) request-response singletons. The individual request-response times were measured by using the CPU cycle counter on the Pentium [6],[20] and the minimum, average and maximum times were computed. This approach is in agreement with the recommendations of the IP Performance Metric Working Group and the Global Grid Forum (GGF) [7]. The round-trip latency was measured and plotted as a function of the frame size of the response. The slope of this graph is given by the sum of the inverse data transfer rates for each step of the end-to-end path [10]. Figure 2.1 shows a table giving the slopes expected for various PCI-Ethernet-PCI transfers given the different PCI bus widths and speeds. The intercept gives the sum of the propagation delays in the hardware components and the end system processing times. Histograms were also made of the singleton request-response measurements. These histograms show any variations in the round-trip latencies, some of which may be caused by other activity in the PCs, which was kept to a minimum during the tests. In general for the Gigabit Ethernet systems under test, the Full Width Half Maximum (FWHM) of the latency distribution was ~2 µs with very few events in the tail (see Figure 6.3), which indicates that the method used to measure the times was precise. These latency tests also indicate any interrupt coalescence in the NICs and provide information on protocol stack performance and buffer management. 2.2 UDP Throughput The UDPmon [11] tool was used to transmit streams of UDP packets at regular, carefully controlled intervals and the throughput was measured at the receiver. Figure 2.2 shows how the messages and the stream of UDP data packets are exchanged by UDPmon. On an unloaded network, UDPmon will estimate the capacity of the link with the smallest bandwidth on the path between the two end systems. On a loaded network, the tool gives an estimate of the available bandwidth [7]. These bandwidths are indicated by the flat portions of the curves. In these tests, a series of user payloads from 64 to 1472 bytes were selected and for each packet size, the frame transmit spacing was varied. For each point, the following information was recorded: the time to send and the time to receive the frames; the number of packets received, the number lost, and the number out of order; the distribution of the lost packets; the received inter-packet spacing; the CPU load and the number of interrupts for both transmitting and receiving systems. 1 Allowing for 20 bytes of IP and 8 bytes of UDP headers, the maximum user payload for an Ethernet interface with a 1500 byte Maximum Transfer Unit (MTU) would be 1472 bytes. The “wire”2 throughput rates include an extra 60 bytes of overhead and were plotted as a function of the frame transmit spacing. On the right hand side of the plots, the curves show a 1/t behaviour, where the delay between sending successive packets is the most important factor. When the frame transmit spacing is such that the data rate would be greater than the available bandwidth, one would expect the curves to be flat (often observed to be the case). As the packet size is reduced, processing and transfer overheads become more important and this decreases the achievable data transfer rate. Transfer Element 32 bit 33 MHz PCI 64 bit 33 MHz PCI 64 bit 66 MHz PCI 64 bit 133 MHz PCI-X Gigabit Ethernet 10 Gigabit Ethernet 32 bit 33 MHz PCI with Gigabit Ethernet 64 bit 33 MHz PCI with Gigabit Ethernet 64 bit 66 MHz PCI with Gigabit Ethernet 64 bit 133 MHz PCI-X with 10 Gigabit Ethernet Inverse data transfer rate µs/byte 0.0075 0.00375 0.00188 0.00094 0.008 0.0008 Expected slope µs/byte 0.023 0.0155 0.0118 0.0027 Figure 2.1 Table of the slopes expected for various PCI-Ethernet-PCI transfers. Requester Responder Start & Zero stats OK done Send data frames at regular intervals Bytes per packet ●●● ●●● Inter-packet time Time to send Time to receive Get remote statistics Send statistics Signal end of test OK done Time Figure 2.2 Timing diagram of the messages exchanged by UDPmon 2.3 Activity on the PCI Buses and the Gigabit Ethernet Link The activity on the PCI bus of the sender and receiver nodes and the passage of the Gigabit Ethernet frames on the fibre was measured using a Logic Analyser to record traces of the individual signals on the PCI/PCI-X buses, and a specialised Gigabit Ethernet Probe [12 to determine what was on the network, as shown in Figure 2.3. The 66 “wire” overhead bytes include: 12 bytes for inter-packet gap, 8 bytes for the preamble and Start Frame Delimiter, 18 bytes for Ethernet frame header and CRC and 28 bytes of IP and UDP headers 2 Gigabit CPU NIC Ethernet CPU NIC Probe PCI bus PCI bus chipset chipset memory memory Logic Analyser Display Figure 2.3 Test arrangement for examining the PCI buses and the Gigabit Ethernet link. Possible bottlenecks are: CPU power, Memory bandwidth, chipset/PCI bus, NIC. 3 Hardware and Operating Systems Both copper and fibre NICs were tested and are listed in Figure 3.1 together with the chipset and the Linux driver version. Details of the systems and motherboards used in the tests are given with the summary of the measurements in Figure 9.1. As an example of a typical server quality board, the SuperMicro [3] P4DP6 motherboard has dual 2.2 GHz Xeon processors 3, a 400 MHz front-side bus and four independent 64 bit PCI / PCI-X buses selectable speeds of 66, 100 or 133 MHz as shown in the block diagram of Figure 3.2; the “slow” devices and the EIDE controllers are connected via another bus. The P4DP8-G2 used in the DataTAG [4] PCs is similar, with the on-board 10/100 LAN controller replaced by the Intel 82546EB dual port Gigabit Ethernet controller connected to PCI Slot 5. As most of the large Grid projects in High Energy Physics or Astronomy are currently based on Linux PCs, this platform was selected for making the evaluations. RedHat 7.1, 7.2 and 8.0 distributions of Linux were used with various kernels including 2.4.14, 2.4.19, 2.4.20 and 2.4.22. However, the techniques and methodology are applicable to other platforms. Care was taken to obtain the latest drivers for the Linux kernel in use. Normally no modifications were made to the IP stacks nor the drivers, but in some tests the DataTAG altAIMD kernel[28] was used. Drivers were loaded with the default parameters except where stated. The DataTAG altAIMD kernel incorporates several experimental TCP stacks that modify the sender side congestion window update algorithms and can work with any TCP receiver. The version used in these tests was a patch for the Linux 2.4.20 kernel which allowed dynamic switching between the following stacks using a parameter in the kernel “/proc” database: Standard Linux TCP HighSpeed TCP ScalableTCP 0.5.1 In addition, the following modifications were incorporated and applied to all the TCP stacks: Web100 instrumentation[29]; an on/off toggle for Limited Slow Start; Appropriate Byte Counting (ABC) as described in RFC3465; the ability to enable or disable the functionality of the moderate_cwnd() procedure that adjusts the congestion window in the presence of "dubious" ACKs; the Txqueuelen feedback/moderation toggle, which enables TCP to ignore or moderate congestion feedback when the transmit queue becomes full. The Xeon processor has two “logical” processors each with its own independent machine states, data registers, control registers and interrupt controller (APIC). However they share the core resources including execution engine, system bus interface, and level-2 cache. This hyper-threading technology allows the core to execute two or more separate code streams concurrently, using out-of-order instruction scheduling to maximise the use of the execution units[5][19]. 3 Manufacturer Alteon Model ACENIC Chipset Tigon II rev 6 Linux 2.4.14 SysKonnect SK9843 PRO/1000 XT 82544 Intel On board 82546 EB 2.4.14 2.4.19-SMP 2.4.20-SMP 2.4.14 2.4.19-SMP 2.4.20-SMP 2.4.20-SMP Intel Intel PRO/10GbE LR Server Adapter 2.4.20 2.4.22 Driver version acenic v0.83 Firmware 12.4.11 sk98lin v4.06 sk98lin v6.0 β 04 sk98lin v6.0 β 04 e1000 e1000 4.4.12 e1000 4.4.12 e1000 4.4.12 Ixgb 1.0.45 Figure 3.1 NICs used in the tests. Figure 3.2 Block diagram of the SuperMicro P4DP6 / P4DP8 motherboards. 4 Measurements made on the SuperMicro 370DLE The SuperMicro 370DLE motherboard is somewhat old now, but has the advantage of having both 32 bit and 64 bit PCI slots with 33 or 66 MHz bus speed jumper selectable. The board uses the ServerWorks III LE Chipset and one PIII 800 MHz CPU. RedHat 7.1 Linux was used with the 2.4.14 kernel for the tests. 4.1 SysKonnect NIC 4.1.1 Request-Response Latency The round trip latency, shown in Figure 4.1 as a function of the packet size, is a smooth function for both 33 MHz and 66 MHz PCI buses, indicating that the driver-NIC buffer management works well. The clear step increase in latency at 1500 bytes is due to the need to send a second partially filled packet, and the smaller slope for the second packet is due to the overlapping of the incoming second frame and the data from the first frame being transferred over the PCI bus. These two effects are common to all the gigabit NICs tested. The slope observed for the 32 bit PCI bus (0.0286 µs/byte) is in reasonable agreement with that expected, the increase being consistent with some memory-memory copies at both ends. However the slope of 0.023 µs/byte measured for the 64 bit PCI bus is much larger than the 0.0118 µs/byte expected, even allowing for memory copies. Measurements with the logic analyser, discussed in Section 4.1.3, confirmed that the PCI data transfers did operate at 64 bit 66 MHz with no wait states. It is possible that the extra time per byte could be due to some movement of data within the hardware interface itself. The small intercept of 62 or 56 µs, depending on bus speed, suggests that the NIC interrupts the CPU for each packet sent and received. This was confirmed by the UDPmon throughput data and the PCI measurements. 4.1.2 UDP Throughput The SysKonnect NIC managed 584 Mbit/s throughput for full 1500 byte frames on the 32 bit PCI bus (at a spacing of 20-21 µs) and 720 Mbit/s (at a spacing of 17 µs) on the 64 bit 66 MHz PCI bus as shown in Figure 4.2. With inter-packet spacing less than these values, there is packet loss, traced to loss in the receiving IP layer when no buffers for the higher layers are available. However, the receiving CPU is in kernel mode for 95 - 100% of the time when the frames were spaced at 18 µs or less, as shown in the lower graph of Figure 4.2, so a lack of CPU power could explain the packet loss. 160 SysKonnect 32bit 33 MHz 160 y = 0.0286x + 61.79 y = 0.0188x + 89.756 100 80 60 Latency us 120 120 Latency us SysKonnect 64 bit 66 MHz 140 140 y = 0.0231x + 56.088 100 y = 0.0142x + 81.975 80 60 40 40 20 20 0 0 0 500 1000 1500 2000 2500 0 3000 500 1000 1500 2000 2500 3000 Message length bytes Message length bytes Figure 4.1 UDP request-response latency as a function of the packet size for the SysKonnect SK9843 NIC on 32 bit 33 MHz and 64 bit 66 MHz PCI buses of the SuperMicro 370DLE. Recv Wire rate Mbit/s 1000 50 bytes SysKonnect 32bit 33 MHz 100 bytes 800 200 bytes 400 bytes 600 600 bytes 400 800 bytes 1000 bytes 200 1200 bytes 1400 bytes 0 0 5 10 15 20 25 Transmit Time per frame us Recv Wire rate Mbit/s 1000 30 35 40 1472 bytes 50 bytes SysKonnnect 64bit 66MHz 100 bytes 800 200 bytes 400 bytes 600 600 bytes 800 bytes 400 1000 bytes 200 1200 bytes 1400 bytes 0 0 5 10 15 20 25 30 35 40 1472 bytes Transmit Time per frame us SysKonnect 64 bit 66 MHz 50 bytes % CPU Kernel Receiver 100 100 bytes 200 bytes 80 400 bytes 60 600 bytes 40 800 bytes 1000 bytes 20 1200 bytes 0 1400 bytes 0 5 10 15 20 Spacing between frames us 25 30 35 40 1472 bytes Figure 4.2 UDP Throughput as a function of the packet size for the SysKonnect SK9843 NIC on 32 bit 33 MHz and 64 bit 66 MHz PCI buses of the SuperMicro 370DLE, together with the percentage of time the receiving CPU was in kernel mode. 4.1.3 PCI Activity Figure 4.3 shows the signals on the sending and receiving PCI buses when frames with 1400 bytes of user data were sent. At time t, the dark portions on the send PCI shows the setup of the Control and Status Registers (CSRs) on the NIC, which then moves the data over the PCI bus – indicated by the assertion of the PCI signals for a long period (~3s) . The Ethernet frame is transferred over the Gigabit Ethernet as shown by the lower two signals at time X to O, and then on to the receive PCI bus at a time just after O. The frame exists on the Ethernet medium for 11.6 s. The activity seen on the sending and receiving PCI buses after the data transfers is due to the driver updating the CSRs on the NIC after the frame has been sent or received. These actions are performed after the NIC has generated an interrupt; for the SysKonnect card, this happens for each frame. Figure 4.3 also shows that the propagation delay of the frame from the first word on the sending PCI bus to the first word on the receiving PCI was ~26 s. Including the times to setup the sending CSRs and the interrupt processing on the receiver, the total delay from the software sending the packet to receiving it was 36 s. Using the intercept of 56 s from the latency measurements, one can estimate the IP stack and application processing times to be ~ 10 s on each 800 MHz CPU. The traces show that data movement over PCI and Ethernet do not overlap in time, indicating that the NICs operate in a store-and-forward mode using buffering on the NIC for both transmit and receive. There is no indication of multiple Ethernet frames being stored or queued on the NIC. Send transfer Send setup Send PCI Receive PCI Packet on Ethernet Fibre Receive transfer Figure 4.3 The traces of the signals on the send and receive PCI buses for the SysKonnect NIC on the 64 bit 66 MHz bus of the SuperMicro 370DLE motherboard. The bottom 3 signals are from the Gigabit Ethernet Probe card and show the presence of the frame on the Gigabit Ethernet fibre. The upper plot in Figure 4.4 shows the same signals corresponding to packets being generated at a transmit spacing of 20 µs and the lower plot with the transmit spacing set to 10 µs. In this case, the packets are transmitted back-to-back on the wire with a spacing of 11.7 µs, i.e. full wire speed. This shows that a 800MHz CPU is capable of transmitting large frames at gigabit wire speed. The time required for the Direct Memory Access (DMA) transfers over the PCI scales with PCI bus width and speed as expected, but the time required to setup the CSRs of the NIC is almost independent of these parameters. It is dependent on how quickly the NIC can deal with the CSR accesses internally. Together with CPU load, these considerations explain why using 66 MHz PCI buses gives increased performance but not by a factor of two (see Figure 4.2). Another issue to consider is the number of CSR accesses that a NIC requires to transmit data, receive data, determine the error status, and update the CSRs when a packet has been sent or received. Clearly, for high throughput, the number of CSR accesses should be minimised. Send PCI Receive PCI Frames on Ethernet Fibre Spacing 20 µs Send PCI Receive PCI Frame back-to-back on Ethernet Fibre Figure 4.4 Signals on the send and receive PCI buses and the Gigabit Ethernet Probe card for different packet spacings. Upper: signals correspond to packets being generated at a transmit spacing of 20 µs. Lower: plots with the transmit spacing set to 10 µs. 4.2 Intel PRO/1000 XT NIC 4.2.1 UDP Request-Response Latency Figure 4.5 shows the round trip latency as a function of the packet size for the Intel PRO/1000 XT Server NIC connected to the 64 bit 64 MHz PCI bus of the SuperMicro 370DLE motherboard. The graph shows smooth behaviour as a function of packet size, and the observed slope of 0.018 µs/byte is in reasonable agreement with the 0.0118 µs/byte expected. The large intercept of ~168 µs suggests that the driver enables interrupt coalescence, which was confirmed by the throughput tests that showed one interrupt for approximately every 33 packets sent (~ 400 µs) and one interrupt for every 10 packets received (~ 120 µs). 4.2.2 UDP Throughput Figure 4.6 shows that the maximum throughput for full 1472 byte frames (1500 byte MTU) on the 64 bit 66 MHz PCI bus was 910 Mbit/s. The lower graph shows the corresponding packet loss. This behaviour of losing frames at wire rate is typical of most NIC-motherboard combinations. It was traced to “indiscards”, i.e. loss in the receiving IP layer when no buffers for the higher layers were available. The receiving CPU was loaded at between 85 and 100% when the frames were spaced at 13 µs or less, which suggests that a lack of CPU power may be the cause of the lost packets. 4.2.3 PCI Activity Figure 4.7 shows the PCI signals for a 64 byte request from the sending PC followed by a 1400 byte response. Prior to sending the request frame, there is a CSR setup time of 1.75 µs followed by 0.25 µs of DMA data transfer. The default interrupt coalescence of 64 units was enabled. This gave ~ 70 µs delay between the PCI transfer and the update of the CSRs. This delay is seen on both send and receive actions. 250 Intel Pro/1000 64bit 66MHz Latency us 200 150 y = 0.0187x + 167.86 100 50 0 0 500 Message length bytes 1000 1500 Figure 4.5 Latency as a function of the packet size for the Intel PRO/1000 XT on the SuperMicro 370DLE 64 bit 66 MHz PCI bus. Figure 4.8 shows the PCI signals for 1400 byte frames sent every 11 µs. The PCI bus is occupied for ~4.7 µs on sending, which corresponds to ~ 43% usage; for receiving, it takes only ~ 3.25 µs to transfer the frame ~ 30% usage. Figure 4.9 shows the transfers on a longer time scale. There are regular gaps in the frame transmission, approximately every 900 µs. Transfers on the sending PCI stop first, followed by those on the receiving PCI. Symmetric flow control was enabled, which suggests that Ethernet pause frames were used to prevent buffer overflow and subsequent frame loss. This behaviour is not unreasonable as one of these frames would occupy the Gigabit Ethernet for ~11.7 µs. Recv Wire rate Mbit/s 1000 50 bytes 100 bytes 200 bytes 400 bytes 600 bytes 800 bytes 1000 bytes 1200 bytes 1400 bytes 1472 bytes Intel Pro/1000 370DLE 64bit 66MHz 800 600 400 200 0 0 5 % Packet loss 1 10 15 20 25 Spacing between frames us 30 35 40 Intel Pro/1000 370DLE 64bit 66MHz 0.8 0.6 0.4 0.2 0 0 5 10 15 20 25 Spacing between frames us 30 35 40 50 bytes 100 bytes 200 bytes 400 bytes 600 bytes 800 bytes 1000 bytes 1200 bytes 1400 bytes 1472 bytes Figure 4.6 UDP Throughput and corresponding packet loss as a function of the packet size for the Intel PRO/1000 XT Server NIC on the SuperMicro 370DLE 64 bit 66 MHz PCI bus. Send Interrupt processing Interrupt delay Send 64 byte request Send PCI Receive PCI Request received 1400 byte response Receive Interrupt processing Figure 4.7 The traces of the signals on the send and receive PCI buses for the Intel PRO/1000 XT NIC and the SuperMicro 370DLE motherboard for a 64 byte request with a 1400 byte response. 1400 bytes to NIC Send PCI Receive PCI 1400 bytes to memory Figure 4.8 The traces of the signals on the send and receive PCI buses for the Intel PRO/1000 XT NIC and the SuperMicro 370DLE motherboard for a stream of 1400 byte packets transmitted with a separation of 11 µs. Send PCI Receive PCI Figure 4.9 The traces of the signals on the send and receive PCI buses for the Intel PRO/1000 XT NIC and the SuperMicro 370DLE motherboard for a stream of 1400 byte packets transmitted with a separation of 11 µs. showing pauses in the packet transmission. 5 Measurements made on the IBM DAS board The motherboards in the IBM DAS compute server were tested. They had Dual 1GHz Pentium III with the ServerWorks CNB20LE Chipset, and the PCI bus was 64 bit 33 MHz. RedHat 7.1 Linux with the 2.4.14 kernel was used for the tests. Figure 5.1shows the UDP throughput for the SysKonnect NIC. It performed very well, giving a throughput of 790 Mbit/s, which may be compared to only 620 Mbit/s on the SuperMicro 370DLE. However the IBM had much more processing power with dual 1 GHz CPUs. The Intel NIC also performed very well on the IBM DAS system, giving a throughput of 930 Mbit/s, which may be compared to the 910 Mbit/s on the SuperMicro 370DLE. The latency and PCI activity were similar to those for the SuperMicro 370DLE. Recv Wire rate Mbit/s 1000 50 bytes SysKonnnect IBM das 64bit 33MHz 100 bytes 800 200 bytes 400 bytes 600 600 bytes 800 bytes 400 1000 bytes 1200 bytes 200 1400 bytes 0 1472 bytes 0 5 10 15 20 Spacing between frames us 25 30 35 40 Figure 5.1 UDP Throughput as a function of the packet size for the SysKonnect SK9843 NIC on the 64 bit 33 MHz PCI bus of the IBM DAS. 6 Measurements made on the SuperMicro P4DP6 board The SuperMicro P4DP6 motherboard used the Intel E7500 (Plumas) Chipset and had Dual Xeon Prestonia 2.2 GHz CPUs. The arrangement of the 64 bit PCI and PCI-X buses was discussed in Section 3. 6.1 SysKonnect NIC 6.1.1 Request-Response Latency Figure 6.1 depicts the round trip latency as a function of the packet size. This smooth function indicates that the driver-NIC buffer management works well. The slope for the 64 bit PCI bus of 0.0199 µs/byte is in reasonable agreement with the 0.0117 µs/byte expected. The intercept of 62 µs suggests that there is no interrupt coalescence, which is confirmed by the data from the UDPmon throughput tests that record one interrupt per packet sent and received. 140 SysKonnect 64 bit 66 MHz 120 y = 0.0199x + 62.406 Latency us 100 y = 0.012x + 78.554 80 60 40 20 0 0 500 1000 1500 2000 2500 3000 Message length bytes Figure 6.1 UDP request-response latency as a function of the packet size for the SysKonnect NIC using the 64 bit 66 MHz PCI bus on the SuperMicro P4DP6 board. 6.1.2 UDP Throughput As shown in Figure 6.2, the SysKonnect NIC manages 885 Mbit/s throughput for full 1472 byte packets on the 64 bit 66 MHz PCI bus (at a spacing of 13 µs). At inter-packet spacing less than this value, there is ~20% packet loss for packets over 1200 bytes and between 60 and 70% loss for small packets. High packet loss corresponded to low throughput. These losses were again traced to “indiscards”, i.e. loss in the receiving IP layer. Although the corresponding CPU utilisation on the receiving PC averaged over the 4 CPUs was ~ 20 30 %, with the 2.4.19 kernel, only one of the CPUs was responsible for the packet processing and this CPU spent nearly all of its time in kernel mode as shown by the lower plot in Figure 6.2. The SysKonnect NIC generated an interrupt for each packet, and with this utilisation, the system was very close to live lock [18], providing a likely explanation of the observed packet loss. Network studies at Stanford Linear Accelerator Center (SLAC) [14] also indicate that a processor of at least 1 GHz/ Gbit is required. With the 2.4.20 kernel and the same hardware, packets over 1400 bytes achieved line speed, and in general the CPU utilisation was less. However there was still packet loss for small packets, which is not understood. Recv Wire rate Mbit/s 50 bytes SysKonnect P4DP6 64 bit 66 MHz 1000 100 bytes 800 200 bytes 400 bytes 600 600 bytes 800 bytes 400 1000 bytes 200 1200 bytes 1400 bytes 0 0 5 10 15 20 25 30 35 40 1472 bytes Spacing between frames us % Packet loss 50 bytes SysKonnect P4DP6 64 bit 66 MHz 100 100 bytes 80 200 bytes 60 400 bytes 600 bytes 40 800 bytes 20 1000 bytes 0 1200 bytes 0 5 10 15 20 25 Spacing between frames us 30 35 40 1472 bytes 50 bytes SysKonnect P4DP6 64 bit 66 MHz % CPU1 kernel receiver 1400 bytes 100 100 bytes 80 200 bytes 400 bytes 60 600 bytes 40 800 bytes 20 1000 bytes 1200 bytes 0 0 5 10 15 20 25 Spacing between frames us 30 35 40 1400 bytes 1472 bytes Figure 6.2 UDP Throughput, packet loss, and CPU utilization as a function of the packet size for the SysKonnect NIC using the 64 bit 66 MHz PCI bus on the SuperMicro P4DP6 board. 6.2 Intel PRO/1000 NIC Figure 6.3 shows the round trip latency as a function of the packet size. The behaviour shows some distinct steps. The observed slope of 0.009 µs/byte is smaller than the 0.0118 µs/byte expected, but making rough allowance for the steps, agreement is good. Figure 6.3 also shows histograms of the round trip times for various packet sizes. There is no variation with packet size, all having a FWHM of 1.5 µs and no significant tail. This confirms that the timing method of using the CPU cycle counter gives good precision and does not introduce bias due to other activity in the end systems. 6.2.1 UDP Throughput Figure 6.4 shows that the Intel NIC performed very well, giving a throughput of 950 Mbit/s with no packet loss. The corresponding average CPU utilisation on the receiving PC was ~ 20 % for packets greater than 1000 bytes and 30- 40 % for smaller packets. Intel 64 bit 66 MHz 300 y = 0.0093x + 194.67 Latency us 250 200 y = 0.0149x + 201.75 150 100 50 0 0 900 800 64 bytes Intel 64 bit 66 MHz 500 1000 1500 2000 Message length bytes 512 bytes Intel 64 bit 66 MHz 800 2500 3000 1024 bytes Intel 64 bit 66 MHz 800 700 700 700 700 600 600 600 500 500 500 N(t) N(t) 500 N(t) 600 400 400 400 400 300 300 300 300 200 200 200 200 100 100 100 100 0 0 0 170 190 210 170 Latency us 190 Latency us 210 1400 bytes Intel 64 bit 66 MHz N(t) 800 0 190 210 190 230 210 230 Latency us Latency us Figure 6.3 Latency as a function of the packet size, and histograms of the request-response latency for 64, 512, 1024 and 1400 byte packet sizes for the Intel PRO/1000 XT NIC on the 64 bit 66 MHz PCI bus of the SuperMicro P4DP6 motherboard. Recv Wire rate Mbit/s 1000 Intel P4DP6 64 bit 66 MHz 50 bytes 100 bytes 800 200 bytes 400 bytes 600 600 bytes 800 bytes 400 1000 bytes 200 1200 bytes 1400 bytes 0 1472 bytes 0 5 10 15 20 25 Spacing between frames us 30 35 40 Figure 6.4 UDP Throughput as a function of the packet size for the Intel PRO/1000 XT NIC on the SuperMicro P4DP6 64 bit 64 MHz PCI-X bus. During these tests no packet loss was observed. 1400 bytes to NIC Send PCI PCI Stop Asserted Receive PCI 1400 bytes to memory Figure 6.5 The signals on the send and receive PCI buses for the Intel Pro 1000 NIC and the SuperMicro P4DP6 motherboard. The lower picture shows the PCI transactions to send the frame send in more detail. 6.2.2 PCI Activity Figure 6.5 shows the PCI activity when sending a 1400 byte packet with the Intel PRO/1000 XT NICs. The DMA transfers are clean but there are several PCI STOP signals that occur when accessing the NIC CSRs. The assertion of these STOP signals delays the completion of the PCI transaction, as shown in more detail in the lower picture. The PCI bus is occupied for ~ 2 µs just setting up the CSRs for sending. For 1400 byte packets at wire speed, the PCI bus is occupied for ~ 5µs on sending, ~ 50% usage, while for receiving it takes only ~ 2.9 µs to transfer the frame over the PCI bus, ~ 25% usage. 7 10 Gigabit Ethernet Performance with Various Motherboards For the tests described in this section, two similar systems were connected back to back. 7.1 Latency Figure 7.1 shows the request-response latency plot as a function of message size and histograms of the latency for the Intel 10 Gbit NIC in the DataTAG SuperMicro P4DP8-G2 PCs. For two 133 MHz PCIX buses and a 10 Gigabit Ethernet interconnect, the expected slope is 0.00268 µs/byte, compared with the measured 0.0034 µs/byte. The extra time could be due to memory-memory copies in each host, as it corresponds to a transfer rate of 2.7 GBytes/s in each host. The histograms show narrow peaks (FWHM ~2 µs) indicating good stability; however there is a second smaller peak 4-6 µs away from the main peak. The request-response latency was also measured for the CERN OpenLab HP RX2600 Dual Itanium 64 bit systems. The measured slope was 0.0033 µs/byte, close to that expected, and the histograms were similar to those for the Xeon processor. 7.2 Throughput Figure 7.2 shows the UDP throughput as a function of the spacing between the packets for different packet sizes. In all cases the MTU was set to 16,114 bytes, giving a maximum length of 16,080 bytes for the user data in the UDP packet. The top plot is for the DataTAG SuperMicro PCs showing a wire rate throughput of 2.9 Gbit/s; the middle plot is for the HP Itanium PCs with a wire rate of 5.7 Gbit/s; and the bottom plot is for the SLAC Dell PCs giving a wire rate of 5.4 Gbit/s. The PCI-X Maximum Memory Read Byte Count (mmrbc) parameter [13] [27] was set to 4096 bytes for the SLAC Dell and HP Itanium measurements; the DataTAG SuperMicro PCs measurements used the default mmrbc of 512 bytes. The difference between the top and bottom plots includes the effects of CPU power and the PCI-X mmrbc parameter setting. In general the plots indicate well behaved hardware and driver network sub-system with few unexpected changes or lost packets. The plots also show that high throughput can only be achieved by using jumbo frames. When 1472 byte frames (1500 byte MTU) are used, the throughput is 2 Gbit/s or less, and some packets are lost. In this case it appears that CPU usage is the limiting factor. In terms of throughput performance, the 1 GHz 64 bit Itaniums have a clear advantage when compared to the 3 GHz Xeon, but further investigation is required to determine whether this difference in performance is due to the processor characteristics, the memory interconnect or the chipset design. It is also possible that the improvements in the kernel from 2.4.20 to 2.5.72 play an important role. 7.3 Effect of tuning the PCI-X Bus The activity on the PCI-X bus and the UDP memory to memory throughput were investigated as a function of the value loaded into the mmrbc register of the Intel PRO/10GbE LR Server Adapter. The mmrbc forms part of the PCI-X Command Register and sets the maximum byte count the PCI-X device may use when initiating a data transfer sequence with one of the burst read commands [13]. Figure 7.3 shows traces of the signals on the PCI-X Bus on the DataTAG SuperMicro PC, when a packet with 16,080 bytes of user data is moved from memory to the NIC for different values of the mmrbc parameter. The PCI-X bus operated at 133 MHz. The sequence starts with two accesses to the CSRs on the NIC and then the DMA data transfer begins. For illustration, the CSR access, the length of a PCI-X data transfer segment and the length of the whole packet (16,114 bytes including headers), are indicated on the traces when the mmrbc was set to 2048 bytes. Note that after transferring a segment 2048 bytes long, there is a ~870 ns pause in the data flow over the PCI-X bus, which allows rearbitration for use of the bus. 60 DataTAGt3-6 10GE MTU16114 Latency us 50 40 y = 0.0032x + 44.098 30 ave time us min time 20 10 0 0 500 500 Message length bytes 64 bytes 600 500 400 300 200 100 0 400 1500 512 bytes 100 0 0 50 Latency us 100 1400 bytes 300 N(t) 200 500 400 N(t) N(t) 300 1000 200 100 0 0 50 Latency us 0 100 50 Latency us Figure 7.1 The request-response latency plots and histograms for DataTAG SuperMicro P4DP8-G2 Dual Xeon 2.2GHz Recv Wire rate Mbit/s 3500 16000 bytes DataTAG3-6 10GE MTU16114 14000 bytes 3000 12000 bytes 2500 10000 bytes 2000 9000 bytes 8000 bytes 1500 7000 bytes 1000 6000 bytes 500 5000 bytes 4000 bytes 0 3000 bytes 0 5 15 20 25 Spacing between frames us 30 35 40 5000 4000 3000 9000 bytes 8000 bytes 2000 7000 bytes 6000 bytes 1000 5000 bytes 4000 bytes 0 0 5 10 15 20 25 Spacing between frames us 30 35 40 3000 bytes 2000 bytes 1472 bytes 16080 bytes SLAC Dell 10GE MTU16114 6000 Recv Wire rate Mbit/s 2000 bytes 1472 bytes 16080 bytes 16000 bytes 14000 bytes 12000 bytes 10000 bytes Openlab29-30 10GE MTU16114 6000 Recv Wire rate Mbit/s 10 14000 bytes 5000 12000 bytes 10000 bytes 4000 9000 bytes 3000 8000 bytes 7000 bytes 2000 6000 bytes 1000 5000 bytes 4000 bytes 0 3000 bytes 0 10 20 Spacing between frames us 30 40 2000 bytes 1472 bytes Figure 7.2 UDP throughput as a function of the spacing between the packets for different packet sizes. Top: SuperMicro dual 2.2 GHz Xeon, kernel 2.4.21 DataTAG altAIMD, mmrbc= 512 bytes; Middle: HP dual 1 GHz Itanium, kernel 2.5.72, mmrbc= 4096 bytes; Bottom: Dell dual 3.06GHz Xeon, kernel 2.4.20 DataTAG altAIMD, mmrbc= 4096 bytes 100 The plots clearly show the expected decrease in the time taken for the data transfers as the amount of data that may be transmitted in one PCI-X sequence is increased. The variation of the throughput for UDP packets as a function of mmrbc is shown in Figure 7.4. The dashed lines represent the theoretical maximum throughput for PCI-X. For the DataTAG SuperMicro systems, which have dual 2.2 GHz Xeon CPUs and a 400 MHz memory bus, the throughput plateaus at 4.0 Gbit/s. The purple diamonds show the PCI-X throughout calculated using data from the logic analyser traces. For the HP RX2600 system with Dual Itanium 2 CPUs, the throughput climbs smoothly from 3.2 Gbit/s to 5.7 Gbit/s, tracking the predicted rate curve (purple line). Further work is needed to determine whether CPU power, memory speed or the chipset is the limiting factor. mmrbc 512 bytes mmrbc 1024 bytes CSR Access mmrbc 2048 bytes PCI-X segment 2048 bytes Transfer of 16114 bytes Interrupt & CSR update mmrbc 4096 bytes Figure 7.3 Traces of the PCI-X signals showing movement of a 16080 byte packet from memory to the NIC as the Maximum Memory Read Byte Count is increased. 10 8 6 4 measured Rate Gbit/s rate from expected time Gbit/s Max throughput PCI-X 2 0 0 1000 2000 3000 4000 Max Memory Read Byte Count 5000 PCI-X Transfer time us PCI-X Transfer time us 10 8 6 4 measured Rate Gbit/s rate from expected time Gbit/s Max throughput PCI-X 2 0 0 1000 2000 3000 4000 Max Memory Read Byte Count 5000 Figure 7.4 The plots of the variation of UDP throughput as a function of mmrbc together with the theoretical maximum PCI-X throughput and that calculated from the PCI-X activity traces. Left plot: Throughput for the DataTAG SuperMicro Dual 2.2 GHz Xeon Right plot: Throughput for the HP RX2600 dual 64 bit 1 GHz Itanium 2 8 Related Work One Gigabit Ethernet NICs have been available for some time and many groups have evaluated their performance in various ways. Competitive performance evaluations [26], zero-copy operating systembypass concepts [24], and detailed NIC tests using TCP [9] have been reported. However, few included an investigation of the role and limitations of the PCI / PCI-X bus [25]. The emergence of the 10 Gigabit Ethernet standard provided a cost effective method to upgrade performance of network backbones, and router and switch manufacturers are keen to demonstrate performance and inter-operability. Photonics researchers study the performance of 10 gigabit transceivers [22] as well as the stability of Dense Wave Division Multiplexing optical filter components for use in real-time applications [21]. Ten Gigabit Ethernet PCI-X cards[1], [2] are becoming available and network researchers have studied their performance in moving bulk data with TCP and the effect of MTU size and PCI-X burst size on LANs [16] and transatlantic networks [17]. Other researchers [23] have tested transatlantic UDP and TCP flows. Recently, other tests on Gigabit Ethernet NICs have been reported [9]. These have used TCP/IP and long message sizes up to 1 Gbytes for the memory to memory transfers. It is clear that these results will include the effects of the TCP protocol, its implementation, as well as the performance of the NICs and Motherboards. By focusing on simple UDP transfers to avoid the effects of reliable network protocols, measuring the individual CPU loads, and visualising the actual operation of the PCI bus with a logic analyser when sending commands and data to and from the NIC, the measurements presented in this paper complement and extend the understanding of the work of other researchers. 9 Conclusion This paper has reported on some of the tests made on a variety of high performance server-type motherboards with PCI or PCI-X input/output buses with NICs from different manufacturers. Figure 9.1 gives a table summarising the maximum throughput obtained for the NICs tested. Details of tests on the older NICs may be found in [15]. The Intel PRO/1000 XT card gave a consistently high throughput between 910 – 950 Mbit/s, improving when used on the more powerful modern motherboards, and the on-board Intel controller operated at gigabit line speed when one port was used. The old SysKonnect card gave throughputs between 720 – 885 Mbit/s, again improving with more modern motherboards. The SysKonnect card used on the SuperMicro P4DP8-G2 operated at line speed. In these tests, all cards were stable in operation with several 1000s Gbytes of data being transferred. In general, the SysKonnect had a larger CPU load and gave more packet loss, probably due to the impact of not using interrupt coalescence. Inspection of the signals on the PCI buses and the Gigabit Ethernet media has shown that a PC with a 800 MHz CPU can generate Gigabit Ethernet frames back-to-back at line speed. However, much more processing power is required for the receiving system to prevent packet loss. The design and implementation of the kernel and drivers are very important to ensure high throughput while minimising CPU load. In some cases, an increase of throughput of ~12% was observed. Better use of the multiple CPUs (e.g., by better scheduling) could give an advantage. Timing information derived from the PCI signals and the round trip latency, allowed the processing time required for the IP stack and test application to be estimated as 10 µs per node for send and receive of a packet. A 32 bit 33 MHz PCI bus has almost 100% occupancy and the tests indicate a maximum throughput of ~ 670 Mbit/s. A 64 bit 33 MHz PCI bus shows 82% usage on sending when operating with interrupt coalescence and delivering 930 Mbit/s. In both of these cases, involving a disk sub-system operating on the same PCI bus would seriously decrease performance: the data would have to traverse the bus twice and there would be extra control information for the disk controller. It is worth noting that even on a well designed system, the network data must cross the CPU-memory bus three times: once during the DMA to or from the NIC, and twice for the kernel to user space memory-memory copy by the CPU. Thus the achievable bandwidth of the memory sub-system should be at least three times the required network bandwidth. Motherboard PCI Bus OS SuperMicro 370DLE PCI 32bit 33 MHz RedHat 7.1 kernel 2.4.14 SuperMicro 370DLE PCI 64bit 66 MHz RedHat 7.1 kernel 2.4.14 IBM DAS PCI 64bit 33 MHz RedHat 7.1 kernel 2.4.14 SuperMicro P4DP6 PCI 64bit 66 MHz RedHat 7.2 kernel 2.4.19-SMP SuperMicro P4DP6 PCI 64bit 66 MHz RedHat 7.2 2.4.20 DataTAG altAIMD SuperMicro P4DP8-G2 PCI 64bit 66 MHz RedHat 7.2 kernel 2.4.19-SMP HP RX2600 Itanium PCI-X 64 bit 133 MHz kernel 2.5.72 Dell PowerEdge 2650 PCI-X 64 bit 133 MHz 2.4.20 DataTAG altAIMD CPU Type Speed System Bus Speed Chipset Alteon AceNIC SysKonnect SK-9843 PIII 800 MHz Intel Pro/1000 XT ServerWorks III LE 674 Mbit/s 584 Mbit/s 0-0 µs PIII 800 MHz ServerWorks III LE 930 Mbit/s 720 Mbit/s 0-0 µs 910 Mbit/s 400-120 µs Dual PIII 1 GHz ServerWorks CNB20LE 790 Mbit/s 0-0 µs 930 Mbit/s 400-120 µs Dual Xeon 2.2GHz 400 MHz Intel E7500 (Plumas) 885 Mbit/s 0-0 µs 950 Mbit/s 70-70 µs Dual Xeon 2.2GHz 400 MHz Intel E7500 (Plumas) 995 Mbit/s 0-0 µs 995 Mbit/s 70-70 µs Dual Xeon 2.2GHz 400 MHz Intel E7500 (Plumas) 990 Mbit/s 0-0 µs 995 Mbit/s 70-70 µs Dual Itanium 2 1 GHz 64 bit 200 MHz dual pumped BW = 6.4GB/s Dual Xenon 3.06 GHz 533 MHz HP zx1 Intel Pro/10GbE LR 4.0 Gbit/s 5.7 Gbit/s 5.4 Gbit/s Figure 9.1 Table of the maximum throughput measured for 1472 byte UDP packets and the transmitreceive interrupt coalescence times in micro-seconds. Details of the motherboards and Operating Systems studied are also given. The latency, throughout and PCI-X activity measurements indicate that the Intel 10 Gigabit Ethernet card and driver are well behaved and integrate well with the network sub-system for both 32 and 64 bit architectures. The plots also show that high throughput, 5.4Gbit/s for Xeon with mmrbc tuning and 5.7 Gbit/s for Itanium with the default mmrbc setting, can only be achieved by using jumbo frames. With 10 Gigabit Ethernet, when standard 1500 byte MTU frames are used, the throughput is 2 Gbit/s or less, and some packets are lost. It appears that CPU usage is very important and may be the limiting factor, but further work is needed to investigate this and the effect of interrupt coalescence. It is also important to note the large increase in throughput that may be achieved by correctly setting the PCI-X mmrbc parameter. In the future, it would be interesting to study in detail the emerging 10 Gigabit Ethernet NICs, the impact of 64-bit architectures, the impact of PCI-Express, and the performance of the latest Linux kernels (e.g. version 2.6, which includes driver, kernel and scheduling enhancements). The techniques described in this paper have been applied to the operation of RAID controllers and disk sub-systems; the interaction between the network and storage subsystems should also be investigated in the future. 10 Acknowledgements We would like to thank members of the European DataGrid and DataTAG projects, the UK e-science MB-NG projects, and the SLAC network group for their support and help in making test systems available, especially: Boston Ltd & SuperMicro for the supply of the SuperMicro motherboards and NICs - Dev Tyagi (SuperMicro), Alex Gili-Ross(Boston) Brunell University - Peter van Santen Sverre Jarp and Glen Hisdal of the CERN Open Lab CERN DataTAG - Sylvain Ravot, Olivier Martin and Elise Guyot SLAC – Les Cottrell, Connie Logg, Gary Buhrmaster SURFnet & Universiteit van Amsterdam - Pieter de Boer, Cees de Laat, Antony Antony We would also like to acknowledge funding from both PPARC and EPSRC to work on these projects. 11 References [1] Intel, PRO/10GbE LR Server Adapter User Guide, http://support.intel.com/support/network/adapter/pro10gbe/pro10gbelr/sb/CS-008392.htm [2] S2io, S2io Xframe 10 Gigabit Ethernet PCI-X Server/Storage Adapter DataSheet, http://www.s2io.com/products/XframeDatasheet.pdf [3] SuperMicro motherboard reference material, http://www.supermicro.com/PRODUCT/MotherBoards/E7500/P4DP6.htm [4] DataTAG, http://datatag.web.cern.ch/datatag [5] Intel, Hyper-threading Technology on the Intel Xeon Processor Family for Servers, White paper, http://www.intel.com/eBusiness/pdf/prod/server/xeon/wp020901.pdf [6] M. Boosten, R.W. Dobinson and P. van der Stok, “MESH Messaging and ScHeduling for Fine Grain Parallel Processing on Commodity Platforms”, in Proceedings of the International Conference on Parallel and Distributed Techniques and Applications (PDPTA'99), Las-Vegas, Nevada, June 1999. [7] B. Lowekamp, B. Tierney, L. Cottrell, R. Hughes-Jones, T. Kielmann and M. Swany, A Hierarchy of Network Performance Characteristics for Grid Applications and Services, GGF Network Measurements Working Group draft, work in progress, May 2004 http://www-didc.lbl.gov/NMWG/docs/draft-ggf-nmwg-hierarchy-04.pdf [8] ATLAS HLT/DAQ/DCS Group, ATLAS High-Level Trigger Data Acquisition and Controls, Technical Design Report CERN/LHCC/2003-022, Reference ATLAS TDR 016, June 2003. [9] P. Gray and A.Bets “Performance Evaluation of Copper-based Gigabit Ethernet Interfaces”, in Proceedings of the 27th Annual IEEE Conference on Local Computer Networks (LCN'02), Tampa, Florida, November 2002. [10] R. Hughes-Jones and F. Saka, Investigation of the Performance of 100Mbit and Gigabit Ethernet Components Using Raw Ethernet Frames, Technical Report ATL-COM-DAQ-2000-014, Mar 2000. http://www.hep.man.ac.uk/~rich/atlas/atlas_net_note_draft5.pdf [11] R. Hughes-Jones, UDPmon: a Tool for Investigating Network Performance, http://www.hep.man.ac.uk/~rich/net [12 Details of the Gigabit Ethernet Probe Card http://www.hep.man.ac.uk/~scott/projects/atlas/probe/probehome.html [13] PCI Special Interest Group, PCI-X Addendum to the PCI Local Bus Specification, Revision 1.0a, July 2000. [14] SLAC Network tests on bulk throughput site: http://www-iepm.slac.stanford.edu/monitoring/bulk/ SC2001 high throughput measurements: http://www-iepm.slac.stanford.edu/monitoring/bulk/sc2001 [15] R. Hughes-Jones, S. Dallison, G. Fairey, P. Clarke and I. Bridge, “Performance Measurements on Gigabit Ethernet NICs and Server Quality Motherboards”, In Proceedings of the 1st International Workshop on Protocols for Fast Long-Distance Networks (PFLDnet 2003), Geneva, Switzerland, February 2003. [16] W.C. Feng, J. Hurwitz, H. Newman et al., “Optimizing 10-Gigabit Ethernet for Networks of Workstations, Clusters, and Grids: a Case Study”, in Proceedings of the 15th International Conference for High Performance Networking and Communications (SC 2003), Phoenix, AZ, November 2003. [17] J. Hurwitz, W. Feng , “Initial End-to-End Performance Evaluation of 10-Gigabit Ethernet”, in Proceedings of the 11th Symposium on High Performance Interconnects (HOT-I 2003), Stanford, California, August 2003. [18] J. C. Mogul and K.K. Ramakrishnan, “Eliminating receive livelock in an interrupt-driven kernel”. In Proceedings of the 1996 Usenix Technical Conference, pages 99–111, 1996. [19] Intel, IA-32 Intel Architecture Software Developer’s Manual Vol 1. Basic Architecture, http://www.intel.com/design/pentium4/manuals/245471.htm [20] F. Saka, “Ethernet for the ATLAS Second Level Trigger”, Chapter 6 Thesis, Physics Department, Royal Holloway College, University of London, 2001, http://www.hep.ucl.ac.uk/~fs/thesis.pdf [21] R. McCool, “Fibre Developments at Jodrell Bank Observatory”, in Proceedings of the 2nd Annual e-VLBI (Very Large Base Interferometry) Workshop, Dwingeloo, The Netherlands, May 2003 http://www.jive.nl/evlbi_ws/presentations/mccool.pdf [22] Variations in 10 Gigabit Ethernet Laser Transmitter Testing Using Reference Receivers, Tektronix Application Note, 2003, http://www.tek.com/Measurement/App_Notes/2G_16691/eng/2GW_16691_0.pdf [23] C. Meirosu, “The International Grid Testbed: a 10 Gigabit Ethernet success story”, Talk given at the 1st International Grid Networking Workshop (GNEW 2004), Geneva, Switzerland, March 2004, http://datatag.web.cern.ch/datatag/gnew2004/slides/meirosu.ppt [24] P. Shivam, P. Wyckoff and D. Panda, “EMP: Zero-copy OS-bypass NIC-driven Gigabit Ethernet Message Passing”, in Proceedings of the 13th International Conference for High Performance Networking and Communications (SC2001), Denver, CO, November 2001. [25] M.L. Loeb, A.J. Rindos, W.G. Holland and S.P. Woolet, “Gigabit Ethernet PCI adapter performance”, IEEE Network, Vol. 15, No. 2, pp.42-47, March-April 2001. [26] R. Kovac, "Getting ’Giggy’ with it", Network World Fusion, June 2001 http://www.nwfusion.com/reviews/2001/0618rev.html [27] T. Shanley, PCI-X System Architecture, Addison-Wesley, 2001. [28] altAIMD kernel http://www.hep.ucl.ac.uk/~ytl/tcpip/linux/altaimd/ [29] Web100 Project home page, http://www.web100.org/ Biographies: Richard Hughes-Jones leads the e-science and Trigger and Data Acquisition development in the Particle Physics group at Manchester University. He has a PhD in Particle Physics and has worked on Data Acquisition and Network projects. He is a member of the Trigger/DAQ group of the ATLAS experiment in the LHC programme, focusing on Gigabit Ethernet and protocol performance. He is also responsible for the High performance, High Throughput network investigations in the European Union DataGrid and DataTAG projects, the UK e-Science MB-NG project, and the UK GridPP project. He is secretary of the Particle Physics Network Coordinating Group which supports networking for UK PPARC researchers. He is a co-chair of the Network Measurements Working Group of the GGF, a cochair of PFLDnet 2005, and a member of the UKLight Technical Committee. Peter Clarke is deputy director of the National e-Science Centre at Edinburgh. He gained his PhD in Particle Physics from Oxford University. His experience encompasses both Experimental Particle Physics and High Performance Global Computing and Networks for Science. He worked in the ATLAS experiment at the CERN Large Hadron Collider, and recently the Computing sector of the LHC project. In the field of networking he has been a principle member of the DataGRID, DataTAG and MB-NG projects, which address Grid and the Network issues (including QoS, high performance data transport and optical networking). He is a member of the GGF Steering Committee and coDirector of the Data Area within the GGF. Stephen Dallison has a PhD in Particle Physics and took up a postdoctoral post in the Network Team of the HEP group at the University of Manchester in May 2002. He principally works on the MB-NG project but also contributes to the EU DataTAG project. As part of the High Throughput working group Stephen has worked on network performance tests between sites in the UK, Europe and the United States, involving advanced networking hardware and novel transport layer alternatives to TCP/IP.