File transfer experiments on the CHEETAH network Xuan Zheng and Malathi Veeraraghavan July 27, 2005 1. Experimental configuration We used the following hosts in these experiments: zelda3 in Atlanta, zelda4 in ORNL, wukong in MCNC, and orbitty compute-0-0 node in NCSU. Table 1 shows the important hardware and software configuration-related and TCP parameters on these hosts. Table 1 Hardware and software configuration-related and TCP parameters on experimental hosts compute-0-01 zelda32 zelda4 wukong CPU Dual 2.4GHz Xeon Dual 2.8GHz Xeon Dual 2.8GHz Xeon Single 2.8GHz Xeon memory size 2GB 2GB 2GB 1GB disk system 3x10000rpm SCSI RAID-0 2x10000rpm SCSI RAID-0 2x10000rpm SCSI RAID-0 2x15000rpm SCSI RAID-0 kernel version 2.6.11.7-1.smp.x86. i686.cmo 2.4.21-4.ELsmp 2.4.21-4.ELsmp 2.6.9-1.667smp file system xfs ext3 ext3 ext3 rmem_default (in bytes) 8388608 8388608 8388608 8388608 rmem_max (in bytes) 16777216 16777216 16777216 16777216 wmem_default (in bytes) 4096000 8388608 8388608 4096000 wmem_max (in bytes) 16777216 16777216 16777216 16777216 tcp_rmem (in bytes; min, default, max) 4096 8388608 16777216 4096 8388608 16777216 4096 8388608 16777216 4096 8388608 16777216 tcp_wmem (in bytes; min, default, max) 4096 4096000 16777216 4096 8388608 16777216 4096 8388608 16777216 4096 4096000 16777216 The rmem_default, rmem_max, wmem_default and wmem_max are variables in /proc/net/sys/core in Linux systems. These represent the default and maximum socket buffer sizes at the receive (‘r’) and write (‘w’) sides. The tcp_rmem and tcp_wmem are specifically for TCP and override the generic socket values of rmem and wmem in the core directory if the former are smaller. These TCP variables are in /proc/sys/net/ipv4 directory. The default value can be overwritten by an application as long as the value is less than the maximum value using a sysctl -w net.ipv4.tcp_rmem=xx, where xx is the number of bytes the application desires to use for the receive memory size. 1 For other compute nodes, a few sample experiments were conducted. The results are quite close. So our experiments only focused on compute-0-0 2 For the same reason as orbitty computer nodes, we only focused on zelda3 and ignored zelda1/2. All machines are equipped with secondary optical GbE NICs. All experimental connections were set up through the CHEETAH network at a circuit rate of 1Gbps. Thus, the end-to-end connection is a dedicated 1Gbps circuit with no intervening packet switches. As all orbitty machines have RAID0 arrays as well as local disks, we note that our file transfers were to file systems located on the RAID0 array rather than to local disk. For example, on orbitty compute0-0, the /scratch partition is on the RAID0 array. 2. Experiments Large file transfers were carried out between all pairs of machines in both directions using different application tools. RTTs between each pair of machines were measured using ping.. The iperf measurements shown in Table 2 are for memory-to-memory transfers using TCP or UDP. For disk-to-disk transfers, we used a fairly large file, 1.3GB in size (named test.iso), and transferred this file using FTP, SFTP (SecureFX), SABUL, Hurricane, BBCP, and FRTPv1. We show the results in Table 2. All throughput results are in Mbps. Each experiment was repeated at least five times and the mean value is listed in Table 2. Due to time constraints, we have not yet gathered other statistics, such as standard deviation, and confidence intervals to verify if this was a sufficient number of test runs. We do not however expect great variability because the circuit is dedicated and the computers were also dedicated to the file transfer during the tests (no other significant tasks were running on these hosts at the time of the experiments). In all TCP transfers (FTP, SFTP and BBCP), we started the experiments by setting the TCP buffer size (sender and receiver) to equal the bandwidth-delay product. For example, for the TCP transfers between Zelda4 and wukong, we set the default TCP buffer size to 1Gbpsx13.7ms=13.7Mbit=1712500bytes. We increased the TCP buffer size by approximately 1MB in each round until we observed the optimal TCP performance. The optimal TCP buffer sizes are shown in Table 1. It is as yet unclear to us why the bandwidth-delay product value of 1712500bytes was insufficient to achieve optimal performance, as noted in the TCP tuning guide provided in http://www-didc.lbl.gov/TCP-tuning/TCP-tuning.html. See next steps. Table 2 Experimental results zelda4 - zelda4 - zelda4 - zelda3 - zelda3 - wukong - compute-0-0 zelda3 wukong compute-0-0 wukong compute-0-0 8.75 1 RTT (ms) 32 13.7 Memory-to-memory transfers Iperf TCP 938 924 938 938 931 900 933 934 934 N/A N/A 933 Iperf UDP 888 913 957 957 646 830 800 913 653 727 750 645 Disk-to-disk transfers 3 FTP 752 552 585 585 702 458 878 479 702 722* N/A 620 SFTP 25 18.8 34.9 35.1 17.3 17.9 41 18.7 26.3 24.3 26.4 130 SABUL 640 770 488 624 470 404 638 848 463 610 520 479 Hurricane 524 545 537 422 456 368 530 542 264 282 N/A N/A BBCP N/A 500 607 657 N/A N/A N/A 611 425 379 87 513 FRTPv13 629 600 853 610 510 664 644 787 515 620 368 388 FRTPv1 used here is our modified SABUL implementation in which rate control is null and busy-wait is used by the sender to fix the sending rate. It is a user-space based implementation. To obtain the optimal performance here, we needed to experiment with the sending rate. Losses are incurred at this rate due to receive buffer overflows, but it is the rate at which goodput is maximum. 3. Observations We make a few preliminary observations from the above data: 1. FTP over TCP appears to the best among the different file-transfer applications we tested in our experiments. The best result for a disk-to-disk transfer (~870Mbps) was seen during the transfer from zelda3 to orbitty compute-0-0. Further for the disk-to-disk transfer from ORNL to NCSU Centaur Lab, we note that FTP over TCP gave us the best throughput of 752Mbps. 2. Linux kernel 2.6 seems to have a better TCP implementation than Linux kernel 2.4. The web site http://www-didc.lbl.gov/TCP-tuning/linux.html notes that Linux kernel 2.6 implements the BI-TCP as default [ref]. When a kernel 2.6 host is a receiver, TCP works very well even with a small TCP receive buffer. While when a kernel 2.4 host is a receiver, we need a larger TCP send and receive buffer to obtain optimal throughput. Further it appears that this difference could potentially explain the difference in throughput for the two directions of the zelda4 to compute-0-0 FTP transfer. 3. We realized that raid-0 on compute-0-0 includes three 10000rmp SCSI disks instead of two on zelda hosts. Further, it uses XFS file system instead of ext on zelda hosts. This could be another reason that we got better performance when writing the file from zelda hosts to compute-0-0 than writing the file from compute-0-0 to zelda hosts. 4. We speculate that the cause for the poor performance of SFTP is due to its use of encryption for security and integrity. We plan to run a test to verify this observation. 5. At the large file size used, UDP based user-space implemented protocols such as SABUL, HURRICANE and FRTPv.1.0 do not outperform FTP over TCP. In fact, they under-perform FTP over TCP. The kernel-based implementation of TCP is likely a reason for this observation, though it needs to be verified. At smaller file sizes UDP-based implementations are likely to outperform FTP over TCP. This again needs to be verified. 6. Our experience with BBCP shows it to be unstable. [What exactly does this mean?] 7. Our host, wukong appears to have some problems when used as a sender in a large TCP transfer. The software often stalls toward the end of the transfer, which makes the total throughput very poor. We need to determine the reason for this behavior. 8. The optimal TCP buffer size in Table 1 works well in most cases, but not in all cases. We believe that we should be able to set the values of the TCP buffers predictably in the CHEETAH environment when the hosts are not multitasking. Further work is required to automate this computation of the optimal TCP buffer values for any pair of non-multitasking hosts on the CHEETAH network. 4. Next steps 1. Upgrade zelda4/5 with kernel 2.6 and the third disk, and repeat the experiment between zelda4/5 and orbitty compute nodes. (xuan) 2. Test the file transfer between zelda5 and orbitty compute-0-0 through zelda4 (as a router). The function of zelda4 as a router has been successfully tested between zelda5 and orbitty compute nodes. We need further experiments to identify the transfer performance between zelda5 and orbitty compute nodes. From the few experiments conducted so far, FTP over TCP performance is quite close to the values we obtained for the throughput between zelda4 and orbitty compute-0-0. (xuan) 3. Given the predictability of the CHEETAH network, if the hosts involved in a file transfer are not 4. 5. 6. 7. 8. 9. multi-tasking, we should be able to model and hence automate the computation of the TCP parameter values needed for optimal throughput. To achieve this goal, we need to further understand the TCP implementation in Linux, and determine how to set all TCP parameters (not just the TCP buffer size) in depth. Further work is required to automate the selection of the optimal TCP buffer values for any pair of non-multitasking hosts on the CHEETAH network. (anant) Disable encryption for data security and integrity in SecureFX and obtain throughput values. (xuan) Study the impact of multitasking on FTP over TCP as well as on UDP based user-space implemented protocols such as Hurricane, SABUL, and FRTPv1.0.43954 (anant) Verify if UDP based protocol implementations, even if user-space based, outperform FTP over TCP for small files. (anant) Understand the cause for wukong’s behavior when it is a sender in a large file transfer. (xuan) Download and install IOZONE on zelda hosts (xuan) “cat /proc/cpuinfo” show four CPUs on zelda hosts and two CPUs on wukong? 5. References Lisong Xu, Khaled Harfoush, and Injong Rhee, “Binary Increase Congestion Control for Fast, Long Distance Networks,” published??