Report of file transfer experiments on the CHEETAH network

advertisement
File transfer experiments on the CHEETAH network
Xuan Zheng and Malathi Veeraraghavan
July 27, 2005
1. Experimental configuration
We used the following hosts in these experiments: zelda3 in Atlanta, zelda4 in ORNL, wukong in MCNC,
and orbitty compute-0-0 node in NCSU. Table 1 shows the important hardware and software
configuration-related and TCP parameters on these hosts.
Table 1 Hardware and software configuration-related and TCP parameters on experimental hosts
compute-0-01
zelda32
zelda4
wukong
CPU
Dual 2.4GHz Xeon
Dual 2.8GHz Xeon
Dual 2.8GHz Xeon
Single 2.8GHz Xeon
memory size
2GB
2GB
2GB
1GB
disk system
3x10000rpm SCSI
RAID-0
2x10000rpm SCSI
RAID-0
2x10000rpm SCSI
RAID-0
2x15000rpm SCSI
RAID-0
kernel version
2.6.11.7-1.smp.x86.
i686.cmo
2.4.21-4.ELsmp
2.4.21-4.ELsmp
2.6.9-1.667smp
file system
xfs
ext3
ext3
ext3
rmem_default
(in bytes)
8388608
8388608
8388608
8388608
rmem_max
(in bytes)
16777216
16777216
16777216
16777216
wmem_default
(in bytes)
4096000
8388608
8388608
4096000
wmem_max
(in bytes)
16777216
16777216
16777216
16777216
tcp_rmem
(in bytes; min,
default, max)
4096 8388608 16777216
4096 8388608 16777216
4096 8388608 16777216
4096 8388608 16777216
tcp_wmem
(in bytes; min,
default, max)
4096 4096000 16777216
4096 8388608 16777216
4096 8388608 16777216
4096 4096000 16777216
The rmem_default, rmem_max, wmem_default and wmem_max are variables in /proc/net/sys/core in
Linux systems. These represent the default and maximum socket buffer sizes at the receive (‘r’) and write (‘w’)
sides. The tcp_rmem and tcp_wmem are specifically for TCP and override the generic socket values of rmem
and wmem in the core directory if the former are smaller. These TCP variables are in /proc/sys/net/ipv4
directory. The default value can be overwritten by an application as long as the value is less than the maximum
value using a sysctl -w net.ipv4.tcp_rmem=xx, where xx is the number of bytes the application desires to use
for the receive memory size.
1
For other compute nodes, a few sample experiments were conducted. The results are quite close. So our experiments only
focused on compute-0-0
2 For the same reason as orbitty computer nodes, we only focused on zelda3 and ignored zelda1/2.
All machines are equipped with secondary optical GbE NICs. All experimental connections were set up
through the CHEETAH network at a circuit rate of 1Gbps. Thus, the end-to-end connection is a dedicated
1Gbps circuit with no intervening packet switches.
As all orbitty machines have RAID0 arrays as well as local disks, we note that our file transfers were to
file systems located on the RAID0 array rather than to local disk. For example, on orbitty compute0-0, the
/scratch partition is on the RAID0 array.
2. Experiments
Large file transfers were carried out between all pairs of machines in both directions using different
application tools. RTTs between each pair of machines were measured using ping.. The iperf measurements
shown in Table 2 are for memory-to-memory transfers using TCP or UDP. For disk-to-disk transfers, we used a
fairly large file, 1.3GB in size (named test.iso), and transferred this file using FTP, SFTP (SecureFX), SABUL,
Hurricane, BBCP, and FRTPv1. We show the results in Table 2. All throughput results are in Mbps.
Each experiment was repeated at least five times and the mean value is listed in Table 2. Due to time
constraints, we have not yet gathered other statistics, such as standard deviation, and confidence intervals to
verify if this was a sufficient number of test runs. We do not however expect great variability because the
circuit is dedicated and the computers were also dedicated to the file transfer during the tests (no other
significant tasks were running on these hosts at the time of the experiments).
In all TCP transfers (FTP, SFTP and BBCP), we started the experiments by setting the TCP buffer size
(sender and receiver) to equal the bandwidth-delay product. For example, for the TCP transfers between Zelda4
and wukong, we set the default TCP buffer size to 1Gbpsx13.7ms=13.7Mbit=1712500bytes. We increased the
TCP buffer size by approximately 1MB in each round until we observed the optimal TCP performance. The
optimal TCP buffer sizes are shown in Table 1. It is as yet unclear to us why the bandwidth-delay product value
of 1712500bytes was insufficient to achieve optimal performance, as noted in the TCP tuning guide provided in
http://www-didc.lbl.gov/TCP-tuning/TCP-tuning.html. See next steps.
Table 2 Experimental results
zelda4 -
zelda4 -
zelda4 -
zelda3 -
zelda3 -
wukong -
compute-0-0
zelda3
wukong
compute-0-0
wukong
compute-0-0
8.75
1
RTT (ms)
32
13.7
Memory-to-memory transfers
Iperf TCP
938
924
938
938
931
900
933
934
934
N/A
N/A
933
Iperf UDP
888
913
957
957
646
830
800
913
653
727
750
645
Disk-to-disk transfers
3
FTP
752
552
585
585
702
458
878
479
702
722*
N/A
620
SFTP
25
18.8
34.9
35.1
17.3
17.9
41
18.7
26.3
24.3
26.4
130
SABUL
640
770
488
624
470
404
638
848
463
610
520
479
Hurricane
524
545
537
422
456
368
530
542
264
282
N/A
N/A
BBCP
N/A
500
607
657
N/A
N/A
N/A
611
425
379
87
513
FRTPv13
629
600
853
610
510
664
644
787
515
620
368
388
FRTPv1 used here is our modified SABUL implementation in which rate control is null and busy-wait is used by the sender to
fix the sending rate. It is a user-space based implementation. To obtain the optimal performance here, we needed to experiment
with the sending rate. Losses are incurred at this rate due to receive buffer overflows, but it is the rate at which goodput is
maximum.
3. Observations
We make a few preliminary observations from the above data:
1. FTP over TCP appears to the best among the different file-transfer applications we tested in our
experiments. The best result for a disk-to-disk transfer (~870Mbps) was seen during the transfer from
zelda3 to orbitty compute-0-0. Further for the disk-to-disk transfer from ORNL to NCSU Centaur Lab,
we note that FTP over TCP gave us the best throughput of 752Mbps.
2. Linux kernel 2.6 seems to have a better TCP implementation than Linux kernel 2.4. The web site
http://www-didc.lbl.gov/TCP-tuning/linux.html notes that Linux kernel 2.6 implements the BI-TCP as
default [ref]. When a kernel 2.6 host is a receiver, TCP works very well even with a small TCP
receive buffer. While when a kernel 2.4 host is a receiver, we need a larger TCP send and receive
buffer to obtain optimal throughput. Further it appears that this difference could potentially explain
the difference in throughput for the two directions of the zelda4 to compute-0-0 FTP transfer.
3. We realized that raid-0 on compute-0-0 includes three 10000rmp SCSI disks instead of two on zelda
hosts. Further, it uses XFS file system instead of ext on zelda hosts. This could be another reason that
we got better performance when writing the file from zelda hosts to compute-0-0 than writing the file
from compute-0-0 to zelda hosts.
4. We speculate that the cause for the poor performance of SFTP is due to its use of encryption for
security and integrity. We plan to run a test to verify this observation.
5. At the large file size used, UDP based user-space implemented protocols such as SABUL,
HURRICANE and FRTPv.1.0 do not outperform FTP over TCP. In fact, they under-perform FTP over
TCP. The kernel-based implementation of TCP is likely a reason for this observation, though it needs
to be verified. At smaller file sizes UDP-based implementations are likely to outperform FTP over
TCP. This again needs to be verified.
6. Our experience with BBCP shows it to be unstable. [What exactly does this mean?]
7. Our host, wukong appears to have some problems when used as a sender in a large TCP transfer. The
software often stalls toward the end of the transfer, which makes the total throughput very poor. We
need to determine the reason for this behavior.
8. The optimal TCP buffer size in Table 1 works well in most cases, but not in all cases. We believe that
we should be able to set the values of the TCP buffers predictably in the CHEETAH environment
when the hosts are not multitasking. Further work is required to automate this computation of the
optimal TCP buffer values for any pair of non-multitasking hosts on the CHEETAH network.
4. Next steps
1. Upgrade zelda4/5 with kernel 2.6 and the third disk, and repeat the experiment between zelda4/5 and
orbitty compute nodes. (xuan)
2. Test the file transfer between zelda5 and orbitty compute-0-0 through zelda4 (as a router). The
function of zelda4 as a router has been successfully tested between zelda5 and orbitty compute nodes.
We need further experiments to identify the transfer performance between zelda5 and orbitty compute
nodes. From the few experiments conducted so far, FTP over TCP performance is quite close to the
values we obtained for the throughput between zelda4 and orbitty compute-0-0. (xuan)
3. Given the predictability of the CHEETAH network, if the hosts involved in a file transfer are not
4.
5.
6.
7.
8.
9.
multi-tasking, we should be able to model and hence automate the computation of the TCP parameter
values needed for optimal throughput. To achieve this goal, we need to further understand the TCP
implementation in Linux, and determine how to set all TCP parameters (not just the TCP buffer size)
in depth. Further work is required to automate the selection of the optimal TCP buffer values for any
pair of non-multitasking hosts on the CHEETAH network. (anant)
Disable encryption for data security and integrity in SecureFX and obtain throughput values. (xuan)
Study the impact of multitasking on FTP over TCP as well as on UDP based user-space implemented
protocols such as Hurricane, SABUL, and FRTPv1.0.43954 (anant)
Verify if UDP based protocol implementations, even if user-space based, outperform FTP over TCP
for small files. (anant)
Understand the cause for wukong’s behavior when it is a sender in a large file transfer. (xuan)
Download and install IOZONE on zelda hosts (xuan)
“cat /proc/cpuinfo” show four CPUs on zelda hosts and two CPUs on wukong?
5. References
Lisong Xu, Khaled Harfoush, and Injong Rhee, “Binary Increase Congestion Control for Fast, Long
Distance Networks,” published??
Download