High Speed Physics Data Transfers using UltraLight Julian Bunn (thanks to Yang Xia and others for material in this talk) UltraLight Collaboration Meeting October 2005 Disk to Disk (Newisys) 2004 System VendorNewisys 4300 AMD Opteron Enterprise Server with 3 AMD-8131 CPUQuad Opteron 848 2.2GHz Memory16GB PC2700 DDR ECC Network Interface S2io 10GE in a 64-bit/133MHz PCIX slot. Raid Controller 3 x Supermicro Marvell SATA controller Hard Drives 24 x 250GB WDC 7200rpm SATA OSWin2K3 AMD64, Service Pack 1, v.1185 550 MBytes/sec Tests with rootd • Physics analysis files are typically ROOT format • Would like to serve these files over the network as quickly as possible. • At least three possibilities: – Use rootd – Use Clarens – Use Web server • Use of rootd is simple: – On client, use “123.456.789.012:/dir/root.file” – On server, run “rootd” rootd On server: [root@dhcp-116-157 rootdata]# ./rootd -p 5000 -f -noauth main: running in foreground mode: sending output to stderr ROOTD_PORT=5000 On client, add following to .rootrc (corrects issue in current root): XNet.ConnectDomainAllowRE: * +Plugin.TFile: ^root: TNetFile Core "TNetFile(const char*,Option_t*,const char*,Int_t,Int_t)" In the C code, access the files like this: TChain* ch = new TChain("Analysis"); ch->Add("root://10.1.1.1:5000/../raid/rootdata/zpr200gev.mumu.root"); ch->Add("root://10.1.1.1:5000/../raid/rootdata/zpr500gev.mumu.root"); Rootd (measure performance) Compression makes a big difference: Root file is 282 MBytes, but Root object data amounts to 655 MBytes! Thus the physics data rate to application is twice the reported network rate (for this test ~22 MBytes/sec) Application: Int_t nbytes = 0, nb = 0; TStopwatch s; for (Long64_t jentry=0; jentry<nentries;jentry++){ Long64_t ientry = LoadTree(jentry); Real time 0:00:14, CP time 12.790 655167999 Bytes if (ientry < 0) break; nb = fChain->GetEntry(jentry); nbytes += nb; } s.Stop(); Rootd: rd=2.81415e+08, wr=0, rx=478531, s.Print(); Long64_t fileBytes = gFile->GetBytesRead(); tx=2.81671e+08 Double_t mbSec = (Double_t) (fileBytes/1024/1024); mbSec /= s.RealTime(); cout << nbytes << " Bytes (uncompressed) " << fileBytes << " Bytes (in file) " << mbSec << " MBytes/sec" << endl; Tests with Clarens/Root • Using Dimitri’s analysis (Root files containing Higgs -> muon data at various energies) • Root client requests objects from files of size a few hundred MBytes • In this analysis, not all the objects from the file are read, so care in computing the network data rate is required • Clarens serves data to Root client at approx. 60 MBytes/sec • Compare with using wget pull of Root file from Clarens/Apache: 125 MBytes/sec cold cache, 258 MBytes/sec warm cache Tests with gridftp • Gridftp may work well, if you can manage to install it and work with security constraints • Michael Thomas experience: – Installed on laptop successfully, but needed Grid certificate for host, and reverse DNS lookup. Didn’t have, so couldn’t use – Installed on osg-discovery.caltech.edu successfully, but could not use for testing since production machine – Attempted install on UltraLight dual core Opterons at Caltech, but no host certificates, no reverse lookup, no support for x86_64 • Summary: installation/deployment constraints severely restrict usefulness of gridftp Tests with bbftp • • • • • bbftp supported by IN2P3 Time difference makes support less interactive than for bbcp Operates with an ftp-like client/server setup Tested bbftp v3.2.0 between LAN Opterons Example localhost copy: bbftp -e 'put /tmp/julian/example.session /tmp/julian/junk.dat' localhost -u root • Some problems: • • • Segmentation faults when using IP numbers rather than names … x86_64 issue? Transfer fails with reported routing error, but routes are OK By default, files are copied to temporary location on target machine, then copied to correct location. This is not what is wanted when targetting a high speed RAID array! [Can be avoided with “setoption notmpfile”] Sending files to /dev/null did not seem to work: • >> USER root PASS << bbftpd version 3.2.0 : OK >> COMMAND : setoption notmpfile << OK >> COMMAND : put OneGB.dat /dev/null BBFTP-ERROR-00100 : Disk quota excedeed or No Space left on device << Disk quota excedeed or No Space left on device bbcp • • • • http://www.slac.stanford.edu/~abh/bbcp/ Developed as tool for BaBar file transfers The work of Andy Hanushevsky (SLAC) Peer to Peer architecture – third party transfers • Simple to install: just need bbcp executable in path on remote machine(s) • Works with all standard methods of authentication Tests with bbcp The goal is to transfer data files at 10 Gbits/sec in the WAN We use Opteron systems with two CPUs each dual core, 8GB or 16GB RAM, s2io 10Gbit NICs, RHEL 2.6 kernel We use a stepwise approach, starting with the easiest data transfers: 1) Memory to bit bucket (/dev/zero to /dev/null) 2) Ramdisk to bit bucket (/mnt/rd to /dev/null) 3) Ramdisk to Ramdisk (/mnt/rd to /mnt/rd) 4) Disk to bit bucket (/disk/file to /dev/null) 5) Disk to Ramdisk 6) Disk to Disk bbcp LAN Rates • Goal: bbcp rates should match or exceed iperf rates • Single bbcp process: a) 1 stream max rate = 523 MBytes/sec b) 2 streams max rate = 522 MBytes/sec c) 4 streams max rate = 473 MBytes/sec d) 8 streams max rate = 460 MBytes/sec e) 16 streams max rate = 440 MBytes/sec f) 32 streams max rate = 417 MBytes/sec • 3 simultaneous bbcp processes: P 1) bbcp: At 050922 08:58:14 copy 99% complete; 348432.0 KB/s P 2) bbcp: At 050922 08:58:15 copy 54% complete; 192539.5 KB/s P 3) bbcp: At 050922 08:58:15 copy 30% complete; 194359.9 KB/s Aggregate utlization of 735 MByte/sec (~6 Gbits/sec). Conclusion: bbcp can match iperf in the LAN. Use one or two streams, and several bbcp processes (if you can) bbcp WAN rates Memory To Memory (sender has FAST & Web100) 785 MBytes/sec Performance Killers 1) Make sure you're using the right interface! Check with ifconfig 2) Do a cat /proc/sys/net/ipv4/tcp_rmem and make sure the numbers are big, like 1610612736 1610612736 1610612736 3) Tune the interface if not, using: /usr/local/src/s2io//s2io_perf.sh 4) Flush existing routes # sysctl -w net.ipv4.route.flush=1 5) Sometimes a route has to be configured manually, and added to /etc/sysconfig/networks-scripts/route-ethX for the future 6) Sometimes commands like sysctl and ifconfig are not in the PATH 7) Check route is OK with traceroute in both directions 8) Check machine reachable with ping 9) Sometimes 10Gbit adapter does not have 9000 MTU ... But instead has default of 1500 10) If in doubt, reboot 11) If still in doubt, rebuild your application, and goto 10) Ramdisks & SHC • % • • Avoid disk I/O by using ramdisks – it works mount -t ramfs none /mnt/rd Allows physics data files to be placed in system RAM Finesses the new Bandwidth Challenge “rule” disallowing iperf/artificial data • In CACR’s new “Shared Heterogeneous Cluster” (>80 dual Opteron HP nodes) we intend to populate ramdisks on all nodes with Root files, and transfer them using bbcp to nodes in the Caltech booth at SC2005 • The SHC is connected to the WAN via a Black Diamond switch, with two bonded 10Gbit links to Caltech’s UltraLight Force10. SC2005 Bandwidth Challenge The Caltech-CERN-Florida-FNAL-Michigan-Manchester-SLAC entry will demonstrate high speed transfers of physics data between host labs and collaborating institutes in the USA and worldwide. Caltech and FNAL are major participants in the CMS collaboration at CERN’s Large Hadron Collider (LHC). SLAC is the host of the BaBar collaboration. Using state of the art WAN infrastructure and Grid-based Web Services based on the LHC Tiered Architecture, our demonstration will show real-time particle event analysis requiring transfers of Terabyte-scale datasets. We propose to saturate at least fifteen lambdas at Seattle, full duplex (potentially over 300 Gbps of scientific data). The lambdas will carry traffic between SLAC, Caltech and other partner Grid Service sites including UKlight, UERJ, FNAL and AARnet. We will monitor the WAN performance using Caltech's MonALISA agentbased system. The analysis software will use a suite of Grid-enabled Analysis tools developed at Caltech and University of Florida. There will be a realistic mixture of streams: those due to the transfer of the TeraByte event datasets, and those due to a set of background flows of varied character absorbing the remaining capacity. The intention is to simulate the environment in which distributed physics analysis will be carried out at the LHC. We expect to easily beat our SC2004 record of ~100Gbits/sec (roughly equivalent to downloading 1000 DVDs in less than an hour). Summary • Seeking fastest ways of moving physics data in the 10 Gbps WAN • Disk to Disk WAN record held by Newisys machines in 2004: >500MBytes/sec • Root files can be served to Root clients at decent rates (> 60Mbytes/sec). Root compression helps by factor >2 • Root files can be served by rootd, xrootd, Clarens, and vanilla Web servers • For file transfers, bbftp and gridftp hard to deploy and test • bbcp easy to deploy, well supported, and can match iperf speeds in the LAN (~7Gbits/sec) and the WAN (~6.3Gbits/sec) for memory to memory data transfers • Optimistically, bbcp should be able to copy disk resident files in the WAN at the same speeds, given: – Powerful servers – Fast disks • Although we are not there yet, we are aiming to be by SC2005!