Supported by NSF ITR-0312376, NSF EIN-0335190, and DOE DE-FG02-04ER25640 grants A Study of Applications for Optical Circuit-Switched Networks Xiuduan Fang May 1, 2006 1 Outline Introduction CHEETAH Background ― ― CHEETAH concept and network CHEETAH end-host software Analytical Models of GMPLS Networks Application (App) I: Web Transfer App App II: Parallel File Transfers Summary and Conclusions 2 Introduction Many optical connection-oriented (CO) testbeds ― ― Use Generalized Multiprotocol Label Switching (GMPLS) ― E.g., CANARIE's CA*net 4, UKLight, and CHEETAH Primarily designed for e-Science apps Immediate request, call blocking Motivation: extend these GMPLS networks to million of users Problem Statement ― ― What apps are well served by GMPLS networks? Design apps to use GMPLS networks efficiently 3 Circuit-switched High-speed End-to-End Transport ArcHitecture (CHEETAH) Designed as an “add-on” service to the Internet and leverages the services of the Internet IP router End host Packet-switched Internet IP router NIC I NIC II NIC I Optical circuitswitched CHEETAH network Ethernet-SONET Ethernet-SONET gateway gateway NIC II End host CHEETAH concept 4 CHEETAH Network NYC HOPI Force10 UVa mvstu6 NC ORNL, TN WASH HOPI Force10 UVa Catalyst 4948 NCSU M20 CUNY Foundry CUNY Host CUNY WASH Abilene T640 Centuar FastIron FESX448 zelda4 zelda5 MCNC Catalyst 7600 1G Sycamore SN16000 Atlanta, GA SN16000 wukong zelda1 Direct fibers VLANs MPLS tunnels OC-192 lambda zelda2 zelda3 Sycamore SN16000 5 CHEETAH End-host Software CHEETAH software End host OCS client Application TCP/IP C-TCP CHEETAH software Internet OCS client Routing decision Routing decision RSVP-TE client RSVP-TE client NIC 1 NIC 2 CHEETAH network End host NIC 1 Application TCP/IP NIC 2 C-TCP OCS: Optical Connectivity Service RD: routing decision RSVP-TE: ReSerVation Protocol-Traffic Engineering C-TCP: Circuit-TCP 6 Outline Introduction CHEETAH Background ― ― CHEETAH concept and network CHEETAH end-host software Analytical Models of GMPLS Networks Application (App) I: Web Transfer App App II: Parallel File Transfers Summary and Conclusions 7 Analytical Models of GMPLS Networks ― Problem: what apps are suitable for GMPLS networks? Measure of suitability: Call-blocking probability, Pb Link utilization, U ― App properties: Per-circuit BW Call-holding time, 1/ Assumptions: ― ― ― Call arrival rate, (Poisson process) Single link Single class: all apps are of the same type A link of capacity C; m circuits; per-circuit BW=C/m m is a measure of high-throughput vs. moderate-throughput For high-throughput (e.g., e-Science apps), m is small 8 BW sharing models Two kinds of apps: whether 1 / is dependent on C / m 1 / is independent of C / m 1 / is dependent on C / m Link L, capacity C 0 RD 1 N N … … … 1 Link L, capacity C File size distribution: , :shape , k :scale The Erlang-B formula :crossover file size 9 Numerical Results: 1 / is independent of C / m Two equations, four variables Fix U and m, compute Pb and 10 Numerical Results: 1 / is independent of C / m m=10 Pb=23.62% 1/ Conclusions: to get high U Small m (~10): high Pb, thus book-ahead or call queuing Large m (~1000): high ( N / ) , thus large N Intermediate m (~100): large 1 / is preferred 11 Numerical Results: 1 / is dependent on C / m , when 1.1, k 1.25MB Conclusions: to get high U Small m (~10): high Pb, thus book-ahead or call queuing As m increases, N does not increase m=100, to get U>80%, Pb<5%: 6MB< <29MB, thus 0.5s 1 / 2.3s 12 Conclusions for Analysis Ideal apps require BW on the order of onehundredth the link capacity as per-circuit rate Apps where is 1 / independent of C / m ― long call-holding time is preferred ― need short call-holding time Apps where is 1 / dependent on C / m 13 Outline Introduction CHEETAH Background ― ― CHEETAH concept and network CHEETAH end-host software Analytical Models of GMPLS Networks Application (App) I: Web Transfer App App II: Parallel File Transfers Summary and Conclusions 14 APP I: Web Transfer App on CHEETAH Why web transfer? ― ― Web-based apps are ubiquitous Based on the previous analysis, m=100 is suitable for CHEETAH Consists of a software package WebFT ― ― Leverages CGI for deployment without modifying web client and web server software Integrated with CHEETAH end-host software APIs to allow use of the CHEETAH network in a mode transparent to users 15 WebFT Architecture Web server Web client Web Browser (e.g. Mozilla) URL Response RSVP-TE daemon Web Server (e.g. Apache) CGI scripts (download.cgi & redirection.cgi WebFT sender WebFT receiver RSVP-TE API C-TCP API Control messages via Internet Data transfers via a circuit Cheetah end-host software APIs and daemons OCS API RD API OCS daemon RSVP-TE API RD daemon C-TCP API RSVP-TE daemon Cheetah end-host software APIs and daemons 16 Experimental Testbed for WebFT IP routers Internet IP routers NIC I NIC I zelda3 wukong NIC II Atlanta, GA CHEETAH Network Sycamore SN16000 Atlanta, GA NIC II NCSU Sycamore SN16000 MCNC, NC zelda3 and wukong: Dell machines, running Linux FC3 and ext2/3, with RAID-0 SCCI disks RTT between them: 24.7ms on the Internet path, and 8.6ms for the CHEETAH circuit. load Apache HTTP server 2.0 on zelda3 17 Experimental Results for WebFT The web page to test WebFT Test parameters: ― Test.rm: 1.6 GB, circuit rate: 1 Gbps Test results ― throughput: 680 Mbps, delay: 19 s 18 Outline Introduction CHEETAH Background ― ― CHEETAH concept and network CHEETAH end-host software Analytical Models of GMPLS Networks Application (App) I: Web Transfer App App II: Parallel File Transfers Summary and Conclusions 19 APP II: Parallel File Transfers on CHEETAH Motivation: E-Science projects need to share large volumes of data (TB or PB) Goal: achieve multi-Gb/s throughput Two factors limit throughput ― ― TCP’s congestion-control algorithm End-host limitations Solutions to relieve end-host limitations ― ― Single-host solution Cluster solution, which has two variations General case: non-split source file Special case: split source file 20 General-Case Cluster Solution Host 1 transfer Host i’ assemble Original Sink … … … Host n transfer Host 1’ … Host i … … Original split Source transfer Host n’ 21 Software Tools: GridFTP and PVFS2 GridFTP: a data-transfer protocol on the Grid ― ― Extends FTP by adding features for partial file transfer, multi-streaming and striping We mainly use the GridFTP striped transfer feature. PVFS: Parallel Virtual File System ― ― ― An open source implementation of a parallel file system Stripes a file across multiple I/O servers like RAID0 A second version: PVFS2 22 globus-url-copy GridFTP server receiving front end GridFTP server sending front end Block n+1 Block n+1 … … … data node S1 data node R1 Sending data nodes initiate data connections to receiving nodes Block 1 Block 1 Block n+1 Block n+1 … data node Rn GridFTP striped transfer Parallel File System Block 1 … Parallel File System Block 1 … data node Sn 23 General-Case Cluster Solution: Design Steps Approach Pros. GridFTP partial file transfer Splitting & Assembling Transferring Wastes disk space, Performance overhead Socket program Avoids wasting disk space pvfs2-cp Avoids wasting disk space Performance overhead Many independent transfers incurring much overhead to set up and release connections GridFTP partial file transfer GridFTP striped transfer Cons. A single file transfer 24 General-Case Cluster Solution: Implementation To get a high throughput, we need to make data nodes responsible for data blocks in their local disks ― Make PVFS2 and GridFTP have the same stripe Block 1 Block 1 pattern Block n+1 Block n+1 … Problems: PVFS2 … data node Rn … … ― data node R1 PVFS2 1.0.1 does not provide a utility to inspect data distribution 1 Block 1 receiving DataBlock connections between sending and Block n+1 Block n+1 nodes are random PVFS2 ― … data node S1 … data node Sn 25 Random data connections Block 1 Block 1 Block n+1 Block n+1 Block 1 Block 1 Block n+1 Block n+1 … data node Rn PVFS2 … data node S1 … … PVFS2 … data node R1 … data node Sn 26 Random data connections Block 1 Block 1 Block n+1 Block n+1 Block 1 Block 1 Block n+1 Block n+1 … data node Rn PVFS2 … data node S1 … … PVFS2 … data node R1 … data node Sn 27 Implementation - Modifications to PVFS2 Goal: know a priori how a file is striped in PVFS2 Use strace command to trace systems calls called by pvfs2-cp ― ― Pvfs2-fs-dump gives the (non-deterministic) I/O server order of file distribution Pvfs2-cp ignores the –s option for configuring stripe size Modify PVFS2 code ― ― ― For load balance, PVFS2 stripes files starting with a random server: jitter = (rand() % num_io_servers); Set jitter = -1 to get a fixed order of data distribution Change the default stripe size (original: 64KBytes) 28 Implementation - Modifications to GridFTP Goal: use a deterministic matching sequence between sending and receiving data nodes Method: modify the implementation of SPAS and SPOR commands ― ― SPAS: sort the list of host-port pairs based on the IP-address order for receiving data nodes SPOR: request sending data nodes to initiate data connections sequentially to receiving data nodes 29 Experimental Results Conducted on a 22-node cluster, sunfire Reduced network-and-disk contention Performance of PVFS2 implementation was poor 30 Summary and Conclusions Analytical Models of GMPLS Networks ― Application I: Web Transfer Application ― ― Ideal apps require BW on the order of onehundredth the link capacity as per-circuit rate provided deterministic data services to CHEETAH clients on dedicated end-to-end circuits No modifications to the web client and web server software by leveraging CGI Application II: Parallel File Transfers ― ― Implemented a general-case cluster solution by using PVFS2 and GridFTP striped transfer Modified PVFS2 and GridFTP code to reduce network-and-disk contention 31 Publication Lists M. Veeraraghavan, X. Fang, and X. Zheng, On the suitability of applications for GMPLS networks, submitted to IEEE Globecom2006 X. Fang, X. Zheng, and M. Veeraraghavan, Improving web performance through new networking technologies, IEEE ICIW'06, February 23-25, 2006 Guadeloupe, French Caribbean 32 Future Work Analytical Models of GMPLS Networks ― ― Application I: Web Transfer Application ― ― Multi-class Multiple links and network models Design a Web partial CO transfer to enable nonCHEETAH hosts to use CHEETAH Connect multiple CO networks to further reduce RTT Application II: Parallel File Transfers ― ― Test the general-case cluster solution on CHEETAH Work on PVFS2 or try GPFS to get a high I/O throughput 33 A Classification of Networks that Reflects Sharing Modes 34 The client can be reached via the CHEETAH network (OCS) No Yes Request a CHEETAH circuit (Routing Decision) No Yes Set up a circuit (RSVP_TE client) Fail Succeed Send the file via C-TCP Release the circuit (RSVP_TE client) Return Success The flow chart for the WebFT sender Return Failure 35 The WebFT Receiver Integrates with the CHEETAH end-host software modules similar to the WebFT sender. Runs as a daemon in the background on the client host to avoid manual intervention. Also provides the WebFT sender a desired circuit rate. 36 Experimental Results for WebFT 37 PVFS2 Architecture 38 Experimental Configuration Configuration of PVFS2 I/O servers ― ― Configuration of GridFTP servers ― ― The 1st PVFS2: sunfire1 through sunfire5 The 2nd PVFS2: sunfire10, and sunfire6 through 9 Sending front end: sunfire1 with data nodes sunfire1 through sunfire5 Receiving front end: sunfire10 with data nodes sunfire10, sunfire6 through sunfire9 GridFTP striped transfer globus-url-copy -vb –dbg -stripe ftp://sunfire1:50001/pvfs2/test_1G ftp://sunfire10:50002/pvfs2/test_1G1 2>dbg1.txt 39 Four Conditions to Avoid Unnecessary Network-and-disk Contention Know a priori how data are striped in PVFS2 PVFS2 I/O servers and GridFTP servers run on the same hosts GridFTP stripes data across data nodes in the same sequence as PVFS2 does across PVFS2 I/O servers GridFTP and PVFS2 have the same stripe size 40 41 The Specific Cluster Solution for TSI orbitty at NCSU zelda at ORNL controller-0 (rudi) compute0-0 zelda1 controller-1 (orbitty) compute0-1 zelda2 disk-0-0 compute0-2 zelda3 compute0-3 zelda4 compute0-4 zelda5 disk-1-0 disk-2-0 Dell 5424 Dell 5224 disk-3-0 . . . disk-4-0 compute0-19 CHEETAH X1E at ORNL LAN X1E monitoring host 42 Numerical Results for 1 / is dependent on C / m Conclusions: Large m (~1000): does not increase N 43