Experiences Using Cloud Computing for A Scientific Workflow Application Jens Vöckler, Gideon Juve, Ewa Deelman, Mats Rynge, G. Bruce Berriman Funded by NSF grant OC 0910812 This Talk Experience in cloud computing talk FutureGrid: Pegasus-WMS Periodograms Experiments 2011-06-08 Hardware Middlewares Periodogram I Comparison of clouds using periodograms Periodogram II ScienceCloud’11 2 What is FutureGrid Something Different For Everyone 6 centers across the nation Nimbus Eucalyptus Moab “bare metal” Start here: 2011-06-08 Test bed for Cloud Computing (this talk). http://www.futuregrid.org/ ScienceCloud’11 3 What Comprises FutureGrid Resource Type Cores Host CPU + RAM IU india iDataPlex 1,416 8 x 2.9 GHz Xeon @ 24 GB IU xray Cray XT5m 672 2 x 2.6 GHz 285 (E6) @ 8 GB UofC hotel iDataPlex 672 8 x 2.9 GHz Xeon @ 24 GB UCSD sierra iDataPlex 672 8 x 2.5 GHz Xeon @ 32 GB UFl foxtrot iDataPlex 256 8 x 2.3 GHz Xeon @ 24 GB PowerEdge 1,016 8 x 2.6 GHz Xeon @ 12 GB TACC alamo TOTALS Proposed: 2011-06-08 4,704 16 x (192 GB + 12 TB / node) cluster 8 node GPU-enhanced cluster ScienceCloud’11 4 Middlewares in FG Resource Type Cores Moab Eucalyptus Nimbus IU india iDataPlex 1,416 832 (59%) 400 (28%) - IU xray Cray XT5m 672 672 (100%) - - UofC hotel iDataPlex 672 336 (50%) - 336 (50%) UCSD sierra iDataPlex 672 312 (46%) 120 (18%) 160 (24%) UFl foxtrot iDataPlex 256 - - 248 (97%) PowerEdge 1,016 896 (88%) - 120 (12%) 4,704 3,048 (65%) 520 (11%) 744 (18%) TACC alamo TOTALS Available resources as of 2011-06-06 2011-06-08 ScienceCloud’11 5 Pegasus WMS I Automating Computational Pipelines Funded by NSF/OCI, is a collaboration with the Condor group at UW Madison Automates data management Captures provenance information Used by a number of domains Across a variety of applications Scalability 2011-06-08 Handle large data (kB…TB), and Many computations (1…106 tasks) ScienceCloud’11 6 Pegasus WMS II Reliability Retry computations from point of failure Construction of complex workflows Can run pure locally, or Distributed among institutions 2011-06-08 Based on computational blocks Portable, reusable WF descr. Laptop, campus cluster, grid, cloud ScienceCloud’11 7 How Pegasus Uses FutureGrid Focus on Eucalyptus and Nimbus No Moab “bare metal” at this point During Experiments in Nov’ 2010 544 Nimbus cores 744 Eucalyptus cores 1,288 total potential cores 2011-06-08 across 4 clusters in 5 clouds. Actually used 300 physical cores (max). ScienceCloud’11 8 Pegasus FG Interaction 2011-06-08 ScienceCloud’11 9 Periodograms Find extra-solar planets by Wobbles in radial velocity of star, or Dips in star’s intensity Star Planet Star 2011-06-08 Time Brightness Red Blue Planet Light Curve Time ScienceCloud’11 10 Kepler Workflow 210k light-curves released in July 2010 Apply 3 algorithms to each curve Run entire data-set This talk’s experiments: 2011-06-08 3 times, with 3 different parameter sets 1 algorithm, 1 parameter set, 1 run Either partial or full data-set ScienceCloud’11 11 Pegasus Periodograms 1st experiment is a “ramp-up” Try to see where things trip Across 3 comparable infrastructures 3rd experiment runs full set 2011-06-08 Already found places needing adjustments 2nd experiment also 16k light curves 16k light curves 33k computations (every light-curve twice) Testing hypothesized tunings ScienceCloud’11 12 Periodogram Workflow 2011-06-08 ScienceCloud’11 13 Excerpt: Jobs over Time 2011-06-08 ScienceCloud’11 14 Hosts, Tasks, and Duration (I) 100% 90% 80% 50 10,290 50 50 352 250.6 70% 60% 50% 20 20 6,678 17 20 126 20 40% 30% 140 29 20% 10% 28 7,080 30 30 8 Req. Hosts Avail. Hosts Eucalyptus india 86.8 162 0% 2011-06-08 7,134 20 30 77.5 Act. Hosts Eucalyptus sierra Nimbus sierra ScienceCloud’11 119.2 19 1,900 Jobs Tasks Nimbus foxtrot 0.4 Cum. Dur. (h) Nimbus hotel 15 Resource- and Job States (I) 2011-06-08 ScienceCloud’11 16 Cloud Comparison Compare academic and commercial clouds Constrained node- and core selection 2011-06-08 NERSC’s Magellan cloud (Eucalyptus) Amazon’s cloud (EC2), and FutureGrid’s sierra cloud (Eucalyptus) Because AWS costs $$ 6 nodes, 8 cores each node 1 Condor slot / physical CPU ScienceCloud’11 17 Cloud Comparison II Site CPU Magellan 8 x 2.6 GHz 19 (0) GB 5.2 h 226.6 h 43.6 Amazon 8 x 2.3 GHz 7 (0) GB 7.2 h 295.8 h 41.1 FutureGrid 8 x 2.5 GHz 29 (½) GB 5.7 h 248.0 h 43.5 Cum. Dur. Speed-Up Given 48 physical cores 2011-06-08 RAM (SW) Walltime Speed-up ≈ 43 considered pretty good AWS cost ≈ $31 7.2 h x 6 x c1.large ≈ $29 1.8 GB in + 9.9 GB out ≈ $2 ScienceCloud’11 18 Scaling Up I Workflow optimizations Submit-host Unix settings Increase open file-descriptors limit Increase firewall’s open port range Submit-host Condor DAGMan settings 2011-06-08 Pegasus clustering ✔ Compress file transfers Idle job limit ✔ ScienceCloud’11 19 Scaling Up II Submit-host Condor settings Socket cache size increase File descriptors and ports per daemon Remote VM Condor settings 2011-06-08 Using condor_shared_port daemon Use CCB for private networks Tune Condor job slots TCP for collector call-backs ScienceCloud’11 20 Hosts, Tasks, and Duration (II) 100% 90% 80% 50 50 356.5 4,012 101,283 1,428 34,480 809 21,539 1,074 24,600 125.8 1,135 28,761 102.3 Jobs Tasks Cum. Dur. (h) 70% 60% 50% 20 20 20 19 40% 30% 30 29 20% 10% 30 26 Req. Hosts Act. Hosts 164.1 85.5 0% Eucalyptus india 2011-06-08 Eucalyptus sierra Nimbus sierra ScienceCloud’11 Nimbus foxtrot Nimbus hotel 21 Resource- and Job States (II) 2011-06-08 ScienceCloud’11 22 Lose Ends Saturate requested resources Clustering Better submit host tuning 2011-06-08 Requires better monitoring ✔ Better data staging ScienceCloud’11 23 Acknowledgements Funded by NSF grant OC 0910812 Ewa Deelman, Gideon Juve, Mats Rynge, Bruce Berriman FG help desk ;-) http://pegasus.isi.edu/ 2011-06-08 ScienceCloud’11 24