Magellan Project Clouds for Science? Shane Canon Lawrence Berkeley National Lab CASC Leap Day (2012) Arlington, VA Magellan Project • Mission – Determine the appropriate role for private cloud computing for DOE/SC midrange workloads – Approximately two years in duration with established milestones • Approach – Deploy a test bed to investigate the use of cloud computing for mid-range scientific computing – Evaluate the effectiveness of cloud computing models for a wide spectrum of DOE/SC applications 2 Plenty of Hype Gartner’s 2010 Emerging Technologies Hype Cycle Cloud Computing Magellan Research Agenda and Lines of Inquiry • Are the open source cloud software stacks ready for DOE HPC science? • Can DOE cyber security requirements be met within a cloud? • Are the new cloud programming models useful for scientific computing? • Can DOE HPC applications run efficiently in the cloud? What applications are suitable for clouds? • How usable are cloud environments for scientific applications? • When is it cost effective to run DOE HPC science in a cloud? 4 What is a Cloud? Definition According to the National Institute of Standards & Technology (NIST)… • Resource pooling. Computing resources are pooled to serve multiple consumers. • Broad network access. Capabilities are available over the network. • Measured Service. Resource usage is monitored and reported for transparency. • Rapid elasticity. Capabilities can be rapidly scaled out and in (pay-as-you-go) • On-demand self-service. Consumers can provision capabilities automatically. 5 Magellan Timeline 2011 2010 6 Understanding User Requirements and Desires User interfaces/Science Gateways: Use of clouds to host science gateways and/or access to cloud resources through science… Hadoop File System MapReduce Programming Model/Hadoop Cost associativity? (i.e., I can get 10 cpus for 1 hr now or 2 cpus for 5 hrs at the same cost) Easier to acquire/operate than a local cluster Exclusive access to the computing resources/ability to schedule independently of other groups/users Ability to control groups/users Ability to share setup of software or experiments with collaborators Ability to control software environments specific to my application Access to on-demand (commercial) paid resources closer to deadlines Access to additional resources 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% Program Office Program Office Advanced Scientific Computing Research 17% High Energy Physics 20% Biological and Environmental Research 9% Nuclear Physics 13% Basic Energy Sciences -Chemical Sciences 10% Advanced Networking Initiative (ANI) Project 3% Fusion Energy Sciences 10% Other 14% Magellan was architected for flexibility and to support research Compute Servers 720Compute Servers Nehalem Dual quad-core 2.66GHz 24GB RAM, 500GB Disk Totals 5760 Cores, 40TF Peak 21TB Memory, 400 TB Disk IO Nodes (9) Mgt Nodes (2) Big Memory Servers Gateway Nodes (27) Archival Storage 8 ESNet 10Gb/s Router 2 Servers (1TB Memory each) Aggregation Switch 10 Compute/Storage Nodes 8TB High-Performance FLASH 20 GB/s Bandwidth QDR InfiniBand Flash Storage Servers Global Storage (GPFS) 1 PB ANI 100 Gb/s Application Performance 18 16 Runtime Relative to Magellan (non-VM) Carver 14 Franklin 12 Lawrencium 10 EC2-Beta-Opt Amazon CC 8 Amazon EC2 6 4 2 0 GAMESS GTC IMPACT fvCAM MAESTRO256 Application Performance 60 Runtime relative to Carver 50 40 Carver Franklin 30 Lawrencium Amazon CC EC2-Beta-Opt Amazon EC2 20 10 0 MILC PARATEC Application Scaling Scaling - Paratec 11 Application Scaling Scaling - MILC 12 Cost of NERSC in the Cloud Component Cost Compute Systems (1.38B hours) $189,200,000 HPSS (17 PB) $8,200,000 File Systems (2 PB) $2,600,000 Total (Annual Cost) $200,000,000 Assumes 85% utilization and zero growth in HPSS and File System data. Doesn’t include the 2x-10x performance impact that has been measured. This still only captures about 65% of NERSC’s $55M annual budget. No consulting staff, no administration, no support. 13 Where are (commercial) clouds effective? • Individual projects with high-burst needs. – Avoid paying for idle hardware – Access to larger scale (elasticity) – Alternative: Pool with other users (condo model) • High-Throughput Applications with modest data needs – Bioinformatics – Monte-Carlo simulations – Cost issues still apply • Infrastructure Challenged Sites – Facilities cost >> IT costs – Consider the long-term costs • Undetermined or Volatile Needs – Use Clouds to baseline requirements and build in-house 14 Workload Comparison Traditional Enterprise IT HPC Centers Typical Load Average 30% * 90% Computational Needs Bounded computing requirements – Sufficient to meet customer demand or transaction rates. Virtually unbounded requirements – Scientist always have larger, more complicated problems to simulate or analyze. Scaling Approach Scale-in. Emphasis on consolidating in a node using virtualization Scale-Out Applications run in parallel across multiple nodes. 15 Magellan Final Report • • • • Final Report released in December 2011 Jointly written with ANL Available at the ASRC website Comprehensive (170 pages) – Findings/Recommendations – User Experiences – Performance Benchmarking – Programming Models (MapReduce) – Cost Analysis 16 Key Findings • Cloud approaches provide many useful benefits such as customized environments and access to surge capacity. • Cloud computing can require significant initial effort and skills in order to port applications to these new models. • Significant gaps and challenges exist in the areas of managing virtual environments, workflows, data, cyber-security, etc. • The key economic benefit of clouds comes from the consolidation of resources across a broad community, which results in higher utilization, economies of scale, and operational efficiency. DOE already achieves this with facilities like NERSC and the LCFs. • Cost analysis shows that DOE centers are cost competitive, typically 3–7x less expensive, when compared to commercial cloud providers. 17 What Now? • Magellan Hardware has been folded into an existing Cluster (Carver) • Continue to operate 80 node Hadoop (MapReduce) cluster for data intensive computing • NERSC is looking at ways to incorporate some of the lessons from Magellan – Flexible environments – Flexible scheduling • DOE/ASRC has not announced any plans to fund a production cloud 18 Is an HPC Center a Cloud? • • • • HPC Centers ? Resource pooling. Broad network access. Measured Service. Rapid elasticity. – Usage can grow/shrink; pay-as-you-go. • On-demand self-service. – Users cannot demand (or pay for) more service than their allocation allows – Jobs often wait for hours or days in queues From the NIST definition for Cloud Computing 19 X Conclusions • Magellan project has accomplished what it set out to do – Deployed a testbed to study efficacy of Clouds for DOE Science – Engaged with users to evaluate new models – Conducted extensive benchmarking studies – Analyzed cost of clouds versus DOE centers – Released a comprehensive report that has garnered broad recognition and kudos 20 Acknowledgements • • • • Lavanya Ramakrishnan Iwona Sakrejda Tina Declerck Others – – – – Keith Jackson Nick Wright John Shalf Krishna Muriki (not pictured) US Department of Energy DE-AC02-05CH11232 21 Thank you! Email: scanon@lbl.gov Report at… http://science.energy.gov/ascr/ or http://www.nersc.gov/