Magellan Project Clouds for Science?

advertisement
Magellan Project
Clouds for Science?
Shane Canon
Lawrence Berkeley National Lab
CASC
Leap Day (2012)
Arlington, VA
Magellan Project
• Mission
– Determine the appropriate role for private cloud
computing for DOE/SC midrange workloads
– Approximately two years in duration with
established milestones
• Approach
– Deploy a test bed to investigate the use of cloud
computing for mid-range scientific computing
– Evaluate the effectiveness of cloud computing
models for a wide spectrum of DOE/SC
applications
2
Plenty of Hype
Gartner’s 2010 Emerging Technologies Hype Cycle
Cloud Computing
Magellan Research Agenda and
Lines of Inquiry
• Are the open source cloud software stacks
ready for DOE HPC science?
• Can DOE cyber security requirements be
met within a cloud?
• Are the new cloud programming models
useful for scientific computing?
• Can DOE HPC applications run efficiently
in the cloud? What applications are
suitable for clouds?
• How usable are cloud environments for
scientific applications?
• When is it cost effective to run DOE HPC
science in a cloud?
4
What is a Cloud?
Definition
According to the National Institute of Standards &
Technology (NIST)…
• Resource pooling. Computing resources are pooled
to serve multiple consumers.
• Broad network access. Capabilities are available over
the network.
• Measured Service. Resource usage is monitored and
reported for transparency.
• Rapid elasticity. Capabilities can be rapidly scaled
out and in (pay-as-you-go)
• On-demand self-service. Consumers can provision
capabilities automatically.
5
Magellan Timeline
2011
2010
6
Understanding User
Requirements and Desires
User interfaces/Science Gateways: Use of clouds to host science
gateways and/or access to cloud resources through science…
Hadoop File System
MapReduce Programming Model/Hadoop
Cost associativity? (i.e., I can get 10 cpus for 1 hr now or 2 cpus
for 5 hrs at the same cost)
Easier to acquire/operate than a local cluster
Exclusive access to the computing resources/ability to schedule
independently of other groups/users
Ability to control groups/users
Ability to share setup of software or experiments with collaborators
Ability to control software environments specific to my application
Access to on-demand (commercial) paid resources closer to
deadlines
Access to additional resources
0% 10% 20% 30% 40% 50% 60% 70% 80% 90%
Program Office
Program Office
Advanced Scientific Computing Research
17%
High Energy Physics
20%
Biological and Environmental Research
9%
Nuclear Physics
13%
Basic Energy Sciences -Chemical Sciences
10%
Advanced Networking Initiative (ANI) Project
3%
Fusion Energy Sciences
10%
Other
14%
Magellan was architected for
flexibility and to support research
Compute Servers
720Compute Servers
Nehalem Dual quad-core 2.66GHz
24GB RAM, 500GB Disk
Totals
5760 Cores, 40TF Peak
21TB Memory, 400 TB Disk
IO Nodes (9)
Mgt Nodes (2)
Big Memory Servers
Gateway Nodes (27)
Archival Storage
8
ESNet
10Gb/s
Router
2 Servers (1TB Memory each)
Aggregation Switch
10 Compute/Storage Nodes
8TB High-Performance FLASH
20 GB/s Bandwidth
QDR InfiniBand
Flash Storage Servers
Global Storage (GPFS) 1 PB
ANI
100 Gb/s
Application Performance
18
16
Runtime Relative to Magellan (non-VM)
Carver
14
Franklin
12
Lawrencium
10
EC2-Beta-Opt
Amazon CC
8
Amazon EC2
6
4
2
0
GAMESS
GTC
IMPACT
fvCAM
MAESTRO256
Application Performance
60
Runtime relative to Carver
50
40
Carver
Franklin
30
Lawrencium
Amazon CC
EC2-Beta-Opt
Amazon EC2
20
10
0
MILC
PARATEC
Application Scaling
Scaling - Paratec
11
Application Scaling
Scaling - MILC
12
Cost of NERSC in the Cloud
Component
Cost
Compute Systems (1.38B hours)
$189,200,000
HPSS (17 PB)
$8,200,000
File Systems (2 PB)
$2,600,000
Total (Annual Cost)
$200,000,000
Assumes 85% utilization and zero growth in HPSS and File System data.
Doesn’t include the 2x-10x performance impact that has been measured.
This still only captures about 65% of NERSC’s $55M annual budget.
No consulting staff, no administration, no support.
13
Where are (commercial) clouds
effective?
• Individual projects with high-burst needs.
– Avoid paying for idle hardware
– Access to larger scale (elasticity)
– Alternative: Pool with other users (condo model)
• High-Throughput Applications with modest data needs
– Bioinformatics
– Monte-Carlo simulations
– Cost issues still apply
• Infrastructure Challenged Sites
– Facilities cost >> IT costs
– Consider the long-term costs
• Undetermined or Volatile Needs
– Use Clouds to baseline requirements and build in-house
14
Workload Comparison
Traditional Enterprise IT
HPC Centers
Typical Load Average
30% *
90%
Computational Needs
Bounded computing
requirements – Sufficient to
meet customer demand or
transaction rates.
Virtually unbounded requirements –
Scientist always have larger, more
complicated problems to simulate or
analyze.
Scaling Approach
Scale-in.
Emphasis on consolidating
in a node using
virtualization
Scale-Out
Applications run in parallel across
multiple nodes.
15
Magellan Final Report
•
•
•
•
Final Report released in December 2011
Jointly written with ANL
Available at the ASRC website
Comprehensive (170 pages)
– Findings/Recommendations
– User Experiences
– Performance Benchmarking
– Programming Models (MapReduce)
– Cost Analysis
16
Key Findings
• Cloud approaches provide many useful benefits such as
customized environments and access to surge capacity.
• Cloud computing can require significant initial effort and skills
in order to port applications to these new models.
• Significant gaps and challenges exist in the areas of managing
virtual environments, workflows, data, cyber-security, etc.
• The key economic benefit of clouds comes from the
consolidation of resources across a broad community, which
results in higher utilization, economies of scale, and
operational efficiency. DOE already achieves this with facilities
like NERSC and the LCFs.
• Cost analysis shows that DOE centers are cost competitive,
typically 3–7x less expensive, when compared to commercial
cloud providers.
17
What Now?
• Magellan Hardware has been folded into an
existing Cluster (Carver)
• Continue to operate 80 node Hadoop
(MapReduce) cluster for data intensive
computing
• NERSC is looking at ways to incorporate
some of the lessons from Magellan
– Flexible environments
– Flexible scheduling
• DOE/ASRC has not announced any plans to
fund a production cloud
18
Is an HPC Center a Cloud?
•
•
•
•
HPC Centers ?
Resource pooling.
Broad network access.
Measured Service.
Rapid elasticity.
– Usage can grow/shrink; pay-as-you-go.
• On-demand self-service.
– Users cannot demand (or pay for) more
service than their allocation allows
– Jobs often wait for hours or days in queues
From the NIST definition for Cloud Computing
19




X
Conclusions
• Magellan project has accomplished what it
set out to do
– Deployed a testbed to study efficacy of Clouds for
DOE Science
– Engaged with users to evaluate new models
– Conducted extensive benchmarking studies
– Analyzed cost of clouds versus DOE centers
– Released a comprehensive report that has
garnered broad recognition and kudos
20
Acknowledgements
•
•
•
•
Lavanya Ramakrishnan
Iwona Sakrejda
Tina Declerck
Others
–
–
–
–
Keith Jackson
Nick Wright
John Shalf
Krishna Muriki (not pictured)
US Department of Energy
DE-AC02-05CH11232
21
Thank you!
Email: scanon@lbl.gov
Report at…
http://science.energy.gov/ascr/
or
http://www.nersc.gov/
Download