DOE Perspectives on CI - LBNL

advertisement
DOE Perspective on Cyberinfrastructure - LBNL
Gary Jung
Manager, High Performance Computing Services
Lawrence Berkeley National Laboratory
Educause CCI Working Group Meeting
November 5, 2009
Midrange Computing
• DOE ASCR hosted a workshop in Oct 2008 to assess the role of
mid-range computing in the Office of Science and revealed that this
computation continues to play an increasingly important role in
enabling the Office of Science.
• Although it is not part of ASCR's mission, midrange computing, and
the associated data management play a vital and growing role in
advancing science in disciplines where capacity is as important as
capability.
• Demand for midrange computing services is…
o
o
growing rapidly at many sites (>30% growth annually at LBNL)
the direct expression of a broad scientific need
• Midrange computing is a necessary adjunct to leadership-class
facilities
2
November 5, 2009
Berkeley Lab Computing
• Gap between desktop and National Centers
• Midrange Computing Working Group 2001
• Cluster support program started in 2002
o
Services for PI-owned clusters include: Pre purchase consulting;
development of specs and RFP, facilities planning, installation and
configuration, ongoing cluster support, user services consulting,
cybersecurity, computer room colocation
• Currently 32 clusters in production, over 1400 nodes, 6500
processor cores
• Funding: Institution provides support for infrastructure costs,
technical development. Researchers pay for cluster and
incremental cost of support.
3
November 5, 2009
Cluster Support Phase II: Perceus Metacluster
• All clusters interconnected into shared cluster infrastructure
o
o
o
o
Permits sharing of resources, storage
 Global home file system
One ‘super master’ node, used to boot nodes across all clusters
 multiple system images supported
One master job scheduler, submitting to all clusters
Simplifies provisioning new systems and ongoing support
• Metacluster model made possible by Perceus software
o successor to Warewulf (http://www.perceus.org)
o can run jobs across clusters, recapturing stranded capacity.
4
November 5, 2009
5
November 5, 2009
Laboratory-Wide Cluster - Drivers
“Computation lets us understand everything we do.”
– LBNL Acting Lab Director Paul Alivisatos
38% of scientists depend on cluster computing for research.
69% of scientists are interested in cycles on a Lab-owned cluster.
o early-career scientists twice as likely to be ‘very interested’ than
later-career peers
Why do scientists at LBNL need midrange computing resources?
o ‘on ramp’ activities in preparation for running at supercomputing
centers (development, debugging, benchmarking, optimization)
o scientific inquiry not connected with ‘on ramp’ activities
6
November 5, 2009
Laboratory-Wide Cluster “Lawrencium”
•
Overhead funded program
o
o
•
•
Production in Fall 2008
General purpose Linux cluster suitable for a wide range of
applications
o
o
o
o
•
•
•
Capital equipment dollars shifted from business computing
Overhead funded staffing - 2 FTE
198-nodes, 1584 cores, DDR Infiniband interconnect
40TB NFS home directory storage; 100TB Lustre parallel scratch
Commercial job scheduler and banking system
#500 on the Nov 2008 Top500
Open to all LBNL PIs and collaborators on their project
Users are required to complete a survey when applying for accounts
and later provide feedback on science results
No user allocations at this time. This has been successful to date.
7
November 5, 2009
Networking - LBLNet
•
•
•
•
Peer at 10GBE with ESNET
10GbE at core. Moving to 10GbE to the buildings
Goal is sustained high speed data flows with cybersecurity
Network based IDS approach - traffic is innocent until proven
guilty
o
o
o
Reactive firewall
Does not impede data flow. no stateful firewall.
Bro cluster allows us to scale our IDS to 10GBE
8
November 5, 2009
Communications and Governance
• General announcements at IT council
• Steering committees used for scientific computing
o
o
o
Small group of stakeholders, technical experts, decision makers
Helps to validate and communicate decisions
Accountability
9
November 5, 2009
Challenges
• Funding (past)
o
o
Difficult for IT to shift funding from other areas of computing to
support for science
Recharge can constrain adoption. Full cost recovery definitely will.
• New Technology (ongoing)
• Facilities (current)
o
Computer room is approaching capacity despite upgrades




o
Environmental Monitoring
Plenum in ceiling converted to hot air return
Tricks to boost underfloor pressure
Water cooled doors
Underway
 DCIE measurement in process
 Tower and heat exchanger replacement
 Data Center container investigation
10
November 5, 2009
Next Steps
• Opportunities presented by cloud computing
o
Amazon investigation earlier this year. Others ongoing




Latency sensitive applications ran poorly as expected
Performance dependent of specific use case
Data migration. Economics of storing vs moving
Certain LBNL factors favor costs for build instead of buy
• Large storage and computation for data analysis
• GPU investigation
11
November 5, 2009
Points of Collaboration
• UC Berkeley HPCC
o
o
o
Recent high profile joint projects between UCB and LBNL
encourages close collaboration
25-30% of scientists have dual appointment
UC Berkeley proximity to LBNL facilitates the use of cluster
services
• University of California Shared Research Computing Services
pilot (SRCS)
o
o
o
o
o
LBNL and SDSC joint pilot for the ten UC campuses
Two 272-node clusters located at UC Berkeley and SDSC
Shared computing is more cost-effective
Dedicated CENIC L3 connecting network for integration
Pilot consists of 24 research projects
12
November 5, 2009
13
November 5, 2009
Download