DOE Perspective on Cyberinfrastructure - LBNL Gary Jung Manager, High Performance Computing Services Lawrence Berkeley National Laboratory Educause CCI Working Group Meeting November 5, 2009 Midrange Computing • DOE ASCR hosted a workshop in Oct 2008 to assess the role of mid-range computing in the Office of Science and revealed that this computation continues to play an increasingly important role in enabling the Office of Science. • Although it is not part of ASCR's mission, midrange computing, and the associated data management play a vital and growing role in advancing science in disciplines where capacity is as important as capability. • Demand for midrange computing services is… o o growing rapidly at many sites (>30% growth annually at LBNL) the direct expression of a broad scientific need • Midrange computing is a necessary adjunct to leadership-class facilities 2 November 5, 2009 Berkeley Lab Computing • Gap between desktop and National Centers • Midrange Computing Working Group 2001 • Cluster support program started in 2002 o Services for PI-owned clusters include: Pre purchase consulting; development of specs and RFP, facilities planning, installation and configuration, ongoing cluster support, user services consulting, cybersecurity, computer room colocation • Currently 32 clusters in production, over 1400 nodes, 6500 processor cores • Funding: Institution provides support for infrastructure costs, technical development. Researchers pay for cluster and incremental cost of support. 3 November 5, 2009 Cluster Support Phase II: Perceus Metacluster • All clusters interconnected into shared cluster infrastructure o o o o Permits sharing of resources, storage Global home file system One ‘super master’ node, used to boot nodes across all clusters multiple system images supported One master job scheduler, submitting to all clusters Simplifies provisioning new systems and ongoing support • Metacluster model made possible by Perceus software o successor to Warewulf (http://www.perceus.org) o can run jobs across clusters, recapturing stranded capacity. 4 November 5, 2009 5 November 5, 2009 Laboratory-Wide Cluster - Drivers “Computation lets us understand everything we do.” – LBNL Acting Lab Director Paul Alivisatos 38% of scientists depend on cluster computing for research. 69% of scientists are interested in cycles on a Lab-owned cluster. o early-career scientists twice as likely to be ‘very interested’ than later-career peers Why do scientists at LBNL need midrange computing resources? o ‘on ramp’ activities in preparation for running at supercomputing centers (development, debugging, benchmarking, optimization) o scientific inquiry not connected with ‘on ramp’ activities 6 November 5, 2009 Laboratory-Wide Cluster “Lawrencium” • Overhead funded program o o • • Production in Fall 2008 General purpose Linux cluster suitable for a wide range of applications o o o o • • • Capital equipment dollars shifted from business computing Overhead funded staffing - 2 FTE 198-nodes, 1584 cores, DDR Infiniband interconnect 40TB NFS home directory storage; 100TB Lustre parallel scratch Commercial job scheduler and banking system #500 on the Nov 2008 Top500 Open to all LBNL PIs and collaborators on their project Users are required to complete a survey when applying for accounts and later provide feedback on science results No user allocations at this time. This has been successful to date. 7 November 5, 2009 Networking - LBLNet • • • • Peer at 10GBE with ESNET 10GbE at core. Moving to 10GbE to the buildings Goal is sustained high speed data flows with cybersecurity Network based IDS approach - traffic is innocent until proven guilty o o o Reactive firewall Does not impede data flow. no stateful firewall. Bro cluster allows us to scale our IDS to 10GBE 8 November 5, 2009 Communications and Governance • General announcements at IT council • Steering committees used for scientific computing o o o Small group of stakeholders, technical experts, decision makers Helps to validate and communicate decisions Accountability 9 November 5, 2009 Challenges • Funding (past) o o Difficult for IT to shift funding from other areas of computing to support for science Recharge can constrain adoption. Full cost recovery definitely will. • New Technology (ongoing) • Facilities (current) o Computer room is approaching capacity despite upgrades o Environmental Monitoring Plenum in ceiling converted to hot air return Tricks to boost underfloor pressure Water cooled doors Underway DCIE measurement in process Tower and heat exchanger replacement Data Center container investigation 10 November 5, 2009 Next Steps • Opportunities presented by cloud computing o Amazon investigation earlier this year. Others ongoing Latency sensitive applications ran poorly as expected Performance dependent of specific use case Data migration. Economics of storing vs moving Certain LBNL factors favor costs for build instead of buy • Large storage and computation for data analysis • GPU investigation 11 November 5, 2009 Points of Collaboration • UC Berkeley HPCC o o o Recent high profile joint projects between UCB and LBNL encourages close collaboration 25-30% of scientists have dual appointment UC Berkeley proximity to LBNL facilitates the use of cluster services • University of California Shared Research Computing Services pilot (SRCS) o o o o o LBNL and SDSC joint pilot for the ten UC campuses Two 272-node clusters located at UC Berkeley and SDSC Shared computing is more cost-effective Dedicated CENIC L3 connecting network for integration Pilot consists of 24 research projects 12 November 5, 2009 13 November 5, 2009