TeraGrid National CyberInfrastructure for Scientific Research Philip Blood Senior Scientific Specialist Pittsburgh Supercomputing Center April 23, 2010 What is the TeraGrid? The TeraGrid (TG) is the world’s largest open scientific discovery infrastructure, providing: • Computational resources • Data storage, access, and management • Visualization systems • Specialized gateways for scientific domains • Centralized user services, allocations, and usage tracking • All connected via highperformance networks TeraGrid Governance • 11 Resource Providers (RPs) funded under individual agreements with NSF – Mostly different: start and end dates, goals, and funding models • 1 Coordinating Body – Grid Integration Group (GIG) – – – – University of Chicago/Argonne Subcontracts to all RPs and six other universities ~10 Area Directors, lead coordinated work across TG ~18 Working groups with members from many RPs work on day-today issues – RATs formed to handle short-term issues • TeraGrid Forum sets policies and is responsible for the overall TeraGrid – Each RP and the GIG votes in the TG Forum Slide courtesy of Dan Katz Who Uses TeraGrid (2008) TeraGrid Objectives • DEEP Science: Enabling Petascale Science – Make science more productive through integrated set of advanced computational resources • Address key challenges prioritized by users • WIDE Impact: Empowering Communities – Bring TeraGrid capabilities to the broad science community • Partnerships with community leaders, “Science Gateways” to make access easier • OPEN Infrastructure, OPEN Partnership – Provide a coordinated, general purpose, reliable set of services • Free and open to U.S. scientific research community and their international partners Introduction to the TeraGrid • TG Portal and Documentation • Compute & Visualization Resources – More than 1.5 petaflops of computing power • Data Resources – Can obtain allocations of data storage facilities – Over 100 Scientific Data Collections made available to communities • Science Gateways • How to Apply for TG Services & Resources • User Support & Successes – – – – Central point of contact for support of all systems Personal User Support Contact Advanced Support for TeraGrid Applications (ASTA) Education and training events and resources TeraGrid User Portal Web-based single point of contact for : • Access to your TeraGrid accounts and allocated resources • Interfaces for data management, data collections, and other user tasks and resources • Access to TeraGrid Knowledge Base, Help Desk, and online training Teragrid User Portal: portal.teragrid.org Many features (certain resources, documentation, training, consulting, allocations) do not require a portal account! portal.teragrid.org Documentation Find Information about TeraGrid www.teragrid.org • Click “Knowledge Base” link for quick answers to technical questions • Click “User Info” link to go to www.teragrid.org--> User Support • Science Highlights • News and press releases • Education, outreach and training events and resources portal.teragrid.orgResources Resources by Category • Shows status of systems currently in production • Click on names for more info on each resource www.teragrid.orgUser SupportResources Resources by Site • Complete listing of TG resources (including those not yet running) • Scroll through list to see details on each resource • Click on icons to go to user guides for each resource A few examples of different types of TG resources... Slide courtesy of Dan Katz Massively Parallel Resources • Ranger@TACC – First NSF ‘Track2’ HPC system – 504 TF – 15,744 Quad-Core AMD Opteron processors – 123 TB memory, 1.7 PB disk • Kraken@NICS (UT/ORNL) – – – – Second NSF ‘Track2’ HPC system Blue Waters@NCSA 1 PF Cray XT5 system 16,512 compute sockets, 99,072 cores NSF Track 1 10 PF peak 129 TB memory, 3.3 PB disk Coming in 2011 Slide courtesy of Dan Katz Shared Memory Resources • Pople@PSC: – – – – SGI Altix system 768 Itanium 2 cores 1.5 TB global shared memory Primarily for large shared memory and hybrid applications • Nautilus@NICS – – – – – – SGI UltraViolet 1024 cores (Intel Nehalem) 16 GPUS 4 TB global shared memory 1 PB file system Visualization and Analysis • Ember@NCSA (coming in September) – SGI UltraViolet (1536 Nehalem cores) Visualization & Analysis Resources • Longhorn@TACC: – Dell/NVIDIA Visualization and Data Analysis Cluster – a hybrid CPU/GPU system – designed for remote, interactive visualization and data analysis, but it also supports production, compute-intensive calculations on both the CPUs and GPUs via off-hour queues • TeraDRE@Purdue: – Subcluster featuring NVIDIA GeForce 6600GT GPUs – Used for rendering graphics with Maya, POV-ray, and Blender (among others) • Spur@TACC: – Sun Visualization Cluster – 128 compute cores / 32 NVIDIA FX5600 GPUs – Spur is intended for serial and parallel visualization applications that take advantage of large per-node memory, multiple computing cores, and multiple graphics processors. Other Specialized TG Resources Data-Intensive Computing • Dash@SDSC: – 64 Intel Nehalem compute nodes (512 cores) – 4 I/O nodes (32 cores) – vSMP (virtual shared memory) • aggregates memory across 16 nodes. • allows applications to address 768GB – 4 TB of Flash memory • (fast file I/O subsystem or fast virtual memory) Heterogeneous CPU/GPU Computing • Lincoln@NCSA: – 192 Dell PowerEdge 1950 nodes (1536 cores) – 96 NVIDIA Tesla S1070 High Throughput Computing • Condor Pool@Purdue – Pool of over 27,000+ processors – Various Architectures and OS – Excellent for parameter sweeps, serial applications Data Storage Resources • Global File System – GPFS-WAN • 700 TB disk storage at SDSC, historically mounted at a few TG sites • Licensing issues prevent further use – Data Capacitor (Lustre-WAN) • • • • Mounted on growing number of TG systems 535 TB storage at IU, including databases Ongoing work to improve performance and authentication infrastructure Another Lustre-WAN implementation being built by PSC – pNFS is a possible path for global file systems, but is far away from being viable for production • Data Collections – Allocable storage at SDSC and IU (files, databases) for collections used by communities • Tape Storage – Allocable resources available at IU, NCAR, NCSA, SDSC – Most sites provide “free” archival tape storage with compute allocations • Access is generally through GridFTP (through portal or command-line) Adapted from slide by Dan Katz portal.teragrid.orgResources Data Collections Data collections represent permanent data storage that is organized, searchable, and available to a wide audience, either a for a collaborative group or the scientific public in general What is a Science Gateway? • A Science Gateway – Enables scientific communities of users with a common scientific goal – Uses high performance computing – Has a common interface – Leverages community investment • Three common forms: – Web-based portals – Application programs running on users' machines but accessing services in TeraGrid – Coordinated access points enabling users to move seamlessly between TeraGrid and other grids How can a Gateway help? • Make science more productive – – – – Researchers use same tools Complex workflows Common data formats Data sharing • Bring TeraGrid capabilities to the broad science community – – – – Lots of disk space Lots of compute resources Powerful analysis capabilities A nice interface to information Gateway Highlight NanoHub Harnesses TeraGrid for Education • Nanotechnology education • Used in dozens of courses at many universities • Teaching materials • Collaboration space • Research seminars • Modeling tools • Access to cutting edge research software Gateways Highlight SCEC Produces Hazard Map • PSHA hazard map for California using newly released Earthquake Rupture Forecast (UCERF2.0) calculated using SCEC Science Gateway • Warm colors indicate regions with a high probability of experiencing strong ground motion in the next 50 years. • High resolution map, significant CPU use How can I build a gateway? • Web information available: www.teragrid.org/programs/sci_gateways – How to turn your project into a science gateway – Details about current gateways – Link to write a winning gateway proposal for a TeraGrid allocation • Download code and instructions – Building a simple gateway tutorial • Talk to us – Biweekly telecons to get advice from others – Potential assistance from TeraGrid staff – Nancy Wilkins-Diehr, wilkinsn@sdsc.edu – Vickie Lynch, lynchve@ornl.gov Some Current Science Gateways • • • • • • • Biology and Biomedicine Science Gateway Open Life Sciences Gateway The Telescience Project Grid Analysis Environment (GAE) Neutron Science Instrument Gateway TeraGrid Visualization Gateway, ANL BIRN • Open Science Grid (OSG) • Special PRiority and Urgent Computing Environment (SPRUCE) • National Virtual Observatory (NVO) • Linked Environments for Atmospheric Discovery (LEAD) • Computational Chemistry Grid (GridChem) • Computational Science and Engineering Online (CSE-Online) • GEON(GEOsciences Network) • Network for Earthquake Engineering Simulation (NEES) • SCEC Earthworks Project • Network for Computational Nanotechnology and nanoHUB • GIScience Gateway (GISolve) • Gridblast Bioinformatics Gateway • Earth Systems Grid • Astrophysical Data Repository (Cornell) Slide courtesy of Nancy Wilkins-Diehr portal.teragrid.orgResources Science Gateways Explore current TG science gateways from the TG User Portal 1. Get an allocation for your project How One Uses TeraGrid RP 1 RP 2 POPS 2. Allocation PI adds users User Portal Science Gateways 2. Use TeraGrid resources TeraGrid Infrastructure Accounting, … (Accounting, Network,Network, Authorization,…) Command Line RP 3 Viz Service Compute Service (HPC, HTC, CPUs, GPUs, VMs) Slide courtesy of Dan Katz Data Service How to Get Started on the TeraGrid Two Methods: • Direct – PI on allocations must be researchers at US institutions • postdocs can be PIs, but not graduate students – Decide which systems you wish to apply for • Read resource descriptions on www.teragrid.org • Send email to help@teragrid.org with questions – Create a login on the proposal system (POPS) – Apply for a Startup allocation on POPS • Total allocation on all resources must not exceed 200K SUs (core-hours) • Each machine has an individual limit for Startup allocations (check resource description) • Campus Champion – – – – Contact your Campus Champion and discuss your computing needs Send him your contact information and you’ll be added to the CC account Experiment with various TG systems to discover the best ones for you Apply for your own Startup account via the “Direct” method Proposal System https://pops-submit.teragrid.org Welcome to POPS What’s New Area … Welcome, References to Guide, Policies, Resources, etc. … How to use summary … Deadline Dates … You can get a lot for a little! By submitting an abstract, your CV, and filling out a form, you get: • A Startup allocation – Up to 200,000 SUs (core hours) on TG systems for one year – That is the equivalent of 8333 days (22.8 years) of processing time on a single core! • Access to consulting from TeraGrid personnel regarding your computational challenges • Opportunity to apply for Advanced Support – Requires additional 1 page justification of your need for advanced support – Can be done together with your Startup request, or at anytime after that Access to resources • Terminal: ssh, gsissh • Portal: TeraGrid user portal, Gateways – Once logged in to portal, click on “Login” • Also, SSO from command-line Slide courtesy of Dan Katz TGUP Data Mover • Drag and drop java applet in user portal – Uses GridFTP, 3rd-party transfers, RESTful services, etc. Slide courtesy of Dan Katz Need Help? • First, try searching the Knowledge Base or other Documentation • Submit a ticket – Send an email to help@teragrid.org – Use the TeraGrid User Portal ‘Consulting’ tab • Can also call TeraGrid Help Desk 24/7: 1-866-907-2383 A User Experience: TeraGrid Support Enabling New Science Jan. 2004 DL_POLY ~13,000 atoms NAMD: 740,000 atoms 60X larger! TeraGrid to the Rescue Fall 2004: Granted allocations at PSC, NCSA, SDSC Where to Run? • EVERYWHERE • Minimize/preequilibrate on ANL IA64 (high availability/long queue time) • Smaller simulations 350,000 atoms on NCSA IA-64s (and later Lonestar) • Large simulations 740,000 atoms on highly scalable systems: PSC XT3 and SDSC Datastar • TeraGrid infrastructure critical – Archiving data – Moving data between sites – Analyzing data on the TeraGrid Result: Open up new phenomenon to investigation through simulation Membrane Remodeling Blood, P.D. and Voth, G.A. Proc. Natl. Acad. Sci. 103, 15068 (2006). Personalized User Support: Critical to Success • Contacted by TG Support to determine needs • Worked closely with TG Support on (then) new XT3 architecture to keep runs going during initial stages of machine operation • TG worked closely with application developers to find problems with the code and improve performance • Same pattern established throughout TeraGrid for User Support • Special advanced support can be obtained for deeper needs (improving codes/workflows/etc.) by applying in POPS (and providing description of needs) Applying for Advanced Support • Go to teragrid.org Help & Support • Look at criteria and write 1 page justification • Submit with your regular proposal, or as a supplement later on Advanced Support: Improving Parallel Performance of Protein Folding Simulations The UNRES molecular dynamics (MD) code utilizes a carefully-derived mesoscopic protein force field to study and predict protein folding pathways by means of molecular dynamics simulations. http://www.chem.cornell.edu/has/ Load Imbalance Detection in UNRES Only looking at time spent in the important MD phase • In this case: Developers unaware that chosen algorithm would create Observe multiple causes of load imbalance, as well as the serial bottleneck load imbalance • Reexamined available algorithms and found one with much better load balance – also faster in serial! • Also parallelized serial function causing bottleneck Major Serial Bottleneck and Load Imbalance in UNRES Eliminated • After looking at the performance profiling done by the PSC, developers discovered that they could use an algorithm with much better load balance and faster serial performance. • Code now runs 4x faster! TG App: Predicting storms • Hurricanes and tornadoes cause massive loss of life and damage to property • TeraGrid supported spring 2007 NOAA and University of Oklahoma Hazardous Weather Testbed – Major Goal: assess how well ensemble forecasting predicts thunderstorms, including the supercells that spawn tornadoes – Nightly reservation at PSC, spawning jobs at NCSA as needed for details – Input, output, and intermediate data transfers – Delivers “better than real time” prediction – Used 675,000 CPU hours for the season – Used 312 TB on HPSS storage at PSC Slide courtesy of Dennis Gannon, ex-IU, and LEAD Collaboration App: GridChem Different licensed applications with different queues Will be scheduled for workflows Slide courtesy of Joohyun Kim Apps: Genius and Materials Fully-atomistic simulations of clay-polymer nanocomposites Modeling blood flow before (during?) surgery Why cross-site / distributed runs? HemeLB on LONI 1. Rapid turnaround, conglomeration of idle processors to run a single large job LAMMPS on TeraGrid 2. Run big compute & big memory jobs not possible on a single machine Slide courtesy of Steven Manos and Peter Coveney 44 TeraGrid Annual Conference • Showcases capabilities, achievements and impact of TeraGrid in research • Presentations, demos, posters, visualizations • Tutorials, training and peer support • Student competitions and volunteer opportunities • www.teragrid.org/tg10 Campus Champions Program • Source of local, regional and national high performance computing and cyberinfrastructure information at home campus • Source of information about TeraGrid resources and services that will benefit their campus • Source of startup accounts to quickly get researchers and educators using their allocation of time on the TeraGrid resources • Direct access to TeraGrid staff www.teragrid.org/web/eot/campus_champions TeraGrid HPC Education and Training • Workshops, institutes and seminars on highperformance scientific computing • Hands-on tutorials on porting and optimizing code for the TeraGrid systems • On-line self-paced tutorials • High-impact educational and visual materials suitable for K–12, undergraduate and graduate classes www.teragrid.org/web/eot/workshops HPC University • Virtual Organization to advance researchers’ HPC skills – Catalog of live and self-paced training – Schedule series of training courses – Gap analysis of materials to drive development • Work with educators to enhance the curriculum – Search catalog of HPC resources – Schedule workshops for curricular development – Leverage good work of others • Offer Student Research Experiences – Enroll in HPC internship opportunities – Offer Student Competitions • Publish Science and Education Impact – Promote via TeraGrid Science Highlights, iSGTW – Publish education resources to NSDL-CSERD http://hpcuniv.org/ Sampling of Training Topics Offered • HPC Computing – Introduction to Parallel Computing – Toward Multicore Petascale Applications – Scaling Workshop - Scaling to Petaflops – Effective Use of Multi-core Technology – TeraGrid - Wide BlueGene Applications • Domain-specific Sessions – Petascale Computing in the Biosciences • Visualization – Introduction to Scientific Visualization – Remote/Collaborative TeraScale Visualization on the TeraGrid • Other Topics – Rocks Linux Cluster Workshop – LCI International Conference on HPC Clustered Computing • Over 30 on-line asynchronous tutorials Broaden Awareness through CI Days • Work with campuses to develop leadership in promoting CI to accelerate scientific discovery • Catalyze campus-wide and regional discussions and planning • Collaboration of Open Science Grid, Internet 2, National Lamda Rail, EDUCAUSE, Minority Serving Institution Cyberinfrastructure Empowerment Coalition, TeraGrid, and local & regional organizations • Identify Campus Champions http://www.cidays.org For More Information YOUR Campus Champion Levent Yilmaz slyilmaz at pitt.edu www.teragrid.org www.nsf.gov/oci/ http://cidays.org help@teragrid.org Questions?