WhatCanTeraGridDoForYOU_v0.9

advertisement
TeraGrid
National CyberInfrastructure for
Scientific Research
Philip Blood
Senior Scientific Specialist
Pittsburgh Supercomputing Center
April 23, 2010
What is the TeraGrid?
The TeraGrid (TG) is the world’s largest open scientific
discovery infrastructure, providing:
• Computational
resources
• Data storage, access,
and management
• Visualization systems
• Specialized gateways
for scientific domains
• Centralized user
services, allocations,
and usage tracking
• All connected via highperformance networks
TeraGrid Governance
• 11 Resource Providers (RPs) funded under individual
agreements with NSF
– Mostly different: start and end dates, goals, and funding models
• 1 Coordinating Body – Grid Integration Group (GIG)
–
–
–
–
University of Chicago/Argonne
Subcontracts to all RPs and six other universities
~10 Area Directors, lead coordinated work across TG
~18 Working groups with members from many RPs work on day-today issues
– RATs formed to handle short-term issues
• TeraGrid Forum sets policies and is responsible for the
overall TeraGrid
– Each RP and the GIG votes in the TG Forum
Slide courtesy of Dan Katz
Who Uses TeraGrid (2008)
TeraGrid Objectives
• DEEP Science: Enabling Petascale Science
– Make science more productive through integrated set of advanced
computational resources
• Address key challenges prioritized by users
• WIDE Impact: Empowering Communities
– Bring TeraGrid capabilities to the broad science community
• Partnerships with community leaders, “Science Gateways” to make
access easier
• OPEN Infrastructure, OPEN Partnership
– Provide a coordinated, general purpose, reliable set of services
• Free and open to U.S. scientific research community and their
international partners
Introduction to the TeraGrid
• TG Portal and Documentation
• Compute & Visualization Resources
– More than 1.5 petaflops of computing power
• Data Resources
– Can obtain allocations of data storage facilities
– Over 100 Scientific Data Collections made available to communities
• Science Gateways
• How to Apply for TG Services & Resources
• User Support & Successes
–
–
–
–
Central point of contact for support of all systems
Personal User Support Contact
Advanced Support for TeraGrid Applications (ASTA)
Education and training events and resources
TeraGrid User Portal
Web-based single point of contact for :
• Access to your TeraGrid accounts and allocated
resources
• Interfaces for data management, data
collections, and other user tasks and resources
• Access to TeraGrid Knowledge Base, Help
Desk, and online training
Teragrid User Portal:
portal.teragrid.org
Many features (certain resources, documentation, training,
consulting, allocations) do not require a portal account!
portal.teragrid.org Documentation
Find Information about TeraGrid
www.teragrid.org
• Click “Knowledge Base” link for
quick answers to technical
questions
• Click “User Info” link to go to
www.teragrid.org--> User
Support
• Science Highlights
• News and press releases
• Education, outreach and training
events and resources
portal.teragrid.orgResources
Resources by Category
• Shows status of
systems currently in
production
• Click on names for
more info on each
resource
www.teragrid.orgUser SupportResources
Resources by Site
• Complete listing of
TG resources
(including those not
yet running)
• Scroll through list to
see details on each
resource
• Click on icons to go
to user guides for each
resource
A few examples of different types of TG
resources...
Slide courtesy of Dan Katz
Massively Parallel Resources
• Ranger@TACC
– First NSF ‘Track2’ HPC system
– 504 TF
– 15,744 Quad-Core AMD
Opteron processors
– 123 TB memory, 1.7 PB disk
• Kraken@NICS (UT/ORNL)
–
–
–
–
Second NSF ‘Track2’ HPC system
Blue Waters@NCSA
1 PF Cray XT5 system
16,512 compute sockets, 99,072 cores NSF Track 1
10 PF peak
129 TB memory, 3.3 PB disk
Coming in 2011
Slide courtesy of Dan Katz
Shared Memory Resources
• Pople@PSC:
–
–
–
–
SGI Altix system
768 Itanium 2 cores
1.5 TB global shared memory
Primarily for large shared memory
and hybrid applications
• Nautilus@NICS
–
–
–
–
–
–
SGI UltraViolet
1024 cores (Intel Nehalem)
16 GPUS
4 TB global shared memory
1 PB file system
Visualization and Analysis
• Ember@NCSA (coming in September)
– SGI UltraViolet (1536 Nehalem cores)
Visualization & Analysis Resources
• Longhorn@TACC:
– Dell/NVIDIA Visualization and Data Analysis Cluster
– a hybrid CPU/GPU system
– designed for remote, interactive visualization and data analysis, but it
also supports production, compute-intensive calculations on both the
CPUs and GPUs via off-hour queues
• TeraDRE@Purdue:
– Subcluster featuring NVIDIA GeForce 6600GT GPUs
– Used for rendering graphics with Maya, POV-ray, and Blender (among
others)
• Spur@TACC:
– Sun Visualization Cluster
– 128 compute cores / 32 NVIDIA FX5600 GPUs
– Spur is intended for serial and parallel visualization applications that take
advantage of large per-node memory, multiple computing cores, and
multiple graphics processors.
Other Specialized TG Resources
Data-Intensive Computing
• Dash@SDSC:
– 64 Intel Nehalem compute nodes (512 cores)
– 4 I/O nodes (32 cores)
– vSMP (virtual shared memory)
• aggregates memory across 16 nodes.
• allows applications to address 768GB
– 4 TB of Flash memory
• (fast file I/O subsystem or fast virtual memory)
Heterogeneous CPU/GPU Computing
• Lincoln@NCSA:
– 192 Dell PowerEdge 1950 nodes (1536 cores)
– 96 NVIDIA Tesla S1070
High Throughput Computing
• Condor Pool@Purdue
– Pool of over 27,000+ processors
– Various Architectures and OS
– Excellent for parameter sweeps, serial applications
Data Storage Resources
• Global File System
– GPFS-WAN
• 700 TB disk storage at SDSC, historically mounted at a few TG sites
• Licensing issues prevent further use
– Data Capacitor (Lustre-WAN)
•
•
•
•
Mounted on growing number of TG systems
535 TB storage at IU, including databases
Ongoing work to improve performance and authentication infrastructure
Another Lustre-WAN implementation being built by PSC
– pNFS is a possible path for global file systems, but is far away from being viable
for production
• Data Collections
– Allocable storage at SDSC and IU (files, databases) for collections used by
communities
• Tape Storage
– Allocable resources available at IU, NCAR, NCSA, SDSC
– Most sites provide “free” archival tape storage with compute allocations
• Access is generally through GridFTP (through portal or command-line)
Adapted from slide by Dan Katz
portal.teragrid.orgResources
Data Collections
Data collections
represent permanent
data storage that is
organized, searchable,
and available to a wide
audience, either a for a
collaborative group or
the scientific public in
general
What is a Science Gateway?
• A Science Gateway
– Enables scientific communities of users
with a common scientific goal
– Uses high performance computing
– Has a common interface
– Leverages community investment
• Three common forms:
– Web-based portals
– Application programs running on users'
machines but accessing services in
TeraGrid
– Coordinated access points enabling
users to move seamlessly between
TeraGrid and other grids
How can a Gateway help?
• Make science more productive
–
–
–
–
Researchers use same tools
Complex workflows
Common data formats
Data sharing
• Bring TeraGrid capabilities to
the broad science community
–
–
–
–
Lots of disk space
Lots of compute resources
Powerful analysis capabilities
A nice interface to information
Gateway Highlight
NanoHub Harnesses TeraGrid for Education
• Nanotechnology education
• Used in dozens of courses
at many universities
• Teaching materials
• Collaboration space
• Research seminars
• Modeling tools
• Access to cutting edge
research software
Gateways Highlight
SCEC Produces Hazard Map
• PSHA hazard map for
California using newly
released Earthquake Rupture
Forecast (UCERF2.0)
calculated using SCEC
Science Gateway
• Warm colors indicate regions
with a high probability of
experiencing strong ground
motion in the next 50 years.
• High resolution map,
significant CPU use
How can I build a gateway?
• Web information available:
www.teragrid.org/programs/sci_gateways
– How to turn your project into a science gateway
– Details about current gateways
– Link to write a winning gateway proposal for a
TeraGrid allocation
• Download code and instructions
– Building a simple gateway tutorial
• Talk to us
– Biweekly telecons to get advice from others
– Potential assistance from TeraGrid staff
– Nancy Wilkins-Diehr, wilkinsn@sdsc.edu
– Vickie Lynch, lynchve@ornl.gov
Some Current Science Gateways
•
•
•
•
•
•
•
Biology and Biomedicine Science Gateway
Open Life Sciences Gateway
The Telescience Project
Grid Analysis Environment (GAE)
Neutron Science Instrument Gateway
TeraGrid Visualization Gateway, ANL
BIRN
• Open Science Grid (OSG)
• Special PRiority and Urgent Computing
Environment (SPRUCE)
• National Virtual Observatory (NVO)
• Linked Environments for Atmospheric
Discovery (LEAD)
• Computational Chemistry Grid (GridChem)
• Computational Science and Engineering
Online (CSE-Online)
• GEON(GEOsciences Network)
• Network for Earthquake Engineering
Simulation (NEES)
• SCEC Earthworks Project
• Network for Computational Nanotechnology
and nanoHUB
• GIScience Gateway (GISolve)
• Gridblast Bioinformatics Gateway
• Earth Systems Grid
• Astrophysical Data Repository (Cornell)
Slide courtesy of Nancy Wilkins-Diehr
portal.teragrid.orgResources
Science Gateways
Explore current TG
science gateways from
the TG User Portal
1. Get an
allocation
for your
project
How One Uses TeraGrid
RP 1
RP 2
POPS
2. Allocation
PI adds
users
User
Portal
Science
Gateways
2. Use
TeraGrid
resources
TeraGrid Infrastructure
Accounting, …
(Accounting, Network,Network,
Authorization,…)
Command
Line
RP 3
Viz
Service
Compute
Service
(HPC, HTC, CPUs, GPUs, VMs)
Slide courtesy of Dan Katz
Data
Service
How to Get Started on the TeraGrid
Two Methods:
• Direct
– PI on allocations must be researchers at US institutions
• postdocs can be PIs, but not graduate students
– Decide which systems you wish to apply for
• Read resource descriptions on www.teragrid.org
• Send email to help@teragrid.org with questions
– Create a login on the proposal system (POPS)
– Apply for a Startup allocation on POPS
• Total allocation on all resources must not exceed 200K SUs (core-hours)
• Each machine has an individual limit for Startup allocations (check resource
description)
• Campus Champion
–
–
–
–
Contact your Campus Champion and discuss your computing needs
Send him your contact information and you’ll be added to the CC account
Experiment with various TG systems to discover the best ones for you
Apply for your own Startup account via the “Direct” method
Proposal System
https://pops-submit.teragrid.org
Welcome to POPS
What’s New Area
…
Welcome, References to Guide, Policies, Resources, etc.
…
How to use summary
…
Deadline Dates
…
You can get a lot for a little!
By submitting an abstract, your CV, and filling out a form,
you get:
• A Startup allocation
– Up to 200,000 SUs (core hours) on TG systems for one year
– That is the equivalent of 8333 days (22.8 years) of processing time
on a single core!
• Access to consulting from TeraGrid personnel regarding
your computational challenges
• Opportunity to apply for Advanced Support
– Requires additional 1 page justification of your need for advanced
support
– Can be done together with your Startup request, or at anytime after
that
Access to resources
• Terminal: ssh, gsissh
• Portal: TeraGrid user
portal, Gateways
– Once logged in to
portal, click on “Login”
• Also, SSO from
command-line
Slide courtesy of Dan Katz
TGUP Data Mover
• Drag and drop java applet in user portal
– Uses GridFTP, 3rd-party transfers, RESTful services, etc.
Slide courtesy of Dan Katz
Need Help?
• First, try searching the Knowledge Base or
other Documentation
• Submit a ticket
– Send an email to help@teragrid.org
– Use the TeraGrid User Portal ‘Consulting’ tab
• Can also call TeraGrid Help Desk 24/7:
1-866-907-2383
A User Experience: TeraGrid
Support Enabling New Science
Jan. 2004
DL_POLY ~13,000 atoms
NAMD: 740,000 atoms
60X
larger!
TeraGrid to the Rescue
Fall 2004: Granted allocations at PSC, NCSA, SDSC
Where to Run?
• EVERYWHERE
• Minimize/preequilibrate on ANL IA64 (high availability/long queue
time)
• Smaller simulations 350,000 atoms
on NCSA IA-64s (and later
Lonestar)
• Large simulations 740,000 atoms
on highly scalable systems: PSC
XT3 and SDSC Datastar
• TeraGrid infrastructure critical
– Archiving data
– Moving data between sites
– Analyzing data on the TeraGrid
Result: Open up new phenomenon to
investigation through simulation
Membrane Remodeling
Blood, P.D. and Voth, G.A. Proc. Natl. Acad. Sci. 103, 15068 (2006).
Personalized User Support:
Critical to Success
• Contacted by TG Support to determine needs
• Worked closely with TG Support on (then) new XT3
architecture to keep runs going during initial stages of
machine operation
• TG worked closely with application developers to find
problems with the code and improve performance
• Same pattern established throughout TeraGrid for User
Support
• Special advanced support can be obtained for deeper needs
(improving codes/workflows/etc.) by applying in POPS (and
providing description of needs)
Applying for Advanced Support
• Go to
teragrid.org
Help & Support
• Look at criteria
and write 1
page
justification
• Submit with
your regular
proposal, or as
a supplement
later on
Advanced Support: Improving Parallel
Performance of Protein Folding
Simulations
The UNRES molecular dynamics (MD) code utilizes a carefully-derived mesoscopic protein
force field to study and predict protein folding pathways by means of molecular dynamics
simulations.
http://www.chem.cornell.edu/has/
Load Imbalance Detection in
UNRES
Only looking at time spent in the
important MD phase
• In this case: Developers unaware that chosen algorithm would create Observe multiple causes of load imbalance, as
well as the serial bottleneck
load imbalance
• Reexamined available algorithms and found one with much better load
balance – also faster in serial!
• Also parallelized serial function causing bottleneck
Major Serial Bottleneck and Load
Imbalance in UNRES Eliminated
• After looking at the performance profiling done by the PSC,
developers discovered that they could use an algorithm with
much better load balance and faster serial performance.
• Code now runs 4x faster!
TG App: Predicting
storms
• Hurricanes and tornadoes cause massive
loss of life and damage to property
• TeraGrid supported spring 2007 NOAA
and University of Oklahoma Hazardous
Weather Testbed
– Major Goal: assess how well ensemble
forecasting predicts thunderstorms,
including the supercells that spawn
tornadoes
– Nightly reservation at PSC, spawning jobs
at NCSA as needed for details
– Input, output, and intermediate data
transfers
– Delivers “better than real time” prediction
– Used 675,000 CPU hours for the season
– Used 312 TB on HPSS storage at PSC
Slide courtesy of Dennis Gannon, ex-IU, and LEAD Collaboration
App: GridChem
Different
licensed
applications
with different
queues
Will be
scheduled for
workflows
Slide courtesy of Joohyun Kim
Apps: Genius and Materials
Fully-atomistic simulations of
clay-polymer nanocomposites
Modeling blood flow before
(during?) surgery
Why cross-site /
distributed runs?
HemeLB on LONI
1. Rapid turnaround,
conglomeration of
idle processors to
run a single large
job
LAMMPS on TeraGrid
2. Run big compute
& big memory
jobs not possible
on a single
machine
Slide courtesy of Steven Manos and Peter Coveney
44
TeraGrid Annual Conference
• Showcases capabilities, achievements and impact of
TeraGrid in research
• Presentations, demos, posters, visualizations
• Tutorials, training and peer support
• Student competitions and volunteer opportunities
• www.teragrid.org/tg10
Campus Champions Program
• Source of local, regional and national high
performance computing and
cyberinfrastructure information at home
campus
• Source of information about TeraGrid
resources and services that will benefit
their campus
• Source of startup accounts to quickly get
researchers and educators using their
allocation of time on the TeraGrid
resources
• Direct access to TeraGrid staff
www.teragrid.org/web/eot/campus_champions
TeraGrid HPC Education and Training
• Workshops, institutes and
seminars on highperformance scientific
computing
• Hands-on tutorials on
porting and optimizing
code for the TeraGrid
systems
• On-line self-paced tutorials
• High-impact educational
and visual materials
suitable for K–12,
undergraduate and
graduate classes
www.teragrid.org/web/eot/workshops
HPC University
• Virtual Organization to advance researchers’ HPC skills
– Catalog of live and self-paced training
– Schedule series of training courses
– Gap analysis of materials to drive development
• Work with educators to enhance the curriculum
– Search catalog of HPC resources
– Schedule workshops for curricular development
– Leverage good work of others
• Offer Student Research Experiences
– Enroll in HPC internship opportunities
– Offer Student Competitions
• Publish Science and Education Impact
– Promote via TeraGrid Science Highlights, iSGTW
– Publish education resources to NSDL-CSERD
http://hpcuniv.org/
Sampling of Training Topics Offered
• HPC Computing
– Introduction to Parallel Computing
– Toward Multicore Petascale Applications
– Scaling Workshop - Scaling to Petaflops
– Effective Use of Multi-core Technology
– TeraGrid - Wide BlueGene Applications
• Domain-specific Sessions
– Petascale Computing in the Biosciences
• Visualization
– Introduction to Scientific Visualization
– Remote/Collaborative TeraScale Visualization on the TeraGrid
• Other Topics
– Rocks Linux Cluster Workshop
– LCI International Conference on HPC Clustered Computing
• Over 30 on-line asynchronous tutorials
Broaden Awareness through CI Days
• Work with campuses to develop leadership in promoting CI to
accelerate scientific discovery
• Catalyze campus-wide and regional discussions and planning
• Collaboration of Open Science Grid, Internet 2, National Lamda Rail,
EDUCAUSE, Minority Serving Institution Cyberinfrastructure
Empowerment Coalition, TeraGrid, and local & regional organizations
• Identify Campus Champions
http://www.cidays.org
For More Information
YOUR Campus Champion
Levent Yilmaz slyilmaz at pitt.edu
www.teragrid.org
www.nsf.gov/oci/
http://cidays.org
help@teragrid.org
Questions?
Download