ppt - GridLab

advertisement
GridLab WP2: CGAT
Cactus Grid Application Toolkit
Gabrielle Allen
GridLab/Cactus
Max Planck Institute for
Gravitational Physics (AEI)
1
WP2: CGAT
Making use of GAT within the Cactus framework
Grid-enabling applications using Cactus
Devising and implementing scenarios
Testing GridLab services and tools on multiple
testbeds, with “real” applications
User requirements
Interacting/disseminating with different groups using
Cactus to make them aware of Grid and GridLab
Gabrielle Allen
GGF7, 2003
2
Cactus: www.cactuscode.org
… a framework for HPC applications
o
o
o
o
o
o
Gabrielle Allen
Open source
Modular (flesh and thorns)
Portable
Collaborative
Provides parallelism, IO, toolkits, …
Generic applications
o
Nothing to do with the Grid, but by design
very well suited for use on the Grid …
o
… and our main users (e.g. Denis)
want/need the services the Grid will provide
GGF7, 2003
3
Cactus User Community
Using and Developing Physics Thorns
Numerical Relativity
AEI
Southampton
Goddard
Tuebingen
Wash U
Penn State
TAC
EU Astrophysics
Network
Other Applications
RIKEN
Chemical Engineering
(U.Kansas)
Thessaloniki
SISSA
LSU
Portsmouth
UNAM
Pittsburg
Austin
Brownsville
New EU Astrophysics
Network ???
Climate Modeling
(Ultrecht, NASA,+)
Gabrielle Allen
CFD
(KISTI, LSU)
GGF7, 2003
Bio-Informatics
(Canada)
Early Universe
(LBL)
Astrophysics
(Zeus)
Crack Prop.
(Cornell)
4
Numerical Relativity
Black Hole simulations using the Cactus framework
(Typical: 50GB, 600 TeraFlops, 1TB output, 50hrs, 15000SUs)
Simulations performed at NERSC/NCSA by the AEI numerical relativity group
Visualization by Werner Benger, ZIB
Gabrielle Allen
GGF7, 2003
5
Grid-Cactus Development
GrADs project
(Also using
Cactus)
GridLab
(GAT, services,
scenarios,
implementation)
Cactus
Development
Team
(Adding needed
infrastructure)
MetaCactus
(DFG Proposal)
Gabrielle Allen
TeraGrid
(distributed runs,
Visapult)
GriKSL
(Data/Visualization)
NumRel/EU Users
(Ideas and Testing)
ASC Project
(Ending this year)
GGF7, 2003
6
WP2: CGAT
Cactus/GAT Integration
Cactus Flesh
Thorn
Thorn
Thorn
Thorn
Thorn
Physics and
Computational
Infrastructure
Modules
Gabrielle Allen
CGAT
Thorn
GAT
Library
Cactus GAT wrappers
Additional functionality
Build system
GridLab
Service
GGF7, 2003
GridLab
Service
7
Grid-enabled Cactus Apps
Generic Cactus framework e.g.
Checkpointing
Portability
Flexible make system
Switchable parallel layers
Steering/control API and interfaces
Socket layer
and integration of Grid services with the GAT means
that all Cactus applications are trivially grid-enabled.
Gabrielle Allen
GGF7, 2003
8
What do our users want?
Larger computational resources
Memory/CPU
Faster throughput
Cleverer scheduling, configurable scheduling, co-scheduling,
exploitation of un-used cycles
Easier use of resources
Portals, grid application frameworks, information services, mobile
devices
Remote interaction with simulations and data
Notification, steering, visualization, data management
Collaborative tools
Notification, visualization, video conferencing, portals
Dynamic applications, New scenarios
Grid application frameworks connecting to services
Gabrielle Allen
GGF7, 2003
9
Application Scenarios
Dynamic Staging
Dynamic Load Balancing
move to faster/cheaper/bigger machine
inhomogeneous loads
multiple grids
Multiple Universe
create clone to investigate steered
parameter
Portal
Automatic Convergence Testing
User/virtual organisation interface to
the grid.
from initial data or initiated during
simulation
Intelligent Parameter Surveys
Look Ahead
farm out to different machines
spawn off and run coarser resolution to
predict likely future
Spawn Independent/Asynchronous
Tasks
Make use of
send to cheaper machine, main
simulation carries on
Application Profiling
best machine/queue
choose resolution parameters based
on queue
Gabrielle Allen
GGF7, 2003
Running with management tools
such as Condor, Entropia, etc.
Scripting thorns (management,
launching new jobs, etc)
Dynamic use of eg MDS for finding
available resources
10
Motivation for GAT
Why do applications need a framework for using the Grid?
We (application developers) need a layer between
applications and grid infrastructure:
Higher level than existing grid APIs, hide complexity, abstract grid
functionality through application oriented APIs
Insulate against rapid evolution of grid infrastructure
Choose between different grid infrastructures
Make it possible for grid developers to develop new infrastructures
Make it possible for application developers to use and develop for the
grid independent of the state of deployment of the grid infrastructure
Gabrielle Allen
GGF7, 2003
11
SC2002, Baltimore
Varied applications deployed
of the GGTC testbed
Cactus Black Hole Simulations
ASC Portal
Smith-Waterman
Nimrod-G
Task Farming scenario
Visapult
Highlights
GGTC won 2 of the 3 HPC Awards
Won (with Visapult/LBL group)
Bandwidth Challenge
$2000 prize money to UNICEF
childrens fund
Gabrielle Allen
GGF7, 2003
12
Global Grid Testbed
Collaboration (GGTC)
Driven by GGF APPS and GridLab testbed and applications
Whole testbed constructed very swiftly (few weeks)
5 continents: North America, Europe, Asia, Africa, Australia
Over 14 countries, including:
China, Japan, Singapore, S.Korea, Egypt, Australia, Canada, Germany, UK,
Netherlands, Czech, Hungary, Poland, USA
About 70 machines, with thousands of processors (~7500)
Many hardware types, including PS2, IA32, IA64, MIPS, IBM Power,
Alpha, Hitachi/PPC, Sparc
Many OSs, including Linux, Irix, AIX, OSF, True64, Solaris, Hitachi
Many different organizations (big centers/individuals)
All ran same Grid infrastructure! (Globus)
Gabrielle Allen
GGF7, 2003
13
Global Grid Testbed
Collaboration
Gabrielle Allen
GGF7, 2003
14
User Portal
Myproxy/GRAM/MDS/GridF
TP/GSI-SOAP
Start jobs
GRAM, GRMS (OGSA)
Move/browse files
GridFTP
Track and monitor
announced jobs
Connect to simulation web
interfaces for steering and
viz
Access to Grid
New framework based on
portlets:
www.gridsphere.org
Gabrielle Allen
GGF7, 2003
15
Notification
SMS
Server
“TestBed”
Portal
Server
A
p
p
l
i
c
a
t
i
o
n
s
Mail
Server
Gabrielle Allen
R
u
n
n
i
n
g
GGF7, 2003
16
Remote Data
Visualization
Tool
OpenDX, Amira, …
HDF5
GridFTP
Stream
VFD
VFD
Hyperslabbing,
Downsampling
IOStreamedHDF5
GridFTP
Remote Data Server
Gabrielle Allen
Simulation
GGF7, 2003
17
Bandwidth Challenge:
Highest Performing Application
Distributed simulations using
Cactus, Globus and Visapult
With John Shalf/LBL and others
16.8 Gigabits/second
scinet.supercomp.org/bwc
Six sites: USA/Dutch/Czech
Gabrielle Allen
GGF7, 2003
18
Task Farming on the Grid
TFM implemented
in Cactus
TFM
GAT (GRAM, GRMS) used for
starting remote TFMs
TFM
TFM
TFM
TFM
Designed for
the Grid
fork/exec
Tasks can be
anything
Gabrielle Allen
GGF7, 2003
19
Task Farming Motivation
Requested by local physics group
Parameter surveys, e.g. looking for critical phenomena in
gravitational wave collapse by varying amplitude, testing different
formalisms of Einstein Equations for evolving same initial data
Scenario is inherently quite robust and fault tolerant
Good migration path to the Grid
Start easy (not too much Grid!), task farm across local
homogeneous workstations and on single supercomputers.
Use public keys first, then test standard Grid infrastructure
Use of GAT then means users can start testing GridLab services
(should still work for them if services not ready)
CGAT team can then test real physics runs using wider Grid and
GridLab services.
Gabrielle Allen
GGF7, 2003
20
Task Farming on the Grid
Generic
Part
Application
Specific
Gabrielle Allen
GGF7, 2003
21
Grid-xclock
Simple application for testing and debugging.
xclock is standard X utility, run on any machine with X installed
Requires:
o xclock binary
o X libraries
o To display remotely, need to
open outgoing ports from
machine it is running on to
machine displaying
Gabrielle Allen
GGF7, 2003
22
Grid-Black Holes
Task farm small Cactus
black hole simulations
across testbed
Requires:
Parameter survey:
black hole corotation
o Black hole binary
o C/Fortran/MPI libraries
parameter
o How to run MPI jobs on a
Results steer a large
known set of nodes
production black hole
o To contact Steering server,
simulation
need to open outgoing ports
Now push to bring this
from machine it is running on
to physics userbaseto server
and incorporate
GridLab services
Gabrielle Allen
GGF7, 2003
23
What we did …
Need a Cactus black hole (MPI/Fortran) binary on
each machine
Login interactively to each machine (gsissh)
Set up standard user environment
(paths, scratch space, …)
Install Cactus and utilities in standard location
(e.g. $HOME/binary/cactus_blackhole)
Test executable runs in usual login environment
Gabrielle Allen
GGF7, 2003
24
Testbed Problems
Organization
People working in the testbed collaboration not always in close
contact with local administrators/policy makers
General coordination and status reporting of 70 machines
Accounts
Local policies for creating accounts differ
Basically no way to create limited access/use accounts for us
Different resources available: e.g. file spaces, inodes
Lack of access via gsissh a big problem with many machines,
requiring lots of coordination with administrators
Really need group accounts for such an endeavor (e.g. CAS)
Needed some gymnastics with gridmap files (existing accounts)
Gabrielle Allen
GGF7, 2003
25
Testbed Problems
Machines
Resources at main centers usually well documented
(although Grid software, installations and support usually not
documented)
Other resources not usually documented, need to find compilers,
scratch space etc.
Local changes to “standard” queuing systems etc
Setting up user environment
A few machines have rather strange set ups
Firewalls
Many machines firewalled in different ways.
Need a lot of lobbying at big centers to open needed ports
Often ports only opened to specific addresses (hard for demoing
in Baltimore)
Gabrielle Allen
GGF7, 2003
26
Testbed Problems
Application Installation
MPI is sometimes hard to use (many different implementations, LAM,
MPICH, ScaliMPI, Native, …)
Even with very portable applications initial compilation and testing can
be very time consuming
Need robust tools to help with this e.g. GridMake (AEI)
Grid Installations
Not well (or at all) documented
Different versions and patches
Local tweaks to installations
Firewalls can change even daily
Functioning of software can change even daily!!
Incomplete installations (e.g. no gsissh)
Certificates
Various problems with all the different machine and user certificates
Gabrielle Allen
GGF7, 2003
27
Testbed Problems
Globus Infrastructure
Main problems with Globus are with deployment
Proxy delegation
Start a run, get a limited proxy which can’t be used to start another
run
Setting user environment for deploying applications
MPI runs set up different environments on different processors?
Xclock not on standard path
X libraries not on standard library path
Gabrielle Allen
GGF7, 2003
28
Deployment of Applications
To run any application need correct user environment
Path to any executables
Home directory and other directories
Location of needed libraries, X, C, Fortran, MPI
Could be many others depending on the application
Note that machines typically have multiple compilers, MPI
installations … have to use correct ones for a given executable
In usual interactive use of machines many of these are set in
e.g. user’s .cshrc
Globus user environment
Starting jobs with Globus only provides a minimal environment
Rationale is that resources are not used interactively, correct
environment should be passed in from outside
RSL syntax provides way to pass in requested environment
Gabrielle Allen
GGF7, 2003
29
Deployment of Applications
For our current use of resources this is a real problem.
Even though you can pass in user environment how do you get
the correct values for a given machine (it isn’t on MDS now).
How do you get the correct executable in the first place?
Could provide statically linked executables (executable
repository) but still need to provide them at least for each
machine, each OS version, each MPI/F90 combination.
Applications will need to provide a list of which variables need
to be set to be run (standard way to specify this?)
Do we need a Grid equivalent of “modules” functionality
(module load gnu, module load mpi-mpich)
Gabrielle Allen
GGF7, 2003
30
Deployment of Applications
Frustrating right now, because user environment is usually
correctly set for interactive use, but how can we make use of
this in a grid environment?
Use globusrun to invoke the correct interactive shell on any
machine? E.g. run “csh –csh”
In practice this worked
Around 35 machines worked for grid-xclock
Only 7 machines worked for grid-blackholes (MPI/Fortran)
Currently investigating why it didn’t fully work by comparing
environment obtained on a machine when entering in different
ways
Machines not consistently set up?
Environment passed in inconsistant manner to all processors?
Gabrielle Allen
GGF7, 2003
31
MPI/Fortran Applications
Require many more details about environment
Location of MPI/Fortran libraries for a particular compiler and
MPI implementation
Problems with interpretation of RSL keywords on
some machines
Wanted to be given a set of processors on which the TFM would
start different MPI task
E.g. jobtype=“single”, count = 4 would sometimes start up 4
versions of the TFM instead of a single TFM in control of 4
processors
How can you tell which processors you were actually
allocated?
On clusters the TFM typically needs this information in order to
start MPI runs with a machines file
Gabrielle Allen
GGF7, 2003
32
Lessons Learnt from SC2002
Need to really think about the design of scenarios for the Grid
(firewalls, NAT/internal cluster nodes, environment)
Need more communication of requirements and problems with
infrastructure developers (GridFTP, Globus, RSL)
Real testbeds and real applications are crucial! (70 GGTC
testbed machines, 35 “worked” with Grid x-clock, 7 “worked”
with Grid black holes [Fortran/MPI])
Need to think more about compute resources
General machine setup (environment)
Deployment of Grid software
Intermachine connectivity (firewalls, NAT, IPv6?)
Need reliable Grid tools: Testbed tests/status, gridmake (AEI),
grid debuggers, grid profilers.
Gabrielle Allen
GGF7, 2003
33
Summary
Lots of problems with running real applications on todays
machines with todays Grid infrastructure
This is what GridLab is addressing
Co-development of applications and infrastructure on a real testbed
GAT will allow us to develop our applications ready for the Grid
Applications can still run as they do today, but can test/make use
of (anyones) services as they are ready
Allows us to simultaneously work with our resources to also make
them ready for the Grid
Gabrielle Allen
GGF7, 2003
34
Download