GridLab WP2: CGAT Cactus Grid Application Toolkit Gabrielle Allen GridLab/Cactus Max Planck Institute for Gravitational Physics (AEI) 1 WP2: CGAT Making use of GAT within the Cactus framework Grid-enabling applications using Cactus Devising and implementing scenarios Testing GridLab services and tools on multiple testbeds, with “real” applications User requirements Interacting/disseminating with different groups using Cactus to make them aware of Grid and GridLab Gabrielle Allen GGF7, 2003 2 Cactus: www.cactuscode.org … a framework for HPC applications o o o o o o Gabrielle Allen Open source Modular (flesh and thorns) Portable Collaborative Provides parallelism, IO, toolkits, … Generic applications o Nothing to do with the Grid, but by design very well suited for use on the Grid … o … and our main users (e.g. Denis) want/need the services the Grid will provide GGF7, 2003 3 Cactus User Community Using and Developing Physics Thorns Numerical Relativity AEI Southampton Goddard Tuebingen Wash U Penn State TAC EU Astrophysics Network Other Applications RIKEN Chemical Engineering (U.Kansas) Thessaloniki SISSA LSU Portsmouth UNAM Pittsburg Austin Brownsville New EU Astrophysics Network ??? Climate Modeling (Ultrecht, NASA,+) Gabrielle Allen CFD (KISTI, LSU) GGF7, 2003 Bio-Informatics (Canada) Early Universe (LBL) Astrophysics (Zeus) Crack Prop. (Cornell) 4 Numerical Relativity Black Hole simulations using the Cactus framework (Typical: 50GB, 600 TeraFlops, 1TB output, 50hrs, 15000SUs) Simulations performed at NERSC/NCSA by the AEI numerical relativity group Visualization by Werner Benger, ZIB Gabrielle Allen GGF7, 2003 5 Grid-Cactus Development GrADs project (Also using Cactus) GridLab (GAT, services, scenarios, implementation) Cactus Development Team (Adding needed infrastructure) MetaCactus (DFG Proposal) Gabrielle Allen TeraGrid (distributed runs, Visapult) GriKSL (Data/Visualization) NumRel/EU Users (Ideas and Testing) ASC Project (Ending this year) GGF7, 2003 6 WP2: CGAT Cactus/GAT Integration Cactus Flesh Thorn Thorn Thorn Thorn Thorn Physics and Computational Infrastructure Modules Gabrielle Allen CGAT Thorn GAT Library Cactus GAT wrappers Additional functionality Build system GridLab Service GGF7, 2003 GridLab Service 7 Grid-enabled Cactus Apps Generic Cactus framework e.g. Checkpointing Portability Flexible make system Switchable parallel layers Steering/control API and interfaces Socket layer and integration of Grid services with the GAT means that all Cactus applications are trivially grid-enabled. Gabrielle Allen GGF7, 2003 8 What do our users want? Larger computational resources Memory/CPU Faster throughput Cleverer scheduling, configurable scheduling, co-scheduling, exploitation of un-used cycles Easier use of resources Portals, grid application frameworks, information services, mobile devices Remote interaction with simulations and data Notification, steering, visualization, data management Collaborative tools Notification, visualization, video conferencing, portals Dynamic applications, New scenarios Grid application frameworks connecting to services Gabrielle Allen GGF7, 2003 9 Application Scenarios Dynamic Staging Dynamic Load Balancing move to faster/cheaper/bigger machine inhomogeneous loads multiple grids Multiple Universe create clone to investigate steered parameter Portal Automatic Convergence Testing User/virtual organisation interface to the grid. from initial data or initiated during simulation Intelligent Parameter Surveys Look Ahead farm out to different machines spawn off and run coarser resolution to predict likely future Spawn Independent/Asynchronous Tasks Make use of send to cheaper machine, main simulation carries on Application Profiling best machine/queue choose resolution parameters based on queue Gabrielle Allen GGF7, 2003 Running with management tools such as Condor, Entropia, etc. Scripting thorns (management, launching new jobs, etc) Dynamic use of eg MDS for finding available resources 10 Motivation for GAT Why do applications need a framework for using the Grid? We (application developers) need a layer between applications and grid infrastructure: Higher level than existing grid APIs, hide complexity, abstract grid functionality through application oriented APIs Insulate against rapid evolution of grid infrastructure Choose between different grid infrastructures Make it possible for grid developers to develop new infrastructures Make it possible for application developers to use and develop for the grid independent of the state of deployment of the grid infrastructure Gabrielle Allen GGF7, 2003 11 SC2002, Baltimore Varied applications deployed of the GGTC testbed Cactus Black Hole Simulations ASC Portal Smith-Waterman Nimrod-G Task Farming scenario Visapult Highlights GGTC won 2 of the 3 HPC Awards Won (with Visapult/LBL group) Bandwidth Challenge $2000 prize money to UNICEF childrens fund Gabrielle Allen GGF7, 2003 12 Global Grid Testbed Collaboration (GGTC) Driven by GGF APPS and GridLab testbed and applications Whole testbed constructed very swiftly (few weeks) 5 continents: North America, Europe, Asia, Africa, Australia Over 14 countries, including: China, Japan, Singapore, S.Korea, Egypt, Australia, Canada, Germany, UK, Netherlands, Czech, Hungary, Poland, USA About 70 machines, with thousands of processors (~7500) Many hardware types, including PS2, IA32, IA64, MIPS, IBM Power, Alpha, Hitachi/PPC, Sparc Many OSs, including Linux, Irix, AIX, OSF, True64, Solaris, Hitachi Many different organizations (big centers/individuals) All ran same Grid infrastructure! (Globus) Gabrielle Allen GGF7, 2003 13 Global Grid Testbed Collaboration Gabrielle Allen GGF7, 2003 14 User Portal Myproxy/GRAM/MDS/GridF TP/GSI-SOAP Start jobs GRAM, GRMS (OGSA) Move/browse files GridFTP Track and monitor announced jobs Connect to simulation web interfaces for steering and viz Access to Grid New framework based on portlets: www.gridsphere.org Gabrielle Allen GGF7, 2003 15 Notification SMS Server “TestBed” Portal Server A p p l i c a t i o n s Mail Server Gabrielle Allen R u n n i n g GGF7, 2003 16 Remote Data Visualization Tool OpenDX, Amira, … HDF5 GridFTP Stream VFD VFD Hyperslabbing, Downsampling IOStreamedHDF5 GridFTP Remote Data Server Gabrielle Allen Simulation GGF7, 2003 17 Bandwidth Challenge: Highest Performing Application Distributed simulations using Cactus, Globus and Visapult With John Shalf/LBL and others 16.8 Gigabits/second scinet.supercomp.org/bwc Six sites: USA/Dutch/Czech Gabrielle Allen GGF7, 2003 18 Task Farming on the Grid TFM implemented in Cactus TFM GAT (GRAM, GRMS) used for starting remote TFMs TFM TFM TFM TFM Designed for the Grid fork/exec Tasks can be anything Gabrielle Allen GGF7, 2003 19 Task Farming Motivation Requested by local physics group Parameter surveys, e.g. looking for critical phenomena in gravitational wave collapse by varying amplitude, testing different formalisms of Einstein Equations for evolving same initial data Scenario is inherently quite robust and fault tolerant Good migration path to the Grid Start easy (not too much Grid!), task farm across local homogeneous workstations and on single supercomputers. Use public keys first, then test standard Grid infrastructure Use of GAT then means users can start testing GridLab services (should still work for them if services not ready) CGAT team can then test real physics runs using wider Grid and GridLab services. Gabrielle Allen GGF7, 2003 20 Task Farming on the Grid Generic Part Application Specific Gabrielle Allen GGF7, 2003 21 Grid-xclock Simple application for testing and debugging. xclock is standard X utility, run on any machine with X installed Requires: o xclock binary o X libraries o To display remotely, need to open outgoing ports from machine it is running on to machine displaying Gabrielle Allen GGF7, 2003 22 Grid-Black Holes Task farm small Cactus black hole simulations across testbed Requires: Parameter survey: black hole corotation o Black hole binary o C/Fortran/MPI libraries parameter o How to run MPI jobs on a Results steer a large known set of nodes production black hole o To contact Steering server, simulation need to open outgoing ports Now push to bring this from machine it is running on to physics userbaseto server and incorporate GridLab services Gabrielle Allen GGF7, 2003 23 What we did … Need a Cactus black hole (MPI/Fortran) binary on each machine Login interactively to each machine (gsissh) Set up standard user environment (paths, scratch space, …) Install Cactus and utilities in standard location (e.g. $HOME/binary/cactus_blackhole) Test executable runs in usual login environment Gabrielle Allen GGF7, 2003 24 Testbed Problems Organization People working in the testbed collaboration not always in close contact with local administrators/policy makers General coordination and status reporting of 70 machines Accounts Local policies for creating accounts differ Basically no way to create limited access/use accounts for us Different resources available: e.g. file spaces, inodes Lack of access via gsissh a big problem with many machines, requiring lots of coordination with administrators Really need group accounts for such an endeavor (e.g. CAS) Needed some gymnastics with gridmap files (existing accounts) Gabrielle Allen GGF7, 2003 25 Testbed Problems Machines Resources at main centers usually well documented (although Grid software, installations and support usually not documented) Other resources not usually documented, need to find compilers, scratch space etc. Local changes to “standard” queuing systems etc Setting up user environment A few machines have rather strange set ups Firewalls Many machines firewalled in different ways. Need a lot of lobbying at big centers to open needed ports Often ports only opened to specific addresses (hard for demoing in Baltimore) Gabrielle Allen GGF7, 2003 26 Testbed Problems Application Installation MPI is sometimes hard to use (many different implementations, LAM, MPICH, ScaliMPI, Native, …) Even with very portable applications initial compilation and testing can be very time consuming Need robust tools to help with this e.g. GridMake (AEI) Grid Installations Not well (or at all) documented Different versions and patches Local tweaks to installations Firewalls can change even daily Functioning of software can change even daily!! Incomplete installations (e.g. no gsissh) Certificates Various problems with all the different machine and user certificates Gabrielle Allen GGF7, 2003 27 Testbed Problems Globus Infrastructure Main problems with Globus are with deployment Proxy delegation Start a run, get a limited proxy which can’t be used to start another run Setting user environment for deploying applications MPI runs set up different environments on different processors? Xclock not on standard path X libraries not on standard library path Gabrielle Allen GGF7, 2003 28 Deployment of Applications To run any application need correct user environment Path to any executables Home directory and other directories Location of needed libraries, X, C, Fortran, MPI Could be many others depending on the application Note that machines typically have multiple compilers, MPI installations … have to use correct ones for a given executable In usual interactive use of machines many of these are set in e.g. user’s .cshrc Globus user environment Starting jobs with Globus only provides a minimal environment Rationale is that resources are not used interactively, correct environment should be passed in from outside RSL syntax provides way to pass in requested environment Gabrielle Allen GGF7, 2003 29 Deployment of Applications For our current use of resources this is a real problem. Even though you can pass in user environment how do you get the correct values for a given machine (it isn’t on MDS now). How do you get the correct executable in the first place? Could provide statically linked executables (executable repository) but still need to provide them at least for each machine, each OS version, each MPI/F90 combination. Applications will need to provide a list of which variables need to be set to be run (standard way to specify this?) Do we need a Grid equivalent of “modules” functionality (module load gnu, module load mpi-mpich) Gabrielle Allen GGF7, 2003 30 Deployment of Applications Frustrating right now, because user environment is usually correctly set for interactive use, but how can we make use of this in a grid environment? Use globusrun to invoke the correct interactive shell on any machine? E.g. run “csh –csh” In practice this worked Around 35 machines worked for grid-xclock Only 7 machines worked for grid-blackholes (MPI/Fortran) Currently investigating why it didn’t fully work by comparing environment obtained on a machine when entering in different ways Machines not consistently set up? Environment passed in inconsistant manner to all processors? Gabrielle Allen GGF7, 2003 31 MPI/Fortran Applications Require many more details about environment Location of MPI/Fortran libraries for a particular compiler and MPI implementation Problems with interpretation of RSL keywords on some machines Wanted to be given a set of processors on which the TFM would start different MPI task E.g. jobtype=“single”, count = 4 would sometimes start up 4 versions of the TFM instead of a single TFM in control of 4 processors How can you tell which processors you were actually allocated? On clusters the TFM typically needs this information in order to start MPI runs with a machines file Gabrielle Allen GGF7, 2003 32 Lessons Learnt from SC2002 Need to really think about the design of scenarios for the Grid (firewalls, NAT/internal cluster nodes, environment) Need more communication of requirements and problems with infrastructure developers (GridFTP, Globus, RSL) Real testbeds and real applications are crucial! (70 GGTC testbed machines, 35 “worked” with Grid x-clock, 7 “worked” with Grid black holes [Fortran/MPI]) Need to think more about compute resources General machine setup (environment) Deployment of Grid software Intermachine connectivity (firewalls, NAT, IPv6?) Need reliable Grid tools: Testbed tests/status, gridmake (AEI), grid debuggers, grid profilers. Gabrielle Allen GGF7, 2003 33 Summary Lots of problems with running real applications on todays machines with todays Grid infrastructure This is what GridLab is addressing Co-development of applications and infrastructure on a real testbed GAT will allow us to develop our applications ready for the Grid Applications can still run as they do today, but can test/make use of (anyones) services as they are ready Allows us to simultaneously work with our resources to also make them ready for the Grid Gabrielle Allen GGF7, 2003 34