How to Execute 1 Million Jobs on the Teragrid
Jeffrey P. Gardner -
Edward Walker -
Miron Livney -
Todd Tannenbaum -
PSC
TACC
U. Wisconsin
U. Wisconsin
And many others!
Astronomy is increasingly being done by using large surveys with 100s of millions of objects.
Analyzing large astronomical datasets frequently means performing the same analysis task on >100,000 objects.
Each object may take several hours of computing.
The amount of computing time required may vary, sometimes dramatically, from object to object.
In theory, PBS should provide the answer.
Submit 100,000 single-processor PBS jobs
In practice, this does not work.
Teragrid nodes are multiprocessor
Only 1 PBS job per node
Teragrid machines frequently restrict the number of jobs a single user may run.
Chad might get really mad if I submitted 100,000
PBS jobs!
We could submit a single job that uses many processors.
Now we have a reasonable number of PBS jobs (Chad will now be happy).
Scheduling priority would reflect our actual resource usage.
This still has problems.
Each job takes a different amount of time to run: we are using resources inefficiently.
The Real Solution: Condor+GridShell
The real solution is to submit one large PBS job, then use a private scheduler to manage serial work units within each PBS job.
We can even submit large PBS jobs to multiple Teragrid machines , then farm out serial work units as resources become availiable.
Vocabulary:
JOB : ( n ) a thing that is submitted via Globus or PBS
WORK UNIT : ( n ) An independent unit of work (usually serial), such as the analysis of a single astronomical object
The Real Solution: Condor+GridShell
The real solution is to submit one large PBS job, then use a private scheduler to manage serial work units within each PBS job.
Condor multiple Teragrid machines , then farm out serial work units as resources become availiable.
GridShell
Vocabulary:
JOB : ( n ) a thing that is submitted via Globus or PBS
WORK UNIT : ( n ) An independent unit of work (usually serial), such as the analysis of a single astronomical object
Condor Overview
Condor was first designed as a CPU cycle harvester for workstations sitting on people’s desks.
Condor is designed to schedule large numbers of jobs across a distributed , heterogeneous and dynamic set of computational resources.
1. User writes a simple Condor submit script:
# my_job.submit:
# A simple Condor submit script
Universe = vanilla
Executable = my_program
Queue
2. User submits the job:
% condor_submit my_job.submit
Submitting job(s).
1 job(s) submitted to cluster 1.
3. User watches job run:
% condor_q
-- Submitter: perdita.cs.wisc.edu : <128.105.165.34:1027> :
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
1.0 Jeff 6/16 06:52 0+00:01:21 R 0 0.0 my_program
1 jobs; 0 idle, 1 running, 0 held
%
4. Job completes. User is happy.
Condor user experience is simple
Condor is flexible
Resources can be any mix of architectures
Resources do not need a common filesystem
Resources do not need common user accounting
Condor is dynamic
Resources can disappear and reappear
Condor is fault-tolerant
Jobs are automatically migrated to new resources if existing one become unavailable.
condor_startd – ( runs on execution node )
Advertises specs and availability of execution node
(ClassAds). Starts jobs on exec. node.
condor_schedd – ( runs on submit node )
Handles job submission. Tracks job status.
condor_collector – ( runs on central manager )
Collects system information from execution node.
condor_negotiator –( runs on central manager )
Matches schedd jobs to machines.
Submission Machine schedd
Central Manager negotiator collector
Execution Machine startd
Startd sends system specifications (ClassAds) and system status to Collector
Submission Machine schedd
Schedd sends job info to
Negotiator
User submits Condor job
Central Manager negotiator collector
Execution Machine startd
Submission Machine schedd
Execution Machine startd
Negotiator uses information from
Collector to match Schedd jobs to available Startds
Central Manager negotiator collector
Submission Machine schedd
Schedd sends job to
Startd on assigned execution node
Central Manager negotiator collector
Execution Machine startd
“Personal” Condor on a Teragrid Platform
Condor daemons can be run as a normal user.
Condor “GlideIn”™ ability supports the ability to launch condor_startd’s on nodes within an LSF or PBS job.
“Personal” Condor on a Teragrid Platform
(Condor runs with normal user permissions)
Submission Machine
(could be login node)
Submission Machine schedd
Login Node
Central Manager negotiator collector
Execution PE startd
Execution PE startd
Execution PE startd
PBS Job - GlideIn
Allows users to interact with distributed grid computing resources from a simple shell-like interface.
extends TCSH version 6.12 to incorporates grid-enabled features:
parallel inter-script message-passing and synchronization output redirection to remote files parametric sweep
Redirecting the standard output of a command to a remote file location using GlobusFTP: a.out > gsiftp://tg-login.ncsa.teragrid.org/data
Message passing between 2 parallel tasks: if ( $_GRID_TASKID == 0) then echo "hello" > task_1 else
Set msg=`cat < task_0` endif
Executing 256 instances of a job: a.out on 256 procs
Use GridShell to launch Condor GlideIn jobs at multiple grid sites
All Condor GlideIn jobs report back to a central collector
This converts the entire Teragrid into your own personal Condor pool!
Merging GridShell with Condor
NCSA
SDSC
PSC
Login Node
Gridshell event monitor
User starts GridShell Session at PSC
Merging GridShell with Condor
NCSA
SDSC
PSC
GridShell session starts event monitor on remote login nodes via Globus
Login Node
Gridshell event monitor
Login Node
Gridshell event monitor
Login Node
Gridshell event monitor
Merging GridShell with Condor
NCSA
SDSC
PSC
Local event monitor starts condor daemons on login node
Login Node
Gridshell event monitor
Login Node
Gridshell event monitor negotiator schedd collector
Login Node
Gridshell event monitor
startd startd
All event monitors submit Condor GlideIn PBS jobs
SDSC
PBS Job
PSC
PBS Job
NCSA
PBS Job startd startd startd startd startd startd startd startd startd startd startd startd startd startd startd startd
Login Node
Gridshell event monitor
Login Node
Gridshell event monitor negotiator schedd collector
Login Node
Gridshell event monitor
startd startd
Condor startd’s tell collector that they have started
SDSC
PBS Job
PSC
PBS Job
NCSA
PBS Job startd startd startd startd startd startd startd startd startd startd startd startd startd startd startd startd
Login Node
Gridshell event monitor
Login Node
Gridshell event monitor negotiator schedd collector
Login Node
Gridshell event monitor
Condor schedd distributes independent work units to compute nodes
SDSC
PBS Job
PSC
PBS Job
NCSA
PBS Job startd startd startd startd startd startd startd startd startd startd startd startd startd startd startd startd startd startd
Login Node
Gridshell event monitor
Login Node
Gridshell event monitor negotiator schedd collector
Login Node
Gridshell event monitor
Using GridShell coupled with Condor one can easily harness the power of the Teragrid to process large numbers of independent work units.
Scheduling can be done dynamically from a central Condor queue to multiple grid sites as clusters of processors become availible.
All of this fits into existing Teragrid software.
startd
Merging GridShell with Condor
SDSC
PBS Job
PSC
NCSA
PBS Job
PBS Job startd startd startd startd startd startd startd startd startd startd startd startd startd startd startd startd startd
Login Node
Gridshell event monitor
Login Node
Gridshell event monitor negotiator schedd collector
Login Node
Gridshell event monitor