A Framework for Data Analysis on a Distributed Grid

Condor and GridShell

How to Execute 1 Million Jobs on the Teragrid

Jeffrey P. Gardner -

Edward Walker -

Miron Livney -

Todd Tannenbaum -

PSC

TACC

U. Wisconsin

U. Wisconsin

And many others!

Scientific Motivation









Astronomy is increasingly being done by using large surveys with 100s of millions of objects.

Analyzing large astronomical datasets frequently means performing the same analysis task on >100,000 objects.

Each object may take several hours of computing.

The amount of computing time required may vary, sometimes dramatically, from object to object.

Solution: PBS?





In theory, PBS should provide the answer.



Submit 100,000 single-processor PBS jobs

In practice, this does not work.



Teragrid nodes are multiprocessor



Only 1 PBS job per node





Teragrid machines frequently restrict the number of jobs a single user may run.

Chad might get really mad if I submitted 100,000

PBS jobs!

Solution: mprun?





We could submit a single job that uses many processors.





Now we have a reasonable number of PBS jobs (Chad will now be happy).

Scheduling priority would reflect our actual resource usage.

This still has problems.



Each job takes a different amount of time to run: we are using resources inefficiently.

The Real Solution: Condor+GridShell





The real solution is to submit one large PBS job, then use a private scheduler to manage serial work units within each PBS job.

We can even submit large PBS jobs to multiple Teragrid machines , then farm out serial work units as resources become availiable.

Vocabulary:

JOB : ( n ) a thing that is submitted via Globus or PBS

WORK UNIT : ( n ) An independent unit of work (usually serial), such as the analysis of a single astronomical object

The Real Solution: Condor+GridShell





The real solution is to submit one large PBS job, then use a private scheduler to manage serial work units within each PBS job.

Condor multiple Teragrid machines , then farm out serial work units as resources become availiable.

GridShell

Vocabulary:

JOB : ( n ) a thing that is submitted via Globus or PBS

WORK UNIT : ( n ) An independent unit of work (usually serial), such as the analysis of a single astronomical object

Condor Overview





Condor was first designed as a CPU cycle harvester for workstations sitting on people’s desks.

Condor is designed to schedule large numbers of jobs across a distributed , heterogeneous and dynamic set of computational resources.

Condor: The User Experience

1. User writes a simple Condor submit script:

# my_job.submit:

# A simple Condor submit script

Universe = vanilla

Executable = my_program

Queue

2. User submits the job:

% condor_submit my_job.submit

Submitting job(s).

1 job(s) submitted to cluster 1.

Condor: The User Experience

3. User watches job run:

% condor_q

-- Submitter: perdita.cs.wisc.edu : <128.105.165.34:1027> :

ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD

1.0 Jeff 6/16 06:52 0+00:01:21 R 0 0.0 my_program

1 jobs; 0 idle, 1 running, 0 held

%

4. Job completes. User is happy.

Advantages of Condor









Condor user experience is simple

Condor is flexible



Resources can be any mix of architectures





Resources do not need a common filesystem

Resources do not need common user accounting

Condor is dynamic



Resources can disappear and reappear

Condor is fault-tolerant



Jobs are automatically migrated to new resources if existing one become unavailable.

Condor Daemons







 condor_startd – ( runs on execution node )



Advertises specs and availability of execution node

(ClassAds). Starts jobs on exec. node.

condor_schedd – ( runs on submit node )



Handles job submission. Tracks job status.

condor_collector – ( runs on central manager )



Collects system information from execution node.

condor_negotiator –( runs on central manager )



Matches schedd jobs to machines.

Condor Daemon Layout

Submission Machine schedd

Central Manager negotiator collector

Execution Machine startd

Startd sends system specifications (ClassAds) and system status to Collector



Schedd sends job info to

Negotiator

User submits Condor job






Negotiator uses information from

Collector to match Schedd jobs to available Startds




Schedd sends job to

Startd on assigned execution node



“Personal” Condor on a Teragrid Platform





Condor daemons can be run as a normal user.

Condor “GlideIn”™ ability supports the ability to launch condor_startd’s on nodes within an LSF or PBS job.

“Personal” Condor on a Teragrid Platform

(Condor runs with normal user permissions)

Submission Machine

(could be login node)


Login Node


Execution PE startd

Execution PE startd

Execution PE startd

PBS Job - GlideIn

GridShell Overview





Allows users to interact with distributed grid computing resources from a simple shell-like interface.

extends TCSH version 6.12 to incorporates grid-enabled features:





 parallel inter-script message-passing and synchronization output redirection to remote files parametric sweep

GridShell Examples

Redirecting the standard output of a command to a remote file location using GlobusFTP: a.out > gsiftp://tg-login.ncsa.teragrid.org/data

Message passing between 2 parallel tasks: if ( $_GRID_TASKID == 0) then echo "hello" > task_1 else

Set msg=`cat < task_0` endif

Executing 256 instances of a job: a.out on 256 procs

Merging GridShell with Condor







Use GridShell to launch Condor GlideIn jobs at multiple grid sites

All Condor GlideIn jobs report back to a central collector

This converts the entire Teragrid into your own personal Condor pool!


NCSA

SDSC

PSC

Login Node

Gridshell event monitor

User starts GridShell Session at PSC


NCSA

SDSC

PSC

GridShell session starts event monitor on remote login nodes via Globus

Login Node


Login Node


Login Node



NCSA

SDSC

PSC

Local event monitor starts condor daemons on login node

Login Node


Login Node

Gridshell event monitor negotiator schedd collector

Login Node


startd startd

All event monitors submit Condor GlideIn PBS jobs

SDSC

PBS Job

PSC

PBS Job

NCSA

PBS Job startd startd startd startd startd startd startd startd startd startd startd startd startd startd startd startd

Login Node


Login Node


Login Node


startd startd

Condor startd’s tell collector that they have started

SDSC

PBS Job

PSC

PBS Job

NCSA

PBS Job startd startd startd startd startd startd startd startd startd startd startd startd startd startd startd startd

Login Node


Login Node


Login Node


Condor schedd distributes independent work units to compute nodes

SDSC

PBS Job

PSC

PBS Job

NCSA

PBS Job startd startd startd startd startd startd startd startd startd startd startd startd startd startd startd startd startd startd

Login Node


Login Node


Login Node


GridShell in a NutShell







Using GridShell coupled with Condor one can easily harness the power of the Teragrid to process large numbers of independent work units.

Scheduling can be done dynamically from a central Condor queue to multiple grid sites as clusters of processors become availible.

All of this fits into existing Teragrid software.

startd


SDSC

PBS Job

PSC

NCSA

PBS Job

PBS Job startd startd startd startd startd startd startd startd startd startd startd startd startd startd startd startd startd

Login Node


Login Node


Login Node


A Framework for Data Analysis on a Distributed Grid

Condor and GridShell

Scientific Motivation

Solution: PBS?

Solution: mprun?

Condor: The User Experience

Condor: The User Experience

Advantages of Condor

Condor Daemons

Condor Daemon Layout

Condor Daemon Layout

Condor Daemon Layout

Condor Daemon Layout

GridShell Overview

GridShell Examples

Merging GridShell with Condor

GridShell in a NutShell

Related documents

Products

Support

A Framework for Data Analysis on a Distributed Grid

Condor and GridShell

Scientific Motivation

Solution: PBS?

Solution: mprun?

Condor: The User Experience

Condor: The User Experience

Advantages of Condor

Condor Daemons

Condor Daemon Layout

Condor Daemon Layout

Condor Daemon Layout

Condor Daemon Layout

GridShell Overview

GridShell Examples

Merging GridShell with Condor

GridShell in a NutShell

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib