Condor COD: Computing on Demand

advertisement
Condor COD
(Computing On Demand)
Condor Week 5/5/2003
Derek Wright
Computer Sciences Department, UW-Madison
Lawrence Berkeley National Labs (LBNL)
wright@cs.wisc.edu
http://www.cs.wisc.edu/condor
http://sdm.lbl.gov
What problem are we
trying to solve?
› Some people want to run interactive, yet
›
›
›
compute-intensive applications
Jobs that take lots of compute power over
a relatively short period of time
They want to use batch computing
resources, but need them right away
Ideally, when they’re not in use, resources
would go back to the batch system
www.cs.wisc.edu/condor
Some example applications:
› A distributed build/compilation of a large
›
›
›
software system
A very complex spreadsheet that takes a
lot of cycles when you press “recalculate”
High-energy physics (HEP) “analysis” jobs
Visualization tools for data-mining,
rendering graphics, etc.
www.cs.wisc.edu/condor
Example application for COD
User’s Workstation
Compute Farm
On-demand
workers
Idle nodes
Data
Controller
application
Display
Batch Jobs
www.cs.wisc.edu/condor
What’s the Condor
solution?
› Condor COD: “Computing on Demand”
Use Condor to manage the batch
resources when they’re not in use by the
interactive jobs
Allow the interactive jobs to come in
with high priority and run instead of the
batch job on any given resource
www.cs.wisc.edu/condor
Why did we have to change
Condor for that?
› Doesn’t Condor already notice when
an interactive job starts on a CPU?
› Doesn’t Condor already provide
checkpointing when that happens?
› Can’t I configure Condor to run
whatever jobs I want with a higher
priority on my own machines?
www.cs.wisc.edu/condor
Well, yes…
But that’s not good enough…
› Not all jobs can be checkpointed, and even
›
›
those that can take some time…
We want this to be instantaneous, not
waiting for the batch system to schedule
tasks…
You can configure Condor to run higher
priority jobs, but the other jobs are kicked
off the machine…
www.cs.wisc.edu/condor
What’s new about COD?
› “Checkpoint to swap space”
When a high-priority COD job appears,
the lower-priority batch job is
suspended
The COD job can run right away, while
the batch job is suspended
Batch jobs (even those that can’t
checkpoint) can resume instantly once
there are no more active COD jobs
www.cs.wisc.edu/condor
But wait, there’s more…
› The condor_startd can now manage
multiple “claims” on each resource
 If any COD claim becomes active, the regular
Condor claim is automatically suspended
 Without an active COD, regular claim resumes
› There is a new command-line tool to
›
request, activate, suspend, resume and
release these claims
There’s even a C++ object to do all of that,
if you really want it…
www.cs.wisc.edu/condor
COD claim-management
commands
› Request: authorizes the user and returns a
›
unique claim ID for future commands
Activate: spawns an application on a given
COD claim, with various options to define
the application, job ID, etc
 Suspends any regular Condor job
 You can have multiple COD claims on a single
resource, and they can all be running
simultaneously
www.cs.wisc.edu/condor
COD commands (cont’d)
› Suspend:
 Given COD claim is suspended
 If there are no more active COD claims, a
regular Condor batch job can now run
› Resume: Given COD claim is resumed,
›
›
suspending the Condor batch job (if any)
Deactivate: Kill the application but hold
onto the COD claim
Release: Get rid of the COD claim itself
www.cs.wisc.edu/condor
COD command protocol
› All commands use ClassAds
Allows for a flexible protocol
Excellent error propagation
Can use existing ClassAd technology
› Similar to existing Condor protocol
Separation of claiming from activation,
so you can have hot-spares, etc.
www.cs.wisc.edu/condor
How does all of that solve
the problem?
› The interactive COD application starts up,
›
›
and goes out to claim some compute nodes
Once the helper applications are in place
and ready, these COD claims are
suspended, allowing batch jobs to run
When the interactive application has work,
it can instantly suspend the batch jobs and
resume the COD applications to perform
the computations
www.cs.wisc.edu/condor
Step 1: Initial state
User’s Workstation
Compute Farm
%
Idle nodes
Batch jobs
www.cs.wisc.edu/condor
Step 2: Application spawned
User’s Workstation
Compute Farm
% fractal-gen –n 4
Idle nodes
Controller
application
spawned
Batch jobs
www.cs.wisc.edu/condor
Step 3: Compute node setup
User’s Workstation
Compute Farm
On-demand
workers
Idle nodes
Claiming and
initializing [4] compute
nodes for rendering…
Got reply from:
c1.cluster.org
c6.cluster.org
c14.cluster.org
c17.cluster.org
SUCCESS!
Batch jobs
www.cs.wisc.edu/condor
Step 3: Commands used
% condor_cod_request –name c1.cluster.org \
–classad c1.out
Successfully sent CA_REQUEST_CLAIM to startd
at <128.105.143.14:55642>
Result ClassAd written to c1.out
ID of new claim is:
“<128.105.143.14:55642>#1051656208#2”
% condor_cod_activate –keyword fractgen \
–id “<128.105.143.14:55642>#1051656208#2”
Successfully sent CA_ACTIVATE_CLAIM to startd
at <128.105.143.14:55642>
% …
www.cs.wisc.edu/condor
Step 4: “Checkpoint” to swap
User’s Workstation
Compute Farm
Suspended
worker
Idle nodes
SELECT FRACTAL TYPE
<Mandelbrot>
(more user input…)
Batch jobs
www.cs.wisc.edu/condor
Step 4: Commands used
% condor_cod_suspend \
–id “<128.105.143.14:55642>#1051656208#2”
Successfully sent CA_SUSPEND_CLAIM to startd
at <128.105.143.14:55642>
% …
› Rendering application on each COD node is
›
suspended while interactive tool waits for input
The resources are now available for batch
Condor jobs
www.cs.wisc.edu/condor
Step 5: Batch jobs can run
User’s Workstation
Compute Farm
Idle nodes
SPECIFY PARAMETERS
max_iterations: 400000
TL: -0.65865, -0.56254
BR: -0.45865, -0.71254
(more user input…)
Batch
Batch jobs
queue
www.cs.wisc.edu/condor
Step 6: Computation burst
User’s Workstation
Compute Farm
Interactive
On-demand
workers
Idle nodes
CLICK <RENDER> TO VIEW
YOUR FRACTAL…
RENDER
Suspended
batch job
www.cs.wisc.edu/condor
Batch jobs
Step 6: Commands used
% condor_cod_resume \
–id “<128.105.143.14:55642>#1051656208#2”
Successfully sent CA_RESUME_CLAIM to startd
at <128.105.143.14:55642>
% …
› Batch Condor jobs on COD nodes are
›
suspended
All COD rendering applications are resumed
on each node
www.cs.wisc.edu/condor
Step 7: Results produced
User’s Workstation
Compute Farm
Interactive
On-demand
workers
Idle nodes
Data
Display
Suspended
batch job
www.cs.wisc.edu/condor
Batch jobs
Step 8: User input while
batch work resumes
User’s Workstation
Compute Farm
Suspended
worker
Idle nodes
ZOOM BOX COORDINATES:
TL = -0.60301, -0.61087
BR = -0.58037, -0.62785
Batch jobs
www.cs.wisc.edu/condor
Step 9: Computation burst #2
User’s Workstation
Compute Farm
Interactive
On-demand
workers
Idle nodes
Data
Display
RENDER
Suspended
batch job
www.cs.wisc.edu/condor
Batch jobs
Step 10: Clean-up
User’s Workstation
Compute Farm
Idle nodes
REALLY QUIT? Y/N
Releasing compute nodes…
4 nodes terminated
successfully!
Batch jobs
www.cs.wisc.edu/condor
Step 10: Commands used
% condor_cod_release \
–id “<128.105.143.14:55642>#1051656208#2”
Successfully sent CA_RELEASE_CLAIM to startd
at <128.105.143.14:55642>
State of claim when it was released: "Running"
% …
› The jobs are cleaned up, claims
released, and resources returned to
batch system
www.cs.wisc.edu/condor
Other changes for COD:
› The condor_starter has been
modified so that it can run jobs
without communicating with a
condor_shadow
All the great job control features of the
starter without a shadow
Starter can write its own UserLog
Other useful features for COD
www.cs.wisc.edu/condor
condor_status –cod
› New “–cod” option to condor_status to view
COD claims in a Condor pool:
Name
astro.cs.wi
chopin.cs.w
chopin.cs.w
ID
COD1
COD1
COD2
INTEL/LINUX
Total
ClaimState
Idle
Running
Suspended
Total
3
3
TimeInState
0+00:00:04
0+00:02:05
0+00:10:21
Idle
1
1
Running
1
1
RemoteUser
wright
wright
wright
Suspended
1
1
JobId Keyword
3.0
4.0
Vacating
0
0
www.cs.wisc.edu/condor
fractgen
fractgen
Killing
0
0
What else could I use all
these new features for?
› Short-running system administration tasks
›
that need quick access but don’t want to
disturb the jobs in your batch system
A “Grid Shell”
 A condor_starter that doesn’t need a
condor_shadow is a powerful job management
environment that can monitor a job running
under a “hostile” batch system on the grid
www.cs.wisc.edu/condor
Future work
› More ways to tell COD about your
application
For now, you define important attributes
in your condor_config file and pre-stage
the executables
› Ability to transfer files to and from a
COD job at a remote machine
We’ve already got the functionality in
Condor, so why rely on a shared
filesystem or pre-staging?
www.cs.wisc.edu/condor
More future work
› Accounting for COD jobs
› Working with some real-world
applications and integrating these new
COD features
Would the real users please stand up?
› Better “Grid Shell” support
This is really a separate-yet-related
area of work…
www.cs.wisc.edu/condor
How do you use COD?
› Upgrade to Condor version 6.5.3 or
greater… COD is already included
› There will be a new section in the
Condor manual (coming soon)
› If you need more help, ask the ever
helpful condor-admin@cs.wisc.edu
› Find me at the BoF on Wednesday,
9am to Noon (room TBA)
www.cs.wisc.edu/condor
Download