Eager, Lazy, and Just-in-Time Planning Edinburgh Workshop Oct 2003

advertisement
Eager, Lazy, and
Just-in-Time Planning
Edinburgh Workshop
Oct 2003
Condor Project
Computer Sciences Department
University of Wisconsin-Madison
condor-admin@cs.wisc.edu
http://www.cs.wisc.edu/condor
Planning –vs- Scheduling
› Can you control the resources?
Yes? Scheduling.
No? Planning.
› Planning is a ‘client’ operation.
http://www.cs.wisc.edu/condor
2
The question of When
› Lots of planning open questions.
› An important consideration: When
the planning occurs.
Time
Eager
Lazy
Just-in-Time
http://www.cs.wisc.edu/condor
3
Eager Example
› First Pass of EDG
Globus
Resource Broker
RB
DAGMan
Condor-G
http://www.cs.wisc.edu/condor
Site Scheduler
Fabric
4
Eager Condor-G Submit File
universe = globus
globussite =
beak.cs.wisc.edu/jobmanager-lsf
executable = find_particle
arguments = ….
output = ….
log = …
http://www.cs.wisc.edu/condor
5
EDG Resource Broker
Gets Lazy…
› Addition of a DAGMan callouts
› DAGMan is given a command (script) to run
immediately before submission of job to Condor-G
 (different than a PRE script on a node)
› The helper command is passed a copy of the job
submit file when DAGMan is about to submit that node
in the graph
› This allows changes to be made to the submit file (i.e.
changing globussite attribute) at the last minute
http://www.cs.wisc.edu/condor
6
Eager Example
› First Pass of EDG
Globus
Resource Broker
DAGMan
RB
Site Scheduler
callout
Condor-G
http://www.cs.wisc.edu/condor
Fabric
7
Moving Condor-G to
Just-In-Time
› Delay the binding of the task (job) to the
›
›
resource until the resource is ready.
Need to know when the resource is ready.
One way: unimplemented globus 1.1 “queue
wait time” estimate
 Not really just-in-time, because of lies, lies
lies…
› Another way… Condor-G Glidein
Mechanism.
http://www.cs.wisc.edu/condor
8
600 Condor
jobs
How It Works
Condor-G
Globus Resource
Schedd
LSF
Collector
http://www.cs.wisc.edu/condor
9
600 Condor
jobs
How It Works
Condor-G
Globus Resource
Schedd
LSF
Collector
GlideIn jobs
http://www.cs.wisc.edu/condor
10
600 Condor
jobs
How It Works
Condor-G
Globus Resource
Schedd
LSF
GridManager
Collector
GlideIn jobs
http://www.cs.wisc.edu/condor
11
600 Condor
jobs
How It Works
Condor-G
Schedd
Globus Resource
JobManager
LSF
GridManager
Collector
GlideIn jobs
http://www.cs.wisc.edu/condor
12
600 Condor
jobs
How It Works
Condor-G
Schedd
Globus Resource
JobManager
LSF
GridManager
Startd
Collector
GlideIn jobs
http://www.cs.wisc.edu/condor
13
600 Condor
jobs
How It Works
Condor-G
Schedd
Globus Resource
JobManager
LSF
GridManager
Startd
Collector
GlideIn jobs
http://www.cs.wisc.edu/condor
14
600 Condor
jobs
How It Works
Condor-G
Schedd
Globus Resource
JobManager
LSF
GridManager
Startd
Collector
User Job
GlideIn jobs
http://www.cs.wisc.edu/condor
15
A Just-in-time Submit
executable = find_particle
requirements = TARGET.Arch ==
“Intel/Linux” || TARGET.Arch ==
“Sparc/Solaris”
# job describes the “power”
rank = MFlops * 10000 + Memory
http://www.cs.wisc.edu/condor
16
Another Just-in-time Submit
executable = find_particle
requirements = TARGET.Arch ==
“Intel/Linux” || TARGET.Arch ==
“Sparc/Solaris”
rank =
sam_data_overlap(MY.dataset,TARGET.sa
m_site_name) + (TARGET.Mflops /
100000)
+dataset = search_space_id_0133313
http://www.cs.wisc.edu/condor
17
Lots of Tradeoffs…
› Just-in-Time
 Pro: Dynamic. Resources can come and go. Can
take advantage of changing circumstances.
 Con: Coordination of multiple resources
› Eager
 Pro: Easier to coordinate multiple resources
 Con: Hard to scale… how to know about all the
resources in advance?
 Con: Plan falls apart if assumptions change.
http://www.cs.wisc.edu/condor
18
Some observations
› A complete separation of task from
resource is difficult.
 Lots and lots of structured data required.
 But this separation is required to in order to
achieve Just-In-Time planning.
› Grid Protocols that do not separate task
from resource cannot realistically live on
the grid.
 Virtualization can help.
http://www.cs.wisc.edu/condor
19
Plan for failure
› Much effort on how to create a plan.
› How about a plan for when things
fail?
http://www.cs.wisc.edu/condor
20
Job Failure Policy Expressions
› Condor/Condor-G augemented so users can
›
supply job failure policy expressions in the
submit file.
Can be used to describe a successful run,
or what to do in the face of failure.
on_exit_remove = <expression>
on_exit_hold = <expression>
periodic_remove = <expression>
periodic_hold = <expression>
http://www.cs.wisc.edu/condor
21
Job Failure Policy Examples
› Do not remove from queue (i.e. reschedule) if
›
›
exits with a signal:
on_exit_remove = ExitBySignal == False
Place on hold if exits with nonzero status or
ran for less than an hour:
on_exit_hold = ((ExitBySignal==False) &&
(ExitSignal != 0)) || ((ServerStartTime –
JobStartDate) < 3600)
Place on hold if job has spent more than 50%
of its time suspended:
periodic_hold = CumulativeSuspensionTime >
(RemoteWallClockTime / 2.0)
http://www.cs.wisc.edu/condor
22
Thank you!
http://www.cs.wisc.edu/condor
tannenba@cs.wisc.edu
http://www.cs.wisc.edu/condor
23
Download