Eager, Lazy, and Just-in-Time Planning Edinburgh Workshop Oct 2003 Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu http://www.cs.wisc.edu/condor Planning –vs- Scheduling › Can you control the resources? Yes? Scheduling. No? Planning. › Planning is a ‘client’ operation. http://www.cs.wisc.edu/condor 2 The question of When › Lots of planning open questions. › An important consideration: When the planning occurs. Time Eager Lazy Just-in-Time http://www.cs.wisc.edu/condor 3 Eager Example › First Pass of EDG Globus Resource Broker RB DAGMan Condor-G http://www.cs.wisc.edu/condor Site Scheduler Fabric 4 Eager Condor-G Submit File universe = globus globussite = beak.cs.wisc.edu/jobmanager-lsf executable = find_particle arguments = …. output = …. log = … http://www.cs.wisc.edu/condor 5 EDG Resource Broker Gets Lazy… › Addition of a DAGMan callouts › DAGMan is given a command (script) to run immediately before submission of job to Condor-G (different than a PRE script on a node) › The helper command is passed a copy of the job submit file when DAGMan is about to submit that node in the graph › This allows changes to be made to the submit file (i.e. changing globussite attribute) at the last minute http://www.cs.wisc.edu/condor 6 Eager Example › First Pass of EDG Globus Resource Broker DAGMan RB Site Scheduler callout Condor-G http://www.cs.wisc.edu/condor Fabric 7 Moving Condor-G to Just-In-Time › Delay the binding of the task (job) to the › › resource until the resource is ready. Need to know when the resource is ready. One way: unimplemented globus 1.1 “queue wait time” estimate Not really just-in-time, because of lies, lies lies… › Another way… Condor-G Glidein Mechanism. http://www.cs.wisc.edu/condor 8 600 Condor jobs How It Works Condor-G Globus Resource Schedd LSF Collector http://www.cs.wisc.edu/condor 9 600 Condor jobs How It Works Condor-G Globus Resource Schedd LSF Collector GlideIn jobs http://www.cs.wisc.edu/condor 10 600 Condor jobs How It Works Condor-G Globus Resource Schedd LSF GridManager Collector GlideIn jobs http://www.cs.wisc.edu/condor 11 600 Condor jobs How It Works Condor-G Schedd Globus Resource JobManager LSF GridManager Collector GlideIn jobs http://www.cs.wisc.edu/condor 12 600 Condor jobs How It Works Condor-G Schedd Globus Resource JobManager LSF GridManager Startd Collector GlideIn jobs http://www.cs.wisc.edu/condor 13 600 Condor jobs How It Works Condor-G Schedd Globus Resource JobManager LSF GridManager Startd Collector GlideIn jobs http://www.cs.wisc.edu/condor 14 600 Condor jobs How It Works Condor-G Schedd Globus Resource JobManager LSF GridManager Startd Collector User Job GlideIn jobs http://www.cs.wisc.edu/condor 15 A Just-in-time Submit executable = find_particle requirements = TARGET.Arch == “Intel/Linux” || TARGET.Arch == “Sparc/Solaris” # job describes the “power” rank = MFlops * 10000 + Memory http://www.cs.wisc.edu/condor 16 Another Just-in-time Submit executable = find_particle requirements = TARGET.Arch == “Intel/Linux” || TARGET.Arch == “Sparc/Solaris” rank = sam_data_overlap(MY.dataset,TARGET.sa m_site_name) + (TARGET.Mflops / 100000) +dataset = search_space_id_0133313 http://www.cs.wisc.edu/condor 17 Lots of Tradeoffs… › Just-in-Time Pro: Dynamic. Resources can come and go. Can take advantage of changing circumstances. Con: Coordination of multiple resources › Eager Pro: Easier to coordinate multiple resources Con: Hard to scale… how to know about all the resources in advance? Con: Plan falls apart if assumptions change. http://www.cs.wisc.edu/condor 18 Some observations › A complete separation of task from resource is difficult. Lots and lots of structured data required. But this separation is required to in order to achieve Just-In-Time planning. › Grid Protocols that do not separate task from resource cannot realistically live on the grid. Virtualization can help. http://www.cs.wisc.edu/condor 19 Plan for failure › Much effort on how to create a plan. › How about a plan for when things fail? http://www.cs.wisc.edu/condor 20 Job Failure Policy Expressions › Condor/Condor-G augemented so users can › supply job failure policy expressions in the submit file. Can be used to describe a successful run, or what to do in the face of failure. on_exit_remove = <expression> on_exit_hold = <expression> periodic_remove = <expression> periodic_hold = <expression> http://www.cs.wisc.edu/condor 21 Job Failure Policy Examples › Do not remove from queue (i.e. reschedule) if › › exits with a signal: on_exit_remove = ExitBySignal == False Place on hold if exits with nonzero status or ran for less than an hour: on_exit_hold = ((ExitBySignal==False) && (ExitSignal != 0)) || ((ServerStartTime – JobStartDate) < 3600) Place on hold if job has spent more than 50% of its time suspended: periodic_hold = CumulativeSuspensionTime > (RemoteWallClockTime / 2.0) http://www.cs.wisc.edu/condor 22 Thank you! http://www.cs.wisc.edu/condor tannenba@cs.wisc.edu http://www.cs.wisc.edu/condor 23