Condor COD (Computing On Demand) Condor Week 5/5/2003 Derek Wright Computer Sciences Department, UW-Madison Lawrence Berkeley National Labs (LBNL) wright@cs.wisc.edu http://www.cs.wisc.edu/condor http://sdm.lbl.gov What problem are we trying to solve? › Some people want to run interactive, yet › › › compute-intensive applications Jobs that take lots of compute power over a relatively short period of time They want to use batch computing resources, but need them right away Ideally, when they’re not in use, resources would go back to the batch system www.cs.wisc.edu/condor Some example applications: › A distributed build/compilation of a large › › › software system A very complex spreadsheet that takes a lot of cycles when you press “recalculate” High-energy physics (HEP) “analysis” jobs Visualization tools for data-mining, rendering graphics, etc. www.cs.wisc.edu/condor Example application for COD User’s Workstation Compute Farm On-demand workers Idle nodes Data Controller application Display Batch Jobs www.cs.wisc.edu/condor What’s the Condor solution? › Condor COD: “Computing on Demand” Use Condor to manage the batch resources when they’re not in use by the interactive jobs Allow the interactive jobs to come in with high priority and run instead of the batch job on any given resource www.cs.wisc.edu/condor Why did we have to change Condor for that? › Doesn’t Condor already notice when an interactive job starts on a CPU? › Doesn’t Condor already provide checkpointing when that happens? › Can’t I configure Condor to run whatever jobs I want with a higher priority on my own machines? www.cs.wisc.edu/condor Well, yes… But that’s not good enough… › Not all jobs can be checkpointed, and even › › those that can take some time… We want this to be instantaneous, not waiting for the batch system to schedule tasks… You can configure Condor to run higher priority jobs, but the other jobs are kicked off the machine… www.cs.wisc.edu/condor What’s new about COD? › “Checkpoint to swap space” When a high-priority COD job appears, the lower-priority batch job is suspended The COD job can run right away, while the batch job is suspended Batch jobs (even those that can’t checkpoint) can resume instantly once there are no more active COD jobs www.cs.wisc.edu/condor But wait, there’s more… › The condor_startd can now manage multiple “claims” on each resource If any COD claim becomes active, the regular Condor claim is automatically suspended Without an active COD, regular claim resumes › There is a new command-line tool to › request, activate, suspend, resume and release these claims There’s even a C++ object to do all of that, if you really want it… www.cs.wisc.edu/condor COD claim-management commands › Request: authorizes the user and returns a › unique claim ID for future commands Activate: spawns an application on a given COD claim, with various options to define the application, job ID, etc Suspends any regular Condor job You can have multiple COD claims on a single resource, and they can all be running simultaneously www.cs.wisc.edu/condor COD commands (cont’d) › Suspend: Given COD claim is suspended If there are no more active COD claims, a regular Condor batch job can now run › Resume: Given COD claim is resumed, › › suspending the Condor batch job (if any) Deactivate: Kill the application but hold onto the COD claim Release: Get rid of the COD claim itself www.cs.wisc.edu/condor COD command protocol › All commands use ClassAds Allows for a flexible protocol Excellent error propagation Can use existing ClassAd technology › Similar to existing Condor protocol Separation of claiming from activation, so you can have hot-spares, etc. www.cs.wisc.edu/condor How does all of that solve the problem? › The interactive COD application starts up, › › and goes out to claim some compute nodes Once the helper applications are in place and ready, these COD claims are suspended, allowing batch jobs to run When the interactive application has work, it can instantly suspend the batch jobs and resume the COD applications to perform the computations www.cs.wisc.edu/condor Step 1: Initial state User’s Workstation Compute Farm % Idle nodes Batch jobs www.cs.wisc.edu/condor Step 2: Application spawned User’s Workstation Compute Farm % fractal-gen –n 4 Idle nodes Controller application spawned Batch jobs www.cs.wisc.edu/condor Step 3: Compute node setup User’s Workstation Compute Farm On-demand workers Idle nodes Claiming and initializing [4] compute nodes for rendering… Got reply from: c1.cluster.org c6.cluster.org c14.cluster.org c17.cluster.org SUCCESS! Batch jobs www.cs.wisc.edu/condor Step 3: Commands used % condor_cod_request –name c1.cluster.org \ –classad c1.out Successfully sent CA_REQUEST_CLAIM to startd at <128.105.143.14:55642> Result ClassAd written to c1.out ID of new claim is: “<128.105.143.14:55642>#1051656208#2” % condor_cod_activate –keyword fractgen \ –id “<128.105.143.14:55642>#1051656208#2” Successfully sent CA_ACTIVATE_CLAIM to startd at <128.105.143.14:55642> % … www.cs.wisc.edu/condor Step 4: “Checkpoint” to swap User’s Workstation Compute Farm Suspended worker Idle nodes SELECT FRACTAL TYPE <Mandelbrot> (more user input…) Batch jobs www.cs.wisc.edu/condor Step 4: Commands used % condor_cod_suspend \ –id “<128.105.143.14:55642>#1051656208#2” Successfully sent CA_SUSPEND_CLAIM to startd at <128.105.143.14:55642> % … › Rendering application on each COD node is › suspended while interactive tool waits for input The resources are now available for batch Condor jobs www.cs.wisc.edu/condor Step 5: Batch jobs can run User’s Workstation Compute Farm Idle nodes SPECIFY PARAMETERS max_iterations: 400000 TL: -0.65865, -0.56254 BR: -0.45865, -0.71254 (more user input…) Batch Batch jobs queue www.cs.wisc.edu/condor Step 6: Computation burst User’s Workstation Compute Farm Interactive On-demand workers Idle nodes CLICK <RENDER> TO VIEW YOUR FRACTAL… RENDER Suspended batch job www.cs.wisc.edu/condor Batch jobs Step 6: Commands used % condor_cod_resume \ –id “<128.105.143.14:55642>#1051656208#2” Successfully sent CA_RESUME_CLAIM to startd at <128.105.143.14:55642> % … › Batch Condor jobs on COD nodes are › suspended All COD rendering applications are resumed on each node www.cs.wisc.edu/condor Step 7: Results produced User’s Workstation Compute Farm Interactive On-demand workers Idle nodes Data Display Suspended batch job www.cs.wisc.edu/condor Batch jobs Step 8: User input while batch work resumes User’s Workstation Compute Farm Suspended worker Idle nodes ZOOM BOX COORDINATES: TL = -0.60301, -0.61087 BR = -0.58037, -0.62785 Batch jobs www.cs.wisc.edu/condor Step 9: Computation burst #2 User’s Workstation Compute Farm Interactive On-demand workers Idle nodes Data Display RENDER Suspended batch job www.cs.wisc.edu/condor Batch jobs Step 10: Clean-up User’s Workstation Compute Farm Idle nodes REALLY QUIT? Y/N Releasing compute nodes… 4 nodes terminated successfully! Batch jobs www.cs.wisc.edu/condor Step 10: Commands used % condor_cod_release \ –id “<128.105.143.14:55642>#1051656208#2” Successfully sent CA_RELEASE_CLAIM to startd at <128.105.143.14:55642> State of claim when it was released: "Running" % … › The jobs are cleaned up, claims released, and resources returned to batch system www.cs.wisc.edu/condor Other changes for COD: › The condor_starter has been modified so that it can run jobs without communicating with a condor_shadow All the great job control features of the starter without a shadow Starter can write its own UserLog Other useful features for COD www.cs.wisc.edu/condor condor_status –cod › New “–cod” option to condor_status to view COD claims in a Condor pool: Name astro.cs.wi chopin.cs.w chopin.cs.w ID COD1 COD1 COD2 INTEL/LINUX Total ClaimState Idle Running Suspended Total 3 3 TimeInState 0+00:00:04 0+00:02:05 0+00:10:21 Idle 1 1 Running 1 1 RemoteUser wright wright wright Suspended 1 1 JobId Keyword 3.0 4.0 Vacating 0 0 www.cs.wisc.edu/condor fractgen fractgen Killing 0 0 What else could I use all these new features for? › Short-running system administration tasks › that need quick access but don’t want to disturb the jobs in your batch system A “Grid Shell” A condor_starter that doesn’t need a condor_shadow is a powerful job management environment that can monitor a job running under a “hostile” batch system on the grid www.cs.wisc.edu/condor Future work › More ways to tell COD about your application For now, you define important attributes in your condor_config file and pre-stage the executables › Ability to transfer files to and from a COD job at a remote machine We’ve already got the functionality in Condor, so why rely on a shared filesystem or pre-staging? www.cs.wisc.edu/condor More future work › Accounting for COD jobs › Working with some real-world applications and integrating these new COD features Would the real users please stand up? › Better “Grid Shell” support This is really a separate-yet-related area of work… www.cs.wisc.edu/condor How do you use COD? › Upgrade to Condor version 6.5.3 or greater… COD is already included › There will be a new section in the Condor manual (coming soon) › If you need more help, ask the ever helpful condor-admin@cs.wisc.edu › Find me at the BoF on Wednesday, 9am to Noon (room TBA) www.cs.wisc.edu/condor