Condor Tugba Taskaya-Temizel 6 March 2006 What is Condor Technology? 1. 2. 3. Condor is a high-throughput distributed batch computing system that provides facilities such as job management, scheduling policy, priority scheme, resource monitoring and management (Thain, et al. 2005). They offer the following features: ClassAds: A framework to match the resources with the specified job descriptions. Job Checkpoint and Migration: For some particular applications, it is possible to resume the application from its last state using a checkpoint file. This provides a means of fault tolerance. For example, in the case of a failure in a machine, the job can be safely transferred to another machine. Remote System Calls: Condor supports I/O related jobs (processes, executables) which require processing input files and generating output files. By using this way, the files will automatically be transferred to the remote machines, hence you are not required to transfer the files manually by yourself or have a shared file system. Condor in our Department There are 30 machines in our departmental Condor pool in which 19 of them are Linux based (concorde01-concorde06, tornado01-13) and 11 of them have NT operating system. The number of CPUs is 107. To connect to one of the Condor machines, type: telnet concorde03.mcs.surrey.ac.uk In order to inspect the Condor pool, you can run: condor_status The output will be: Name OpSys Arch State Activity LoadAv Mem ActvtyTime vm1@concorde0 LINUX vm2@concorde0 LINUX vm3@concorde0 LINUX vm4@concorde0 LINUX vm1@concorde0 LINUX vm2@concorde0 LINUX Owner Idle Owner Idle Claimed Busy Claimed Busy Unclaimed Idle Unclaimed Idle INTEL INTEL INTEL INTEL INTEL INTEL In order to see the available machines, you can call: condor_ status -available 1.000 251 0+00:30:56 1.000 251 0+21:45:42 0.320 251 0+00:14:40 0.650 251 0+09:48:20 0.000 251 0+02:50:13 0.000 251 0+02:50:05 How to Run a Job in Condor The jobs that run in Condor environment are background jobs. Hence, they will not accept any input from the user during its run. According to the type of your application, you should choose an appropriate universe. A universe is defined as an execution environment in Condor. The Condor provides many universes such as ’Standard’, ’Vanilla’, ’PVM’, ’MPI’, ’Globus’, ’Java’ and ’Scheduler’. The universe type should be specified in the ClassAd file. How to Run a Job in Condor Standard Universe Standard universe provides checkpoint mechanism that saves the last state of the job. This is of benefit when the long running jobs are required to migrate to another machine. Create a directory such as $HOME/gt3/samples/condor. How to Run a Job in Condor Standard Universe Create a file called counter.c. Write the following lines to the file: #include <stdio.h> #include <math.h> int main(int args,char *argv[]) { int i; for (i=atoi(argv[1]);i<atoi(argv[2]);i++) { printf ("%d \n",i); } } Compile the file and link it to the Condor. condor_compile cc counter.c -o counter How to Run a Job in Condor Standard Universe Once it was linked, we should create a ClassAd file to execute our job. Create a file with an extension ’cmd’ such as ’standardunitest.cmd’. Then, write the following lines: Executable = counter Arguments = 1 30 Output = counterc1.out Log = counterc1.log Queue 1 Arguments = 30 60 Output = counterc2.out Queue 1 How to Run a Job in Condor Standard Universe Once it was linked, we should create a ClassAd file to execute our job. Create a file with an extension ’cmd’ such as ’standardunitest.cmd’. Then, write the following lines: Executable = counter Requirements = (Name== "vm1@concorde01.mcs.surrey.ac.uk“ || Name== "vm2@concorde01.mcs.surrey.ac.uk“ || Name== "vm3@concorde01.mcs.surrey.ac.uk“ || Name== "vm4@concorde01.mcs.surrey.ac.uk“ || Name== "vm1@concorde02.mcs.surrey.ac.uk“ || Name== "vm2@concorde02.mcs.surrey.ac.uk“ || Name== "vm3@concorde02.mcs.surrey.ac.uk“ || Name== "vm4@concorde02.mcs.surrey.ac.uk“ || Name== "vm1@concorde03.mcs.surrey.ac.uk“ || Name== "vm2@concorde03.mcs.surrey.ac.uk“ || Name== "vm3@concorde03.mcs.surrey.ac.uk“ || Name== "vm4@concorde03.mcs.surrey.ac.uk“ || Name== "vm1@concorde04.mcs.surrey.ac.uk“ || Name== "vm2@concorde04.mcs.surrey.ac.uk“ || Name== "vm3@concorde04.mcs.surrey.ac.uk“ || Name== "vm4@concorde04.mcs.surrey.ac.uk“ || Name== "vm1@concorde05.mcs.surrey.ac.uk“ || Name== "vm2@concorde05.mcs.surrey.ac.uk“ || Name== "vm3@concorde05.mcs.surrey.ac.uk“ || Name== "vm4@concorde05.mcs.surrey.ac.uk“ || Name== "vm1@concorde06.mcs.surrey.ac.uk“ || Name== "vm2@concorde06.mcs.surrey.ac.uk“ || Name== "vm3@concorde06.mcs.surrey.ac.uk" || Name== "vm4@concorde06.mcs.surrey.ac.uk") Arguments = 1 30 Output = counterc1.out Log = counterc1.log Queue 1 Arguments = 30 60 Output = counterc2.out Queue 1 How to Run a Job in Condor Standard Universe To submit the job to the Condor pool, run the following command: condor_submit standardunitest.cmd The output will be: Submitting job(s).. Logging submit event(s).. 2 job(s) submitted to cluster 92. To inspect your job, run: condor_q This will display your jobs. At first, you are expected to see that your jobs are idle: -- Submitter: concorde03.mcs.surrey.ac.uk : <131.227.74.149:32773> : concorde03.mcs.surrey.ac.uk ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 92.0 css1tt 3/3 14:52 0+00:00:00 I 0 3.4 counter 1 30 92.1 css1tt 3/3 14:52 0+00:00:00 I 0 3.4 counter 30 60 2 jobs; 2 idle, 0 running, 0 held After couple of minutes, when you call the same command ’condor q’, you should expect to see the following: -- Submitter: concorde03.mcs.surrey.ac.uk : <131.227.74.149:32773> : concorde03.mcs.surrey.ac.uk ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 0 jobs; 0 idle, 0 running, 0 held How to Run a Job in Condor Java Universe A java file can be run on a machine with a JVM. Unlike in standard universe, the jobs cannot be suspended and moved to another machine. However, in the case of a failure, the jobs can be restarted in another machine. Create a Java file called ’Counter.java’ and write the following lines to the file: import java.lang.*; public class Counter{ public static void main(String [] args) { int startt = Integer.parseInt(args[0]); int stopp = Integer.parseInt(args[1]); for(int i=startt;i<stopp;i++) { System.out.println(i); } } } Then, compile the program: javac Counter.java How to Run a Job in Condor Java Universe We should create a submit description file. Recall that the file extension should be ’cmd’ such as ’javaunitest.cmd’. Add the following lines to the file: universe = java executable= Counter.class log= counter.log arguments = Counter 1 30 output = counter1.output error = counter1.error should_transfer_files = YES when_to_transfer_output = ON_EXIT queue arguments = Counter 30 60 output = counter2.output error = counter2.error should_transfer_files = YES when_to_transfer_output = ON_EXIT queue How to Run a Job in Condor Java Universe To submit the jobs to Condor, run condor_submit javaunitest.cmd To inspect its status, type: condor_q The output of the command will look like: -- Submitter: concorde03.mcs.surrey.ac.uk : <131.227.74.149:32773>: concorde03.mcs.surrey.ac.uk ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 98.0 css1tt 3/3 15:58 0+00:00:00 I 0 0.0 java Counter 1 30 98.1 css1tt 3/3 15:58 0+00:00:00 I 0 0.0 java Counter 30 60 2 jobs; 2 idle, 0 running, 0 held When you notice a problem with your job, you need to remove it from the Condor pool. To do it, you need to call: condor_rm ID How to Run a Job in Condor Vanilla Universe There are some applications that cannot be run in standard and java universe such as shell scripts. Shell scripts can be used to call external applications such as Matlab. Create a file, called ’count.m’ and write the following lines to the file: function count(startt, stopp) for i=startt:stopp-1 i end To call the matlab program, we should call the Matlab application and then call our program. To do it, we should write a script. Create a file called ’runmatlab.sh’: #!/bin/sh echo "Number of arguments: $#" matlab -r "addpath /user/csckmst/css1tt/gt3/samples/csm23_2006/Tutorials/condortutorial/; count($1,$2);quit;" How to Run a Job in Condor Vanilla Universe As a final step, we should prepare the description file. Create a file with extension ’.cmd’ such as ’matlabtest.cmd’. Universe = vanilla executable = /a/filer2/home/filer2/csckmst/css1tt/gt3/samples/csm23_2006/Tutorials/ condortutorial/runmatlab.sh Initialdir = /a/filer2/home/filer2/csckmst/css1tt/gt3/samples/csm23_2006/Tutorials/ condortutorial Requirements = Memory>=20 && Arch == "INTEL" && OpSys == "LINUX" Getenv = True Log = matlabpro.log # main matlab file to execute Arguments = 1 30 Output = matlab1.out Error = matlab1.err transfer_input_files = count.m should_transfer_files = YES when_to_transfer_output = ON_EXIT_OR_EVICT Queue 1 # main matlab file to execute Arguments = 30 60 Output = matlab2.out Error = matlab2.err transfer_input_files = count.m should_transfer_files = YES when_to_transfer_output = ON_EXIT_OR_EVICT Queue 1 How to Run a Job in Condor Vanilla Universe To submit the job, run: condor_submit matlabtest.cmd To see the output of your job, call: more matlab1.out more matlab2.out EXERCISE: Submit the matlab and java counter programs together using the same description file. Both programs should be specified in the vanilla universe.