tutorial7 - Department of Computing

advertisement
Condor
Tugba Taskaya-Temizel
6 March 2006
What is Condor Technology?
1.
2.
3.
Condor is a high-throughput distributed batch computing system that
provides facilities such as job management, scheduling policy, priority
scheme, resource monitoring and management (Thain, et al. 2005). They
offer the following features:
ClassAds: A framework to match the resources with the specified job
descriptions.
Job Checkpoint and Migration: For some particular applications, it is possible
to resume the application from its last state using a checkpoint file. This
provides a means of fault tolerance. For example, in the case of a failure in a
machine, the job can be safely transferred to another machine.
Remote System Calls: Condor supports I/O related jobs (processes,
executables) which require processing input files and generating output files.
By using this way, the files will automatically be transferred to the remote
machines, hence you are not required to transfer the files manually by
yourself or have a shared file system.
Condor in our Department
There are 30 machines in our departmental Condor pool in which 19 of them are Linux based
(concorde01-concorde06, tornado01-13) and 11 of them have NT operating system. The number of
CPUs is 107. To connect to one of the Condor machines, type:
telnet concorde03.mcs.surrey.ac.uk


In order to inspect the Condor pool, you can run:
condor_status
The output will be:
Name
OpSys
Arch State
Activity LoadAv Mem ActvtyTime
vm1@concorde0 LINUX
vm2@concorde0 LINUX
vm3@concorde0 LINUX
vm4@concorde0 LINUX
vm1@concorde0 LINUX
vm2@concorde0 LINUX
Owner
Idle
Owner
Idle
Claimed Busy
Claimed Busy
Unclaimed Idle
Unclaimed Idle


INTEL
INTEL
INTEL
INTEL
INTEL
INTEL
In order to see the available machines, you can call:
condor_ status -available
1.000 251 0+00:30:56
1.000 251 0+21:45:42
0.320 251 0+00:14:40
0.650 251 0+09:48:20
0.000 251 0+02:50:13
0.000 251 0+02:50:05
How to Run a Job in Condor


The jobs that run in Condor environment are
background jobs. Hence, they will not accept
any input from the user during its run.
According to the type of your application, you
should choose an appropriate universe.

A universe is defined as an execution environment in Condor.
The Condor provides many universes such as ’Standard’,
’Vanilla’, ’PVM’, ’MPI’, ’Globus’, ’Java’ and ’Scheduler’.
The universe type should be specified in the ClassAd file.
How to Run a Job in Condor
Standard Universe


Standard universe provides checkpoint
mechanism that saves the last state of the job.
This is of benefit when the long running jobs
are required to migrate to another machine.
Create a directory such as
$HOME/gt3/samples/condor.
How to Run a Job in Condor
Standard Universe

Create a file called counter.c. Write the following lines to the file:
#include <stdio.h>
#include <math.h>
int main(int args,char *argv[]) {
int i;
for (i=atoi(argv[1]);i<atoi(argv[2]);i++) {
printf ("%d \n",i);
}
}
 Compile the file and link it to the Condor.
condor_compile cc counter.c -o counter
How to Run a Job in Condor
Standard Universe
Once it was linked, we should create a ClassAd file to execute
our job. Create a file with an extension ’cmd’ such as
’standardunitest.cmd’. Then, write the following lines:
Executable = counter
Arguments = 1 30
Output = counterc1.out
Log = counterc1.log
Queue 1

Arguments = 30 60
Output = counterc2.out
Queue 1
How to Run a Job in Condor
Standard Universe
Once it was linked, we should create a ClassAd file to execute our job. Create a file with an extension ’cmd’ such as
’standardunitest.cmd’. Then, write the following lines:

Executable = counter
Requirements = (Name== "vm1@concorde01.mcs.surrey.ac.uk“ || Name== "vm2@concorde01.mcs.surrey.ac.uk“ || Name==
"vm3@concorde01.mcs.surrey.ac.uk“ || Name== "vm4@concorde01.mcs.surrey.ac.uk“ || Name==
"vm1@concorde02.mcs.surrey.ac.uk“ || Name== "vm2@concorde02.mcs.surrey.ac.uk“ || Name==
"vm3@concorde02.mcs.surrey.ac.uk“ || Name== "vm4@concorde02.mcs.surrey.ac.uk“ || Name==
"vm1@concorde03.mcs.surrey.ac.uk“ || Name== "vm2@concorde03.mcs.surrey.ac.uk“ || Name==
"vm3@concorde03.mcs.surrey.ac.uk“ || Name== "vm4@concorde03.mcs.surrey.ac.uk“ || Name==
"vm1@concorde04.mcs.surrey.ac.uk“ || Name== "vm2@concorde04.mcs.surrey.ac.uk“ || Name==
"vm3@concorde04.mcs.surrey.ac.uk“ || Name== "vm4@concorde04.mcs.surrey.ac.uk“ || Name==
"vm1@concorde05.mcs.surrey.ac.uk“ || Name== "vm2@concorde05.mcs.surrey.ac.uk“ || Name==
"vm3@concorde05.mcs.surrey.ac.uk“ || Name== "vm4@concorde05.mcs.surrey.ac.uk“ || Name==
"vm1@concorde06.mcs.surrey.ac.uk“ || Name== "vm2@concorde06.mcs.surrey.ac.uk“ || Name==
"vm3@concorde06.mcs.surrey.ac.uk" || Name== "vm4@concorde06.mcs.surrey.ac.uk")
Arguments = 1 30
Output = counterc1.out
Log = counterc1.log
Queue 1
Arguments = 30 60
Output = counterc2.out
Queue 1
How to Run a Job in Condor
Standard Universe

To submit the job to the Condor pool, run the following command:
condor_submit standardunitest.cmd

The output will be:
Submitting job(s)..
Logging submit event(s)..
2 job(s) submitted to cluster 92.


To inspect your job, run:
condor_q
This will display your jobs. At first, you are expected to see that your jobs are idle:
-- Submitter: concorde03.mcs.surrey.ac.uk : <131.227.74.149:32773> : concorde03.mcs.surrey.ac.uk
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
92.0 css1tt 3/3 14:52 0+00:00:00 I 0 3.4 counter 1 30
92.1 css1tt 3/3 14:52 0+00:00:00 I 0 3.4 counter 30 60
2 jobs; 2 idle, 0 running, 0 held
After couple of minutes, when you call the same command ’condor q’, you should expect to see the
following:

-- Submitter: concorde03.mcs.surrey.ac.uk : <131.227.74.149:32773> : concorde03.mcs.surrey.ac.uk
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
0 jobs; 0 idle, 0 running, 0 held
How to Run a Job in Condor
Java Universe
 A java file can be run on a machine with a JVM. Unlike in standard universe, the jobs cannot be
suspended and moved to another machine. However, in the case of a failure, the jobs can be restarted
in another machine.

Create a Java file called ’Counter.java’ and write the following lines to the file:
import java.lang.*;
public class Counter{
public static void main(String [] args) {
int startt = Integer.parseInt(args[0]);
int stopp = Integer.parseInt(args[1]);
for(int i=startt;i<stopp;i++)
{
System.out.println(i);
}
}
}

Then, compile the program:
javac Counter.java
How to Run a Job in Condor
Java Universe

We should create a submit description file. Recall that the file extension should be
’cmd’ such as ’javaunitest.cmd’. Add the following lines to the file:
universe = java
executable= Counter.class
log= counter.log
arguments = Counter 1 30
output = counter1.output
error = counter1.error
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
queue
arguments = Counter 30 60
output = counter2.output
error = counter2.error
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
queue
How to Run a Job in Condor
Java Universe
To submit the jobs to Condor, run
condor_submit javaunitest.cmd
 To inspect its status, type:
condor_q
 The output of the command will look like:
-- Submitter: concorde03.mcs.surrey.ac.uk : <131.227.74.149:32773>:
concorde03.mcs.surrey.ac.uk
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
98.0 css1tt 3/3 15:58 0+00:00:00 I 0 0.0 java Counter 1 30
98.1 css1tt 3/3 15:58 0+00:00:00 I 0 0.0 java Counter 30 60
2 jobs; 2 idle, 0 running, 0 held
 When you notice a problem with your job, you need to remove it from the Condor pool.
To do it, you need to call:
condor_rm ID

How to Run a Job in Condor
Vanilla Universe

There are some applications that cannot be run in standard and java
universe such as shell scripts. Shell scripts can be used to call external applications
such as Matlab.

Create a file, called ’count.m’ and write the following lines to the file:
function count(startt, stopp)
for i=startt:stopp-1
i
end
To call the matlab program, we should call the Matlab application and then
call our program. To do it, we should write a script. Create a file called
’runmatlab.sh’:

#!/bin/sh
echo "Number of arguments: $#"
matlab -r "addpath
/user/csckmst/css1tt/gt3/samples/csm23_2006/Tutorials/condortutorial/;
count($1,$2);quit;"
How to Run a Job in Condor
Vanilla Universe
 As a final step, we should prepare the description file. Create a file with extension ’.cmd’ such as
’matlabtest.cmd’.
Universe = vanilla
executable = /a/filer2/home/filer2/csckmst/css1tt/gt3/samples/csm23_2006/Tutorials/
condortutorial/runmatlab.sh
Initialdir = /a/filer2/home/filer2/csckmst/css1tt/gt3/samples/csm23_2006/Tutorials/
condortutorial
Requirements = Memory>=20 && Arch == "INTEL" && OpSys == "LINUX"
Getenv = True
Log = matlabpro.log
# main matlab file to execute
Arguments = 1 30
Output = matlab1.out
Error = matlab1.err
transfer_input_files = count.m
should_transfer_files = YES
when_to_transfer_output = ON_EXIT_OR_EVICT
Queue 1
# main matlab file to execute
Arguments = 30 60
Output = matlab2.out
Error = matlab2.err
transfer_input_files = count.m
should_transfer_files = YES
when_to_transfer_output = ON_EXIT_OR_EVICT
Queue 1
How to Run a Job in Condor
Vanilla Universe


To submit the job, run:
condor_submit matlabtest.cmd
To see the output of your job, call:
more matlab1.out
more matlab2.out
EXERCISE:
Submit the matlab and java counter programs together
using the same description file. Both programs should be
specified in the vanilla universe.
Download