Worked Condor Example – Simple MATLAB Add

advertisement
Worked Condor Example – Simple MATLAB Add Description This worked example shows how you can take a MATLAB function and compile it down into an executable, which can be run either as a single job through Condor or as a parameter sweep job through Condor. Features In this example you will see: How to compile and use a MATLAB function Write a Condor submission script Provide arguments to Condor jobs Produce a parameter sweep of jobs with different command line arguments •
•
•
•
AddNum.m and AddNum.exe AddNum.m is a short MATLAB function which takes two numbers as parameters and outputs the sum of these two numbers. % Function to sum two numbers
% Takes two numbers and outputs the result
function x=addNum(a, b)
if (ischar(a)), a = str2num(a), end;
if (ischar(b)), b = str2num(b), end;
fprintf(1, '%d + %d = %d\n', a, b, a+b);
exit;
You need to compile this code in MATLAB to make an executable: mcc –m AddNum.m This should give you a binary executable AddNum.exe. You should test this binary before running through Condor. It is almost impossible to debug your code in Condor. > addNum.exe 1 2
a =
1
b =
2
1 + 2 = 3
Once your program is running successfully then you’re ready to run this on Condor. Getting your files onto Condor1 The main server for submitting jobs to Condor is condor1.ncl.ac.uk. You will require a University Unix account to access this computer. Note: condor1.ncl.ac.uk is only accessible from inside the University, you can access it remotely using ssh tunneling or ras, but that is not discussed here. You’ll need to use an SSH file transfer program to copy your files onto condor1. You can use Secure Shell from a standard Windows install or putty (http://www.chiark.greenend.org.uk/~sgtatham/putty/) from other windows computers. On Linux or Mac OSX you can use the scp command. Full details on how to use these commands can be found elsewhere. You need to copy addNum.exe onto condor1. Running your MATLAB function through Condor Once you have addNum.exe on condor1 you need to provide a submission file to run it through condor. Create the file submit.condor using an editor (you can use nano): # Run the command
# addNum.exe 1 2
# on a Windows 7 computer
# Name of the executable to run
Executable = addNum.exe
# The arguments to give to the executable
Arguments = 1 2
# The Condor universe to run the job in. By default this should be
# vanilla
Universe = vanilla
# The requirements for the computer you want
# OpSys == “WINNT61” will require a Windows 7 computer
# Arch == “INTEL” will require an Intel based computer
Requirements = OpSys == "WINNT61" && Arch == "INTEL"
# Set Priority high to allow the system to wake up computers
Priority=high
# File to write any errors to
Error
= m_condor.err
# File to write program output to
Output = m_condor.out
# File for Condor to log information to
Log
= m_condor.log
# Set this unless all the files you require are already on
# the computer
should_transfer_files = YES
# Files are sent back when job finishes
when_to_transfer_output = ON_EXIT
# Tell Condor to submit this job
Queue
You are now ready to submit your job: [nasm3@condor1 ~/simpleAdd]$ condor_submit submit.condor
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 2620.
[nasm3@condor1 ~/simpleAdd]$
This submits your job and it is added into the Condor “queue” where it will wait for an appropriate computer before running it. You can look at the current Condor queue to see what the status of your job is: [nasm3@condor1 ~/simpleAdd]$ condor_q
-- Quill: quill@condor1.ncl.ac.uk : <localhost:5432> : quill : 2011-04-14
10:22:10+01
ID
OWNER
SUBMITTED
RUN_TIME ST PRI SIZE CMD
2620.0
nasm3
4/14 10:20
0+00:00:00 I 0
0.2 addNum.exe 1 2
1 jobs; 1 idle, 0 running, 0 held
[nasm3@condor1 ~/simpleAdd]$
Note: Sometimes it takes a few seconds for jobs to appear in the queue, if your job doesn’t appear run condor_q again. If you get a lot of output from this command you can restrict the output by putting your username after the condor_q command. After a while your Condor job will run. Once it has finished running it will disappear from the condor_q output. To look at your output file use the more command: [nasm3@condor1 ~/simpleAdd]$ more m_condor.out
a =
1
b =
2
1 + 2 = 3
[nasm3@condor1 ~/simpleAdd]$
You should also check that the error file is empty: [nasm3@condor1 ~/simpleAdd]$ more m_condor.err
[nasm3@condor1 ~/simpleAdd]$
Submitting a parameter sweep of jobs to Condor Condor allows you to submit multiple jobs at the same time. Each job can work differently to the others. Jobs are numbered 0 to n-­‐1, where n is the number of jobs that you ask Condor to run. You tell Condor the number of jobs you want to run by specifying a number after the Queue statement in the submit script (Queue 10 in the submission script below). Condor provides you with the $(Process) variable which represents the job number for each of the jobs. So, for example, in the submission script below the Arguments line will be “100 0” for the first job and “100 1” for the second job. You need to use the $(Process) variable with your output and error filenames otherwise Condor will overwrite them and you’ll only see the last one to run. The following submission script provides a simple parameter sweep where the addNum function is run ten times with each instance adding 100 to the $(Process) number. Create this as a file called submit.multiple.condor: # Run the command
# addNum.exe 1 2
# on a Windows 7 computer
# Name of the executable to run
Executable = addNum.exe
# The arguments to the program. Note that the second argument is the
# Process number provided by Condor. Thus each job gets a different
# second parameter.
Arguments = 100 $(Process)
# The Condor universe to run the job in. By default this should be
# vanilla
Universe = vanilla
# The requirements for the computer you want
# OpSys == “WINNT61” will require a Windows 7 computer
# Arch == “INTEL” will require an Intel based computer
Requirements = OpSys == "WINNT61" && Arch == "INTEL"
# Set Priority high to allow the system to wake up computers
Priority=high
# File to write any errors to. Note each job will get a
# different file
Error
= m_condor.$(Process).err
# File to write any output to. Note each job will get a
# different file
Output = m_condor.$(Process).out
# File for Condor to log information to. Note each job will get a
# different file
log
= m_condor.$(Process).log
# Set this unless all the files you require are already on
# the computer
should_transfer_files = YES
# Files are sent back when job finishes
when_to_transfer_output = ON_EXIT
# Tell Condor to submit this job 10 times
Queue 10
Once you have this file ready you can submit your parameter sweep to Condor using the condor_submit command: [nasm3@condor1 ~/simpleAdd]$ condor_submit submit.multiple.condor
Submitting job(s)..........
Logging submit event(s)..........
10 job(s) submitted to cluster 2623.
[nasm3@condor1 ~/simpleAdd]$
Note that it now tells you that 10 jobs have been submitted. You can check this and that each job has a different set of arguments using the condor_q command: [nasm3@condor1 ~/simpleAdd]$ condor_q
-- Quill: quill@condor1.ncl.ac.uk : <localhost:5432> : quill
ID
OWNER
SUBMITTED
RUN_TIME ST PRI SIZE
2623.0
nasm3
4/14 14:25
0+00:00:00 I 0
0.2
2623.1
nasm3
4/14 14:25
0+00:00:00 I 0
0.2
2623.2
nasm3
4/14 14:25
0+00:00:00 I 0
0.2
2623.3
nasm3
4/14 14:25
0+00:00:00 I 0
0.2
2623.4
nasm3
4/14 14:25
0+00:00:00 I 0
0.2
2623.5
nasm3
4/14 14:25
0+00:00:00 I 0
0.2
2623.6
nasm3
4/14 14:25
0+00:00:00 I 0
0.2
2623.7
nasm3
4/14 14:25
0+00:00:00 I 0
0.2
: 2011-04-14 14:29:18+01
CMD
addNum.exe 100 0
addNum.exe 100 1
addNum.exe 100 2
addNum.exe 100 3
addNum.exe 100 4
addNum.exe 100 5
addNum.exe 100 6
addNum.exe 100 7
2623.8
2623.9
nasm3
nasm3
4/14 14:25
4/14 14:25
0+00:00:00 I
0+00:00:00 I
0
0
0.2
0.2
addNum.exe 100 8
addNum.exe 100 9
10 jobs; 10 idle, 0 running, 0 held
[nasm3@condor1 ~/simpleAdd]$
Once each of these jobs have completed (it has disappeared from condor_q) you can look at the output files to see your results. For example job 5 which ran “addNum.exe 100 5” gives the result: [nasm3@condor1 ~/simpleAdd]$ more m_condor.5.out
a =
100
b =
5
100 + 5 = 105
[nasm3@condor1 ~/simpleAdd]$
You can now use Condor with other executables and/or MATLAB functions to run parameter sweep runs where each run is controlled by a number passed to the program as a command line argument. 
Download