LSF for Users ZORRO HPC What is LSF? LSF - Load Sharing Facility Batch Management Subsystem for multi-host, multi-vendor complexes with capability to manage computing resources across multiple platforms. LSF runs on the AU-HPC cluster -----------------------------------------------------------------------------Documentation: /app/docs/LSF/7.0/*.pdf Hardware description: http://www.american.edu/hpc At a command line enter: man lsfintro To be able to access LSF This has been added to your login processing: . /opt/lsf/conf/profile.lsf (sh users) or source /opt/lsf/conf/cshrc.lsf (csh users) These commands are executed before you receive a command prompt. There is no need for you to add anything to your login files in order to use LSF. These commands define the LSF environment: LSF_SERVERDIR, LSF_BINDIR, LSF_LIBDIR, XLSF_UIDDIR, LSF_ENVDIR, PATH, MANPATH ------------------------------------------------------------------Check: env | grep -i lsf Essential Commands for Users • • • • • • bhosts bqueues bsub bjobs bhist bpeek • • • • • bmod bbot/btop bswitch bstop/bresume bkill Essential Commands Purpose • • • • • • • bhosts - information about available hosts (lshosts) bqueues - information about available queues bsub - submit jobs to batch subsystem bjobs - list jobs in the batch subsystem bhist - displays historical information about user’s jobs bpeek - displays stdout and stderr of user’s unfinished job bmod - modifies job submission options for user’s job Essential Commands Purpose (cont’d) • bbot/btop - moves a pending job relative to user’s last/ first job in a queue • bswitch - switches user’s unfinished jobs from one queue to another • bstop/bresume - suspends/resumes user’s unfinished jobs • bkill - kill, suspend or resume user’s jobs Essential Commands: bhosts bhosts [-w|-l][-R “res_req”][host_name|host_group] Displays information about hosts/platforms lshosts [-w | -l] [-R "res_req"] [host_name | cluster_name] lshosts -s [shared_resource_name ...] Displays hosts and their static resource information [root@hpchead ~]$ lshosts HOST_NAME type model cpuf ncpus maxmem maxswp server RESOURCES hpchead X86_64 Intel_EM 60.0 12 24097M 2015M Yes (mpich2 mg) node15 X86_64 Intel_EM 60.0 12 24097M 2000M Yes (cuda mpich2) node14 X86_64 Intel_EM 60.0 12 24097M 2000M Yes (cuda mpich2) node13 X86_64 Intel_EM 60.0 12 24097M 2000M Yes (cuda mpich2) node12 X86_64 Intel_EM 60.0 12 24097M 2000M Yes (cuda mpich2) node11 X86_64 Intel_EM 60.0 12 24097M 2000M Yes (cuda mpich2) node10 X86_64 Intel_EM 60.0 12 24097M 2000M Yes (cuda mpich2) node09 X86_64 Intel_EM 60.0 12 24097M 2000M Yes (cuda mpich2) node08 X86_64 Intel_EM 60.0 12 24097M 2000M Yes (cuda mpich2) node07 X86_64 Intel_EM 60.0 12 24097M 2000M Yes (cuda mpich2) node06 X86_64 Intel_EM 60.0 12 24097M 2000M Yes (cuda mpich2) node05 X86_64 Intel_EM 60.0 12 24097M 2000M Yes (cuda mpich2) node04 X86_64 Intel_EM 60.0 12 24097M 2000M Yes (cuda mpich2) node03 X86_64 Intel_EM 60.0 12 24097M 2000M Yes (cuda mpich2) node02 X86_64 Intel_EM 60.0 12 24097M 2000M Yes (cuda mpich2) node01 X86_64 Intel_EM 60.0 12 24094M 2000M Yes (cuda mpich2) Essential Commands: bqueues bqueues [-w|-l|-r][-m host_name|-m all] [-u user_name|-u all][queue_name …] Displays information about queues. By default, returns the following information about all queues: queue name, queue priority, queue status, job slot statistics, and job state statistics. [root@hpchead]$ bqueues QUEUE_NAME PRIO STATUS MAX JL/U JL/P JL/H NJOBS PEND RUN SUSP dynamic_provisi 60 Open:Active - - - - 0 0 0 0 owners 43 Open:Active - - - - 0 0 0 0 priority 43 Open:Active - - - - 0 0 0 0 night 40 Open:Inact - - - - 0 0 0 0 chkpnt_rerun_qu 40 Open:Active - - - - 0 0 0 0 short 35 Open:Active - - - - 0 0 0 0 license 33 Open:Active - - - - 0 0 0 0 normal 30 Open:Active - - - - 0 0 0 0 hpc_linux 30 Open:Active - - - - 0 0 0 0 hpc_linux_tv 30 Open:Active - - - - 0 0 0 0 idle 20 Open:Active - - - - 0 0 0 0 Essential Commands: bsub bsub [options] command [cmd_args] Submits a job for batch execution OPTION LIST -B Sends mail at dispatch and initiation times. -H Holds job in PSUSP and waits for bresume -I | -Ip | -Is Submits as batch interactive -K Submits job and locks cmd line with status updates -N Sends job report by e-mail (use only with -I | -Is | -Ip or -o) -r Rerun job on another host if host terminates -x Exclusive execution mode -a esub_parameters Specifies parallel job launcher (PJL) to be used -b [[month:]day:]hour:minute Dispatch date/time -C core_limit Limits size of core dumps (-C 0 recommended?) -c [hours:]minutes[/host_name | /host_model] Cpu time limit -D data_limit -e err_file File to use as stderr -E "pre_exec_command [arguments ...]" Pre-exec command invoked before batch stream command processing -ext[sched] "external_scheduler_options" N/A -f "local_file operator [remote_file]" ... Files to be copied between local/remote systems -F file_limit Per process file size limit Essential Commands: bsub bsub [options] command [cmd_args] Submits a job for batch execution OPTION LIST (cont’d) - g job_group_name Submits job to a job group -G user_group Associates job with a specific group -i input_file | -is input_file Specifies stdin for job -J job_name | -J "job_name[index_list]%job_slot_limit" Specifies job name -k "checkpoint_dir [checkpoint_period][method=method_name]" Makes a job checkpointable and specifies checkpoint directory -L login_shell Uses login_shell for runtime environment -m "host_name[@cluster_name][+[pref_level]] | host_group[+[pref_level]] Selects and ranks hosts/groups on which to run -M mem_limit Sets per process memory limit -n min_proc[,max_proc] Sets min/max number of processors required to run job -o out_file Specifies stdout -P project_name Specifies project name -p process_limit Limits total number of processes -q queue_name Specifies queue for job (default provided by system) -R "res_req" Specifies resource requirements -sla service_class_name Specifies service class for job -sp priority Specifies priority amongst user’s jobs -S stack_limit Sets per-process stack limit Essential Commands: bsub bsub [options] command [cmd_args] Submits a job for batch execution OPTION LIST (cont’d) -t [[month:]day:]hour:minute Specifies job termination date -T thread_limit Sets limit on number of concurrent jobs -U reservation_ID Uses reservation via brsvadd command -u mail_user Mail-to address -v swap_limit Sets total process virtual memory limit -w 'dependency_expression' Defines dependencies to be met before job initiation -wa '[signal | command | CHKPNT]' Specifies action to be taken before job control step occurs -wt '[hours:]minutes' Specifies time interval before job control occurs to send warning signal -W [hours:]minutes[/host_name | /host_model] Specifies run time limit for job -Zs Spolls command file and runs from there The Importance of Being < LSF usage is different from any other job schedulers bsub a.out bsub -n 2 a.out bsub myscript bsub -q queuename a.out bsub -i infile -o outfile - e errfile a.out bsub < myscript LSF Job Submission bsub < jobfile. * By default, the job output is sent by mail. Each LSF job runs in a queue. If you don't give LSF a queue name, your job will go to the default <normal> queue. Each LSF job will be dispatched to a compute node. If you don't specify the node, LSF will choose one for you. To find the name of the server and the current status of the job, use the bjobs command: [root@hpchead ~]$ bjobs 103 JOBID USER STAT QUEUE SUBMIT_TIME 103 User DONE normal FROM_HOST EXEC_HOST JOB_NAME hpchead hpchead hostname Jun 7 11:38 This job executed on hpchead, the same host from which it was submitted. Unless told otherwise, LSF will chose an execution host with the same architecture as the submission host. If more than one server meets that criterion, LSF will choose the most powerful host with the lightest load. LSF Job Submission LSF output/error logs By default, LSF will send you email containing the standard output (stdout) and standard error (stderr) from your job, as well as some basic information about the execution of the job. If your program produces additional output files, they are separate and are not included in this email. To save your job's output in a file instead of receiving it in email, use the -o option on the bsub command: bsub -o my_output <myjob You can also put stdout and stderr in different files if you wish: bsub -o my_out -e my_err <myjob To make it easier to keep track of the output from multiple runs of the same program, you can use the special %J variable in your file names. LSF will substitute the job number for the %J variable: bsub -o out.%J –e err.%J <myjob LSF Job Submission Submit job at specific time: To force your job to begin at a specific time, use the -b option on the bsub command: bsub -b 11:00 job01 * Tells LSF to start your job at 11:00 a.m. If the current time is job will be held until the next day. bsub -b 2:15:23:15 * Tells LSF to start the job at 11:15 p.m. on February 15. Submit job to specific host: If you want your job to run on a specific host, use the -m option bsub -m node01 <myjob after 11:00 a.m., the Sample LSF script Serial Job bsub < serial.lsf #!/bin/bash # enable your environment, which will use .bashrc configuration in your home directory #BSUB -L /bin/bash # the name of your job showing on the queue system #BSUB -J FortranJob # the following BSUB line specify the queue that you will use, #BSUB -q normal # the system output and error message output, %J will show as your jobID #BSUB -o %J.out #BSUB -e %J.err #the CPU number that you will collect (Attention: each node has 2 CPU) #BSUB -n 1 #Fortran example pgf90 -o samp_f -Mextend samp.f ./samp_f # C example pgcc -o samp_c samp.c ./samp_c # C++ example pgCC --no_auto_instantiation -o samp_cc samp.cc ./samp_cc Sample LSF script MPI Job bsub < mpi.lsf #!/bin/bash # enable your environment, which will use .bashrc configuration in your home directory #BSUB -L /bin/bash # LSF batch script to run the test MPI code # #BSUB -P 93300070 # Project 93300070 #BSUB -a mpich_gm # select the mpich-gm elim #BSUB -x # exlusive use of node (not_shared) #BSUB -n 2 # number of total tasks #BSUB -R "span[ptile=1]" # run 1 tasks per node #BSUB -J mpilsf.test # job name #BSUB -o mpilsf.out # output filename #BSUB -e mpilsf.err # error filename #BSUB –q normal # queue # Fortran example mpif90 -o mpi_samp_f mpisamp.f mpirun.lsf ./mpi_samp_f # C example mpicc -o mpi_samp_c mpisamp.c mpirun.lsf ./mpi_samp_c # C++ example mpicxx -o mpi_samp_cc mpisamp.cc mpirun.lsf ./mpi_samp_cc Sample LSF script Matlab Job bsub < matlab.lsf #!/bin/bash # enable your environment, which will use .bashrc configuration in your home directory #BSUB -L /bin/bash # the name of your job showing on the queue system #BSUB -J MatlabJob # the following BSUB line specify the queue that you will use, #BSUB -q normal # the system output and error message output, %J will show as your jobID #BSUB -o %J.out #BSUB -e %J.err #the CPU number that you will collect (Attention: each node has 2 CPU) #BSUB -n 1 #when job finish that you will get email notification #BSUB -u user@american.edu #BSUB -N # your matlab code matlab -nodisplay -r myplot #enter your working directory cd /home/username/matlab LSF – Running Matlab in batch To submit a batch Matlab job, first prepare a file with your Matlab commands, say “program_file.m”. Then issue the commands: bsub -q normal matlab -nodisplay -nojvm -nosplash -r program_file logfile output_file.txt NB: if you intend to use java programs do not include the flag -nojvm. Note that the suffix ".m" is omitted from the command file name. This submits a batch job to the batch queue taking input from the file program_file.m, and placing text output in output_file.txt. LSF – Running Mathematica in batch To submit a batch Mathematica job, first prepare a file with your Mathematica commands, say "test.m". Then issue the commands: bsub -q normal "math < test.m > test-out.txt" * Note that the suffix ".m" is included in the command file name. This submits a batch job to the batch queue taking input from the file test.m, and placing text output in test-out.txt. Saving graphical output is somewhat trickier. To illustrate the simplest approach, here is a sample Mathematica job: AppendTo[$Echo, "stdout"] 3+5 Integrate[Exp[-x^2],{x,-Infinity,Infinity}] FactorInteger[120] sc = Plot[{Sin[x], Cos[x]}, {x, 0, 2*Pi}, PlotStyle -> { {RGBColor[1, 0, 0], Thickness[0.01]}, {RGBColor[0, 1, 0], Thickness[0.01]}}] Export["sc.m",sc,"TEXT"] abc=Table[Plot[x^n,{x,0,1}], {n, 1, 3}] Do[Export["abc"<>ToString[n]<>".m",abc[[n]],"TEXT"],{n,1,3}] 7+9 Quit If you are connecting to HPCHEAD.american.edu from Linux or a Mac with X11 installed, merely use ssh -Y. If you are connecting from Windows, you must use Cygwin, Xming or X-Win32 to run an X server in order to do the same. LSF – Running R in batch R To submit a batch R job, first prepare a file with your R commands, say “program_file.R”. Then issue the two commands: bsub -q normal R CMD BATCH program_file.R output_file.txt command submits a batch job to the batch queue taking input from the file program_file.R, and placing text output in output_file.txt. Graphical output is saved to a PDF file via the "pdf" command within R, for example: pdf("graphs.pdf") # create graphical output file X=rnorm(100) # generate 100 N(0,1) variates Y=rexp(100) # generate 100 Exp(1) variates c(mean(X),mean(Y)) # mean of both samples hist(X) # plot N(0,1) histogram hist(Y) # plot Exp(1) histogram dev.off() # close the file Both histograms are saved to the same PDF file (one graph per page). Essential Commands: bjobs bjobs - Displays information about LSF jobs bjobs -u user_name bjobs -u all bjobs -l bjobs -r bjobs -s bjobs -q queue_name Essential Commands: bhist bhist - displays historical information about jobs bhist -J job_name bhist -C start_time, end_time bhist -D start_time, end_time bhist -S start_time, end_time bhist -T start_time, end_time Essential Commands: bpeek bpeek - displays stdout and stderr of user’s selected, unfinished job bpeek -f uses ‘tail -f’ to display output instead of ‘cat’ bpeek [-q queue_name | -m host_name | -J job_name | job_ID | "job_ID[index_list]"] Essential Commands: bmod bmod - modifies job submission options of a job bmod [bsub options] [job_ID | "job_ID[index]"] bmod -g job_group_name | -gn [job_ID] bmod [-sla service_class_name | -slan] [job_ID] bmod [-h | -V] Essential Commands: bbot, btop bbot - moves a pending job relative to the last job in the queue bbot job_ID | "job_ID[index_list]" [position] bbot [-h | -V] btop - moves a pending job relative to the first job in the queue btop job_ID | "job_ID[index_list]" [position] btop [-h | -V] Essential Commands: bswitch bswitch - switches unfinished jobs from one queue to another bswitch [-J job_name] [-m host_name | -m host_group] [-q queue_name] [-u user_name | -u user_group | -u all] destination_queue [0] bswitch destination_queue [job_ID | "job_ID[index_list]"] ... bswitch [-h | -V] Essential Commands: bstop/bresume bstop -suspends unfinished jobs bstop [-a] [-d] [-g job_group_name |-sla service_class_name] [-J job_name] [-m host_name | -m host_group] [-q queue_name] [-u user_name | -u user_group | -u all] [0] [job_ID | "job_ID[index]"] ... bstop [-h | -V] bresume -resumes one or more suspended jobs bresume [-g job_group_name] [-J job_name] [-m host_name ] [-q queue_name] [-u user_name | -u user_group | -u all ] [0] bresume [job_ID | "job_ID[index_list]"] ... bresume [-h | -V] Essential Commands: bkill bkill - sends signals to kill, suspend, or resume unfinished jobs bkill [-l] [-g job_group_name | -sla service_class_name] [-J job_name] [-m host_name | -m host_group] [-q queue_name] [-r | -s (signal_value | signal_name)] [-u user_name | -u user_group | -u all] [job_ID ... | 0 | "job_ID[index]" ...] bkill [-h | -V]