Portable Batch System * Definition and 3 Primary Roles

advertisement
Portable Batch System – Definition and
3 Primary Roles
• Definition: PBS is a distributed workload management system. It
handles the management and monitoring of the computational
workload on a set of computers
• Queuing: Users submit tasks or “jobs” to the resource
management system where they are queued up until the system
is ready to run them.
• Scheduling: The process of selecting which jobs to run, when,
and where, according to a predetermined policy. Aimed at
balance competing needs and goals on the system(s) to maximize
efficient use of resources
• Monitoring: Tracking and reserving system resources, enforcing
usage policy. This includes both software enforcement of usage
limits and user or administrator monitoring of scheduling policies
Submitting jobs to PBS: qsub command
• qsub command is used to submit a batch job to PBS. Executed on aluf
(login node). Submitting a PBS job specifies a task, requests resources
and sets job attributes, which can be defined in an executable scriptfile.
Recommended syntax of qsub command :
> qsub
[options] scriptfile
• PBS script files ( PBS shell scripts, see the next page) should be created in
the user’s directory
• To obtain detailed information about qsub options, please use the
command:
> man qsub
• Job Identifier (JOB_ID) Upon successful submission of a batch job PBS
returns a job identifier in the following format:
> sequence_number.server_name
> 12345.aluf01
ALUF Queues Description
• all_q - default routing queue, navigates jobs to
respective destination queues according to the
Wall time and CPUs number (ncpus) request in
the PBS script
• multicore - parallel jobs up to 4 CPUs, time limit 24
hours
• short
- Serial jobs, (1 CPU), time limit 3 hours
• main
- Serial jobs,(1 CPU), time limit 24 hours
• long
- Serial jobs,(1 CPU), time limit 72 hours
For detailed up-to-date information on queues limits please
type: "qstat -fQ queue_name"
The PBS shell script sections
• Shell specification: #!/bin/sh
• PBS directives: used to request resources or set
attributes. A directive begins with the default
string “#PBS”.
• Tasks (programs or commands)
- environment definitions
- I/O specifications
- executable specifications
NB! Other lines started with # are comments
PBS script example for multicore user code
#!/bin/sh
#PBS -N job_name
#PBS -q queue_name
#PBS -M user@technion.ac.il
#PBS -l select=1:ncpus=4
#PBS -l select=mem=8 GB
#PBS -l walltime=24:00:00
PBS_O_WORKDIR=$HOME/mydir
cd $PBS_O_WORKDIR
./program.exe < input.file > output.file
Other examples see at
http://tx.technion.ac.il/doc/aluf/PBS-scripts/
Checking job/queue status: qstat command
• qstat command is used to request the status of batch jobs
and queues
• Detailed information: > man qstat
• qstat output structure (see on Tamnun)
• Useful commands
> qstat –a
all users in all queues (default)
> qstat -1n
all jobs in the system with node names
> qstat -1nu username all user’s jobs with node names
> qstat –f JOB_ID extended output for the job
> Qstat –Q
list of all queues in the system
> qstat –Qf queue_name extended queue details
 qstat –1Gn queue_name all jobs in the queue with
node names
Removing job from a queue: qdel command
• qdel used to delete queued or running jobs. The
job's running processes are killed. A PBS job may
be deleted by its owner or by the administrator
• Detailed information: > man qdel
• Useful commands
> qdel JOB_ID deletes job from a queue
> qdel -W force JOB_ID force delete job
Checking a job results and Troubleshooting
•
Save the JOB_ID for further inspection
•
Check error and output files: job_name.eJOB_ID;job_name.oJOB_ID
• Inspect job’s details (after N days ) : > ssh aluf01
> tracejob [-n N] JOB_ID
•
•
>
>
>
>
Running interactive batch job: > qsub –I pbs_script
Job is sent to an execution node, PBS directives executed, shell control is
passed to user, job awaits user’s command
Checking a job on an execution node: > ssh node_name (aluf01 or
aluf02, or aluf03)
hostname
top
/u user - shows user processes ; /1 – CPU usage
kill -9 PID remove job from the node
ls –rtl /gtmp check files under user’s ownership
Download