Advanced Scheduler Tips and Tricks Matthew Scholz Why are you here? • “My job isn’t starting quickly” • “I want to do MORE work on the system” • “My job keeps dying on the cluster” Overview • • • • • Scheduling Priorities Cluster Resources Better Resource Requesting Array Jobs Longjobs (BLCR) Scheduling Priorities • Jobs that use more resources get higher priority (because these are hard to schedule) • Smaller jobs are “backfilled” to fit in the holes created by the bigger jobs • Eligible jobs acquire more priority as they sit in the queue • Jobs can be in three basic states: – Blocked, eligible or running Cluster Resources • #PBS –l nodes=1:ppn=4;walltime=4:00:01;mem=4G • Resources you request relate to time in queue • Factors – Priority (number of cores requested) – Available Resources (number of cores available) – RAM/Core Cluster Resources 2 Year Name Description 2007 intel07 2009 amd09 2010 2010 2011 gfx10 intel10 intel11 Quad-core 2.3GHz Intel Xeon E5345 Sun Fire X4600 (Fat Node) AMD Opteron 8384 NVIDIA CUDA Node (feature=GBE) Intel Xeon E5620 (2.40 GHz) Intel Xeon 2.66 GHz E7-8837 2014 intel14 Intel Xeon E5-2670 v2 (2.6 GHz) 2 NVIDIA K20 GPUs (feature=gpgpu) 2 Xeon Phi 5110P (feature=phi) Total* ppn Memory Nodes Total Cores 8 8GB 124 992 32 256GB 3 96 8 8 32 32 64 18GB 24GB 512GB 1TB 2TB 32 192 2 1 2 256 1536 64 32 128 20 64GB 128 2560 20 256GB 24 480 20 128GB 40 800 20 128GB 28 560 576 7504 * Does not include Condor Cluster Cluster Resources 3 • Buy-in Priority – Investigators have helped make the cluster larger by purchasing some of the nodes. – These nodes are “reserved” • Buy-in use = 1 week • Non-buy in use = 4 Hours QUESTIONS? Better Resource Requests • RAM/Core Vs. Ram/Node – When requesting resource –l mem=XGB, rememeber it is divided PER core. – E.g. ppn=4;mem=4GB == 1GB/core – Each node can accommodate total amount of RAM of machine. Best to target to be able to use AVERAGE RAM/core for best shot Better Resource Requests • Walltime: – <= 4 hours more available machines – Up to 1 week walltime allowed • Feature=GBE – ~320 cores available, not on infiniband (high-speed interconnect) – If nodes=1, you can also request feature=gbe QUESTIONS? Array Jobs • Pleasantly parallel workflows: – I need to sort 50 files, and generate 50 new files – (NOT: I need to sort 50 files into 1 large file) – Jobs are independent of eachother, but have the same behavior • #PBS –t 1-40 • (OR) • #PBS –t 2,4,8 Array Jobs • What does this DO? • -t 1-2 submits _2_ jobs • Each job is identical, EXCEPT – When job starts, environment variable is set: • $PBS_ARRAYID=1 (-t 1) • $PBS_ARRAYID=2 (-t 2) • Etc – Workflows can be modified to take in variable to run different workflows QUESTIONS? Long jobs (BLCR) • Berkley Labs Checkpoint/Restart – Wrapper around a program. – Can save entire program (checkpoint) for restart later – We have a powertool! (longjob) • Uses: – Jobs that need to run > 1 week – Jobs that are taking too long to start (can be run in 4 hour chunks) BLCR Example (commands) Cd mkdir examples cd examples module load powertools getexample velvet_blcr cd velvet_blcr nano velveth_blcr.qsub • Command line tools to make a new examples dir in your homedir, and grab the example • Nano to edit (you can use your editor of choice) Break down on commands • • • • Must have powertools module loaded in script use $PBS_O_WORKDIR Important variables: BLCR_WAIT_SEC – how long to run before beginning checkpoint – MUST be less than walltime (enough to allow save) • BLCR_OUTPUT – Name of output file BLCR continued # if checkpoint file does not exist if [ ! -f checkfile.blcr ] then WORK=${PBS_O_WORKDIR}/${PBS_JOBID} mkdir -p ${WORK} #Run main simulation program cd $WORK fi If statement, to see if it is the first time running: If so, make a new directory For safety BLCR • longjob command – ONE command, if you are running a pipeline, put it all into a separate script file – Remember that the WORK will be done in a new subdir (no relative paths) • Remember: It takes time to save memory footprint Last chance: QUESTIONS?