Shared Computing Cluster Advanced Usage Research Computing Services Outline › SCC design overview › SCC Login Nodes › Interactive Jobs › Batch Jobs › Multithreaded Jobs › MPI Jobs › Job Arrays › Job dependence › Interactive Graphics jobs › Jobs monitoring › Jobs analysis › Code Optimization › qacct, acctool and other useful commands Rear View Ethernet Infiniband switch Compute nodes Server Cabinets SCC Compute node There are hundreds of nodes. Each has its own properties and designation. • Processor • Number of Cores • Memory • Network connection • CPU Architecture SCC on the web: http://www.bu.edu/tech/support/research/computing-resources/tech-summary/ <- Chassis -> <- Cabinets -> <- Cabinets -> <- Chassis -> MPI jobs only 16 cores; 128 GB per node 1 Gbps Ethernet & FDR Infiniband Shared nodes with 16 cores & 128 GB per node 1p jobs and omp jobs 1 Gbps Ethernet Shared nodes with 16 cores & 256 GB per node 1p jobs and omp jobs 10 Gbps Ethernet To get information about each node execute qhost scc2 ~> local HOSTNAME qhost qhost Show status of each host ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO SWAPUS ------------------------------------------------------------------------------global - - - - - - - geo linux-x64 12 0.34 94.4G 16.1G 8.0G 2.2G scc-aa1 linux-x64 16 9.78 126.0G 2.6G 8.0G 19.3M scc-aa2 linux-x64 16 16.11 126.0G 5.1G 8.0G 0.0 scc-aa3 linux-x64 16 16.13 126.0G 5.1G 8.0G 0.0 scc-aa4 linux-x64 16 16.03 126.0G 5.3G 8.0G 2.7M scc-aa5 linux-x64 16 16.01 126.0G 2.1G 8.0G 18.7M scc-aa6 linux-x64 16 16.00 126.0G 4.9G 8.0G 2.4M scc-aa7 linux-x64 16 16.01 126.0G 5.0G 8.0G 16.9M scc-aa8 linux-x64 16 16.03 126.0G 5.7G 8.0G 18.5M scc-ab1 linux-x64 16 16.00 126.0G 58.2G 8.0G 89.7M qhost scc2 ~> qhost -F local HOSTNAME ARCH Detailed information about each node NCPU LOAD MEMTOT MEMUSE SWAPTO SWAPUS ------------------------------------------------------------------------------scc-aa1 linux-x64 ... hl:arch=linux-x64 hl:num_proc=16.000000 hl:mem_total=125.997G hl:swap_total=8.000G hl:virtual_total=133.997G hl:scratch_free=840.000G ... hf:cpu_arch=sandybridge hf:cpu_type=E5-2670 hf:eth_speed=1.000000 hf:ib_speed=56.000000 16 0.00 126.0G 1.6G 8.0G 19.3M qhost scc2 ~> qhost -j local HOSTNAME ARCH Print all the jobs running on each host NCPU LOAD MEMTOT MEMUSE SWAPTO SWAPUS ------------------------------------------------------------------------------scc-aa2 job-ID linux-x64 prior name user 16 16.05 126.0G 5.2G state submit/start at 8.0G queue 0.0 master ja-task-ID ---------------------------------------------------------------------------------------------5299960 0.30000 cu_pt bmatt r 01/17/2015 18:25:53 a128@scc-a MASTER a128@scc-a SLAVE a128@scc-a SLAVE a128@scc-a SLAVE a128@scc-a SLAVE a128@scc-a SLAVE a128@scc-a SLAVE a128@scc-a SLAVE qhost scc2 ~> local HOSTNAME qhost -q Show information about queues for each host ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO SWAPUS ------------------------------------------------------------------------------global - - - - - - - geo linux-x64 12 0.30 94.4G 16.1G 8.0G 2.2G scc-aa1 linux-x64 16 15.15 126.0G 2.6G 8.0G 19.3M 16 16.15 126.0G 5.1G 8.0G 0.0 a BP 0/16/16 as BP 0/0/16 a128 BP 0/0/16 scc-aa2 linux-x64 a BP 0/0/16 as BP 0/0/16 a128 BP 0/16/16 Service Models – Shared and Buy-In ~ 55% ~ 45% Buy-In: purchased by individual Shared: paid for by BU and faculty or research groups through the Buy-In program with priority access for the purchaser. university-wide grants and are free to the entire BU Research Computing community. Shared Buy-In SCC basic organization Public Network SCC1 SCC2 SCC3 SCC4 VPN only File Storage Login Nodes Private Network Compute Nodes More than 350 nodes with ~ 6300 CPUs and 232 GPUs SCC Login nodes rules Login nodes are designed for light work: › Text editing › Light debugging › Program compilation › File transfer There are 4 login nodes with 12 cores each and more than 1400 SCC users SCC Login nodes rules To ensure effective and smooth experience for everyone, the users should NOT: › Execute a program on a login node that runs longer than 10-15 minutes › Execute parallel programs on a login node 3 types of jobs › Batch job – execution of the program without manual intervention › Interactive job – running interactive shell: run GUI applications, code debugging, benchmarking of serial and parallel code performance… › Interactive Graphics job (new) Interactive Jobs qsh qlogin qrsh qsh qlogin / qrsh X-forwarding is required ✓ — Session is opened in a separate window ✓ — Allows for a graphics window to be opened by a program ✓ ✓ Current environment variables can be passed to the session ✓ — Batch-system environment variables ($NSLOTS, etc.) are set ✓ — Request interactive job scc2 ~> qsh pwd /projectnb/krcs local scc2 ~> qsh Your job 5300277 ("INTERACTIVE") has been submitted waiting for interactive job to be scheduled ... Your interactive job 5300277 has been successfully scheduled. scc2 ~ > scc2 ~> pwd /projectnb/krcs Request interactive job with additional options scc2 ~> module load R local ~> scc2 qsh –pe omp 4 –l mem_total 252G -V Your job 5300277 ("INTERACTIVE") has been submitted waiting for interactive job to be scheduled ... Your interactive job 5300277 has been successfully scheduled. scc2 ~ > scc2 ~> echo $NSLOTS 4 scc2 ~> module list Currently Loaded Modulefiles: 1) pgi/13.5 2) R/R-3.1.1 qsh scc2 ~ > qsh -P krcs -pe omp 16 qsh –now n Your job 5300273 ("INTERACTIVE") has been submitted local waiting for interactive job to be scheduled .... Your "qsh" request could not be scheduled, try again later. scc2 ~ > qsh -P krcs -pe omp 16 –now n Your job 5300277 ("INTERACTIVE") has been submitted waiting for interactive job to be scheduled ... Your interactive job 5300277 has been successfully scheduled. scc2 ~ > When the cluster is busy, or when a number of additional options are added to interactive job request the schedule cannot satisfy the request immediately. Add “-now n” option to your interactive job request to add this job into pending queue. qrsh scc2 ~ > qrsh –pe omp 4 -V RSA host key for IP address '192.168.18.180' not in list of known hosts. local Last login: Fri Jan 16 16:50:34 2015 from scc4p.scc.bu.edu scc-pi4 ~ > pwd /usr1/scv/koleinik Jobs started with qrsh command do not require X-forwarding; scc-pi4 ~ > echo $NSLOTS scc-pi4 ~ > module list Currently Loaded Modulefiles: 1) pgi/13.5 They will start in the same window; Current directory will be set to home; Environment variables cannot be passed; Submitting Batch Jobs # local Submit a (binary) program scc2 ~ > qsub –b y printenv Your job 5300301 ("printenv") has been submitted # Submit a program using script scc-pi4 ~ > qsub myScript.sh Your job 5300302 (“myScript") has been submitted qsub Check the job qstat local ~ > qstat –u koleinik scc2 job-ID prior name user state submit/start at queue slots ja-task-ID ----------------------------------------------------------------------------------------------------5260168 0.11732 a1 koleinik r 01/22/2015 14:59:22 budge@scc-jc1.scc.bu.edu # Check only running jobs scc-pi4 ~ > qstat –u koleinik –s r # Check resources requested for each job scc-pi4 ~ > qstat –u koleinik –r 12 4 Parallelization on the SCC 2 types of parallelization: - multithreaded/OpenMP (uses some or all cores on one node) - mpi (uses multiple cores possibly across a number of nodes) Multithreaded parallelization on the SCC C, C++, FORTRAN, R, Python, etc. allow for multithreaded type of parallelization. This normally requires to add some special directives within the code. There are a number of applications which will also parallelize if appropriate option is given on the command line Multithreaded parallelization on the SCC OMP parallelization, using C: #pragma omp parallel { threads = omp_get_num_threads(); id = omp_get_thread_num(); printf(" hello from thread %d out of %d threads!\n", id, threads); } Multithreaded parallelization on the SCC Multithreaded parallelization, using R: library(parallel) registerDoMC(nCores) # Execute sampling and analysis in parallel matrix <- foreach(i=1:nSim, .combine=rbind) %dopar% { perm <- sample(D, replace=FALSE) mdl <- lm(perm ~ M) c(i, coef(mdl)) } Multithreaded parallelization on the SCC Batch script options to submit a multi-threaded program # Request 8 cores. This number can be up to 16 #$ -pe omp 8 # # For OMP C or FORTRAN code you need to set enviroment variable: export OMP_NUM_THREADS=$NSLOTS ./program arg1 arg2 MPI parallelization on the SCC Batch script options to submit MPI program # Request 32 cores. This number should be multiple of 16 #$ -pe mpi_16_tasks_per_node 32 # mpirun –np 32 ./program arg1 arg2 MPI parallelization on the SCC Check which nodes the program runs on (expanded view for MPI jobs) qstat scc2 ~ > qstat –u <user_name> –g t local job-ID prior name user state submit/start at queue master ja-task-ID -----------------------------------------------------------------------------------------------------------------5348232 0.24921 program user r 01/23/2015 06:52:00 straub-mpi@scc-nb3.scc.bu.edu MASTER straub-mpi@scc-nb3.scc.bu.edu SLAVE straub-mpi@scc-nb3.scc.bu.edu SLAVE straub-mpi@scc-nb3.scc.bu.edu SLAVE straub-mpi@scc-nb3.scc.bu.edu SLAVE 01/23/2015 06:52:00 straub-mpi@scc-nb7.scc.bu.edu SLAVE straub-mpi@scc-nb7.scc.bu.edu SLAVE . . . 5348232 0.24921 program . . . user r MPI parallelization on the SCC Possible choices for number of processors on the SCC: 4 tasks per node: 4, 8, 12, … 8 tasks per node: 16, 24, 32, … 12 tasks per node: 12, 24, 36, … 16 tasks per node: 16, 32, 48, … Bad choice: Better: # should be used for very small number of tasks # should be used for very small number of tasks mpi_4_tasks_per_node 12 mpi_12_tasks_per_node 12 Array Jobs An array job executes independent copy of the same job script. The number of tasks to be executed is set using-t option to the qsub command, .i.e: scc % qsub -t 1-10 myscript.sh The above command will submit an array job consisting of 10 tasks, numbered from 1 to 10. The batch system sets upSGE_TASK_ID environment variable which can be used inside the script to pass the task ID to the program: #!/bin/bash/ #$ -N myjob #$ -j y Rscript myRfile.R $SGE_TASK_ID Where my job will execute? How long will it wait in the queue?… › Type of the application › Additional resources requested › What other users do Where my job will execute? How long will it wait in the queue?… There are a number of queues defined on the SCC. Various types of jobs are assigned to the different queues. Jobs in a particular queue can execute only on designated nodes. Check status of the queues qstat scc2 local~ > qstat –g c CLUSTER QUEUE CQLOAD USED RES AVAIL TOTAL aoACDS cdsuE -------------------------------------------------------------------------------a 0.89 128 0 448 576 0 0 a128 0.89 384 0 192 576 0 0 as 0.89 0 0 576 576 0 0 b 0.96 407 0 9 416 0 0 bioinfo 0.00 0 0 48 48 0 0 bioinfo-pub 0.00 0 0 48 48 0 0 Queues on the SCC › a* queues - for MPI jobs › b* 1p and omp jobs › c* large memory jobs Get information about the queue local ~ > qconf –sq a scc2 hostlist @aa @ab @ac @ad @ae qtype BATCH pe_list mpi_16_tasks_per_node_a h_rt 120:00:00 qconf Get information about various options for qsub command local ~ > qconf –sc scc2 #name shortcut type #-----------------------------------------------------------cpu_arch cpu_a RESTRING cpu_type cpu_t RESTRING eth_speed eth_sp INT scratch MEMORY … scratch_free qconf Why my job failed… WHY ? Batch Script Syntax I submitted a job and it's hung in the queue… Possible Cause: Check if the script has CR symbols at the end of the lines: cat -A script_file You should NOT see ^M characters there dos2unix script_file I submitted a job and it failed … Why? Add the option “-m ae” to the batch script (or qsub command): an email will be sent at the end of the job and if the job is aborted. Job 5300308 (printenv) Complete User = koleinik Queue = linga@scc-kb5.scc.bu.edu Host = scc-kb5.scc.bu.edu Start Time = 01/17/2015 23:31:44 End Time = 01/17/2015 23:31:44 User Time = 00:00:00 System Time = 00:00:00 Wallclock Time = 00:00:00 CPU = 00:00:00 Max vmem = NA Exit Status =0 Time Limit Job 9022506 (myJob) Aborted Exit Status = 137 Signal = KILL User = koleinik Queue = b@scc-bc3.scc.bu.edu Host = scc-bc3.scc.bu.edu Start Time = 08/18/2014 15:58:55 End Time = 08/19/2014 03:58:56 CPU = 11:58:33 Max vmem = 4.324G failed assumedly after job because: job 9022506.1 died through signal KILL (9) The default time for interactive and non-interactive jobs on the SCC is 12 hours. Make sure you request enough time for your application to complete: #$ -l h_rt 48:00:00 Dear Admins: I submitted a job and it takes longer than I expected. Is it possible to extend the time limit? Unfortunately, no… SCC batch system does not allow to alter the time limit even to the Systems Administrators. Memory Job 1864070 (myBigJob) Complete User = koleinik Queue = linga@scc-kb8.scc.bu.edu Host = scc-kb8.scc.bu.edu Start Time = 10/19/2014 15:17:22 End Time = 10/19/2014 15:46:14 User Time = 00:14:51 System Time = 00:06:59 Wallclock Time = 00:28:52 CPU = 00:27:43 Max vmem = 207.393G Exit Status = 137 There is a number of nodes that have only 3GB of memory per slot, so by default 1p-job should not use more than 3-4GB of memory. If the program needs more memory it should request additional resources. scc2 ~ > qhost –h scc-kb8 HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO SWAPUS ------------------------------------------------------------------------------global scc-kb8 linux-x64 64 4.03 252.2G 8.6G 8.0G 36.8M Memory Currently, on the SCC there are nodes with 16 cores & 128GB = 8GB/per slot 16 cores & 256GB = 16GB/per slot 12 cores & 48GB = 4GB/per slot 8 cores & 24GB = 3GB/per slot 8 cores & 96GB = 12GB/per slot 64 cores & 256GB = 4GB/per slot 64 cores & 512GB = 8GB/per slot Available only to Med. Campus users Memory Example: Single processor job needs 10GB of memory. ------------------------- # Request a node with at least 12 GB per slot #$ -l mem_total=94G Memory Example: Single processor job needs 50GB of memory. ------------------------# Request a large memory node (16GB of memory per slot) #$ -l mem_total=252G # Request a few slots #$ -pe omp 3 * Projects that can run on LinGA nodes might need some additional options Memory Valgrind memory mismangement detector: scc2 val > valgrind --tool=memcheck --leak-check=yes ./mytest ==63349== Memcheck, a memory error detector ==63349== Copyright (C) 2002-2012, and GNU GPL'd, by Julian Seward et al. ==63349== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info ==63349== Command: ./mytest ==63349== String = tutorial, Address = 85733440 String = tutorial from SCC, Address = 85733536 ==63349== ==63349== HEAP SUMMARY: ==63349== in use at exit: 0 bytes in 0 blocks ==63349== total heap usage: 2 allocs, 2 frees, 271 bytes allocated ==63349== ==63349== All heap blocks were freed -- no leaks are possible ==63349== ==63349== For counts of detected and suppressed errors, rerun with: -v ==63349== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 6 from 6) Jobs using more than 1 CPU Job 1864070 (myParJob) Complete User = koleinik Queue = budge@scc-hb2.scc.bu.edu Host = scc-hb2.scc.bu.edu Start Time = 11/29/2014 00:48:27 End Time = 11/29/2014 01:33:35 User Time = 02:24:13 System Time = 00:09:07 Wallclock Time = 00:45:08 CPU = 02:38:59 Max vmem = 78.527G Exit Status = 137 Some applications try to detect the number of cores and parallelize if possible. One common example is MATLAB. Always read documentation and available options to applications. And either disable parallelization or request additional cores. If the program does not allow to control the number of cores used – request the whole node. Jobs using more than 1 CPU Example: MATLAB by default will use up to 12 CPUs. ------------------------- # Start MATLAB using a single thread option: matlab -nodisplay -singleCompThread -r "n=4, rand(n), exit" Jobs using more than 1 CPU Example: Running MATLAB Parallel Computing Toolbox. ------------------------- # Request 4 cores: #$ -pe omp 4 matlab -nodisplay -r "matlabpool open 4, s=0; parfor i=1:n, s=s+i; end, matlabpool close, s, exit" My Job used to run fine and now it fails… Why? Check your disc usage! - To check the disc usage in your home directory use quota - To check the disc usage by the project use pquota –u project_name I submitted a job and it failed … Why? We are always happy to help! Please email us at help@scc.bu.edu Please include: 1. Job ID 2. Your working directory 3. Brief description of the problem How I can retrieve the information about the job I recently ran? scc2 ~ > qacct –d 7 –o koleinik -j qacct - report and account for SCC usage: -b -d -e -h -j -o -q -t -P Begin Time MMDDhhmm Days EndTime MMDDhhmm HostName Job ID Owner Queue Task ID Range Project qacct output qname hostname group owner project department jobname jobnumber taskid b scc-bd3.scc.bu.edu scv koleinik krcs defaultdepartment ibd_check 5060718 4 qacct output granted_pe slots failed exit_status ru_wallclock ru_utime ru_stime ru_maxrss ... ru_inblock ru_oublock ... maxvmem NONE 1 0 0 231 171.622 16.445 14613128 # Indicates a problem # Exit status of the job script # Time (in seconds) # Maximum resident set size (in bytes) 4427096 308408 # Block input operations # Block output operations 14.003G # Maximum virtual memory usage Job Analysis I submitted my job. And the program did not run . My job is now in Eqw status… scc2 advanced > cat -v myScript.sh #!/bin/bash^M ^M #Give my job a name^M #$ -N myProgram^M #^M ./myProgram^M scc2 advanced > dos2unix myScript.sh If a text file was created or edited outside of the SCC – make sure it is converted to the proper format! I submitted my job? How can I monitor it? # Get the host name scc2 ~ > qstat –u <userID> job-ID prior name user state submit/start at queue slots ------------------------------------------------------------------------------------------------------5288392 0.11772 myScript koleinik r 01/17/2015 08:48:15 linga@scc-ka6.scc.bu.edu 1 # Login to the host scc2 ~ > ssh scc-ka6 top scc2 ~ > top –u <userID> PID USER 24793 koleinik PR 20 NI VIRT RES SHR S %CPU %MEM 0 2556m 1.2g 5656 R 533.3 0.9 TIME+ COMMAND 0:06.87 python top scc2 ~ > top –u <userID> PID USER 24793 koleinik PID PR -- PR 20 NI VIRT RES SHR S %CPU %MEM 0 2556m 1.2g 5656 R 533.3 0.9 TIME+ COMMAND 0:06.87 python Process Id -- Priority of the task VIRT – Total amount of virtual memory used RES – Non-swapped physical memory a task has used (RES = CODE+DATA) SHR – Shared memory used by the task (memory that could be potentially shared with other tasks) top scc2 ~ > top –u <userID> PID USER 24793 koleinik PR 20 NI VIRT RES SHR S %CPU %MEM 0 2556m 1.2g 5656 R 533.3 0.9 TIME+ COMMAND 0:06.87 python S – Process Status: ‘D’ = uninterruptable sleep ‘R’ = running ‘S’ = sleeping ‘T’ = traced or stopped ‘Z’ = zombie %CPU – CPU usage %MEM – Currently used share of available physical memory TIME+ -- CPU time COMMAND – Command/program used to start the task top scc2 ~ > top –u <userID> PID USER 24793 koleinik PR 20 NI VIRT RES SHR S %CPU %MEM 0 2556m 1.2g 5656 R 533.3 0.9 TIME+ COMMAND 0:06.87 python The job was submitted requesting only 1 slot, but it is using more than 5 CPUs. This jobs will be aborted by the process reaper. top scc2 ~ > top –u <userID> PID 46746 46748 46749 46750 46747 46703 46727 USER koleinik koleinik koleinik koleinik koleinik koleinik koleinik PR 20 20 20 20 20 20 20 NI VIRT RES SHR S 0 975m 911m 2396 R 0 853m 789m 2412 R 0 1000m 936m 2396 R 0 1199m 1.1g 2396 R 0 857m 793m 2412 R 0 9196 1424 1180 S 0 410m 301m 3864 S %CPU %MEM 100.0 0.7 100.0 0.6 100.0 0.7 100.0 0.9 99.7 0.6 0.0 0.0 0.0 0.2 TIME+ COMMAND 238:08.88 R 238:07.88 R 238:07.84 R 238:07.36 R 238:07.20 R 0:00.01 5300788 0:05.11 R The job was submitted requesting only 1 slot, but it is using 4CPUs. This jobs will be aborted by the process reaper. top scc2 ~ > top –u <userID> PID USER 8012 koleinik PR 20 NI VIRT 0 24.3g RES 23g SHR S %CPU %MEM 16g R 99.8 25.8 TIME+ COMMAND 2:48.89 R The job was submitted requesting only 1 slot, but it is using 25% of all available memory on the machine. This jobs might fail due to the memory problem (especially if other jobs on this machine are also using a lot of memory). qstat qstat command has many options! qstat –u <userID> # list all users’ the jobs in the queue qstat –u <userID> -r qstat –u <userID> -g t # check resources requested for each job # display each task on a separate line qstat qstat –j <jobID> # Display full information about the job job_number: . . . 5270164 sge_o_host: scc1 . . . hard resource_list: . . . h_rt=2592000 usage . . . 1: # time in seconds cpu=9:04:39:31, mem=163439.96226 GBs, io=0.21693, vmem=45.272G, maxvmem=46.359G Program optimization My program runs too slow… Why? Before you look into parallelization of your code, optimize it. There are a number of well know techniques in every language. There are also some specifics in running the code on the cluster! My program runs too slow… Why? 1. Input/Output › Reduce the number of I/O to the home directory/project space (if possible); › Group smaller I/O statements into larger where possible › Utilize local /scratch space › Optimize the seek pattern to reduce the amount of time waiting for disk seeks. › If possible read and write numerical data in a binary format My program runs too slow… Why? 2. Other tips › Many languages allow operations on vectors/matrices; › Pre-allocate arrays before accessing them within loops; › Reuse variables when possible and delete those that are not needed anymore; › Access elements within your code according to the storage pattern in this language (FORTRAN, MATLAB, R – in columns; C, C++ - rows) My program runs too slow… Why? 3. Email SCC The members of out group will be happy to assist you with the tips how to improve the performance of your code for the specific language. How many SUs I used ? › acctool #My project(s) total usage on all hosts yesterday (short form): % acctool y #My project(s) total usage on shared nodes for the past moth % acctool –host shared –b 1/01/15 y #My balance for the project scv % acctool -p scv -balance -b 1/01/15 y