Using ITaP clusters for large scale statistical analysis with R Doug Crabill Purdue University Topics • Running multiple R jobs on departmental Linux servers serially, and in parallel • Cluster concepts and terms • Use cluster to run same R program many times • Use cluster to run same R program many times with different parameters • Running jobs on an entire cluster node Invoking R in a batch mode • • • • R CMD BATCH vs Rscript Rscript t.R > t.out Rscript t.R > t.out & # run in background nohup Rscript t.R > t.out & # run in background, even if logout • Can launch several such jobs with &’s simultaneously, but best to stick with between 2 and 8 jobs per departmental server. • Invoking manually doesn’t scale well for running dozens of jobs Launching several R jobs serially • Create a file like “run.sh” that contains several Rscript invocations. The first job will run until it completes, then the second job will run, etc. Rscript t.R > tout.001 Rscript t.R > tout.002 Rscript t.R > tout.003 Rscript t.R > tout.004 Rscript t.R > tout.005 Rscript t.R > tout.006 Rscript t.R > tout.007 Rscript t.R > tout.008 • Do NOT use “&” at the end or it could crash server Creating the “run.sh” script programatically • Extra cool points for creating the “run.sh” script using R > sprintf("Rscript t.R > tout.%03d", 1:8) [1] "Rscript t.R > tout.001" "Rscript t.R > tout.002" "Rscript t.R > tout.003" [4] "Rscript t.R > tout.004" "Rscript t.R > tout.005" "Rscript t.R > tout.006" [7] "Rscript t.R > tout.007" "Rscript t.R > tout.008" > write(sprintf("Rscript t.R > tout.%03d", 1:8), "run.sh") Invoking “run.sh” • sh run.sh # run every job in dorun.sh one at a time • nohup sh run.sh & # run every job in dorun.sh one at a time, and keep running even if logout • nohup xargs -d '\n' -n1 -P4 sh -c < run.sh & # run every job in run.sh keeping 4 jobs running simultaneously keep running even if logout Supercomputers and clusters • Supercomputer = A collection of computer nodes in a cluster managed by scheduler • Node = one computer in a cluster (dozens to hundreds) • Core = one CPU core in a Node (often 8 to 48 cores per node) • Front end = one or more computers used for launching jobs on the cluster • PBS / Torque is the scheduling software. PBS is like the maître d’, seating groups of various sizes for varying times at available tables, with a waiting list, a bar, reservations, bad customers that spoil the party ITaP / RCAC clusters • Conte – was fastest supercomputer on any academic campus in the world when built June 2013. Has Intel Phi coprocessors • Carter – Has NVIDIA GPU-accelerated nodes • Hansen, Rossmann, Coates • Scholar – uses part of Carter, for instructional use • Hathi – Hadoop • Radon – Accessible by all researchers on campus • Anecdotes… <queue patriotic music> More info on radon cluster • https://www.rcac.purdue.edu/ then select Computation->Radon Theoretic Processo SubNumber Cores Memory Intercon al Peak rs per Cluster of Nodes per Node per Node nect TeraFLO Node PS Two 2.33 GHz QuadRadon-D 30 8 16 GB 1 GigE 58.2 Core Intel E5410 • Read the User’s guide on left sidebar Logging into radon • Get accounts on the RCAC website previously mentioned or ask me • Make an SSH connection to radon.rcac.purdue.edu • From Linux (or Mac terminal), type this to log into one of the cluster front ends (as user dgc): • ssh –X radon.rcac.purdue.edu –l dgc • Do not run big jobs on the front ends! They are only used for submitting jobs to the cluster and light testing and debugging File storage on radon • Home directory quota is ~10GB (type “myquota”) • Can be increased to 100GB via Boiler Backpack settings at http://www.purdue.edu/boilerbackpack • Scratch storage of around 1TB per user. This directory differs per user, and is accessible by the $RCAC_SCRATCH environment variable: radon-fe01 ~ $ cd $RCAC_SCRATCH radon-fe01 /scratch/radon/d/dgc $ • All nodes can see all files in home and scratch Software on radon • List of applications installed on radon can be found in the users guide previously mentioned • The module command is used to “load” software packages for use by the current login session • module avail #See the list of applications available • module load r # Add “R” • Must be included as part of every R job to be run on the cluster PBS scheduler commands • qstat # See list of jobs in the queue • qstat –u dgc # See list of jobs in the queue submitted by dgc • qsub jobname.sh # Submit jobname.sh to run on the cluster • qdel JOBIDNUMBER # delete a previously submitted job from the queue Simple qsub submission file • Qsub accepts command line arguments, or embedded comments that are ignored by the script, but honored by qsub. radon-fe01 ~/cluster $ cat myjob.sh #!/bin/sh -l #PBS -l nodes=1:ppn=1 #PBS -l walltime=00:10:00 cd $PBS_O_WORKDIR /bin/hostname radon-fe01 ~/cluster $ qsub myjob.sh 683369.radon-adm.rcac.purdue.edu • The JOBID of this particular job is 683369 Viewing status and the results • Use qstat or qstat -u dgc to check job status • Output of job 683369 goes to myjob.sh.o683369 • Errors from job 683369 goes to myjob.sh.e683369 • Inconvenient to collect the results from a dynamically named file like myjob.sh.o683369. Best to write output to a filename of your choosing by writing directly to filenames of your choosing in your R program or directing the output to a file in your job submission file Our first R job submission • Say we want to run the R program t.R on radon: summary(1 + rgeom(10^7, 1/1000 )) • Create R1.sh with contents: #!/bin/sh -l #PBS -l nodes=1:ppn=1 #PBS -l walltime=00:10:00 cd $PBS_O_WORKDIR module add r Rscript t.R > out1 • Submit using qsub R1.sh Let’s do that 100 times • Using our “R1.sh” file as a template, create files prog001.sh through prog100.sh, changing the output file for each job to be out.NNN. In R: s<-scan("R1.sh", what='c', sep="\n") sapply(1:100, function(i) { s[6]=sprintf("Rscript t.R > out.%03d", i); write(s, sprintf("prog%03d.sh", i)); }) write(sprintf("qsub prog%03d.sh", 1:100), "runall.sh") • Submit all 100 jobs by typing sh –x runall.sh • Generating files using bash instead (all on one line): for i in `seq -w 1 100`; do (head -6 R1.sh; echo "Rscript t.R > out.$i") > prog$i.sh; echo "qsub prog$i.sh"; done > runall.sh Coupon collector problem • I want to solve the coupon collector problem with large parameters but it will take much too long on a single computer (around 2.5 days): sum(sapply(1:10000, function(y) {mean(1 + rgeom(10^8, y/10000))})) • The obvious approach is to break it into 10,000 smaller R jobs and submit them to the cluster. • Better to break it into 250 jobs, each operating on 40 numbers. • Create an R script that accepts command line arguments to process many numbers at a time. Estimate walltime carefully! Coupon collector R code • t2.R, read arguments into “args”, process each args <- commandArgs(TRUE) sapply(as.integer(args), function(y) {mean(1 + rgeom(10^8, y/10000))}) • Can test via: Rscript t2.R 100 125 200 # Change reps from 10^8 to 10^5 for test • Generate 250 scripts with 40 arguments each s<-scan("R2.sh", what='c', sep="\n") sapply(1:250, function(y) { s[6]=sprintf("Rscript t2.R %s > out.%03d", paste((y*40-39):(y*40), collapse=" "), y); write(s, sprintf("prog%03d.sh", y)); }) write(sprintf("qsub prog%03d.sh", 1:250), "runall.sh") Coupon collector results • Output is in the files out.001 through out.250: radon-fe00 ~/cluster/R2done $ cat out.001 [1] 9999.8856 5000.4830 3333.0443 2499.8564 1999.7819 1666.2517 1428.6594 [8] 1249.9841 1110.9790 1000.0408 909.1430 833.3409 769.1818 714.2486 [15] 666.6413 624.9357 588.3044 555.5487 526.3795 500.0021 476.2695 [22] 454.5702 434.7949 416.6470 399.9255 384.5739 370.3412 357.1366 [29] 344.8375 333.2978 322.5507 312.5258 303.0307 294.1573 285.7368 [36] 277.8168 270.2709 263.1612 256.3872 249.9905 • It’s hard to read 250 files with that stupid leading column. UNIX tricks to the rescue! sum(scan(pipe("cat out* | colrm 1 5"))) # works for small indexes only sum(scan(pipe("cat out* | sed -e 's/.*]//'"))) # works for all index sizes • Cha-ching! Using all cores on a single node • When running your job on a single core of a node shared with strangers, some may misbehave and use too much RAM or CPU. Solution is to request entire nodes, and fill them with just your jobs so you never share a node with anyone else. • Job submission file should include: #PBS -l nodes=1:ppn=8 • Forces PBS to exclusively schedule a node for you. If it is a single R job, you are using just one core! Must use xargs or a similar trick to launch 8 simultaneous R jobs. Only submit 1/8th the jobs. All cores example one #!/bin/sh -l #PBS -l nodes=1:ppn=8 #PBS -l walltime=00:30:00 cd $PBS_O_WORKDIR module add r Rscript t3.R >out1 & Rscript t3.R >out2 & Rscript t3.R >out3 & Rscript t3.R >out4 & Rscript t3.R >out5 & Rscript t3.R >out6 & Rscript t3.R >out7 & Rscript t3.R >out8 & wait All cores example two #!/bin/sh -l #PBS -l nodes=1:ppn=8 #PBS -l walltime=00:30:00 cd $PBS_O_WORKDIR module add r xargs -d '\n' -n1 -P8 sh -c < batch1.sh • Where batch1.sh contains (could be > 8 lines!): Rscript t3.R >out1 Rscript t3.R >out2 Rscript t3.R >out3 Rscript t3.R >out4 Rscript t3.R >out5 Rscript t3.R >out6 Rscript t3.R >out7 Rscript t3.R >out8 Thanks! • Thanks to Prof. Mark Daniel Ward for all his help with the examples used in this talk! • URL for these notes: • http://www.stat.purdue.edu/~dgc/cluster.pptx • http://www.stat.purdue.edu/~dgc/cluster.pdf (copy and paste works poorly with PDF!)