Using ITaP clusters for large scale statistical analysis with R Doug Crabill

advertisement
Using ITaP clusters for large
scale statistical analysis
with R
Doug Crabill
Purdue University
Topics
• Running multiple R jobs on departmental Linux
servers serially, and in parallel
• Cluster concepts and terms
• Use cluster to run same R program many times
• Use cluster to run same R program many times
with different parameters
• Running jobs on an entire cluster node
Invoking R in a batch mode
•
•
•
•
R CMD BATCH vs Rscript
Rscript t.R > t.out
Rscript t.R > t.out & # run in background
nohup Rscript t.R > t.out &
# run in
background, even if logout
• Can launch several such jobs with &’s simultaneously,
but best to stick with between 2 and 8 jobs per
departmental server.
• Invoking manually doesn’t scale well for running dozens
of jobs
Launching several R jobs serially
• Create a file like “run.sh” that contains several
Rscript invocations. The first job will run until it
completes, then the second job will run, etc.
Rscript t.R > tout.001
Rscript t.R > tout.002
Rscript t.R > tout.003
Rscript t.R > tout.004
Rscript t.R > tout.005
Rscript t.R > tout.006
Rscript t.R > tout.007
Rscript t.R > tout.008
• Do NOT use “&” at the end or it could crash server
Creating the “run.sh” script
programatically
• Extra cool points for creating the “run.sh” script
using R
> sprintf("Rscript t.R > tout.%03d", 1:8)
[1] "Rscript t.R > tout.001" "Rscript t.R > tout.002" "Rscript t.R > tout.003"
[4] "Rscript t.R > tout.004" "Rscript t.R > tout.005" "Rscript t.R > tout.006"
[7] "Rscript t.R > tout.007" "Rscript t.R > tout.008"
> write(sprintf("Rscript t.R > tout.%03d", 1:8), "run.sh")
Invoking “run.sh”
• sh run.sh
# run every job in dorun.sh one at a
time
• nohup sh run.sh &
# run every job in
dorun.sh one at a time, and keep running even if
logout
• nohup xargs -d '\n' -n1 -P4 sh -c <
run.sh & # run every job in run.sh keeping 4 jobs
running simultaneously keep running even if logout
Supercomputers and clusters
• Supercomputer = A collection of computer nodes in a
cluster managed by scheduler
• Node = one computer in a cluster (dozens to hundreds)
• Core = one CPU core in a Node (often 8 to 48 cores per
node)
• Front end = one or more computers used for launching
jobs on the cluster
• PBS / Torque is the scheduling software. PBS is like the
maître d’, seating groups of various sizes for varying
times at available tables, with a waiting list, a bar,
reservations, bad customers that spoil the party
ITaP / RCAC clusters
• Conte – was fastest supercomputer on any
academic campus in the world when built June
2013. Has Intel Phi coprocessors
• Carter – Has NVIDIA GPU-accelerated nodes
• Hansen, Rossmann, Coates
• Scholar – uses part of Carter, for instructional use
• Hathi – Hadoop
• Radon – Accessible by all researchers on campus
• Anecdotes… <queue patriotic music>
More info on radon cluster
• https://www.rcac.purdue.edu/ then select
Computation->Radon
Theoretic
Processo
SubNumber
Cores
Memory Intercon al Peak
rs per
Cluster of Nodes
per Node per Node
nect
TeraFLO
Node
PS
Two 2.33
GHz
QuadRadon-D
30
8
16 GB
1 GigE
58.2
Core
Intel
E5410
• Read the User’s guide on left sidebar
Logging into radon
• Get accounts on the RCAC website previously
mentioned or ask me
• Make an SSH connection to radon.rcac.purdue.edu
• From Linux (or Mac terminal), type this to log into
one of the cluster front ends (as user dgc):
• ssh –X radon.rcac.purdue.edu –l dgc
• Do not run big jobs on the front ends! They are
only used for submitting jobs to the cluster and
light testing and debugging
File storage on radon
• Home directory quota is ~10GB (type “myquota”)
• Can be increased to 100GB via Boiler Backpack
settings at http://www.purdue.edu/boilerbackpack
• Scratch storage of around 1TB per user. This
directory differs per user, and is accessible by the
$RCAC_SCRATCH environment variable:
radon-fe01 ~ $ cd $RCAC_SCRATCH
radon-fe01 /scratch/radon/d/dgc $
• All nodes can see all files in home and scratch
Software on radon
• List of applications installed on radon can be found
in the users guide previously mentioned
• The module command is used to “load” software
packages for use by the current login session
• module avail
#See the list of applications
available
• module load r
# Add “R”
• Must be included as part of every R job to be run
on the cluster
PBS scheduler commands
• qstat
# See list of jobs in the queue
• qstat –u dgc # See list of jobs in the queue
submitted by dgc
• qsub jobname.sh # Submit jobname.sh to run
on the cluster
• qdel JOBIDNUMBER # delete a previously
submitted job from the queue
Simple qsub submission file
• Qsub accepts command line arguments, or
embedded comments that are ignored by the
script, but honored by qsub.
radon-fe01 ~/cluster $ cat myjob.sh
#!/bin/sh -l
#PBS -l nodes=1:ppn=1
#PBS -l walltime=00:10:00
cd $PBS_O_WORKDIR
/bin/hostname
radon-fe01 ~/cluster $ qsub myjob.sh
683369.radon-adm.rcac.purdue.edu
• The JOBID of this particular job is 683369
Viewing status and the results
• Use qstat or qstat -u dgc to check job status
• Output of job 683369 goes to myjob.sh.o683369
• Errors from job 683369 goes to myjob.sh.e683369
• Inconvenient to collect the results from a
dynamically named file like myjob.sh.o683369.
Best to write output to a filename of your choosing
by writing directly to filenames of your choosing in
your R program or directing the output to a file in
your job submission file
Our first R job submission
• Say we want to run the R program t.R on radon:
summary(1 + rgeom(10^7, 1/1000 ))
• Create R1.sh with contents:
#!/bin/sh -l
#PBS -l nodes=1:ppn=1
#PBS -l walltime=00:10:00
cd $PBS_O_WORKDIR
module add r
Rscript t.R > out1
• Submit using qsub R1.sh
Let’s do that 100 times
• Using our “R1.sh” file as a template, create files
prog001.sh through prog100.sh, changing the
output file for each job to be out.NNN. In R:
s<-scan("R1.sh", what='c', sep="\n")
sapply(1:100, function(i) { s[6]=sprintf("Rscript t.R > out.%03d", i);
write(s, sprintf("prog%03d.sh", i)); })
write(sprintf("qsub prog%03d.sh", 1:100), "runall.sh")
• Submit all 100 jobs by typing sh –x runall.sh
• Generating files using bash instead (all on one line):
for i in `seq -w 1 100`; do (head -6 R1.sh; echo "Rscript t.R > out.$i") >
prog$i.sh; echo "qsub prog$i.sh"; done > runall.sh
Coupon collector problem
• I want to solve the coupon collector problem with large
parameters but it will take much too long on a single
computer (around 2.5 days):
sum(sapply(1:10000, function(y) {mean(1 + rgeom(10^8, y/10000))}))
• The obvious approach is to break it into 10,000 smaller
R jobs and submit them to the cluster.
• Better to break it into 250 jobs, each operating on 40
numbers.
• Create an R script that accepts command line
arguments to process many numbers at a time.
Estimate walltime carefully!
Coupon collector R code
• t2.R, read arguments into “args”, process each
args <- commandArgs(TRUE)
sapply(as.integer(args), function(y) {mean(1 + rgeom(10^8, y/10000))})
• Can test via:
Rscript t2.R 100 125 200 # Change reps from 10^8 to 10^5 for test
• Generate 250 scripts with 40 arguments each
s<-scan("R2.sh", what='c', sep="\n")
sapply(1:250, function(y) {
s[6]=sprintf("Rscript t2.R %s > out.%03d", paste((y*40-39):(y*40),
collapse=" "), y);
write(s, sprintf("prog%03d.sh", y));
})
write(sprintf("qsub prog%03d.sh", 1:250), "runall.sh")
Coupon collector results
• Output is in the files out.001 through out.250:
radon-fe00 ~/cluster/R2done $ cat out.001
[1] 9999.8856 5000.4830 3333.0443 2499.8564 1999.7819 1666.2517 1428.6594
[8] 1249.9841 1110.9790 1000.0408 909.1430 833.3409 769.1818 714.2486
[15] 666.6413 624.9357 588.3044 555.5487 526.3795 500.0021 476.2695
[22] 454.5702 434.7949 416.6470 399.9255 384.5739 370.3412 357.1366
[29] 344.8375 333.2978 322.5507 312.5258 303.0307 294.1573 285.7368
[36] 277.8168 270.2709 263.1612 256.3872 249.9905
• It’s hard to read 250 files with that stupid leading
column. UNIX tricks to the rescue!
sum(scan(pipe("cat out* | colrm 1 5")))
# works for small indexes only
sum(scan(pipe("cat out* | sed -e 's/.*]//'"))) # works for all index sizes
• Cha-ching!
Using all cores on a single node
• When running your job on a single core of a node
shared with strangers, some may misbehave and
use too much RAM or CPU. Solution is to request
entire nodes, and fill them with just your jobs so
you never share a node with anyone else.
• Job submission file should include:
#PBS -l nodes=1:ppn=8
• Forces PBS to exclusively schedule a node for you.
If it is a single R job, you are using just one core!
Must use xargs or a similar trick to launch 8
simultaneous R jobs. Only submit 1/8th the jobs.
All cores example one
#!/bin/sh -l
#PBS -l nodes=1:ppn=8
#PBS -l walltime=00:30:00
cd $PBS_O_WORKDIR
module add r
Rscript t3.R >out1 &
Rscript t3.R >out2 &
Rscript t3.R >out3 &
Rscript t3.R >out4 &
Rscript t3.R >out5 &
Rscript t3.R >out6 &
Rscript t3.R >out7 &
Rscript t3.R >out8 &
wait
All cores example two
#!/bin/sh -l
#PBS -l nodes=1:ppn=8
#PBS -l walltime=00:30:00
cd $PBS_O_WORKDIR
module add r
xargs -d '\n' -n1 -P8 sh -c < batch1.sh
• Where batch1.sh contains (could be > 8 lines!):
Rscript t3.R >out1
Rscript t3.R >out2
Rscript t3.R >out3
Rscript t3.R >out4
Rscript t3.R >out5
Rscript t3.R >out6
Rscript t3.R >out7
Rscript t3.R >out8
Thanks!
• Thanks to Prof. Mark Daniel Ward for all his help
with the examples used in this talk!
• URL for these notes:
• http://www.stat.purdue.edu/~dgc/cluster.pptx
• http://www.stat.purdue.edu/~dgc/cluster.pdf
(copy and paste works poorly with PDF!)
Download