Cluster Computing Basics

GENE 760:
Genomic Methods for Genetic Analysis
Course Organizer:
Jim Noonan:
[email protected]
Richard Sarro:
Rob Amezquita:
[email protected]
[email protected]
GENE 760:
• Intro to analyses of genomic datasets
• High throughput sequencing applications
- ChIP-seq
- RNA-seq
- Whole exome and whole-genome sequencing
- Metagenomics
- Big functional genomics datasets
• You do not need to have prior programming experience
• You will learn how to:
- Work with massive datasets in a Linux HPC environment
- Write your own scripts in Perl and R to parse files, run
pipelines, do basic statistical analyses
- Interpret genomics data to gain biological insights
Information we need from you
In a single email to [email protected]:
Your name
Your netID
Your Grad School year
Are you taking the course for Credit or are you an Auditor?
• Your level of experience with:
- Working in a UNIX/Linux environment
- High performance computing
- Perl scripting
- R
- Any other programming language
- High-throughput sequencing apps or data
Cluster Computing Basics
R D Bjornson
N J Carriero
Accessing Louise—You
Run a program on your computer (“local”) to login to louise (“remote”) over a network connection.
The local computer must be on the Yale network:
The login program run on the local computer must support the secure shell protocol.
A computer at Yale
Via VPN software
Via a login to a computer at Yale that allows external access, then login from there to louise.
Linux: ssh
Mac OS X: Use terminal or X11/xterm to create a command line session (“shell”), then ssh.
Windows: Putty + ssh or cygwin (and then pretend as if you are using Linux).
ssh [email protected]
On first log in, if prompted for a passphrase for an ssh key, just press “enter”. In general, unless you
know what you are doing, leave ssh-related files alone (and do not change the permission on your
home directory!).
Running GUIs involves understanding and using X11. Baked-in with Linux, distributed but not
installed by default with Mac OS X, and a 3rd party add on for Windows (e.g., cygwin).
Accessing Louise—Your Data
• Use scp or sftp (part of the ssh program
suite) to copy files from local to remote and
back. (FileZilla offers a GUI.)
• rsync can be useful for keeping a local and
remote file hierarchy in sync.
• wget will allow you to retrieve a file via a URL
from the command line. Useful for fetching
reference files from repository sites
Don’t Panic!
man: Describes how to use a command.
man man
help: Information about frequently used “shell”
info: New and improved (?) man—may provide more
locate: Find the location of a file (in common system
which: Use to determine which version of a program will
be used by default.
Note: User interface is hunt-and-peck not point-and-click!
Cluster Organization
Login nodes
– Virtualized
– Light use only
Compute nodes
– Multicore, ~4GB DRAM per core. Parallel or concurrent execution is
relatively easy using the cores of one node. More work to use the
cores on multiple nodes. But in either case do not assume this will
happen automatically.
– Shared vs dedicated
File systems
– Cluster wide (default), accessible over network
– Local to node (direct connection)
Cluster Organization (Louise)
300+ Users.
Storage System
90 compute
nodes for
general use.
Don’t loiter in
the lobby!
Processor cores:
4 to 64 per
compute node
Resource Management
Need to explicitly allocate resources for computing
– Interactive. For development; using interactive programs such as
MATLAB®, python or R; and/or graphic rich tools (X11
– Batch
qsub registers a request for resources:
qsub -I -X nodes=1:ppn=8 -q default
(-I: interactive; -X: X11 forwarding, also use ssh –Y when you login
from your local computer)
qsub FileWithOptionsAndCommands
qstat provides information about requests:
qstat -1 -n -u njc2
Editor (emacs vs vi/vim)
emacs makes it possible to work directly with files 10s to 100s of MB,
explore binary files, capture shell transcripts and review them,
interactively navigate the file hierarchy, review file differences, etc. .
Binary vs ASCII files.
Basic command to determine the kind of file.
od –c Displays content byte by byte, permitting a detailed
examination—useful especially when dealing with DOS/Unix/Mac OS X
end of line conflicts or looking for file corruption. Often used in a “shell
pipe” with head. Btw, do not use a “wysiwyg” editor such as Word or
Wordpad for technical work: especially data preparation or code
ls , cd , mkdir: List directory contents, change directories, make a
new directory. File hierarchy == tree of directories. A “path” is a series of
nested directories written this way /dir0/dir1/dir2/file.
head , less , tail: See a couple of lines in an ASCII file. head and
tail can be used to extract a small sample, e.g. to see the format of data
in the file or to create test input (but this kind of sample is generally not
representative). Often used with “pipes”. Use less to browse files (by
line number or percentage).
split: One way to cope with large files (but virtual splitting can be more
efficient: split will, at least temporarily, double the amount of file space
awk: Swiss army knife. Can do head/tail/split and much more:
awk 'NR%1009 == 13{print $0}' fullDataSet > sampleDataSet
python: An excellent interactive general purpose text processing and
analysis environment (increasingly popular, but perl has a large lead).
Tools: bash scripting, redirection and
When you log into a computer you are connected to a program. This
program accepts the text you type and does “something” with it. If,
for example, you type “ls”, the program first determines that “ls” is
not something that it directly understands, so it next looks for
another program on the computer called “ls”. If it finds it, it runs
that program on your behalf and then reports the output. If it does
not find it, it reports an error to that effect.
This class of program is generally referred to as “command shells”. It
should be clear that the shell plays a critical role in the use of a
cluster computer, and yet most users give the shell little or no
thought. This generally comes back to haunt them in the form of
subtle bugs that they are ill equipped to diagnose and correct, as
well as missed opportunities to streamline workflow.
Tools: bash
Consider a sequence of commands given to the
bash shell (the default shell) :
gunzip data.gz
awk '/chr13/{print $0}' data > chr13Records
gzip data
myProgram -i chr13Records -o chr13Filtered
rm chr13Records
sort -k 2,2n < chr13Filtered > chr13Sorted
rm chr13Filtered
Note: stdin, stdout, stderr
Tools: bash
An alternative using bash “pipes”:
gunzip -c data.gz
| awk '/chr13/{print $0}’
| myProgram -i - -o –
| sort -k 2,2n > chr13Sorted
Three advantages:
– Less file system IO (extremely important in a cluster
– Less clean up (an issue when this sort of processing is
done 100s or 1000s of times)
– Better use of multicore machines (gunzip, awk, and
myProgram can run concurrently).
Tools: bash
Now suppose we have 100 data sets: dataSet00.gz ...
A few notes about file naming:
• When working with a large number of files, it is easy to lose
track of files or accidentally overwrite some, so choose a
clear and informative scheme and stick to it. If >> 1000, use
additional levels of directories.
• 0- vs 1- based indexing is a subtle point that you need to
get comfortable with (you don’t have to use it yourself, but
you will run into it sooner or later).
• Padding with leading 0’s compensates for dumb file sorting.
How can we easily process all of these sets?
Tools: bash
for f in $(ls dataSet*.gz)
gunzip -c $f | awk '/chr13/{print $0}’
| myProgram -i - -o - | sort -k 2,2n >
Note: You can use an editor to create a file that
contains a complex command or a command
sequence and then have bash execute that
file: source CommandFile
That may take a while, how can we use multiple processors to do it faster? SimpleQueue!
Produce a list of tasks to be executed (essentially the same loop as before modified to
display the commands to be executed rather than actually execute them):
for f in $(ls dataSet*.gz)
echo ”cd $(pwd) && ( gunzip -c $f | awk '/chr13/{print $0}’ |
myProgram -i - -o - | sort -k 2,2n > chr13Sorted_${f/.gz/} )
>${f}.out 2>${f}.err”
done > Tasks
Create a batch script that directs the resource manager to allocate compute nodes and
then uses the allocated nodes to work through the list of tasks: default 4.6 njc2 dataExtraction Tasks > sqBatchScript
Submit (can “|” to qsub too): qsub sqBatchScript
Check output files and status information (SimpleQueue collects a great deal).
cd ... && blast ds 00
cd ... && blast ds 01
cd ... && blast ds 02
cd ... && blast ds 03
cd ... && blast ds 04
cd ... && blast ds 05
cd ... && blast ds 06
Aside: Random Number Generation
If you run a code that depends on random
numbers, you must take care to ensure it does
what you expect when you run it several times,
perhaps concurrently on different nodes.
On the one hand, in general you will want each
instance to see different random numbers. This
may not happen by default.
On the other, you would like to be able to
reproduce your results. Different but not too
Parallelism: Pre-packaged
Thread based: Fairly common ("easy"-ish).
Thread-based parallelism can only make use
of the cores on one node.
Message passing based (MPI, PVM, …): Less
common in bioinformatics. A message passing
program can make use of the aggregate
resources of many nodes.
“make” based: Illumina and one or two others.
Limited to the cores of one node.
Parallelism: Pre-packaged
If you are using a 3rd party program, it is important to know which kind
of parallelism is used and to invoke the program appropriately.
If threaded:
Run on a dedicated node!
Check docs for a number of threads parameter.
If MP, typically need to set up a special execution environment in order
to run the program using the resources allocated. Unfortunately,
this tends to vary with MP implementations and so has to be
addressed on a case by case basis (ask RDB or NJC).
If “make”, invoke like this:
make -j N MakeTarget > make.out 2> make.err
where N is the number of cores to use.
Do It Yourself: Owner computes
It is possible to write you own parallel programs.
One strategy that RDB and NJC often use:
• Imagine that you run multiple copies of a sequential version.
• At some point, the copies will enter a period of execution in which the
work can be split up into independent tasks. Add a check to decide which
copy “owns” (and should execute) a given task—all other copies will skip
this task.
• Each copy records the tasks it did. When it exits the period of execution
that was split up, it exchanges with all other copies the results of the tasks
it did. At this point all the copies know all the results and will continue to
execute as if they had each done all of the work themselves.
The devil is in the details—especially the mechanisms used to settle
ownership and to exchange task results. Ask us for help; just keep in mind
that this kind of parallelism is an option and need not be terribly complex.
Software as an Experimental System
Start with “small” input sets and/or run parameters and systematically alter
these to study how CPU time, memory use and IO activity vary from run to
Non-invasive tools:
top May need a separate log in to the allocated node (use intra-cluster
time command:
/usr/bin/time –v prog a0 a1 a2 > outFile 2>
Output from time will be appended to “errFile”. Note: use the full path—
this is an instance where it is important to understand how the shell
Software as an Experimental System
If you are in a position to modify code, you can get
much more accurate and detailed information.
Ditto with profiling:
Compile time option plus post processing for C,
C++, Fortran, …
Available as a runtime facility in various scripting
systems (python, perl, ruby).
Activating profiling often significantly increases
run time, placing a premium on the importance
of well designed small test cases.
Scaling Considerations
Consider the time (in arbitrary “operation” units) to
process N records, if doing:
A record by record transform => Time(N)
An all to all comparison => Time(N2)
An exploration of subsets => Time(2N)
An exploration of orderings => Time(N!)
One naturally tends to focus on run time, but
memory and IO (amount as well as rate) matter
Scaling Considerations
What N corresponds to about 1 CPU second?
Time(N) => 1,000,000,000
Time(N2) => 30,000
Time(2N) => 30
Time(N!) => 13
What model applies clearly matters!
Scaling Considerations
It matters when determining how big a problem
is feasible. Suppose we double the input size:
Time(2*1,000,000,000) => ~ 2 s
Time((2*30,000)2) => ~4 s
Time(2(2*30)) => 1,000,000,000 s
Time((2*13)!) => 1017 s (roughly ten billion
Scaling Considerations
It matters when verifying code behavior. If you
have a code that you believe follows a Time(N)
model, but empirically behaves like Time(N2),
then you may have a bug.
For example, code that maintains a list of values
can easily degenerate to Time(N2) if one is
careless with the operations that maintain the
Other Performance Considerations
Memory hierarchy:
Do as much as you can with one record before moving on to the next.
Physical vs Virtual Memory:
When chunking work, size to fit in physical memory.
Local vs remote IO:
If you cannot eliminate temporary IO via bash pipes or named pipes, at
least write to a local file system (but clean up!).
Bulk IO vs character IO:
Mostly done for you, but avoid IO operations that read or write one byte
or character at a time.
Data IO vs metadata operations:
Metadata operations are much more expensive than normal data IO.
Avoid them. E.g., don’t use a series of specially named empty files to
indicate progress, write progress updates to a log file instead.