Cluster Computing Basics

Cluster Computing Basics
R D Bjornson
N J Carriero
CS Dept, Keck HPC Resource, YCGA
Don’t Panic!
man: Describes how to use a command.
man man
help: Information about frequently used “shell”
info: New and improved (?) man—may provide more
locate: Find the location of a file (in common system
which: Use to determine which version of a program will
be used by default.
Note: User interface is hunt-and-peck not point-and-click!
Accessing Louise
Run a program on your computer (“local”) to login to louise (“remote”) over a
network connection.
The local computer must be on the Yale network:
– A computer at Yale
– Via VPN software
– Via a login to a computer at Yale that allows external access, then login from there to louise.
The login program must support the secure shell protocol.
– Linux: ssh
– Mac OS X: Use terminal or X11/xterm to create a command line session (“shell”), then
– Windows: Putty + ssh or cygwin (and then pretend as if you are using Linux).
ssh [email protected]
On first log in, if prompted for a passphrase for an ssh key, just press “enter”. In
general, unless you know what you are doing, leave ssh-related files alone (and do
not change the permission on your home directory!).
Running GUIs involves understanding and using X11. Baked-in with Linux,
distributed but not installed by default with Mac OS X, and a 3rd party add on for
Windows (e.g., cygwin).
Accessing Louise
• Use scp or sftp (part of the ssh program
suite) to copy files from local to remote and
• rsync can be useful for keeping a local and
remote file hierarchy in sync.
• wget will allow you to retrieve a file via a URL
from the command line. Useful for fetching
reference files from repository sites
Cluster Organization
Login nodes
– Virtualized
– Light use only
Compute nodes
– Multicore, ~4GB DRAM per core. Parallel or concurrent execution is
relatively easy using the cores of one node. More work to use the
cores on multiple nodes. But in either case do not assume this will
happen automatically.
– Shared vs dedicated
File systems
– Cluster wide (default), accessible over network
– Local to node (direct connection)
Cluster Organization (Louise)
300+ Users.
Storage System
90 compute
nodes for
general use.
Don’t loiter in
the lobby!
Processor cores:
4 to 64 per
compute node
Resource Management
Need to explicitly allocate resources for computing
– Interactive. For development; using interactive
programs such as MATLAB®, python or R; and/or
graphic rich tools (X11 forwarding)
– Batch
– qsub registers a request for resources (for X11
forwarding also use ssh –Y for initial login)
qsub -X -I nodes=1:ppn=8 -q default
qsub FileWithOptionsAndCommands
– qstat provides information about requests
qstat -1 -n -u njc2
Editor (emacs vs vi and vim)
emacs makes it possible to work directly with files 10s to 100s of MB,
explore binary files, capture shell transcripts and review them,
interactively navigate the file hierarchy, review file differences, etc. .
Binary vs ASCII files.
Basic command to determine the kind of file.
od –c Displays content byte by byte, permitting a detailed
examination—useful especially when dealing with DOS/Unix/Mac OS X
end of line conflicts or looking for file corruption. Often used in a “pipe”
with head. Btw, do not use a “wysiwyg” editor such as Word or Wordpad
for technical work, especially data preparation or code development.
ls , cd , mkdir: List directory contents,
change directories, make a new directory. File
hierarchy = tree of directories.
– A “path” is a series of nested directories written this
way /dir0/dir1/dir2/file.
– When you login, you start work in your “home”
directory (aka ~).
– When bash looks up a command for you, it searches in
all paths listed in the “PATH environment variable”.
export PATH=/my/new/program/Directory:$PATH
– Look in “/usr/local/cluster/software/installation” for
programs of interest.
head , less , tail: See a couple of lines in an ASCII file. head
and tail can be used to extract a small sample, e.g. to see the
format of data in the file or to create test input (but this kind of
sample is generally not representative). Often used with pipes. Use
less to browse files (by line number or percentage).
split: One way to cope with large files (but virtual splitting can be
more efficient: split will, at least temporarily, double the amount
of file space used).
awk: Swiss army knife. Can do head/tail/split and much more:
awk 'NR%1000 == 13{print $0}' fullDataSet > sampleDataSet
python: An excellent general purpose text processing and analysis
environment (increasingly popular, but perl has a large lead).
Tools: bash scripting, redirection and
When you log into a computer you are connected to a program. This
program accepts the text you type and does “something” with it. If,
for example, you type “ls”, the program first determines that “ls” is
not something that it directly understands, so it next looks for
another program on the computer called “ls” in one of the
directories in PATH. If it finds it, it runs that program on your behalf
and then reports the output. If it does not find it, it reports an error
to that effect.
This class of program is generally referred to as “command shells”. It
should be clear that the shell plays a critical role in the use of a
cluster computer, and yet most users give the shell little or no
thought. This generally comes back to haunt them in the form of
subtle bugs that they are ill equipped to diagnose and correct, as
well as missed opportunities to streamline workflow.
Tools: bash
Consider a sequence of commands given to the
bash shell (the default shell) :
unzip data.gz
awk '/chr13/{print $0}' data > chr13Records
gzip data
myProgram -i chr13Records -o chr13Filtered
rm chr13Records
sort -k 2,2n < chr13Filtered > chr13Sorted
rm chr13Filtered
Note: stdin, stdout, stderr
Tools: bash
An alternative using bash pipes (“|”):
gunzip -c data.gz | awk '/chr13/{print $0}'| myProgram
-i - -o - | sort -k 2,2n > chr13Sorted
Three advantages:
– Less file system IO (extremely important in a
cluster setting)
– Less clean up (an issue when this sort of
processing is done 100s or 1000s of times)
– Better use of multicore machines (gunzip, awk,
and myProgram can run concurrently).
Tools: bash
Now suppose we have 100 data sets: dataSet00.gz ...
A few notes about file naming:
• When working with a large number of files, it is easy to lose
track of files or accidentally overwrite some, so choose a
clear and informative scheme and stick to it. If >> 1000, use
additional levels of directories.
• 0- vs 1- based indexing is a subtle point that you need to
get comfortable with (you don’t have to use it yourself, but
you will run into it sooner or later).
• Padding with leading 0’s compensates for dumb file sorting.
How can we easily process all of these sets?
Tools: bash
for f in $(ls dataSet*.gz)
gunzip -c $f | awk '/chr13/{print $0}’|
myProgram -i - -o - | sort -k 2,2n >
Note: You can use an editor to create a file that contains a
complex command or a command sequence and then
have bash execute that file as if you typed it in directly:
source CommandFile
You can also turn that file (“script”) into something that
you can run like any other program.
That may take a while, how can we use multiple processors to do it faster?
Simple queue:
Produce a list of tasks to be executed (essentially the same loop as before modified to
display the commands to be executed rather than actually execute them).
for f in $(ls dataSet*.gz)
echo ”cd $(pwd) && ( gunzip -c data.gz | awk '/chr13/{print
$0}’ | myProgram -i - -o - | sort -k 2,2n >
chr13Sorted_${f/.gz/} ) >${f}.out 2>${f}.err”
done > Tasks
Create a batch script that directs the resource manager to allocate compute nodes and
then uses the allocated nodes to work through the list of tasks (can “|” to qsub). default 4.6 njc2 dataExtraction Tasks
Check output files and status information (Simple Queue collects a great deal).
cd ... && blast ds 00
cd ... && blast ds 01
cd ... && blast ds 02
cd ... && blast ds 03
cd ... && blast ds 04
cd ... && blast ds 05
cd ... && blast ds 06
Aside: Random Number Generation
If you run a code that depends on random
numbers, you must take care to ensure it does
what you expect when you run it several times,
perhaps concurrently on different nodes.
On the one hand, in general you will want each
instance to see different random numbers. This
may not happen by default.
On the other, you would like to be able to
reproduce your results. Different but not too
Parallelism: Pre-packaged
Thread based: Fairly common ("easy"-ish).
Thread-based parallelism can only make use
of the cores on one node.
Message passing based (MPI, PVM, …): Less
common in bioinformatics. A message passing
program can make use of the aggregate
resources of many nodes.
“make” based: Illumina and one or two others.
Limited to the cores of one node.
Parallelism: Pre-packaged
If you are using a 3rd party program, it is important to know which kind
of parallelism is used and to invoke the program appropriately.
If threaded:
Run on a dedicated node!
Check docs for a number of threads parameter.
If MP, typically need to set up a special execution environment in order
to run the program using the resources allocated. Unfortunately,
this tends to be MPI-implementation specific and so has to be
addressed on a case by case basis (ask RDB or NJC).
If “make”, invoke like this:
make -j N MakeTarget > make.out 2> make.err
where N is the number of cores to use.
Do It Yourself: Owner computes
It is possible to write you own parallel programs.
One strategy that RDB and NJC often use:
• Imagine that you run multiple copies of a sequential version.
• At some point, the copies will enter a period of execution in which the
work can be split up into independent tasks. Add a check to decide which
copy “owns” (and should execute) a given task—all other copies will skip
this task.
• Each copy records the tasks it did. When it exits the period of execution
that was split up, it exchanges with all other copies the results of the tasks
it did. At this point all the copies know all the results and will continue to
execute as if they had each done all of the work themselves.
The devil is in the details—especially the mechanisms used to settle
ownership and to exchange task results. Ask us for help; just keep in mind
that this kind of parallelism is an option and need not be terribly complex.
Software as an Experimental System
Start with “small” input sets and/or run parameters and systematically alter
these to study how CPU time, memory use and IO activity vary from run to
Non-invasive tools:
top May need a separate log in to the allocated node (use intra-cluster
time command:
/usr/bin/time –v prog a0 a1 a2 > outFile 2>
Output from time will be appended to “errFile”. Note: use the full path—
this is an instance where it is important to understand how the shell
Software as an Experimental System
If you are in a position to modify code, you can get
much more accurate and detailed information.
Ditto with profiling:
Compile time option plus post processing for C,
C++, Fortran, …
Available as a runtime facility in various scripting
systems (python, perl, ruby).
Activating profiling often significantly increases
run time, placing a premium on the importance
of well designed small test cases.
Scaling Considerations
Consider the time (in arbitrary “operation” units) to
process N records, if doing:
A record by record transform => Time(N)
An all to all comparison => Time(N2)
An exploration of subsets => Time(2N)
An exploration of orderings => Time(N!)
One naturally tends to focus on run time, but
memory and IO (amount as well as rate) matter
Scaling Considerations
What N corresponds to about 1 CPU second?
Time(N) => 1,000,000,000
Time(N2) => 30,000
Time(2N) => 30
Time(N!) => 13
What model applies clearly matters!
Scaling Considerations
It matters when determining how big a problem
is feasible. Suppose we double the input size:
Time(2*1,000,000,000) => ~ 2 s
Time((2*30,000)2) => ~4 s
Time(2(2*30)) => 1,000,000,000 s (> 30 years)
Time((2*13)!) => 1016 s (roughly a billion
Scaling Considerations
It matters when verifying code behavior. If you
have a code that you believe follows a Time(N)
model, but empirically behaves like Time(N2),
then you may have a bug.
For example, code that maintains a list of values
can easily degenerate to Time(N2) if one is
careless with the operations that maintain the
Other Performance Considerations
Memory hierarchy:
Do as much as you can with one record before moving on to the next.
Physical vs Virtual Memory:
When chunking work, size to fit in physical memory.
Local vs remote IO:
If you cannot eliminate temporary IO via bash pipes or named pipes, at
least write to a local file system (but clean up!).
Bulk IO vs character IO:
Mostly done for you, but avoid IO operations that read or write one byte
or character at a time.
Data IO vs metadata operations:
Metadata operations are much more expensive than normal data IO.
Avoid them. E.g., don’t use a series of specially named empty files to
indicate progress, write to a log file instead.