hpc-tutorial-gsoe9400

advertisement
High Performance Computing
John Zaitseff
September 2014
High Performance Computing
High Performance Computing
architecture
Massively Parallel Distributed
Computational Cluster
•
•
•
•
Many individual servers (“nodes”): dozens
to thousands
Multiple processors per node: between 8
and 64 cores
Interconnected by fast networks
Almost always run Linux
– In our case: Rocks Linux Distribution
on top of CentOS 6.x
The Trentino cluster
Image credit: John Zaitseff, UNSW
Compute Node m-n
Compute Node m-4
Compute Node m-3
Chassis 1
Compute Node m-2
Compute Node m-1
Compute Node 1-n
Compute Node 1-4
Head Node
Compute Node 1-3
Compute Node 1-2
Compute Node 1-1
Compute Node n
Compute Node 4
Compute Node 3
Compute Node 2
Compute Node 1
High Performance Computing
architecture
Internet
Storage Node
Internal Network Switch
Chassis m
The Newton cluster:
newton.mech.unsw.edu.au
•
•
•
•
•
10 × Dell R415 server nodes
– Head node: newton
– Compute nodes: newton01 to newton09
160 × AMD Opteron 4386 3.1GHz processor cores
– Two physical processors per node
– Eight CPU cores per processor
– Only four floating-point units per processor
320 GB of main memory (32 GB per node)
12 TB of storage: 6 × 3 TB drives in RAID 6
1Gb Ethernet network interconnect
http://cfdlab.unsw.wikispaces.net/
The Newton cluster
Image credit: John Zaitseff, UNSW
The Trentino cluster:
trentino.mech.unsw.edu.au
•
•
•
•
•
16 × Dell R815 server nodes
– Head node: trentino
– Compute nodes: trentino01 to trentino15
1024 × AMD Opteron 6272 2.1GHz processor cores
– Four physical processors per node
– Sixteen CPU cores per processor
– Only eight floating-point units per processor
2048 GB of main memory (128 GB per node)
30 TB of storage: 12 × 3 TB drives in RAID 6
4×1Gb Ethernet network interconnect
http://cfdlab.unsw.wikispaces.net/
The back of the Trentino cluster
Image credit: John Zaitseff, UNSW
The Leonardi cluster:
leonardi.eng.unsw.edu.au
•
•
•
•
•
•
•
7 × HP BladeSystem c7000 blade enclosures
1 × HP ProLiant DL385 G7 server: leonardi
56 × HP BL685c G7 compute nodes
– Compute nodes: ec01b01-ec07b08
2944 × AMD Opteron 6174 2.2GHz processor cores
and Opteron 6276 2.3GHz processor cores
– Four physical processors per node
– Twelve or sixteen CPU cores per processor
5888 GB of main memory (96 or 128 GB per node)
95 TB of storage: 60 × 2 TB drives in RAID 60
2×10Gb Ethernet network interconnect
http://leonardi.unsw.wikispaces.net/
Nodes in the Leonardi cluster
Image credit: John Zaitseff, UNSW
The Raijin cluster:
raijin.nci.org.au
•
•
•
•
•
•
•
3592 × Fujitsu blade server nodes
Multiple login nodes
Multiple management nodes
57,472 Intel Xeon E5-2670 2.60GHz
processors
160 TB of main memory
10 PB of storage using the Lustre
distributed file system
14Gb Infiniband FDR network
interconnect
Image credit: National Computational Infrastructure
http://nci.org.au/nci-systems/national-facility/peak-system/raijin/
Compute Node m-n
Compute Node m-4
Compute Node m-3
Chassis 1
Compute Node m-2
Compute Node m-1
Compute Node 1-n
Compute Node 1-4
Head Node
Compute Node 1-3
Compute Node 1-2
Compute Node 1-1
Compute Node n
Compute Node 4
Compute Node 3
Compute Node 2
Compute Node 1
High Performance Computing
architecture
Internet
Do not run your
jobs here!
Storage Node
Internal Network Switch
Chassis m
Connecting to a HPC system
•
Use the Secure Shell protocol (SSH)
– Under Linux: ssh username@hpcsystemname
– Under Windows: PuTTY (Start » All Programs » PuTTY » PuTTY)
– Can install Cygwin: “that Linux feeling under Windows”
•
Command line prompt
– Will look something like: z9693022@newton:~ $
– May be different in different systems; may be customised
•
Try it now: PuTTY, Host name newton.mech.unsw.edu.au
– RSA2 fingerprint: 69:7e:64:75:57:67:ad:4c:21:8e:90:7d:8e:97:70:ce
– User name: your zID; Password: your zPass
•
To exit: exit
Simple Linux commands
•
•
•
•
List files in a directory: ls [pathname ...]
– [] indicates optional parameters, ... indicates one or more parameters
– Italic fixed-width font indicates replaceable parameters
To show the current directory: pwd
To change directories: cd directory
– ~ is the home directory
– .. is the directory above the current one
– ~user is the home directory of user user
Try it now:
cd ~z9693022/src/trader-7.6
ls
# List files in current directory
cd src
pwd; ls
# More than one command at a time!
cd ..; pwd
# You don’t have to enter the comments...
Directories and files: paths and
pathnames
• Files and directories are organised into a hierarchical tree structure
•
The top of the tree is called the root directory (or simply root), and is
denoted as / (slash)
•
The root directory contains directories, which in turn contain files and
directories of their own:
/
bin
etc
home
share
z9693022
apps
bin
Modules
ansys
matlab
14.5
15.0
usr
share
local
bin
Absolute pathnames
•
•
Any file or directory can be represented as an absolute pathname:
– gives the full name of the file or directory
– starts with the root “/”
– lists each directory along the way
– has a “/” to separate each path (or pathname) component
For example: the directory /share/apps/ansys/15.0
/
bin
etc
home
share
z9693022
apps
bin
Modules
ansys
matlab
14.5
15.0
usr
share
local
bin
Relative pathnames
•
•
•
•
•
•
Second way of denoting a file or directory (a pathname)
Relative to the current working directory
Does not start with the root directory “/”
Path components are still separated with slashes “/”
Current directory is denoted by “.” (dot)
Going up a level is denoted by “..” (dot-dot)
•
Often just contains a filename with no directories listed
•
Examples: Assume current directory is /home/z9693022/src/trader-7.6:
README
→ /home/z9693022/src/trader-7.6/README
src/trader.c
→ /home/z9693022/src/trader-7.6/src/trader.c
../trader-7.6.tar.xz → /home/z9693022/src/trader-7.6.tar.xz
src/.././README → /home/z9693022/src/trader-7.6/README
./README
→ /home/z9693022/src/trader-7.6/README
Important directories
•
•
•
•
Home directory: /home/user (e.g., /home/z9693022)
Scratch directory for temporary files: /share/scratch/user
(but not available on Newton!)
Binary directories for utility programs:
– /bin
— for essential utilities
– /usr/bin
— for other utilities and some applications
– /usr/local/bin — for local utilities and applications
– /home/user/bin — for your own utilities
On our clusters, applications: /share/apps
On our clusters, module files: /share/apps/Modules
•
Note synonyms: path, pathname, filename
•
More with pathnames
•
•
•
•
•
To change directories: cd dir
To change to your home directory: cd ~ or cd $HOME or cd (by itself)
To get current working directory: pwd
To show the directory tree structure: tree, tree -d (directories only)
To view a file page by page: less filename, “q” to quit, “h” for help
•
Try it now:
cd /home/z9693022/src/trader-7.6
tree -d
less README
less src/trader.c
cd src; pwd
less README
less ../README # Different from README!
Getting help
•
•
•
•
•
Many commands have a myriad of command line options
For a brief summary of command line options, try command --help
For a full explanation, try man command
For some commands, try pinfo command
To search for a keyword in the manual: man -k keyword
•
Remember, “Google is your friend” 
•
Try it now:
ls --help
cd --help
man ls
pinfo coreutils
man less
man cd
# Does this work?
# See “See Also” section at end
# “q” to quit
# 1571 lines!
# What is “BASH_BUILTINS”?
The Bourne Again (Bash) shell
•
Official manual page entry:
Bash is an sh-compatible command language interpreter that executes
commands read from the standard input or from a file. Bash also incorporates
useful features from the Korn and C shells (ksh and csh).
Bash is intended to be a conformant implementation of the Shell and Utilities
portion of the IEEE POSIX specification (IEEE Standard 1003.1). Bash can be
configured to be POSIX-conformant by default.
•
•
•
•
•
•
Interprets your typed commands and executes them
Just another Linux program: nothing special about it!
Started by the system when you log in
You can then start another shell, if you like (e.g., ksh, tcsh, even python)
You can start a subshell by running bash
To exit a subshell (or the main shell): exit
Some features of Bash
•
•
Powerful command line facilities (shortcuts):
– Tab completion (press the TAB key to complete commands and
pathnames, TAB TAB to list all possibilities)
– Command line editing: try ↑ (Up-Arrow) to recall previous commands,
CTRL-R (C-R or ^R) to search for previous commands, ← and → to
move along current command line
A full programming and scripting language:
– Variables and arrays
– Loops (for; while; until), control statements (if ... then ... else; case)
– Functions and coprocesses
– Text processing (“expansion” and “parameter substitution”)
– Simple arithmetic calculations
– Input/output redirection (e.g., redirect output to different files)
– Much, much more! (The man page runs to over 5,300 lines)
Trying out some features of Bash
•
Try it now:
– cd ~z9693022/src/trader-7.6/src
–
–
–
–
•
Type “less”, then space, but do not press ENTER yet
Press TAB once: nothing appears
Press TAB a second time: all relevant completions appear
Type “f”, then press TAB: the filename is completed to “fileio.”
– Press TAB TAB again: two files are listed
– Type “h” to select the second file, then press ENTER (and “q” to quit)
Try it now:
– Press CTRL-R, then type “ls” (but do not press ENTER): previous
commands with “ls” in them are listed
– Press CTRL-R again a few times: will even list “pinfo coreutils”
– Press ENTER when you get to the command you wish to execute
– Press CTRL-C if you do not wish to execute any command
Listing files and directories
•
•
Already know the ls command: List directory contents
In full: ls [options] [pathname ...]
•
Some options:
– “-a” for all files (including those starting with “.”)
– “-l” for long (detailed) listing
– Options sometimes can be combined: “-alF”
Try it now: ls -laF or dir (an alias to “ls -laF”); ll (“ls -lF”)
•
•
•
Example of a line in a long listing:
-rw-r--r-- 1 z9693022 unsw
1266 May 24 07:59 README
The columns of information are: file permissions, number of links (usually 1
for files, 2 or more for directories), file owner, group owner, size in bytes
(here, 1266), date last modified, the actual filename (README), with
perhaps a trailing “*” for executable files and “/” for directories.
File and directory patterns
•
•
•
•
The Bash shell interprets certain characters in the command line by
replacing them with matching pathnames
Called pathname expansion, pattern matching, wildcards or globbing
For existing pathnames: “*” matches any string, “?” matches any single
character, “[...]” matches any one of the enclosed characters
Try it now:
cd ~z9693022/src/trader-7.6/src; echo 1 2 3
echo *c
# All filenames ending in “c”: “.” is not special
echo ????.c
# All filenames six characters long (4 + “.c”)
echo M*m
# All filenames starting with “M” and ending with “m”
echo [it]*
# All filenames starting with either “i” or “t”
echo ../lib/uni* # All filenames in ../lib starting with “uni”
echo ../*/*.c
More file and directory patterns
•
•
Glob patterns “*”, “?” and “[...]” only match existing pathnames
Even for pathnames that do not exist: “{alt1,alt2,...}” lists alternatives,
“{n..m}” lists all numbers between n and m, “{n..m..s}” in steps of s
•
Technically called brace expansion
•
Try it now:
ls test-*
# “No such file or directory”
echo test-*
# What happens?
echo test-{one,two,three}
echo newdir/{one,two,three}
echo test-{1..100}
echo test-{001..100}
# Zero-padding
echo test-{1..100..3}
# By steps of three
echo test-{100..1..-3} # By steps of negative three
Naming files and directories
•
Linux allows any characters in filenames except “/” and the NUL byte
•
You may create filenames with “weird” characters in them:
– spaces and tabs
– starting with “-”: conflicts with command line options
– question marks “?”, asterisks “*”, brackets and braces
– other characters with special meanings: “!”, “$”, “&”, “#”, “"”, etc.
•
•
•
•
•
Just because you can does not mean you should!
To match such files: use the glob characters “*” and “?”
Linux file systems are case-sensitive: README.TXT is different from
readme.txt, which is different from Readme.txt and ReadMe.txt!
File type suffixes (e.g., “.txt”) are optional but recommended
Filenames starting with “.” are usually hidden from globs and ls output.
•
Recommendation: Use “a” to “z”, “A” to “Z”, “0” to “9”, “-”, “_” and “.” only.
Managing directories
•
•
•
To create a directory: mkdir dir ...
To create parent directories as well: mkdir -p dir ...
To remove an empty directory: rmdir dir ...
•
Try it now:
cd ~; ls
mkdir gsoe9400/dir{1,2,3}
# Why does this fail?
mkdir -p gsoe9400/dir{1,2,3,99} gsoe9400/x
ls gsoe9400
rmdir gsoe9400/dir?
ls gsoe9400
# Should list dir99 and x only
rmdir gsoe9400/*
# Be careful...
Managing files
•
•
To output one or more file’s contents: cat filename ...
To view one or more files page by page: less filename ...
•
•
•
•
To copy one file: cp source destination
To copy one or more files to a directory: cp filename ... dir
To preserve the “last modified” time-stamp: cp -p
To copy recursively: cp -pr source destination
•
•
•
To move one or more files to a different directory: mv filename ... dir
To rename a file: mv oldname newname
To remove files: rm filename ...
•
Recommendation: use “ls filename ...” before rm or mv: what happens
if you accidentally type “rm *”? or “rm * .c”? (note the space!)
Managing files and directories,
continued
• To copy whole directory trees: cp -pr filename ... destination
•
•
•
To copy to and from another Linux system (e.g., from Leonardi to Trentino),
use Secure Copy: scp [-p -r] source ... destination
– Either source or destination (but not both) can contain a remote system
identifier followed by a colon: [user@]system:
Can also use rsync or insync: insync [-d] source destination
Examples:
cp -pr ~z9693022/src/trader-7.6 .
scp -p ~/file1.txt leonardi:file2.txt
scp -p john@zap.org.au:src/README .
mkdir dir1; insync ~/orig dir1
insync /share/scratch/$USER/data1 $HOME/data1
insync leonardi:/share/scratch/$USER/data2 .
Managing files and directories,
continued
• Try it now:
cd ~/gsoe9400
cp -pr ~z9693022/src/trader-7.6 .; ls
cd trader-7.6; pwd
cat build-aux/bootstrap
ls */*.c
rm */*.c; ls */*.c
# What is the output of ls?
insync ~z9693022/src/trader-7.6 .
mkdir ../new; cp src/trader.c ../new
cd ../new; ls
mv trader.c new.c; rm new.c
cp -p ../trader-7.6/src/trader.* .
cp trader.c new.c
ls -l trader.c new.c # What is the difference between these files?
Transferring files
•
•
To copy files to another Linux system: use scp, rsync or insync
To copy files to and from a Windows machine: use WinSCP or scp, rsync
or insync under Cygwin
•
Try it now:
– Start WinSCP (Start » All Programs » WinSCP » WinSCP)
o Host name newton.mech.unsw.edu.au
o RSA2 fingerprint: 69:7e:64:75:57:67:ad:4c:21:8e:90:7d:8e:97:70:ce
o User name: your zID; Password: your zPass
– Copy ~/gsoe9400/new/new.c to the Windows desktop
– Rename it to newnew.c (using the usual Windows right-click or F2)
– Copy it back
– Under PuTTY: ls newnew.c
More Linux commands
•
•
•
•
•
•
•
•
•
•
•
•
What machine am I on? hostname
What is the date and time? date
Who is logged in? who
But who is user z1234567? finger [username ...]
What is the user name for someone? finger part-of-name
What files contains a particular string? grep 'pattern' filename ...
What is the difference between two files? diff [-u] file1 file2
How do I rename multiple files at once? rename or prename
Where is a file named filename? find dir ... -name filename
How big is a file or directory? du -h [filename ...]
How much space is available in a directory? df -h [dir ...]
How much disk quota do I have? quota -s
– “Blocks” is how many disk blocks you are using, in chunks of 1 kB
– On Newton: “limit” is 10240M = 10 GB
Redirecting input and output
•
•
•
•
•
•
•
The terminal is treated as just another file (/dev/tty); use CTRL-D to
signify the end of file
Other special files: /dev/null (an empty file), /dev/zero (an infinite
number of binary zeros—can use up your quota in a hurry!)
Input and output from a program can be redirected to a file or even piped to
another program
To redirect output to filename, use “>filename”
To append output to filename, use “>>filename”
To redirect input from filename, use “<filename”
•
To connect the output from one program to the input of another (pipes), use
“program1 | program2”
Multiple pipes are allowed: “program1 | program2 | ... | programn”
•
•
Many utility programs are designed to be used in this way, as filters
Output can be substituted into a command line: $(commandline)
Redirecting input and output,
continued
• Try it now:
cd ~/gsoe9400/trader-7.6
ls > ../dir-list1
cat ../dir-list1
cat ../dir-list1 | wc -l # How many lines in ../dir-list1?
ls ~/gsoe9400/trader-7.6 | wc -l # Same as above
rm ../dir-list1
ls -l | grep May
# How many files were last modified in May?
ls -l | grep May | sort -nk4 # Same, but sort by file size (4th field)
who | awk '{print $1}'
# Just list first field of “who” output
finger $(who | awk '{print $1}') # Full details of who is logged in
finger $(who | awk '{print $1}') | less # One page at a time
Simple scripting
•
•
•
•
Shell scripts are just files containing a list of commands to be executed
First line (“magic identifier”) must be #!/bin/bash
Comments are introduced with “#”
The script file must be made executable: chmod a+x filename
•
Variables:
– To set a variable, use varname=value (no spaces!)
– To use a variable, use $varname or ${varname}
•
– Variable names start with a letter, may contain letters, numbers and “_”
– Variable names are case-sensitive (as with most things Linux)
Functions (parameters are accessed using $1, $2, ...):
funcname () {
body of function
}
Simple scripting, continued
•
For loops:
for varname in list ...; do
process using ${varname}
done
•
Control statements (multiple “elif” allowed; “elif” and “else” clauses are
optional):
if [ comparison ]; then
if-true statements
elif [ second-comparison ]; then
if-second-true statements
else
if-false statements
fi
Example of comparisons: string1 = string2 (is equal)
– See the manual page for test (“man test”) for more information
•
Simple scripting, continued
•
While loops:
while [ comparison ]; do
while-true statements
done
•
Until loops:
until [ comparison ]; do
while-false statements
done
•
•
Many, many other programming features available!
Read the manual page: man bash
•
Some books:
– Cameron Newham, Learning the bash Shell, 3rd Edition, O’Reilly
Media, March 2005. ISBN 9780596009656, 9780596158965
– William E. Shotts Jr., The Linux Command Line, No Starch Press,
January 2012. ISBN 9781593273897, 9781593274269
Editing files under Linux
•
•
•
•
Use an editor to edit text files
Many choices, leading to “religious wars”!
Some options: GNU Emacs, Vim, Nano
Nano is very simple to use: nano filename
•
– CTRL-X to exit (you will be asked to save any changes)
GNU Emacs and Vim are highly customisable and programmable
– For example, see the file ~z9693022/.emacs
– Debra Cameron et al., Learning GNU Emacs, 3rd Edition, O’Reilly
Media, December 2004. ISBN 9780596006488, 9780596104184
– Arnold Robbins et al., Learning the vi and Vim Editors, 7th Edition,
O’Reilly Media, July 2008. ISBN 9780596529833, 9780596159351
•
Try it now:
cd ~/gsoe9400; nano script1
Creating a simple script file
•
Try it now, continued: Enter the following text:
#!/bin/bash
# How much disk quota am I using?
#
(We want only the last line of "quota" output:
#
use the "tail" utility)
blocks_used=$(quota | tail -n 1 | awk '{print $1}')
blocks_limit=$(quota | tail -n 1 | awk '{print $3}')
percent=$(( ${blocks_used} * 100 / ${blocks_limit} ))
echo "I am using ${blocks_used} blocks (${percent}%)"
•
Save the file and exit the editor, then:
chmod a+x ./script1
./script1
# Execute the script! (Note the use of “./”)
Creating a script with loops
•
Try it now:
– Create and run the file script2, containing the following. What is the
output? (Hint: remember “chmod a+x ./script2; ./script2”)
#!/bin/bash
module load matlab/2014a
for n in {01..10}; do
echo "n = $n;" >script${n}.m
echo "sqrtn = sqrt(n);" >>script${n}.m
echo "save('data${n}.txt', 'sqrtn', '-ascii');" \
>>script${n}.m
echo "quit" >>script${n}.m
matlab -nojvm -r script${n} >/dev/null
cat data${n}.txt
done
Applications on the cluster
•
•
•
Applications are managed using the module system
Applications are stored in /share/apps
Module files are stored in /share/apps/Modules
•
•
Module files set shell environment variables such as PATH
PATH controls where applications are searched (the search path)
– Try it now: echo $PATH
•
•
•
•
To see all available applications: module avail
To see currently loaded applications: module list
To load an application: module load application[/version]
To unload an application: module unload application[/version]
Submitting jobs to the cluster
•
•
•
So far, everything has been run on the head node: a very bad idea!
To submit a job to the cluster compute nodes:
– Create a shell script file as per normal
– Add #PBS directives as required directly after “#!/bin/bash”
– Add “cd $PBS_O_WORKDIR”
– Execute qsub ./scriptfile
– Wait for the job to run, checking its status as required
Common #PBS directives (“man qsub” for full details):
–
–
–
–
–
–
–
#PBS
#PBS
#PBS
#PBS
#PBS
#PBS
#PBS
-N
-M
-m
-l
-l
-l
-q
scriptname
— Set a name for the script
email
— Send notifications to an email address
abe
— What notifications to send
walltime=hh:mm:ss
— How much time is required
vmem=sizegb
— How much memory is required (GB)
nodes=1:ppn=n
— Request n processors on one node
queuename
— Which queue to submit to
Checking your job status
•
•
Submit your jobs using “qsub”
You will be given a job number in the form jobnumber.systemname
•
•
•
Check job status: qstat [jobnumber]
Another way: showq
Yet another way: pestat or pestat | less -S
•
– Use ← and → keys to scroll left and right (or expand your terminal!)
Show which nodes are reserved: showres -n | less -S
•
Get overall information about the cluster: visit http://systemname/ganglia/
– e.g., http://newton.mech.unsw.edu.au/ganglia/
– Currently only available within UNSW
•
Try it now: view the Ganglia page for the Newton cluster.
Managing your jobs
•
•
•
•
To see which nodes exist on the cluster: rocks list host or pestat
To see jobs belonging to you: qstat | grep $USER
To see when your job will start: showstart jobnumber
For more detailed information: checkjob jobnumber
•
•
•
•
To delete a queued job (whether running or not): qdel jobnumber ...
To place a job on hold: qhold jobnumber ...
To release a job currently on hold: qrls jobnumber ...
To rerun a job (kill it and then restart it): qrerun jobnumber ...
•
To move a job from one queue to another:
qmove destqueue jobnumber ...
Submitting and checking a job
•
Try it now:
– Create and change to the directory ~/gsoe9400/job1:
mkdir ~/gsoe9400/job1; cd ~/gsoe9400/job1
– Copy the previously created script file script2:
cp ../script2 job1
– Edit the file job1 and add the following lines just after “#!/bin/bash”:
#PBS -N job1
#PBS -M J.Zaitseff@unsw.edu.au # Do not replace—used to
#PBS -m abe
# assess you for this class!
#PBS -l walltime=00:10:00
#PBS -l vmem=2gb
#PBS -l nodes=1:ppn=1
cd $PBS_O_WORKDIR
– Submit the script: qsub ./job1
Conclusion
You have begun your journey
to using High Performance
Computing clusters effectively.
Well done!
John Zaitseff
J.Zaitseff@unsw.edu.au
Available for consultations
on Tuesdays 9:30am–4pm
by appointment only.
Image credit: John Zaitseff, UNSW
http://www.engineering.unsw.edu.au/hpc
Download