HPCC : Training Session – II ( Advanced) Srirangam Addepalli Huijun Zhu

HPCC : Training Session – II ( Advanced)
Srirangam Addepalli
Huijun Zhu
High Performance Computing center
Major headline statement set
on two lines here
 First level bullet treatment here
 First level bullet treatment here
 First level bullet treatment here
• Second level bullet treatment
• Second level bullet treatment
– Third level bullet treatment
– Third level bullet treatment
Introduction and Outline
In this session
1. Compiling and Running Serial and MPI Programs
2. Debugging serial and parallel code
3. Profiling Serial and parallel code
4. Features of Sun Grid Engine and Local Setup
5. Features of Shell
6. Shell Commands
7. Additional SGE options
8. Applications of interest
9. Questions and Contacts
Compiler Optimizations
Compiler optimizations:
Windows Linux
No optimization
Optimize for size
Optimize for speed and enable some optimization
Enable all optimizations as O2, and intensive loop optimizations
Enables SSE3, SSE2 and SSE instruction sets optimizations for non-Intel
-prof_gen Compile the program and instrument it for a profile generating run.
-prof_use May only be used after running a program that was previously
compiled using prof_gen. Uses profile information during each
step of the compilation process.
Compile Optimization
for(int i=0;i<1000;i++)
a[i] = b[i] + c[i];
icc -unrool=c:8 unroll.c
for(int i=0;i<1000;i+=8)
a[i] = b[i] + c[i];
a[i+1] = b[i+1] + c[i+1];
a[i+2] = b[i+2] + c[i+2];
a[i+3] = b[i+3] + c[i+3];
a[i+4] = b[i+4] + c[i+4];
] = b[i+5] + c[i+5];
a[i+6] = b[i+] + c[i+6];
a[i+7] = b[i+] + c[i+7];
Compiling Code with debug and profiling
icc -unrool=c:8 unroll.c
icc -g -unrool=c:8 unroll.c -o test.exe
icc -debug -unrool=c:8 unroll.c -o test.exe
idb test.c
Compiling Code with profiling
icc -unrool=c:8 unroll.c
icc -g -c -prof_gen -unrool=c:8 unroll.c – Geneates Object Files
icc -g -prof_use -unrool=c:8 unroll.o – Geneates Executable Files
Use - Thread Profiler
- Vtune performance analyzer
Using GCC:
cp /lustre/work/apps/examples/Advanced/TimeWaste.c .
gcc TimeWaste.c -pg -o TimeWaste -O2 -lc
gprof TimeWaste gmon.out $options
- Flat Profile
- Class graph
- annotated source
( -g -pg compile options must be used)
Parallel Code
Compile with -g option to enable debugging.
cp /lustre/work/apps/examples/Advanced/pimpi.c .
mpicc -g pimpi.c
mpirun -np 4 -dbg=gdb a.out
To run cpi with 2 processes, where the second process is run under the
debugger, the session would look something like
mpirun -np 2 cpi -p4norem
waiting for process on hostcompute-19-15.local:
/lustre/home/saddepal/cpi compute-19-15.local 38357 -p4amslave
on the first machine and
% gdb cpi
3. Intel parallel studio is really good. We currently do not have a license
but users can request a free license.
(Show idb)
Resources: Hrothgar/ Janus
Hrothgar Cluster:
 7680 Cores. 640 Nodes. (12 Cores/Node)
 Intel(R) Xeon(R) @ 2.8 GHz
 24 GB of memory per node
 DDR Infiniband for MPI communication & storage
80 TB of parallel lustre file system
86.2 Tflop peak performance.
 1024 cores for serial job
 Top 500 list: 109 in the world. Top 10 Academic institutions in USA.
 320 cores for community cluster
•JANUS Windows Cluster
 40 Cores. 5 nodes ( 8 Cores/Node)
64 GB Memory
Visual Studio with Interl fortran
 700 GB Storage
File System
Hrothgar has multiple file systems available for your use. There are
Lustre parallel systems (physically all the same) and a physical disk
scratch system on each compute node.
$HOME is backed up, persistant, with quotas.
$WORK is not backed up,is persistant, with quotas.
$SCRATCH and /state/partition1 are not backed up, are not persistant,
without quotas.
By "not persistant" we mean will be cleared based on earliest last
access time.Lustre and physical disk have different performance
characteristics.Lustre has much higher bandwidth and much higher
latency, so is betterfor large reads and writes. Disk is better for small
reads and writes, particularly when they are intermixed. Backups of
$HOME are taken at night. Removed files are gone forever
but can be restored from the last backup if on $HOME. Critical files that
cannot be replicated should be backed up on your personal system.
File System -2
Backed up Size
7 TB
22 TB
43 TB
Let’s compile and run an MPI program.
mkdir mpi
cd mpi
cp /lustre/work/apps/examples/mpi/* .
mpicc cpi.c, or
mpif77 fpi.f
qsub mpi.sh
$ echo $MCMD
$ echo $MFIL
For mvapich2, those are mpirun_rsh and hostfile. These are SGE
$NSLOTS is the number of cores
$SGE_CWD_PATH is the submit directory.
#$ -V
# export env to job:needed
#$ -cwd
# change to submit dir
#$ -j y
# ignore -e, merge
#$ -S /bin/bash
# the job shell
#$ -N mpi
# the name, qstat label
#$ -o $JOB_NAME.o$JOB_ID
#the job output file
#$ -e $JOB_NAME.e$JOB_ID
#the job error file
#$ -q development
#the queue, dev or normal
#$ -pe fill 8
cmd="$MCMD -np $NSLOTS -$MFIL \
$SGE_CWD_PATH/machinefile.$JOB_ID \
echo cmd=$cmd
#this expands,
#prints,runs $cmd
Sun Grid Engine
◮ There are two queues: normal (48 hrs) and serial (120 hrs).
◮ When the time is up, your job is killed, but a signal is sent 2 minutes
earlier so you can prepare for restart, and requeue if you want.
◮ serial supports both parallel and serial jobs (Normal only parallel).
◮ There are no queue limits per user, so we request that you do not
submit, say, 410 one-node jobs if you are the first person on at a
◮ We ask that you either use (1) all the cores or (2) all the memory on
each node you request, so the minimum core (pe) count is 12, and
increments are 12. If you are not using all the cores, request pe 12
anyway, but don’t use $NSLOTS.
◮ Interactive script /lustre/work/apps/bin/qstata.sh will show
all hosts with running jobs on the system (it’s 724 lines of output).
Sun Grid Engine : Restart
SGE, does not auto-restart a timed-out job. You have to tell it
in your command file.
The SGE headers #$ are unchanged and are omitted.
2 minutes before the job is killed, SGE will send a "usr1" signal to the
The "trap" command intercepts the signal, runs a script called
"restart.sh", and exits with a signal 99. That signal 99 tells SGE to
requeue the job.
trap "$SGE_CWD_PATH/restart.sh;exit 99" usr1
echo MASTER=$HOSTNAME #optional, prints run node
echo RESTARTED=$RESTARTED #optional, 1st run or not
#these 5 lines
if [ $ec == 0 ]; then
#let the job finish when
#it’s done by sending an
#code 0
exit $ec
#normal end
Serial Jobs
Most serial jobs won’t use 16GB of memory,
run 8 at once.
"-pe fill 8" for this.
This example assumes you run in 8 subdirectories of
your submit directory, but you don’t have to if files don’t conflict.
ssh $HOSTNAME "cd $PWD/r1;$HOME/bin/mys <i.dat 1>out 2>&1" &
ssh $HOSTNAME "cd $PWD/r2;$HOME/bin/mys <i.dat 1>out 2>&1" &
ssh $HOSTNAME "cd $PWD/r3;$HOME/bin/mys <i.dat 1>out 2>&1" &
ssh $HOSTNAME "cd $PWD/r4;$HOME/bin/mys <i.dat 1>out 2>&1" &
ssh $HOSTNAME "cd $PWD/r5;$HOME/bin/mys <i.dat 1>out 2>&1" &
ssh $HOSTNAME "cd $PWD/r6;$HOME/bin/mys <i.dat 1>out 2>&1" &
ssh $HOSTNAME "cd $PWD/r7;$HOME/bin/mys <i.dat 1>out 2>&1" &
ssh $HOSTNAME "cd $PWD/r8;$HOME/bin/mys <i.dat 1>out 2>&1" &
r=8 #these 8 lines count
while [ "$r" -ge 1 ] #the ssh processes
do #and sleep the master while they
sleep 60 #are running so the job won’t die
r=‘ps ux | grep $HOSTNAME | grep -v grep | wc -l‘
Application Checkpoint
Application Checkpoint Supported for Serial Jobs.
cr_run – To run Application.
./a.out should be replaced with cr_run ./a.out
cr_checkpoint - - term PID
( Where PID is the processid of job)
Serial Jobs Auto Checkpoint
#$ -V
#$ -cwd
#$ -S /bin/bash
#$ -N GeorgeGains
#$ -o $JOB_NAME.o$JOB_ID
#$ -e $JOB_NAME.e$JOB_ID
#$ -q kharecc
export tmpdir=$SGE_CKPT_DIR/ckpt.$JOB_ID
export currcpr=`cat $tmpdir/currcpr`
export ckptfile=$tmpdir/context_$JOB_ID.$currcpr
if ( $RESTARTED && -e $tmpdir ) then
echo "Restarting from $ckptfile" >> /tmp/restart.log
/usr/bin/cr_restart $ckptfile
/usr/bin/cr_run a.out
end if
Applications -- NWChem
/lustre/work/apps/examples/nwchem/nwchem.sh is an NWChem script almost like the MPI
command file. There is a defined variable, your lustre scratch area, called $SCRATCH. If
NWChem is started in that directory with no scratch defined in .nw, it will use that area for
scratch, as intended. Here the machinefile and input file remain in the submit directory. This
requires +intel,+mvapich1,and +nwchem-5.1 in .soft. You can check before submitting,
correct will show some files:
run="$MCMD -np $NSLOTS -$MFIL \
$SGE_CWD_PATH/machinefile.$JOB_ID \
$NWCHEM_TOP/bin/LINUX64/nwchem \
echo run=$run
Applications -- Matlab
/lustre/work/apps/examples/matlab/matlab.sh is a matlab script with a
small test job new4.m. It requires any compiler/mpi (except if using
MEX) ,and +matlab in .soft.
#$ -V
#$ -cwd
#$ -S /bin/bash
#$ -N testjob
#$ -o $JOB_NAME.o$JOB_ID
#$ -e $JOB_NAME.e$JOB_ID
#$ -q development
#$ -pe fill 8
matlab -nodisplay -nojvm -r new4
Interactive Jobs
qlogin -q “qname” “resorce requirements”
eg: qlogin -q HiMem -pe mpi 8
qlogin -q normal -pe mpi 16
Disadvantages: Will not be run in batch
mode. Terminal Closes and your jobs is
Advantage: Can debug jobs more easily
Array Jobs
#$ -S /bin/bash
~/programs/program -i ~/data/input -o ~/results/output
Now, let’s complicate things. Assume you have input files input.1, input.2, . . . , input. 10000,
and you want the output to be placed in files with a similar numbering scheme. You could use perl
to generate 10000 shell scripts, submit them, then clean up the mess later. Or, you could use an
array job. The modification to the previous shell script is simple:
#$ -S /bin/bash
# Tell the SGE that this is an array job, with "tasks" to be numbered 1 to 10000
#$ -t 1-10000
# When a single command in the array job is sent to a compute node,
# its task number is stored in the variable SGE_TASK_ID,
# so we can use the value of that variable to get the results we want:
~/programs/program -i ~/data/input.$SGE_TASK_ID -o ~/results/output.$SGE_TASK_ID
Appl Guides/Questions
 Parallel Matlab
 Matlab toolboxes
Questions ?
• Support
• Email : [email protected]
• Phone: 806-7424378