HPCC : Training Session – II ( Advanced) Srirangam Addepalli Huijun Zhu

advertisement
HPCC : Training Session – II ( Advanced)
Srirangam Addepalli
Huijun Zhu
High Performance Computing center
Jan-2011
SECTION HEADER GOES HERE
Major headline statement set
on two lines here
 First level bullet treatment here
 First level bullet treatment here
 First level bullet treatment here
• Second level bullet treatment
• Second level bullet treatment
– Third level bullet treatment
– Third level bullet treatment
Introduction and Outline
In this session
1. Compiling and Running Serial and MPI Programs
2. Debugging serial and parallel code
3. Profiling Serial and parallel code
4. Features of Sun Grid Engine and Local Setup
5. Features of Shell
6. Shell Commands
7. Additional SGE options
8. Applications of interest
9. Questions and Contacts
Compiler Optimizations
Compiler optimizations:
Windows Linux
/Od
-O0
Comment
No optimization
/O1
-O1
Optimize for size
/O2
-O2
Optimize for speed and enable some optimization
/O3
-O3
Enable all optimizations as O2, and intensive loop optimizations
/QxO
-xO
Enables SSE3, SSE2 and SSE instruction sets optimizations for non-Intel
/Qprof-gen
-prof_gen Compile the program and instrument it for a profile generating run.
/Qprof-use
-prof_use May only be used after running a program that was previously
compiled using prof_gen. Uses profile information during each
step of the compilation process.
Compile Optimization
UNROLL
for(int i=0;i<1000;i++)
{
a[i] = b[i] + c[i];
}
icc -unrool=c:8 unroll.c
for(int i=0;i<1000;i+=8)
{
a[i] = b[i] + c[i];
a[i+1] = b[i+1] + c[i+1];
a[i+2] = b[i+2] + c[i+2];
a[i+3] = b[i+3] + c[i+3];
a[i+4] = b[i+4] + c[i+4];
] = b[i+5] + c[i+5];
a[i+6] = b[i+] + c[i+6];
a[i+7] = b[i+] + c[i+7];
}
Debugging
Compiling Code with debug and profiling
icc -unrool=c:8 unroll.c
Debug:
icc -g -unrool=c:8 unroll.c -o test.exe
icc -debug -unrool=c:8 unroll.c -o test.exe
idb test.c
Profiling
Compiling Code with profiling
icc -unrool=c:8 unroll.c
Profile:
icc -g -c -prof_gen -unrool=c:8 unroll.c – Geneates Object Files
icc -g -prof_use -unrool=c:8 unroll.o – Geneates Executable Files
Use - Thread Profiler
- Vtune performance analyzer
Using GCC:
cp /lustre/work/apps/examples/Advanced/TimeWaste.c .
gcc TimeWaste.c -pg -o TimeWaste -O2 -lc
./TimeWaste
gprof TimeWaste gmon.out $options
-p
- Flat Profile
-q
- Class graph
-A
- annotated source
( -g -pg compile options must be used)
Parallel Code
Compile with -g option to enable debugging.
1
cp /lustre/work/apps/examples/Advanced/pimpi.c .
mpicc -g pimpi.c
2.
mpirun -np 4 -dbg=gdb a.out
To run cpi with 2 processes, where the second process is run under the
debugger, the session would look something like
mpirun -np 2 cpi -p4norem
waiting for process on hostcompute-19-15.local:
/lustre/home/saddepal/cpi compute-19-15.local 38357 -p4amslave
on the first machine and
% gdb cpi
3. Intel parallel studio is really good. We currently do not have a license
but users can request a free license.
(Show idb)
Resources: Hrothgar/ Janus
Hrothgar Cluster:
 7680 Cores. 640 Nodes. (12 Cores/Node)
 Intel(R) Xeon(R) @ 2.8 GHz
 24 GB of memory per node
 DDR Infiniband for MPI communication & storage

80 TB of parallel lustre file system

86.2 Tflop peak performance.
 1024 cores for serial job
 Top 500 list: 109 in the world. Top 10 Academic institutions in USA.
 320 cores for community cluster
•JANUS Windows Cluster
 40 Cores. 5 nodes ( 8 Cores/Node)

64 GB Memory

Visual Studio with Interl fortran
 700 GB Storage
File System
Hrothgar has multiple file systems available for your use. There are
three
Lustre parallel systems (physically all the same) and a physical disk
scratch system on each compute node.
$HOME is backed up, persistant, with quotas.
$WORK is not backed up,is persistant, with quotas.
$SCRATCH and /state/partition1 are not backed up, are not persistant,
without quotas.
By "not persistant" we mean will be cleared based on earliest last
access time.Lustre and physical disk have different performance
characteristics.Lustre has much higher bandwidth and much higher
latency, so is betterfor large reads and writes. Disk is better for small
reads and writes, particularly when they are intermixed. Backups of
$HOME are taken at night. Removed files are gone forever
but can be restored from the last backup if on $HOME. Critical files that
cannot be replicated should be backed up on your personal system.
File System -2
Location
Quota
/lustre/home/eraiderid
100GB
/lustre/work/eraiderid
500GB
/lustre/scratch/eraiderid
none
/state/partition1/
none
Location
/lustre/home/eraiderid
/lustre/work/eraiderid
/lustre/scratch/eraiderid
Alias
$HOME
$WORK
$SCRATCH
Backed up Size
yes
7 TB
no
22 TB
no
43 TB
MPI Run
Let’s compile and run an MPI program.
mkdir mpi
cd mpi
cp /lustre/work/apps/examples/mpi/* .
mpicc cpi.c, or
mpif77 fpi.f
qsub mpi.sh
$ echo $MCMD
mpirun
$ echo $MFIL
machinefile
For mvapich2, those are mpirun_rsh and hostfile. These are SGE
vars:
$NSLOTS is the number of cores
$SGE_CWD_PATH is the submit directory.
MPI.sh
#!/bin/bash
#$ -V
# export env to job:needed
#$ -cwd
# change to submit dir
#$ -j y
# ignore -e, merge
#$ -S /bin/bash
# the job shell
#$ -N mpi
# the name, qstat label
#$ -o $JOB_NAME.o$JOB_ID
#the job output file
#$ -e $JOB_NAME.e$JOB_ID
#the job error file
#$ -q development
#the queue, dev or normal
#$ -pe fill 8
#8,16,32,etc
cmd="$MCMD -np $NSLOTS -$MFIL \
$SGE_CWD_PATH/machinefile.$JOB_ID \
$SGE_CWD_PATH/a.out"
echo cmd=$cmd
#this expands,
$cmd
#prints,runs $cmd
Sun Grid Engine
◮ There are two queues: normal (48 hrs) and serial (120 hrs).
◮ When the time is up, your job is killed, but a signal is sent 2 minutes
earlier so you can prepare for restart, and requeue if you want.
◮ serial supports both parallel and serial jobs (Normal only parallel).
◮ There are no queue limits per user, so we request that you do not
submit, say, 410 one-node jobs if you are the first person on at a
restart.
◮ We ask that you either use (1) all the cores or (2) all the memory on
each node you request, so the minimum core (pe) count is 12, and
increments are 12. If you are not using all the cores, request pe 12
anyway, but don’t use $NSLOTS.
◮ Interactive script /lustre/work/apps/bin/qstata.sh will show
all hosts with running jobs on the system (it’s 724 lines of output).
Sun Grid Engine : Restart
SGE, does not auto-restart a timed-out job. You have to tell it
in your command file.
The SGE headers #$ are unchanged and are omitted.
2 minutes before the job is killed, SGE will send a "usr1" signal to the
job.
The "trap" command intercepts the signal, runs a script called
"restart.sh", and exits with a signal 99. That signal 99 tells SGE to
requeue the job.
trap "$SGE_CWD_PATH/restart.sh;exit 99" usr1
echo MASTER=$HOSTNAME #optional, prints run node
echo RESTARTED=$RESTARTED #optional, 1st run or not
$SGE_CWD_PATH/myjob.sh
Ec=$?
#these 5 lines
if [ $ec == 0 ]; then
#let the job finish when
echo "COMPLETED"
#it’s done by sending an
fi
#code 0
exit $ec
#normal end
Serial Jobs
Most serial jobs won’t use 16GB of memory,
run 8 at once.
"-pe fill 8" for this.
This example assumes you run in 8 subdirectories of
your submit directory, but you don’t have to if files don’t conflict.
ssh $HOSTNAME "cd $PWD/r1;$HOME/bin/mys <i.dat 1>out 2>&1" &
ssh $HOSTNAME "cd $PWD/r2;$HOME/bin/mys <i.dat 1>out 2>&1" &
ssh $HOSTNAME "cd $PWD/r3;$HOME/bin/mys <i.dat 1>out 2>&1" &
ssh $HOSTNAME "cd $PWD/r4;$HOME/bin/mys <i.dat 1>out 2>&1" &
ssh $HOSTNAME "cd $PWD/r5;$HOME/bin/mys <i.dat 1>out 2>&1" &
ssh $HOSTNAME "cd $PWD/r6;$HOME/bin/mys <i.dat 1>out 2>&1" &
ssh $HOSTNAME "cd $PWD/r7;$HOME/bin/mys <i.dat 1>out 2>&1" &
ssh $HOSTNAME "cd $PWD/r8;$HOME/bin/mys <i.dat 1>out 2>&1" &
r=8 #these 8 lines count
while [ "$r" -ge 1 ] #the ssh processes
do #and sleep the master while they
sleep 60 #are running so the job won’t die
r=‘ps ux | grep $HOSTNAME | grep -v grep | wc -l‘
done
Application Checkpoint
Application Checkpoint Supported for Serial Jobs.
cr_run – To run Application.
./a.out should be replaced with cr_run ./a.out
cr_checkpoint - - term PID
cr_restart
contextfile.pid
( Where PID is the processid of job)
Serial Jobs Auto Checkpoint
#!/bin/bash
#$ -V
#$ -cwd
#$ -S /bin/bash
#$ -N GeorgeGains
#$ -o $JOB_NAME.o$JOB_ID
#$ -e $JOB_NAME.e$JOB_ID
#$ -q kharecc
export tmpdir=$SGE_CKPT_DIR/ckpt.$JOB_ID
export currcpr=`cat $tmpdir/currcpr`
export ckptfile=$tmpdir/context_$JOB_ID.$currcpr
if ( $RESTARTED && -e $tmpdir ) then
echo "Restarting from $ckptfile" >> /tmp/restart.log
/usr/bin/cr_restart $ckptfile
else
/usr/bin/cr_run a.out
end if
Applications -- NWChem
NWChem
/lustre/work/apps/examples/nwchem/nwchem.sh is an NWChem script almost like the MPI
command file. There is a defined variable, your lustre scratch area, called $SCRATCH. If
NWChem is started in that directory with no scratch defined in .nw, it will use that area for
scratch, as intended. Here the machinefile and input file remain in the submit directory. This
requires +intel,+mvapich1,and +nwchem-5.1 in .soft. You can check before submitting,
correct will show some files:
ls $NWCHEM_TOP
INFILE="siosi3.nw"
cd $SCRATCH
run="$MCMD -np $NSLOTS -$MFIL \
$SGE_CWD_PATH/machinefile.$JOB_ID \
$NWCHEM_TOP/bin/LINUX64/nwchem \
${SGE_CWD_PATH}/$INFILE"
echo run=$run
$run
Applications -- Matlab
•
/lustre/work/apps/examples/matlab/matlab.sh is a matlab script with a
small test job new4.m. It requires any compiler/mpi (except if using
MEX) ,and +matlab in .soft.
#!/bin/sh
#$ -V
#$ -cwd
#$ -S /bin/bash
#$ -N testjob
#$ -o $JOB_NAME.o$JOB_ID
#$ -e $JOB_NAME.e$JOB_ID
#$ -q development
#$ -pe fill 8
matlab -nodisplay -nojvm -r new4
Interactive Jobs
•
qlogin -q “qname” “resorce requirements”
•
eg: qlogin -q HiMem -pe mpi 8
•
qlogin -q normal -pe mpi 16
•
Disadvantages: Will not be run in batch
mode. Terminal Closes and your jobs is
killed.
•
Advantage: Can debug jobs more easily
Array Jobs
#!sh
#$ -S /bin/bash
~/programs/program -i ~/data/input -o ~/results/output
Now, let’s complicate things. Assume you have input files input.1, input.2, . . . , input. 10000,
and you want the output to be placed in files with a similar numbering scheme. You could use perl
to generate 10000 shell scripts, submit them, then clean up the mess later. Or, you could use an
array job. The modification to the previous shell script is simple:
#!sh
#$ -S /bin/bash
# Tell the SGE that this is an array job, with "tasks" to be numbered 1 to 10000
#$ -t 1-10000
# When a single command in the array job is sent to a compute node,
# its task number is stored in the variable SGE_TASK_ID,
# so we can use the value of that variable to get the results we want:
~/programs/program -i ~/data/input.$SGE_TASK_ID -o ~/results/output.$SGE_TASK_ID
Appl Guides/Questions
 Parallel Matlab
 Matlab toolboxes

Questions ?
• Support
• Email : hpccsupport@ttu.edu
• Phone: 806-7424378
Download