HPCC : Training Session – II ( Advanced) Srirangam Addepalli Huijun Zhu High Performance Computing center Jan-2011 SECTION HEADER GOES HERE Major headline statement set on two lines here First level bullet treatment here First level bullet treatment here First level bullet treatment here • Second level bullet treatment • Second level bullet treatment – Third level bullet treatment – Third level bullet treatment Introduction and Outline In this session 1. Compiling and Running Serial and MPI Programs 2. Debugging serial and parallel code 3. Profiling Serial and parallel code 4. Features of Sun Grid Engine and Local Setup 5. Features of Shell 6. Shell Commands 7. Additional SGE options 8. Applications of interest 9. Questions and Contacts Compiler Optimizations Compiler optimizations: Windows Linux /Od -O0 Comment No optimization /O1 -O1 Optimize for size /O2 -O2 Optimize for speed and enable some optimization /O3 -O3 Enable all optimizations as O2, and intensive loop optimizations /QxO -xO Enables SSE3, SSE2 and SSE instruction sets optimizations for non-Intel /Qprof-gen -prof_gen Compile the program and instrument it for a profile generating run. /Qprof-use -prof_use May only be used after running a program that was previously compiled using prof_gen. Uses profile information during each step of the compilation process. Compile Optimization UNROLL for(int i=0;i<1000;i++) { a[i] = b[i] + c[i]; } icc -unrool=c:8 unroll.c for(int i=0;i<1000;i+=8) { a[i] = b[i] + c[i]; a[i+1] = b[i+1] + c[i+1]; a[i+2] = b[i+2] + c[i+2]; a[i+3] = b[i+3] + c[i+3]; a[i+4] = b[i+4] + c[i+4]; ] = b[i+5] + c[i+5]; a[i+6] = b[i+] + c[i+6]; a[i+7] = b[i+] + c[i+7]; } Debugging Compiling Code with debug and profiling icc -unrool=c:8 unroll.c Debug: icc -g -unrool=c:8 unroll.c -o test.exe icc -debug -unrool=c:8 unroll.c -o test.exe idb test.c Profiling Compiling Code with profiling icc -unrool=c:8 unroll.c Profile: icc -g -c -prof_gen -unrool=c:8 unroll.c – Geneates Object Files icc -g -prof_use -unrool=c:8 unroll.o – Geneates Executable Files Use - Thread Profiler - Vtune performance analyzer Using GCC: cp /lustre/work/apps/examples/Advanced/TimeWaste.c . gcc TimeWaste.c -pg -o TimeWaste -O2 -lc ./TimeWaste gprof TimeWaste gmon.out $options -p - Flat Profile -q - Class graph -A - annotated source ( -g -pg compile options must be used) Parallel Code Compile with -g option to enable debugging. 1 cp /lustre/work/apps/examples/Advanced/pimpi.c . mpicc -g pimpi.c 2. mpirun -np 4 -dbg=gdb a.out To run cpi with 2 processes, where the second process is run under the debugger, the session would look something like mpirun -np 2 cpi -p4norem waiting for process on hostcompute-19-15.local: /lustre/home/saddepal/cpi compute-19-15.local 38357 -p4amslave on the first machine and % gdb cpi 3. Intel parallel studio is really good. We currently do not have a license but users can request a free license. (Show idb) Resources: Hrothgar/ Janus Hrothgar Cluster: 7680 Cores. 640 Nodes. (12 Cores/Node) Intel(R) Xeon(R) @ 2.8 GHz 24 GB of memory per node DDR Infiniband for MPI communication & storage 80 TB of parallel lustre file system 86.2 Tflop peak performance. 1024 cores for serial job Top 500 list: 109 in the world. Top 10 Academic institutions in USA. 320 cores for community cluster •JANUS Windows Cluster 40 Cores. 5 nodes ( 8 Cores/Node) 64 GB Memory Visual Studio with Interl fortran 700 GB Storage File System Hrothgar has multiple file systems available for your use. There are three Lustre parallel systems (physically all the same) and a physical disk scratch system on each compute node. $HOME is backed up, persistant, with quotas. $WORK is not backed up,is persistant, with quotas. $SCRATCH and /state/partition1 are not backed up, are not persistant, without quotas. By "not persistant" we mean will be cleared based on earliest last access time.Lustre and physical disk have different performance characteristics.Lustre has much higher bandwidth and much higher latency, so is betterfor large reads and writes. Disk is better for small reads and writes, particularly when they are intermixed. Backups of $HOME are taken at night. Removed files are gone forever but can be restored from the last backup if on $HOME. Critical files that cannot be replicated should be backed up on your personal system. File System -2 Location Quota /lustre/home/eraiderid 100GB /lustre/work/eraiderid 500GB /lustre/scratch/eraiderid none /state/partition1/ none Location /lustre/home/eraiderid /lustre/work/eraiderid /lustre/scratch/eraiderid Alias $HOME $WORK $SCRATCH Backed up Size yes 7 TB no 22 TB no 43 TB MPI Run Let’s compile and run an MPI program. mkdir mpi cd mpi cp /lustre/work/apps/examples/mpi/* . mpicc cpi.c, or mpif77 fpi.f qsub mpi.sh $ echo $MCMD mpirun $ echo $MFIL machinefile For mvapich2, those are mpirun_rsh and hostfile. These are SGE vars: $NSLOTS is the number of cores $SGE_CWD_PATH is the submit directory. MPI.sh #!/bin/bash #$ -V # export env to job:needed #$ -cwd # change to submit dir #$ -j y # ignore -e, merge #$ -S /bin/bash # the job shell #$ -N mpi # the name, qstat label #$ -o $JOB_NAME.o$JOB_ID #the job output file #$ -e $JOB_NAME.e$JOB_ID #the job error file #$ -q development #the queue, dev or normal #$ -pe fill 8 #8,16,32,etc cmd="$MCMD -np $NSLOTS -$MFIL \ $SGE_CWD_PATH/machinefile.$JOB_ID \ $SGE_CWD_PATH/a.out" echo cmd=$cmd #this expands, $cmd #prints,runs $cmd Sun Grid Engine ◮ There are two queues: normal (48 hrs) and serial (120 hrs). ◮ When the time is up, your job is killed, but a signal is sent 2 minutes earlier so you can prepare for restart, and requeue if you want. ◮ serial supports both parallel and serial jobs (Normal only parallel). ◮ There are no queue limits per user, so we request that you do not submit, say, 410 one-node jobs if you are the first person on at a restart. ◮ We ask that you either use (1) all the cores or (2) all the memory on each node you request, so the minimum core (pe) count is 12, and increments are 12. If you are not using all the cores, request pe 12 anyway, but don’t use $NSLOTS. ◮ Interactive script /lustre/work/apps/bin/qstata.sh will show all hosts with running jobs on the system (it’s 724 lines of output). Sun Grid Engine : Restart SGE, does not auto-restart a timed-out job. You have to tell it in your command file. The SGE headers #$ are unchanged and are omitted. 2 minutes before the job is killed, SGE will send a "usr1" signal to the job. The "trap" command intercepts the signal, runs a script called "restart.sh", and exits with a signal 99. That signal 99 tells SGE to requeue the job. trap "$SGE_CWD_PATH/restart.sh;exit 99" usr1 echo MASTER=$HOSTNAME #optional, prints run node echo RESTARTED=$RESTARTED #optional, 1st run or not $SGE_CWD_PATH/myjob.sh Ec=$? #these 5 lines if [ $ec == 0 ]; then #let the job finish when echo "COMPLETED" #it’s done by sending an fi #code 0 exit $ec #normal end Serial Jobs Most serial jobs won’t use 16GB of memory, run 8 at once. "-pe fill 8" for this. This example assumes you run in 8 subdirectories of your submit directory, but you don’t have to if files don’t conflict. ssh $HOSTNAME "cd $PWD/r1;$HOME/bin/mys <i.dat 1>out 2>&1" & ssh $HOSTNAME "cd $PWD/r2;$HOME/bin/mys <i.dat 1>out 2>&1" & ssh $HOSTNAME "cd $PWD/r3;$HOME/bin/mys <i.dat 1>out 2>&1" & ssh $HOSTNAME "cd $PWD/r4;$HOME/bin/mys <i.dat 1>out 2>&1" & ssh $HOSTNAME "cd $PWD/r5;$HOME/bin/mys <i.dat 1>out 2>&1" & ssh $HOSTNAME "cd $PWD/r6;$HOME/bin/mys <i.dat 1>out 2>&1" & ssh $HOSTNAME "cd $PWD/r7;$HOME/bin/mys <i.dat 1>out 2>&1" & ssh $HOSTNAME "cd $PWD/r8;$HOME/bin/mys <i.dat 1>out 2>&1" & r=8 #these 8 lines count while [ "$r" -ge 1 ] #the ssh processes do #and sleep the master while they sleep 60 #are running so the job won’t die r=‘ps ux | grep $HOSTNAME | grep -v grep | wc -l‘ done Application Checkpoint Application Checkpoint Supported for Serial Jobs. cr_run – To run Application. ./a.out should be replaced with cr_run ./a.out cr_checkpoint - - term PID cr_restart contextfile.pid ( Where PID is the processid of job) Serial Jobs Auto Checkpoint #!/bin/bash #$ -V #$ -cwd #$ -S /bin/bash #$ -N GeorgeGains #$ -o $JOB_NAME.o$JOB_ID #$ -e $JOB_NAME.e$JOB_ID #$ -q kharecc export tmpdir=$SGE_CKPT_DIR/ckpt.$JOB_ID export currcpr=`cat $tmpdir/currcpr` export ckptfile=$tmpdir/context_$JOB_ID.$currcpr if ( $RESTARTED && -e $tmpdir ) then echo "Restarting from $ckptfile" >> /tmp/restart.log /usr/bin/cr_restart $ckptfile else /usr/bin/cr_run a.out end if Applications -- NWChem NWChem /lustre/work/apps/examples/nwchem/nwchem.sh is an NWChem script almost like the MPI command file. There is a defined variable, your lustre scratch area, called $SCRATCH. If NWChem is started in that directory with no scratch defined in .nw, it will use that area for scratch, as intended. Here the machinefile and input file remain in the submit directory. This requires +intel,+mvapich1,and +nwchem-5.1 in .soft. You can check before submitting, correct will show some files: ls $NWCHEM_TOP INFILE="siosi3.nw" cd $SCRATCH run="$MCMD -np $NSLOTS -$MFIL \ $SGE_CWD_PATH/machinefile.$JOB_ID \ $NWCHEM_TOP/bin/LINUX64/nwchem \ ${SGE_CWD_PATH}/$INFILE" echo run=$run $run Applications -- Matlab • /lustre/work/apps/examples/matlab/matlab.sh is a matlab script with a small test job new4.m. It requires any compiler/mpi (except if using MEX) ,and +matlab in .soft. #!/bin/sh #$ -V #$ -cwd #$ -S /bin/bash #$ -N testjob #$ -o $JOB_NAME.o$JOB_ID #$ -e $JOB_NAME.e$JOB_ID #$ -q development #$ -pe fill 8 matlab -nodisplay -nojvm -r new4 Interactive Jobs • qlogin -q “qname” “resorce requirements” • eg: qlogin -q HiMem -pe mpi 8 • qlogin -q normal -pe mpi 16 • Disadvantages: Will not be run in batch mode. Terminal Closes and your jobs is killed. • Advantage: Can debug jobs more easily Array Jobs #!sh #$ -S /bin/bash ~/programs/program -i ~/data/input -o ~/results/output Now, let’s complicate things. Assume you have input files input.1, input.2, . . . , input. 10000, and you want the output to be placed in files with a similar numbering scheme. You could use perl to generate 10000 shell scripts, submit them, then clean up the mess later. Or, you could use an array job. The modification to the previous shell script is simple: #!sh #$ -S /bin/bash # Tell the SGE that this is an array job, with "tasks" to be numbered 1 to 10000 #$ -t 1-10000 # When a single command in the array job is sent to a compute node, # its task number is stored in the variable SGE_TASK_ID, # so we can use the value of that variable to get the results we want: ~/programs/program -i ~/data/input.$SGE_TASK_ID -o ~/results/output.$SGE_TASK_ID Appl Guides/Questions Parallel Matlab Matlab toolboxes Questions ? • Support • Email : hpccsupport@ttu.edu • Phone: 806-7424378