OpenMP

Parallel Programming Orientation Confidential – Internal Use Only 1 Agenda  Parallel jobs » Paradigms to parallelize algorithms » Profiling and compiler optimization » Implementations to parallelize code • OpenMP • MPI  Queuing » Job queuing » Integrating Parallel Programs  Questions and Answers 2 Confidential – Internal Use Only Traditional “Ad-Hoc” Linux Cluster  Full Linux install to disk; Load into memory » Manual & Slow; 5-30 minutes  Full set of disparate daemons, services, user/password, & host access setup Interconnection Network Master Node  Basic parallel shell with complex glue scripts run jobs  Monitoring & management added as isolated tools Internet or Internal Network 3 Confidential – Internal Use Only Cluster Virtualization Architecture Realized Optional Disks  Minimal in-memory OS with single daemon rapidly deployed in seconds no disk required » Less than 20 seconds  Virtual, unified process space enables intuitive single sign-on, job submission » Effortless job migration to nodes Interconnection Network  Monitor & manage efficiently from the Master » Single System Install » Single Process Space » Shared cache of the cluster state Master Node » Single point of provisioning » Better performance due to lightweight nodes Internet or Internal Network 4 » No version skew is inherently more reliable Confidential – Internal Use Only Just a Primer  Only a brief introduction is provided here. Many other in-depth tutorials are available on the web and in published sources. » http://www.mpi-forum.org/docs/mpi-11-html/mpi-report.html » https://computing.llnl.gov/?set=training&page=index 5 Confidential – Internal Use Only Parallel Code Primer  Paradigms for writing parallel programs depend upon the application » SIMD (single-instruction multiple-data) » MIMD (multiple-instruction multiple-data) » MISD (multiple-instruction single-data)  SIMD will be presented here as it is a commonly used template » A single application source is compiled to perform operations on different sets of data (Single Program Multiple Data (SPMD) model) » The data is read by the different threads or passed between threads via messages (hence MPI = message passing interface) • Contrast this with shared memory or OpenMP where data is locally via memory • Optimizations in the MPI implementation can perform localhost optimization; however, the program is still written using a message passing construct 6 Confidential – Internal Use Only Explicitly Parallel Programs  Different paradigms exist for parallelizing programs » Shared memory » OpenMP » Sockets » PVM » Linda » MPI  Most distributed parallel programs are now written using MPI » Different options for MPI stacks: MPICH, OpenMPI, HP, Intel » ClusterWare comes integrated with customized versions of MPICH and OpenMPI 7 Confidential – Internal Use Only Example Code  Calculate p through numerical integration » Compute p by integrating f(x) = 4/(1 + x**2) from 0 to 1 • Function is the derivative of arctan(x) • See source code 8 Confidential – Internal Use Only Compiling and running code set n = 400,000,000 $ gcc -o cpi-serial cpi-serial.c $ time ./cpi-serial Process 0 pi is approximately 3.1415926535895520, Error is 0.0000000000002411 real user sys 0m11.009s 0m11.007s 0m0.000s $ gcc -g -pg -o cpi-serial_prof cpi-serial.c $ time ./cpi-serial_prof Process 0 pi is approximately 3.1415926535895520, Error is 0.0000000000002411 real 0m11.012s user 0m11.010s sys 0m0.000s $ ls -ltra total 40 … -rw-rw-r-- 1 jhan jhan 9 586 Mar 7 14:45 gmon.out Confidential – Internal Use Only Profiling  -g flag includes debugging information in the binary. Useful for gdb tracing of an application and for profiling  -pg flag generates code to writing profile information $ gprof cpi-serial_prof gmon.out Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ns/call ns/call name 74.48 2.85 2.85 main 23.95 3.77 0.92 400000000 2.29 2.29 f 2.37 3.86 0.09 frame_dummy Call graph (explanation follows) granularity: each sample hit covers 2 byte(s) for 0.26% of 3.86 seconds index % time self children called name <spontaneous> 2.85 0.92 main [1] 0.92 0.00 400000000/400000000 f [2] ----------------------------------------------0.92 0.00 400000000/400000000 main [1] [2] 23.8 0.92 0.00 400000000 f [2] ----------------------------------------------<spontaneous> [3] 2.3 0.09 0.00 frame_dummy [3] ----------------------------------------------[1] 97.7  -A flag to gprof shows calls next to source code 10 Confidential – Internal Use Only Profiling Tips  Code should be profiled using realistic data set » Contrast the call graphs of n=100 versus n=400,000,000  Profiling can give tips about where to optimize the current algorithm, but it can’t suggest alternative (better) algorithms » e.g. Monte Carlo algorithm to calculate p  Amdahl’s Law » The speedup parallelization achieves is limited by the serial part of the code 11 Confidential – Internal Use Only OpenMP Introduction  Parallelization using shared memory in a single machine » Portion of the code is forked on the machine to parallelize » i.e. not distributed parallelization  Done using pragmas in the source code. Compiler must support OpenMP (gcc 4, Intel, etc.) » $ gcc -fopenmp -o cpi-openmp cpi-openmp.c » See Source Code  Profiling can add overhead to resulting executable  ‘time’ can be used to measure improvement  Runtime selection of the number of threads using OMP_NUM_THREADS environment variable 12 Confidential – Internal Use Only Scaling with OpenMP $ time OMP_NUM_THREADS=1 ./cpi-openmp Process 0 pi is approximately 3.1415926535895520, Error is 0.0000000000002411 real 0m10.583s user 0m10.581s sys 0m0.001s $ time OMP_NUM_THREADS=2 ./cpi-openmp Process 0 Process 1 pi is approximately 3.1415926535900218, Error is 0.0000000000002287 real 0m5.295s user 0m11.297s sys 0m0.000s $ time OMP_NUM_THREADS=4 ./cpi-openmp … real 0m2.650s user 0m10.586s sys 0m0.003s $ time OMP_NUM_THREADS=8 ./cpi-openmp … real 0m1.416s user 0m11.294s sys 0m0.001s 13 Confidential – Internal Use Only Scaling with OpenMP OpenMP Speedup 9 8 7 Speedup 6 5 Ideal Speedup 4 Speedup 3 2 1 0 0 1 2 3 4 5 6 7 8 9 Number of Processors  Code is easy to parallelize  Good scaling is seen up to 8 processors, kink in the curve is expected 14 Confidential – Internal Use Only Role of the Compiler  Parallelization using shared memory in a single machine » i.e. not distributed parallelization  Done using pragmas in the source code. Compiler must support OpenMP (gcc 4, Intel, etc.) » $ gcc -fopenmp -o cpi-openmp cpi-openmp.c  Profiling can add overhead to resulting executable  ‘time’ can be used to measure improvement  Runtime selection of the number of threads using OMP_NUM_THREADS environment variable 15 Confidential – Internal Use Only GCC versus Intel C $ time OMP_NUM_THREADS=1 ./cpi-openmp … real 0m10.583s user 0m10.581s sys 0m0.001s $ gcc -O3 -fopenmp -o cpi-openmp-gcc-O3 cpi-openmp.c $ time OMP_NUM_THREADS=1 ./cpi-openmp-gcc-O3 Process 0 pi is approximately 3.1415926535895520, Error is 0.0000000000002411 real 0m3.154s user 0m3.143s sys 0m0.011s $ time OMP_NUM_THREADS=8 ./cpi-openmp-gcc-O3 … real 0m0.399s user 0m3.181s sys 0m0.001s $ icc -openmp -o cpi-openmp-icc cpi-openmp.c $ time OMP_NUM_THREADS=1 ./cpi-openmp-icc Process 0 pi is approximately 3.1415926535899272, Error is 0.0000000000001341 real 0m1.618s user 0m1.575s sys 0m0.000s $ time OMP_NUM_THREADS=8 ./cpi-openmp-icc … real 0m0.204s user 0m1.584s sys 0m0.001s 16 Confidential – Internal Use Only Compiler Timings 17 Confidential – Internal Use Only Explicitly Parallel Programs  Different paradigms exist for parallelizing programs » Shared memory » OpenMP » Sockets » PVM » Linda » MPI  Most distributed parallel programs are now written using MPI » Different options for MPI stacks: MPICH, OpenMPI, HP, Intel » ClusterWare comes integrated with customized versions of MPICH and OpenMPI 18 Confidential – Internal Use Only OpenMP Summary  OpenMP provides a mechanism to parallelize within a single machine  Shared memory and variables are handled “automatically”  Performance, with an appropriate compiler, can provide significant speedups  Coupled with large core count SMP machines, OpenMP could be all of the parallelization required » GPU programming is similar to the OpenMP model 19 Confidential – Internal Use Only Explicitly Parallel Programs  Different paradigms exist for parallelizing programs » Shared memory » OpenMP » Sockets » PVM » Linda » MPI  Most distributed parallel programs are now written using MPI » Different options for MPI stacks: MPICH, OpenMPI, HP, Intel » ClusterWare comes integrated with customized versions of MPICH and OpenMPI 20 Confidential – Internal Use Only Running MPI Code  Binaries are executed simultaneously » on the same machine or different machines  After the binaries start running, the MPI_COMM_WORLD is established  Any data to be transferred must be explicitly determined by the programmer  Hooks exist for a number of languages » E.g. Python (https://computing.llnl.gov/code/pdf/pyMPI.pdf) 21 Confidential – Internal Use Only Example MPI Source  cpi.c calculates p using MPI in C compute pi by integrating f(x) = 4/(1 + x**2) #include "mpi.h" #include <stdio.h> #include <math.h> Terminates Each Does Only rank worker MPI_SUM “0” MPI does outputs execution function this the loop on value and “1” System Initialize Determines Gets Differentiate MPI Broadcasts built-in theinclude the name “1” function the MPI actions MPI_INT of file size rank execution the which to based of of get the the increments MPI_DOUBLE of environment pi the at counter &mypi by on the all defines environment group calling processor on time from rank. value &n associated process the from Only MPI the “master” infunctions process with the a number of workers in processors (versus communictor communicator performs with rank this “0" to action all other dividing the range -> possible MPI_COMM_WORLD to a processes of&pi theongroup off-by-one single valueerror) at rank “0” while (!done) { if (myid == 0) { /* printf("Enter the number of intervals: (0 quits) "); scanf("%d",&n); */ if (n==0) n=100; else n=0; startwtime = MPI_Wtime(); } MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD); if (n == 0) done = 1; else { h = 1.0 / (double) n; sum = 0.0; for (i = myid + 1; i <= n; i += numprocs) { x = h * ((double)i - 0.5); sum += f(x); } mypi = h * sum; double f( double ); double f( double a ) { return (4.0 / (1.0 + a*a)); } int main( int argc, char *argv[]) { int done = 0, n, myid, numprocs, i; double PI25DT = 3.141592653589793238462643; double mypi, pi, h, sum, x; double startwtime = 0.0, endwtime; int namelen; char processor_name[MPI_MAX_PROCESSOR_NAME]; MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD); { MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&numprocs); MPI_Comm_rank(MPI_COMM_WORLD,&myid); MPI_Get_processor_name(processor_name,&namelen); is %.16f\n", fprintf(stderr,"Process %d on %s\n", myid, processor_name); } } printf("pi is approximately %.16f, Error pi, fabs(pi - PI25DT)); endwtime = MPI_Wtime(); printf("wall clock time = %f\n", endwtime-startwtime); } MPI_Finalize(); n = 0; } 22 if (myid == 0) return 0; Confidential – Internal Use Only Other Common MPI Functions  MPI_Send, MPI_Recv » Blocking send and receive between two specific ranks  MPI_Isend, MPI_Irecv » Non-blocking send and receive between two specific ranks  man pages exist for the MPI functions  Poorly written programs can suffer from poor communication efficiency (e.g. stair-step) or lost data if the system buffer fills before a blocking send or receive is initiated to correspond with a non-blocking receive or send  Care should be used when creating temporary files as multiple threads may be running on the same host overwriting the same temporary file (include rank in file name in a unique temporary directory per simulation) 23 Confidential – Internal Use Only Compiling MPICH programs  mpicc, mpiCC, mpif77, mpif90 are used to automatically compile code and link in the correct MPI libraries from /usr/lib64/MPICH » Environment variables can used to set the compiler: • CC, CPP, FC, F90 » Command line options to set the compiler: • -cc=, -cxx=, -fc=, -f90= » GNU, PGI, and Intel compilers are supported 24 Confidential – Internal Use Only Running MPICH programs  mpirun is used to launch MPICH programs  Dynamic allocation can be done when using the –np flag  Mapping is also supported when using the –map flags  If Infiniband is installed, the interconnect fabric can be chosen using the machine flag: » -machine p4 » -machine vapi 25 Confidential – Internal Use Only Scaling with MPI $ which mpicc /usr/bin/mpicc $ mpicc -show -o cpi-mpi cpi-mpi.c gcc -L/usr/lib64/MPICH/p4/gnu -I/usr/include -o cpi-mpi cpi-mpi.c -lmpi lbproc $ mpicc -o cpi-mpi cpi-mpi.c $ time mpirun -np 1 ./cpi-mpi Process 0 on scyld.localdomain … real 0m11.198s user 0m11.187s sys 0m0.010s $ time mpirun -np 2 ./cpi-mpi Process 0 on scyld.localdomain Process 1 on n0 … real 0m6.486s user 0m5.510s sys 0m0.009s $ time mpirun -map -1:-1:-1:-1:-1:-1:-1:-1:0:0:0:0:0:0:0:0 ./cpi-mpi … real 0m1.283s user 0m1.381s sys 0m0.016s 26 Confidential – Internal Use Only Environment Variable Options  Additional environment variable control: » NP — The number of processes requested, but not the number of processors. As in the example earlier in this section, NP=4 ./a.out will run the MPI program a.out with 4 processes. » ALL_CPUS — Set the number of processes to the number of CPUs available to the current user. Similar to the example above, --all-cpus=1 ./a.out would run the MPI program a.out on all available CPUs. » ALL_NODES—Set the number of processes to the number of nodes available to the current user. Similar to the ALL_CPUS variable, but you get a maximum of one CPU per node. This is useful for running a job per node instead of per CPU. » ALL_LOCAL — Run every process on the master node; used for debugging purposes. » NO_LOCAL — Don’t run any processes on the master node. » EXCLUDE — A colon-delimited list of nodes to be avoided during node assignment. » BEOWULF_JOB_MAP — A colon-delimited list of nodes. The first node listed will be the first process (MPI Rank 0) and so on. 27 Confidential – Internal Use Only Compiling and Running OpenMPI programs  env-modules package allow users to change their environment variables according to predefined files » module avail » module load openmpi/gnu » GNU, PGI, and Intel compilers are supported  mpicc, mpiCC, mpif77, mpif90 are used to automatically compile code and link in the correct MPI libraries from /opt/scyld/openmpi  mpirun is used to run code  Interconnect can be selected at runtime » -mca btl openib,tcp,sm,self » -mca btl udapl,tcp,sm,self 28 Confidential – Internal Use Only Compiling and Running OpenMPI programs What env-modules does:  Set user environment prior to compiling » export PATH=/opt/scyld/openmpi/gnu/bin:${PATH}  mpicc, mpiCC, mpif77, mpif90 are used to automatically compile code and link in the correct MPI libraries from /opt/scyld/openmpi » Environment variables can used to set the compiler: • OPMI_CC, OMPI_CXX, OMPI_F77, OMPI_FC  Prior to running PATH and LD_LIBRARY_PATH should be set » module load openmpi/gnu » /opt/scyld/openmpi/gnu/bin/mpirun –np 16 a.out OR: » export PATH=/opt/scyld/openmpi/gnu/bin:${PATH} export MANPATH=/opt/scyld/openmpi/gnu/share/man export LD_LIBRARY_PATH=/opt/scyld/openmpi/gnu/lib:${LD_LIBRARY_PATH} » /opt/scyld/openmpi/gnu/bin/mpirun –np 16 a.out 29 Confidential – Internal Use Only Scaling with MPI Implementations 30 Confidential – Internal Use Only Scaling with MPI Implementations Speedup relative to GCC Scaling with MPI for gcc 18 16 14 12 10 8 6 4 2 0 0 2 4 6 8 10 12 14 16 18 Number of Processors Ideal Speedup 1:1 MPICC - GCC - P4 MPICC - GCC - VAPI OpenMPI - GCC - tcp,sm,self OpenMPI - GCC - openib,sm,self  Infiniband allows wider scaling  Performance difference between MPICH versus OpenMPI  A little artificial because it’s only two physical machines 31 Confidential – Internal Use Only Scaling with MPI Implementations Compiler Comparison Speedup relative to GCC 60 50 40 30 20 10 0 0 2 4 6 8 10 12 14 16 18 Number of Processors I deal Speedup 1: 1 M P I CC - GCC - P 4 M P I CC - GCC - V A P I OpenM P I - GCC - t cp, sm, sel f OpenM P I - GCC - openi b, sm, sel f I deal Speedup 3. 49: 1 M P I CC - GCC -O3 - P 4 M P I CC - GCC -O3 - V A P I M P I CC - I CC - P 4 M P I CC - I CC - V A P I OpenM P I - GCC -O3 - t cp, sm, sel f OpenM P I - GCC -O3 - openi b, sm, sel f OpenM P I - I CC - t cp, sm, sel f OpenM P I - I CC - openi b, sm, sel f  Larger problems would allow continued scaling 32 Confidential – Internal Use Only MPI Summary  MPI provides a mechanism to parallelize in a distributed fashion » Localhost optimization is done is on a shared memory machine  Shared variables are explicitly handled by the developer  Tradeoff between CPU versus IO can determine the performance characteristics  Hybrid programming models are possible » MPI code with OpenMPI sections » MPI code with GPU calls 33 Confidential – Internal Use Only Queuing  How are resources allocated among multiple users and/or groups? » Statically by using bpctl user and group permissions » ClusterWare supports a variety of queuing packages • TaskMaster (advanced MOAB policy based scheduler integrated ClusterWare) • Torque • SGE 34 Confidential – Internal Use Only Interacting with Torque  To submit a job: » qsub script.sh • Example script.sh: #!/bin/sh #PBS –j oe #PBS –l nodes=4 cd $PBS_O_WORKDIR hostname • qsub does not accept arguments for script.sh. All executable arguments must be included in the script itself » Administrators can create a ‘qapp’ script that takes user arguments, creates script.sh with the user arguments embedded, and runs ‘qsub script.sh’ 35 Confidential – Internal Use Only Interacting with Torque  Other commands » qstat – Status of queue server and jobs » qdel – Remove a job from the queue » qhold, qrls – Hold and release a job in the queue » qmgr – Administrator command to configure pbs_server » /var/spool/torque/server_name: should match hostname of the head node » /var/spool/torque/mom_priv/config: file to configure pbs_mom • ‘$usecp *:/home /home’ indicates that pbs_mom should use ‘cp’ rather than ‘rcp’ or ‘scp’ to relocate the stdout and stderr files at the end of execution » pbsnodes – Administrator command to monitor the status of the resources » qalter – Administrator command to modify the parameters of a particular job (e.g. requested time) 36 Confidential – Internal Use Only Other options to qsub  Options that can be included in a script (with the #PBS directive) or on the qsub command line » Join output and error files : #PBS –j oe » Request resources: #PBS –l nodes=2:ppn=2 » Request walltime: #PBS –l walltime=24:00:00 » Define a job name: #PBS –N jobname » Send mail at jobs events: #PBS –m be » Assign job to an account: #PBS –A account » Export current environment variables: #PBS –V  To start an interactive queue job use: » qsub –I for Torque » qrsh for SGE 37 Confidential – Internal Use Only Queue script case studies #!/bin/bash  qapp script: » Be careful about escaping special characters in the redirect section (\$, \’, \”) #Usage: qapp arg1 arg2 debug=0 opt1=“${1}” opt2=“${2}” if [[ “${opt2}” == “” ]] ; then echo “Not enough arguments” exit 1 fi cat > app.sh << EOF #!/bin/bash #PBS –j oe #PBS –l nodes=1 cd \$PBS_O_WORKDIR app $opt1 $opt2 EOF if [[ “${debug}” –lt 1 ]] ; then qsub app.sh fi if [[ “${debug}” –eq 0 ]] ; then /bin/rm –f app.sh fi 38 Confidential – Internal Use Only Queue script case studies  Using local scratch: #!/bin/bash #PBS –j oe #PBS –l nodes=1 cd $PBS_O_WORKDIR tmpdir=“/scratch/$USER/$PBS_JOBID” /bin/mkdir –p $tmpdir rsync –a ./ $tmpdir cd $tmpdir $pathto/app arg1 arg2 cd $PBS_O_WORKDIR rsync –a $tmpdir/ . /bin/rm –fr $tmpdir 39 Confidential – Internal Use Only Queue script case studies  Using local scratch for MPICH parallel jobs: » pbsdsh is a Torque command #!/bin/bash #PBS –j oe #PBS –l nodes=2:ppn=8 cd $PBS_O_WORKDIR tmpdir=“/scratch/$USER/$PBS_JOBID” /usr/bin/pbsdsh –u “/bin/mkdir –p $tmpdir” /usr/bin/pbsdsh –u bash –c “cd $PBS_O_WORKDIR ; rsync –a ./ $tmpdir” cd $tmpdir mpirun –machine vapi $pathto/app arg1 arg2 cd $PBS_O_WORKDIR /usr/bin/pbsdsh –u “rsync –a $tmpdir/ $PBS_O_WORKDIR” /usr/bin/pbsdsh –u “/bin/rm –fr $tmpdir” 40 Confidential – Internal Use Only Queue script case studies  Using local scratch for OpenMPI parallel jobs: » Do a ‘module load openmpi/gnu’ prior to running qsub » OR explicitly include a ‘module load openmpi/gnu’ in the script itself #!/bin/bash #PBS –j oe #PBS –l nodes=2:ppn=8 #PBS -V cd $PBS_O_WORKDIR tmpdir=“/scratch/$USER/$PBS_JOBID” /usr/bin/pbsdsh –u “/bin/mkdir –p $tmpdir” /usr/bin/pbsdsh –u bash –c “cd $PBS_O_WORKDIR ; rsync –a ./ $tmpdir” cd $tmpdir /usr/openmpi/gnu/bin/mpirun –np `cat $PBS_NODEFILE | wc –l` –mca btl openib,sm,self $pathto/app arg1 arg2 cd $PBS_O_WORKDIR /usr/bin/pbsdsh –u “rsync –a $tmpdir/ $PBS_O_WORKDIR” /usr/bin/pbsdsh –u “/bin/rm –fr $tmpdir” 41 Confidential – Internal Use Only Other considerations  A queue script need not be a single command » Multiple steps can be performed from a single script • Guaranteed resources • Jobs should typically be a minimum of 2 minutes » Pre-processing and post-processing can be done from the same script using the local scratch space » If configured, it is possible to submit additional jobs from a running queued job  To remove multiple jobs from the queue: » qstat | grep “ [RQ] “ | awk ‘{print $1}’ | xargs qdel 42 Confidential – Internal Use Only Integrating Parallel Programs  The scheduler on keeps track of available resources  Don’t monitor how the resources are used  Onus is on the user to request and use the correct resources » OpenMP: be sure to requests multiple processors on the same machine • Torque: #PBS –l nodes=1:ppn=x • SGE: Correct PE (parallel environment) submission » MPI: be sure to the use the machines that have been assigned by the queue system • Torque: MPICH and OpenMPI mpirun will do the correct thing. $PBS_NODEFILE contains a list of assigned hosts • SGE: $PE_HOSTFILE contains a list of assigned hosts. OpenMPI’s mpirun may need to be recompiled 43 Confidential – Internal Use Only Integrating Parallel Programs  Be careful about task pinning (taskset) » Different jobs may assuming the same CPU set resulting in oversubscription of some cores and some free cores » In a shared environment, not using task pinning can be easier at a slight trade-off in performance  Make sure that the same MPI implementation and compiler combination is used to run the code as was used to compile and link 44 Confidential – Internal Use Only Questions?? Confidential – Internal Use Only 45

OpenMP

Related documents

Products

Support

OpenMP

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib