TM Parallel Concepts An Introduction TM The Goal of Parallelization • Reduction of elapsed time of a program • Reduction in turnaround time of jobs cpu time 1 processor communication overhead 4 processors start Elapsed time finish • Overhead: – – – – – total increase in cpu time communication synchronization additional work in algorithm non-parallel part of the program • (one processor works, others spin idle) Reduction in elapsed time Elapsed time • Overhead vs Elapsed time is better expressed as Speedup and Efficiency TM Speedup and Efficiency Both measure the parallelization properties of a program • Let T(p) be the elapsed time on p processors • The Speedup S(p) and the Efficiency E(p) are defined as: S(p) = T(1)/T(p) E(p) = S(p)/p T(p) = T(1)/p S(p) = T(1)/T(p) = p • For ideal parallel speedup we get: E(p) = S(p)/p = 1 or 100% Speedup Efficiency ideal 1 Super-linear Saturation Disaster Number of processors Number of processors • Scalable programs remain efficient for large number of processors Amdahl’s Law This rule states the following for parallel programs: The non-parallel fraction of the code (I.e. overhead) imposes the upper limit on the scalability of the code • The non-parallel (serial) fraction s of the program includes the communication and synchronization overhead (1) (2) (3) (4) T(1) = = = T(p) = S(p) = = < 1 = s + f ! program has serial and parallel fractions T(parallel) + T(serial) T(1) *(f + s) T(1) *(f + (1-f)) T(1) *(f/p + (1-f)) T(1)/T(p) 1/(f/p + 1-f) 1/(1-f) ! for p-> inf. • Thus, the maximum parallel Speedup S(p) for a program that has parallel fraction f: (5) S(p) < 1/(1-f) TM TM The Practicality of Parallel Computing Speedup 8.0 P=8 7.0 6.0 5.0 4.0 P=4 3.0 2.0 1.0 P=2 0% 20% 40% 60% 80% 100% Percentage parallel code David J. Kuck, Hugh Performance Computing, Oxford Univ.. Press 1996 1970s 1980s 1990s Best Hand-tuned codes ~99% range • In practice, making programs parallel is not as difficult as it may seem from Amdahl’s law • It is clear that a program has to spend significant portion (most) of run time in the parallel region TM Fine-Grained Vs Coarse-Grained • Fine-grain parallelism (typically loop level) – – – – can be done incrementally, one loop at a time does not require deep knowledge of the code a lot of loops have to be parallel for decent speedup potentially many synchronization points MAIN (at the end of each parallel loop) A E B F C G K • Coarse-grain parallelism – make larger loops parallel at higher call-tree level potentially in-closing many small loops – more code is parallel at once – fewer synchronization points, reducing overhead – requires deeper knowledge of the code H L p Coarse-grained D I J N M O q r s t Fine-grained TM Other Impediments to Scalability Load imbalance: p0 p1 p2 p3 • the time to complete a parallel execution of a code segment is start determined by the longest running thread Elapsed time finish • unequal work load distribution leads to some processors being idle, while others work too much with coarse grain parallelization, more opportunities for load imbalance exist Too many synchronization points • compiler will put synchronization points at the start and exit of each parallel region • if too many small loops have been made parallel, synchronization overhead will compromise scalability. TM Parallel Programming Models Classification of Programming models: Feature Originally sequential Examples: Autoparalelising compilers Control flow Single • • • • • Explicit Implicit F90, HPF MPI, PVM Shared variable OpenMP pthreads Single Multiple Multiple Address space Single Communication Implicit Single Implicit Multiple Explicit Single Implicit Synchronisation Implicit Implicit Implicit/explicit Explicit Data allocation Implicit/semiexplicit Explicit Implicit Data parallel Message passing Implicit/semiexplicit Control flow - number of explicit threads of execution Address space - access to global data from multiple threads Communication - data transfer part of language or library Synchronization - mechanism to regulate access to data Data allocation - control of the data distribution to execution threads Computing p with DPL p= 1 4 dx (1+x2) 0 =0<i<N S 4 N(1+((i+0.5)/N)2) PROGRAM PIPROG INTEGER, PARAMETER:: N = 1000000 REAL (KIND=8):: LS,PI, W = 1.0/N PI = SUM( (/ (4.0*W/(1.0+((I+0.5)*W)**2),I=1,N) /) ) PRINT *, PI END Notes: – – – – – essentially sequential form automatic detection of parallelism automatic work sharing all variables shared by default number of processors specified outside of the code compile with: f90 -apo -O3 -mips4 -mplist TM Computing p with Shared Memory p= Notes: – – – – – 1 4 dx (1+x2) 0 =0<i<N S 4 N(1+((i+0.5)/N)2) #define n 1000000 main() { double pi, l, ls = 0.0, w = 1.0/n; int i; #pragma omp parallel private(i,l) reduction(+:ls) { #pragma omp for for(i=0; i<n; i++) { l = (i+0.5)*w; ls += 4.0/(1.0+l*l); } #pragma omp master printf(“pi is %f\n”,ls*w); #pragma omp end master } } essentially sequential form automatic work sharing all variables shared by default directives to request parallel work distribution number of processors specified outside of the code TM Computing p with Message Passing 1 #include <mpi.h> 4 dx #define N 1000000 (1+x2) main() 0 { double pi, l, ls = 0.0, w = 1.0/N; p= =0<i<N S 4 N(1+((i+0.5)/N)2) int i, mid, nth; MPI_init(&argc, &argv); MPI_comm_rank(MPI_COMM_WORLD,&mid); MPI_comm_size(MPI_COMM_WORLD,&nth); for(i=mid; i<N; i += nth) { l = (i+0.5)*w; ls += 4.0/(1.0+l*l); } MPI_reduce(&ls,&pi,1,MPI_DOUBLE,MPI_SUM,0,MPI_COMM_WORLD); if(mid == 0) printf(“pi is %f\n”,pi*w); MPI_finalize(); } Notes: – – – – – – thread identification first explicit work sharing all variables are private explicit data exchange (reduce) all code is parallel number of processors is specified outside of code TM Computing p with POSIX Threads TM TM Comparing Parallel Paradigms • Automatic parallelization combined with explicit Shared Memory programming (compiler directives) used on machines with global memory – Symmetric Multi-Processors, CC-NUMA, PVP – These methods collectively known as Shared Memory Programming (SMP) – SMP programming model works at loop level, and coarse level parallelism: • the coarse level parallelism has to be specified explicitly • loop level parallelism can be found by the compiler (implicitly) – Explicit Message Passing Methods are necessary with machines that have no global memory addressability: • clusters of all sort, NOW & COW – Message Passing Methods require coarse level parallelism to be scalable •Choosing programming model is largely a matter of the application, personal preference and the target machine. •it has nothing to do with scalability. – communication overhead – process synchronization Scalability limitations: •scalability is mainly a function of the hardware and (your) implementation of the parallelism TM Summary • The serial part or the communication overhead of the code limits the scalability of the code (Amdahl Law) • Programs have to be >99% parallel to use large (>30 proc) machines • Several Programming Models are in use today: – Shared Memory programming (SMP) (with Automatic Compiler parallelization, Data-Parallel and explicit Shared Memory models) – Message Passing model • Choosing a Programming Model is largely a matter of the application, personal choice and target machine. It has nothing to do with scalability. – Don’t confuse Algorithm and implementation • Machines with a global address space can run applications based on both, SMP and Message Passing programming models