Parallel Concepts An Introduction TM

TM Parallel Concepts An Introduction TM The Goal of Parallelization • Reduction of elapsed time of a program • Reduction in turnaround time of jobs cpu time 1 processor communication overhead 4 processors start Elapsed time finish • Overhead: – – – – – total increase in cpu time communication synchronization additional work in algorithm non-parallel part of the program • (one processor works, others spin idle) Reduction in elapsed time Elapsed time • Overhead vs Elapsed time is better expressed as Speedup and Efficiency TM Speedup and Efficiency Both measure the parallelization properties of a program • Let T(p) be the elapsed time on p processors • The Speedup S(p) and the Efficiency E(p) are defined as: S(p) = T(1)/T(p) E(p) = S(p)/p T(p) = T(1)/p S(p) = T(1)/T(p) = p • For ideal parallel speedup we get: E(p) = S(p)/p = 1 or 100% Speedup Efficiency ideal 1 Super-linear Saturation Disaster Number of processors Number of processors • Scalable programs remain efficient for large number of processors Amdahl’s Law This rule states the following for parallel programs: The non-parallel fraction of the code (I.e. overhead) imposes the upper limit on the scalability of the code • The non-parallel (serial) fraction s of the program includes the communication and synchronization overhead (1) (2) (3) (4) T(1) = = = T(p) = S(p) = = < 1 = s + f ! program has serial and parallel fractions T(parallel) + T(serial) T(1) *(f + s) T(1) *(f + (1-f)) T(1) *(f/p + (1-f)) T(1)/T(p) 1/(f/p + 1-f) 1/(1-f) ! for p-> inf. • Thus, the maximum parallel Speedup S(p) for a program that has parallel fraction f: (5) S(p) < 1/(1-f) TM TM The Practicality of Parallel Computing Speedup 8.0 P=8 7.0 6.0 5.0 4.0 P=4 3.0 2.0 1.0 P=2 0% 20% 40% 60% 80% 100% Percentage parallel code David J. Kuck, Hugh Performance Computing, Oxford Univ.. Press 1996 1970s 1980s 1990s Best Hand-tuned codes ~99% range • In practice, making programs parallel is not as difficult as it may seem from Amdahl’s law • It is clear that a program has to spend significant portion (most) of run time in the parallel region TM Fine-Grained Vs Coarse-Grained • Fine-grain parallelism (typically loop level) – – – – can be done incrementally, one loop at a time does not require deep knowledge of the code a lot of loops have to be parallel for decent speedup potentially many synchronization points MAIN (at the end of each parallel loop) A E B F C G K • Coarse-grain parallelism – make larger loops parallel at higher call-tree level potentially in-closing many small loops – more code is parallel at once – fewer synchronization points, reducing overhead – requires deeper knowledge of the code H L p Coarse-grained D I J N M O q r s t Fine-grained TM Other Impediments to Scalability Load imbalance: p0 p1 p2 p3 • the time to complete a parallel execution of a code segment is start determined by the longest running thread Elapsed time finish • unequal work load distribution leads to some processors being idle, while others work too much with coarse grain parallelization, more opportunities for load imbalance exist Too many synchronization points • compiler will put synchronization points at the start and exit of each parallel region • if too many small loops have been made parallel, synchronization overhead will compromise scalability. TM Parallel Programming Models Classification of Programming models: Feature Originally sequential Examples: Autoparalelising compilers Control flow Single • • • • • Explicit Implicit F90, HPF MPI, PVM Shared variable OpenMP pthreads Single Multiple Multiple Address space Single Communication Implicit Single Implicit Multiple Explicit Single Implicit Synchronisation Implicit Implicit Implicit/explicit Explicit Data allocation Implicit/semiexplicit Explicit Implicit Data parallel Message passing Implicit/semiexplicit Control flow - number of explicit threads of execution Address space - access to global data from multiple threads Communication - data transfer part of language or library Synchronization - mechanism to regulate access to data Data allocation - control of the data distribution to execution threads Computing p with DPL p= 1 4 dx (1+x2) 0 =0<i<N S 4 N(1+((i+0.5)/N)2) PROGRAM PIPROG INTEGER, PARAMETER:: N = 1000000 REAL (KIND=8):: LS,PI, W = 1.0/N PI = SUM( (/ (4.0*W/(1.0+((I+0.5)*W)**2),I=1,N) /) ) PRINT *, PI END Notes: – – – – – essentially sequential form automatic detection of parallelism automatic work sharing all variables shared by default number of processors specified outside of the code compile with: f90 -apo -O3 -mips4 -mplist TM Computing p with Shared Memory p= Notes: – – – – – 1 4 dx (1+x2) 0 =0<i<N S 4 N(1+((i+0.5)/N)2) #define n 1000000 main() { double pi, l, ls = 0.0, w = 1.0/n; int i; #pragma omp parallel private(i,l) reduction(+:ls) { #pragma omp for for(i=0; i<n; i++) { l = (i+0.5)*w; ls += 4.0/(1.0+l*l); } #pragma omp master printf(“pi is %f\n”,ls*w); #pragma omp end master } } essentially sequential form automatic work sharing all variables shared by default directives to request parallel work distribution number of processors specified outside of the code TM Computing p with Message Passing 1 #include <mpi.h> 4 dx #define N 1000000 (1+x2) main() 0 { double pi, l, ls = 0.0, w = 1.0/N; p= =0<i<N S 4 N(1+((i+0.5)/N)2) int i, mid, nth; MPI_init(&argc, &argv); MPI_comm_rank(MPI_COMM_WORLD,&mid); MPI_comm_size(MPI_COMM_WORLD,&nth); for(i=mid; i<N; i += nth) { l = (i+0.5)*w; ls += 4.0/(1.0+l*l); } MPI_reduce(&ls,&pi,1,MPI_DOUBLE,MPI_SUM,0,MPI_COMM_WORLD); if(mid == 0) printf(“pi is %f\n”,pi*w); MPI_finalize(); } Notes: – – – – – – thread identification first explicit work sharing all variables are private explicit data exchange (reduce) all code is parallel number of processors is specified outside of code TM Computing p with POSIX Threads TM TM Comparing Parallel Paradigms • Automatic parallelization combined with explicit Shared Memory programming (compiler directives) used on machines with global memory – Symmetric Multi-Processors, CC-NUMA, PVP – These methods collectively known as Shared Memory Programming (SMP) – SMP programming model works at loop level, and coarse level parallelism: • the coarse level parallelism has to be specified explicitly • loop level parallelism can be found by the compiler (implicitly) – Explicit Message Passing Methods are necessary with machines that have no global memory addressability: • clusters of all sort, NOW & COW – Message Passing Methods require coarse level parallelism to be scalable •Choosing programming model is largely a matter of the application, personal preference and the target machine. •it has nothing to do with scalability. – communication overhead – process synchronization Scalability limitations: •scalability is mainly a function of the hardware and (your) implementation of the parallelism TM Summary • The serial part or the communication overhead of the code limits the scalability of the code (Amdahl Law) • Programs have to be >99% parallel to use large (>30 proc) machines • Several Programming Models are in use today: – Shared Memory programming (SMP) (with Automatic Compiler parallelization, Data-Parallel and explicit Shared Memory models) – Message Passing model • Choosing a Programming Model is largely a matter of the application, personal choice and target machine. It has nothing to do with scalability. – Don’t confuse Algorithm and implementation • Machines with a global address space can run applications based on both, SMP and Message Passing programming models

Parallel Concepts An Introduction TM

Related documents

Products

Support

Parallel Concepts An Introduction TM

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib