Parallel Programming Sathish S. Vadhiyar Motivations of Parallel Computing Parallel Machine: a computer system with more than one processor Motivations • Faster Execution time due to non-dependencies between regions of code • Presents a level of modularity • Resource constraints. Large databases. • Certain class of algorithms lend themselves • Aggregate bandwidth to memory/disk. Increase in data throughput. • Clock rate improvement in the past decade – 40% • Memory access time improvement in the past decade – 10% 2 Parallel Programming and Challenges Recall the advantages and motivation of parallelism But parallel programs incur overheads not seen in sequential programs Communication delay Idling Synchronization 3 Challenges P0 P1 Idle time Computation Communication Synchronization 4 How do we evaluate a parallel program? Execution time, Tp Speedup, S S(p, n) = T(1, n) / T(p, n) Usually, S(p, n) < p Sometimes S(p, n) > p (superlinear speedup) Efficiency, E E(p, n) = S(p, n)/p Usually, E(p, n) < 1 Sometimes, greater than 1 Scalability – Limitations in parallel computing, relation to n and p. 5 Speedups and efficiency S E Ideal p Practical p 6 Limitations on speedup – Amdahl’s law Amdahl's law states that the performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode can be used. Overall speedup in terms of fractions of computation time with and without enhancement, % increase in enhancement. Places a limit on the speedup due to parallelism. Speedup = 1 (fs + (fp/P)) 7 Amdahl’s law Illustration S = 1 / (s + (1-s)/p) 1 Efficiency 0.8 0.6 0.4 0.2 Courtesy: 0 0 5 10 http://www.metz.supelec.fr/~dedu/docs/kohPaper/node2.html http://nereida.deioc.ull.es/html/openmp/pdp2002/sld008.htm 15 Amdahl’s law analysis f P=1 P=4 P=8 P=16 P=32 1.00 1.0 4.00 8.00 16.00 32.00 0.99 1.0 3.88 7.48 13.91 24.43 0.98 1.0 3.77 7.02 12.31 19.75 0.96 1.0 3.57 6.25 10.00 14.29 •For the same fraction, speedup numbers keep moving away from processor size. •Thus Amdahl’s law is a bit depressing for parallel programming. •In practice, the number of parallel portions of work has to be large enough to match a given number of processors. Gustafson’s Law Amdahl’s law – keep the parallel work fixed Gustafson’s law – keep computation time on parallel processors fixed, change the problem size (fraction of parallel/sequential work) to match the computation time For a particular number of processors, find the problem size for which parallel time is equal to the constant time For that problem size, find the sequential time and the corresponding speedup Thus speedup is scaled or scaled speedup. Also called weak speedup Metrics (Contd..) Table 5.1: Efficiency as a function of n and p. N 64 192 512 P=1 1.0 1.0 1.0 P=4 0.80 0.92 0.97 P=8 0.57 0.80 0.91 P=16 0.33 0.60 0.80 P=32 11 Scalability Efficiency decreases with increasing P; increases with increasing N How effectively the parallel algorithm can use an increasing number of processors How the amount of computation performed must scale with P to keep E constant This function of computation in terms of P is called isoefficiency function. An algorithm with an isoefficiency function of O(P) is highly scalable while an algorithm with quadratic or exponential isoefficiency function is poorly scalable 12 Parallel Program Models Single Program Multiple Data (SPMD) Multiple Program Multiple Data (MPMD) Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/ 13 Programming Paradigms Shared memory model – Threads, OpenMP, CUDA Message passing model – MPI 14 PARALLELIZATION Parallelizing a Program Given a sequential program/algorithm, how to go about producing a parallel version Four steps in program parallelization 1. Decomposition Identifying parallel tasks with large extent of possible concurrent activity; splitting the problem into tasks 2. Assignment Grouping the tasks into processes with best load balancing 3. Orchestration Reducing synchronization and communication costs 4. Mapping Mapping of processes to processors (if possible) 16 Steps in Creating a Parallel Program Partitioning D e c o m p o s i t i o n Sequential computation A s s i g n m e n t Tasks p0 p1 p2 p3 Processes O r c h e s t r a t i o n p0 p1 p2 p3 Parallel program M a p p i n g P0 P1 P2 P3 Processors 17 Orchestration Goals Structuring communication Synchronization Challenges Organizing data structures – packing Small or large messages? How to organize communication and synchronization ? 18 Orchestration Maximizing data locality Minimizing volume of data exchange Not communicating intermediate results – e.g. dot product Minimizing frequency of interactions - packing Minimizing contention and hot spots Do not use the same communication pattern with the other processes in all the processes Overlapping computations with interactions Split computations into phases: those that depend on communicated data (type 1) and those that do not (type 2) Initiate communication for type 1; During communication, perform type 2 Replicating data or computations Balancing the extra computation or storage cost with 19 the gain due to less communication Mapping Which process runs on which particular processor? Can depend on network topology, communication pattern of processors On processor speeds in case of heterogeneous systems 20 Mapping Static mapping Mapping based on Data partitioning 0 0 0 1 1 1 2 2 2 Applicable to dense matrix computations 0 1 2 0 1 2 0 1 2 Block distribution Block-cyclic distribution Graph partitioning based mapping Applicable for sparse matrix computations Mapping based on task partitioning 21 Based on Task Partitioning Based on task dependency graph 0 0 4 0 0 2 1 2 4 3 4 6 5 6 7 In general the problem is NP complete 22 Mapping Dynamic Mapping A process/global memory can hold a set of tasks Distribute some tasks to all processes Once a process completes its tasks, it asks the coordinator process for more tasks Referred to as self-scheduling, work- stealing 23 High-level Goals Table 2.1 Steps in the Parallelization Pr ocess and Their Goals ArchitectureDependent? Major Performance Goals Decomposition Mostly no Expose enough concurr ency but not too much Assignment Mostly no Balance workload Reduce communication volume Orchestration Yes Reduce noninher ent communication via data locality Reduce communication and synchr onization cost as seen by the pr ocessor Reduce serialization at shar ed r esour ces Schedule tasks to satisfy dependences early Mapping Yes Put r elated pr ocesses on the same pr ocessor if necessary Exploit locality in network topology Step 24 PARALLEL ARCHITECTURE 25 Classification of Architectures – Flynn’s classification In terms of parallelism in instruction and data stream Single Instruction Single Data (SISD): Serial Computers Single Instruction Multiple Data (SIMD) - Vector processors and processor arrays - Examples: CM-2, Cray-90, Cray YMP, Hitachi 3600 Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/ 26 Classification of Architectures – Flynn’s classification Multiple Instruction Single Data (MISD): Not popular Multiple Instruction Multiple Data (MIMD) - Most popular - IBM SP and most other supercomputers, clusters, computational Grids etc. Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/ 27 Classification of Architectures – Based on Memory Shared memory 2 types – UMA and NUMA UMA NUMA Examples: HPExemplar, SGI Origin, Sequent NUMA-Q Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/ 28 Shared Memory vs Message Passing Shared memory machine: The n processors share physical address space Communication can be done through this shared memory P P P P P P P M M M M M M M P P P P Interconnect P P P Interconnect Main Memory The alternative is sometimes referred to as a message passing machine or a distributed memory machine 29 Shared Memory Machines The shared memory could itself be distributed among the processor nodes Each processor might have some portion of the shared physical address space that is physically close to it and therefore accessible in less time Terms: NUMA vs UMA architecture Non-Uniform Memory Access Uniform Memory Access 30 Classification of Architectures – Based on Memory Distributed memory Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/ Recently multi-cores Yet another classification – MPPs, NOW (Berkeley), COW, Computational Grids 31 Parallel Architecture: Interconnection Networks An interconnection network defined by switches, links and interfaces Switches – provide mapping between input and output ports, buffering, routing etc. Interfaces – connects nodes with network 32 Parallel Architecture: Interconnections Indirect interconnects: nodes are connected to interconnection medium, not directly to each other Shared bus, multiple bus, crossbar, MIN Direct interconnects: nodes are connected directly to each other Topology: linear, ring, star, mesh, torus, hypercube Routing techniques: how the route taken by the message from source to destination is decided Network topologies Static – point-to-point communication links among processing nodes Dynamic – Communication links are formed dynamically by switches 33 Interconnection Networks Static Bus – SGI challenge Completely connected Star Linear array, Ring (1-D torus) Mesh – Intel ASCI Red (2-D) , Cray T3E (3-D), 2DTorus k-d mesh: d dimensions with k nodes in each dimension Hypercubes – 2-logp mesh – e.g. many MIMD machines Trees – our campus network Dynamic – Communication links are formed dynamically by switches Crossbar – Cray X series – non-blocking network Multistage – SP2 – blocking network. For more details, and evaluation of topologies, refer to book by Grama et al. 34 Indirect Interconnects Shared bus Multiple bus 2x2 crossbar Crossbar switch Multistage Interconnection Network 35 Star Direct Interconnect Topologies Ring Linea r 2D Mesh Hypercube(binary ncube) n= 2 n= 3 Torus 36 Evaluating Interconnection topologies Diameter – maximum distance between any two 1 processing nodes 2 Full-connected – p/2 Star – logP Ring – Hypercube Connectivity – multiplicity of paths between 2 nodes. Maximum number of arcs to be removed from 1 network2 to break it into two disconnected networks Linear-array – 2 Ring – 4 2-d mesh – d 2-d mesh with wraparound – D-dimension hypercubes – 37 Evaluating Interconnection topologies bisection width – minimum number of links to be removed from network to 2 partition it into 2 equal halves Root(P) Ring – 1 P-node 2-D mesh Tree1 – P2/4 Star – CompletelyP/2connected – Hypercubes - 38 Evaluating Interconnection topologies channel width – number of bits that can be simultaneously communicated over a link, i.e. number of physical wires between 2 nodes channel rate – performance of a single physical wire channel bandwidth – channel rate times channel width bisection bandwidth – maximum volume of communication between two halves of network, i.e. bisection width times channel bandwidth 39 Shared Memory Architecture: Caches P2 Read X P1 Read X Write X=1 Cache hit: Wrong data!! X: 0 X: X: 10 X: 1 0 40 Cache Coherence Problem If each processor in a shared memory multiple processor machine has a data cache Potential data consistency problem: the cache coherence problem Shared variable modification, private cache Objective: processes shouldn’t read `stale’ data Solutions 41 Cache Coherence Protocols Write update – propagate cache line to other processors on every write to a processor Write invalidate – each processor gets the updated cache line whenever it reads stale data Which is better? 42 Invalidation Based Cache Coherence P2 Read X P1 Read X Write X=1 X: X: 10 X: 1 X: 0 Invalidat e X: 0X: 1 43 Cache Coherence using invalidate protocols 3 states associated with data items Shared – a variable shared by 2 caches Invalid – another processor (say P0) has updated the data item Dirty – state of the data item in P0 Implementations Snoopy for bus based architectures shared bus interconnect where all cache controllers monitor all bus activity There is only one operation through bus at a time; cache controllers can be built to take corrective action and enforce coherence in caches Memory operations are propagated over the bus and snooped Directory-based Instead of broadcasting memory operations to all processors, propagate coherence operations to relevant processors A central directory maintains states of cache blocks, associated processors Implemented with presence bits 44 END