Parallel Computing Final Exam Review 1 What is Parallel computing? Parallel computing involves performing parallel tasks using more than one computer. Example in real life with related principles -book shelving in a library Single worker P workers with each worker stacking n/p books, but with arbitration problem(many workers try to stack the next book in the same shelf.) P workers with each worker stacking n/p books, but without arbitration problem (each worker work on a different set of shelves) 2 Important Issues in parallel computing Task/Program Partitioning. Data Partitioning. How to split a single task among the processors so that each processor performs the same amount of work, and all processors work collectively to complete the task. How to split the data evenly among the processors in such a way that processor interaction is minimized. Communication/Arbitration. How we allow communication among different processors and how we arbitrate communication related conflicts. 3 Challenges 1. 2. 3. 4. Design of parallel computers so that we resolve the above issues. Design, analysis and evaluation of parallel algorithms run on these machines. Portability and scalability issues related to parallel programs and algorithms Tools and libraries used in such systems. 4 Units of Measure in HPC • High Performance Computing (HPC) units are: —Flop: floating point operation —Flops/s: floating point operations per second —Bytes: size of data (a double precision floating point number is 8) • Typical sizes are millions, billions, trillions… Mega Mflop/s = 106 flop/sec Mbyte = 220 = 1048576 ~ 106 bytes Giga Gflop/s = 109 flop/sec Gbyte = 230 ~ 109 bytes Tera Tflop/s = 1012 flop/sec Tbyte = 240 ~ 1012 bytes Peta Pflop/s = 1015 flop/sec Pbyte = 250 ~ 1015 bytes Exa Eflop/s = 1018 flop/sec Ebyte = 260 ~ 1018 bytes Zetta Zflop/s = 1021 flop/sec Zbyte = 270 ~ 1021 bytes Yotta Yflop/s = 1024 flop/sec Ybyte = 280 ~ 1024 bytes • See www.top500.org for current list of fastest machines 5 What is a parallel computer? A parallel computer is a collection of processors that cooperatively solve computationally intensive problems faster than other computers. Parallel algorithms allow the efficient programming of parallel computers. This way the waste of computational resources can be avoided. Parallel computer v.s. Supercomputer supercomputer refers to a general-purpose computer that can solve computational intensive problems faster than traditional computers. A supercomputer may or may not be a parallel computer. 6 Flynn’s taxonomy of computer architectures (control mechanism) Depending on the execution and data streams computer architectures can be distinguished into the following groups. (1) SISD (Single Instruction Single Data) : This is a sequential computer. (2) SIMD (Single Instruction Multiple Data) : This is a parallel machine like the TM CM-200. SIMD machines are suited for data-parallel programs where the same set of instructions are executed on a large data set. Some of the earliest parallel computers such as the Illiac IV, MPP, DAP, CM-2, and MasPar MP-1 belonged to this class of machines (3) MISD (Multiple Instructions Single Data) : Some consider a systolic array a member of this group. (4) MIMD (Multiple Instructions Multiple Data) : All other parallel machines. A MIMD architecture can be an MPMD or an SPMD. In a Multiple Program Multiple Data organization, each processor executes its own program as opposed to a single program that is executed by all processors on a Single Program Multiple Data architecture. Examples of such platforms include current generation Sun Ultra Servers, SGI Origin Servers, multiprocessor PCs, workstation clusters, and the IBM SP Note: Some consider CM-5 as a combination of a MIMD and SIMD as it contains control hardware that allows it to operatein a SIMD mode. 7 SIMD and MIMD Processors A typical SIMD architecture (a) and a typical MIMD architecture (b). 8 Taxonomy based on Address-Space Organization (memory distribution) Message-Passing Architecture In a distributed memory machine each processor has its own memory. Each processor can access its own memory faster than it can access the memory of a remote processor (NUMA for Non-Uniform Memory Access). This architecture is also known as message-passing architecture and such machines are commonly referred to as multicomputers. Examples: Cray T3D/T3E, IBM SP1/SP2, workstation clusters. Shared-Address-Space Architecture Provides hardware support for read/write to a shared address space. Machines built this way are often called multiprocessors. (1) A shared memory machine has a single address space shared by all processors (UMA, for Uniform Memory Access). The time taken by a processor to access any memory word in the system is identical. Examples: SGI Power Challenge, SMP machines. (2) A distributed shared memory system is a hybrid between the two previous ones. A global address space is shared among the processors but is distributed among them. Example: SGI Origin 2000 Note: The existence of a cache in shared-memory parallel machines cause cache coherence problems when a cached variable is modified by a processor and the shared-variable is requested by another processor. cc-NUMA for cachecoherent NUMA architectures (Origin 2000). 9 NUMA and UMA Shared-AddressSpace Platforms Typical shared-address-space architectures: (a) Uniform-memory access shared-address-space computer; (b) Uniform-memoryaccess shared-address-space computer with caches and memories; (c) Non-uniform-memory-access shared-address-space computer with local memory only. 10 Message Passing vs. Shared Address Space Platforms Message passing requires little hardware support, other than a network. Shared address space platforms can easily emulate message passing. The reverse is more difficult to do (in an efficient manner). 11 Taxonomy based on processor granularity The granularity sometimes refers to the power of individual processors. Sometimes is also used to denote the degree of parallelism. (1) A coarse-grained architecture consists of (usually few) powerful processors (eg old Cray machines). (2) a fine-grained architecture consists of (usually many inexpensive) processors (eg TM CM-200, CM-2). (3) a medium-grained architecture is between the two (eg CM-5). Process Granularity refers to the amount of computation assigned to a particular processor of a parallel machine for a given parallel program. It also refers, within a single program, to the amount of computation performed before communication is issued. If the amount of computation is small (low degree of concurrency) a process is fine-grained. Otherwise granularity is coarse. 12 Taxonomy based on processor synchronization (1) In a fully synchronous system a global clock is used to synchronize all operations performed by the processors. (2) An asynchronous system lacks any synchronization facilities. Processor synchronization needs to be explicit in a user’s program. (3) A bulk-synchronous system comes in between a fully synchronous and an asynchronous system. Synchronization of processors is required only at certain parts of the execution of a parallel program. 13 Physical Organization of Parallel Platforms – ideal architecture(PRAM) The Parallel Random Access Machine (PRAM) is one of the simplest ways to model a parallel computer. A PRAM consists of a collection of (sequential) processors that can synchronously access a global shared memory in unit time. Each processor can thus access its shared memory as fast (and efficiently) as it can access its own local memory. The main advantages of the PRAM is its simplicity in capturing parallelism and abstracting away communication and synchronization issues related to parallel computing. Processors are considered to be in abundance and unlimited in number. The resulting PRAM algorithms thus exhibit unlimited parallelism (number of processors used is a function of problem size). The abstraction thus offered by the PRAM is a fully synchronous collection of processors and a shared memory which makes it popular for parallel algorithm design. It is, however, this abstraction that also makes the PRAM unrealistic from a practical point of view. Full synchronization offered by the PRAM is too expensive and time demanding in parallel machines currently in use. Remote memory (i.e. shared memory) access is considerably more expensive in real machines than local memory access UMA machines with unlimited parallelism are difficult to build. 14 Four Subclasses of PRAM Depending on how concurrent access to a single memory cell (of the shared memory) is resolved, there are various PRAM variants. ER (Exclusive Read) or EW (Exclusive Write) PRAMs do not allow concurrent access of the shared memory. It is allowed, however, for CR (Concurrent Read) or CW (Concurrent Write) PRAMs. Combining the rules for read and write access there are four PRAM variants: EREW: CREW Multiple read accesses to a memory location are allowed. Multiple write accesses to a memory location are serialized. ERCW access to a memory location is exclusive. No concurrent read or write operations are allowed. Weakest PRAM model Multiple write accesses to a memory location are allowed. Multiple read accesses to a memory location are serialized. Can simulate an EREW PRAM CRCW Allows multiple read and write accesses to a common memory location. Most powerful PRAM model Can simulate both EREW PRAM and CREW PRAM 15 Resolve concurrent write access (1) in the arbitrary PRAM, if multiple processors write into a single shared memory cell, then an arbitrary processor succeeds in writing into this cell. (2) in the common PRAM, processors must write the same value into the shared memory cell. (3) in the priority PRAM the processor with the highest priority (smallest or largest indexed processor) succeeds in writing. (4) in the combining PRAM if more than one processors write into the same memory cell, the result written into it depends on the combining operator. If it is the sum operator, the sum of the values is written, if it is the maximum operator the maximum is written. Note: An algorithm designed for the common PRAM can be executed on a priority or arbitrary PRAM and exhibit similar complexity. The same holds for an arbitrary PRAM algorithm when run on a priority PRAM. 16 Innerconnection Networks for Parallel Computers Interconnection networks carry data between processors and to memory. Interconnects are made of switches and links (wires, fiber). Interconnects are classified as static or dynamic. Static networks Consists of point-to-point communication links among processing nodes Also referred to as direct networks Dynamic networks Built using switches (switching element) and links Communication links are connected to one another dynamically by the switches to establish paths among processing nodes and memory banks. 17 Static and Dynamic Interconnection Networks Classification of interconnection networks: (a) a static network; and (b) a dynamic network. 18 Network Topologies Bus-Based Networks The simplest network that consists a shared medium(bus) that is common to all the nodes. The distance between any two nodes in the network is constant (O(1)). Ideal for broadcasting information among nodes. Scalable in terms of cost, but not scalable in terms of performance. The bounded bandwidth of a bus places limitations on the overall performance as the number of nodes increases. Typical bus-based machines are limited to dozens of nodes. Sun Enterprise servers and Intel Pentium based shared-bus multiprocessors are examples of such architectures 19 Network Topologies Crossbar Networks Employs a grid of switches or switching nodes to connect p processors to b memory banks. Nonblocking network: the connection of a processing node to a memory bank doesnot block the connection of any other processing nodes to other memory banks. The total number of switching nodes required is Θ(pb). (It is reasonable to assume b>=p) Scalable in terms of performance Not scalable in terms of cost. Examples of machines that employ crossbars include the Sun Ultra HPC 10000 and the Fujitsu VPP500 20 Network Topologies: Multistage Network Multistage Networks Intermediate class of networks between bus-based network and crossbar network Blocking networks: access to a memory bank by a processor may disallow access to another memory bank by another processor. More scalable than the bus-based network in terms of performance, more scalable than crossbar network in terms of cost. The schematic of a typical multistage interconnection network. 21 Network Topologies: Multistage Omega Network Omega network Consists of log p stages, p is the number of inputs(processing nodes) and also the number of outputs(memory banks) Each stage consists of an interconnection pattern that connects p inputs and p outputs: Each switch has two conncetion modes: Perfect shuffle(left rotation): 0 i p / 2 1 2i, j 2i 1 p, p / 2 i p 1 Pass-thought conncetion: the inputs are sent straight through to the outputs Cross-over connection: the inputs to the switching node are crossed over and then sent out. Has p/2*log p switching nodes 22 Network Topologies: Multistage Omega Network A complete Omega network with the perfect shuffle interconnects and switches can now be illustrated: A complete omega network connecting eight inputs and eight outputs. An omega network has p/2 × log p switching nodes, and the cost of such a network grows as (p log p). 23 Network Topologies: Multistage Omega Network – Routing Let s be the binary representation of the source and d be that of the destination processor. The data traverses the link to the first switching node. If the most significant bits of s and d are the same, then the data is routed in pass-through mode by the switch else, it switches to crossover. This process is repeated for each of the log p switching stages. Note that this is not a non-blocking switch. 24 Network Topologies: Multistage Omega Network – Routing An example of blocking in omega network: one of the messages (010 to 111 or 110 to 100) is blocked at link AB. 25 Network Topologies - Fixed Connection Networks (static) Completely-connection Network Star-Connected Network Linear array 2d-array or 2d-mesh or mesh 3d-mesh Complete Binary Tree (CBT) 2d-Mesh of Trees Hypercube 26 Evaluating Static Interconnection Network One can view an interconnection network as a graph whose nodes correspond to processors and its edges to links connecting neighboring processors. The properties of these interconnection networks can be described in terms of a number of criteria. (1) Set of processor nodes V . The cardinality of V is the number of processors p (also denoted by n). (2) Set of edges E linking the processors. An edge e = (u, v) is represented by a pair (u, v) of nodes. If the graph G = (V,E) is directed, this means that there is a unidirectional link from processor u to v. If the graph is undirected, the link is bidirectional. In almost all networks that will be considered in this course communication links will be bidirectional. The exceptions will be clearly distinguished. (3) The degree du of node u is the number of links containing u as an endpoint. If graph G is directed we distinguish between the out-degree of u (number of pairs (u, v) ∈ E, for any v ∈ V ) and similarly, the indegree of u. T he degree d of graph G is the maximum of the degrees of its nodes i.e. d = maxudu. 27 Evaluating Static Interconnection Network (cont.) (4) The diameter D of graph G is the maximum of the lengths of the shortest paths linking any two nodes of G. A shortest path between u and v is the path of minimal length linking u and v. We denote the length of this shortest path by duv. Then, D = maxu,vduv. The diameter of a graph G denotes the maximum delay (in terms of number of links traversed) that will be incurred when a packet is transmitted from one node to the other of the pair that contributes to D (i.e. from u to v or the other way around, if D = duv). Of course such a delay would hold if messages follow shortest paths (the case for most routing algorithms). (5) latency is the total time to send a message including software overhead. Message latency is the time to send a zero-length message. (6) bandwidth is the number of bits transmitted in unit time. (7) bisection width is the number of links that need to be removed from G to split the nodes into two sets of about the same size (±1). 28 Network Topologies - Fixed Connection Networks (static) Completely-connection Network Each node has a direct communication link to every other node in the network. Ideal in the sense that a node can send a message to another node in a single step. Static counterpart of crossbar switching networks Nonblocking Star-Connected Network One processor acts as the central processor. Every other processor has a communication link connecting it to this central processor. Similar to bus-based network. The central processor is the bottleneck. 29 Network Topologies: Completely Connected and Star Connected Networks Example of an 8-node completely connected network. (a) A completely-connected network of eight nodes; (b) a star connected network of nine nodes. 30 Network Topologies - Fixed Connection Networks (static) Linear array 2d-array or 2d-mesh or mesh The processors are ordered to form a 2-dimensional structure (square) so that each processor is connected to its four neighbor (north, south, east, west) except perhaps for the processors of the boundary. Extension of linear array to two-dimensions: Each dimension has p nodes with a node identified by a two-tuple (i,j). 3d-mesh In a linear array, each node(except the two nodes at the ends) has two neighbors, one each to its left and right. Extension: ring or 1-D torus(linear array with wraparound). A generalization of a 2d-mesh in three dimensions. Exercise: Find the characteristics of this network and its generalization in k dimensions (k > 2). Complete Binary Tree (CBT) There is only one path between any pair of two nodes Static tree network: have a processing element at each node of the tree. Dynamic tree network: nodes at intermediate levels are switching nodes and the leaf nodes are processing elements. Communication bottleneck at higher levels of the tree. Solution: increasing the number of communication links and switching nodes closer to the root. 31 Network Topologies: Linear Arrays Linear arrays: (a) with no wraparound links; (b) with wraparound link. 32 Network Topologies: Two- and Three Dimensional Meshes Two and three dimensional meshes: (a) 2-D mesh with no wraparound; (b) 2-D mesh with wraparound link (2-D torus); and (c) a 3-D mesh with no wraparound. 33 Network Topologies: Tree-Based Networks Complete binary tree networks: (a) a static tree network; and (b) a dynamic tree network. 34 Network Topologies: Fat Trees A fat tree network of 16 processing nodes. 35 Evaluating Network Topologies Linear array 2d-array or 2d-mesh or mesh For |V | = N, we have a √N ×√N mesh structure, with |E| ≤ 2N = O(N), d = 4, D = 2√N − 2, bw = √N. 3d-mesh |V | = N, |E| = N −1, d = 2, D = N −1, bw = 1 (bisection width). Exercise: Find the characteristics of this network and its generalization in k dimensions (k > 2). Complete Binary Tree (CBT) on N = 2n leaves For a complete binary tree on N leaves, we define the level of a node to be its distance from the root. The root is of level 0 and the number of nodes of level i is 2i. Then, |V | = 2N − 1, |E| ≤ 2N − 2 = O(N), d = 3, D = 2lgN, bw = 1(c) 36 Network Topologies - Fixed Connection Networks (static) 2d-Mesh of Trees An N2-leaf 2d-MOT consists of N2 nodes ordered as in a 2d-array N ×N (but without the links). The N rows and N columns of the 2dMOT form N row CBT and N column CBTs respectively. For such a network, |V | = N2+2N(N −1), |E| = O(N2), d = 3, D = 4lgN, bw = N. The 2d-MOT possesses an interesting decomposition property. If the 2N roots of the CBT’s are removed we get 4 N/2 × N/2 CBT s. The 2d-Mesh of Trees (2d-MOT) combines the advantages of 2dmeshes and binary trees. A 2d-mesh has large bisection width but large diameter (√N). On the other hand a binary tree on N leaves has small bisection width but small diameter. The 2d-MOT has small diameter and large bisection width. A 3d-MOTcan be defined similarly. 37 Network Topologies - Fixed Connection Networks (static) Hypercube The hypercube is the major representative of a class of networks that are called hypercubic networks. Other such networks is the butterfly, the shuffle-exchange graph, de-Bruijn graph, Cube-connected cycles etc. Each vertex of an n-dimensional hypercube is represented by a binary string of length n. Therefore there are |V | = 2n = N vertices in such a hypercube. Two vertices are connected by an edge if their strings differ in exactly one bit position. Let u = u1u2 . . .ui . . . un. An edge is a dimension i edge if it links two nodes that differ in the i-th bit position. This way vertex u is connected to vertex ui= u1u2 . . . ūi . . .un with a dimension i edge. Therefore |E| = N lg N/2 and d = lgN = n. The hypercube is the first network examined so far that has degree that is not a constant but a very slowly growing function of N. The diameter of the hypercube is D = lgN. A path from node u to node v can be determined by correcting the bits of u to agree with those of v starting from dimension 1 in a “left-to-right” fashion. The bisection width of the hypercube is bw = N. This is a result of the following property of the hypercube. If all edges of dimension i are removed from an n dimensional hypercube, we get two hypercubes each one of dimension n − 1. 38 Network Topologies: Hypercubes and their Construction Construction of hypercubes from hypercubes of lower dimension. 39 Evaluating Static Interconnection Networks Network Diameter Bisectio nWidth Arc Connectivi ty Cost (No. of links) Completely-connected Star Complete binary tree Linear array 2-D mesh, no wraparound 2-D wraparound mesh Hypercube Wraparound k-ary dcube 40 Performance Metrics for Parallel Systems Number of processing elements p Execution Time Parallel runtime: the time that elapses from the moment a parallel computation starts to the moment the last processing element finishes execution. Ts: serial runtime Tp: parallel runtime Total Parallel Overhead T0 Total time collectively spent by all the processing elements – running time required by the fastest known sequential algorithm for solving the same problem on a single processing element. T0=pTp-Ts 41 Performance Metrics for Parallel Systems Speedup S: The ratio of the serial runtime of the best sequential algorithm for solving a problem to the time taken by the parallel algorithm to solve the same problem on p processing elements. S=Ts(best)/Tp Example: adding n numbers: Tp=Θ(logn), Ts= Θ(n), S= Θ(n/logn) Theoretically, speedup can never exceed the number of processing elements p(S<=p). Proof: Assume a speedup is greater than p, then each processing element can spend less than time Ts/p solving the problem. In this case, a single processing element could emulate the p processing elements and solve the problem in fewer than Ts units of time. This is a contradiction because speedup, by definition, is computed with respect to the best sequential algorithm. Superlinear speedup: In practice, a speedup greater than p is sometimes observed, this usually happens when the work performed by a serial algorithm is greater than its parallel formulation or due to hardware features that put the serial implementation at a disadvantage. 42 Example for Superlinear speedup Superlinear speedup: Example1: Superlinear effects from caches: With the problem instance size of A and 64KB cache, the cache hit rate is 80%. Assume latency to cache of 2ns and latency of DRAM of 100ns, then memory access time is 2*0.8+100*0.2=21.6ns. If the computation is memory bound and performs one FLOP/memory access, this corresponds to a processing rate of 46.3 MFLOPS. With the problem instance size of A/2 and 64KB cache, the cache hit rate is higher, i.e., 90%, 8% the remaining data comes from local DRAM and the other 2% comes from the remote DRAM with latency of 400ns, then memory access time is 2*0.9+100*0.08+400*0.02=17.8. The corresponding execution rate at each processor is 56.18MFLOPS, and for two processors the total processing rate is 112.36MFLOPS. Then the speedup will be 112.36/46.3=2.43! 43 Example for Superlinear speedup Superlinear speedup: Example2: Superlinear effects due to exploratory decomposition: explore leaf nodes of an unstructured tree. Each leaf has a label associated with it and the objective is to find a node with a specified label, say ‘S’. The solution node is the rightmost leaf in the tree. A serial formulation of this problem based on depth-first tree traversal explores the entire tree, i.e. all 14 nodes, time is 14 units time. Now a parallel formulation in which the left subtree is explored by processing element 0 and the right subtree is explored by processing element 1. The total work done by the parallel algorithm is only 9 nodes and corresponding parallel time is 5 units time. Then the speedup is 14/5=2.8. 44 Performance Metrics for Parallel Systems(cont.) Efficiency E Cost(also called Work or processor-time product) W Ratio of speedup to the number of processing element. E=S/p A measure of the fraction of time for which a processing element is usefully employed. Examples: adding n numbers on n processing elements: Tp=Θ(logn), Ts= Θ(n), S= Θ(n/logn), E= Θ(1/logn) Product of parallel runtime and the number of processing elements used. W=Tp*p Examples: adding n numbers on n processing elements: W= Θ(nlogn). Cost-optimal: if the cost of solving a problem on a parallel computer has the same asymptotic growth(in Θ terms) as a function of the input size as the fastest-known sequential algorithm on a single processing element. Problem Size W2 The number of basic computation steps in the best sequential algorithm to solve the problem on a single processing element. W2=Ts of the fastest known algorithm to solve the problem on a sequential computer. 45 Parallel vs Sequential Computing: Amdahl’s Theorem 0.1 (Amdahl’s Law) Let f, 0 ≤ f ≤ 1, be the fraction of a computation that is inherently sequential. Then the maximum obtainable speedup S on p processors is S ≤1/(f + (1 − f)/p) Proof. Let T be the sequential running time for the named computation. fT is the time spent on the inherently sequential part of the program. On p processors the remaining computation, if fully parallelizable, would achieve a running time of at most (1−f)T/p. This way the running time of the parallel program on p processors is the sum of the execution time of the sequential and parallel components that is, fT + (1 − f)T/p. The maximum allowable speedup is therefore S ≤ T/(fT + (1 − f)T/p) and the result is proven. 46 Amdahl’s Law Amdahl used this observation to advocate the building of even more powerful sequential machines as one cannot gain much by using parallel machines. For example if f = 10%, then S ≤ 10 as p → ∞. The underlying assumption in Amdahl’s Law is that the sequential component of a program is a constant fraction of the whole program. In many instances as problem size increases the fraction of computation that is inherently sequential decreases with time. In many cases even a speedup of 10 is quite significant by itself. In addition Amdahl’s law is based on the concept that parallel computing always tries to minimize parallel time. In some cases a parallel computer is used to increase the problem size that can be solved in a fixed amount of time. For example in weather prediction this would increase the accuracy of say a three-day forecast or would allow a more accurate five-day forecast. 47 Parallel vs Sequential Computing: Gustaffson’s Law Theorem 0.2 (Gustafson’s Law) Let the execution time of a parallel algorithm consist of a sequential segment fT and a parallel segment (1 − f)T and the sequential segment is constant. The scaled speedup of the algorithm is then. S =(fT + (1 − f)Tp)/(fT + (1 − f)T) = f + (1 − f)p For f = 0.05, we get S = 19.05, whereas Amdahl’s law gives an S ≤ 10.26. 1 proc p proc fT fT (1-f)Tp (1-f)T T(f+(1-f)p) T Amdahl’s Law assumes that problem size is fixed when it deals with scalability. Gustafson’s Law assumes that running time is fixed. 48 Brent’s Scheduling Principle (Emulations) Suppose we have an unlimited parallelism efficient parallel algorithm, i.e. an algorithm that runs on zillions of processors. In practice zillions of processors may not available. Suppose we have only p processors. A question that arises is what can we do to “run” the efficient zillion processor algorithm on our limited machine. One answer is emulation: simulate the zillion processor algorithm on the p processor machine. Theorem 0.3 (Brent’s Principle) Let the execution time of a parallel algorithm requires m operations and runs in parallel time t. Then running this algorithm on a limited processor machine with only p processors would require time m/p + t. Proof: Let mi be the number of computational operations at the i-th step, i.e. mi m .If we assign the p processors on the i-th step to work on these mi operations they can conclude in time mi / p mi / p 1 . Thus the total running time on p processors would be t m / p m / p 1 t m / p t m / p i i i 1 i i i 49 The Message Passing Interface (MPI): Introduction The Message-Passing Interface (MPI)is an attempt to create a standard to allow tasks executing on multiple processors to communicate through some standardized communication primitives. It defines a standard library for message passing that one can use to develop message-passing program using C or Fortran. The MPI standard define both the syntax and the semantics of these functional interface to message passing. MPI comes intro a variety of flavors, freely available such as LAM-MPI and MPIch, and also commercial versions such as Critical Software’s WMPI. It supporst message-passing on a variety of platforms from Linux-based or Windows-based PC to supercomputer and multiprocessor systems. After the introduction of MPI whose functionality includes a set of 125 functions, a revision of the standard took place that added C++ support, external memory accessibility and also Remote Memory Access (similar to BSP’s put and get capability)to the standard. The resulting standard is known as MPI-2 and has grown to almost 241 functions. 50 The Message Passing Interface A minimum set A minimum set of MPI functions is described below. MPI functions use the prefix MPI and after the prefix the remaining keyword start with a capital letter. A brief explanation of the primitives can be found on the textbook (beginning page 242). A more elaborate presentation is available in the optional book. Function Class Standard Function Operation Initialization and Termination MPI MPI MPI_Init MPI_Finalize Start of SPMD code End of SPMD code Abnormal Stop MPI MPI_Abort One process halts all Process Control MPI MPI MPI MPI_Comm_size MPI_Comm_rank MPI_Wtime Number of processes Identifier of Calling Process Local (wall-clock) time Message Passing MPI MPI MPI_Send MPI_Recv Send message to remote proc. Receive mess. from remote proc. 51 General MPI program #include <mpi.h> … Main(int argc, char* argv[]) { … MPI_Init(&argc, argv); /*no MPI functions called before this*/ … MPI_Finalize(); /*no MPI functions called after this*/ … } 52 MPI and MPI-2 Initialization and Termination #include <mpi.h> int MPI_Init (int *argc, char **argv); int MPI_Finalize(void); Multiple processes from the same source are created by issuing the function MPI_Init and these processes are safely terminated by issuing a MPI_Finalize. The arguments of MPI_Init are the command line arguments minus the ones that were used/processed by the MPI implementation(main function’s parameters – argc and argv). Thus command line processing should only be performed in the program after the execution of this function call. Successful return returns a MPI_SUCCESS; otherwise an error-code that is implementation dependent is returned. Definitions are available in <mpi.h>. 53 MPI and MPI-2 Abort int MPI_Abort(MPI_Comm comm, int errcode); Note that MPI_Abort aborts an MPI program cleanly. The first one is a communicator and second argument is an integer error code. A communicator is a collection of processes that can send messages to each other. A default communicator is MPI_COMM_WORLD, which consists of all the processes running when program execution begins. 54 MPI and MPI-2 Communicators; Process Control Under MPI, a communication domain is a set of processors that are allowed to communicate with each other. Information about such a domain is stored in a communicator that uniquely identify the processors that participate in a communication operation. A default communication domain is all the processors of a parallel execution; it is called MPI COMM WORLD. By using multiple communicators between possibly overlapping groups of processors we make sure that messages are not interfering with each other. #include <mpi.h> int MPI_Comm_size ( MPI_Comm comm, int *size); int MPI_Comm_rank ( MPI_Comm comm, int *rank); Thus MPI_Comm_size ( MPI_COMM_WORLD, &nprocs); MPI_Comm_rank ( MPI_COMM_WORLD, &pid ); return the number of processors nprocs and the processor id pid of the calling processor. 55 MPI and MPI-2 Communicators; Process Control: Example A hello world! Program in MPI is the following one. #include <mpi.h> int main(int argc, char **argv) { int nprocs, mypid; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&nprocs); MPI_Comm_rank(MPI_COMM_WORLD,&mypid ); printf("Hello world from process %d of total %d\n", mypid, nprocs); MPI_Finalize(); } 56 Submit job to Grid Engine (1)- created mpi program, e.g. example.c. (2)- compiled my program, using mpicc -O2 example.c -o example (3)- type command of vi submit (4)- i put this into vi submit #!/bin/bash #$ -S /bin/sh #$ -N example #$ -q default #$ -pe lammpi 4 #$ -cwd mpirun C ./example (5)- then ran chmod a+x submit (6)- then ran qsub submit (7)- ran qstat to check the status of your program 57 MPI Message-Passing primitives #include <mpi.h> /* Blocking send and receive */ int MPI_Send(void *buf, int count, MPI_Datatype dtype, int dest, int tag, MPI_Comm comm); int MPI_Recv(void *buf, int count, MPI_Datatype dtype, int src, int tag, MPI_Comm comm, MPI_Status *stat); /* Non-Blocking send and receive */ int MPI_Isend(void *buf, int count, MPI_Datatype dtype, int dest, int tag, MPI_Comm comm, MPI_Request *req); int MPI_Irecv(void *buf, int count, MPI_Datatype dtype, int src, int tag, MPI_Comm comm, MPI_Request *req); int MPI_Wait(MPI_Request *preq, MPI_Status *stat); buf - initial address of send/receive buffer count - number of elements in send buffer (nonnegative integer) or maximum number of elements in receive buffer. dtyp - datatype of each send/receive buffer element (handle) dest,src - rank of destination/source (integer) Wild-card: MPI_ANY_SOURCE for recv only. No wildcard for dest. tag - message tag (integer). Range 0...32767. Wild-card: MPI_ANY_TAG for recv only; send must specify tag. comm - communicator (handle) stat - status object (Status), which can be the MPI constant; it returns the source and tag of the message that was acctually received. MPI_STATUS_IGNORE if the return status is not desired Struct of status: status->MPI_SOURCE, status->MPI_TAG, status->MPI_ERROR 58 data type correspondence between MPI and C MPI_CHAR --> signed char , MPI_SHORT --> signed short int , MPI_INT --> signed int MPI_LONG --> signed long int , MPI_UNSIGNED_CHAR --> unsigned char , MPI_UNSIGNED_SHORT --> unsigned short int , MPI_UNSIGNED --> unsigned int MPI_UNSIGNED_LONG --> unsigned long int , MPI_FLOAT --> float , MPI_DOUBLE --> double, MPI_LONG_DOUBLE --> long double, MPI_BYTE MPI_PACKED 59 MPI Message-Passing primitives The MPI Send and MPI Recv functions are blocking, that is they do not return unless it is safe to modify or use the contents of the send/receive buffer. MPI also provides for non-blocking send and receive primitives. These are MPI Isend and MPI Irecv, where the I stands for Immediate. These functions allow a process to post that it wants to send to or receive from another process, and then allow the process to call a function (eg. MPI Wait to complete the send-receive pair. Non-blocking send-receives allow for the overlapping of computation/communication. Thus MPI Wait plays the role of synchronizer: the send/receive are only advisories and communication is only effected at the MPI Wait. 60 MPI Message-Passing primitives: tag A tag is simply an integer argument that is passed to a communication function and that can be used to uniquely identify a message. For example, in MPI if process A sends a message to process B, then in order for B to receive the message, the tag used in A's call to MPI_Send must be the same as the tag used in B's called to MPI_Recv. Thus, if the characteristics of two messages sent by A to B are identical (i.e., same count and datatype), then A and B can distinguish between the two by using different tags. For example, suppose A is sending two floats, x and y, to B. Then the processes can be sure that the values are received correctly, regardless of the order in which A sends and B receives, provided different tags are used: /* Assume system provides some buffering */ if (my_rank == A) { tag = 0; MPI_Send(&x, 1, MPI_FLOAT, B, tag, MPI_COMM_WORLD); ... tag = 1; MPI_Send(&y, 1, MPI_FLOAT, B, tag, MPI_COMM_WORLD); } else if (my_rank == B) { tag = 1; MPI_Recv(&y, 1, MPI_FLOAT, A, tag, MPI_COMM_WORLD, &status); ... tag = 0; MPI_Recv(&x, 1, MPI_FLOAT, A, tag, MPI_COMM_WORLD, &status); } 61 MPI Message-Passing primitives: Communicators Now if one message from process A to process B is being sent by the library, and another, with identical characteristics, is being sent by the user's code, unless the library developer insists that user programs refrain from using certain tag values, this approach cannot be made to work. Clearly, partitioning the set of possible tags is at best an inconvenience: if one wishes to modify an existing user code so that it can use a library that partitions tags, each message passing function in the entire user code must be checked. The solution that was ultimately decided on was the communicator. Formally, a communicator is a pair of objects: the first is a group or ordered collection of processes, and the second is a context, which can be viewed as a unique, system-defined tag. Every communication function in MPI takes a communicator argument, and a communication can succeed only if all the processes participating in the communication use the same communicator argument. Thus, a library can either require that its functions be passed a unique library-specific communicator, or its functions can create their own unique communicator. In either case, it is straightforward for the library designer and the user to make certain that their messages are not confused. 62 For example, suppose now that the user's code is sending a float, x, from process A to process B, while the library is sending a float, y, from A to B: /* Assume system provides some buffering */ void User_function(int my_rank, float* x) { MPI_Status status; if (my_rank == A) { /* MPI_COMM_WORLD is pre-defined in MPI */ MPI_Send(x, 1, MPI_FLOAT, B, 0, MPI_COMM_WORLD); } else if (my_rank == B) { MPI_Recv(x, 1, MPI_FLOAT, A, 0, MPI_COMM_WORLD, &status); } ... } void Library_function(float* y) { MPI_Comm library_comm; MPI_Status status; int my_rank; /* Create a communicator with the same group */ /* as MPI_COMM_WORLD, but a different context */ MPI_Comm_dup(MPI_COMM_WORLD, &library_comm); /* Get process rank in new communicator */ MPI_Comm_rank(library_comm, &my_rank); if (my_rank == A) MPI_Send(y, 1, MPI_FLOAT, B, 0, library_comm); else if (my_rank == B) { MPI_Recv(y, 1, MPI_FLOAT, A, 0, library_comm, &status); } ... } int main(int argc, char* argv[]) { ... if (my_rank == A) { User_function(A, &x); ... Library_function(&y); } else if (my_rank == B) { Library_function(&y); ... User_function(B, &x); } ... } 63 MPI Message-Passing primitives: User-defined datatypes The second main innovation in MPI, user-defined datatypes, allows programmers to exploit this power, and as a consequence, to create messages consisting of logically unified sets of data rather than only physically contiguous blocks of data. Loosely, an MPI datatype is a sequence of displacements in memory together with a collection of basic datatypes (e.g., int, float, double, and char). Thus, an MPI-datatype specifies the layout in memory of data to be collected into a single message or data to be distributed from a single message. For example, suppose we specify a sparse matrix entry with the following definition. typedef struct { double entry; int row, col; } mat_entry_t; MPI provides functions for creating a variable that stores the layout in memory of a variable of type mat_entry_t. One does this by first defining an MPI datatype MPI_Datatype mat_entry_mpi_t; to be used in communication functions, and then calling various MPI functions to initialize mat_entry_mpi_t so that it contains the required layout. Then, if we define mat_entry_t x; we can send x by simply calling MPI_Send(&x, 1, mat_entry_mpi_t, dest, tag, comm); and we can receive x with a similar call to MPI_Recv. 64 MPI: An example with the blocking operations #include <stdio.h> #include <mpi.h> #define N 10000000 // Choose N to be multiple of nprocs to avoid problems. // Parallel sum of 1 , 2 , 3, ... , N int main(int argc,char **argv){ int pid,nprocs,i,j; int sum, start, end, total; MPI_Status status; MPI_Init(argc,argv); MPI_Comm_size(MPI_COMM_WORLD,&nprocs); MPI_Comm_rank(MPI_COMM_WORLD,&pid ); sum = 0; total = 0; start = (N/nprocs)*pid +1 ; // Each processor end =(N/nprocs)*(pid+1); for(i=start;i<=end;i++) sum += i; if (pid != 0 ) { MPI_Send(&sum,1,MPI_INT,0,1,MPI_COMM_WORLD); } else { for (j=1;j<nprocs;j++) { MPI_Recv(&total,1,MPI_INT,j,1,MPI_COMM_WORLD,&status); sum = sum + total; } } if (pid == 0 ) { printf(" The sum from 1 to %d is %d \n",N,sum); } MPI_Finalize(); } // Note: Program neither compiled nor run!+- 65 Non-Blocking send and Receive /* Non-Blocking send and receive */ int MPI_Isend(void *buf, int count, MPI_Datatype dtype, int dest, int tag, MPI_Comm comm, MPI_Request *req); int MPI_Irecv(void *buf, int count, MPI_Datatype dtype, int src, int tag, MPI_Comm comm, MPI_Request *req); int MPI_Wait(MPI_Request *preq, MPI_Status *stat); #include "mpi.h" int MPI_Wait ( MPI_Request *request, MPI_Status *status) Waits for an MPI send or receive to complete Input Parameter request (handle) Output Parameter status object (Status) . May be MPI_STATUS_IGNORE. 66 MPI: An example with the nonblocking operations #include <stdio.h> #include <mpi.h> #define N 10000000 // Choose N to be multiple of nprocs to avoid problems. // Parallel sum of 1 , 2 , 3, ... , N int main(int argc,char **argv){ int pid,nprocs,i,j; int sum, start, end, total; MPI_Status status; MPI_Request request; MPI_Init(argc,argv); MPI_Comm_size(MPI_COMM_WORLD,&nprocs); MPI_Comm_rank(MPI_COMM_WORLD,&pid ); sum = 0; total = 0; start = (N/nprocs)*pid +1 ; // Each processor end = (N/nprocs)*(pid+1); for(i=start;i<=end;i++) sum += i; if (pid != 0 ) { // MPI_Send(&sum,1,MPI_INT,0,1,MPI_COMM_WORLD); MPI_Isend(&sum,1,MPI_INT,0,1,MPI_COMM_WORLD,&request); MPI_Wait(&request,&status); } else { for (j=1;j<nprocs;j++) { MPI_Recv(&total,1,MPI_INT,j,1,MPI_COMM_WORLD,&status); sum = sum + total; } } if (pid == 0 ) { printf(" The sum from 1 to %d is %d \n",N,sum); } MPI_Finalize(); } // Note: Program neither compiled nor run! 67 MPI Basic Collective Operations One simple collective operations: int MPI_Bcast(void * message, int count, MPI_Datatype datatype, int root, MPI_Comm comm) The routine MPI_Bcast sends data from one process to all others 68 MPI_Bcast Process 1 Data Present Empty Write data to all processes Process 2 Data Present Empty Process 3 Data Present Empty Process 0 Data Present Data written, unblock 69 Simple Program that Demonstrates MPI_Bcast: #include <mpi.h> #include <stdio.h> int main (int argc, char *argv[]){ int k,id,p,size; MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD, &id); MPI_Comm_size(MPI_COMM_WORLD, &size); if(id == 0) k = 20; else k = 10; for(p=0; p<size; p++){ if(id == p) printf("Process %d: k= %d before\n",id,k); } //note MPI_Bcast must be put where all other processes //can see it. MPI_Bcast(&k,1,MPI_INT,0,MPI_COMM_WORLD); for(p=0; p<size; p++){ if(id == p) printf("Process %d: k= %d after\n",id,k); } MPI_Finalize(); return 0 70 Simple Program that Demonstrates MPI_Bcast: The Output would look like: Process 0: k= 20 before Process 0: k= 20 after Process 3: k= 10 before Process 3: k= 20 after Process 2: k= 10 before Process 2: k= 20 after Process 1: k= 10 before Process 1: k= 20 after 71 Parallel Algorithm Assumptions Convention: In this subject we name processors arbitrarily either 0, 1, . . . , p − 1 or 1, 2, . . . , p. The input to a particular problem would reside in the cells of the shared memory. We assume, in order to simplify the exposition of our algorithms, that a cell is wide enough (in bits or bytes) to accommodate a single instance of the input (eg. a key or a floating point number). If the input is of size n, the first n cells numbered 0, . . . , n − 1 store the input. We assume that the number of processors of the PRAM is n or a polynomial function of the size n of the input. Processor indices are 0, 1, . . . , n − 1. 72 PRAM Algorithm: Matrix Multiplication Matrix Multiplication A simple algorithm for multiplying two n × n matrices on a CREW PRAM with time complexity T = O(lg n) and P = n3 follows. For convenience, processors are indexed as triples (i, j, k), where i, j, k = 1, . . . , n. In the first step processor (i, j, k) concurrently reads aij and bjk and performs the multiplication aijbjk. In the following steps, for all i, k the results (i, ∗, k) are combined, using the parallel sum algorithm to form cik = j aijbjk. After lgn steps, the result cik is thus computed. The same algorithm also works on the EREW PRAM with the same time and processor complexity. The first step of the CREW algorithm need to be changed only. We avoid concurrency by broadcasting element aij to processors (i, j, ∗) using the broadcasting algorithm of the EREW PRAM in O(lg n) steps. Similarly, bjk is broadcast to processors (∗, j, k). The above algorithm also shows how an n-processor EREW PRAM can simulate an n-processor CREW PRAM with an O(lg n) slowdown. 73 Matrix Multiplication 1. aij to all (i,j,*) procs bjk to all (*,j,k) procs 2. aij*bjk at (i,j,k) proc 3. parallel sumj aij *bjk (i,*,k) procs 4. cik = sumj aij*bjk CREW O(1) O(1) O(1) O(lgn) O(1) EREW O(lgn) O(lgn) O(1) O(lgn) n procs participate O(1) T=O(lgn),P=O(n3 ) W=O( n3 lgn) W2 = O(n3 ) 74 PRAM Algorithm: Logical AND operation Problem. Let X1 . . .,Xn be binary/boolean values. Find X = X1 ∧ X2 ∧ . . . ∧ Xn. The sequential problem accepts a P = 1, T = O(n),W = O(n) direct solution. An EREW PRAM algorithm solution for this problem works the same way as the PARALLEL SUM algorithm and its performance is P = O(n), T = O(lg n),W = O(n lg n) along with the improvements in P and W mentioned for the PARALLEL SUM algorithm. In the remainder we will investigate a CRCW PRAM algorithm. Let binary value Xi reside in the shared memory location i. We can find X = X1 ∧ X2 ∧ . . . ∧ Xn in constant time on a CRCW PRAM. Processor 1 first writes an 1 in shared memory cell 0. If Xi = 0, processor i writes a 0 in memory cell 0. The result X is then stored in this memory cell. The result stored in cell 0 is 1 (TRUE) unless a processor writes a 0 in cell 0; then one of the Xi is 0 (FALSE) and the result X should be FALSE, as it is. 75 Logical AND operation begin Logical AND (X1 . . .Xn) 1. Proc 1 writ1es in cell 0. 2. if Xi = 0 processor i writes 0 into cell 0. end Logical AND Exercise Give an O(1) CRCW algorithm for LOGICAL OR. 76 Parallel Operations with Multiple Outputs – Parallel Prefix Problem definition: Given a set of n values x0, x1, . . . , xn−1 and an associative operator, say +, the parallel prefix problem is to compute the following n results/“sums”. 0: x0, 1: x0 + x1, 2: x0 + x1 + x2, ... n − 1: x0 + x1 + . . . + xn−1. Parallel prefix is also called prefix sums or scan. It has many uses in parallel computing such as in load-balancing the work assigned to processors and compacting data structures such as arrays. We shall prove that computing ALL THE SUMS is no more difficult that computing the single sum x0 + . . .xn−1. 77 Parallel Prefix Algorithm1: divideand-conquer x0 x1 x2 x3 x4 x5 x6 x7 <<Paralel Prefix "Box" for 8 inputs | | | | | | | | -------------------------------------| 1 | | 2 | <<< 2 PP Boxes for 4 inputs each -------------------------------------| | | | | | | | | | | | | | | | Take rightmost output of Box 1 and | | | | | | | | combine it with the outputs of Box2 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | x0+...+x3 x0+..+x7 x0+...+x2 x0+...+x6 x0+x1 x0+...+x5 x0 x0+...+x4 78 Parallel Prefix Algorithm 2: An algorithm for parallel prefix on an EREW PRAM would require lg n phases. In phase i, processor j reads the contents of cells j and j − 2i (if it exists) combines them and stores the result in cell j. The EREW PRAM algorithm that solves the parallel prefix problem has performance P = O(n), T = O(lg n), and W = O(n lg n), W2 = O(n). 79 Parallel Prefix Algorithm 2: Example For visualization purposes, the second step is written in two different lines. When we write x1 + . . . + x5 we mean x1 + x2 + x3 + x4 + x5. x1 1. 2. 2. 3. 3. Finally F. x1 x2 x1+x2 x1+x2 x3 x4 x5 x6 x7 x8 x2+x3 x3+x4 x4+x5 x5+x6 x6+x7 x7+x8 x1+(x2+x3) (x2+x3)+(x4+x5) (x4+x5)+(x6+x7) (x1+x2)+(x3+x4) (x3+x4)+(x5+x6) (x5+x6+x7+x8) x1+...+x5 x1+...+x7 x1+...+x6 x1+...+x8 x1+...+x3 x1+...+x4 x1+...+x5 x1+...+x6 x1+...+x7 x1+...+x8 80 Parallel Prefix Algorithm 2: Example 2 For visualization purposes, the second step is written in two different lines. When we write [1 : 5] we mean x1 +x2 + x3 + x4 + x5. We write below [1:2] to denote x1+x2 [i:j] to denote xi + ... + x5 [i:i] is xi NOT xi+xi! [1:2][3:4]=[1:2]+[3:4]= (x1+x2) + (x3+x4) = x1+x2+x3+x4 A * indicates value above remains the same in subsequent steps 0 x1 x2 0 [1:1] [2:2] 1 * [1:1][2:2] 1. * [1:2] 2. * * 2. * * 3. * * 3. * * [1:1] [1:2] x1 x1+x2 x3 [3:3] [2:2][3:3] [2:3] [1:1][2:3] [1:3] * * [1:3] x1+x2+x3 x4 x5 x6 x7 x8 [4:4] [5:5] [6:6] [7:7] [8:8] [3:3][4:4] [4:4][5:5] [5:5][6:6] [6:6][7:7] [7:7][8:8] [3:4] [4:5] [5:6] [6:7] [7:8] [1:2][3:4] [2:3][4:5] [3:4][5:6] [4:5][6:7] [5:6][7:8] [1:4] [2:5] [3:6] [4:7] [5:8] * [1:1][2:5] [1:2][3:6] [1:3][4:7] [1:4][5:8] * [1:5] [1:6] [1:7] [1:8] [1:4] [1:5] [1:6] [1:7] [1:8] x1+...+x4 x1+...+x5 x1+...+x6 x1+...+x7 x1+...+x8 81 Parallel Prefix Algorithm 2: // We write below[1:2] to denote X[1]+X[2] // [i:j] to denote X[i]+X[i+1]+...+X[j] // [i:i] is X[i] NOT X[i]+X[i] // [1:2][3:4]=[1:2]+[3:4]= (X[1]+X[2])+(X[3]+X[4])=X[1]+X[2]+X[3]+X[4] // Input : M[j]= X[j]=[j:j] for j=1,...,n. // Output: M[j]= X[1]+...+X[j] = [1:j] for j=1,...,n. ParallelPrefix(n) 1. i=1; // At this step M[j]= [j:j]=[j+1-2**(i-1):j] 2. while (i < n ) { 3. j=pid(); 4. if (j-2**(i-1) >0 ) { 5. a=M[j]; // Before this stepM[j] = [j+1-2**(i-1):j] 6. b=M[j-2**(i-1)]; // Before this stepM[j-2**(i-1)]= [j-2**(i-1)+1-2**(i-1):j-2**(i-1)] 7. M[j]=a+b; // After this step M[j]= M[j]+M[j-2**(i-1)]=[j-2**(i-1)+1-2**(i-1):j-2**(i1)] // [j+1-2**(i-1):j] = [j-2**(i-1)+1-2**(i-1):j]=[j+1-2**i:j] 8. } 9. i=i*2; } At step 5, memory location j − 2i−1 is read provided that j − 2i−1 ≥ 1. This is true for all times i ≤ tj = lg(j − 1) + 1. For i > tj the test of line 4 fails and lines 5-8 are not executed. 82 Parallel Prefix Algorithm based on Complete Binary Tree Consider the following variation of parallel prefix on n inputs that works on a complete binary tree with n leaves (assume n is a power of two). Action by nodes 1. 2. 3. 4. Non-leaf : If it receives l and r from left and right children, computes l + r and sends it up and send down to its right child the l. Root : Step [1] except nothing is sent up. Non-leaf : If it gets p from parent it transmits it to its left/right children. Leaf : If it holds l and receives p from its parent it sets l = p + l (this order) [note p is the left argument, l is the right one, order matters] 83 Parallel Prefix Algorithm based on Complete Binary Tree: Example X1+x2+x3+x4+X5+x6+x7+x8 \x1+x2+x3+x4 X1+x2+x3+x4 X1+x2 x1 \x1 \x1+x2 x2 x3 X5+x6+x7+x8 \x1+x2 \x1+x2+x3+x4 x3+x4 X5+x6 \x1+x2 \x1+x2+x3+x4 \x3 x4 X5 \x1+x2+x3+x4 \x5+x6 x7+x8 \x1+x2+x3+x4 \x1+x2+x3+x4 \x5+x6 \x5 x6 x7 \x1+x2+x3+x4 \x5+x6 \x7 x8 after recving: x1+x2 x1+.+x3 x1+.+x3 x3+x4 x1+..+x4 x1+..+x4 x1+.+x5 x1+.+x5 x5+x6 x1+.+x6 x1+.+x6 x5+.+x7 x1+.+x7 x7+x8 x5+.+x8 x1+.+x8 84 PRAM Algorithms: Maximum finding Problem. Let X1 . . .,XN be n keys. Find X = max{X1,X2, . . .,XN}. The sequential problem accepts a P = 1, T = O(N),W = O(N) direct solution. An EREW PRAM algorithm solution for this problem works the same way as the PARALLEL SUM algorithm and its performance is P = O(N), T = O(lgN),W = O(N lgN),W2 = O(N) along with the improvements in P and W mentioned for the PARALLEL SUM algorithm. In the remainder we will investigate a CRCW PRAM algorithm. Let binary value Xi reside in the local memory of processor i. The CRCW PRAM algorithm MAX1 to be presented has performance T = O(1), P = O(N2), and work W2 = W = O(N2). The second algorithm to be presented in the following pages utilizes what is called a doubly-logarithmic depth tree and achieves T = O(lglgN), P = O(N) and W = W2 = O(N lglgN). The third algorithm is a combination of the EREW PRAM algorithm and the CRCW doubly-logarithmic depth tree-based algorithm and requires T = O(lglgN), P = O(N) and W2 = O(N). 85 PRAM Algorithm Maximum Finding begin Max1 (X1 . . .XN) 1. in proc (i, j) if Xi ≥ Xj then xij = 1; 2. else xij = 0; 3. Yi = xi1 ∧ . . . ∧ xin ; 4. Processor i reads Yi ; 5. if Yi = 1 processor i writes i into cell 0. end Max1 In the algorithm, we rename processors so that pair (i, j) could refer to processor j × n + i. Variable Yi is equal to 1 if and only if Xi is the maximum. The CRCW PRAM algorithm MAX1 has performance T = O(1), P = O(N2), and work W2 = W = O(N2). 86 Traditional PRAM Algorithm vs. Architecture Independent Parallel Algorithm Design Under the PRAM model, synchronization is ignored and thus is seen as for free, as PRAM processors work synchronously. It also ignores communication, as in the PRAM the cost of accessing the shared memory is as small as the cost of accessing local registers of the PRAM. But actually, the exchange of data can significantly impact the efficiency of parallel programs by introducing interaction delays during their execution. It takes roughly ts+mtw time for a simple exchange of an m-word message between two processes running on different nodes of an interconnection network with cut-through routing. ts: latency or the startup time for the data transfer tw: per-word transfer time, which is inversely proportional to the available bandwidth between the nodes. 87 Basic Communication Operations – Oneto-all broadcast and all-to-one reduction Assume that p processes participate in the operation and the data to be broadcast or reduced contains m words. Since one-to-all broadcast or all-to-one reduction procedure involves log p point-to-point simple message transfers, each at a time cost of ts+mtw. Therefore, the total time taken by the procedure is T=(ts+mtw) log p This is true for all interconnection network. 88 All-to-all Broadcast and Reduction Linear Array and Ring: P different messages circulate in the p-node ensemble. If communication is performed circularly in a single direction, then each node received all (p-1) pieces of information from all other nodes in (p-1) steps. So the total time is: T=(ts+mtw)(p-1) 2-D Mesh: Based on linear array algorithm, treating each rows and columns of the mesh as linear arrays. Two phases: Phase one: each row of the mesh performs an all-to-all broadcast using the procedure for the linear array. In this phase, all nodes collect p corresponding to the p nodes of their respective rows. Each node consolidates this information into a single message of size mp. The time for this phase is: T1= =(ts+mtw)(p-1) Phase two: columnwise all-to-all broadcase of the consolidated messages. By the end of this phase, each node obtains all p pieces of m-word data originally resided on different nodes. The time for this phase is T2= =(ts+mptw)(p-1) The time for entire all-to-all broadcast on a p-node two-dimensional square mesh is the sum of the times spent in the individual phases: Hypercube: T=2ts(p-1)+mtw(p-1) log p T (t s 2i 1 t wm) t s log p t wm( p 1) i 1 89 Traditional PRAM Algorithm vs. Architecture Independent Parallel Algorithm Design As an example of how traditional PRAM algorithm design differs from architecture independent parallel algorithm design, example algorithm for broadcasting in a parallel machine is introduced. Problem: In a parallel machine with p processors numbered 0, . . . , p − 1, one of them, say processor 0, holds a one-word message The problem of broadcasting involves the dissemination of this message to the local memory of the remaining p − 1 processors. The performance of a well-known exclusive PRAM algorithm for broadcasting is analyzed below in two ways under the assumption that no concurrent operations are allowed. One follows the traditional (PRAM) analysis that minimizes parallel running time. The other takes into consideration the issues of communication and synchronization. This leads to a modification of the PRAM-based algorithm to derive an architecture independent algorithm for broadcasting whose performance is consistent with observations of broadcasting operations on real parallel machines. 90 Broadcasting: MPI Algorithm 1 Algorithm. Without loss of generality let us assume that p is a power of two. The message is broadcast in lg p rounds of communication by binary replication. In round i = 1, . . . , lg p, each processor j with index j < 2i−1 sends the message it currently holds to processor j + 2i−1 (on a shared memory system, this may mean copying information into a cell read by this processor). The number of processors with the message at the end of round i is thus 2i. Analysis of Algorithm. Under the PRAM model the algorithm requires lg p communication rounds and so many parallel steps to complete. This cost, however, ignores synchronization which is for free, as PRAM processors work synchronously. It also ignores communication, as in the PRAM the cost of accessing the shared memory is as small as the cost of accessing local registers of the PRAM. 91 Broadcasting: MPI Algorithm 1 Under the MPI cost model each communication round is assigned a cost of max {ts, tw · 1} as each processor in each round sends or receives at most one message containing the one-word message. The cost of the algorithm is lg p · max {tw, tw · 1}, as there are lgp rounds of communication. As the communicated information by any processors is small in size, it is likely that latency issues prevail in the transmission time (ie bandwidth based cost tw · 1 is insignificant compared to the latency/synchronization reflecting term ts). In high latency machines the dominant term would be ts lg p rather than tw lg p. Even though each communication round would last for at least ts time units, only a small fraction tw of it is used for actual communication. The remainder is wasted. It makes then sense to increase communication round utilization so that each processor sends the one-word message to as many processors as it can accommodate within a round. The total time is: lg p *(ts+tw) 92 Broadcasting: MPI Algorithm 2 Input: p processors numbered 0 . . .p − 1. Processor 0 holds a message of length equal to one word. Output: The problem of broadcasting involves the dissemination of this message to the remaining p − 1 processors. Algorithm 2. In one superstep, processor 0 sends the message to be broadcast to processors 1, . . . , p − 1 in turn (a “sequential”-looking algorithm). Analysis of Algorithm 2. The communication time of Algorithm 2 is 1 · max{ts, (p − 1) · tw} (in a single superstep, the message is replicated p − 1 times by processor 0). The total time is ts+(p-1)tw 93 Broadcasting: MPI Algorithm 3 Algorithm 3 Architecture independent Algorithm 3 Both Algorithm 1 and Algorithm 2 can be viewed as extreme cases of an Algorithm 3. The main observation is that up to L/g words can be sent in a superstep at a cost of ts. Then, It makes sense for each processor to send L/g messages to other processors. Let k − 1 be the number of messages a processor sends to other processors in a broadcasting step. The number of processors with the message at the end of a broadcasting superstep would be k times larger than that in the start. We call k the degree of replication of the broadcast operation. In each round, every processor sends the message to k−1 other processors. In round i = 0, 1, . . ., each processor j with index j < ki sends the message to k − 1 distinct processors numbered j + kiּl, where l = 1, . . . , k−1. At the end of round i (the (i+1)-st overall round), the message is broadcast to ki ·(k−1)+ki = ki+1 processors. The number of rounds required is the minimum integer r such that kr ≥ p, The number of rounds necessary for full dissemination is thus decreased to lgkp, and the total cost becomes lgkp max {ts, (k − 1)tw}. At the end of each superstep the number of processors possessing the message is k times more than that of the previous superstep. During each superstep each processor sends the message to exactly k−1 other processors. Algorithm 3 consists of a number of rounds between 1 (and it becomes Algorithm 2) and lg p (and it becomes Algorithm 1). The total time is: lgkp (ts+(k-1)tw) 94 Broadcasting: MPI Algorithm 3 Broadcast (0, p, k) 1. my_pid = pid(); mask_pid = 1; 2. while (mask_pid < p) { 1. if (my_pid < mask_pid) for (i = 1, j = mask_pid;i < k; i++, j+ = mask_pid) { target_pid = my_pid + j; if (target_pid < p) mpi_put(target_pid,&M,&M, 0, sizeof(M)); (or mpi_send…) 1. 2. } else if ((my_pid >= mask_pid) and (my_pid < k* mask_pid)) 1. mpi_get() or mpi_Recv… mask_pid = mask_pid ∗ k; } 95 Broadcasting n > p words: Algorithm 4 Now suppose that the message to be broadcast consists of not a single word but is of size n > p. Algorithm 4 may be a better choice than the previous algorithms as one of the processors sends or receives substantially more than n words of information. (ntw>>ts) There is a broadcasting algorithm, call it Algorithm 4, that requires only two communication rounds and is optimal (for the communication model abstracted by ts and tw) in terms of the amount of information (up to a constant) each processor sends or receives. Algorithm 4. Two-phase broadcasting The idea is to split the message into p pieces, have processor 0 send piece i to processor i in the first round and in the second round processor i replicates the i-th piece p − 1 times by sending each copy to each of the remaining p − 1 processors (see attached figure). The total time is: p times one-to-one + one all-to-all broadcast (ts+n/p*tw)(p-1)+(ts+n/p*tw)(p-1)=2(ts+n/p*tw)(p-1) 96 97 Matrix Computations SPMD program design stipulates that processors executes a single program on different pieces of data. For matrix related computations it makes sense to distribute a matrix evenly among the p processors of a parallel computer. Such a distribution should also take into consideration the storage of the matrix by say the compiler so that locality issues are also taken into consideration (filling cache lines efficiently to speedup computation). There are various ways to divide a matrix. Some of the most common one are described below. One way to distribute a matrix is by using block distributions. Split an array into blocks of size n/p1 × n/p2 so that p = p1 × p2 and assign the i-th block to processor i. This distribution is suitable for matrices as long as the amount of work for different elements of the matrix is the same. The most common block distributions are. • column-wise (block) distribution. Split matrix into p column stripes so that n/p consecutive columns form the i-th stripe that will be stored in processor i. This is p1 = 1 and p2 = p. • row-wise (block) distribution. Split matrix into p row stripes so that n/p consecutive rows form the i-th stripe that will be stored in processor i. This is p1 = p and p2 = 1. • block or square distribution. This is the case p1 = p2 = √p, i.e. the blocks are of size n/√p× n/√p and store block i to processor i. There are certain cases (eg. LU decomposition, Cholesky factorization), where the amount of work differs for different elements of a matrix. For these cases block distributions are not suitable. 98 Matrix block distributions 99 Matrix-Vector Multiplication Sequential Alg: the running time is O(n2). n^2 multiplications and additions MAT_VECT(A,x,y) { for i=0 to n-1 do { y[i]=0; for j=0 to n-1 do y[i]=y[i]+A[i][j]*x[j]; } } 100 Matrix-Vector Multiplication: Rowwise 1-D Partitioning Assume p=n (p – no. of processors). Steps: Step 1: Initial partition of matrix and vector: Step 2: All-to-all broadcast Every process has one element of the vector, but every process needs the entire vector. Step 3: computation Matrix distribution: Each process get one complete row of the matrix. Vector distribution: The n*1 vector is distributed such that each process owns one of its elements. Process Pi computes Running time: y[i ] n 1 ( A[i, j ] x[ j ]) j 0 All-to-all broadcast: θ(n) at any architecture Multiplication of a single row of A and with vector x is θ(n) Total running time is θ(n). Total work is θ(n^2) – cost-optimal 101 Matrix-Vector Multiplication: Rowwise 1-D Partitioning 102 Matrix-Vector Multiplication: Rowwise 1-D Partitioning Assume p<n (p – no. of processors). Three Steps: Initial partition of matrix and vector: All-to-all broadcast: Among p processes and involved messages of size n/p Computation: Each process initially stores n/p complete rows of the matrix and a portion of the vector of size n/p Each process multiplies n/p rows of the matrix with the vector x to produce n/p elements of the result vector. Running Time: All-to-all broadcast: T=(ts+ n/p tw)(p-1) on any architecture T=ts logp + n/p tw(p-1) on hypercube Computation: T=n* n/p =θ(n2/p) Total running time T= θ(n2/p+ts logp + n tw) Total work: W=θ(n2+ts p logp + n p tw) – cost-optimal 103 Matrix-Vector Multiplication: Columnwise 1-D Partitioning Similar to rowwise 1-D Partitioning 104 Matrix-Vector Multiplication: 2-D Partitioning Assume p=n2 Steps: Step 1: Initial partitioning Step 2: broadcast Each process multiplies its matrix element with the corresponding element of x. Step 4: All-to-one reduction of partial results. The ith element of vector should be available to the ith element of each row of matrix. So this step consists of n simultaneous one-to-all broadcast operations, one in each column of processes. Step 3: computation Each process get one element of matrix The vector is distributed only processes in the diagonal, each of which owns one element. The products computed for each row must be added, leaving the sums in the last column of processes. Running time: One-to-all broadcast: θ(log n) Computation in each process: θ(1) All-to-one reduction: θ(log n) Total running time: θ(log n) Total work: θ(n2 log n) – not cost-optimal 105 Matrix-Vector Multiplication: 2-D Partitioning 106 Matrix-Vector Multiplication: 2-D Partitioning Assume p<n2 Steps: Step 1: Initial partitioning Step 2: columwise one-to-all broadcast Each process multiplies its n/p matrix element with the corresponding element of x. Step 4: All-to-one reduction of partial results. The ith group of elements of vector should be available to the ith group of each row of matrix. So this step consists of n simultaneous one-to-all broadcast operations, one in each column of processes. Step 3: computation Each process get (n/p)*(n/p) of matrix The vector is distributed only processes in the diagonal, each of which owns n/p element. The products computed for each row must be added, leaving the sums in the last column of processes. Running time: Columnwise one-to-all broadcast: T= (ts+ n/p tw)(log p) on any architecture Computation in each process: T=n/p* n/p All-to-one reduction: T= (ts+ n/p tw)(log p) on any architecture Total running time: T= n2/p + 2(ts+ n/p tw)(log p) on any architecture 107 Matrix-Vector Multiplication: 1-D Partitioning vs. 2-D Partitioning Matrix-vector multiplication is faster with block 2D partitioning of the matrix than with block 1-D partitioning for the same number of processes. If the number of processes is greater than n, then the 1-D partitioning cannot be used. If the number of processes is less than or equal to n, 2-D partitioning is preferable. 108 Matrix Distributions : Block cyclic In block cyclic distributions the rows (similarly for columns) are split into q groups of n/q consecutive rows per group, where potentially q > p, and the i-th group is assigned to a processor in a cyclic fashion. • column-cyclic distribution. This is an one-dimensional cyclic distribution. Split matrix into q column stripes so that n/q consecutive columns form the i-th stripe that will be stored in processor i %p. The symbol % is the mod (remainder of the division) operator. Usually q > p. Sometimes the term wrapped-around column distribution is used for the case where n/q = 1, i.e. q = n. • row-cyclic distribution. This is an one-dimensional cyclic distribution. Split matrix into q row stripes so that n/q consecutive rows form the i-th stripe that will be stored in processor i %p. The symbol % is the mod (remainder of the division) operator. Usually q > p. Sometimes the term wrapped-around row distribution is used for the case where n/q = 1, i.e. q = n. • scattered distribution. Let p = qi · qj processors be divided into qj groups each group Pj consisting of qi processors. Particularly, Pj = {jqi + l | 0 ≤ l ≤ qi − 1}. Processor jqi + l is called the l-th processor of group Pj . This way matrix element (i, j), 0 ≤ i, j < n, is assigned to the (i mod qi)-th processor of group P(j mod qj). A scattered distribution refers to the special case qi = qj = √p. 109 Block cyclic distributions 110 Scattered Distribution 111 Matrix Multiplication – Serial algorithm 112 Matrix Multiplication The algorithm for matrix multiplication presented below was presented in the seminal work of Valiant. It works for p ≤ n2. Three steps: Initial partitioning: Matrices A and B are partitioned into p blocks Ai,j, and Bi,j (1 <=i,j < √p) of size n/√p × n/√p each. These blocks are mapped onto a √p × √p logical mesh of processes. The process are labeled from P0,0 to P √p-1,√p -1. All-to-all broadcasting: Process Pi,j initially stores Ai,j and Bi,j and computes block Ci,j of the result matrix. Computing submatrix Ci,j requires all submatrices Ai,k and Bk,j for 0 ≤k<√p. To aquire all the required blocks, an all-to-all broadcast of matrix A’s block is performed in each row of processes, and an all-to-all broadcast of matrix B’s blocks is performed in each column. Computation: After Pi,j acquire Ai,0, Ai,1, …, Ai, √p -1 and B0,j, B1,j, …, B √p 1,j, it performs the submatrix multiplication and addition step of line 7 and line 8 in Alg 8.3. Running time: All-to-all broadcast: T=(ts+ n^2/p tw)(p-1) on any architecture T=ts log p + n^2/p tw( p-1) on hypercube Computation: T= p*(n/p)^3=n^3/p. 113 Matrix Multiplication The input matrices A and B are divided into p block-submatrices, each one of dimension m× m, where m = n/√p. We call this distribution of the input among the processors block distribution. This way, element A(i, j), 0 ≤ i < n, 0 ≤ j < n, belongs to the (j/m)∗√p+(i/m)-th block that is subsequently assigned to the memory of the same-numbered processor. Let Ai (respectively, Bi) denote the i-th block of A (respectively, B) stored in processor i. With these conventions the algorithm can be described in Figure 1. The following Proposition describes the performance of the aforementioned algorithm. 114 Matrix Multiplication 115 Sorting Algorithm Rearranging a list of numbers into increasing (strictly nondecreasing) order. 116 Potential Speedup O(nlogn) optimal for any sequential sorting algorithm without using special properties of the numbers. Best we can expect based upon a sequential sorting algorithm using n processors is Optimal parallel time complexity = O(n logn)/n= O(logn) Has been obtained but the constant hidden in the order notation extremely large. 117 Compare-and-Exchange Sorting Algorithms: Compare and Exchange Form the basis of several, if not most, classical sequential sorting algorithms. Two numbers, say A and B, are compared. If A > B, A and B are exchanged, i.e.: if (A > B) { temp = A; A = B; B = temp; } 118 Message Passing Method For P1 to send A to P2 and P2 to send B to P1. Then both processes perform compare operations. P1 keeps the larger of A and B and P2 keeps the smaller of A and B: 119 Merging Two Sublists 120 Bubble Sort First, largest number moved to the end of list by a series ofcompares and exchanges, starting at the opposite end. Actions repeated with subsequent numbers, stopping just before the previously positioned number. In this way, the larger numbers move (“bubble”) toward one end, 121 Bubble Sort 122 Time Complexity Number of compare and exchange operations Indicates a time complexity of O(n^2) given that a single compare-and-exchange operation has a constant complexity, O(1). 123 Parallel Bubble Sort Iteration could start before previous iteration finished if doesnot overtake previous bubbling action: 124 Odd-Even (Transposition) Sort Variation of bubble sort. Operates in two alternating phases, even phase and odd phase. Even phase Even-numbered processes exchange numbers with their right neighbor. Odd phase Odd-numbered processes exchange numbers with their right neighbor. 125 Sequential Odd-Even Transposition Sort 126 Parallel Odd-Even Transposition Sort Sorting eight numbers 127 Parallel Odd-Even Transposition Sort Consider the one item per processor case. There are n iterations; in each iteration, each processor does one compare-exchange –which can all be done in parallel. The parallel run time of this formulation is Θ(n). This is cost optimal with respect to the base serial algorithm but not the optimal serial algorithm. 128 Parallel Odd-Even Transposition Sort 129 Parallel Odd-Even Transposition Sort Consider a block of n/p elements per processor. The first step is a local sort. In each subsequent step, the compare exchange operation is replaced by the compare split operation. There are p phases with each phase performing Θ(n/p) compares and Θ(n/p) communication. The parallel run time of the formulation is n n Tp ( log ) (n) (n) p p The parallel formulation is cost-optimal for p= O(logn). 130 Quicksort Very popular sequential sorting algorithm that performs well with average sequential time complexity of O(nlogn). First list divided into two sublists. All numbers in one sublist arranged to be smaller than all numbers in other sublist. Achieved by first selecting one number, called a pivot, against which every other number is compared. If the number is less than the pivot, it is placed in one sublist. Otherwise, it is placed in the other sublist. Pivot could be any number in the list, but often first number in list chosen. Pivot itself could be placed in one sublist, or the pivot could be separated and placed in its final position 131 Quicksort 132 Quicksort Example of the quicksort algorithm sorting a sequence of size n= 8. 133 Parallel Quicksort Lets start with recursive decomposition -the list is partitioned by a single process and then each of the subproblems is handled by a different processor. The time for this algorithm is lower-bounded by Ω(n)! –Not cost optimal as the process-time product is Ω(n^2). Can we parallelize the partitioning step -in particular, if we can use n processors to partition a list of length n around a pivot in O(1)time, we have a winner. This is difficult to do on real machines, though. 134 Parallel Quicksort Using tree allocation of processes 135 Parallel Quicksort With the pivot being withheld in processes: 136 Analysis Fundamental problem with all tree constructions –initial division done by a single processor, which will seriously limit speed. Tree in quicksort will not, in general, be perfectly balanced Pivot selection very important to make quicksort operate fast. 137 Parallelizing Quicksort: PRAM Formulation We assume a CRCW (concurrent read, concurrent write) PRAM with concurrent writes resulting in an arbitrary write succeeding. The formulation works by creating pools of processors. Every processor is assigned to the same pool initially and has one element. Each processor attempts to write its element to a common location (for the pool). Each processor tries to read back the location. If the value read back is greater than the processor's value, it assigns itself to the 'left' pool, else, it assigns itself to the 'right' pool. Each pool performs this operation recursively. Note that the algorithm generates a tree of pivots. The depth of the tree is the expected parallel runtime. The average value is O(logn). 138 Parallelizing Quicksort: PRAM Formulation A binary tree generated by the execution of the quicksortalgorithm. Each level of the tree represents a different array-partitioning iteration. If pivot selection is optimal, then the height of the tree is Θ(log n), which is also the number of iterations. 139 Parallelizing Quicksort: PRAM Formulation The execution of the PRAM algorithm on the array shown in (a). 140 End Thank you! 141