Algorithmic Frameworks Chapter 14 (starting with 14.2 - Parallel Algorithms) Parallel Algorithms Models for parallel machines and parallel processing Approaches to the design of parallel algorithms Speedup and efficiency of parallel algorithms A class of problems NC Some parallel algorithms (not all in the textbook) 2 What is a Parallel Algorithm? Imagine you needed to find a lost child in the woods. Even in a small area, searching by yourself would be very time consuming Now if you gathered some friends and family to help you, you could cover the woods much faster. 3 Parallel Architectures Single Instruction Stream, Multiple Data Stream (SIMD) One global control unit connected to each processor Multiple Instruction Stream, Multiple Data Stream (MIMD) Each processor has a local control unit 4 Architecture (continued) Shared-Address-Space Each processor has access to main memory Processors may be given a small private memory for local variables Message-Passing Each processor is given its own block of memory Processors communicate by passing messages directly instead of modifying memory locations 5 Interconnection Networks Static Each processor is hard-wired to every other processor Completely Connected Dynamic Star-Connected Bounded-Degree (Degree 4) Processors are connected to a series of switches 6 There are many different kinds of parallel machines – this is only one type A parallel computer must be capable of working on one task even though many individual computers are tied together. Lucidor is a distributed memory computer (a cluster) from HP. It consists of 74 HP servers and 16 HP workstations. Peak: 7.2 GFLOPs GFLOP = one billion decimal number operations/ second source:http://www.pdc.kth.se/compresc/machines/lucidor.html 7 ASCI White (10 teraops/sec) • Third in ASCI series • IBM delivered in 2000 8 Some Other Parallel Machines Blue Gene Earth Simulator 9 Lemieux Lighting 10 Connection Machines - SIMDs CM-2 CM-5 11 Example: Finding the Largest Key in an Array In order to find the largest key in an array of size n, at least n-1 comparisons must be done. A parallel version of this algorithm will still perform the same amount of compares, but by doing them in parallel it will finish sooner. 12 Example: Finding the Largest Key in an Array Assume that n is a power of 2 and we have n/2 processors executing the algorithm in parallel. Each Processor reads two array elements into local variables called first and second It then writes the larger value into the first of the array slots that it has to read. Takes lg n steps for the largest key to be placed in S[1] 13 Example: Finding the Largest Key in an Array 3 12 P1 7 6 P2 8 11 P3 19 13 P4 Read 12 First 3 6 7 Second First 11 Second First 8 13 Second First 19 Second Write 12 12 7 6 11 11 19 13 14 Example: Finding the Largest Key in an Array 12 Read 12 7 6 11 11 19 13 P3 P1 12 7 19 11 Write 12 12 7 6 19 11 19 13 15 Example: Finding the Largest Key in an Array 12 Read 12 7 6 19 11 19 13 19 13 P1 19 12 Write 19 12 7 6 19 11 16 Approaches to The Design of Parallel Algorithms Modify an existing sequential algorithm Exploiting those parts of the algorithm that are naturally parallelizable. Design a completely new parallel algorithm that May have no natural sequential analog. Brute force Methods for parallel processing: Use an existing sequential algorithm but Each processor using differential initial conditions Using compiler to optimize sequential algorithm Using advanced CPU to optimize code 17 Parallelism Some of the next slides are due to Rebecca Hartman-Baker at Oak Ridge National Labs. She permits their use for non-commercial, educational use only. 18 Parallelization Concepts When performing task, some subtasks depend on one another, while others do not Example: Preparing dinner Salad prep independent of lasagna baking Lasagna must be assembled before baking Likewise, in solving problems, some tasks are independent of one another 19 Serial vs. Parallel Serial: tasks must be performed in sequence Parallel: tasks can be performed independently in any order 20 Serial vs. Parallel: Example Example: Preparing dinner Serial tasks: making sauce, assembling lasagna, baking lasagna; washing lettuce, cutting vegetables, assembling salad Parallel tasks: making lasagna, making salad, setting table 21 Serial vs. Parallel: Example Could have several chefs, each performing one parallel task This is concept behind parallel computing 22 SPMD vs. MPMD SPMD: Write single program that will perform same operation on multiple sets of data Multiple chefs baking many lasagnas Rendering different frames of movie MPMD: Write different programs to perform different operations on multiple sets of data Multiple chefs preparing four-course dinner Rendering different parts of movie frame Can also write hybrid program in which some processes perform same task 23 Multilevel Parallelism May be subtasks within a task that can be performed in parallel Salad and lasagna making are parallel tasks; chopping of vegetables for salad are parallelizable subtasks We can use subgroups of processors (chefs) to perform parallelizable subtasks 24 Discussion: Jigsaw Puzzle* Suppose we want to do 5000 piece jigsaw puzzle Time for one person to complete puzzle: n hours How can we decrease wall time to completion? * Thanks to Henry Neeman for example 25 Discuss: Jigsaw Puzzle Add another person at the table Effect on wall time Communication Resource contention Add p people at the table Effect on wall time Communication Resource contention 26 Discussion: Jigsaw Puzzle What about: p people, p tables, 5000/p pieces each? What about: one person works on river, one works on sky, one works on mountain, etc.? 27 Parallel Algorithm Design: PCAM Partition: Decompose problem into fine- grained tasks to maximize potential parallelism Communication: Determine communication pattern among tasks Agglomeration: Combine into coarser-grained tasks, if necessary, to reduce communication requirements or other costs Mapping: Assign tasks to processors, subject to tradeoff between communication cost and concurrency 28 Example: Computing Pi We want to compute One method: method of darts* Ratio of area of square to area of inscribed circle proportional to 29 Method of Darts Imagine dartboard with circle of radius R inscribed in square Area of circle R 2 Area of square 2R 4R 2 2 Area of circle Area of square R2 4R 2 4 30 Method of Darts So, ratio of areas proportional to How to find areas? Suppose we threw darts (completely randomly) at dartboard Could count number of darts landing in circle and total number of darts landing in square Ratio of these numbers gives approximation to ratio of areas Quality of approximation increases with number of darts = 4 # darts inside circle # darts thrown 31 Method of Darts How in the world do we simulate this experiment on computer? Decide on length R Generate pairs of random numbers (x, y) such that -R ≤ x, y ≤ R 2 2 2 If (x, y) within circle (i.e. if (x +y ) ≤ R ), add one to tally for inside circle Lastly, find ratio 32 500 Fastest Supercomputers http://www.top500.org/lists/2008/11 Some of these are MIMDs and some are SIMDs (and some we don't know for sure because of propriety information, but timings indicate often which is which.) The Top500 list is updated twice yearly (Nov and June) by researchers at the University of Mannheim, the University of Tennessee and the U.S. Energy Department's Lawrence Berkeley National Laboratory. It ranks computers by how many trillions of calculations per second, or teraflops, they can perform using algebraic calculations in a test called Linpack. 33 Parallel Programming Paradigms and Machines Two primary programming paradigms: SPMD (single program, multiple data) MPMD (multiple programs, multiple data) There Are No PRAM Machines, but parallel machines are often classified into the two groups SIMD (single instruction, multiple data) MIMD (multiple instruction, multiple data) which clearly mimic the programming paradigms. A network of distributed computers all working on the same problem, models a MIMD machine. 34 Speedup and Efficiency of Parallel Algorithms Let T*(n) be the time complexity of a sequential algorithm to solve a problem P of input size n Let Tp(n) be the time complexity of a parallel algorithm to solves P on a parallel computer with p processors Speedup Sp(n) = T*(n) / Tp(n) <claimed by some; disputed by others> Sp(n) <= p Best possible, Sp(n) = p when Tp(n) = T*(n)/p 35 Speedup and Efficiency of Parallel Algorithms Let T*(n) be the time complexity of a sequential algorithm to solve a problem P of input size n Let Tp(n) be the time complexity of a parallel algorithm to solves P on a parallel computer with p processors Efficiency Ep(n) = T1(n) / (p Tp(n)) where T1(n) is when the parallel algorithm runs on 1 processor Best possible, Ep(n) = 1 36 A Class of Problems NC The class NC consists of problems that Can be solved by parallel algorithm using Polynomially bounded number of processors p(n) p(n) O(nk) for problem size n and some constant k The number of time steps bounded by a polynomial is the logarithm of the problem size n T(n) O( (log n)m ) for some constant m Theorem: NC P 37 Computing the Complexity of Parallel Algorithms Unlike sequential algorithms, we do not have one kind of machine on which we must do our algorithm development and our complexity analysis Moreover, the complexity is a function of the input size predominately. The following will need to be considered for parallel machines besides input size: What is the type of machine – SIMD?, MIMD?, others? How many processors? How can data be exchanged between processors – shared memory?, interconnection network?38 SIMD Machines SIMD machines are coming in (again) as dual and quad processors. In the past, there have been several SIMD machines Connection Machine - 64K processors ILLIAC was the name given to a series of supercomputers built at the University of Illinois at Urbana-Champaign. In all, five computers were built in this series between 1951 and 1974. Design of the ILLIAC IV had 256 processors Today: ClearSpeed: 192 high-performance processing elements, each with: Dual 32 & 64bit FPU and 6 Kbytes high bandwidth memory39 PRAMs A Simple Model for Parallel Processing Parallel Random Access Machine (PRAM) model A number of processors all can access A large shared memory All processors are synchronized However, processors can run different programs. Each processor has an unique id, pid and May be instructed to do different things depending on their pid (if pid < 200, do this, else do that) 41 The PRAM Model Parallel Random Access Machine Theoretical model for parallel machines p processors with uniform access to a large memory bank UMA (uniform memory access) – Equal memory access time for any processor to any address 42 PRAM Models PRAM models vary according How they handle write conflicts The models differ in how fast they can solve various problems. Concurrent Read Exclusive Write (CREW) Only one processor is allow to write to one particular memory cell at any one step Concurrent Read Concurrent Write (CRCW) An algorithm that works correctly for CREW will also work correctly for CRCW but not vice versa 43 Summary of Memory Protocols Exclusive-Read Exclusive-Write Exclusive-Read Concurrent-Write Concurrent-Read Exclusive-Write Concurrent-Read Concurrent-Write If concurrent write is allowed we must decide which value to accept 44 Assumptions There is no limit on the number of processors in the machine. Any memory location is uniformly accessible from any processor. There is no limit on the amount of shared memory in the system. Resource contention is absent. The programs written on these machines can be written for MIMDs or SIMDs. 45 More Details on the PRAM Model Memory size is infinite, number of processors in unbounded No direct communication between processors they communicate via the memory Every processor accesses any memory location in 1 cycle Typically all processors execute the same algorithm in a synchronous fashion although each processor can run a different program. READ phase COMPUTE phase WRITE phase Some subset of the processors can stay idle (e.g., even numbered processors may not work, while odd processors do, and conversely) P1 P2 P3 . . . Shared Memory PN 46 PRAM CW? What ends up being stored when multiple writes occur? priority CW: processors are assigned priorities and the top priority processor is the one that counts for each group write Fail common CW: if values are not equal, no change Collision common CW: if values not equal, write a “failure value” Fail-safe common CW: if values not equal, then algorithm aborts Random CW: non-deterministic choice of the value written Combining CW: write the sum, average, max, min, etc. of the values etc. The above means that when you write an algorithm for a CW PRAM you can do any of the above at any different points in time It doesn’t corresponds to any hardware in existence and is just a logical/algorithmic notion that could be implemented in software In fact, most algorithms end up not needing CW 47 PRAM Example 1 Problem: We have a linked list of length n For each element i, compute its distance to the end of the list: d[i] = 0 if next[i] = NIL d[i] = d[next[i]] + 1 otherwise Sequential algorithm in O(n) We can define a PRAM algorithm in O(log n) Associate one processor to each element of the list At each iteration split the list in two with odd-placed and even-placed elements in different lists List size is divided by 2 at each step, hence O(log n) 48 PRAM Example 1 Principle: Look at the next element Add its d[i] value to yours Point to the next element’s next element 1 1 1 1 1 0 2 2 2 2 1 0 4 4 3 2 1 0 5 4 3 2 1 0 The size of each list is reduced by 2 at each step, hence the O(log n) complexity 49 PRAM Example 1 Algorithm forall i if next[i] == NIL then d[i] 0 else d[i] 1 while there is an i such that next[i] ≠ NIL forall i if next[i] ≠ NIL then d[i] d[i] + d[next[i]] next[i] next[next[i]] What about the correctness of this algorithm? 50 forall Loop At each step, the updates must be synchronized so that pointers point to the right things: next[i] next[next[i]] Ensured by the semantic of forall forall i A[i] = B[i] forall i tmp[i] = B[i] forall i A[i] = tmp[i] Nobody really writes it out, but one mustn’t forget that it’s really what happens underneath 51 while Condition while there is an i such that next[i] ≠NULL How can one do such a global test on a PRAM? Cannot be done in constant time unless the PRAM is CRCW At the end of each step, each processor could write to a same memory location TRUE or FALSE depending on next[i] being equal to NULL or not, and one can then take the AND of all values (to resolve concurrent writes) On a PRAM CREW, one needs O(log n) steps for doing a global test like the above In this case, one can just rewrite the while loop into a for loop, because we have analyzed the way in which the iterations go: for step = 1 to log n 52 What Type of PRAM? The previous algorithm does not require a CW machine, but: tmp[i] d[i] + d[next[i]] which requires concurrent reads on proc i and j such that j = next[i]. Solution: split it into two instructions: tmp2[i] d[i] tmp[i] tmp2[i] + d[next[i]] (note that the above are technically in two different forall loops) Now we have an execution that works on a EREW PRAM, which is the most restrictive type 53 Final Algorithm on a EREW PRAM forall i if next[i] == NILL then d[i] 0 else d[i] 1 for step = 1 to log n forall i if next[i] ≠ NIL then tmp[i] d[i] d[i] tmp[i] + d[next[i]] O(1) next[i] next[next[i]] O(1) O(log n) O(log n) Conclusion: One can compute the length of a list of size n in time O(log n) on any PRAM 54 Are All PRAMs Equivalent? Consider the following problem Given an array of n elements, ei=1,n, all distinct, find whether some element e is in the array On a CREW PRAM, there is an algorithm that works in time O(1) on n processors: Initialize a boolean to FALSE Each processor i reads ei and e and compare them If equal, then write TRUE into the boolean (only one processor will write, so we’re ok for CREW) 55 Are All PRAMs Equivalent? One a EREW PRAM, one cannot do better than log n Each processor must read e separately At worst a complexity of O(n), with sequential reads At best a complexity of O(log n), with series of “doubling” of the value at each step so that eventually everybody has a copy (just like a broadcast in a binary tree, or in fact a k-ary tree for some constant k) Generally, “diffusion of information” to n processors on an EREW PRAM takes O(log n) Conclusion: CREW PRAMs are more powerful than EREW PRAMs 56 This is a Typical Question for Various Parallel Models Is model A more powerful than model B? Basically, you are asking if one can simulate the other. Whether or not the model maps to a “real” machine, is another question of interest. Often a research group tries to build a machine that has the characteristics of a given model. 57 Simulation Theorem Simulation theorem: Any algorithm running on a CRCW PRAM with p processors cannot be more than O(log p) times faster than the best algorithm on a ECEW PRAM with p processors for the same problem Proof: We will “simulate” concurrent writes When Pi writes value xi to address li, one replaces the write by an (exclusive) write of (li ,xi) to A[i], where A[i] is some auxiliary array with one slot per processor Then one sorts array A by the first component of its content Processor i of the EREW PRAM looks at A[i] and A[i-1] If the first two components are different or if i = 0, write value xi to address li Since A is sorted according to the first component, writing is exclusive 58 Proof (continued) Picking one processor for each competing write P0 12 8 P1 P2 43 29 P3 P4 P5 26 92 P0 (29,43) = A[0] A[0]=(8,12) P0 writes P1 (8,12) = A[1] A[1]=(8,12) P1 nothing P2 (29,43) = A[2] A[2]=(29,43) P2 writes P3 (29,43) = A[3] A[3]=(29,43) P3 nothing P4 (92,26) = A[4] A[4]=(29,43) P4 nothing P5 (8,12) = A[5] A[5]=(92,26) P5 writes sort 59 Proof (continued) Note that we said that we just sort array A If we have an algorithm that sorts p elements with O(p) processors in O(log p) time, we’re set Turns out, there is such an algorithm: Cole’s Algorithm. Basically a merge-sort in which lists are merged in constant time! It’s beautiful, but we don’t really have time for it, and it’s rather complicated Therefore, the proof is complete. 60 Brent Theorem Theorem: Let A be an algorithm with m operations that runs in time t on some PRAM (with some number of processors). It is possible to simulate A in time O(t + m/p) on a PRAM of same type with p processors Example: maximum of n elements on an EREW PRAM Clearly can be done in O(log n) with O(n) processors Compute series of pair-wise maxima The first step requires O(n/2) processors What happens if we have fewer processors? By the theorem, with p processors, one can simulate the same algorithm in time O(log n + n / p) If p = n / log n, then we can simulate the same algorithm in O(log n + log n) = O(log n) time, which has the same complexity! This theorem is useful to obtain lower-bounds on number of required processors that can still achieve a given complexity. 61 Example: Merge Sort 8 1 4 5 P1 1 2 7 P2 8 4 4 6 P3 5 2 P4 7 3 P1 1 3 6 P2 5 8 2 3 6 6 7 8 7 P1 1 2 3 4 5 62 Merge Sort Analysis Number of compares 1 + 3 + … + (2i-1) + … + (n-1) ∑i=1..lg(n) 2i-1 = 2n-2-lgn = Θ(n) We have improved from nlg(n) to n simply by applying the old algorithm to parallel computing, by altering the algorithm we can further improve merge sort to (lgn)2 63 O(1) Sorting Algorithm We assume a CRCW PRAM where concurrent write is handled with addition for(int i=1; i<=n; i++) { for(int j=1; j<=n; j++) { if(X[i] > X[j]) Processor Pij stores 1 in memory location mi else Processor Pij stores 0 in memory location mi } } 64 O(1) Sorting Algorithm with n2 Processors Unsorted List {4, 2, 1, 3} Note: Each is a different processor P11(4,4) P12(4,2) P13(4,1) P14(4,3) 0+1+1+1 = 3 P21(2,4) P22(2,2) 4 is in position 3 when sorted P23(2,1) P24(2,3) 0+0+1+0 = 1 P31(1,4) P32(1,2) 2 is in position 1 when sorted P33(1,1) P34(1,3) 0+0+0+0 = 0 P41(3,4) P42(3,2) 0+1+1+0 = 2 1 is in position 0 when sorted P43(3,1) P44(3,3) 3 is in position 2 when sorted 65 MIMDs Embarrassingly Parallel Problem In parallel computing, an embarrassingly parallel problem is one for which little or no effort is required to separate the problem into a number of parallel tasks. This is often the case where there exists no dependency (or communication) between those parallel tasks. Embarrassingly parallel problems are ideally suited to distributed computing and are also easy to perform on server farms which do not have any of the special infrastructure used in a true supercomputer cluster. They are thus well suited to large, internet based distributed platforms. 67 A common example of an embarrassingly parallel problem lies within graphics processing units (GPUs) for tasks such as 3D projection, where each pixel on the screen may be rendered independently. Some other examples of embarrassingly parallel problems include: The Mandelbrot set and other fractal calculations, where each point can be calculated independently. Rendering of computer graphics. In ray tracing, each pixel may be rendered independently. In computer animation, each frame may be rendered independently. 68 Brute force searches in cryptography. BLAST searches in bioinformatics. Large scale face recognition that involves comparing thousands of input faces with similarly large number of faces. Computer simulations comparing many independent scenarios, such as climate models. Genetic algorithms and other evolutionary computation metaheuristics. Ensemble calculations of numerical weather prediction. Event simulation and reconstruction in particle physics. 69 A Few Other Current Applications that Utilize Parallel Architectures - Some MIMDs and Some SIMDs Computer graphics processing Video encoding Accurate weather forecasting Often in a dynamic programming algorithm a given row or diagonal can be computed simultaneously This makes many dynamic programming algorithms amenable for parallel architectures 70 Atmospheric Model Typically, these points are located on a rectangular latitude-longitude grid of size 15-30 , in the range 50-500. A vector of values is maintained at each grid point, representing quantities such as pressure, temperature, wind velocity, and humidity. The three-dimensional grid used to represent the state of the atmosphere. Values maintained at each grid point represent quantities such as pressure and temperature. 71 Parallelization Concepts When performing task, some subtasks depend on one another, while others do not Example: Preparing dinner Salad prep independent of lasagna baking Lasagna must be assembled before baking Likewise, in solving problems, some tasks are independent of one another 72 Parallel Algorithm Design: PCAM Partition: Decompose problem into fine- grained tasks to maximize potential parallelism Communication: Determine communication pattern among tasks Agglomeration: Combine into coarser-grained tasks, if necessary, to reduce communication requirements or other costs Mapping: Assign tasks to processors, subject to tradeoff between communication cost and concurrency 73 Parallelization Strategies for MIMD Models What tasks are independent of each other? What tasks must be performed sequentially? Use PCAM parallel algorithm design strategy due to Ian Foster. PCAM = Partition, Communicate, Agglomerate, and Map. Note: Some of the next slides on this topic are adapted and modified from Randy H. Katz’s course. 74 Program Design Methodology for MIMDs 75 Ian Foster’s Scheme Partitioning. The computation that is to be performed and the data operated on by this computation are decomposed into small tasks. Practical issues such as the number of processors in the target computer are ignored, and attention is focused on recognizing opportunities for parallel execution. Communication. The communication required to coordinate task execution is determined, and appropriate communication structures and algorithms are defined. Agglomeration. The task and communication structures defined in the first two stages of a design are evaluated with respect to performance requirements and implementation costs. If necessary, tasks are combined into larger tasks to improve performance or to reduce development costs.76 Mapping. Each task is assigned to a processor in a manner that attempts to satisfy the competing goals of maximizing processor utilization and minimizing communication costs. Mapping can be specified statically or determined at runtime by load-balancing algorithms. Looking at these steps for the dart throwing problem … 77 Partition or Decompose “Decompose problem into fine-grained tasks to maximize potential parallelism” Finest grained task: throw of one dart Each throw independent of all others If we had huge computer, could assign one throw to each processor 78 Communication “Determine communication pattern among tasks” Each processor throws dart(s) then sends results back to manager process 79 Agglomeration “Combine into coarser-grained tasks, if necessary, to reduce communication requirements or other costs” To get good value of , must use millions of darts We don’t have millions of processors available Furthermore, communication between manager and millions of worker processors would be very expensive Solution: divide up number of dart throws evenly between processors, so each processor does a share of work 80 Mapping “Assign tasks to processors, subject to tradeoff between communication cost and concurrency” Assign role of “manager” to processor 0 Processor 0 will receive tallies from all the other processors, and will compute final value of Every processor, including manager, will perform equal share of dart throws 81 Program Partitioning The partitioning stage of a design is intended to expose opportunities for parallel execution. Thus, the focus is on defining a large number of small tasks in order to yield what is termed a fine-grained decomposition of a problem. Just as fine sand is more easily poured than a pile of bricks, a fine-grained decomposition provides the greatest flexibility in terms of potential parallel algorithms. In later design stages, evaluation of communication requirements, the target architecture, or software engineering issues may lead us to forego opportunities for parallel execution identified at this stage. We then revisit the original partition and agglomerate tasks to increase their size, or granularity. However, in this first stage we wish to avoid prejudging alternative partitioning strategies. 82 Good Partitioning A good partition divides into small pieces both the computation associated with a problem and the data on which this computation operates. When designing a partition, programmers most commonly first focus on the data associated with a problem, then determine an appropriate partition for the data, and finally work out how to associate computation with data. This partitioning technique is termed domain decomposition. The alternative approach---first decomposing the computation to be performed and then dealing with the data---is termed functional decomposition. These are complementary techniques which may be applied to different components of a single problem or even applied to the same problem to obtain alternative parallel algorithms. 83 In this first stage of a design, we seek to avoid replicating computation and data. That is, we seek to define tasks that partition both computation and data into disjoint sets. Like granularity, this is an aspect of the design that we may revisit later. It can be worthwhile replicating either computation or data if doing so allows us to reduce communication requirements. 84 Domain Decomposition Domain decompositions for a problem involving a three-dimensional grid. One-, two-, and three-dimensional decompositions are possible; in each case, data associated with a single task are shaded. A three-dimensional decomposition offers the greatest flexibility and is adopted in the early stages of a design. 85 Functional decomposition in a computer model of climate. Each model component can be thought of as a separate task, to be parallelized by domain decomposition. Arrows represent exchanges of data between components during computation: The atmosphere model generates wind velocity data that are used by the ocean model The ocean model generates sea surface temperature data that are used by the atmosphere model, and so on. 86 Communication The tasks generated by a partition are intended to execute concurrently but cannot, in general, execute independently. The computation to be performed in one task will typically require data associated with another task. Data must then be transferred between tasks so as to allow computation to proceed. This information flow is specified in the communication phase of a design. We use a variety of examples to show how communication requirements are identified and how channel structures and communication operations are introduced to satisfy these requirements. For clarity in exposition, we categorize communication patterns along four loosely orthogonal axes: local/global, structured/unstructured, static/dynamic, and synchronous/asynchronous. 87 Types of Communication –Rather Obvious Definitions In local communication, each task communicates with a small set of other tasks (its ‘neighbors''); in contrast, global communication requires each task to communicate with many tasks. In structured communication, a task and its neighbors form a regular structure, such as a tree or grid; in contrast, unstructured communication networks may be arbitrary graphs. In static communication, the identity of communication partners does not change over time; in contrast, the identity of communication partners in dynamic communication structures may be determined by data computed at runtime and may be highly variable. In synchronous communication, producers and consumers execute in a coordinated fashion, with producer/consumer pairs cooperating in data transfer operations; in contrast, asynchronous communication may require that a consumer 88 obtain data without the cooperation of the producer. Local Communication • Task and channel structure for a two-dimensional finite difference computation with five-point update stencil. • In this simple fine-grained formulation, each task encapsulates a single element of a two-dimensional grid and must both send its value to four neighbors and receive values from four neighbors. • Only the channels used by the shaded task are shown. 89 Global Communication A centralized summation algorithm that uses a central manager task (S) to sum N numbers distributed among N tasks. Here, N=8 , and each of the 8 channels is labeled with the number of the step in which they are used. 90 Distributing Communication and Computation We first consider the problem of distributing a computation and communication associated with a summation. We can distribute the summation of the N numbers by making each task i , 0<i<N-1 , compute the sum: •A summation algorithm that connects N tasks in an array in order to sum N numbers distributed among these tasks. •Each channel is labeled with the number of the step in which it is used and the value that is communicated on it. 91 Divide and Conquer Tree structure for divide-and-conquer summation algorithm with N=8 . The N numbers located in the tasks at the bottom of the diagram are communicated to the tasks in the row immediately above; these each perform an addition and then forward the result to the next level. The complete sum is available at the root of the tree after log N steps. 92 Unstructured and Dynamic Communication Example of a problem requiring unstructured communication. In this finite element mesh generated for an assembly part, each vertex is a grid point. An edge connecting two vertices represents a data dependency that will require communication if the vertices are located in different tasks. Note that different vertices have varying numbers of neighbors. (Image courtesy of M. S. Shephard.) 93 Asynchronous Communication Using separate ``data tasks'' to service read and write requests on a distributed data structure. In this figure, four computation tasks (C) generate read and write requests to eight data items distributed among four data tasks (D). Solid lines represent requests; dashed lines represent replies. One compute task and one data task could be placed on each of four processors so as to distribute computation and data equitably. 94 Agglomeration In the third stage, agglomeration, we move from the abstract toward the concrete. We revisit decisions made in the partitioning and communication phases with a view to obtaining an algorithm that will execute efficiently on some class of parallel computer. In particular, we consider whether it is useful to combine, or agglomerate, tasks identified by the partitioning phase, so as to provide a smaller number of tasks, each of greater size. We also determine whether it is worthwhile to replicate data and/or computation. 95 Examples of Agglomeration In (a), the size of tasks is increased by reducing the dimension of the decomposition from three to two. In (b), adjacent tasks are combined to yield a threedimensional decomposition of higher granularity. In (c), subtrees in a divide-andconquer structure are coalesced. In (d), nodes in a tree algorithm are combined. 96 Increasing Granularity In the partitioning phase of the design process, efforts are focused on defining as many tasks as possible. This forces us to consider a wide range of opportunities for parallel execution. Note that defining a large number of fine-grained tasks does not necessarily produce an efficient parallel algorithm. One critical issue influencing parallel performance is communication costs. On most parallel computers, computing must stop in order to send and receive messages. Thus, we can improve performance by reducing the amount of time spent communicating. 97 Increasing Granularity Clearly, performance improvement can be achieved by sending less data. Perhaps less obviously, it can also be achieved by using fewer messages, even if we send the same amount of data. This is because each communication incurs not only a cost proportional to the amount of data transferred but also a fixed startup cost. 98 The figure shows fine- and coarse-grained two-dimensional partitions of this problem. In each case, a single task is exploded to show its outgoing messages (dark shading) and incoming messages (light shading). In (a), a computation on an 8x8 grid is partitioned into 64 tasks, each responsible for a single point, while in (b) the same computation is partitioned into 2x2=4 tasks, each responsible for 16 points. In (a), 64x4=256 communications are required, 4 per task; these transfer a total of 256 data values. In (b), only 4x4=16 communications are required, and only 99 16x4=64 data values are transferred. Mapping In the fourth and final stage of the parallel algorithm design process, we specify where each task is to execute. This mapping problem does not arise on uniprocessors or on shared-memory computers that provide automatic task scheduling. In these computers, a set of tasks and associated communication requirements is a sufficient specification for a parallel algorithm; operating system or hardware mechanisms can be relied upon to schedule executable tasks to available processors. Unfortunately, general-purpose mapping mechanisms have yet to be developed for scalable parallel computers. In general, mapping remains a difficult problem that must be explicitly addressed when designing parallel algorithms. 100 Mapping The goal in developing mapping algorithms is normally to minimize total execution time. We use two strategies to achieve this goal: We place tasks that are able to execute concurrently on different processors, so as to enhance concurrency. We place tasks that communicate frequently on the same processor, so as to increase locality. The mapping problem is known to be NP - complete. However, considerable knowledge has been gained on specialized strategies and heuristics and the classes of problem for which they are effective. Next, we provide a rough classification of problems and present some representative techniques. 101 Mapping Example Mapping in a grid problem in which each task performs the same amount of computation and communicates only with its four neighbors. The heavy dashed lines delineate processor boundaries. The grid and associated computation is partitioned to give each processor the same amount of computation and to minimize off-processor communication. 102 Load-Balancing Algorithms A wide variety of both general-purpose and applicationspecific load-balancing techniques have been proposed for use in parallel algorithms based on domain decomposition techniques. We review several representative approaches here namely recursive bisection methods, local algorithms, probabilistic methods, and cyclic mappings. These techniques are all intended to agglomerate finegrained tasks defined in an initial partition to yield one coarse-grained task per processor. Alternatively, we can think of them as partitioning our computational domain to yield one sub-domain for each processor. For this reason, they are often referred to as partitioning algorithms. 103 Graph Partitioning Problem Load balancing in a grid problem. Variable numbers of grid points are placed on each processor so as to compensate for load imbalances. This sort of load distribution may arise if a local loadbalancing scheme is used in which tasks exchange load information with neighbors and transfer grid points when load imbalances are detected. 104 Task Scheduling Algorithms Manager/worker load-balancing structure. Workers repeatedly request and process problem descriptions; the manager maintains a pool of problem descriptions ( p) and responds to requests from workers. 105 SIMD, Associative, and Multi-Associative Computing SIMDs vs MIMDs SIMDs keep coming into and out of favor over the years. They are beginning to rise in interest today due to graphics cards and multi-processor core machines that use SIMD architectures. If anyone has taken an operating system course and learned about concurrency programming, they can program MIMSs. But, MIMDs spend a lot of time on communicating between processors, load balancing etc. Many people don't understand that SIMDs are really simple to program - but they need a different 107 mindset. SIMDs vs MIMDs Most SIMD programs are very similar to those for SISD machines, except for the way the data is stored. Many people, frankly, can't see how such simple programs can do so much. Example: The air traffic control problem and history. Associative computing uses a modified SIMD model to solve problems in a very straight forward manner. 108 Associative Computing References Note: Below KSU papers are available on the website: http://www.cs.kent.edu/~parallel/ (Click on the link to “papers”) 1. Maher Atwah, Johnnie Baker, and Selim Akl, An Associative Implementation of Classical Convex Hull Algorithms, Proc of the IASTED International 2. Conference on Parallel and Distributed Computing and Systems, 1996, 435-438 Johnnie Baker and Mingxian Jin, Simulation of Enhanced Meshes with MASC, a MSIMD Model, Proc. of the Eleventh IASTED International Conference on Parallel and Distributed Computing and Systems, Nov. 1999, 511-516. 109 Associative Computing References 3. Mingxian Jin, Johnnie Baker, and Kenneth Batcher, Timings for Associative Operations on the MASC Model, 4. 5. Proc. of the 15th International Parallel and Distributed Processing Symposium, (Workshop on Massively Parallel Processing, San Francisco, April 2001.) Jerry Potter, Johnnie Baker, Stephen Scott, Arvind Bansal, Chokchai Leangsuksun, and Chandra Asthagiri, An Associative Computing Paradigm, Special Issue on Associative Processing, IEEE Computer, 27(11):19-25, Nov. 1994. (Note: MASC is called ‘ASC’ in this article.) Jerry Potter, Associative Computing - A Programming Paradigm for Massively Parallel Computers, Plenum Publishing Company, 1992. 110 Associative Computers Associative Computer: A SIMD computer with a few additional features supported in hardware. These additional features can be supported (less efficiently) in traditional SIMDs in software. This model, however, gives a real sense of how problems can be solved with SIMD machines. The name “associative” is due to its ability to locate items in the memory of PEs by content rather than location. Traditionally, such memories were called associative memories. 111 Associative Models The ASC model (for ASsociative Computing) gives a list of the properties assumed for an associative computer. The MASC (for Multiple ASC) Model Supports multiple SIMD (or MSIMD) computation. Allows model to have more than one Instruction Stream (IS) The IS corresponds to the control unit of a SIMD. ASC is the MASC model with only one IS. The one IS version of the MASC model is sufficiently important to have its own name. 112 ASC & MASC are KSU Models Several professors and their graduate students at Kent State University have worked on models The STARAN and the ASPRO computers fully support the ASC model in hardware. The MPP computer supports ASC, partly in hardware and partly in software. Prof. Batcher was chief architect or consultant for STARAN and MPP. Dr. Potter developed a language for ASC Dr. Baker works on algorithms for models and architectures to support models Dr. Walker is working with a hardware design to support the ASC and MASC models. Dr. Batcher and Dr. Potter are currently not actively working on ASC/MASC models but still provide advice. 113 Motivation The STARAN Computer (Goodyear Aerospace, early 1970’s) and later the ASPRO provided an architectural model for associative computing embodied in the ASC model. ASC extends the data parallel programming style to a complete computational model. ASC provides a practical model that supports massive parallelism. MASC provides a hybrid data-parallel, control parallel model that supports associative programming. Descriptions of these models allow them to be compared to other parallel models 114 The ASC Model C E L L Cells Memory PE N E T W O R K IS Memory PE Memory PE 115 Basic Properties of ASC Instruction Stream The IS has a copy of the program and can broadcast instructions to cells in unit time Cell Properties Each cell consists of a PE and its local memory All cells listen to the IS A cell can be active, inactive, or idle Inactive cells listen but do not execute IS commands until reactivated Idle cells contain no essential data and are available for reassignment Active cells execute IS commands synchronously 116 Basic Properties of ASC Responder Processing The IS can detect if a data test is satisfied by any of its responder cells in constant time (i.e., any-responders property). The IS can select an arbitrary responder in constant time (i.e., pick-one property). 117 Basic Properties of ASC Constant Time Global Operations (across PEs) Logical OR and AND of binary values Maximum and minimum of numbers Associative searches Communications There are at least two real or virtual networks PE communications (or cell) network IS broadcast/reduction network (which could be implemented as two separate networks) 118 Basic Properties of ASC The PE communications network is normally supported by an interconnection network E.g., a 2D mesh The broadcast/reduction network(s) are normally supported by a broadcast and a reduction network (sometimes combined). See posted paper by Jin, Baker, & Batcher (listed in associative references) Control Features PEs and the IS and the networks all operate synchronously, using the same clock 119 Non-SIMD Properties of ASC Observation: The ASC properties that are unusual for SIMDs are the constant time operations: Constant time responder processing Any-responders? Pick-one Constant time global operations Logical OR and AND of binary values Maximum and minimum value of numbers Associative Searches These timings are justified by implementations using a resolver in the paper by Jin, Baker, & Batcher (listed in associative references and posted). 120 Typical Data Structure for ASC Model PE1 Make Color Year Dodge red 1994 PE2 PE3 IS PE4 Price Busyidle 1 1 0 0 Ford blue 1996 1 1 Ford white 1998 0 1 PE5 PE6 PE7 Model On lot Subaru red 1997 0 0 0 0 1 1 Make, Color – etc. are fields the programmer establishes Various data types are supported. Some examples will show string data, but they are not supported in the ASC simulator. 121 The Associative Search PE1 Make Color Year Dodge red 1994 Model Price PE2 PE3 IS PE4 Busyidle 1 1 0 0 Ford blue 1996 1 1 Ford white 1998 0 1 PE5 PE6 PE7 On lot Subaru red 1997 0 0 0 0 1 1 IS asks for all cars that are red and on the lot. PE1 and PE7 respond by setting a mask bit in their PE. 122 Memory PE Memory PE Memory PE Memory PE Memory PE Memory PE Memory PE Memory PE Instruction Stream (IS) Instruction Stream (IS) Instruction Stream (IS) IS Network PE Interconnection Network MASC Model Basic Components An array of cells, each consisting of a PE and its local memory A PE interconnection network between the cells One or more Instruction Streams (ISs) An IS network MASC is a MSIMD model that supports both data and control parallelism associative programming 123 MASC Basic Properties Each cell can listen to only one IS Cells can switch ISs in unit time, based on the results of a data test. Each IS and the cells listening to it follow rules of the ASC model. Control Features: The PEs, ISs, and networks all operate synchronously, using the same clock Restricted job control parallelism is used to coordinate the interaction of the multiple ISs. 124 Characteristics of Associative Programming Consistent use of style of programming called data parallel programming Consistent use of global associative searching and responder processing Usually, frequent use of the constant time global reduction operations: AND, OR, MAX, MIN Broadcast of data using IS bus allows the use of the PE network to be restricted to parallel data movement. 125 Characteristics of Associative Programming Tabular representation of data – think 2D arrays Use of searching instead of sorting Use of searching instead of pointers Use of searching instead of the ordering provided by linked lists, stacks, queues Promotes an highly intuitive programming style that promotes high productivity Uses structure codes (i.e., numeric representation) to represent data structures such as trees, graphs, embedded lists, and matrices. Examples of the above are given in Ref: Nov. 1994 IEEE Computer article. Also, see “Associative Computing” book by Potter. 126 Languages Designed for the ASC Professor Potter has created several languages for the ASC model. ASC is a C-like language designed for ASC model ACE is a higher level language than ASC that uses natural language syntax; e.g., plurals, pronouns. Anglish is an ACE variant that uses an English-like grammar (e.g., “their”, “its”) An OOPs version of ASC for the MASC was discussed (by Potter and his students), but never designed. Language References: ASC Primer – Copy available on parallel lab website www.cs.kent.edu/~parallel/ “Associative Computing” book by Potter [11] – some features in this book were never fully implemented in ASC Compiler 127 Algorithms and Programs Implemented in ASC A wide range of algorithms implemented in ASC without the use of the PE network: Graph Algorithms minimal spanning tree shortest path connected components Computational Geometry Algorithms convex hull algorithms (Jarvis March, Quickhull, Graham Scan, etc) Dynamic hull algorithms 128 ASC Algorithms and Programs (not requiring PE network) String Matching Algorithms all exact substring matches all exact matches with “don’t care” (i.e., wild card) characters. Algorithms for NP-complete problems traveling salesperson 2-D knapsack. Data Base Management Software associative data base relational data base 129 ASC Algorithms and Programs (not requiring a PE network) A Two Pass Compiler for ASC – This compiler uses ASC parallelism. first pass optimization phase Two Rule-Based Inference Engines for AI An Expert System OPS-5 interpreter PPL (Parallel Production Language interpreter) A Context Sensitive Language Interpreter (OPS-5 variables force context sensitivity) 130 An associative PROLOG interpreter Associative Algorithms & Programs (using a network) There are numerous associative programs that use a PE network; 2-D Knapsack ASC Algorithm using a 1-D mesh Image processing algorithms using 1-D mesh FFT (Fast Fourier Transform) using 1-D nearest neighbor & Flip networks Matrix Multiplication using 1-D mesh An Air Traffic Control Program (using Flip network connecting PEs to memory) Demonstrated using live data at Knoxville in mid 70’s. All but first were developed in assembler at Goodyear Aerospace 131 The MST Problem The MST problem assumes the weights are positive, the graph is connected, and seeks to find the minimal spanning tree, i.e. a subgraph that is a tree1, that includes all nodes (i.e. it spans), and where the sum of the weights on the arcs of the subgraph is the smallest possible weight (i.e. it is minimal). Note: The solution may not be unique. 132 An Example 2 A 7 B 4 3 F 6 5 1 6 I 2 3 2 H 4 8 2 E C G 1 D As we will see, the algorithm is simple. The ASC program is quite easy to write. A SISD solution is a bit messy because of the data structures needed to hold the data for the problem 133 An Example – Step 0 2 A 7 B 4 3 F 6 C G 5 1 6 I 2 3 2 H 4 8 2 E We will maintain three sets 1 of nodes whose membership will change during the run. D The first, V1, will be nodes selected to be in the tree. The second, V2, will be candidates at the current step to be added to V1. The third, V3, will be nodes not considered yet. 134 An Example – Step 0 2 A 7 B 4 3 F 6 5 1 6 I 2 3 2 H 4 8 2 E C G 1 D V1 nodes will be in red with their selected edges being in red also. V2 nodes will be in light blue with their candidate edges in light blue also. V3 nodes and edges will remain white. 135 An Example – Step 1 2 A 7 B 4 3 F 6 5 1 6 I 2 3 2 H 4 8 2 E C G 1 D Select an arbitrary node to place in V1, say A. Put into V2, all nodes incident with A. 136 An Example – Step 2 2 A 7 B 4 3 F 6 5 1 6 I 2 3 2 H 4 8 2 E C G 1 D Choose the edge with the smallest weight and put its node, B, into V1. Mark that edge with red also. Retain the other edge-node combinations in the “to be considered” list. 137 An Example – Step 3 2 A 7 B 4 3 F 6 5 1 6 I 2 3 2 H 4 8 2 E C G 1 D Add all the nodes incident to B to the “to be considered list”. However, note that AG has weight 3 and BG has weight 6. So, there is no sense of including BG in the list. 138 An Example – Step 4 2 A 7 B 4 3 F 6 5 1 6 I 2 3 2 H 4 8 2 E C G 1 D Add the node with the smallest weight that is colored light blue and add it to V1. Note the nodes and edges in red are forming a subgraph which is 139 a tree. An Example – Step 5 2 A 7 B 4 3 F 6 5 1 6 I 2 3 2 H 4 8 2 E C G 1 D Update the candidate nodes and edges by including all that are incident to those that are in V1 and colored red. 140 An Example – Step 6 2 A 7 B 4 3 F 6 5 1 6 I 2 3 2 H 4 8 2 E C G 1 D Select I as its edge is minimal. Mark node and edge as red. 141 An Example – Step 7 2 A 7 B 4 3 F 6 C G 5 1 6 I 2 H 4 8 2 E 2 3 1 D Add the new candidate edges. Note that IF has weight 5 while AF has weight 7. Thus, we drop AF from consideration at this time. 142 An Example – after several more passes, C is added & we have … 2 A 7 B 4 3 F 6 5 1 6 I 2 3 2 H 4 8 2 E C G 1 D Note that when CH is added, GH is dropped as CH has less weight. Candidate edge BC is also dropped since it would form a back edge between two nodes already in the MST. When there are no more nodes to be considered, i.e. no more in V3, we obtain 143 the final solution. An Example – the final solution 2 A 7 B 4 3 F 6 5 1 6 I 2 3 2 H 4 8 2 E C G 1 D The subgraph is clearly a tree – no cycles and connected. The tree spans – i.e. all nodes are included. While not obvious, it can be shown that this algorithm always produces a minimal spanning tree. The algorithm is known as Prim’s Algorithm for MST. 144 The ASC Program vs a SISD Solution in , say, C, C++, or Java First, think about how you would write the program in C or C++. The usual solution uses some way of maintaining the sets as lists using pointers or references. See solutions to MST in Algorithms texts by Baase listed in the posted references. In ASC, pointers and references are not even supported as they are not needed and their use is likely to result in inefficient SIMD algorithms The implementation of MST in ASC, basically follows the outline that was provided to the problem, but first, we need to learn something about the language ASC. 145 ASC-MST Algorithm Preliminaries Next, a “data structure” level presentation of Prim’s algorithm for the MST is given. The data structure used is illustrated in the next two slides. This example is from the Nov. 1994 IEEE Computer paper cited in the references. There are two types of variables for the ASC model, namely the parallel variables (i.e., ones for the PEs) the scalar variables (ie., the ones used by the IS). Scalar variables are essentially global variables. Can replace each with a parallel variable with this scalar value stored in each entry. 146 ASC-MST Algorithm Preliminaries (cont.) In order to distinguish between them here, the parallel variables names end with a “$” symbol. Each step in this algorithm takes constant time. One MST edge is selected during each pass through the loop in this algorithm. Since a spanning tree has n-1 edges, the running time of this algorithm is O(n) and its cost is O(n 2). Definition of cost is (running time) (number of processors) Since the sequential running time of the Prim MST algorithm is O(n 2) and is time optimal, this parallel implementation is cost optimal. 147 Graph Used for Data Structure a 22 7 b 4 8 c 3 9 6 d 3 e f Figure 6 in [Potter, Baker, et. al.] 148 b$ c$ d$ f$ candidate$ parent$ current_best$ 2 8 ∞ ∞ ∞ no b 2 ∞ 7 4 3 ∞ no a 2 c 8 7 ∞ ∞ 6 9 yes b 7 IS d ∞ 4 ∞ ∞ 3 ∞ yes b 4 a e ∞ 3 6 3 ∞ ∞ yes root b 3 nextnode b f ∞ ∞ 9 ∞ ∞ ∞ wait e$ a$ ∞ mask$ a PEs node$ Data Structure for MST Algorithm 149 Algorithm: ASC-MST-PRIM Initially assign any node to root. All processors set candidate$ to “wait” current-best$ to the candidate field for the root node to “no” All processors whose distance d from their node to root node is finite do Set their candidate$ field to “yes Set their parent$ field to root. Set current_best$ = d. 150 Algorithm: ASC-MST-PRIM (cont. 2/3) While the candidate field of some processor is “yes”, Restrict the active processors whose candidate field is “yes” and (for these processors) do Compute the minimum value x of current_best$. Restrict the active processors to those with current_best$ = x and do pick an active processor, say node y. Set the candidate$ value of node y to “no” Set the scalar variable next-node to y. 151 Algorithm: ASC-MST-PRIM (cont. 3/3) the value z in the next_node column of a processor is less than its current_best$ value, then Set current_best$ to z. Set parent$ to next_node For all processors, if candidate$ is “waiting” and the distance of its node from next_node y is finite, then Set candidate$ to “yes” Set current_best$ to the distance of its node from y. Set parent$ to y If 152 Initialize candidates to “waiting” If there are any finite values in root’s field, set candidate$ to “yes” Algorithm: ASC-MST-PRIM(root) set parent$ to root set current_best$ to the values in root’s field set root’s candidate field to “no” Loop while some candidate$ contain “yes” for them restrict mask$ to mindex(current_best$) set next_node to a node identified in the preceding step set its candidate to “no” if the value in their next_node’s field are less than current_best$, then set current_best$ to value in next_node’s field set parent$ to next_node if candidate$ is “waiting” and the value in its next_node’s field is finite set candidate$ to “yes” set parent$ to next_node set current_best to the values in next_node’s field 153 Comments on ASC-MST Algorithm The three preceding slides are Figure 6 in [Potter, Baker, et.al.] IEEE Computer, Nov 1994]. Preceding slide gives a compact, data-structures level pseudo-code description for this algorithm Pseudo-code illustrates Potter’s use of pronouns (e.g., them, its) and possessive nouns. The mindex function returns the index of a processor holding the minimal value. This MST pseudo-code is much shorter and simpler than data-structure level sequential MST pseudocodes 154 b 2 ∞ 7 4 3 ∞ c 8 7 ∞ ∞ 6 9 IS d ∞ 4 ∞ ∞ 3 ∞ a e ∞ 3 6 3 ∞ ∞ nextnode f ∞ ∞ 9 ∞ ∞ ∞ 155 current_best$ d$ ∞ ∞ ∞ parent$ c$ 8 candidate$ b$ 2 f$ a$ ∞ e$ node$ a PEs root mask$ Tracing 1st Pass of MST Algorithm on Figure 6