3 INTRODUCTION TO PARALLEL ALGORITHMS 3.1 Performance measures There are a number of different ways to characterize the performance of both parallel computers and parallel algorithms. Usually, the peak performance of a machine is expressed in units of millions of instructions executed per second (MIPS) or millions of floating point operations executed per second (MFLOPS). However, in practice, the realizable performance is clearly a function of the match between the algorithms and the architecture. 3.1.1 Speed-up, efficiency, and redundancy Perhaps the best known measure of the effectiveness of parallel algorithms is the speed-up ratio (Sp) with respect to the best serial algorithm, which is defined as the ratio of the time required to solve a given problem using the best known serial method to the time required to solve the same problem by a parallel algorithm using p processors. Thus, if T(N) denotes the time required by the best known serial algorithm to solve a problem of size N, and Tp(N) denotes the time taken by a parallel algorithm using p processors us in solving a problem of the same size, then Sp T (N ) . Tp ( N ) Clearly, Sp is a function of N, the size of the problem, and p, the number of processors. For notational simplicity, we often do not explicitly denote this dependence. Another related measure is known as the processor efficiency, Ep, which is the ratio of the speed-up to the number of processors. Thus Ep S p (N ) p . Not all serial algorithms admit speed-up through the use of parallel implementation. But all the parallel algorithms can be serially implemented in a straightforward way. Thus, since a parallel algorithm using p>1 processors must perform at least as fast as a serial algorithm, it is desired that S p 1 . In the above definition of speed-up, the parallel algorithm may be quite different from the serial algorithm. It is conceivable that in many cases one might be interested in comparing the performance of the same algorithm on the serial machine and on a given parallel processor. In such cases, the speedup with respect to a given algorithm is simply defined as S*p T1 ( N ) . Tp ( N ) Recall that Ti (N ) is the time needed to execute the algorithm on i processors. Likewise, the processor efficiency in this case is * E * p S p (N ) p . Clearly, S * p 1. Further, one step of a parallel algorithm using p processors takes at most p steps when implemented serially. Thus T1 ( N ) pT p ( N ) . This in turn implies that 1 S p p and 0 E p 1 . Since T ( N ) T1 ( N ) , it readily * * follows that S p S p . * Any parallel algorithm, which asymptotically attains linear speed-up, that is, S p ( N ) ( p ) is said to have the optimal speed-up. It is conceivable that an algorithm with optimal speed-up may have very low processor efficiency, and likewise, an algorithm with good processor efficiency may not have large speed-up. In other words, a good serial algorithm may be a bad parallel algorithm, and a throwaway serial algorithm may end up being a reasonable parallel algorithm. Another factor that would reflect the performance of a parallel algorithm is the total number of scalar operations required by an algorithm. In the case of serial algorithms, the serial time complexity function itself denotes the number of scalar operations. But it is a common practice to design parallel algorithms by introducing extra or redundant scalar computations to achieve speed-up. The redundancy factor, Rp, of a p-procossor algorithm is defined as the ratio of the total scalar operation performed by a parallel algorithm to the total scaler operations of a serial algorithm. Clearly, R p 1 . Consider the problem of adding 8 numbers x1, x2, … , x8 . Serial algorithm to solve the problem has the following view 1st step: 2nd step: 3rd step: 4th step: 5th step: 6th step: 7th step: y1 = x1 + x2 y2 = y1 + x3 y3 = y2 + x4 y4 = y3 + x5 y5 = y4 + x6 y6 = y5 + x7 y7 = y6 + x8 One of the parallel algorithms can be used to solve the same problem. 1st parallel algorithm 1st parallel step: 2nd parallel step: 3rd parallel step: y1 = x1 + x2 z1 = y1 + y2 w1 = z1 + z2 y2 = x3 + x4 z2 = y3 + y4 2 y3 = x5 + x6 y4 = x7 + x8 2nd parallel algorithm 1st parallel step: 2nd parallel step: 3rd parallel step: 4th parallel step: y1 = x1 + x2 y3 = x5 + x6 z1 = y1 + y2 w1 = z1 + z2 y2 = x3 + x4 y4 = x7 + x8 z2 = y3 + y4 Let us calculate the above described performance measures for these parallel algorithms. Results of calculations for the 1st parallel algorithm are as follows: p 4 , T 7 , T4 3 , S 4 7 2 .3 2.3 , E 4 5.75 . 3 4 Results of calculations for the 2nd parallel algorithm are as follows: p 4 , T 7 , T4 4 , S 4 7 1.75 1.75 , E 4 0.44 . 4 4 It is obvious that the 1st parallel algorithm is the best. The number of parallel steps is called the height of parallel algorithm. On the other hand, the maximal number of operations in parallel steps of an algorithm is called its weight. If there are several parallel algorithms, the one having minimal height is the best. Usually this algorithm has the maximal weight. In order to design the best parallel algorithm, in each step we try to find the maximal number of independent operations. Then carry out them simultaneously. 3.2 Basic arithmetic operations Parallelism in problem solving can be achieved in at least three different levels: procedure level, expression level, and basic arithmetic operations level. At the first two levels, the time complexity function merely indicates the number of basic arithmetic (arithmetic or comparison) operations. For example, at these two levels, addition of two integers is taken as a unit (independent of size of integers) operation. But it is common knowledge that it takes more time to add two large integers than to add two small ones. Thus, for the overall speed-up, it is necessary to use the best possible algorithm to perform these basic arithmetic operations. Below we analyse the number of bit operations such as AND and OR, needed to perform basic arithmetic operations in parallel. Since this is the lowest level at which parallelism can be achieved, it is called bit level or micro level parallelism. 3.2.1 Analysis of parallel addition Addition is the simplest of the basic arithmetic operations. Despite its simplicity, development of “optimal ” algorithm parallel algorithm is far from simple. 3 Brent’s algorithm Let a = aN aN-1 … a2 a1 and b = bN bN-1 … b2 b1 be two integers in binary to be added. Let s = a + b, where s = sN sN-1 … s2 s1. It is wellknown that si ai bi ci 1 , where c0 = 0, and for i = 1 to N, ci pi ( g i ci 1 ) pi ai bi g i ai bi where , , and are the logical exclusive-OR, inclusive-OR and AND operations, respectively. ci is called the carry from the ith bit position, pi and gi are the carry propagate condition and the carry generate condition, respectively. Clearly, we need c0 through cN-1 for computing s. Since c0=0, by distributing p1 we see that c1 p1 g1 c2 p2 ( g 2 ( p1 g1 )) and in general ci pi ( g i ( pi 1 ( g i 1 ( p1 g1 )))) . The propagate bit p i and the generate bit g i can be computed in parallel in unit step using a linear array of N AND gates and N OR gates with two inputs. Once all the carry bit are known, si can be computed in just two steps. Thus, if TA(N) is the time required to add two N-bit binary numbers and TC(N-1) is the time to compute the N-1 carry bits c1 through cN-1 (that is height of algorithm). Obviously carry bits can be computed in logN units of time. So, TA(N)= logN + 3. 3.2.2 Parallel multiplication We first analyse the complexity of Grade-School algorithm for parallel multiplication of two numbers. Grade-School algorithm The following example illustrates this algorithm. Let a=1101 and b=1101. Then x1= 0001101, x2= 0000000, x3= 0110100, x4= 1101000. Clearly, x1+x2= 00001101 and x3+x4= 10011100 can be computed in parallel and the final result is obtained in one more addition s= (x1+x2)+(x3+x2)= 10101001. First step can be done in one unit of time. The associative fan-in algorithm takes O(logN) units of time to find sum of two 2N-bit integers. Therefore, second step takes O(log2N) units of time to compute the sum of the products. 4 So, overall complexity (or height) of Grade-School algorithm to find product of two integers is O(log2N). Ofman-Wallace algorithm By cleverly exploiting the properties of the full adder, Ofman and Wallace independently developed a faster algorithm for multiplying two N-bit integers. The algorithm is based on the following result. Let a aN 1aN a2 a1 and b bN 1bN b2b1 be two N 1 bit integers. Then there exists three N bit integers x, y, and z such that x+y+z=a+b, where a N 1 0 and b1 0 . a and b can be obtained from x, y, and z in parallel in three steps by using formulae ai xi yi z i , bi 1 ( xi yi ) ( xi yi ) z i for 1 i N . Since the computation of ai and bi 1 can be done in parallel, it follows that a and b can be obtained in three steps. As an example, let x 1011, y 0111 , and z 1101. Then it can be easily verified that a 00001 and b 11110 . The complexity of two N-bit integers can be obtained in O(logN) steps. Katasuba-Offman algorithm We now define an algorithm based on divide-and-conquer strategy. Let a and b are N-bit integers. First, rewrite a and b as a A1 2 N b B1 2 N 2 A0 , 2 B0 . where Ai and Bi are N/2-bit integers, A0 and B0 is remainder and A1 and B1 is the quotient when a is divided by 2N/2. Define r0 A0 B0 r1 ( A1 A0 ) ( B1 B0 ) r2 ( A1 B1 ) N Then a b r2 2 N (r1 r2 r0 )2 2 r0 . Thus the product is obtained by performing three multiplications of (N/2)-bit integers (the recursive step) and a total of six additions/subtractions. Each additon/subtraction can be done in O(logN) steps. The overall complexity of algorithm is O(log2N). 4 PRAM ALGORITHMS 4.1 A model of serial computation 5 The random access machine (RAM) is a model of a serial computer. A RAM consists of a memory, a read-only input tape, a write-only output tape, and a program. The program is not stored in memory and cannot be modified. The input tape contains a sequence of integers. Every time an input value is read, and input head advanced one square. Likewise, the output head advanced after every write. Memory consists of an unbounded sequence of registers, designated r0, r1, r2, … . Each register can hold a single integer. Register r0 is the accumulator, where computations are performed. The allowed RAM operation include load, store, read, write, add, subtract, multiply, divide, test, jump, and half. 4.2 PRAM model of parallel computation PRAM consists of a control unit, global memory, and an unbounded set of processors, each with its own private memory. Although active processors execute identical instructions, every processor has a unique index, and the value of a processor’s index can be used to enable or disable the processor or influence which memory location accessed. A PRAM computation begins with the input stored in global memory and a single active processing element. During each step of the computation and active, enabled processor may read a value from a single private or global memory location, perform a single RAM operation, and write into one local or global memory location. Alternately, during the computation step a processor may activate another processor. All active, enabled processors must execute the same instruction, on different memory locations. The computation terminates when the last processor halts. Control P1 P2 Private memory Pn Private memory Private memory … … … … Interconnection network Global memory … 6 Various PRAM models differ on how handle read or write conflicts; i.e., when two or more processors attempt to read from, or write to, the same global memory location. Most of the results in the research literature have been based upon one of the following models: 1. EREW (Exclusive Read Exclusive Write): read or write conflicts are not allowed. 2. CREW (Concurrent Read Exclusive Write): concurrent read is allowed; multiple processors may read from the same global memory location during the same instruction step. Write conflicts are not allowed. 3. CRCW (Concurrent Read Concurrent Write): concurrent reading and concurrent writing allowed. A variety of CRCW models exist with different policies for handling concurrent writes to the global address. We list three different models: a. COMMON. All processors concurrently writing into the same global address must be writing the same value. b. ARBITRARY. If multiple processors concurrently write to the same global address, one of the competing processors is arbitrarily chosen as the “winner,” and its value is written into the register. c. PRIORITY. If multiple processors concurrently write to the same global address, the processor with the lowest index succeeds in writing its value into memory location. The EREW PRAM model is the weakest. Clearly a CREW PRAM can execute any EREW PRAM algorithm in the same amount of time; the concurrent read facility is simply not used. Similarly, a CREW PRAM can execute any EREW PRAM algorithm in the same amount of time. The PRIORITY PRAM model is the strongest. Any algorithm designed for the COMMON PRAM model will execute with the same complexity on the ARBITRARY PRAM and PRIORITY PRAM models as well, for if all processors writing to the same location write the same value, choosing an arbitrary would cause the same result. Likewise, if an algorithm executes correctly when an arbitrary processor is chosen as the “winner,” the processor with lowest index is as reasonable an alternative as any other. Hence any algorithm designed for the ARBITRARY PRAM model will execute with the same time complexity on the PRIORITY PRAM. 4.3 PRAM algorithms PRAM algorithms have two phases. In the first phase a sufficient number of processors are activated, and in the second phase these activated processors perform the computation in parallel. 4.3.1 Parallel reduction on EREW PRAM 7 Given a set of n values a1, a2, … , an, and an associative binary operator +, reduction is the process of computing a1,+ a2+… + an,. Parallel summation is an example of reduction operation. The realization of algorithm for 4+3+8+2+9+1+0+5+6+3 is illustrated below. 4.3.2 Prefix sums on EREW PRAM Given a set of n values a1, a2, … , an, and an associative binary operator +, the prefix sums problem is to compute the n quantities: a1 a1+ a2 a1+ a2+ a3 … a1+ a2+ a3+ … + an 4 3 8 2 9 1 0 5 6 3 4 7 1 1 1 1 1 5 1 9 4 7 1 1 2 2 1 15 12 14 4 7 15 17 26 27 27 32 34 34 4 7 15 17 26 27 27 32 38 41 For example, given the operation + and the integers 3, 1, 0, 4, and 2, the prefix sums of the integers are 3, 4, 4, 8, 10. Realization of prefix sums on EREW PRAM for 4, 3, 8, 2, 9, 1, 0, 5, 6, and 3 is shown below. 4.3.3 List ranking Consider a problem of finding, for each of n elements on a linked list, the suffix sums of the last i elements on the list, where i=1, … , n . The suffix sums problem is a variant of the prefix sums problem, where an array is replaced by a linked list, and the sums are computed from the end, rather than from 8 the beginning. If the values are 0 and 1, the problem is called the list ranking problem. If we associate a processor with every list element and jump pointers in parallel, the distance to the end of the list is cut in half through the instruction next[i]:=next[next[i]]. Hence a logarithmic number pointer jumping steps are sufficient to collapse the list so that every list element points to the last element. 4 7 17 3 8 10 2 9 1 10 0 5 5 6 3 9 15 41 Preorder tree traversal Sometimes it is appropriate to attempt to reduce complicated-looking problem into a simpler one for which a fast parallel algorithm is already known. The problem of numbering the vertices of a rooted tree in preorder (depth-first search order) is a case in point. Note that a preorder tree traversal algorithm visits nodes of the given tree according to the principle root-left-right. The algorithm works in the following way. In step one the algorithm constructs a singly-linked list. Each vertex of the singly-linked list corresponds to a downward or upward edge traversal of the tree. In step two the algorithm assigns weights to the vertices of the newly created singly-linked list. In the preorder traversal algorithm, a vertex is labelled as soon as it is encountered via a downward edge traversal. Every vertex in the singly-linked list gets the weight 1, meaning that the node count is incremented when this edge is traversed. List elements corresponding to upward edges have the weight 0, because the node count does not increase when the preorder traversal works its way back up the tree through 9 previously labelled nodes. In the third step we compute for each element of the singly-linked list , the rank of that list element. In step four the processors associated with downward edges use the ranks they have computed to assign a preorder traversal number. Merging to sorted lists Many PRAM algorithms achieve low time complexity by performing more operations than an optimal RAM algorithm. The problem of merging two sorted lists is another example. A A B D C E G B F C D E H G H (a) A B 1 D B 0 B D 1 F (b) EG 1 BE 1 EH 1 EB 0 GE 0 HE 0 A C 1 B A 0 FC 0 CF 1 C A 0 (c) A B 7 D B 5 B D 6 EG 4 BE 5 EH 3 EB 2 GE 3 HE 2 A C 2 B A 2 (d) A B C D E(d) F G H 1 2 7 3 4 5 6 (d) 10 8 FC 0 CF 1 C A 0 Parallel algorithm assigns one processor to each list element. So, altogether, there will be 2n processors each keeping track for a particular entry in the list. Every processor finds the position of its own element on the other list using binary search. Because an element’s index on its own list is known, its place in the merged list has been found and the two indices added. All n elements can be inserted into the merged list by their processors in constant time. Graph coloring Determining the vertices of a graph can be colored with c colors so that no two adjacent vertices are assigned the same color is called the graph coloring problem. To solve the problem quickly, we can create a processor for every possible coloring of the graph, then each processor checks to see if the coloring it represents is valid. 1 2 4 A1 A2 A3 A4 A5 A6 A7 A8 1 5 7 9 1 3 1 7 1 9 2 3 7 8 9 1 1 1 2 1 3 1 7 1 9 2 4 8 1 1 1 2 2 1 2 2 2 4 B1 B2 B3 B4 B5 B6 B7 B8 5 0 1 2 1 2 Coloring Initial values After checking 0,0,0 0,0,1 0,1,0 0,1,1 1,0,0 1,0,1 1,1,0 1,1,1 1 1 1 1 1 1 1 1 0 0 1 0 0 1 0 0 + 2 11 Number of colorings 2 2 2 3 2 4 Assume that graph has n vertices. Given nxn adjacency matrix and a positive constant c, a processor is created for every possible coloring of the graph. Each processor initially sets its value in the n-dimentional candidate array to 1. It then determines whether, for the particular assignment of colors to vertices it represents, two adjacent vertices have been given the same color. If A[i,j]=1 means that vertices j and k are adjacent, and ij=ik means that vertices j and k have the same color. If a processor detects an invalid coloring, it sets its value in the candidate array to 0. After n2 comparisons, if any element in the candidate array is still 1, then the coloring is valid. By summing over all cn elements in the candidate array, it can be determined whether there exists a valid coloring. The CREW PRAM algorithm for graph coloring appears below. 12