Potential for Parallel Computation Chapter 2 Fundamentals of Parallel Processing G. Alaghband Fundamentals of Parallel Processing 1, Potential for Parallel Computation Main Topics • Prefix Algorithms • Speedup and Efficiency • Amdahl's Law G. Alaghband Fundamentals of Parallel Processing 2, Potential for Parallel Computation Examples of Parallel Programming Design • Sequential/Parallel Add • Sum Prefix Algorithm • Parameters of Parallel Algorithms • Generalized Prefix Algorithm • Divide and Conquer • Upper/Lower Algorithm • Size and Depth of Upper/Lower Algorithm • Odd/Even Algorithm • Size and Depth of Odd/Even Algorithm • A Parallel Prefix Algorithm with Small Size and Depth • Size and Depth Analysis G. Alaghband Fundamentals of Parallel Processing 3, Potential for Parallel Computation A Simple Algorithm : Adding numbers: Assume a vector of numbers in V[1:N] Sequential add: S:= V[1]; for i := 2 step 1 until N S := S + V[i]; Data dependence graph for sequential summation Back G. Alaghband Fundamentals of Parallel Processing 4, Potential for Parallel Computation Data Dependence Graph for Parallel Summation Back G. Alaghband Fundamentals of Parallel Processing 5, Potential for Parallel Computation A Slightly More Complicated Algorithm Sum Prefix: For i := 2 step 1 until N V[i] := V[i-1] + V[i]; Many Applications: Dependence Graph for Sequential Prefix • Radix sort • Quicksort • String comparison • Lexical analysis • Stream compaction • Polynomial evaluation • Solving recurrences • Tree operations • Histograms • Assigning space in farmers market • Allocating memory to parallel threads • Allocating memory buffer for communication channels Back G. Alaghband Each term, V’[i], is the sum of all numbers in V[1:i], i N Fundamentals of Parallel Processing 6, Potential for Parallel Computation Questions to contemplate and try to answer as we study this section are: • Do sequential computations involve different amount of parallelism? • What can be done in parallel in an arbitrary algorithm? G. Alaghband Fundamentals of Parallel Processing 7, Potential for Parallel Computation PARAMETERS OF PARALLEL ALGORITHMS SIZE: Number of operations DEPTH: Number of operations in the longest chain from any input to any output. EXAMPLES Sequential sum of N inputs: SIZE = N - 1 DEPTH = N – 1 Parallel sum of N inputs (pair-wise summation): SIZE = N - 1 DEPTH = Log2 N Sequential Sum Prefix of N inputs: SIZE = N - 1 DEPTH = N - 1 G. Alaghband Fundamentals of Parallel Processing 8, Potential for Parallel Computation A simply stated problem having several different algorithms is the Generalized Prefix Problem: Given an associative operator O, and N variables V1, V2, ..., VN, form the N results: V1, V1OV2, V1OV2OV3, …, V1OV2OV3O...OVN . There are several different algorithms to solve this problem, each with different characteristics. G. Alaghband Fundamentals of Parallel Processing 9, Potential for Parallel Computation Divide and Conquer A general technique for constructing non-trivial parallel algorithms is the divide and conquer technique. The idea is to split a problem into 2 smaller problems whose solution can be simply combined to solve the larger problem. The splitting is continued recursively until problems are so small that they are easy to solve. In this case we split the prefix problem on V1, V2, ..., VN into 2 problems: Prefix on V1, V2, ..., VN/2 , and Prefix on VN/2+1 , VN/2+2, ..., VN That is, we split inputs to the prefix computation into a lower half and an upper half, and solve the problem separately on each half. G. Alaghband Fundamentals of Parallel Processing 10, Potential for Parallel Computation The Upper/Lower Construction The solution to the 2 half size problems are combined by the construction below: Recall that the ceiling of X, X is the least integer X and the floor of X, X, is the greatest integer X. G. Alaghband Fundamentals of Parallel Processing 11, Potential for Parallel Computation Recursively applying the Upper/Lower construction will eventually result in prefix computations on no more than 2 inputs, which is trivial. For example: For 4 inputs we obtain: N=4 Size = 4 Depth = 2 G. Alaghband Fundamentals of Parallel Processing 12, Potential for Parallel Computation A larger example of the parallel prefix resulting from recursive Upper/Lower construction Pul(8): N=8 Size = 12 Depth = 3 G. Alaghband Fundamentals of Parallel Processing 13, Potential for Parallel Computation Finally Pul(16) G. Alaghband N = 16 Size = 32 Depth = 4 Fundamentals of Parallel Processing 14, Potential for Parallel Computation Having developed a way to produce a prefix algorithm which allows parallel operations, we should now characterize it in terms of its size and depth. The depth of the algorithm is trivial to analyze. The construction must be repeated log2 N times to reduce everything to two inputs. For each application of the construction, the path from the rightmost input to the rightmost output passes through one more operation. Therefore Depth = log2 N G. Alaghband Fundamentals of Parallel Processing 15, Potential for Parallel Computation Analysis of Size of Upper/Lower Assume N a power of 2 ( easiest to analyze). Theorem: Let s(N) = Size(Pul(N)), then for N a power of 2: s(2k) = k 2k-1 = N/2 log2N where N = 2k. G. Alaghband Fundamentals of Parallel Processing 16, Potential for Parallel Computation Proof: s(2k) = k 2k-1 = N/2 log2N by induction on k. The initial condition for k = 1 is : s(21) = Size(Pul(2)) = 1 = 21-1 Assume the result is true for k = i and prove it for k = i + 1. We assume s(2i) = i2i-1 The size of s(2i+1) is related to s(2i) by counting operations in the recursive constructions: s(2i+1) = 2 s(2i) + 2 prefix on 2i = 2(i2i-1) + 2i = (i+1)2i+1-1 = 2i Last operation on half size k.2k-1 Thus if results hold for k = i, it holds for k = i + 1. The result follows for arbitrary integer k by finite induction. k.2k-1 = N/2log2N, ( N = 2k , K = log2N) G. Alaghband Fundamentals of Parallel Processing 17, Potential for Parallel Computation If we have unlimited processors (arithmetic units) available then the minimum depth algorithm finishes soonest. The Upper/Lower construction gives an algorithm with minimum depth. If number of processors are limited then we have to keep the size small: ODD/EVEN Algorithm G. Alaghband Fundamentals of Parallel Processing 18, Potential for Parallel Computation Parallel prefix with larger depth but smaller size: Divide the inputs into sets with odd and even index values. Combine odds with the next higher evens at input. Do the parallel prefix on the new combined evens and combine evens with next higher odds at output. Recursive application of odd/even construction continues until a prefix of 2 inputs is reached. Poe(N) G. Alaghband Fundamentals of Parallel Processing 19, Potential for Parallel Computation Examples: Poe(4): The odd/even construction for 4 inputs is presented first: N=4 Size = 4 Depth = 2 Note : Poe(4) is a special case. It is equivalent to Pul(4) . Notice that the longest path is only one more than the 2-input case. G. Alaghband Fundamentals of Parallel Processing 20, Potential for Parallel Computation Odd/Even construction for 8 inputs shows the recursive construction: N=8 Size = 11 Depth = 4 G. Alaghband Fundamentals of Parallel Processing 21, Potential for Parallel Computation Size and Depth The size and depth analysis of Odd/Even algorithm is simple for N a power of 2. G. Alaghband Fundamentals of Parallel Processing 22, Potential for Parallel Computation Thus size of Odd/Even algorithm is less than the size of Upper/Lower but its depth is greater (~ twice) G. Alaghband Fundamentals of Parallel Processing 23, Potential for Parallel Computation The sequential algorithm is very deep, Odd/Even about twice as deep asis Upper/Lower The sizeisof sequential algorithm smallest, both are much shallower sequential The sizebut of Upper/Lower grows fasterthan with the N than the sizecase. of Odd/Even. The size of Odd/Even is less than twice the size of sequential algorithm. G. Alaghband Fundamentals of Parallel Processing 24, Potential for Parallel Computation The sequential algorithm is very deep, Odd/Even is about twice as deep as Upper/Lower but both are much shallower than the sequential case. The size of sequential algorithm is smallest, The size of Upper/Lower grows faster with N than the size of Odd/Even. The size of Odd/Even is less than twice the size of sequential algorithm. It is possible to find a parallel prefix algorithm with minimum depth which also has a size proportional to N instead of Nlog2N. G. Alaghband Fundamentals of Parallel Processing 25, Potential for Parallel Computation A Parallel Algorithm with Small Depth and Size Reference: Ladner, R. E. and Fisher, M. J., “Parallel Prefix Computation,” JACM, vol. 27, no. 4, pp. 831-838, Oct. 1980. By combining the 2 methods: (Upper/Lower and Odd/Even), we can define a set of prefix algorithms Pj(N). For j 1, Pj(N) is defined by Odd/Even construction using Pj-1(N/2). (until we get to P0) P0 is the subject of our interest! G. Alaghband Fundamentals of Parallel Processing 26, Potential for Parallel Computation P0(N) is defined differently, using Upper/Lower construction with P1 and P0 of fewer inputs. P0(N/2 ) will do upper/lower with P1 and P0 of N/4 P1(N/2 will do odd/even of P0 of N/4 G. Alaghband Fundamentals of Parallel Processing 27, Potential for Parallel Computation To get P0(N), only P0 and P1 are needed. For example: To get P0(16) we need P1(8) and P0(8) To get P1(8) we need P0(4) To get P0(8) we need P1(4) and P0(4). Note that P1(4) happens to be the same as P0(4). G. Alaghband Fundamentals of Parallel Processing 28, Potential for Parallel Computation Back G. Alaghband Fundamentals of Parallel Processing 29, Potential for Parallel Computation Depth of P0 can be seen by realizing that Although P1 has depth log2N+1 for N input, it is being applied only to half the size at the first application (N/2). Note that the longest path from any input to the highest numbered output for P1(N/2) is only k-1. This value is added to all outputs of P0(N/2) adding one more to the depth = k. Thus the longest path from any input to the highest numbered output for P0(N) is only k=log2N. Depth grows by one for each doubling of N. G. Alaghband Fundamentals of Parallel Processing 30, Potential for Parallel Computation Chain here goes through 3 operations only G. Alaghband Fundamentals of Parallel Processing 31, Potential for Parallel Computation Size of P1 Size of P0 Upper/Lower P0 G. Alaghband Fundamentals of Parallel Processing 32, Potential for Parallel Computation Theorem: If N = 2k, then S0(N) = 4N -F(2+k) -2F(3+k) + 1 and S1(N) = 3N - F(1+k) -2F(2+k) where F(m) is the m-th Fibonacci number. Recall : F(0) = 0, F(1) = 1, F(m) = F(m-1) + F(m-2), m 2 0, 1, 1, 2, 3, 5, 8, ... G. Alaghband Fundamentals of Parallel Processing 33, Potential for Parallel Computation S0(N) = 4N -F(2+k) -2F(3+k) + 1 S1(N) = 3N - F(1+k) -2F(2+k) G. Alaghband Fundamentals of Parallel Processing 34, Potential for Parallel Computation S0(N) = 4N -F(2+k) -2F(3+k) + 1 S1(N) = 3N - F(1+k) -2F(2+k) G. Alaghband Fundamentals of Parallel Processing 35, Potential for Parallel Computation Two problems remain in understanding Ladner and Fisher’s P0. 1. What happens if N is not a power of 2? In this case there is an upper bound: for N 1: Sj(N) < 4N so P0(N) is no more than a few times larger than the sequential algorithm. G. Alaghband Fundamentals of Parallel Processing 36, Potential for Parallel Computation 2. Some insight into the behavior of Fibbonacci numbers would help as most people have little intuition about how Fibbonacci numbers behave. We can do this by using an asymptotic formula for large m. G. Alaghband Fundamentals of Parallel Processing 37, Potential for Parallel Computation asymptotic formula for large m: Which shows size of P0(N) Is less than 4N G. Alaghband Fundamentals of Parallel Processing 38, Potential for Parallel Computation Speedup and Efficiency • Speed up and Efficiency of Parallel Algorithms • Arithmetic Expression Evaluations • Vector and Matrix Algorithms G. Alaghband Fundamentals of Parallel Processing 39, Potential for Parallel Computation Speedup and Efficiency of Algorithms For any given computation (algorithm): Let TP be the time to perform a computation with P processors. (arithmetic units, or PEs) We assume that any P independent operations can be done simultaneously. Note: The depth of an algorithm T , the minimum execution time. The speedup with P processors is , and efficiency is G. Alaghband Fundamentals of Parallel Processing 40, Potential for Parallel Computation These numbers, SP and EP, refer to an algorithm and not to a machine. Similar numbers can be defined for specific hardware. The time T1 can be chosen in different ways: To evaluate how good an algorithm is, it should be the time for the “BEST” sequential algorithm. G. Alaghband Fundamentals of Parallel Processing 41, Potential for Parallel Computation The Minimum Number of Processors Giving the Maximum Speedup: Let P be the minimum number of P of processors such that TP = T i.e. P = min { P | TP = T } Then TP, SP, EP are the best known time, speedup, and efficiency respectively. G. Alaghband Fundamentals of Parallel Processing 42, Potential for Parallel Computation What are T1 values for Upper/Lower and Odd/Even? G. Alaghband Fundamentals of Parallel Processing 43, Potential for Parallel Computation G. Alaghband Fundamentals of Parallel Processing 44, Potential for Parallel Computation If we have one processor, we should use the smaller sequential algorithm This makes the efficiency, Thus the efficiency of Upper/Lower really decreases as the problem size grows. G. Alaghband Fundamentals of Parallel Processing 45, Potential for Parallel Computation Evaluation of Arithmetic Expressions Most problems are not so simple that best sequential algorithm is known, to say nothing of the best parallel algorithm. Arithmetic expression evaluation is a case in which general results are known: Definition: An atom is a constant or variable appearing in an expression. Let E<N> be an expression in +, -, , /, (, ) having N atoms. The minimum time for sequential evaluation of any such algorithm is N-1 steps. Evaluation of E<N> takes a minimum amount of time as shown by the next lemma. G. Alaghband Fundamentals of Parallel Processing 46, Potential for Parallel Computation Lemma 1: For any number P of processors, the time to evaluate an expression in N atoms satisfies: The proof is based on the fact that all atoms are combined into one result. Proof: Since there is a single result and all operations are dyadic (they take 2 operands), then there is only one result at the last step and no more than 2 intermediate results at the next to the last step. In general, there are at most 2l intermediate values at the last-l step. Since there must be N atoms at stage l, then N 2 no. of stages and QED G. Alaghband Fundamentals of Parallel Processing 47, Potential for Parallel Computation Expressions can be transferred by mathematical operations into more parallel forms. Using associative, commutative, and distributive laws we can reduce the height of an expression tree. Consider the expression: E<8> = A + B(CDE + F + G) + H G. Alaghband Fundamentals of Parallel Processing 48, Potential for Parallel Computation E<8> = A + B(CDE + F + G) + H G. Alaghband Fundamentals of Parallel Processing 49, Potential for Parallel Computation Including the distributive law, we can get an even smaller depth. But the number of operations will increase. E<8> = A + B(CDE + F + G) + H G. Alaghband Fundamentals of Parallel Processing 50, Potential for Parallel Computation Using associativity and commutativity, evaluation of an expression is bounded. In addition to lower bound on the evaluation time for arithmetic expression, we can also get an upper bound when associativity and commutativity are used to put the expression into the most parallel form possible: Theorem1: If E<N,d> is an arithmetic expression in N atoms with depth d of parenthesis nesting, then using commutativity and associativity only, E<N,d> can be transformed so that Reference: J.L. Baer and D.P. Bovet, “Compilation of Arithmetic Expressions for Parallel Computation,” Proc. IFIP Congress 1968, North Holland, Amsterdam, pp. 340-346. G. Alaghband Fundamentals of Parallel Processing 51, Potential for Parallel Computation If distributivity is also used, the upper bound is independent of parenthesis nesting. But the size of the computation may increase, so the bound on P also will increase. Theorem: An expression E<N> in N atoms can be transformed by associativity, commutativity, and distributivity so that Note: The time bound only applies to transformed expressions. The transformation itself takes an order of Nlog2N steps. For computations larger than single expressions, we must look at specific cases. G. Alaghband Fundamentals of Parallel Processing 52, Potential for Parallel Computation Non-Associativity of Floating-Point Accumulation • Originated by the limited precision and range of the IEEE floating-point representation. • A hypothetical numerical representation with a precision of 32 “digits", showing two very large numbers added to a very small number- the large numbers are conceived to be at the extremes of the representation range. G. Alaghband Fundamentals of Parallel Processing 53, Potential for Parallel Computation Non-Associativity of Floating-Point Accumulation G. Alaghband Fundamentals of Parallel Processing 54, Potential for Parallel Computation Amdahl’s Law Let T(P) be the execution time with hardware parallelism P. Let S be the time doing the sequential part of the work , Time to do the parallel part of the work sequentially is Q, i.e., S and Q are the sequential and parallel amounts of work measured by time on one processor, The total time with P processors is G. Alaghband Fundamentals of Parallel Processing 55, Potential for Parallel Computation Amdahl’s Law Expressing this in terms of the fraction of serial work Amdahl’s law states that Speedup Efficiency G. Alaghband Fundamentals of Parallel Processing 56, Potential for Parallel Computation • A very small amount of unparallelized code can have a very large effect on efficiency if the amount of parallelism is large; • A fast vector processor must also have a fast scalar processor in order to achieve a sizeable fraction of its peak performance; • Effort in parallelizing a small fraction of code that is currently executed sequentially may pay off in large performance gains; • Hardware that allows even a small fraction of new things to be done in parallel may be considerably more efficient. G. Alaghband Fundamentals of Parallel Processing 57, Potential for Parallel Computation Although Amdahl’s law is a simple performance model, it need not be taken simplistically. The behavior of the sequential fraction, f, for example, can be quite important. System sizes, especially the number, P, of processors are often increased for the purpose of running larger problems. Increasing the problem size often does not increase the absolute amount of sequential work significantly. In this case, f is a decreasing function of problem size, and if problem size is increased with P, the somewhat pessimistic implications of equations look much more favorable. see Problem 2.16 for a specific example. The behavior of performance as both problem and system size increase is called scalability. G. Alaghband Fundamentals of Parallel Processing 58, Potential for Parallel Computation