Lecture 4 Analytical Modeling of Parallel Programs Parallel Computing Fall 2008 1 Performance Metrics for Parallel Systems Number of processing elements p Execution Time Parallel runtime: the time that elapses from the moment a parallel computation starts to the moment the last processing element finishes execution. Ts: serial runtime Tp: parallel runtime Total Parallel Overhead T0 Total time collectively spent by all the processing elements – running time required by the fastest known sequential algorithm for solving the same problem on a single processing element. T0=pTp-Ts 2 Performance Metrics for Parallel Systems Speedup S: The ratio of the serial runtime of the best sequential algorithm for solving a problem to the time taken by the parallel algorithm to solve the same problem on p processing elements. S=Ts(best)/Tp Example: adding n numbers: Tp=Θ(logn), Ts= Θ(n), S= Θ(n/logn) Theoretically, speedup can never exceed the number of processing elements p(S<=p). Proof: Assume a speedup is greater than p, then each processing element can spend less than time Ts/p solving the problem. In this case, a single processing element could emulate the p processing elements and solve the problem in fewer than Ts units of time. This is a contradiction because speedup, by definition, is computed with respect to the best sequential algorithm. Superlinear speedup: In practice, a speedup greater than p is sometimes observed, this usually happens when the work performed by a serial algorithm is greater than its parallel formulation or due to hardware features that put the serial implementation at a disadvantage. 3 Example for Superlinear speedup Superlinear speedup: Example1: Superlinear effects from caches: With the problem instance size of A and 64KB cache, the cache hit rate is 80%. Assume latency to cache of 2ns and latency of DRAM of 100ns, then memory access time is 2*0.8+100*0.2=21.6ns. If the computation is memory bound and performs one FLOP/memory access, this corresponds to a processing rate of 46.3 MFLOPS. With the problem instance size of A/2 and 64KB cache, the cache hit rate is higher, i.e., 90%, 8% the remaining data comes from local DRAM and the other 2% comes from the remote DRAM with latency of 400ns, then memory access time is 2*0.9+100*0.08+400*0.02=17.8. The corresponding execution rate at each processor is 56.18MFLOPS, and for two processors the total processing rate is 112.36MFLOPS. Then the speedup will be 112.36/46.3=2.43! 4 Example for Superlinear speedup Superlinear speedup: Example2: Superlinear effects due to exploratory decomposition: explore leaf nodes of an unstructured tree. Each leaf has a label associated with it and the objective is to find a node with a specified label, say ‘S’. The solution node is the rightmost leaf in the tree. A serial formulation of this problem based on depth-first tree traversal explores the entire tree, i.e. all 14 nodes, time is 14 units time. Now a parallel formulation in which the left subtree is explored by processing element 0 and the right subtree is explored by processing element 1. The total work done by the parallel algorithm is only 9 nodes and corresponding parallel time is 5 units time. Then the speedup is 14/5=2.8. 5 Performance Metrics for Parallel Systems(cont.) Efficiency E Cost(also called Work or processor-time product) W Ratio of speedup to the number of processing element. E=S/p A measure of the fraction of time for which a processing element is usefully employed. Examples: adding n numbers on n processing elements: Tp=Θ(logn), Ts= Θ(n), S= Θ(n/logn), E= Θ(1/logn) Product of parallel runtime and the number of processing elements used. W=Tp*p Examples: adding n numbers on n processing elements: W= Θ(nlogn). Cost-optimal: if the cost of solving a problem on a parallel computer has the same asymptotic growth(in Θ terms) as a function of the input size as the fastest-known sequential algorithm on a single processing element. Problem Size W2 The number of basic computation steps in the best sequential algorithm to solve the problem on a single processing element. W2=Ts of the fastest known algorithm to solve the problem on a sequential computer. 6 Parallel vs Sequential Computing: Amdahl’s Theorem 0.1 (Amdahl’s Law) Let f, 0 ≤ f ≤ 1, be the fraction of a computation that is inherently sequential. Then the maximum obtainable speedup S on p processors is S ≤1/(f + (1 − f)/p) Proof. Let T be the sequential running time for the named computation. fT is the time spent on the inherently sequential part of the program. On p processors the remaining computation, if fully parallelizable, would achieve a running time of at most (1−f)T/p. This way the running time of the parallel program on p processors is the sum of the execution time of the sequential and parallel components that is, fT + (1 − f)T/p. The maximum allowable speedup is therefore S ≤ T/(fT + (1 − f)T/p) and the result is proven. 7 Amdahl’s Law Amdahl used this observation to advocate the building of even more powerful sequential machines as one cannot gain much by using parallel machines. For example if f = 10%, then S ≤ 10 as p → ∞. The underlying assumption in Amdahl’s Law is that the sequential component of a program is a constant fraction of the whole program. In many instances as problem size increases the fraction of computation that is inherently sequential decreases with time. In many cases even a speedup of 10 is quite significant by itself. In addition Amdahl’s law is based on the concept that parallel computing always tries to minimize parallel time. In some cases a parallel computer is used to increase the problem size that can be solved in a fixed amount of time. For example in weather prediction this would increase the accuracy of say a three-day forecast or would allow a more accurate five-day forecast. 8 Parallel vs Sequential Computing: Gustaffson’s Law Theorem 0.2 (Gustafson’s Law) Let the execution time of a parallel algorithm consist of a sequential segment fT and a parallel segment (1 − f)T and the sequential segment is constant. The scaled speedup of the algorithm is then. S =(fT + (1 − f)Tp)/(fT + (1 − f)T) = f + (1 − f)p For f = 0.05, we get S = 19.05, whereas Amdahl’s law gives an S ≤ 10.26. 1 proc p proc fT fT (1-f)Tp (1-f)T T(f+(1-f)p) T Amdahl’s Law assumes that problem size is fixed when it deals with scalability. Gustafson’s Law assumes that running time is fixed. 9 Brent’s Scheduling Principle (Emulations) Suppose we have an unlimited parallelism efficient parallel algorithm, i.e. an algorithm that runs on zillions of processors. In practice zillions of processors may not available. Suppose we have only p processors. A question that arises is what can we do to “run” the efficient zillion processor algorithm on our limited machine. One answer is emulation: simulate the zillion processor algorithm on the p processor machine. Theorem 0.3 (Brent’s Principle) Let the execution time of a parallel algorithm requires m operations and runs in parallel time t. Then running this algorithm on a limited processor machine with only p processors would require time m/p + t. Proof: Let mi be the number of computational operations at the i-th step, i.e. mi m .If we assign the p processors on the i-th step to work on these mi operations they can conclude in time mi / p mi / p 1 . Thus the total running time on p processors would be t m / p m / p 1 t m / p t m / p i i i 1 i i i 10 End Thank you! 11