BSS 797: Principles of Parallel Computing Lecture 5 Parallel Performance Measurement Speedup Let T(1, N)= be the time required for the best serial algorithm to solve problem of size N on 1 processor and T(P, N) be the time for a given parallel algorithm to solve the same problem of the same size N on P processors. Thus, speedup is defined as S(P, N) = T(1,N)/T(P, N) Remarks: 1. Normally, S(P,N) <= P; Ideally, S(P,N) = P; Rarely, S(P,N) > P --- super speedup. 2. Linear speedup: S(P,N) = c P where c is independent of N and P. 3. Algorithms with S(P,N) = c P are scalable. Parallel efficiency Let T(1, N)= time required for the best serial algorithm to solve problem of size N on 1 processor and T(P, N)= time for a given parallel algorithm to solve the same problem of the same size N on P processors. Thus, parallel efficiency is defined as E(P,N)= T(1, N)/[T(P, N)P] = S(P,N)/P Remarks: 1. Normally, E(P,N) <= 1; Ideally, S(P,N) = 1; Hardly, S(P,N) > 1; Commonly, E(P,N) ~.6. Of course, it is problem-dependent. 2. Linear speedup: E(P,N) = c where c is independent of N and P. 3. Algorithms with E(P,N) = c are scalable. Load Imbalance Ratio I(P,N) Processor i spends ti doing useful work. tmax = max{ti} tavg = \sum i=0P-1 (ti/P) = average time. The total for computation and communication is \sum i=0P-1 ti while the time that the system is occupied (either computation or communication or idle) is P tmax. Load imbalance ratio: I(P,N) = (P tmax - \sum i=0P-1 ti)/(\sum i=0P-1 ti) = tmax/tavg - 1 Remarks: 1. tavg * I(P,N) = tmax - tavg = per-processor wasted time. 2. if tmax = tavg, then ti = tavg, then, I(P,N)=0, therefore, load is fully balanced. 3. One slow processor (tmax) can mess up the entire team---slave-master scheme is usually avoided. Overhead h(P,N) can be defined by E(P,N)= 1/(1 + h(P,N)) Thus, overhead is h(P,N)= 1/E(P,N) - 1 = P/S(P,N) - 1 Remarks h(P,N) --> \infty if E(P,N) --> 0. h(P,N) --> 0 if E(P,N) --> 1. h(P,N) results from: communication and load imbalance. Amdahl’s ``law'' Suppose a fraction f of an algorithm for a problem of size N on P processors is inherently serial and the remainder is perfectly parallel, then assume T(1,N) = \tau. Thus, T(P,N) = f\tau + (1-f)\tau/P. Therefore, S(P,N) =1/(f + (1-f)/P). This indicates that when P --> \infty, the speedup S(P,N) is bounded by 1/f. It means that the maximum possible speedup is finite even if P --> \infty. Granularity The size of the problem allocated to individual processors are called the granularity of the decomposition. Remarks: 1. granularity is usually determined by the problem size N and computer size P. 2. decreasing granularity usually increases communication and decreases load imbalance. 3. increasing granularity usually decreases communication and increases load imbalance. Scalability A scalable algorithm: whose E(P, N) remains bounded from below, i.e., E(P, N) >= E0 > 0$, when the number of processors P --> \infty at fixed problem size. A quasi-scalable algorithm: whose E(P, N) remains bounded from below, i.e., E(P, N) >= E0 > 0, when the number of processors Pmin < P < Pmax at fixed problem size. The interval Pmin < P < Pmax is called scaling zone. Remarks: 1. true scalable: rare, quasi-scalable: common. 2. quasi-scalable is regarded as scalable, usually 3. at fixed N=N(P), E(P,N(P)) decreases monotonically as P increases. 4. efforts: maximize scaling zone: Pmin < P < Pmax and E0. Principles: minimize overhead. In practice, we do, 1. Minimize communication-to-computation ratio 2. Minimize load imbalance 3. Maximize scaling zone