Evaluating Parallel Programs Cluster Computing, UNC-Charlotte, B. Wilkinson. 2a.1 Sequential execution time, ts: Estimate by counting computational steps of best sequential algorithm. Parallel execution time, tp: In addition to number of computational steps, tcomp, need to estimate communication overhead, tcomm: tp = tcomp + tcomm 2a.2 Computational Time Count number of computational steps. When more than one process executed simultaneously, count computational steps of most complex process. Generally, function of n and p, i.e. tcomp = f (n, p) Often break down computation time into parts. Then tcomp = tcomp1 + tcomp2 + tcomp3 + … Analysis usually done assuming that all processors are same and operating at same speed. 2a.3 Communication Time Many factors, including network structure. As a first approximation, use tcomm = tstartup + ntdata tstartup -- startup time, essentially time to send a message with no data. Assumed to be constant. tdata -- transmission time to send one data word, also assumed constant, and there are n data words. 2a.4 Idealized Communication Time Startup time Number of data items (n) The equation to compute the communication time ignore the fact that the source and destination may not be directly linked in a real system so that the message may pass through intermediate nodes. It is also assumed that the overhead incurred by including information other than data in the packet is constant and can be part of startup time. 2a.5 Final communication time, tcomm Summation of communication times of all sequential messages from one process, i.e. tcomm = tcomm1 + tcomm2 + tcomm3 + … Communication patterns of all processes assumed same and take place together so that only one process need be considered. Both tstartup and tdata, measured in units of one computational step, so that can add tcomp and tcomm together to obtain parallel execution time, tp. 2a.6 Communication Time of Broadcast/Gather • If broadcast is done through single shared wire for Ethernet, the time complexity is O(1) for single data item and O(w) if w data items. • If binary tree is used as the underlying network structure and 1-to-N fan-out broadcast is used, then what about communication cost for p final destinations (leaf nodes) using w messages? – We assume the left and right child will receive the message from their parent in a sequential way. However, at each level, different parent nodes will send out the message at the same time. 2a.7 1-to-N fan-out Broadcast • tcomm = 2 (log p) (tstartup + wtdata ) • It depends on number of levels and number of nodes at each level. • For a binary tree and p final destinations at the leave level. 2a.8 Benchmark Factors With ts, tcomp, and tcomm, can establish speedup factor and computation/communication ratio for a particular algorithm/implementation: Both functions of number of processors, p, and number of data elements, n. 2a.9 Factors give indication of scalability of parallel solution with increasing number of processors and problem size. Computation/communication ratio will highlight effect of communication with increasing problem size and system size. We wish to have dominant factor in computation instead of communication, as n increases, communication can be ignored and adding more processors can improve the performance. 2a.10 Example • Adding n numbers using two computers, each adding n/2 numbers each. • Numbers initially held in one computer. Computer 1 Computer 2 Send n/2 numbers Add up n/2 numbers tcomm = tstartup +(n/2)tdata tcomp = n/2 Send result back tcomm = tstartup + tdata Add partial sums tcomp = 1 2a.11 Overall tcomm = 2tstartup +(n/2 + 1)tdata = O(n) tcomp = n/2 + 1 = O(n) Computation/Communication ratio = O(1) 2a.12 Another problem Computation time complexity = O(n2) Communication time complexity = O(n) Computation/Communication ratio = O(n) 2a.13 Cost Cost = (execution time ) x (number of processors) Cost of sequential computation = ts Cost of parallel computation = tp x p Cost-optimal algorithm When parallel computation cost is proportional to sequential computation: Cost = tp x p = k x ts k is a constant 2a.14 Example Suppose ts = O(n log n) for the best sequential program where n = number of data item p = number of processors For cost optimality if tp = O(n log n) / p = O(n/p log n) Not cost optimal if tp = O(n^2/p ) A parallel algorithm is cost-optimal if parallel time complexity times the number of processors equals the sequential time complexity. 2a.15 Evaluating programs • Measuring the execution time • Time-complexity analysis gives an insight into the parallel algorithm and is useful in comparing different algorithms. We want to know how the algorithm actually performs in a real system. • We can measure the elapsed time between two points in the code in seconds. – System calls, such as clock(), time(), or gettimeofday() or MPI_Wtime() – Example: L1: time(&t1); . . L2: time(&t2); elapsed_time = difftime(t2, t1); 2a.16 Communication Time by the Ping-Pong Method • Point-to-point communication time of a specific system can be found using the ping-pong method. • One process p0 sends a message to another process, say p1. Immediately upon receiving the message, p1 sends the message back to p0. The time is divided by two to obtain an estimate of the time of one-way communication. For example, at p0: time(&t1); send(&x, p1); recv(&x, p1); time(&t2); elapsed_time = 0.5* difftime(t2, t1); 2a.17 Profilling • A profile of a program is a histogram or graph showing the time spent on different part of the program. Showing the number of times certain source code are executed. • It can help to identify certain hot spot places in a program visited many times during the execution. These places could be optimized first. 2a.18 Program Profile Histogram Statement number of region of program 2a.19