A General-Purpose Model for Heterogeneous Computation by Tiffani L. Williams B.S. Marquette University, 1994 A dissertation submitted in partial ful llment of the requirements for the degree of Doctor of Philosophy in the School of Electrical Engineering and Computer Science in the College of Engineering and Computer Science at the University of Central Florida Orlando, Florida Fall Term 2000 Major Professor: Rebecca Parsons c 2000 Ti ani L. Williams Abstract Heterogeneous computing environments are becoming an increasingly popular platform for executing parallel applications. Such environments consist of a diverse set of machines and o er considerably more computational power at a lower cost than a parallel computer. Ecient heterogeneous parallel applications must account for the di erences inherent in such an environment. For example, faster machines should possess more data items than their slower counterparts and communication should be minimized over slow network links. Current parallel applications are not designed with such heterogeneity in mind. Thus, a new approach is necessary for designing ecient heterogeneous parallel programs. We propose the k-Heterogeneous Bulk Synchronous Parallel (HBSPk ) model, which is an extension of the BSP model of parallel computation, as a framework for developing applications for heterogeneous parallel environments. The BSP model provides guidance on designing applications for good performance on homogeneous parallel machines. However, it is not appropriate for modeling heterogeneous computation since it assumes that all processors are identical and limits its view of the homogeneous communication network to one layer. The HBSPk model extends BSP hierarchically to address k-level heterogeneous parallel systems. Under HBSPk , improved performance results from exploiting the speeds of the underlying heterogeneous computing components. Collective communication algorithms provide the foundation for our investigation of the HBSPk model. Ecient collective communication operations must be iii available for parallel programs to exhibit good performance on heterogeneous systems. We develop and analyze six collective communication algorithms|gather, scatter, reduction, pre x sums, one-to-all broadcast, and all-to-all broadcast|for the HBSPk model. Experimental results demonstrate the improved performance that results from e ectively exploiting the heterogeneity of the underlying system. Moreover, the model predicts the performance trends of the collective routines. Improved performance is not a result of programmers having to account for myriad di erences in a heterogeneous environment. By hiding the non-uniformity of the underlying system from the application developer, the HBSPk model o ers a framework that encourages the design of heterogeneous parallel software. iv To my mother. v Acknowledgments First, I would like to thank my mother, who throughout my life has always been there to o er love, encouragement, and support. Secondly, I give thanks to my brother for showing me the meaning of perserverance. I am grateful to the members of my doctoral committee who listened to this thesis in its various stages and reacted with patience and incisive suggestions. In particular, I would like to thank my academic advisor, Rebecca Parsons, for o ering the time necessary to make my journey successful and Narsingh Deo for sparking my interest in parallel computation. The Florida Education Fund (FEF) provided the nancial support that made this research a nished product. Through FEF, I have met three great friends, Keith Hunter, Dwayne Nelson, and Larry Davis, two of whom have a decent racquetball game. Without Keith's persistance, I would never have met Mrs. Jacqueline Smith, who from the very rst day we spoke has been one of my greatest advocates. I would also like to acknowledge the "members" of the Evolutionary Computing Lab|Marc Smith, Paulius Micikevicious, Grace Yu, Jaren Johnston, Larry Davis, Yinn Wong, Lynda Vidot, Denver Williams, and Bill Allen|for the entertaining, yet scholarly discussions. Lastly, I give thanks to the unsung heroes who actually read this dissertation. Enjoy. vi Table of Contents List of Tables : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : x List of Figures : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : xvi 1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1 2 A Review of Models of Parallelism : : : : : : : : : : : : : : : : : 6 2.1 Computational Models . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.1 PRAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.1.2 Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1.3 Bridging . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.1.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2 Programming Models . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2.1 Data-Parallel . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.2 Message-Passing . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2.3 Shared-Memory . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3 Heterogeneous Computing . . . . . . . . . . . . . . . . . . . . . . . 23 2.3.1 HCGM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.3.2 Cluster-M . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 vii 2.3.3 PVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3 A Case for BSP : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 27 3.1 The BSP model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.2 BSP Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.2.1 Experimental Approach . . . . . . . . . . . . . . . . . . . . 32 3.2.2 Randomized Sample Sort . . . . . . . . . . . . . . . . . . . 36 3.2.3 Deterministic Sample Sort . . . . . . . . . . . . . . . . . . . 39 3.2.4 Bitonic Sort . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.2.5 Radix Sort . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4 HBSPk : A Generalization of BSP : : : : : : : : : : : : : : : : : : 55 4.1 Machine Representation . . . . . . . . . . . . . . . . . . . . . . . . 55 4.2 Cost Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.3 HBSPk Collective Communication Algorithms . . . . . . . . . . . . 61 4.3.1 Gather . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.3.2 Scatter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.3.3 Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.3.4 Pre x Sums . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.3.5 One-to-All Broadcast . . . . . . . . . . . . . . . . . . . . . 68 4.3.6 All-to-all broadcast . . . . . . . . . . . . . . . . . . . . . . 72 4.3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 viii 5 HBSP1 Collective Communication Performance : : : : : : : : : 75 5.1 The HBSP Programming Library . . . . . . . . . . . . . . . . . . . 76 5.2 The HBSP1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.3 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.4 Application Performance . . . . . . . . . . . . . . . . . . . . . . . 83 5.4.1 Randomized Sample Sort . . . . . . . . . . . . . . . . . . . 97 5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6 Conclusions and Future Work : : : : : : : : : : : : : : : : : : : : : 100 6.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 6.2 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 A Collective Communication Performance Data : : : : : : : : : : : 108 List of References : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 124 ix List of Tables 3.1 BSP system parameters . . . . . . . . . . . . . . . . . . . . . . . 34 3.2 Algorithmic and model summaries using 16 processors on the SGI Challenge and the Intel Paragon. Predicted and actual running times are in s/key. y For radix sort, the largest problem size that could be run on both machines was 4,194,304 keys. . . . . . . . . 36 4.1 De nitions of Notations . . . . . . . . . . . . . . . . . . . . . . . 59 5.1 The functions that constitute HBSPlib interface. . . . . . . . . . . 77 5.2 Speci cation of the nodes in our heterogeneous cluster. z A 2 processor system, where each number is for a single CPU. . . . . . 79 5.3 BYTEmark benchmark scores. . . . . . . . . . . . . . . . . . . . . 81 5.4 Cluster speed and synchronization costs. . . . . . . . . . . . . . . 83 5.5 rj values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.6 Randomized-sample sort performance. Factor of improvement is determined by Tu=Tb . The problem size ranges from 104 to 105 integers. Each data point represents the average of 10 runs on a cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors. . 98 x A.1 Actual execution times (in seconds) for gather. The problem size ranges from 100KB to 1000KB of integers. Ts and Tf (Tu) denote the execution time assuming a slow and fast root node, respectively. Tb is the runtime for balanced workloads. Each data point represents the average of 10 runs on a cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors. . . . . . . . . . . . . . . . . . 110 A.2 Predicted execution times (in seconds) for gather. The problem size ranges from 100KB to 1000KB of integers. Ts and Tf (Tu) denote the execution time assuming a slow and fast root node, respectively. Tb is the runtime for balanced workloads. Each data point represents the predicted performance on a cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors. . . . . . . . . . . . 111 A.3 Actual execution times (in seconds) for scatter. The problem size ranges from 100KB to 1000KB of integers. Ts and Tf (Tu) denote the execution time assuming a slow and fast root node, respectively. Tb is the runtime for balanced workloads. Each data point represents the average of 10 runs on a cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors. . . . . . . . . . . . . . . . . . 112 A.4 Predicted execution times (in seconds) for scatter. The problem size ranges from 100KB to 1000KB of integers. Ts and Tf (Tu) denote the execution time assuming a slow and fast root node, respectively. Tb is the runtime for balanced workloads. Each data point represents the predicted performance on a cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors. . . . . . . . . . . . 113 xi A.5 Actual execution times (in seconds) for single-value reduction. The problem size ranges from 100KB to 1000KB of integers. Ts and Tf (Tu ) denote the execution time assuming a slow and fast root node, respectively. Tb is the runtime for balanced workloads. Each data point represents the average of 10 runs on a cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors. . . . . . . . . . . . 114 A.6 Predicted execution times (in seconds) for single-value reduction. The problem size ranges from 100KB to 1000KB of integers. Ts and Tf (Tu) denote the execution time assuming a slow and fast root node, respectively. Tb is the runtime for balanced workloads. Each data point represents the predicted performance on a cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors. . . . . . 115 A.7 Actual execution times (in seconds) for point-wise reduction. The problem size ranges from 100KB to 1000KB of integers. Ts and Tf denote the execution time assuming a slow and fast root node, respectively. Each data point represents the average of 10 runs on a cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors. 116 A.8 Predicted execution times (in seconds) for point-wise reduction. The problem size ranges from 100KB to 1000KB of integers. Ts and Tf denote the execution time assuming a slow and fast root node, respectively. Each data point represents the predicted performance on a cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors.117 xii A.9 Actual execution times (in seconds) for pre x sums. The problem size ranges from 100KB to 1000KB of integers. Ts and Tf (Tu) denote the execution time assuming a slow and fast root node, respectively. Tb is the runtime for balanced workloads. Each data point represents the average of 10 runs on a cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors. . . . . . . . . . . . . . 118 A.10 Predicted execution times (in seconds) for pre x sums. The problem size ranges from 100KB to 1000KB of integers. Ts and Tf (Tu) denote the execution time assuming a slow and fast root node, respectively. Tb is the runtime for balanced workloads. Each data point represents the predicted performance on a cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors. . . . . . . . . . . . 119 A.11 Actual execution times (in seconds) for one-to-all broadcast. The problem size ranges from 100KB to 1000KB of integers. Ts and Tf (Tu ) denote the execution time assuming a slow and fast root node, respectively. Tb is the runtime for balanced workloads. Each data point represents the average of 10 runs on a cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors. . . . . . . . . . . . 120 A.12 Predicted execution times (in seconds) for one-to-all broadcast. The problem size ranges from 100KB to 1000KB of integers. Ts and Tf (Tu) denote the execution time assuming a slow and fast root node, respectively. Tb is the runtime for balanced workloads. Each data point represents the predicted performance on a cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors. . . . . . 121 xiii A.13 Actual execution times (in seconds) for all-to-all broadcast. There are two algorithms compared|simultaneous broadcast (SB) and intermediate destination (ID). The problem size ranges from 100KB to 1000KB of integers. Each data point represents the average of 10 runs on a cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 A.14 Predicted execution times (in seconds) for all-to-all broadcast. There are two algorithms compared|simultaneous broadcast (SB) and intermediate destination (ID). The problem size ranges from 100KB to 1000KB of integers. Each data point represents the predicted performance on a cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors. . . . . . . . . . . . . . . . . . . . . . . . 123 xiv List of Figures 2.1 The PRAM model. . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Bus, mesh, and hypercube networks. . . . . . . . . . . . . . . . . 11 2.3 The BSP model. . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.4 A superstep. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.5 HPF data allocation model. . . . . . . . . . . . . . . . . . . . . . 17 2.6 Messages sent without context are erroneously received. . . . . . . 20 3.1 Code fragment demonstrating BSMP . . . . . . . . . . . . . . . . 33 3.2 Predicted and actual execution time per key of randomized sample sort on an SGI Challenge and an Intel Paragon. Each plot represents the run time on a 2, 4, 8, or 16 processor system. . . . 38 3.3 Predicted and actual execution time per key of deterministic sample sort on an SGI Challenge and an Intel Paragon. Each plot represents the run time on a 2, 4, 8, or 16 processor system. . . . 41 3.4 A schematic representation of a bitonic sorting network of size n = 8. BMk denotes a bitonic merging network of input size k that sorts the input in either monotonically increasing (+) or decreasing (-) order. The last merging network (BM8+ ) sorts the input. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 xv 3.5 A bitonic sorting network of size n = 8. Each node compares two keys, as indicated by the edges and selects either the maximum or the minimum. Shaded and unshaded nodes designate where the minimum and maximum of two keys is placed, respectively. . . . . 44 3.6 Predicted and actual execution time per key of bitonic sort on an SGI Challenge and an Intel Paragon. Each plot represents the run time on a 2, 4, 8, or 16 processor system. . . . . . . . . . . . . . . 46 3.7 Global rank computation. The computation is illustrated with 4 processors and 4 buckets for the values 0{3. Each processor's t(i; j ) value is shown inside of each bucket. The number outside of a bucket re ects the b(i; j ) value after the multiscan. After the multicast, g(i; j ) re ects the starting position in the output where the rst key with value i on processor j belongs. For example, P0 will place the rst key with value \0" at position 0, the \1" keys starting at position 7, etc. . . . . . . . . . . . . . . . . . . . . . . 49 3.8 Predicted and actual execution time per key of radix sort on an SGI Challenge and an Intel Paragon. Each plot represents the run time on a 2, 4, 8, or 16 processor system. . . . . . . . . . . . . . . 51 4.1 An HBSP2 cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.2 Tree representation of the cluster shown in Figure 4.1. . . . . . . . 57 xvi 4.3 An HBSP2 pre x sums computation. Execution starts with the leaf nodes (or HBSP0 machines) in the top diagram. Here, the nodes send the total of their pre x sums computation to the coordinator of its cluster. The upward traversal of the computation continues until the root node is reached. The bottom diagram shows the downward execution of the computation. The leaf nodes hold the nal result. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.4 HBSP2 pre x sums . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.1 Gather actual performance. The improvement factor is determined by (a) TTfs and (b) TTub . The problem size ranges from 100KB to 1000KB of integers. Each data point represents the average of 10 runs on a cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.2 Gather predicted performance. The improvement factor is determined by (a) TTfs and (b) TTub . The problem size ranges from 100KB to 1000KB of integers. Each data point represents the predicted performance on a cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors. . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.3 Scatter actual performance. The improvement factor is determined by (a) TTfs and (b) TTub . The problem size ranges from 100KB to 1000KB of integers. Each data point represents the average of 10 runs on a cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 xvii 5.4 Scatter predicted performance. The improvement factor is determined by (a) TTfs and (b) TTub . The problem size ranges from 100KB to 1000KB of integers. Each data point represents the predicted performance on a cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors. . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.5 Single-value reduction actual performance. The improvement factor is determined by (a) TTfs and (b) TTub . The problem size ranges from 100KB to 1000KB of integers. Each data point represents the average of 10 runs on a cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors. . . . . . . . . . . . . . . . . . . . . . . . 90 5.6 Single-value reduction predicted performance. The improvement factor is determined by (a) TTfs and (b) TTub . The problem size ranges from 100KB to 1000KB of integers. Each data point represents the predicted performance on a cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors. . . . . . . . . . . . . . . . . . . . . . . . 90 5.7 Point-wise reduction actual performance. The improvement factor is determined by TTfs . The problem size ranges from 100KB to 1000KB of integers. Each data point represents the average of 10 runs on a cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.8 Point-wise reduction predicted performance. The improvement factor is determined by TTfs . The problem size ranges from 100KB to 1000KB of integers. Each data point represents the predicted performance on a cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors. . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 xviii 5.9 Pre x sums actual performance. The improvement factor is determined by (a) TTfs and (b) TTub . The problem size ranges from 100KB to 1000KB of integers. Each data point represents the average of 10 runs on a cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.10 Pre x sums predicted performance. The improvement factor is determined by (a) TTfs and (b) TTub . The problem size ranges from 100KB to 1000KB of integers. Each data point represents the predicted performance on a cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors. . . . . . . . . . . . . . . . . . . . . . . . 93 5.11 One-to-all broadcast actual performance. The improvement factor is determined by (a) TTfs and (b) TTub . The problem size ranges from 100KB to 1000KB of integers. Each data point represents the average of 10 runs on a cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors. . . . . . . . . . . . . . . . . . . . . . . . 94 5.12 One-to-all broadcast predicted performance. The improvement factor is determined by (a) TTfs and (b) TTub . The problem size ranges from 100KB to 1000KB of integers. Each data point represents the predicted performance on a cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors. . . . . . . . . . . . . . . . . . . . . . . . 94 5.13 All-to-all broadcast actual performance. There are two algorithms compared|simultaneous broadcast (SB) and intermediate destination (ID). The improvement factor is given for SB versus ID.. The problem size ranges from 100KB to 1000KB of integers. Each data point represents the average of 10 runs on a cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors. . . . . . . . . . . . 95 xix 5.14 All-to-all broadcast predicted performance. There are two algorithms compared|simultaneous broadcast (SB) and intermediate destination (ID). The improvement factor is given for (a) SB versus ID. The problem size ranges from 100KB to 1000KB of integers. Each data point represents the predicted performance on a cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors. . . . . . 96 5.15 HBSP1 randomized sample sort. . . . . . . . . . . . . . . . . . . . 97 6.1 Processor allocation of p = 16 matrix blocks. Each processor receives the same block size on a homogeneous cluster. On heterogeneous clusters, processors receive varying block sizes. . . . . . . 107 xx CHAPTER 1 Introduction Heterogeneous computing environments are becoming an increasingly popular platform for executing parallel applications [KPS93, SDA97]. Such environments consist of a wide range of architecture types such as Pentium PCs, shared-memory multiprocessors, and high-performance workstations. Heterogeneous parallel systems o er considerably more computational power at a lower cost than traditional parallel machines. Additionally, heterogeneous systems provide users with the opportunity to reuse existing computer hardware and combine di erent models of computation. Despite these advantages, application developers must contend with myriad di erences (such as varying computer speeds, di erent network protocols, and incompatible data formats) in a heterogeneous environment. Current parallel programs are not written to handle such non-uniformity. Thus, a new approach is necessary to promote the development of ecient applications for heterogeneous parallel systems. We propose the k-Heterogeneous Bulk Synchronous Parallel (HBSPk ) model, which is an extension of the BSP model of parallel computation [Val90a], as a framework for the development of heterogeneous parallel applications. The BSP model provides guidance on designing applications for good performance on homogeneous parallel machines. Furthermore, experimental results have demonstrated the utility of the model|in terms of portability, eciency, and predictability| 1 on diverse parallel platforms for a wide variety of non-trivial applications [GLR99, KS99, WG98]. Since the BSP model assumes that all processors have equal computation and communication abilities, it is not appropriate for heterogeneous systems. Instead, it is only suitable for 1-level homogeneous architectures, which consist of some number of identical processors connected by a single communications network. The HBSPk model extends BSP hierarchically to address k-level1 heterogeneous parallel systems. Here, k represents the number of network layers present in the heterogeneous environment. Unlike BSP, the HBSPk model describes multiple heterogeneous parallel computers connected by some combination of internal buses, local-area networks, campus-area networks, and wide-area networks. As a result, it can guide the design of applications for traditional parallel systems, heterogeneous or homogeneous clusters [Buy99a, Buy99b], the Internet, and computational grids [FK98]. Furthermore, HBSPk incorporates parameters that re ect the relative computational and communication speeds at each of the k levels. Performance gains in heterogeneous environments result from e ectively exploiting the speeds of the underlying components. Similarly to homogeneous architectures, good algorithmic performance in heterogeneous environments is the result of balanced machine loads. Executing standard parallel algorithms on heterogeneous platforms leads to the slowest processor becoming a bottleneck, which reduces overall system performance. Computation and communication should be minimized on slower processors. One the other hand, faster processors should be used as often as possible. The HBSPk cost model guides the programmer in balancing these objectives to produce ecient heterogeneous programs. 1 The terms level and layer will be used interchangeably throughout the text. 2 It is imperative that a unifying model for heterogeneous computation emerges to avoid the software development problems that traditional parallel computing currently faces. Frequently, high-performance algorithms and system software are obtained by exploiting architectural features such as number of processors, memory organization, and communication latency of the underlying parallel machine. However, designing software to accommodate the speci cs of one machine often results in inadequate performance on other machines. Hence, the goal of parallel computing is to produce architecture-independent software that takes advantage of a parallel machine's salient characteristics. The HBSPk model seeks to be a general-purpose model for heterogeneous computation. Most parallel algorithms require processors to exchange data. There are a few common basic patterns of interprocessor communication that are frequently used as building blocks in a variety of parallel algorithms. Proper implementation of these collective communication operations is vital to the ecient execution of the parallel algorithms that use them. Collective communication for homogeneous parallel environments has been throughly researched over the years [BBC94, BGP94, MR95]. Collective operations designed for traditional parallel machines are not adequate for heterogeneous environments. As a result, we design and analyze six collective communication algorithms|gather, scatter, reduction, pre x sums, one-to-all broadcast, and all-to-all broadcast|for heterogeneous parallel systems. The intent is not to obtain the best possible algorithms, but rather to point to the potential advantages of using the HBSPk model. Afterwards, we present a randomized sample sort algorithm based on our HBSPk collective communication operations. We test the e ectiveness of our collective operations on a non-dedicated, heterogeneous network of workstations. HBSPlib, a library based on BSPlib [HMS98], 3 provides the foundation for HBSP1 programming. Experimental results demonstrate that our collective algorithms have increased performance on heterogeneous platforms. The experiments also validate that randomized sample sort bene ts from using the HBSPk collective communication algorithms. Moreover, the model accurately predicts the performance trends of the communication algorithms. Improved performance is not a result of programmers having to account for di erences in a heterogeneous environment. By hiding the non-uniformity of the underlying system from the application developer, the HBSPk model o ers a framework that encourages the design of heterogeneous parallel software in an architecture-independent manner. The ultimate goal of this work is to provide a unifying framework that makes parallel computing a viable option for heterogeneous platforms. As heterogeneous parallel systems seem likely to be the platform of choice in the foreseeable future, we propose the HBSPk model and seek to demonstrate that it can provide a simple programming approach, portable applications, ecient performance, and predictable execution. Our results fall into four categories: Model development for heterogeneous computing systems. Infrastructure to support HBSPk programming and analysis. HBSPk application programming. Experimentation examining the e ectiveness of deriving portable, ecient, predictable, and scalable algorithms through the formalisms of the model. The rest of the thesis addresses each of the above contributions. Chapter 2 provides a review of various parallel computational models. Of the models considered, we believe that BSP provides a fundamentally sound approach to parallel 4 programming. Chapter 3 evaluates the utility of BSP in developing ecient sorting applications. The HBSPk model and its associated collective communication algorithms are presented in Chapter 4. The merits of HBSPk are experimentally investigated in Chapter 5. Conclusions and directions for future work are given in Chapter 6. 5 CHAPTER 2 A Review of Models of Parallelism The success of sequential computing can be attributed to the Random-Access Machine (RAM) model [CR73] providing a single, general model of serial computation. The model is accurate for a vast majority of programs. There are a few cases, such as programs that perform extreme amounts of disk I/O, where the model does not accurately re ect program execution. Due to its generality and stability, the RAM model continually supports advancements made in sequential programming. Moreover, these concentrated e orts have allowed the development of software-engineering techniques, algorithmic paradigms, and a robust complexity theory. Unfortunately, parallel computing has not enjoyed success similar to its sequential counterpart. Parallel computers have made a tremendous impact on the performance of large-scale scienti c and engineering applications such as weather forecasting, earthquake prediction, and seismic data analysis, but the e ective design and implementation of algorithms for them remains problematic. Frequently, high-performance algorithms and system software are obtained by exploiting architectural features such as the number of processors, memory organization, and communication latency of the underlying machine. Designing software to accommodate the speci cs of one machine often results in inadequate performance on other machines. Thus, the goal of parallel computing is to produce architecture- 6 independent software that takes advantage of a parallel machine's salient characteristics. Without a universal model of parallel computation, there is no foundation for the development of portable and ecient parallel applications. As a result, numerous models have been developed that attempt to model algorithm execution accurately on existing parallel machines. However, we narrow our focus on two of the most popular approaches. One method is the development of a computational model |an abstraction of a computing machine|that guides the high-level design of parallel algorithms as well as provides an estimation of performance. The other approach is to develop a programming model, a set of language constructs that can be used to express an algorithmic concept in a programming language. For example, the programming languages Pascal and C are designed within the imperative model, which consists of constructs such as arrays, control structures, procedures, and recursion. 2.1 Computational Models A computational model is an abstraction of a computing machine that guides the high-level design of parallel algorithms as well as provides an estimation of performance. In the following subsections, we focus on three classes of computational models|PRAM models, network models, and bridging models|since these models have attracted considerable attention from the research community. For an examination of models not discussed here, we refer the reader to [Akl97], [LMR95], and [MMT95]. 7 P 0 P 1 . . . Pp-1 SHARED MEMORY Figure 2.1: The PRAM model. 2.1.1 PRAM The Parallel Random Access Machine (PRAM) [FW78] is the most widely used parallel computational model. The PRAM model consists of p sequential processors sharing a global memory as shown in Figure 2.1. During each time step or cycle, each processor executes a RAM instruction or accesses global memory. After each cycle, the processors implicitly synchronize to execute the next instruction. In the PRAM model, more than one processor can try to read from or write into the same memory location simultaneously. CRCW (Concurrent-read, concurrent-write), CREW (Concurrent-read, exclusive-write), and EREW (Exclusiveread, exclusive-write) PRAMs [FW78] handle simultaneous access of several processors to the same location of global memory. The CRCW PRAM, the most powerful PRAM model, uses a protocol to resolve concurrent writes. Example protocols include arbitration (an arbitrary processor proceeds with the write operation), prioritization (the processor with the highest priority writes the result), and summation (the sum of all quantities is written). The PRAM model assumes that synchronization and communication is essentially free. However, these overheads can signi cantly a ect algorithm per8 formance since existing parallel machines do not adhere to these assumptions. By ignoring costs associated with exploiting parallelism, the PRAM is a simple abstraction which allows the designer to expose the maximum possible computational parallelism in a given task. Thus, the PRAM provides a measure of the ideal parallel time complexity. Many modi cations to the PRAM have been proposed that attempt to bring it closer to practical parallel computers. Goodrich [Goo93] and McColl [McC93] survey the PRAM model and its extensions. A brief overview of machine characteristics that have been the focus of e orts to improve the PRAM is given below. 1. Memory Access. The LPRAM (Local-memory PRAM) [ACS90] augments the CREW PRAM by associating with each processor an unlimited amount of local private memory. The QRQW (Queue-read, queue write) PRAM [GMR94] assumes that simultaneous accesses to the same memory block will be inserted into a request queue and served in a FIFO manner. The cost of a memory access is a function of the queue length. 2. Asynchrony. The Phase PRAM [Gib89] extends the PRAM by allowing asynchronous execution. A computation is divided into phases and all processors run asynchronously within a phase. An explicit synchronization is performed at the end of a phase. 3. Latency. The BPRAM (Block PRAM) [ACS89], an extension of the LPRAM, addresses communication latency by taking into account the reduced cost for transferring a contiguous block of data. The BPRAM model is de ned with two parameters L (latency or startup time) and b (block size). Al- 9 though it costs one unit of time to access local memory, accessing a block of size b of contiguous locations from global memory costs L + b time units. 4. Bandwidth. The DRAM (Distributed RAM) [LM88] eliminates the paradigm of global shared memory and replaces it with only private distributed memory. Additionally, the communication topology of the network is ignored. To address the notion of limited bandwidth, the model proposes a cost function for a non-local memory access which is based on the maximum possible congestion for a given data partition and execution sequence. The function attempts to provide scheduling incentives to respect limited access to non-local data. 2.1.2 Network Concurrent with the study of PRAM algorithms, there has been considerable research on network-based models. Figure 2.2 illustrates several di erent networks. In these models, processors send messages to and receive messages from other processors over a given network. Communication is only allowed between directly connected processors. Other communication is explicitly forwarded through intermediate nodes. In each step, the nodes can communicate with their nearest neighbors and operate on local data. Leighton [Lei93] provides a survey and analysis of these models. Many algorithms have been designed to run eciently on particular network topologies. Examples are parallel pre x (tree) and FFT (butter y). Although this approach can lead to very ne-tuned algorithms, it has some disadvantages. First, algorithms designed for one network may not perform well on other net10 Bus P0 P1 P2 P3 P4 P0 P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 P13 P14 P15 P16 P 17 P18 P19 P6 P4 P2 P0 3-D Hypercube P7 P5 P1 2-D Mesh P3 Figure 2.2: Bus, mesh, and hypercube networks. 11 works. Hence, to solve a problem on a new machine, it maybe necessary to design a completely new algorithm. Second, algorithms that take advantage of a particular network tend to be more complicated than algorithms designed for more abstract models like the PRAM since they incorporate some of the details of the network. 2.1.3 Bridging PRAM and network models are simple models which appeal to algorithm designers. However, neither approach facilitates the development of portable and ecient algorithms for a variety of parallel platforms. This has prompted the introduction of bridging models [Val90a, Val93]. Ideally, a bridging model provides a uni ed abstraction capturing architectural features that are signi cant to the performance of parallel programs. An algorithm designed on a bridging model should be readily implementable on a variety of parallel architectures, and its eciency on the model should be a good re ection on its actual performance. The Bulk Synchronous Parallel (BSP) model [Val90a] is a bridging model that consists of p processor/memory modules, a communication network, and a mechanism for ecient barrier synchronization of all the processors. Figure 2.3 shows the BSP model. A computation consists of a sequence of supersteps. During a superstep, each processor performs asynchronously some combination of local computation, message transmissions, and message arrivals. Each superstep is followed by a global synchronization of all the processors. A message sent in one superstep is guaranteed to be available to the destination processor at the beginning of the next superstep. A superstep is shown in Figure 2.4. Three parameters characterize the performance of a BSP computer. p represents the number of processors, L measures the minimal number of time steps 12 P 0 P 1 M 0 M 1 . . . P p-1 M p-1 COMMUNICATION NETWORK Figure 2.3: The BSP model. Processors Local Computations Global Communications Barrier Synchronization Figure 2.4: A superstep. 13 between successive synchronization operations, and g re ects the minimal time interval between consecutive message transmissions on a per-processor basis. Both g and L are measured in terms of basic computational operations. The time complexity of a superstep in a BSP program is: w + gh + L where w is the maximum number of basic computational operations executed by any processor in the local computation phase, and h is the maximum number of messages sent or received by any processor. The total execution time for the program is the sum of all the superstep times. An approach related to BSP is the LogP model [CKP93, CKP96]. LogP models the performance of point-to-point messages with three parameters: o (computation overhead of handling a message), g (time interval between consecutive message transmissions at a processor), and L (latency for transmitting a single message). The main di erence between the two models is that, under LogP, the scheduling of communication at the single-message level is the responsibility of the application programmer. Under BSP, the underlying system performs that task. Although proponents of LogP argue that their model o ers a more exible style of programming, Goudreau and Rao [GR98] argue that the advantages are largely illusory, since both approaches lead to very similar high-level parallel algorithms. Moreover, cross simulations between the models show that LogP is no more powerful than BSP from an asymptotic point of view [BHP96]. Thus, BSP's simpler programming style is perhaps to be preferred. Other bridging models have been proposed. Candidate Type Architecture (CTA) [AGL98, Sny86] is an early two parameter model (communication cost L and number of processor p) that was the result of a multidisciplinary e ort. Blelloch et al. [BGM95] propose the (d; x)-BSP model as a re nement for BSP 14 that provides more detailed modeling of memory bank contention and delay. LogP-HMM [LMR95] extends the LogP model with a hierarchical memory model characterizing each processor. Each of the models discussed above are distributed memory models. However, there are arguments in support of the shared-memory abstraction. The Queuing Shared Memory (QSM) [GMR97] model is one such example. 2.1.4 Summary The various models show the lack of consensus for a computational model (or models) of parallel computing. However, the proposed models demonstrate that a small set of machine characteristics are important: communication latency, communication overhead, communication bandwidth, execution synchronization, and memory hierarchy. Early computational models used a few parameters to describe the features of parallel machines. Unfortunately, the assumptions made by these models are not consistent with existing parallel machines. These simpli ed assumptions led to inaccuracies in predicting the actual running time of algorithms. Recent models attempt to bridge the gap between software and hardware by using more parameters to capture the essential characteristics of parallel machines. Furthermore, these models appear to be a promising approach towards the systematic development of parallel software. 2.2 Programming Models A programming model is a set of language constructs that can be used to express an algorithmic concept in a programming language. Parallel programming is similar to sequential programming in that there are many di erent languages to 15 select from to solve a problem. In this section, we restrict our attention to three dominant programming approaches|data-parallel, message-passing, and sharedmemory. A good survey on this topic is the paper by Skillicorn and Talia [ST98]. 2.2.1 Data-Parallel The data-parallel model provides constructs for expressing that a statement sequence is to be executed in parallel on di erent data. Data-parallel languages are attractive because parallelism is not expressed as a set of processors whose interactions are managed by the user, but rather as parallel operations on aggregate data structures. Typically, the programmer must analyze the algorithms to nd the parts which can be executed in parallel. The compiler then maps the data parallel parts onto the underlying hardware. High-Performance Fortran (HPF) [For93] is a data-parallel language based on Fortran-90. It adds more direct data parallelism by including directives to specify how data structures are allocated to processors, and constructs to carry out data-parallel operations, such as reductions. The directives for data distribution support a two-phase process in which an array is aligned, using the ALIGN directive, relative to a template or another array that has already been distributed. The DISTRIBUTE directive is used to distribute an object (and any other objects that may be aligned with it) onto an abstract processor array. An array distribution can be changed at any point by REDISTRIBUTE and REALIGN. The mapping of abstract processor to physical processors is implementation dependent and is not speci ed in the language. Data distribution directives are recommendations to an HPF compiler, not instructions. The compiler does not have to obey them if it determines that performance can be improved by ignoring them. Figure 2.5 illustrates HPF's data allocation model for the integer arrays A, 16 A(1) A(2) ... A(100) B(1) B(2) ... B(100) C(1) C(2) ... C(101) Align A(1) A(2) A(3) ... A(100) C(1) C(2) ... C(101) B(1) B(2) ... B(99) B(100) Distribute A(26:50) B(25:49) C(2:101:4) A(1:25) B(1:24) C(1:101:4) P 2 P 1 A(51:75) B(50:74) C(3:101:4) A(76:100) B(75:99) C(4:101:4) P 3 P 4 Physical Mapping P 1 P 3 P 2 P 4 P 1 P 3 P , 2 ,P4 A Uniprocessor A Mesh Figure 2.5: HPF data allocation model. 17 B, and C of sizes 100, 100, and 101, respectively. The ALIGN directive aligns array element A(i) with element B(i-1). Afterwards, the data is distributed onto 4 logical processors. Arrays A and B are distributed in a block fashion whereas array C is distributed cyclically1. Thus, P1 consists of array elements A(1), A(2), ..., A(25), B(1), B(2), ..., B(24), C(1), C(5), C(9), ..., C(97), C(101); P2 consists of array elements A(26), A(27), .., A(50), B(25), B(26), ..., B(49), C(2), C(6), C(10), .., C(98); etc. The same logical mapping can be used for di erent physical mappings. For example, each logical node maps to a physical node in a 3x4 mesh. However, all of the logical nodes map to a uniprocessor. HPF's data allocation model results in using the same code (with all the directives) on both computers without a change. HPF also o ers directives to provide hints for data dependence analysis. For instance, PURE asserts that a subroutine has no side e ects so that its presence in a loop does not inhibit the loop parallelism. Additionally, the INDEPENDENT directive asserts that the loop has no loop carried dependence which allows the compiler to parallelize the loop without any further analysis. The most attractive feature of the data-parallel approach as exempli ed in HPF is that the compiler takes on the job of generating communication code. However, program performance depends on how well the compiler handles interprocess communication and synchronization. The relative immaturity of these compilers usually means that they may not produce ecient code in many situations. A block distribution evenly divides an array into a number of chunks (or blocks) of consecutive array elements, and allocates a block to a node. A cyclic distribution evenly divides an array so that every ith element is allocated to the ith node. 1 18 2.2.2 Message-Passing A message-passing program consists of multiple processes having only local memory. Since processes reside in di erent address spaces, they communicate by sending and receiving messages. Typically, message passing programming is done by linking with and making calls to libraries which manage the data exchange between processes. The Message Passing Interface (MPI) library [SOJ96] is the standard messagepassing interface for writing parallel applications and libraries. An MPI program is a collection of concurrent communicating tasks belonging to a speci ed group. Task groups provide contexts through which MPI operations can be restricted to only the members of a particular group. The members of a group are assigned unique contiguous identi ers, called ranks, starting from zero. Since new groups cannot be created from scratch, MPI provides a number of functions to create new groups from existing ones. Point-to-point communications between processes are based on send and receive primitives that support both synchronous and asynchronous communication. MPI also provides primitives for collective communication. Collective operations execute when all tasks in the group call the collective routine with matching parameters. Synchronization, broadcast, scatter/gather, and reductions (min, max, multiply) are examples of collective routines supported by MPI. One of the advantages of using MPI is that it facilitates the development of portable parallel libraries. An important requirement for achieving this goal is to guarantee a safe communication space in which unrelated messages are separated from one another. MPI introduces a communicator, which binds a communication context to a group of tasks, to achieve a safe communication space. Having a communication context allows library packages written in message-passing systems to protect or mark their messages so that they are not received (incorrectly) 19 Process 0 Setcontext(A) Send(1, tag) Call lib Setcontext(B) Send(1,tag) No context context (A) context (B) Recv(1,tag) Process 1 Setcontext(A) Send(0, tag) Call lib Setcontext(B) Recv(0,tag) Recv(0,tag) Figure 2.6: Messages sent without context are erroneously received. by the user's code. Figure 2.6 illustrates the fundamental problem. In this gure, two processes are calling a library routine that also performs message passing. The library and user's code have both chosen the same tag to mark a message. Without context, messages are received in the wrong order. To solve this problem, a third tag that is assigned by the operating system is needed to distinguish user messages from library messages. Upon entrance to a library routine, for example, the software would determine this third tag and use it for all communications within the library. The functionality of message-passing libraries is relatively simple and easy to implement. Unlike the data-parallel approach, the programmer must explicitly implement a data distribution scheme and handle all interprocess communication. Consequently, the development of anything but simple programs is quite 20 dicult. Additionally, it is the programmer's responsibility to resolve data dependencies and avoid deadlock and race conditions. Thus, the performance of an application under the message-passing approach often depends upon the ability of the developer. Other message-passing models include p4 [BL92] and PICL [GHP90]. 2.2.3 Shared-Memory The shared-memory model is similar to the data-parallel model in that it has a single address (global naming) space. It is similar to the message-passing model in that it is multithreaded and asynchronous. Communication is done implicitly through shared reads and writes of variables. However, synchronization is explicit. We base our discussion of shared-memory models on Linda. Other shared-memory models include Orca and SR. Linda [CG89, Gel85] is a shared-memory language that provides an extension to standard imperative languages. In Linda, point-to-point communication is replaced by a tuple space which is shared and accessible to all processes. The tuple space is also associative. Items are removed from the tuple space using pattern-matching rules rather than by being addressed directly. Thus, tuple space is similar to a cache in that it is addressed associatively. Tuple space is also anonymous. Once a tuple has been placed in tuple space, the system does not keep track of its creator. The Linda communication model contains three communication operations: in, which reads and removes a tuple from the tuple space; rd, which reads a tuple from the tuple space; and out, which adds a tuple to the tuple space. For example, the rd operation 21 rd (``Florida'', ?X, ``Orlando'') searches the tuple space for tuples of three elements, with a rst element \Florida", last element \Orlando", and a second element of the same type as the variable X. A match occurs if a tuple is found with the same number of elements and the types and values of the corresponding elements are the same. If a matching tuple is not found, the issuing processor must wait until a satisfying tuple enters the tuple space. Besides these three basic operations, Linda provides the eval(x) operation that implicitly creates a new process to evaluate the tuple x and inserts the result in the tuple space. There is a general feeling that shared-memory programming is easier than message-passing programming [HX98]. One reason is that the shared-memory abstraction is similar to the view of memory in sequential programming. However, for developing new, ecient parallel programs that are loosely synchronous and have regular communication patterns, the shared-variable approach is not necessarily easier than the message-passing one. Moreover, shared-memory programs may be more dicult to debug than message-passing ones. Since processes in a shared-memory program reside in a single address space, accesses to shared data must be protected by synchronization constructs such as locks and critical regions. As a result, subtle synchronization errors can easily occur that are dicult to detect. These problems occur less frequently in a message-passing program, as the processes do not share a single address space. 2.2.4 Summary From our survey of parallel programming models, two observations appear. First, the programming models are mostly extensions of C or Fortran depicting the 22 programmer's reluctance to learn a completely new language. Secondly, parallel programming models are evolving towards more high-level approaches. Thus, the programmer is not responsible for handling all aspects of developing a parallel application. Instead, compilers handle such things as data-dependence detection, communication, synchronization, scheduling, and data-mapping. The trend towards higher-level programming models appears to be a good approach since it provides for more robust and portable parallel software. Yet, the success of such higher-level models clearly depends upon the advances of compiler technology. 2.3 Heterogeneous Computing Several models exist to support heterogeneous parallel computation. Below, we consider three models|HCGM, Cluster-M, and PVM|for developing applications for heterogeneous machines. The rst two approaches are computational approaches whereas PVM is a programming model. An overview of heterogeneous models not discussed here appears in [SDA97] and [WWD94]. 2.3.1 HCGM The Heterogeneous Coarse-Grained Muliticomputer (HCGM) model [Mor98a] is a generalization of the CGM model [DFR93]. HCGM shares the same spirit as the BSP and LogP models in that it attempts to provide a bridge between the hardware and software layers of a heterogeneous machine. Formally, HCGM models parallel computers consisting of p heterogeneous processors. Since processors have varying computing capabilities, si represents the speed of processor Pi. The model assumes memory and communication speeds of the processors are proportional to their computational speeds. As a result, faster processors 23 process and communicate more data. Here, si 1 and the slowest processor's si ?1 s . value is normalized to 1. The total speed of the parallel machine is s = Ppi=0 i The processors are interconnected by a network capable of routing any all-to-all communication in which the total amount of data exchanged is O(m). The performance of an HCGM algorithm is measured in terms of computation time and number of supersteps. A model similar in structure and philosophy is the Heterogeneous Bulk Synchronous Parallel (HBSP) model [WP00]. Both HBSP and HCGM are similar in structure and philosophy. The main di erence is that HCGM is not intended to be an accurate predictor of execution times whereas HBSP attempts to provide the developer with predictable algorithmic performance. Additionally, HBSP provides part of the motivation for the development of the HBSPk model. In the HBSPk model, HBSP is synonymous with HBSP1 . 2.3.2 Cluster-M Cluster-M [EF93, ES93] is a model designed to bridge the gap between software and hardware for heterogeneous computing. Cluster-M consists of three main components: the speci cation module, the representation module, and the mapping module. A program is represented as a Spec graph (a multilevel clustered task graph), where nodes (Spec clusters ) show execution times, and arcs represent the expected amount of data to be transferred between the nodes. Leaf nodes represent a single computation operand. All clusters at a level are independent and may be executed simultaneously. Furthermore, the programmer speci es the manner in which a program is clustered, which may be modi ed during run-time. In the representation module, a heterogeneous suite of computers is represented by a Rep graph (a multilevel partitioning of a system graph), where nodes 24 contain the speeds of arithmetic operations for the associated processor, and arcs express the bandwidth for communications between processors. Given an arbitrary Spec graph containing M task modules, and an arbitrary Rep graph of N processors, the mapping module is a portable heuristic tool responsible for nearoptimal mapping of the two graphs in O(MP ) time, where P = maxfM; N g. Moreover, the mapping module has an interface that can be used with portable network communication tools, such as PVM (see Section 2.2.2), for executing portable parallel software across heterogeneous machines. 2.3.3 PVM PVM (Parallel Virtual Machine) [Sun90] is a message-passing software system that is a byproduct of the Heterogeneous Network Project| a collaborative e ort by researchers at Oak Ridge National Laboratory, the University of Tennessee, and Emory University to facilitate heterogeneous parallel computing. PVM is built around the concept of a virtual machine, which is a dynamic collection of computational resources managed as a single parallel computer. The PVM system consist of two parts: a PVM daemon (called pvmd) that resides on every computer of the virtual machine, and a library of standard interface routines that is linked to the user application. The pvmd daemon oversees the operation of user processes within a PVM application and coordinates inter-machine PVM communications. The PVM library contains subroutine calls that the application programmer embeds in their application code. The library routines interact with the pvmd to provide services such as communication, synchronization, and process management. The pvmd may provide the requested service alone or in cooperation with other pvmds in the heterogeneous system. 25 Application programs that use PVM are composed of several tasks. Each task is responsible for a part of the application's computational workload. By sending and receiving messages, multiple tasks of an application can cooperate to solve a problem in parallel. Under PVM, the programmer has the ability to place tasks on speci c machines. Such exibility enables various tasks of a heterogeneous application to exploit particular strengths of the computational resources. However, it is the programmer's responsibility to understand and explicitly code for any distinctive properties of the heterogeneous system. 2.3.4 Summary Programming heterogeneous systems is dicult because each application must take advantage of the underlying architectures and adjust for hardware availability. Some HC models rely solely on programmers to handle the complexity of HC systems. As a result, the programmer must hand-parallelize each task speci cally for the appropriate target machine. If the con guration changes, parts of the application must be rewritten. Other approaches rely on compilers to automatically handle some of the complexity of tailoring applications for heterogeneous systems. By hiding heterogeneity, the developer does not need to understand all of the characteristics of the HC system. Programs written in this way are potentially mechanically portable. Thus, the success of developing software to execute eciently and predictably on HC systems requires a model that hides some of the heterogeneity from the programmer while describing the underlying system with accuracy. 26 CHAPTER 3 A Case for BSP We believe that the BSP model provides a fundamentally sound approach to parallel programming. The model supports the development of architectureindependent software, which promotes a widespread software industry for parallel computers. Moreover, existing applications do not have to be redeveloped or modi ed in a non-trivial way when migrated to di erent machines. Secondly, the BSP model consists of a cost model that provides predictable costs of algorithm execution. BSP captures the essential characteristics of parallel machines with only a few parameters. More complex computational models tend to use more parameters that render them too tedious for practical use. Additionally, the BSP model can be viewed as a kind of programming methodology. The essence of the BSP approach is the notion of the superstep and the idea that the input/output associated with a superstep is performed as a global operation, involving the whole set of individual sends and receives. Viewed in this way, a BSP program is simply one which proceeds in phases, with the necessary global communications taking place between the phases. Lastly, the BSP model provides practical design goals for architects. According to the model, the routing of h-relations should be ecient (g should be small) and barrier synchronization should be ecient (L should be small). Parallel machines developed with these architectural design goals will be quite suitable for executing BSP algorithms. On the other hand, 27 systems not designed with BSP in mind may not deliver good values of g and L resulting in inadequate performance of BSP algorithms. 3.1 The BSP model As discussed in Chapter 2.1.3, a BSP computer consists of a set of processor/memory modules, a communication network, and a mechanism for ecient barrier synchronization of all the processors. A computation consists of a sequence of supersteps. During a superstep, each processor performs asynchronously some combination of local computation, message transmissions, and message arrivals. A message sent in one superstep is guaranteed to be available to the destination processor at the beginning of the next superstep. Each superstep is followed by a global synchronization of all the processors. Three parameters characterize the performance of a BSP computer. p represents the number of processors, L measures the minimal number of time steps between successive synchronization operations, and g re ects the minimal time interval between consecutive message transmissions on a per-processor basis. The values of g and L can be given in absolute times or normalized with respect to processor speed. The parameters described above allow for cost analysis of programs. Cost prediction can be used in the development of BSP algorithms or to predict the actual performance of a program ported to a new architecture. Consider a BSP program consisting of S supersteps. The time complexity of superstep i in a BSP program is: wi + ghi + L (3.1) 28 where wi is largest amount of local computation performed by any processor, and hi is the maximum number of messages sent or received by any processor. (This communication pattern is called an h-relation.) The execution time of the entire program is de ned as W + gH + LS (3.2) where W = PiS=0?1 wi and H = PiS=0?1 hi. The above cost model demonstrates what factors are important when designing BSP applications. To minimize execution time, the programmer must attempt to (i) balance the local computation between processors in each superstep, (ii) balance communication between processors to avoid large variations in hi, and (iii) minimize the number of supersteps. In practice, these objectives can con ict, and trade-o s must be made. The correct trade-o s can be selected by taking into account the g and L parameters of the underlying machine. The cost model also shows how to predict performance across target architectures. The values W , H , and S can be determined by measuring the amount of local computation, the number of bytes sent, and the total number of supersteps [SHM97]. The values of g and L can then be inserted into the cost formula to predict the performance of programs ported to new parallel computers. From the point of view of the BSP programmer, there are only two levels of memory locality: either the data is in local memory, or it is in nonlocal memory (at the other processors). There is no concept of network locality, as would be the case if the underlying interconnection network is a mesh, hypercube, or fat tree. If the underlying interconnection network does indeed support network locality, this fact will not be exploited by the BSP programmer. 29 3.2 BSP Sorting Sequential sorting algorithms have been developed under the Random-Access Machine (RAM) model, an abstraction of the von Neumann model that has guided uniprocessor hardware design for decades. Parallel sorting algorithms have been investigated for many di erent machines and models, however unlike sequential computing, parallel computing has no widely accepted model for program development. As a result, ecient parallel programs are often machinespeci c. To demonstrate the utility of the model, we develop BSP implementations of four sorting algorithms | randomized sample sort, deterministic sample sort, bitonic sort, and radix sort | that present various computation and communication patterns to a parallel machine. With these applications, we evaluate the utility of BSP in terms of portability, eciency, and predictability on an SGI Challenge and an Intel Paragon. The claim that both eciency and portability can be achieved by using the BSP model is supported by both theoretical and experimental results [Val90a, GLR99, KS99, McC93, Val90b, Val93, WG98] . However, other general-purpose models, such as LogP [CKP96], make similar claims. LogP models the performance of point-to-point messages with three parameters representing software overhead, network latency, and communication bandwidth. Under LogP, the programmer is not constrained by a superstep programming style. Although proponents of LogP argue that it o ers a more exible style of programming, Goudreau and Rao [GR98] argue that the advantages are largely illusory, since both approaches lead to very similar high-level parallel algorithms. In fact, most of the BSP sorting algorithms discussed here are, from a high-level perspective, virtually identical to Dusseau et al.'s LogP implementations [DCS96]. The main di erence between the two models is that the scheduling of communication under LogP at the single-message level is the responsibility of the application program30 mer, while the underlying system of BSP performs that task. We argue that the cost of allowing the underlying system to handle communication scheduling is negligible; thus the higher-level BSP approach is preferable. Several experimental studies on the implementation of parallel sorting algorithms have in uenced this work. Similar parallel sorting studies are described by Blelloch et al. for a Connection Machine CM-2 [BLM98], Hightower, Prins, and Reif for a MasPar MP-1 [HPR92], and by Helman, Bader, and JaJa on a Connection Machine CM-5, an IBM SP-2, and a Cray Research T3D [HBJ96]. Of particular relevance is the work of Dusseau et al. [DCS96], which described several sorting approaches on a Connection Machine CM-5 using the LogP cost model [CKP93]. Dusseau et al.'s work is very similar in philosophy to this work in that it advocates the use of a bridging model for the design of portable and ecient code. Experimental results for sorting based directly on the BSP model can be found in the work of Gerbessiotis and Siniolakis for an SGI Power Challenge [GS96], Shumaker and Goudreau for a MasPar MP-2 [SG97], and Hill et al. for a Cray T3E [HJS97]. Juurlink and Wijsho performed a detailed experimental analysis of three parallel models (BSP, E-BSP, and BPRAM) and validated them on ve platforms (Cray T3E, Thinking Machine CM-5, Intel Paragon, MasPar MP-1, and Parsytec GCel) using sorting, matrix multiplication, and all pairs shortest path algorithms [JW98]. 31 3.2.1 Experimental Approach The code for the sorting algorithms uses the BSPlib library [HMS98]. BSPlib synthesizes several BSP programming approaches and provides a parallel communication library based around a Single Program Multiple Data (SPMD) model of computation. BSPlib provides a small set of BSP operations and two styles of data communication, Direct Remote Memory Access (DRMA) and Bulk Synchronous Message Passing (BSMP). DRMA re ects a one-sided direct remote memory access while BSMP captures a BSP oriented message passing approach. We use the BSMP style of communication in our sorting algorithms. Figure 3.1 contains a small, but representative, example that captures the BSMP style of communication. In this code fragment, each processor calls bsp nprocs and bsp pid. These functions return to the calling processor the number of processors and its identity, respectively. Next, processor 0 sends a message with an integer tag (0) and a payload with two doubles (1.4 and 2.3) to all of the other processors. The bsp send call sends the message. After the barrier synchronization call (bsp sync), the message can be received at the destination process by rst accessing the tag (bsp get tag) and then transferring the payload to the destination (bsp move). The status variable used in the bsp get tag call returns the length of the payload. The bsp move call also serves to ush this message from the input bu er. The current experiments utilize two platforms: An SGI Challenge|a shared-memory platform|with 16 MIPS R4400 processors running IRIX System V.4. A shared memory implementation of the BSP library developed by Kevin Lang is used. 32 int i; int p; int pid; int status; int tag; double payload[2]; ... p = bsp_nprocs(); pid = bsp_pid(); /* no. of processors */ /* processor identity */ /* P0 broadcasts message */ if (pid == 0) { tag = 0; payload[0] = 1.4; payload[1] = 2.3; for (i = 1; i < p; i++) bsp_send(i, &tag, &payload, 2*sizeof(double)); } bsp_sync(); /* Receive message from P0 */ if (pid != 0) { bsp_get_tag(&status, &tag); bsp_move(&payload, &status) } ... Figure 3.1: Code fragment demonstrating BSMP 33 SGI Intel s ) L (s) g ( s ) L (s) p g ( byte byte 1 0.03 0 0.22 353 2 0.03 20 0.40 657 4 0.02 20 0.30 1299 8 0.03 40 0.32 2505 16 0.04 60 0.38 4990 Table 3.1: BSP system parameters An Intel Paragon|a message-passing machine|with 32 i860 XP processors running Paragon OSF/1 Release 1.0.4. The BSPlib implementation used was developed by Travis Terry at the University of Central Florida. The code is compiled with the -O2 optimization ag. We consider BSP to model only communication and synchronization; I/O and local computation are not modeled. As a result, none of the experiments include I/O, and local computation is measured as best as possible on our platforms. We discuss our method for measuring the amount of local computation later in this section. Timings start when the input data is evenly distributed among the processors. The input data consists of 4K to 8192K uniformly distributed integers, where K = 1024. For local sorting, we used an 11-bit radix sort, which is the fastest sort that we could nd. Table 3.1 shows the values for g and L achieved on an SGI Challenge and an Intel Paragon. The bandwidth parameter g is the time per packet for a suciently large superstep with a total exchange communication pattern. g is based on an h-relation size of 512 bytes. The value of L corresponds to the time needed for processors to synchronize for an empty superstep (i.e. no computation or 34 communication). To illustrate cost prediction on the SGI Challenge, we measure the values of W , H , and S on an SGI Challenge for each of the sorting algorithms. The values of g and L can then be inserted into the cost formula to predict the performance of our sorting applications. Performance prediction on the Intel Paragon is slightly di erent. To predict runtimes on the Paragon, we apply the following cost model cW + gH + LS (3.3) where c re ects the factor used to estimate local computation (or work depth) on an Intel Paragon. We determine the value of c in the following manner. For each problem size, the work depth of the applications are measured on both platforms. Next, we compute the ratio of work depth on the SGI Challenge to work depth on the Intel Paragon; c re ects the average of these ratios. Table 3.2 provides data about the performance of the BSP sorting applications using 16 processors on both of our parallel platforms. We give the algorithmic parameters, including work depth (as measured on the SGI), the sum over all supersteps of the maximum number of bytes sent or received by any processor, and the number of supersteps. We also include the actual running times, BSP predicted running times, and the c factor. Execution times are given in s/key. The goal is to observe a constant execution time as we scale the problem size. The error of prediction is given by maxfTactual ; Tpredg / minfTactual; Tpredg, where Tactual and Tpred represent the actual and predicted execution times, respectively. The data indicates a general trend that for these applications, ecient use of larger numbers of processors can be achieved by increasing the problem size. This is true not only for these sorting algorithms, but for a wide range of important applications. Intuitively, this will occur whenever the computation can be 35 app n/p rand 4K rand 512K dterm 4K dterm 512K bitonic 4K bitonic 512K radix 4K radix 256Ky SGI pred 1.96 0.40 3.98 0.76 5.98 2.63 10.58 2.41 SGI actual 1.71 0.43 4.52 0.74 3.74 2.40 13.17 2.26 SGI Intel Intel error pred actual 12.67% 6.82 5.09 6.98% 0.86 0.87 11.93% 11.11 5.21 2.69% 1.37 1.19 37.51% 13.76 4.29 8.80% 3.65 2.05 19.65% 36.90 19.79 6.22% 5.17 3.18 Intel error 25.33% 1.25% 53.14% 12.97% 68.83% 43.69% 46.36% 38.53% SGI W (sec) H (bytes) S c 0.12 148648 5 2.99 3.29 2229404 5 1.93 0.26 106488 7 2.55 6.17 4280584 7 1.59 0.39 139264 25 1.88 21.33 17825792 25 1.11 0.67 318336 161 2.23 9.41 16833408 161 1.54 Table 3.2: Algorithmic and model summaries using 16 processors on the SGI Challenge and the Intel Paragon. Predicted and actual running times are in s/key. y For radix sort, the largest problem size that could be run on both machines was 4,194,304 keys. equally balanced among the processors, and communication and synchronization requirements grow more slowly than the computation requirements. 3.2.2 Randomized Sample Sort One approach for parallel sorting that is suitable for BSP computing is randomized sample sort. The sequential predecessor to the algorithm is sequential samplesort [FM70], proposed by Frazer and McKellar as a re nement of Hoare's quicksort [Hoa62]. Sequential samplesort uses a random sample set of input keys to select splitters, resulting in greater balance|and therefore a lower number of expected comparisons|than quicksort. The fact that the sampling approach could be useful for splitting keys in a balanced manner over a number of processors was discussed in the work of Huang and Chow [HC83] and Reif and 36 Valiant [RV87]. Its use was analyzed in a BSP context by Gerbessiotis and Valiant [GV94]. The basic idea behind randomized sample sort in a p-processor system is the following: 1. A set of p ? 1 splitter keys is randomly selected. Conceptually, the splitters will partition the input data into p buckets. 2. All keys assigned to the ith bucket are sent to the ith processor. 3. Each processor sorts its bucket. The selection of splitters that de ne approximately equal-sized buckets is a crucial issue. The standard approach is to randomly select ps keys from the input set, where s is called the oversampling ratio. These keys are sorted, and the keys with ranks s; 2s; 3s; : : : ; (p ? 1)s are selected as the splitters. By choosing a large enough oversampling ratio, it can be shown with high probability that no bucket will contain many more keys than the average [HC83]. Our BSPlib implementation of randomized sample sort is similar to Dusseau e t al.'s LogP implementation. Since the sending of keys to appropriate buckets requires irregular and unbalanced communication that cannot be predicted before run time, Dusseau e t al. ignore analyzing the communication of randomized sample sort. The BSP approach avoids this situation by focusing on the global routing problem. Analysis of Data. Figure 3.2 shows the actual and predicted performance for randomized sample sort on an SGI Challenge and an Intel Paragon. Among the algorithms considered, randomized sample sort had the best performance across all platforms. From the plots, we see that the time per key decreases with 37 SGI Actual 4 3 3 4 8 Intel Predicted 512K 256K 14 12 10 8 6 4 2 0 2 4 8 Keys per processor (n/p) 512K 256K 128K 64K 32K 16K 4K 16 8K us/key 512K 256K 128K 64K 32K 16K 8K 128K Intel Actual 14 12 10 8 6 4 2 0 4K 64K Keys per processor (n/p) Keys per processor (n/p) us/key 32K 16 4K 512K 256K 128K 64K 0 32K 0 16K 1 8K 1 16K 2 2 2 8K us/key 4 4K us/key SGI Predicted Keys per processor (n/p) Figure 3.2: Predicted and actual execution time per key of randomized sample sort on an SGI Challenge and an Intel Paragon. Each plot represents the run time on a 2, 4, 8, or 16 processor system. 38 the number of processors. However, as n approaches our largest problem size, the time per key slightly increases on the SGI Challenge. The BSP model improves the accuracy of its predictions as n=p increases. When n=p = 512K, the actual execution times are within 6.98% and 21.17% of that predicted by the BSP model for the SGI Challenge and the Intel Paragon, respectively. From a BSP perspective, randomized sample sort is attractive. Of all the ratios considered, the algorithm performed best with an oversampling ratio of 1000. Randomized sample sort uses a constant number of supersteps, independent of both n and p. Also, the algorithm uses only one stage of communication, and achieves a utilization of bandwidth that is close to ideal; most of the data only makes a single hop. Moreover, the local sort component can use the best available sequential sorting algorithm. 3.2.3 Deterministic Sample Sort Deterministic sample sort, analyzed in a BSP context by Gerbessiotis et al. [GS96], is motivated by randomized sample sort. The basic idea behind deterministic sample sort is the following: 1. Each processor sorts its local keys. 2. A set of p ? 1 splitter keys is deterministically selected. 3. All keys assigned to the ith bucket are sent to the ith processor. 4. Each processor merges the keys in its bucket. As in randomized sample sort, the selection of splitters is key to good algorithmic performance. Our approach deterministically selects ps keys from the input 39 set, where s is called the oversampling ratio. These keys are merged, and the keys with ranks s; 2s; :::; (p ? 1)s are selected as splitters. Deterministic sample sort also requires irregular and unbalanced communication to send keys to their appropriate bucket. If many keys have the same value, failure to break ties consistently can result in an uneven distribution of keys to buckets. Gerbessiotis et al.'s algorithm bounds the bucket sizes, assuming that all keys are distinct. Since we allow duplicate keys, their bounds do not hold for our implementation. Dusseau et al. do not implement this algorithm. Analysis of Data. Figure 3.3 shows the experimental results for deterministic sample sort. Deterministic sample sort has the second best performance across all platforms. Our experiments indicate that it performs better than randomized sample sort for small problems (n=p 16K ). As with randomized sample sort, increasing the problem size lead to more accurate run time predictions. For 8 million keys, the actual execution times are within 7.43% and 22.18% of that predicted by the BSP model for the SGI Challenge and the Intel Paragon, respectively. For deterministic sample sort, we nd an oversampling ratio of 1000 to have the best overall performance out of the ratios we considered. As with randomized sample sort, it has many positive features in a general-purpose computing context. Both the computation and the communication are balanced. The algorithm uses a constant number of supersteps. There is only one stage of communication, and the bandwidth is used in an ecient manner. Moreover, the computation can leverage ecient sequential sorting algorithms. 40 SGI Actual 5 4 4 us/key 5 3 3 2 2 1 1 4 0 0 16 12 10 10 8 8 512K 256K 8 Keys per processor (n/p) 512K 256K 128K 64K 16 4K 512K 256K 0 128K 0 64K 2 32K 2 16K 4 4 32K 4 2 6 16K 6 8K us/key 12 8K 128K Intel Actual Intel Predicted 4K 64K Keys per processor (n/p) Keys per processor (n/p) us/key 32K 16K 8K 8 4K 512K 256K 128K 64K 32K 16K 8K 2 4K us/key SGI Predicted Keys per processor (n/p) Figure 3.3: Predicted and actual execution time per key of deterministic sample sort on an SGI Challenge and an Intel Paragon. Each plot represents the run time on a 2, 4, 8, or 16 processor system. 41 3.2.4 Bitonic Sort Bitonic sort, developed by Batcher [Bat68], is one of the rst algorithms to attack the parallel sorting problem. The procedure depends upon the keys being ordered as a bitonic sequence 1 . Initially, each key is considered a bitonic sequence. Afterwards, lg n merge stages generate the sorted list. During each stage, two bitonic sequences are merged to form a sorted sequence in increasing or decreasing order. The monotonic sequences are ordered such that the two neighboring sequences (one monotonically increasing and the other monotonically decreasing) can be combined to form a new bitonic sequence for the next merge stage. For example, Figure 3.4 illustrates that the input (a bitonic sequence) to BM8+ is generated by combining BM4+ 's output (a monotonically increasing sequence) and BM4? 's output (a monotonically decreasing sequence). Dusseau e t al.'s LogP implementation of bitonic sort motivated our approach. We simulate the steps of the algorithm on a b utter y network. Intuitively, one can visualize the communication structure of the procedure as the concatenation of increasingly larger butter ies. The communication structure of the ith merge stage can be represented by n=2i butter ies each with 2i rows and i columns. Each butter y node compares two keys and selects either the maximum or the minimum key. (See Figure 3.5). We employ a data placement so that all comparisons are local. The procedure begins with the data in a blocked layout. Under this layout, the rst n=p keys and n=p rows of the butter y nodes are assigned to the rst processor, the second n=p keys and n=p rows are assigned to the second processor, etc. As a result, A bitonic sequence is a sequence of elements with the property that the sequence monotonically increases and then monotonically decreases, or, a cyclic shift of the elements allows the monotonically increasing property to satis ed. 1 42 Stage 1 + BM 2 - Unsorted List BM 2 + BM 2 - BM 2 Stage 2 Stage 3 + + BM4 BM 8 BM 4 Figure 3.4: A schematic representation of a bitonic sorting network of size n = 8. BMk denotes a bitonic merging network of input size k that sorts the input in either monotonically increasing (+) or decreasing (-) order. The last merging network (BM8+ ) sorts the input. 43 Sorted List 6 2 2 2 2 2 1 2 6 4 4 4 1 2 4 5 5 3 3 3 5 5 4 6 6 1 4 4 1 1 7 8 8 5 5 8 8 8 7 7 6 6 3 7 1 3 5 8 7 7 3 3 1 6 7 8 Stage 1 Stage 2 Stage 3 Figure 3.5: A bitonic sorting network of size n = 8. Each node compares two keys, as indicated by the edges and selects either the maximum or the minimum. Shaded and unshaded nodes designate where the minimum and maximum of two keys is placed, respectively. 44 the rst lg (n=p) merge stages are entirely local. Since the purpose of these rst stages is to form a monotonically increasing or decreasing sequence of n=p keys on each processor, a local sort replaces these merge stages. For subsequent merge stages, the blocked layout is remapped to a cyclic layout; the rst key is assigned to the rst processor, the second key to the second processor, etc. Under this layout, the rst i ? lg(n=p) columns of the ith merge stage are computed locally where each processor performs a comparison and conditional swap of pairs of keys. Afterwards, the data is remapped back to a blocked layout, resulting in the last lg(n=p) steps of the merge stage being local, and a local sort is executed by each processor. The remaps between the blocked and cyclic layouts involve regular and balanced communication, i.e., the communication schedule is oblivious to the values of the keys and each processor receives as much data as it sends. Periodic cyclic-blocked remapping requires n p2 (i.e. at least p elements per processor) to execute compare-exchange operations locally [Ion96]. Under LogP, Dusseau e t al. discovered their approach had degraded performance due to the asynchronous nature of their platform, the CM-5. Once processors reached the remap phase, they were seriously out of synch, increasing the opportunity for contention. To improve performance, they employed a barrier synchronization before each remap phase. Analysis of Data. Experimental results for bitonic sort are shown for an SGI Challenge and an Intel Paragon in Figure 3.6. In terms of performance, the bitonic sort was worse than the sample sorts. Bitonic sort performed the best when n=p = 4K . (Of course, for such a small problem size, one should probably elect to use a sequential sort.) On the Intel Paragon, there are signi cant errors when trying to predict the performance of the algorithm on 16 processors. For 45 SGI Actual 7 6 5 4 3 2 1 0 2 4 8 Intel Predicted 512K 256K 16 14 12 10 8 6 4 2 0 2 4 8 Keys per processor (n/p) 512K 256K 128K 64K 32K 16K 4K 16 8K us/key 512K 256K 128K 64K 32K 16K 8K 64K Intel Actual 16 14 12 10 8 6 4 2 0 4K 128K Keys per processor (n/p) Keys per processor (n/p) us/key 32K 8K 16K 16 4K 512K 256K 128K 64K 32K 16K 8K us/key 7 6 5 4 3 2 1 0 4K us/key SGI Predicted Keys per processor (n/p) Figure 3.6: Predicted and actual execution time per key of bitonic sort on an SGI Challenge and an Intel Paragon. Each plot represents the run time on a 2, 4, 8, or 16 processor system. 46 example, when n=p = 512K , bitonic sort requires about 3:65s per key, but the measured time per key is 2:05s. However, for the same problem size with 8 processors, the prediction error is only 5:23%. Bitonic sort was originally developed to sort n numbers in O(lg2 n) parallel time. The original algorithm, however, assumed a growth of computational resources, in this case comparators, that is O(n lg2 n). The overall work to sort n numbers is therefore O(n lg2 n), which is asymptotically worse than the other radix-sort-based parallel algorithms described here. Since bitonic sort is not based on the best sequential algorithm available and consists of O(lg n) communication phases, it is not surprising that it proved to be uncompetitive in our experiments. 3.2.5 Radix Sort The radix sort algorithm [CLR94] relies on the binary representation of the unordered list of keys. Let b denote the number of bits in the binary representation of a key. Radix sort examines the keys to be sorted r bits at a time, where r is de ned to be the radix. Radix sort requires db=re passes. During pass i, it sorts the keys according to their ith least signi cant block of r bits. Consider a BSP formulation, which is virtually identical to Dusseau et al.'s approach, of radix sort for n keys. Each pass consists of three phases. First, each processor computes a local histogram containing 2r buckets by traversing its list of keys and counting the number of occurrences of each of the 2r digits. Next, the global rank of each key is determined by computing a global histogram from the local histograms. Let g(i; j ) be the starting position in the output where the rst key with value i on processor j belongs. Each processor determines the global 47 rank of a key with value i by obtaining its g(i; j ) value. The collection of g(i; j ) values represents the global histogram. Lastly, each key is stored at the correct o set on the destination processor based on its global rank. The rst phase of each pass only performs local computation. However, the other phases require communication. Recall in the second phase, processors determine the global rank of each key by consulting the appropriate g(i; j ) value from the global histogram. The central components of the global rank computation are a multiscan (2r parallel pre x computations, one for each bucket ) and a multicast (multiple broadcasts). Let b(i; j ) represent the total number of keys with value i on all processors with an index less than j . After the multiscan, Pj will know the b(i; j ) values for 0 i < 2r . Let t(i; j ) be the total number of keys with value i on Pj . After the multicast, all processors obtain the b(i; p ? 1) and t(i; p ? 1) values to compute g(i; j ). Thus, i?1 X g(i; j ) = [b(k; p ? 1) + t(k; p ? 1)] + b(i; j ): k=0 Figure 3.7 presents an example of the g(i; j ) computation. In the last phase, the global ranks of the keys are divided equally among the processors. The processor and o set to which a key is sent depends upon its global rank. In our implementation, each processor loops through its set of keys, determines the destination processor and o set of each key, and sends each key and its o set to the appropriate processor. Since the destination of a key is dependent on the value of the key, this phase requires irregular communication to redistribute the keys. We consider two ways of implementing multiscan and multicast. For the following discussion, we concentrate on the multiscan operation since the multicast communication pattern is the same. 48 Bucket 0 1 2 3 P0 P1 P2 P3 3 1 1 2 3 0 2 1 2 0 0 0 4 0 3 5 4 1 2 2 1 4 2 6 1 5 2 1 3 7 Local Histograms 3 9 0 3 4 5 7 9 13 14 16 17 19 20 2225 29 g(0,0) g(0,1) g(0,2) g(0,3) g(1,0) g(1,1) g(1,2) g(1,3) g(2,0) g(2,1) g(2,2) g(2,3) Global Histogram g(3,0) g(3,1) g(3,2) g(3,3) Figure 3.7: Global rank computation. The computation is illustrated with 4 processors and 4 buckets for the values 0{3. Each processor's t(i; j ) value is shown inside of each bucket. The number outside of a bucket re ects the b(i; j ) value after the multiscan. After the multicast, g(i; j ) re ects the starting position in the output where the rst key with value i on processor j belongs. For example, P0 will place the rst key with value \0" at position 0, the \1" keys starting at position 7, etc. 49 One plausible method of implementing multiscan is a tree-based approach. Performing the multiscan as a sequence of m = 2r tree-based parallel pre x computations requires m lg p messages to be sent by P0, the processor that must send the most messages. An alternative and more ecient approach is to pipeline the bucket sizes to the next higher processor. In this case, each processor sends exactly m messages during the multiscan calculation, and making m large will allow the overhead associated with lling the pipeline to become arbitrarily small. Dusseau e t al. also use a pipeline-based approach to implement the multiscan operation. However, their multiscan implementation does not run smoothly. The problem arises as a result of the programmer not accounting for the fact that P0 only sends data whereas the other processors receive data, perform an addition, and send data. Since receiving data is usually given more priority over sending data, P1 will spend most of its time, in the early stages, receiving instead of sending data. Dusseau et al. correct this problem by delaying the sending rate of P0 . Additionally, Dusseau e t al. ignore analyzing the communication of radix sort since it cannot be predicted at compile time. The BSP approach again avoids this situation by focusing on the global routing problem. Analysis of Data. Experimental results for radix sort are shown for an SGI Challenge and an Intel Paragon in Figure 3.2.5. Radix sort provides the worst performance of the parallel algorithms implemented. When n=p = 256K , it is 6 times (3 times) slower than randomized sample sort on a 16 processor SGI Challenge (Intel Paragon) machine. Similarly to the other sorts, the BSP model improves the accuracy of its predictions as n=p increases. When n/p = 512K, the actual execution times are within 9.84% and 38.53% of that predicted by the BSP model for an SGI Challenge and an Intel Paragon, respectively. 50 SGI Actual 14 12 10 8 6 4 2 0 2 4 8 256K 128K 45 40 35 30 25 20 15 10 5 0 2 4 8 Keys per processor (n/p) 256K 128K 64K 32K 16K 8K 16 4K 256K 128K 64K 32K 16K us/key 45 40 35 30 25 20 15 10 5 0 8K 64K Intel Actual Intel Predicted 4K 32K Keys per processor (n/p) Keys per processor (n/p) us/key 16K 8K 16 4K 256K 128K 64K 32K 16K 8K us/key 14 12 10 8 6 4 2 0 4K us/key SGI Predicted Keys per processor (n/p) Figure 3.8: Predicted and actual execution time per key of radix sort on an SGI Challenge and an Intel Paragon. Each plot represents the run time on a 2, 4, 8, or 16 processor system. 51 Our experimental evidence indicates that parallel radix sort is perhaps the least appropriate for general-purpose parallel computing. There are several reasons for this. First, the amount of communication is relatively large. In general, all the keys can be sent to another processor in each pass. In our experiments, the best execution times occurred when there were four passes over the keys. Ignoring algorithmic overhead, this radix sort requires four h-relations of (approximately) size n=p, in contrast to the sample sorts which have only one such relation. Additionally, the overhead associated with the construction of the global histogram is substantial in terms of synchronization. At least p ? 1 barrier synchronizations will be required to perform the multiscan, and many more will be used if there is an attempt to pipeline communication. Similarly, the multibroadcast component will require numerous barrier synchronizations if pipelining is used, as it typically will be for large r. 3.3 Summary We have described BSP implementations of four sorting algorithms and analyzed their performance. LogP proponents argue that the application programmer should not be constrained by the superstep programming style of BSP. However, the BSP sorting implementations described here are virtually identical to Dusseau e t al.'s LogP implementations of randomized sample sort, bitonic sort, and radix sort. (Dusseau et al. did not implement deterministic sample sort.) Moreover, LogP's exible computational model creates additional burdens on the programmer that are avoided in the BSP model. The main di erence between the two models lies in their approach to handling communication. Under LogP, the programmer schedules communication while the runtime-system of BSP performs that task. Theoretical evidence supports the claim that LogP's exible computa52 tional model has no speed advantage over the BSP approach for larger problem sizes [BHP96]. Furthermore, there is convincing evidence that compiler/runtime systems are capable of scheduling communication for ecient performance. HighPerformance Fortran (HPF) [For93] is one such example. Consequently, we argue that the higher-level BSP approach is preferable. Concerning the performance of our sorting implementations, our results suggest that BSP programs can eciently execute a range of sorting applications on both shared-memory and message-passing parallel platforms. Of the sorts discussed here, randomized sample sort is the best overall performer followed closely by deterministic sample sort. Both bitonic sort and radix sort (the worst performer) appear to be uncompetitive in comparison to the sample sorts, which indicates they may not be suitable for general-purpose parallel sorting. Unfortunately, the accuracy of the BSP model raises some concerns. Our results show that the BSP model does not predict accurately the execution times of the sorting applications. However, we believe that our ndings do not re ect negatively on the predictive capabilities of the BSP model. A more reasonable reaction is to notice that BSP is quite useful for predicting performance trends across target architectures of interest. In addition, the BSP cost model can be used as an evaluation tool for nding the most suitable algorithm from a set of alternatives to execute on a parallel architecture. For example, the BSP cost model suggests that randomized sample sort is the best overall performer (of the sorting algorithms discussed here) on both an SGI Challenge and an Intel Paragon. Our experimental results corroborate this claim. In sum, the BSP model seeks to provide a simple programming approach that allows for portable, ecient, and predictable algorithmic performance. Our experiments demonstrate how increased eciency and predictability under BSP 53 can often be achieved by increasing the problem size. This is also the case for many other important applications. Thus, the cost of portable parallel computing is that larger problem sizes are needed to achieve the desired level of eciency and predictability. 54 CHAPTER 4 HBSPk : A Generalization of BSP The k-Heterogeneous Bulk Synchronous Parallel (HBSPk ) model is a generalization of the BSP model [Val90a] of parallel computation. HBSPk provides parameters that allow the user to tailor the model to the required system. As a result, HBSPk can guide the development of applications for traditional parallel systems, heterogeneous clusters, the Internet, and computational grids [FK98]. In HBSPk , each of these systems can be grouped into clusters based on their ability to communicate with each other. Although the model accommodates a wide-range of architecture types, the algorithm designer does not have to manipulate an overwhelming number of parameters. More importantly, HBSPk allows the algorithm designer to think about the collection of heterogeneous computers as a single system. 4.1 Machine Representation The HBSPk model refers to a class of machines with at most k levels of communication. Thus, HBSP0 or single processor systems are the simplest class of machines. The next class of machines are HBSP1 computers, which consist of at most one communication network. Examples of HBSP1 computers 55 SMP SGI workstation LAN 111 000 000 111 000 111 000 111 000 111 11111 00000 00000 11111 00000 11111 Communications Network Figure 4.1: An HBSP2 cluster. include single processor systems (i.e. HBSP0), traditional parallel machines, and heterogeneous workstation clusters. HBSP2 machines extend the HBSP1 class to handle heterogeneous collections of multiprocessor machines or clusters. Figure 4.1 shows an HBSP2 cluster consisting of three HBSP1 machines. In general, HBSPk systems include HBSPk?1 computers as well as machines composed of HBSPk?1 computers. Thus, the relationship of the machine classes is HBSP0 HBSP1 HBSPk . An HBSPk machine can be represented by a tree T = (V; E ). Each node of T represents a heterogeneous machine. The height of the tree is k. The root r of T is an HBSPk machine. Let d be the length of the path from the root r to a node x. The level of node x is k ? d. Thus, nodes at level i of T are HBSPi machines. Figure 4.2 shows a tree representation of the HBSP2 machine shown in Figure 4.1. The root node corresponds to an HBSP2 machine. The components of this machine (a symmetric multiprocessor, an SGI workstation, and a LAN) are shown at level 1. Level 0 depicts the individual processors of the symmetric multiprocessor and the LAN. The indexing scheme of an HBSPk machine is as follows. Machines at level i, 0 i k, are labeled Mi;0 , Mi;1 , : : :, Mi;mi?1 , where mi represents the number of HBSPi machines. Consider machine Mi;j of an HBSPk computer, where 0 j < 56 M 2,0 M M 0,0 1,0 M M 1,1 M 0,1 M 0,2 M 0,3 M 0,4 1,2 M 0,5 Figure 4.2: Tree representation of the cluster shown in Figure 4.1. mi;j . One possible interpretation of Mi;j is that it is a cluster with identity j on level i. The nodes of the cluster are the children of Mi;j . In Figure 4.2, M1;0 is an HBSP1 cluster composed of the nodes M0;0 ; M0;1; M0;2, and M0;3 . M2;0 provides an example of a cluster of clusters. The HBSPk model places no restriction on the amount of nesting within clusters. Additionally, we may consider the machines at level i of T the coordinator nodes of the machines at level i ? 1. As shown in Figure 4.2, M1;0 , M1;1 ; M1;2, and M2;0 are examples of coordinator nodes. Coordinator nodes ful ll many roles. They can act as a representative for their cluster during inter-cluster communication. Or, to increase algorithmic performance, they may represent the fastest machine in their subtree (or cluster). Under this assumption, the root node is the fastest node of the entire HBSPk machine. 4.2 Cost Model Using the de nition of an HBSPk machine as a basis, we de ne the meaning and cost of an HBSPk computation. There are two ways in which to determine the cost of an HBSPk computation. One approach calculates the various costs directly 57 at each level i. The other nds them recursively. For expositional purposes, we choose the former approach. An HBSPk computation consists of some combination of superi-steps. During a superi-step, each level i node performs asynchronously some combination of local computation, message transmissions to other level i machines, and message arrivals from its peers. A message sent in one superi -step is guaranteed to be available to the destination machine at the beginning of the next superistep. Each superi-step is followed by a global synchronization of all the level i computers. Consider the class of HBSP0 machines. For these single processor systems, computation proceeds through a series of super0-steps (or steps). Communication and synchronization with other processors is not applicable. Unlike the previous class, HBSP1 machines perform communication. HBSP1 computers proceed through a series of super1 -steps (or supersteps). During a superstep, each HBSP0 machine performs asynchronously some combination of local computation, message arrivals, and message transmissions. Thus, an HBSP1 computation resembles a BSP computation. The main di erence is that an HBSP1 algorithm delegates more work to the faster processors. For HBSP2 machines, computation consists of super1- and super2 -steps. Super1steps proceed as described previously. During a super2-step, the coordinator nodes for each HBSP1 cluster performs local computation and/or communicates data with other level 1 coordinator nodes. A barrier synchronization of the coordinators separate each super2-step. Thus, an HBSP2 computation consists of both intra- and inter-cluster communication. An HBSPk computer is characterized by the following parameters, which are summarized in Table 4.1: 58 Symbol Meaning Mi;j a machine's identity, where 0 i k, 0 j < mi mi number of HBSPi machines on level i mi;j number of children of Mi;j g speed the fastest machine can inject packets into the network ri;j speed relative to the fastest machine for Mi;j to inject packets into the network Li;j overhead to perform a barrier synchronization of the machines in the j th cluster of level i ci;j fraction of the problem size that Mi;j receives h size of a heterogeneous h-relation hi;j largest number of packets sent or received by Mi;j in a superi -step Si number of superi -steps Ti () execution time of superi -step Table 4.1: De nitions of Notations mi, the number of HBSPi machines labeled Mi;0; Mi;1, Mi;mi ?1 on level i, where 0 i k; mi;j , the number of children of Mi;j ; g, a bandwidth indicator that re ects the speed with which the fastest machine can inject packets into the network; ri;j , the speed relative to the fastest machine for Mi;j to inject a packet into the network; Li;j , overhead to perform a barrier synchronization of the machines in the subtree of Mi;j ; and ci;j , the fraction of the problem size that Mi;j receives. 59 We assume that the ri;j value of the fastest machine is normalized to 1. If ri;j = t, then Mi;j communicates t times slower than the fastest node. The ci;j parameter adds a load-balancing feature into the model. Speci cally, it attempts to provide Mi;j with a problem size that is proportional to its computational and communication abilities. Boulet et al. [BDR99] discuss methods for computing ci;j on a network of heterogeneous workstations. In Chapter 5.3, we present our method of calculating c0;j for an HBSP1 platform. We refer the read to Boulet et al. for other strategies to compute ci;j . When k 2, it is unclear what the value of ci;j should represent. For example, a coordinator node's ci;j value could be the sum of its children's value. Alternatively, a combination of communication and computation costs could factor into a machine's ci;j value. The HBSPk model says nothing about how the ci;j costs should be tabulated. Instead, it assumes that such costs have been determined appropriately. The parameters described above allow for cost analysis of HBSPk programs. First, consider the cost of a superi-step . w represents the largest amount of local computation performed by an HBSPi machine. Let hi;j be the largest number of messages sent or received by Mi;j , where 0 j < mi. The size of the heterogeneous h-relation is h = maxfri;j hi;j g with a routing cost of gh. Thus, the execution time of superi -step is Ti () = wi + gh + Li;j : (4.1) Suppose that Si is the number of superi-steps, where 1 i k. Intuitively, the execution time of an HBSPk algorithm is the sum of the superi-steps, where 1 i k. This leads to an overall cost of Sk S S X X X T () + T () + : : : + T (): 1 =1 1 2 =1 2 =1 60 k (4.2) Similarly to BSP, the above cost model demonstrates what factors are important when designing HBSPk applications. To minimize execution time, the programmer must attempt to (i) balance the local computation of the HBSPi machines in each superi -step, (ii) balance the communication between the machines, and (iii) minimize the number of superi-steps. Balancing these objectives is a nontrivial task. Nevertheless, HBSPk provides guidance on how to design ecient heterogeneous programs. 4.3 HBSP Collective Communication Algorithms k Collective communication plays an important role in the development of parallel programs [BBC94, BGP94, MR95]. It simpli es the programming task, facilitates the implementation of ecient communication schemes, and promotes portability. In the following subsections, we design six HBSPk collective communication operations|gather, scatter, reduction, pre x sums, one-to-all broadcast, and all-to-all broadcast. The HBSPk model provides parameters that allow algorithm designers to exploit the heterogeneity of the underlying system. The model promotes our two-fold design strategy for HBSPk collective operations. First, faster machines should be involved in the computation more often than their slower counterparts. Collective operations use speci c nodes to collect or distribute data to the other nodes in the system. For faster algorithmic performance, these nodes should be the fastest machines in the system. Secondly, faster machines should receive more data items than slower machines. This principle encourages the use of balanced workloads, where machines receive problem sizes relative to their communication and computational abilities. Partitioning the workload so that nodes receive 61 an equal number of elements works quite well for homogeneous environments. However, this strategy encourages unbalanced workloads in heterogeneous environments since faster machines typically sit idle waiting for slower nodes to nish a computation. In this chapter, we design and analyze collective communication algorithms for HBSP1 and HBSP2 platforms. Our HBSP1 algorithms are based on BSP communication operations. Consequently, we discuss the BSP design of each collective operation before presenting its heterogeneous counterpart. In an HBSP1 environment, the number of workstations is m1;0 (or m0 ). The single coordinator node, M1;0 , represents the fastest workstation among the HBSP0 machines. Hence, r1;0 = 1. L1;0 is the cost of synchronizing the cluster of processors. The HBSP2 collective routines use the HBSP1 algorithms as a basis. Unlike HBSP1 platforms, HBSP2 machines contain a two-level communication network. Level 0 consists of m0 individual workstations. Level 1 represents the m1 (or m2;0 ) coordinator nodes for the machines at level 0. Each coordinator, M1;j , requires a cost of L1;j to synchronize the nodes in its cluster. The root of the entire cluster is M2;0 , which is the fastest machine. r2;0 = 1. The communication costs can be quite expensive on higher level links. As a result, our HBSP2 algorithms minimally communicate on these links. We do not specify algorithms for higherlevel machines (i.e., k 3). However, one can generalize the approach given here for these systems. Throughout the rest of this chapter, let xi;j represent the number of items in Mi;j 's possession, where 0 i 2; 0 j < mi. Balanced workloads assume xi;j = ci;j n. The total number of items of interest is n. For notational convenience, the indexes f and s are used to represent the identity of the fastest and slowest nodes, respectively. 62 4.3.1 Gather In the gather operation, a single node collects a unique message from each of the other nodes. The BSP gather operation consists of all processors sending their data to a designated processor. Typically, this duty is relegated to P0. For balanced workloads, the processors send np elements to P0. n denotes the total number of items P0 receives from all the processors. Since np < n, the BSP cost of the gather operation is gn + L. HBSP1 . We extend the above algorithm to accommodate 1-level heteroge- neous environments. Instead of each machine sending their data to P0, they send it to their coordinator node, M1;0 . Suppose that the total number of items that M1;0 receives is n. The size of the heterogeneous h-relation is maxfr0;j x0;j ; r1;0ng. Assume that each processor takes approximately the same amount of time to send their data. Hence, x0;j = c0;j n. Recall that ci;j is inversely proportional to the speed of Mi;j . Consequently, ri;j ci;j < 1. Thus, the gather cost gn + L1;0. The above cost of the gather operation is ecient since the fastest processor is performing most of the work. If r0;j c0;j > 1; M0;j has a problem size that is too large. Its communication time will dominate the cost of the gather operation. Whenever possible, the fastest processor should handle the most data items. Our results, in fact, demonstrate the importance of balanced workloads. The increase in performance is a result of M1;0 receiving the items faster. The HBSPk model, as does BSP, rewards programs with balanced design. HBSP2 . The HBSP2 gather algorithm proceeds as follows. First, each HBSP1 machine performs an HBSP1 gather. Afterwards, each of the level 1 nodes send their data items to the root, M2;0. Since the problem size is n, x2;0 = n. 63 The cost of an HBSP2 gather operation is the sum of the super1 -step and super2-step times. Since each HBSP1 machine performs a gather operation, the super1-step cost is the largest time need for an HBSP1 cluster to nish the operation. Once the level 1 coordinators have the n data items, they send the data to the root. This super2 -step requires maxfr1;j x1;j ; r2;0ng + L2;0 . Assuming balanced workloads, the super2 -step cost is gn + L2;0 . Ecient algorithm execution in this environment implies that the size of the problem size must outweigh the cost of performing the extra level of communication and synchronization. 4.3.2 Scatter The scatter operation is the opposite of the gather algorithm. Here, a single node sends a unique message to every other node. Under BSP, P0 sends np elements to each of the processors. The cost of this operation is gn + L. HBSP1 . The extension of the above algorithm for heterogeneous processors is also similar. The fastest processor, M1;0 , is responsible for scattering the data to all of the processors. M1;0 distributes c0;j n elements to M0;j . The size of the heterogeneous h-relation is maxfr0;j c0;j n; r1;0ng. Assume that processors have balanced workloads. So, r0;j c0;j < 1. This results in a cost of gn + L1;0. HBSP2 . Similarly to the HBSP2 gather operation, the HBSP2 scatter consists of both super1 -step and super2-steps. In the super2 -step, the root process sends the required data to the level 1-coordinator nodes. A super2-step requires a cost of g maxfr1;j x1;j ; r2;0ng + L2;0 , which reduces to gn + L2;0. Once the coordinator nodes receive the data from the root, an HBSP1 gather operation is performed within the cluster. As in the scatter operation, the additional level of synchronization is an issue of concern. However, if the problem size is large enough, the e ect of the barrier cost on the system can be minimized. 64 4.3.3 Reduction Sometimes the gather operation can be combined with a speci ed arithmetic or logical operation. For example, the values could be gathered and then added together to produce a single value. Such an operation is called a single-value reduction. In BSP,each BSP processor locally reduces its np data items and sends the value to P0. Afterwards, P0 reduces the p elements to produce a single value. The cost of this operation in BSP is O( np ) + p(1 + g) + L. HBSP1 . The HBSP1 algorithms proceeds as follows. Each processor locally reduces its xi;j elements and sends the value to M1;0 . The size of the heterogeneous h-relation is maxfr0;s 1; r1;0 m1;0 g. M1;0 reduces the m1;0 values to produce a single value. The total cost of the algorithm is O(x0;s) + gm1;0 + L1;0 . The computational requirement of a reduction operation that produces a single value is m1;0 . If m1;0 is small, the bene t of using the fastest processor as the root node is small. A point-wise reduction is performed on an array of values provided by each processor. Suppose that each processor sends the root node an array called inbuf of size n. Point-wise reduction assumes that all arrays are of equal size. The rst element of inbuf is at index 0. After the reduction, the root node applies the speci ed operation to the rst element in each input bu er (op(inbuf[0]) and stores the result in outbuf. Similarly, outbuf[1] contains op(inbuf[1]). Valid operations include maximum, minimum, and summation to name a few. The HBSP1 point-wise reduction operation proceeds as follows. First, each processor sends its n data items to the fastest node. Afterwards, M1;0 performs the reduction on the m1;0n elements. This step requires a heterogeneous hrelation of size maxfr0;sn; r1;0m1;0ng. Assuming that r0;s r1;0m1;0 , the cost 65 of the super1 -step is O(m1;0n) + gm1;0n + L1;0 . Here, the bene ts of using a point-wise reduction are evident in terms of the computational and communication workload given to the fastest processor. HBSP2 . The HBSP2 single-value reduction algorithm proceeds as follows. First, each level 1-coordinator node performs an HBSP1 reduction. The cost of this super1-step is the largest time needed to perform an HBSP1 reduction. Afterwards, each of the coordinators send their result to the root, M2;0 . The root reduces the m2;0 items to a single result. This super2-step costs O(m2;0) + maxfr1;s 1; r2;0 m2;0g + L2;0 . In the HBSP2 algorithm, the computational requirement of a reduction operation that produces a single value is m2;0 for the root node. If m2;0 is small, the bene t of using the root machine to reduce the values is small. 4.3.4 Pre x Sums Given a list of n numbers, y0; y1; : : : ; yn?1, all partial summations (i.e., y0; y0 + y1; y0 + y1 + y2; : : : ) are computed. The pre x calculation can also be de ned with associative operations other than addition; for example, subtraction, multiplication, maximum, minimum, and logical operations. Practical areas of application include processor allocation, data compaction, sorting, and polynomial evaluation. A BSP pre x sums algorithm requires two supersteps. First, each processor computes its local pre x sums, which requires a computation time of O( np ). Next, the processors sends their sum to P0. Since P0 receives p items, the size of the h-relation is p. P0 computes the pre x sums of these p elements. Suppose the 66 pre x sums are labeled s0 ; s1; : : : ; sp?1. P0 sends sj to Pj+1, where 0 j < p ? 1. The communication time of this step is gp. Each processor computes the nal result by adding the value received from P0 to its local pre x sums. Therefore, the cost of a BSP pre x sums computation is O( np ) + 2gp + 2L. HBSP1 . Similarly to the BSP algorithm, the HBSP1 pre x sums opera- tion begins with M0;j computing its local pre x sums. This step requires a computation time of O(c0;j n). Afterwards, each processor sends its total sum to M1;0. M1;0 computes the pre x sums of the m1;0 elements received in the previous step. M1;0 sends each processor the appropriate value it needs to add to its local pre x sums. This requires an heterogeneous h-relation of size max(r0;s 1; r1;0 m1;0). Adding this value to M0;j 's local pre x sums involves c0;j n amount of work. The size of both heterogeneous h-relations in the pre x sums algorithm is max(r0;s 1; r1;0 m1;0). Again, it is unlikely that r0;s > m1;0. Therefore, the total time of pre x sums on an HBSP1 machine is O(c0;j n) + 2(gm1;0 + L1;0 ), where the amount of local computation is relative to a processor's computational speed. HBSP2 . Figure 4.3 presents an example of a HBSP2 pre x sums computa- tion. The HBSPk pre x sums algorithm begins with M0;j computing its local pre x sums and sending the total sum to the coordinator of its cluster. Each level 1 coordinator node computes the pre x sums of its children's values and sends the total sum to its coordinator, M2;0 . This node computes the pre x sums of its m2;0 elements. Suppose that its pre x sums are labeled s0 ; s1; : : : ; sm2;0 ?1. Pre x sums values are distributed to M2;0's children as follows. The rst child, M1;0 , gets the value 0, the second child, M1;1 , receives s0 , M1;2 obtains s1 , etc. Each M1;j , adds this element to its pre x sums. Similarly to the root node, M1;j , 67 sends an appropriate value for its children to add to their pre x sums. Figure 4.4 presents the algorithm. In Steps 1 and 9, each HBSP0 machine computes its local pre x sums. This requires M0;j to perform c0;j n amount of work. Next, M0;j sends the total sum of its pre x sum to its coordinator, M1;j . Since M0;j sends 1 element and M1;j receives m1;j elements, Step 2 requires a communication time of g maxfr0;j 1; r1;j m1;j g. In Steps 3 and 7, each level 1-coordinator computes the pre x sums of the elements it received from its children. This requires m1;j amount of computation. In Step 4, M1;j sends its sum to the root, M2;0 . The size of the heterogeneous h-relation is maxfr1;j 1; r2;0 m2;0g. Step 5 requires a computation time of m2;0 . Steps 6 and 8 require a heterogeneous h-relation of size maxfr1;j 1; r2;0 m2;0g and maxfr0;j 1; r1;j m1;j g, respectively. Let Li = maxfLi;j g, where 1 i 2, 0 j < mi. The cost of the HBSP2 pre x sums algorithm is O(c0;j n) + 2g(maxfr0;s; r1;sm1;sg + maxfr1;s; m2;0 g) + 2(L1 + L2 ): (4.3) Unlike the gather and broadcast operations, the overhead of pre x sums grows in relation to the underlying heterogeneous architecture|not the problem size. By increasing the problem size, one can overcome the overheads of the underlying architecture. 4.3.5 One-to-All Broadcast The two-phase broadcast is the algorithm of choice for the BSP model. The algorithm's strategy is to spread the broadcast items as equally as possible among the p processors before replicating each item. In the rst phase, P0 sends np elements to each of the processors. The second phase consists of each processor 68 26 9 11 26 9 17 1 4 9 11 10 17 1 6 26 0 9 9 11 23 56 11 15 22 82 26 37 9 19 26 22 23 26 0 34 56 26 1 4 9 82 26 27 32 35 37 37 60 60 82 60 71 75 82 Figure 4.3: An HBSP2 pre x sums computation. Execution starts with the leaf nodes (or HBSP0 machines) in the top diagram. Here, the nodes send the total of their pre x sums computation to the coordinator of its cluster. The upward traversal of the computation continues until the root node is reached. The bottom diagram shows the downward execution of the computation. The leaf nodes hold the nal result. 69 1. M0;j computes the pre x sums of its c0;j n elements, where 0 j < m0. 2. M0;j sends the total sum of its elements to the coordinator of its cluster. 3. M1;j computes the pre x sums of the elements received in Step 2, where 0 j < m1 . 4. M1;j sends the total sum of its elements to M2;0 . 5. M2;0 computes the pre x sums of the elements received in Step 4. 6. M2;0 sends its j th pre x sum to M1;j+1, where 0 j < m1 . 7. M1;j adds the value from Step 6 to its pre x sums. 8. M1;j sends the j th pre x sum to its (j + 1)th child. 9. M0;j adds the value from Step 8 to its pre x sums, where 0 j < m0 . Figure 4.4: HBSP2 pre x sums sending the np elements it received in the previous stage to all of the processors. Therefore, the cost of the two-phase broadcast is 2(gn + L). HBSP1 . The HBSP1 broadcast algorithm uses the above BSP algorithm as its basis. The computation starts at the root (or coordinator) node, where each of its children execute similarly to the two-phase BSP algorithm. M1;0 sends n m1;0 elements to each of its children, M0;j . This phase requires a heterogeneous h-relation of size maxfr1;0n; r0;s mn1;0 g. In a typical environment, it is reasonable to assume that m0 ranges from the tens to the hundreds. As a result, it is quite unlikely that a machine would be m1;0 times slower than the fastest machine. If this is the case, it may be more appropriate not to include that machine in the computation. As a result, the communication time of the rst phase reduces 70 to gn. The second phase consists of each processor receiving n elements1 . This results in a communication time of gr0;sn. Thus, the complexity of a two-phase broadcast on an HBSP1 machine is gn(1 + r0;s) + 2L1;0. As a point of comparison, the one-phase broadcast (M1;0 sends n items to each processor) costs gnm1;0 + L1;0, assuming r0;s < m1;0 . Clearly, the two-phase approach is the better overall performer. An interesting conclusion concerning the broadcast operation is that it e ectively cannot exploit heterogeneity. Since the slowest processor must receive n items, its cost will dictate the complexity of the algorithm. Partitioning the problem so that M0;j receives cj n elements during the rst phase of the algorithm is ine ective. Although wall clock performance may improve, theoretically, the resulting speedup is neglible. HBSP2 . The two-phase approach is the algorithm of choice for HBSP1 ma- chines. Next, we consider broadcasting in an HBSP2 computer. Given that communication is likely to be more expensive (i.e., higher latency links and increased synchronization costs) in such an environment, we investigate whether the twophase approach is also applicable for HBSP2 machines. The algorithm begins with the root node distributing n items to the level 1 coordinator nodes. M2;0 may broadcast the data to its children using either a single-phase or two-phase approach. Afterwards, each level 1 coordinator node sends the n items to its children using the HBSP1 broadcast algorithm. The total cost of the algorithm is the sum of the super1 - and super2-steps. Since both approaches utilize the HBSP1 broadcast, we focus our discussion on the behavior of the super2 -steps. In the one-phase approach, the root nodes sends n elements to the level 1 machines. The cost of the super2 -step is g maxfr1;sn; r2;0 nm2;0g + L2;0 . Suppose 1 Actually, each processor will receive n ? n elements. We use n to simplify the notation. m1 0 ; 71 that that r1;s > m2;0. The super2 -step cost is gr1;sn + L2;0 . Otherwise, its gnm2;0 + L2;0 . Unlike the above algorithm, the two-phase approach requires two super2-steps. Initially, the root node sends mn2;0 elements to the level 1 coordinators. Each coordinator, then, broadcasts its mn2;0 elements to its peers. The rst super2 -step requires a heterogeneous h-relation of size maxfr1;s mn2;0 ; r2;0 ng. The other super2-step costs gr1;sn + L2;0 . Suppose that r1;s > m2;0. The cost of the super2steps is gr1;sn( mn2;0 + 1) + 2L2;0 . Otherwise, the cost is gn(r1;s + r2;0) + 2L2;0 . 4.3.6 All-to-all broadcast A generalization of the one-to-all broadcast is the all-to-all broadcast, where all nodes simultaneously initiate a broadcast. A node sends the same data to every other node, but di erent nodes may broadcast di erent messages. A straightforward BSP algorithm for all-to-all communication is a single-stage algorithm, where each processor sends its data to the other processors. Suppose that each processor, Pj , sends np elements to each of the other processors. This algorithm results in a cost of gn + L, which is the same cost for a one-to-all broadcast. Although this algorithm is susceptible to node contention, it demonstrates the issues that are involved when performing an all-to-all broadcast in a heterogeneous environment. HBSP1 . One approach to designing an HBSP1 all-to-all broadcast is to use the above BSP algorithm as a basis. This simultaneous broadcast algorithm results in the same cost as the one-phase broadcast algorithm, gnm1;0 + L1;0 . Unfortunately, this algorithm is not able to exploit the heterogeneity of the underlying system. Another approach for all-to-all communication is the intermediate destination algorithm. Here, each processor sends their message to an intermediate 72 node. This node is responsible for broadcasting the data to all processors. Clearly, the intermediate node should be the fastest processor in the system. The algorithm begins with M0;j sending its data items to M1;0 . This requires an h-relation of size maxfr0;s mn1;0 ; r1;0ng. Again, we assume that r0;s < m1;0 . The root node collects all of the data from its children and broadcasts it to all of the nodes. The cost of broadcasting the data with a two-phase algorithm is gn(1 + r0;s) + 2L1;0 . Overall, this approach requires a cost of gn(2 + r0;s) + 3L1;0. Unfortunately, as in the one-to-all broadcast, the fundamental diculty of the all-to-all broadcast is that each node must possess the same number of items. The slowest node will always be a bottleneck since it must receive all n data items. As a result, it is dicult to partition the problem in such a way to create balanced workloads among the heterogeneous machines. HBSP2 . Next, we consider performing an all-to-all broadcast on an HBSP2 machine. First, each HBSP0 node sends its data to the coordinator of its cluster. The level 1 coordinators forward the collected data to the root node, M2;0 . Once the root receives all n items and it initiates a HBSP2 one-to-all broadcast. Again, the performance of this algorithm is limited since the slowest machine must receive the n items. 4.3.7 Summary The utility of the HBSPk model is demonstrated through the design and analysis of gather, scatter, reduction, pre x sums, one-to-all broadcast, and allto-all broadcast algorithms. Our results indicate that the HBSPk encourages balanced workloads among the machines, if applicable. For example, a close 73 examination of the broadcast operations demonstrates that it is impossible to avoid unbalanced workloads since the slowest processor must receive n items. Besides analyzing execution time, the HBSPk model can also be used to determine the penalty associated with using a particular heterogeneous environment. This is certainly true for the pre x sums algorithm, where overhead costs are a result of the underlying architecture and not the problem size. 74 CHAPTER 5 HBSP1 Collective Communication Performance This chapter focuses on experimentally validating the performance of the collective communication algorithms presented in Chapter 4. Speci cally, we study the e ectiveness of the collective routines on a non-dedicated, heterogeneous cluster of workstations. Each of the routines were designed to utilize fast processors and balanced workloads. Theoretical analysis of the algorithms showed that applying the above principles lead to good performance on heterogeneous platforms. We design experiments to test whether the predictions made by the model are relevant for HBSP1 platforms. Additional research work has studied the performance of collective algorithms for heterogeneous workstation clusters. The ECO package [LB96], built on top of PVM, automatically analyzes characteristics of heterogeneous networks to develop optimized communication patterns. Bhat, Raghavendra, and Prasanna [BRP99] extend the FNF algorithm [BMP98] and propose several new heuristics for collective operations. Their heuristics consider the e ect communication links with di erent latencies have on a system. Banikazemi [BSP99] present a model for point-to-point communications in heterogeneous networks of workstations and use it to study the e ect of heterogeneity on the performance of collective operations. 75 5.1 The HBSP Programming Library The HBSP1 collective communication algorithms are implemented using the HBSP Programming Library (HBSPlib). Table 5.1 lists the functions that constitute the HBSPlib interface. The design of HBSPlib incorporates many of the functions contained in BSPlib [HMS98]. HBSPlib is written on top of PVM [Sun90], a software package that allows a heterogeneous network of parallel and serial computers to appear as a single, concurrent, computational resource. The computers compose a virtual machine and communicate by sending messages to each other. We use PVM's pvm send() function for asynchronous communication to directly send messages between processors. To receive a message, we take advantage of the PVM function pvm recv(). HBSPlib's implementation of message passing among heterogeneous processors is quite easy, thanks to PVM. More problematic is the implementation of global synchronization. PVM actually has a function called pvm barrier() that implements barrier synchronization. Unfortunately, it is unclear whether the successful return from pvm barrier implies that all messages have been cleared from the communication network. As a result, our implementation of global synchronization is somewhat complex, since we need to guarantee that all messages have arrived at their destination. Therefore, extra packets are used for synchronization purposes. PVM guarantees that message order is preserved, so when a processor calls hbsp sync(), it sends a special synchronization packet to every other processor. Essentially, this packet tells each processor that it has no more messages to send. Next, the processor begins handling the messages that were sent to it. All messages are accounted for once it has processed p ? 1 synchronization packets. At that point, the processor calls pvm barrier(). 76 Function Semantics hbsp begin Starts the program with the number of processors requested. hbsp end Called by all processors at the end of the program. hbsp abort One process halts the entire HBSP computation. hbsp pid Returns the processor id in the range of 0 to one less than the number of processors. hbsp time Returns the time (in seconds) since hbsp begin was called. The timers on each of the processors are not synchronized. hbsp nprocs Returns the number of processors. hbsp sync The barrier synchronization function call. After the call, all outstanding requests are satis ed. hbsp send Sends a message to a designated processor. hbsp get tag Returns the tag of the rst message in the system queue. hbsp qsize Returns the number of messages in the system queue. hbsp move Retrieves the rst message from the processor's receive bu er hbsp get rank Returns the identity of the processor with the requested rank. hbsp get speed Returns the speed of the processor of interest. hbsp cluster speed Returns the total speed of the heterogeneous cluster. Table 5.1: The functions that constitute HBSPlib interface. 77 HBSPlib incorporates functions that allow the programmer to take advantage of the heterogeneity of the underlying system. Under HBSPk , faster machines should perform the most work. The primitive hbsp get rank(1) returns the identity of the fastest processor. hbsp get rank(p) returns the slowest machine's identity, where p is the number of processors. HBSPlib also includes functions to help the programmer distribute the workload based on a machine's ability. The HBSPlib primitive hbsp get speed(j) provides the speed of processor j . hbsp cluster speed returns the speed of the entire cluster. When combined together, these two functions allow for nding the value of processor j 's cj parameter. Details related to this calculation are provided in the Section 5.3. 5.2 The HBSP1 Model We nd it useful to simplify the notation of the HBSPk model, as described in Chapter 4.2, for this environment. The number of workstations is m0 or p. The single coordinator node, M1;0 or Pf , represents the fastest processor among the HBSP0 processors. Ps refers to the slowest node. To identify the individual processors on level 0, we use the notation Pj to refer to processor M0;j . Let rf = 1 and rs = maxfrj g, where 0 j < p. Since an HBSP1 machine consists of a single cluster of processors to synchronize, L = L1;0 . 5.3 Experimental setup Our experimental testbed consisted of a non-dedicated, heterogeneous cluster of SUN and SGI workstations at the University of Central Florida. Table 5.2 78 Host CPU type CPU speed (MHz) Memory (MB) Data cache (KB) aditiz UltraSPARC II 360 256 16 chromus microSPARC II 85 64 8 dcn sgi1 MIPS R5000 180 128 32 dcn sgi3 MIPS R5000 180 128 32 gradsun1 TurboSPARC 170 64 16 gradsun3 TurboSPARC 170 64 16 gromit UltraSPARC IIi 333 128 16 sgi1 MIPS R5000 180 96 32 sgi3 MIPS R5000 180 96 32 sgi7 MIPS R5000 200 64 32 Table 5.2: Speci cation of the nodes in our heterogeneous cluster. z A 2 processor system, where each number is for a single CPU. lists the speci cations of each machine. Each node is connected by a 100Mbit/s Ethernet connection. Our experiments evaluate the impact of processor speed and workload distribution on the overall performance of an algorithm. In this section, we discuss our method for estimating the costs of the HBSP1 parameters on this platform. The ranking of the processors is determined by the BYTEmark benchmark [BYT95], which consists of the 10 tests brie y described below. Numeric sort. An integer-sorting benchmark. String sort. A string-sorting benchmark. Bit eld. A bit manipulation package. Emulated oating-point. A small software oating-point package. 79 Fourier coecients. A numerical analysis benchmark for calculating series approximations of waveforms. Assignment algorithm. A task allocation algorithm. Hu man compression. A well-known text and graphics compression algorithm. IDEA encryption. A block cipher encryption algorithm. Neural net. A back-propagation network simulator. LU Decomposition. A robust algorithm for solving linear equations. The BYTEmark benchmark reports both raw and indexed scores for each test. For example, the numeric sort test reports as its raw score the number of arrays its was able to sort per second. The indexed score is the raw score of the system divided by the raw score obtained on the baseline machine, a 90Mhz Pentium XPS/90 with 16MB of RAM. The indexed score attempts to normalize the raw scores. If a machine has an index score of 2.0, it performed that test twice as fast as a 90 Mhz Pentium computer. After running all of the tests, BYTEmark produces two overall gures, an Integer and a Floating-point index. The Integer index is the geometric mean of those tests that involve only integer processing|Numeric sort, String sort, Bit eld, Emulated oating-point, Assignment algorithm, Hu man compression, and IDEA encryption. The Floating-point index is the geometric mean of the remaining tests. Thus, one can use these results to get a general feel for the performance of the machine in question as compared to a 90 Mhz Pentium. Table 5.3 presents the Integer and Floating-point index scores for each machine in the heterogeneous cluster. Since we consider integer data only, the 80 Machine Integer Floating-point Index Index aditi chromus dcn sgi1 dcn sgi3 gradsun1 gradsun3 gromit sgi1 sgi3 sgi7 4.45 0.75 2.80 2.79 1.80 1.81 4.89 2.81 2.77 3.13 3.77 0.59 3.73 3.67 1.41 1.42 3.33 3.60 3.30 4.11 Table 5.3: BYTEmark benchmark scores. Integer index scores were used to rank the processors. According to the results, chromus is the slowest node. gromit is the fastest machine in the cluster. This result is surprising considering aditi appears faster on paper. Interestingly, aditi narrowly edges out gromit in every test, except string sort, where gromit outperforms aditi with a score of 7.63 to 2.40. BYTEmark uses only a single execution thread. Consequently, it cannot take advantage of aditi's additional processor. This does not present a problem for our experiments since our HBSPlib implementation does not use threads. We ran our experiments with both aditi and gromit as the fastest processor. There was no major di erence in the execution times. Therefore, we consider gromit to be the fastest processor in the cluster. 81 To ensure consistent results, we apply the same processor ordering for each experiment. Table 5.4 shows the ordering. When p = 2, the experiments utilize gromit and chromus. The speed of this con guration is 5:64, which is the sum of each machine's Integer index score. Each machine's cj value is based on its Integer index score and the cluster speed. In general, Ppj=0 cj = 1. When p = 2, 89 (or .867). The c value of chromus is .133. Therefore, gromit's cj value is 54::64 j gromit receives 86.7% of the data elements and chromus acquires the remaining 13.3%. When p = 4, the cluster speed is 12.89. The workstations that comprise the cluster are gromit, chromus, aditi, and dcn sgi1, which receive 37.9%, 5.8%, 34.5%, and 21.7% of the input, respectively. Table 5.4 also presents the synchronizing costs of the clusters comprised of 2, 4, 6, 8, and 10 workstations. For example, synchronizing two processors (i.e, gromit and chromus) requires 9,000 s. The value of L corresponds to the time for an empty superstep (i.e., no computation or communication). When p = 4, 15,000 s are needed in order to synchronize the processors. Compared with the L values of the Intel Paragon and SGI Challenge presented in Table 3.1, the synchronization costs for the heterogeneous cluster are quite high. Several factors contribute to this behavior. Since the cluster is non-dedicated, many other nodes share the network link, which e ectively degrades communication performance. Secondly, our implementation of barrier synchronization is not necessarily ecient. Despite the high L values, our collective algorithms outperformed their PVM counterparts. Additional work will focus on the development of a more ecient barrier synchronization primitive. Table 5.5 shows the rj values achieved on our heterogeneous cluster. To obtain these values, we measure the time needed for each machine to inject a 82 p Speed L(s) Machine 2 gromit, chromus 4 aditi, dcn sgi1 6 dcn sgi3, gradsun1 8 gradsun3, sgi1 10 sgi3, sgi7 5.64 12.89 17.48 22.10 28.00 9,000 15,000 23,000 30,000 37,000 Table 5.4: Cluster speed and synchronization costs. suciently large packet into the network. gromit performed the best with a s . Processor j 's r value is relative to this score. score of 0.196 byte j 5.4 Application Performance The input data for each experiment consists of 100 KBytes to 1000 KBytes of uniformly distributed integers. The problem size, n, refers to the largest number of integers possessed by the root. Experimental results are given in terms of an improvement factor. Let TA and TB represent the execution time of algorithm A and algorithm B , respectively. The improvement factor of using algorithm B over algorithm A is TTBA . The HBSPk model encourages the use of fast processors and balanced workloads. According to the model, applications that embody both of these principles will result in good performance. We designed two types of experiments to validate the predictions of the model. The rst experiment tests whether processor speed has an impact on algorithmic performance. Let Ts represent the execution time of a collective routine assuming the root node is the slowest processor, Ps. Tf denotes the algorithmic cost of using Pf as the root. For these experiments, each 83 machine aditi chromus dcn sgi1 dcn sgi3 gradsun1 gradsun3 gromit sgi1 sgi3 sgi7 rj 1.03 4.08 2.12 1.95 2.00 2.46 1.00 1.68 1.20 1.16 Table 5.5: rj values. processor has an equal number of data items since our objective is to monitor the performance of slow versus fast root nodes. Hence, cj = 1p . The results demonstrate that often times using the fastest node as the root results in signi cant performance improvement. Our second experiment studies the bene t of using the fastest processor as the root and balanced workloads. Let Tu be the execution time when the workload is unbalanced. Note that Tu = Tf . Each processor j 's cj value is p1 . Tb denotes the execution time when the workload is balanced. Here, cj is computed as described in the previous section. In most cases, the results demonstrate that balanced workloads improve the performance of the algorithm. We also investigate the accuracy of the HBSP1 cost function in predicting execution times. Similarly to BSP, we consider HBSPk to model only communication and synchronization [GLR99]. I/O and local computation are not modeled. As a 84 result, none of our experiments include I/O. Furthermore, local computation for some of our collective routines (i.e., single-value reduction, point-wise reduction, and pre x sums) was measured directly. Our results show that the model is able to predict performance trends, but not speci c execution times. The inability of HBSPk to predict speci c execution times does not re ect negatively toward the model. The accuracy of the cost function depends on the choices made in the implementation of the HBSPlib library. Thus, one source for inaccurate predictions may result from the shortcomings of the library implementation. The remainder of this section provides experimental results for each of the collective communication algorithms. Each data point is the average of 10 runs. The experimental data is given in Appendix A. For each of the experiments, the logic of the algorithms is not changed. Instead, the modi cations occur in either root node selection or problem size distribution. In both cases, performance increase is substantial. Gather. Figure 5.1 (a) shows the improvement that results if the root node is Pf . As the number of processors increase, so does performance. The improvement factor is steady across all problem sizes. Performance reaches its maximum at n = 500KB. Unfortunately, there is virtually no bene t to distributing the workload based on a processor's computational abilities, except at p = 2. Figure 5.1 (b) displays the results. The problem lies with the estimation of cj for aditi. Further investigation uncovers that aditi has too many elements to send to the root node, gromit. aditi's workload does not match its abilities. As a result, everyone must wait for aditi to nish sending its items to the root node. For both experiments, the results at p = 2 are interesting. First, Figure 5.1 (a) shows that it is better for the root node to be the slowest workstation. This seems counterintuitive. In our implementation of gather (as well as the other 85 collective operations), a processor does not send data to itself. When Ps is the root, Pf sends np items to it. Similarly, if the fastest processor is the root, Ps sends n elements to P . T < T implies that it is more bene cial to have P waiting f s f f p on data from Ps. It is clear that the root node should be Pf as the number of processors increase. Unlike the situation at p = 2, Pf does not sit idle waiting on data items from Ps. Instead, it handles the messages of the other processors while waiting on the slowest processor's data. Secondly, at p = 2, balanced workloads contribute to increased performance. Tu is the execution time of Ps sending np data elements to the fastest processor. Tb is the cost of Ps sending csn integers to Pf , where cs is calculated as described in Section 5.3. Note that csn < np . In this setting, balanced workloads make a di erence (i.e., Tb < Tu) since Pf receives a smaller number of elements from Ps than in the unbalanced case. Figure 5.2 shows predicted performance for the gather operation. Although the model under-predicts the improvement factor, it does characterize the performance trends of the algorithm. Scatter. Figure 5.3 (a) plots the increase in performance if the root node is the fastest processor. The improvement factor is steady as the problem size increases. The best improvement occurs when p = 6 and n = 500KB. When p = 2, TTfs < 1. This is similar to the behavior experienced with the gather operation. Ts < Tf suggests that it is more advantageous for Ps send data to the fastest processor. As p increases, the results demonstrate that Pf is better suited as the root node. Figure 5.3 (b) compares the performance of unbalanced and balanced workloads. Unlike the gather results, there is a bene t to distributing the problem size based upon a processor's computational abilities. Here, p = 2 had the best performance with a maximum improvement of 3:62. 86 (a) (b) p=2 p=4 p=6 p=8 p = 10 4 Improvement factor Improvement factor 8 6 4 2 p=2 p=4 p=6 p=8 p = 10 3 2 1 0 1 2 3 4 5 6 7 8 9 0 10 1 size(x 100Kbytes) 2 3 4 5 6 7 8 9 10 size (x 100KBytes) Figure 5.1: Gather actual performance. The improvement factor is determined by (a) TTfs and (b) TTub . The problem size ranges from 100KB to 1000KB of integers. Each data point represents the average of 10 runs on a cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors. (a) (b) p=2 p=4 p=6 p=8 p = 10 2.0 Improvement factor Improvement factor 4 3 2 1 0 p=2 p=4 p=6 p=8 p = 10 1.5 1.0 0.5 0.0 1 2 3 4 5 6 7 8 9 10 1 size(x 100Kbytes) 2 3 4 5 6 7 8 9 10 size(x 100Kbytes) Figure 5.2: Gather predicted performance. The improvement factor is determined by (a) TTfs and (b) TTub . The problem size ranges from 100KB to 1000KB of integers. Each data point represents the predicted performance on a cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors. 87 (a) (b) Improvement factor 6 Improvement factor p=2 p=4 p=6 p=8 p = 10 p=2 p=4 p=6 p=8 p = 10 4 2 0 1 2 3 4 5 6 7 8 9 3 2 1 0 10 1 size (x 100KBytes) 2 3 4 5 6 7 8 9 10 size (x 100KBytes) Figure 5.3: Scatter actual performance. The improvement factor is determined by (a) TTfs and (b) TTub . The problem size ranges from 100KB to 1000KB of integers. Each data point represents the average of 10 runs on a cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors. Figure 5.4 shows predicted performance for the scatter operation. The cost model predicts the same performance for both the scatter and gather operations. Thus, the graph is identical to Figure 5.2. Single-value reduction. Unlike the gather and scatter routines, there is neglible improvement for the single-value reduction operation if the root node is Pf . This is not surprising considering the HBSPk model predicted such behavior. Figure 5.5 (a) shows the result. Improvement is insigni cant since the amount of data communicated to the root is a single value from each node. Figure 5.5 (b) demonstrates better performance when the workloads are balanced according to processor speed. The predicted performance of the single-value reduction operation is shown in Figure 5.6. For this algorithm, the cost model predicts that performance remains unchanged regardless of the speed of the root. This is to be expected 88 (a) (b) p=2 p=4 p=6 p=8 p = 10 2.0 Improvement factor Improvement factor 4 3 2 1 0 p=2 p=4 p=6 p=8 p = 10 1.5 1.0 0.5 0.0 1 2 3 4 5 6 7 8 9 10 1 size(x 100Kbytes) 2 3 4 5 6 7 8 9 10 size(x 100Kbytes) Figure 5.4: Scatter predicted performance. The improvement factor is determined by (a) TTfs and (b) TTub . The problem size ranges from 100KB to 1000KB of integers. Each data point represents the predicted performance on a cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors. since the root performs very little communication and computation. The actual results re ect this behavior. Moreover, the cost function correctly identi es the performance trend of balanced workloads. Point-wise reduction. A point-wise reduction of an array of values provides the root node with more work. The HBSPk model predicts that point-wise reduction will result in better performance than single-value reduction. Figure 5.7 plots the increased performance that results assuming that the root node is the fastest processor. Here, performance increases with the number of workstations. Moreover, the improvement is steady as the problem size increases. Figure 5.8 plots the predictions of the cost model. Overall, the performance trends of the algorithm were correctly identi ed. Pre x sums. In the pre x sums algorithm, the problem size refers to the total number of items held by all of the processors|not the root node. Fig89 (a) (b) p=2 p=4 p=6 p=8 p = 10 2.5 Improvement factor Improvement factor 2.0 1.5 1.0 p=2 p=4 p=6 p=8 p = 10 0.5 2.0 1.5 1.0 0.5 0.0 0.0 1 2 3 4 5 6 7 8 9 1 10 2 3 4 5 6 7 8 9 10 size (x 100KBytes) size (x 100KBytes) Figure 5.5: Single-value reduction actual performance. The improvement factor is determined by (a) TTfs and (b) TTub . The problem size ranges from 100KB to 1000KB of integers. Each data point represents the average of 10 runs on a cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors. (a) (b) p=2 p=4 p=6 p=8 p = 10 1.5 Improvement factor Improvement factor 1.0 0.8 0.6 0.4 p=2 p=4 p=6 p=8 p = 10 1.0 0.5 0.2 0.0 0.0 1 2 3 4 5 6 7 8 9 10 1 size(x 100Kbytes) 2 3 4 5 6 7 8 9 10 size(x 100Kbytes) Figure 5.6: Single-value reduction predicted performance. The improvement factor is determined by (a) TTfs and (b) TTub . The problem size ranges from 100KB to 1000KB of integers. Each data point represents the predicted performance on a cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors. 90 p=2 p=4 p=6 p=8 p = 10 Improvement factor 8 6 4 2 0 1 2 3 4 5 6 7 8 9 10 size (x 100KBytes) Figure 5.7: Point-wise reduction actual performance. The improvement factor is determined by TTfs . The problem size ranges from 100KB to 1000KB of integers. Each data point represents the average of 10 runs on a cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors. p=2 p=4 p=6 p=8 p = 10 Improvement factor 4 3 2 1 0 1 2 3 4 5 6 7 8 9 10 size(x 100Kbytes) Figure 5.8: Point-wise reduction predicted performance. The improvement factor is determined by TTfs . The problem size ranges from 100KB to 1000KB of integers. Each data point represents the predicted performance on a cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors. 91 ure 5.9 (a) graphs the improvement factor that results from using Pf as the root instead of Ps. Although the improvement factor is smaller than that of scatter, gather, and point-wise reduction, execution times improve by as much as 24%. This is quite signi cant considering that the modi cations to the algorithm consist of either selecting a slow root node or a fast one. The root node in the pre x sums routine performs very little computation and communication. Improved results can be attained if Pf receives more work. Figure 5.9 (b) shows the results. Here, performance decreases with the number of processors. This implies that the algorithm is able to eciently take advantage of balanced workloads if the number of processors is small. Figure 5.10 presents the predictability results. Similarly to single-value reduction, the cost model predicts that there is not advantage to using a fast root since the amount of computation and communication it performs is small. Unfortunately, the actual results disagree with the predictions. However, the model does accurately predict the bene t of using balanced workloads. One-to-all broadcast. Figure 5.11 (a) compares the execution time of the algorithm assuming the root node is either Ps or Pf . The plot demonstrates that their is neglible improvement in performance. The HBSPk model predicted this behavior. The broadcast operation takes small advantage of the heterogeneity since each processor must receive all of the data. In fact, the improvement in performance is a result of Pf distributing np integers to each processor during the rst phase of the algorithm. Our analysis also applies if processor j receives cj n elements during phase one of the algorithm. Figure 5.11 (b) corroborates the theoretical results. Figure 5.12 plots the predictions of the cost model, which over-predicts the bene t of using the fastest processor. 92 (a) (b) p=2 p=4 p=6 p=8 p = 10 2.0 Improvement factor Improvement factor 2.0 1.5 1.0 0.5 0.0 1 2 3 4 5 6 7 8 9 p=2 p=4 p=6 p=8 p = 10 1.5 1.0 0.5 0.0 10 1 2 3 size (x 100KBytes) 4 5 6 7 8 9 10 size (x 100KBytes) Figure 5.9: Pre x sums actual performance. The improvement factor is determined by (a) TTfs and (b) TTub . The problem size ranges from 100KB to 1000KB of integers. Each data point represents the average of 10 runs on a cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors. (a) (b) p=2 p=4 p=6 p=8 p = 10 2.5 Improvement factor Improvement factor 1.0 0.8 0.6 0.4 0.2 p=2 p=4 p=6 p=8 p = 10 2.0 1.5 1.0 0.5 0.0 0.0 1 2 3 4 5 6 7 8 9 10 1 size(x 100Kbytes) 2 3 4 5 6 7 8 9 10 size(x 100Kbytes) Figure 5.10: Pre x sums predicted performance. The improvement factor is determined by (a) TTfs and (b) TTub . The problem size ranges from 100KB to 1000KB of integers. Each data point represents the predicted performance on a cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors. 93 (a) (b) p=2 p=4 p=6 p=8 p = 10 1.0 Improvement factor Improvement factor 3 p=2 p=4 p=6 p=8 p = 10 2 1 0.8 0.6 0.4 0.2 0 1 2 3 4 5 6 7 8 9 10 0.0 1 size (x 100KBytes) 2 3 4 5 6 7 8 9 10 size (x 100KBytes) Figure 5.11: One-to-all broadcast actual performance. The improvement factor is determined by (a) TTfs and (b) TTub . The problem size ranges from 100KB to 1000KB of integers. Each data point represents the average of 10 runs on a cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors. (a) (b) p=2 p=4 p=6 p=8 p = 10 1.2 Improvement factor Improvement factor 1.5 1.0 0.5 0.0 p=2 p=4 p=6 p=8 p = 10 0.8 0.4 0.0 1 2 3 4 5 6 7 8 9 10 1 size(x 100Kbytes) 2 3 4 5 6 7 8 9 10 size(x 100Kbytes) Figure 5.12: One-to-all broadcast predicted performance. The improvement factor is determined by (a) TTfs and (b) TTub . The problem size ranges from 100KB to 1000KB of integers. Each data point represents the predicted performance on a cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors. 94 p=2 p=4 p=6 p=8 p = 10 Improvement factor 2.0 1.5 1.0 0.5 0.0 1 2 3 4 5 6 7 8 9 10 size (x 100KBytes) Figure 5.13: All-to-all broadcast actual performance. There are two algorithms compared|simultaneous broadcast (SB) and intermediate destination (ID). The improvement factor is given for SB versus ID.. The problem size ranges from 100KB to 1000KB of integers. Each data point represents the average of 10 runs on a cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors. All-to-all broadcast. Figure 5.13 compares the performance of three all-toall broadcast implementations|simultaneous broadcast (SB) and intermediate destination (ID). Overall, the SB algorithm performed the best. In fact, the other algorithms were not close to challenging its performance. This is somewhat disappointing since this algorithm is very susceptible to node contention. One possible explanation for this is that it only performs one superstep, while the other algorithms perform three supersteps. With the high cost of synchronization in our system, the SB algorithm is not as susceptible to the barrier synchronization cost. Figure 5.14 shows the predictability results. The HBSPk cost function veri es that the SB algorithm is indeed a better performer than the ID algorithm. 95 Improvement factor p=2 p=4 p=6 p=8 p = 10 0.6 0.4 0.2 0.0 1 2 3 4 5 6 7 8 9 10 size(x 100Kbytes) Figure 5.14: All-to-all broadcast predicted performance. There are two algorithms compared|simultaneous broadcast (SB) and intermediate destination (ID). The improvement factor is given for (a) SB versus ID. The problem size ranges from 100KB to 1000KB of integers. Each data point represents the predicted performance on a cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors. 96 1. Pf scatters the n data items to each of its children, Pj , where 0 j < p ? 1. 2. Pj randomly selects a set of sample keys from its cj n input keys. 3. Pj sends its sample keys to the fastest node, Pf . 4. Pf sorts the p sample keys. Denote these keys by sample0 ; : : : ; samplep?1 , where samplei is the sample key with rank i in the sorted order. Pf de nes p ? 1 splitters, s0; : : : ; sp?2, where sj = sampled(Pjx=0 cx)pe . 5. Pf broadcasts the splitters to each of the processors. 6. All keys assigned to the j th bucket are sent to the j processor. 7. All processors sort their bucket. Figure 5.15: HBSP1 randomized sample sort. 5.4.1 Randomized Sample Sort Chapter 3.2.2 discusses the merits of randomized sample sort for BSP computing. Here, we extend the algorithm to accommodate a heterogeneous cluster of workstations. Speci cally, our objective is to evaluate the performance of the collective operations as part of a larger program. When adapting the randsort algorithm for HBSP1 machines, we change the way in which splitters are chosen. In heterogeneous environments, it is necessary that O(cj n) keys fall between the splitters sj and sj+1. Homogeneous environments assume c0;j = m10 , where 0 j < m0 . Figure 5.15 presents the algorithm. The cost of the algorithm is as follows. Step 1 requires a cost of gn + L. In Step 2, each processor performs O() amount of work. Step 3 requires a communication time of g maxfrs; rf pg. Assuming rs < p, the communication time reduces to gp. Mf sorts the sample keys in O(p lg p) time. Broadcasting 97 p 2 4 6 8 10 104 2.15 2.29 2.19 2.26 2.25 105 2.76 2.77 2.39 2.96 2.15 106 2.36 2.33 2.13 2.28 1.26 Table 5.6: Randomized-sample sort performance. Factor of improvement is determined by Tu =Tb. The problem size ranges from 104 to 105 integers. Each data point represents the average of 10 runs on a cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors. the p ? 1 splitters requires gp(1 + rs) + 2L time. Since each processor is expected to receive approximately cj n keys [Mor98b], Step 6 uses O(cj n) computation time and g maxfrj cj ng communication time, where 0 j < p. Once each processor receives their keys, sorting them requires O(cj n lg cj n) time. Thus, the total time of the algorithm is O(csn lg csn) + g(n + p + p + prs + rscsn) + 5L. The previous section experimentally validated that using a fast root node often times results in better performance. Assuming a fast root node, we test the performance of randomized sample sort using balanced workloads. Table 5.6 presents the performance of our randomized sample sort implementation. The best performance occurs when n = 105 integers. Here, the improvement factor reaches a high of 2.96. The use of the scatter to allocate data to processors is quite convenient. However, storage limitations will eventually prevent us from distributing all input data from a single process. In that sense, the scalability of the algorithm is limited. 98 5.5 Summary The experimental results demonstrate that signi cant increases in performance occurs if the heterogeneity of the underlying system is taking into consideration. For example, using faster processors often times result in better algorithmic performance. The performance of the gather and scatter algorithms show that there are situations when the root node should be the slowest processor. Balanced workloads also contribute to better overall performance. Overall, the experiments demonstrate that the HBSPk cost model guides the programmer in designing parallel software for good performance on heterogeneous platforms. The algorithms are not ne-tuned for a speci c environment. Instead, the performance gains are a result of the cost predictions provided by the model. 99 CHAPTER 6 Conclusions and Future Work The HBSPk model o ers a framework that makes parallel computing a viable option for heterogeneous platforms. HBSPk extends the BSP model by incorporating parameters that apply to a diverse range of heterogeneous systems such as workstation clusters, the Internet, and computational grids. HBSPk rewards algorithms with balanced design. For heterogeneous systems, this translates to nodes receiving a workload proportional to their computational and communication abilities. The HBSPk parameter, ci;j , provides the programmer with a way to manage the workload of the each machine in the heterogeneous platform. Furthermore, faster machines should be used more often their slower counterparts. Coordinator nodes provide the user with access to the faster nodes in the system. Therefore, the goal of HBSPk algorithm design is to minimize activity on slower machines while increasing the eciency of the faster machines in the system. The utility of the model is demonstrated through the design, analysis, and implementation of six collective communication algorithms|gather, scatter, reduction, pre x sums, one-to-all broadcast, and all-to-all broadcast. Our collective communication algorithms are based on two simple design principles. First, the root of a communication operation must be a fast node. Secondly, faster nodes receive more data items than slower nodes. We designed two types of experiments to validate the predictions of the HBSPk model. One experiment measured the 100 importance of root node selection the other tested the e ect of problem size distribution. The results clearly demonstrate that the heterogeneity of a system cannot be ignored. If algorithms for such platforms are designed correctly, the performance bene ts are tremendous. HBSPk provides the programmer with a framework in which to design ecient software for heterogeneous platforms. Besides good performance, the model predicts the behavior of our collective routines within a reasonable margin of error. Not all algorithms bene t from executing on a heterogeneous machine. The broadcast algorithms (one-to-all and all-to-all) show neglible bene t of using our two-step approach to designing algorithms. A broadcast requires each machine to possess all of the data elements at the end of the operation. Since the slowest node must receive each element, the performance of the algorithm su ers. Therefore, there is no way to balance the workload according to processor speed. In general, collective operations that require nodes to possess all of the data items at the end of the operation are unlikely to e ectively exploit heterogeneity. HBSPk o ers a single-system image of a heterogeneous platform to the application developer. This view incorporates the salient features (characterized by a few parameters) of the underlying machine. Under HBSPk , improved performance is not a result of programmers having to account for myriad di erences in a heterogeneous environment. By hiding the non-uniformity of the underlying system from the application developer, the HBSPk model o ers an environment that encourages the design of heterogeneous parallel software in an architectureindependent manner. 101 6.1 Contributions Below, we present a more detailed description of the contributions of this work. Developed a model of computation for heterogeneous and hierarchicallyconnected systems. Introduced a classi cation scheme to characterize various types of parallel platforms (HBSP0 , HBSP1 , ..., HBSPk ). Designed and analyzed collective communication and sorting algorithms for the HBSPk model. Implemented a library to facilitate HBSP1 programming. Experimental results demonstrating ecient, scalable, and predictable HBSP1 applications. The HBSPk model is a general model of computation that can be applied to a diverse range of heterogeneous platforms. It de nes a programming methodology for designing heterogeneous programs and an associated cost model to analyze the complexity of the algorithm; this model applies to a variety of heterogeneous platforms. Moreover, the cost model allows for predictability of performance. HBSPk provides the designer with parameters that re ect the relative computational and communication speeds at each of the k levels and captures the tradeo s between communication and computation that are inherent in parallel applications. Improved performance results from e ectively exploiting the speeds of the heterogeneous computing components. Furthermore, increased performance comes in an architecture-independent manner. 102 In HBSPk , machines are grouped hierarchically into clusters based on their ability to communicate with each other. HBSP0 or single processor computers are the simplest class of machines since they do not perform communication. HBSP1 machines group HBSP0 processors together to form a single parallel system that performs communication. In general, the HBSPk model refers to a class of machines with at most k di erent levels of communication. This characterization allows the model to be adaptable to workstation clusters as well as computational grids. We designed six collective communication algorithms for heterogeneous computation. Since these basic patterns of interprocessor communication are frequently used as building blocks in a variety of parallel algorithms, ecient implementation of them is crucial. Each of the collective routines contains a phase where one node is responsible for collecting or distributing information to the other nodes. In these situations, there is a substantial performance gain if the root node is the fastest machine and workloads are balanced across the nodes. Multi-layer architectures can bene t from using coordinator nodes to allow multiple fast nodes to be in use at one time. Of course, one must be cognizant of the high cost of communication and synchronization in such environments. Thus, our algorithms minimize trac on slower network links. The HBSP Programming Library (HBSPlib) is a parallel C library of communication routines for the HBSP1 model. Besides providing primitives for process initialization, process enquiry, barrier synchronization, and message passing, HBSPlib incorporates additional functions to address the heterogeneity of the underlying system. These functions include retrieving the identity of the fastest processor and returning the speed of a single processor or an entire cluster. With this information, a programmer is able to give more work to the fastest processor 103 and distribute the workload based on the relative speeds of the heterogeneous processors. The experimental results validate the predictions of the HBSP1 cost model. The testbed consists of a non-dedicated, heterogeneous cluster of workstations. We use the BYTEmark benchmark to rank the processors as well as determine the load balancing parameter, c0;j . The experiments corroborate the theoretical claims of the model. First, faster processors, if used appropriately, result in faster execution times. Additionally, balanced workloads result in better overall performance. Overall, the performance of our collective operations is quite impressive. Furthermore, randomized sample sort shows the bene t of using the HBSP1 collective routines. Fundamental changes to the algorithms are not necessary to attain the increase in performance. Instead, modi cations consist of selecting the root node and distributing the workload. 6.2 Future Research Based on the lessons learned from the development, implementation, and evaluation of the HBSPk model, the following research extensions and improvements are presented as a follow-on research agenda. Develop an optimized and scalable HBSP library implementation. Although our prototype showed the performance improvement that results when a processor's load is balanced, HBSPlib could bene t from additional improvements. One area of concern is our barrier synchronization implementation. Additional work is needed to reduce the cost of this operation. 104 Although PVM served its purpose in our prototype, we are considering using MPI or Java as a basis for future implementations of HBSPlib. Extend HBSPlib to accommodate hierarchical architectures. Recent work demonstrates the importance of collective communication operations for hierarchical networks. Husbands and Hoe [HH98] develop MPI-StarT, a system that eciently implements collective routines for a cluster of SMPs. Bader and JaJa [BJ99] describe a methodology for developing high performance programs running on clusters of SMP nodes based on a small kernel (SIMPLE) of collective communication primitives. Kielmann et al. [KHP99] present MagPie, a system that also handles a two-level communication hierarchy. Karonis et al. [KSF00] develop a topology-aware version of the broadcast operation for good performance. The P -logP model [KBG00], an extension of LogP [CKP93], is used to optimize performance of wide-area collective operations by determining the optimal tree shape for the communication. Moreover, large messages are split into smaller units resulting in better link utilization. Each of the above e orts only consider the network bandwidth of the underlying heterogeneous environment. However, a machine's computational speed plays an important role in the overall time of a collective operation. The HBSPk model allows the algorithm designer to take advantage of the communication and computational abilities of the components in the heterogeneous system. Design additional HBSPk applications. Investigating the range of applications that can be eciently handled by HBSPk is an important issue. We have shown that the HBSPk model can guide the development of ecient collective operations. Most of our communication algorithms resulted in increased performance if the heterogeneity of the underlying system is con105 sidered. More work must be done in showing the applicability of the model to other problems such as matrix multiplication, minimum spanning tree, and N -body simulation. Other communication routines (i.e., broadcast algorithms) cannot e ectively exploit the heterogeneity of the underlying system. Further study of problems in this category is also of interest. Let us consider the diculty of designing a algorithm for solving the matrix multiplication problem on a heterogeneous network of workstations. Given two n n matrices A and B , we de ne the matrix C = A B as ?1 A B [BBR00]. Here, we discuss a block version of the Ci;j = Pkn=0 i;k k;j algorithm [KGG94]. For example, an n n matrix A can be regarded as a q q array of blocks Ai;j , where 0 i; j < q, such that each block is a n n submatrix. We can use p homogeneous processors to implement the q q algorithm by choosing q = pp. There is a one-to-one mapping between the p blocks and the p homogeneous processors. As a result, each processor is responsible for updating a distinct Ci;j block. Splitting the matrices into p blocks of size q q will not lead to good performance on heterogeneous platforms. Instead, we must balance the workload of a processor in accordance with its computing power. Therefore, ecient performance occurs by tiling the C matrix into p rectangles of varying sizes (see Figure 6.1). Investigate other methods of estimating ci;j . In our experiments, processor j 's ci;j value, where i = 0, is based on its computational speed. However, its communication ability may also play a role in achieving balanced workloads. It is not unlikely for a machine's communication and computation abilities to appear on opposite ends of the spectrum. For example, a workstation may perform fast computationally, but its communication ability may not be as strong. In such cases, a reasonable estimation of ci;j considers both 106 Homogeneous Partition Heterogeneous Partition Figure 6.1: Processor allocation of p = 16 matrix blocks. Each processor receives the same block size on a homogeneous cluster. On heterogeneous clusters, processors receive varying block sizes. performance values in its calculation. When k 2, determining the values of each node's ci;j parameter becomes more dicult. One possibility is that its value could be sum of its children's value. Additional investigation of load-balancing strategies for hierarchical architectures is also required. Study the bene ts of incorporating costs for di erent types of communica- tion. Currently, the HBSPk model provides the same cost for both interand intra-cluster communication. However, sending messages within a cluster is generally less expensive than communicating outside the cluster. Furthermore, since nodes are not necessarily in the same region, communication costs may vary depending upon the destination's geographic location. Incorporating such costs will increase the number of parameters in the model. One of our goals in developing the HBSPk model was to keep the number of parameters as small as possible. Thus, an empirical study is necessary to determine the bene ts of modifying the ri;j parameter. 107 APPENDIX A Collective Communication Performance Data 108 The following tables provide performance numbers for our collective routines. We refer to this data in Chapter 5.4. Speci cally, the tables include the actual and predicted runtimes on a heterogeneous cluster comprised of 2, 4, 6, 8, and 10 processors. 109 100 p=2 Ts Tf (Tu) Tb p=4 Ts Tf (Tu) Tb p=6 Ts Tf (Tu) Tb p=8 Ts Tf (Tu) Tb p = 10 Ts Tf (Tu) Tb 200 300 problem size (in KBs) 400 500 600 700 800 900 1000 0.074 0.141 0.206 0.273 0.346 0.413 0.475 0.545 0.615 0.683 0.066 0.128 0.186 0.249 0.308 0.372 0.436 0.491 0.553 0.612 0.021 0.038 0.053 0.069 0.087 0.103 0.118 0.135 0.153 0.168 0.092 0.185 0.288 0.373 0.461 0.556 0.644 0.743 0.828 0.920 0.042 0.073 0.144 0.144 0.170 0.215 0.243 0.282 0.313 0.338 0.050 0.086 0.140 0.175 0.229 0.276 0.317 0.382 0.392 0.442 0.103 0.225 0.302 0.421 0.511 0.613 0.759 0.825 0.911 1.022 0.032 0.059 0.085 0.105 0.132 0.170 0.187 0.216 0.233 0.268 0.04 0.070 0.102 0.137 0.177 0.213 0.239 0.291 0.323 0.353 0.171 0.208 0.338 0.426 0.532 0.650 0.752 0.916 0.974 1.079 0.028 0.053 0.073 0.087 0.109 0.130 0.152 0.171 0.190 0.213 0.035 0.145 0.084 0.131 0.141 0.179 0.209 0.234 0.247 0.281 0.337 0.257 0.356 0.450 0.581 0.650 1.017 0.863 0.951 1.101 0.038 0.069 0.169 0.120 0.199 0.213 0.219 0.243 0.271 0.297 0.174 0.065 0.152 0.164 0.182 0.190 0.231 0.253 0.265 0.296 Table A.1: Actual execution times (in seconds) for gather. The problem size ranges from 100KB to 1000KB of integers. Ts and Tf (Tu) denote the execution time assuming a slow and fast root node, respectively. Tb is the runtime for balanced workloads. Each data point represents the average of 10 runs on a cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors. 110 100 p=2 Ts Tf (Tu) Tb p=4 Ts Tf (Tu) Tb p=6 Ts Tf (Tu) Tb p=8 Ts Tf (Tu) Tb p = 10 Ts Tf (Tu) Tb 200 300 problem size (in KBs) 400 500 600 700 800 900 1000 0.091 0.173 0.255 0.337 0.419 0.501 0.582 0.664 0.746 0.828 0.050 0.091 0.132 0.173 0.214 0.255 0.296 0.337 0.387 0.419 0.029 0.049 0.069 0.089 0.109 0.129 0.149 0.170 0.190 0.210 0.970 0.179 0.261 0.343 0.425 0.507 0.588 0.670 0.752 0.834 0.035 0.056 0.076 0.097 0.117 0.138 0.158 0.179 0.199 0.220 0.035 0.055 0.075 0.095 0.115 0.135 0.155 0.176 0.196 0.216 0.105 0.187 0.269 0.351 0.433 0.515 0.596 0.678 0.760 0.842 0.043 0.063 0.083 0.103 0.123 0.143 0.163 0.183 0.204 0.224 0.043 0.063 0.083 0.103 0.123 0.143 0.163 0.184 0.204 0.224 0.112 0.194 0.276 0.358 0.440 0.522 0.603 0.685 0.767 0.849 0.050 0.070 0.090 0.110 0.130 0.150 0.170 0.191 0.211 0.231 0.051 0.070 0.090 0.110 0.130 0.150 0.170 0.191 0.211 0.231 0.119 0.201 0.283 0.365 0.447 0.529 0.610 0.692 0.774 0.856 0.057 0.077 0.097 0.117 0.137 0.157 0.177 0.198 0.218 0.238 0.057 0.077 0.097 0.117 0.137 0.157 0.177 0.198 0.218 0.238 Table A.2: Predicted execution times (in seconds) for gather. The problem size ranges from 100KB to 1000KB of integers. Ts and Tf (Tu) denote the execution time assuming a slow and fast root node, respectively. Tb is the runtime for balanced workloads. Each data point represents the predicted performance on a cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors. 111 100 p=2 Ts Tf (Tu) Tb p=4 Ts Tf (Tu) Tb p=6 Ts Tf (Tu) Tb p=8 Ts Tf (Tu) Tb p = 10 Ts Tf (Tu) Tb 200 300 problem size (in KBs) 400 500 600 700 800 900 1000 0.070 0.130 0.194 0.257 0.323 0.379 0.441 0.508 0.565 0.631 0.075 0.141 0.210 0.277 0.340 0.409 0.477 0.549 0.608 0.681 0.027 0.044 0.064 0.081 0.097 0.116 0.134 0.153 0.169 0.188 0.090 0.171 0.257 0.337 0.458 0.529 0.696 0.814 0.921 0.981 0.045 0.079 0.146 0.196 0.267 0.313 0.372 0.459 0.485 0.573 0.049 0.091 0.120 0.158 0.192 0.239 0.302 .0314 0.374 0.422 0.099 0.185 0.267 0.465 0.611 0.665 0.769 0.896 1.012 1.131 0.042 0.069 0.125 0.217 0.248 0.305 0.344 0.451 0.485 0.539 0.041 0.067 0.124 0.150 0.218 0.234 0.276 0.326 0.381 0.422 0.104 0.189 0.270 0.357 0.782 0.756 0.917 0.871 1.049 1.111 0.041 0.061 0.083 0.105 0.130 0.308 0.310 0.373 0.410 0.541 0.038 0.059 0.080 0.133 0.159 0.216 0.326 0.261 0.392 0.409 0.109 0.238 0.279 0.367 0.454 0.787 0.763 1.067 1.046 1.110 0.082 0.072 0.095 0.123 0.150 0.186 0.373 0.404 0.424 0.515 0.043 0.065 0.087 0.111 0.185 0.208 0.257 0.315 0.326 0.366 Table A.3: Actual execution times (in seconds) for scatter. The problem size ranges from 100KB to 1000KB of integers. Ts and Tf (Tu) denote the execution time assuming a slow and fast root node, respectively. Tb is the runtime for balanced workloads. Each data point represents the average of 10 runs on a cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors. 112 100 p=2 Ts Tf (Tu) Tb p=4 Ts Tf (Tu) Tb p=6 Ts Tf (Tu) Tb p=8 Ts Tf (Tu) Tb p = 10 Ts Tf (Tu) Tb 200 300 problem size (in KBs) 400 500 600 700 800 900 1000 0.091 0.173 0.255 0.337 0.419 0.501 0.582 0.664 0.746 0.828 0.050 0.091 0.132 0.173 0.214 0.255 0.296 0.337 0.378 0.419 0.029 0.049 0.069 0.089 0.109 0.129 0.149 0.170 0.190 0.210 0.097 0.179 0.261 0.343 0.425 0.507 0.588 0.670 0.752 0.834 0.035 0.056 0.076 0.097 0.117 0.138 0.158 0.179 0.199 0.220 0.035 0.055 0.075 0.095 0.115 0.135 0.155 0.176 0.196 0.216 0.105 0.187 0.269 0.351 0.433 0.515 0.596 0.678 0.760 0.842 0.043 0.063 0.083 0.103 0.123 0.143 0.163 0.184 0.204 0.224 0.043 0.063 0.083 0.103 0.123 0.143 0.163 0.183 0.203 0.224 0.112 0.194 0.276 0.358 0.440 0.522 0.603 0.685 0.767 0.849 0.050 0.070 0.090 0.110 0.130 0.150 0.170 0.190 0.210 0.231 0.050 0.070 0.090 0.110 0.130 0.150 0.170 0.190 0.211 0.230 0.119 0.201 0.283 0.365 0.447 0.529 0.610 0.692 0.774 0.856 0.057 0.077 0.097 0.117 0.137 0.157 0.177 0.197 0.218 0.238 0.057 0.077 0.097 0.117 0.137 0.157 0.177 0.198 0.218 0.238 Table A.4: Predicted execution times (in seconds) for scatter. The problem size ranges from 100KB to 1000KB of integers. Ts and Tf (Tu) denote the execution time assuming a slow and fast root node, respectively. Tb is the runtime for balanced workloads. Each data point represents the predicted performance on a cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors. 113 100 p=2 Ts Tf (Tu) Tb p=4 Ts Tf (Tu) Tb p=6 Ts Tf (Tu) Tb p=8 Ts Tf (Tu) Tb p = 10 Ts Tf (Tu) Tb 200 300 problem size (in KBs) 400 500 600 700 800 900 1000 0.016 0.019 0.023 0.026 0.030 0.033 0.036 0.039 0.043 0.047 0.016 0.020 0.023 0.026 0.030 0.033 0.037 0.040 0.044 0.047 0.013 0.015 0.015 0.017 0.018 0.020 0.021 0.023 0.026 0.027 0.021 0.023 0.024 0.026 0.028 0.029 0.031 0.033 0.035 0.037 0.020 0.022 0.024 0.026 0.027 0.029 0.030 0.032 0.040 0.035 0.019 0.019 0.019 0.020 0.020 0.021 0.021 0.021 0.023 0.023 0.027 0.028 0.030 0.031 0.032 0.033 0.034 0.035 0.036 0.037 0.025 0.027 0.027 0.028 0.029 0.030 0.031 0.034 0.034 0.035 0.024 0.024 0.031 0.024 0.025 0.025 0.026 0.027 0.026 0.027 0.034 0.035 0.038 0.037 0.038 0.039 0.039 0.040 0.041 0.042 0.036 0.031 0.033 0.032 0.033 0.039 0.036 0.035 0.037 0.038 0.029 0.029 0.031 0.030 0.030 0.032 0.031 0.032 0.032 0.032 0.041 0.042 0.043 0.044 0.044 0.045 0.085 0.046 0.047 0.048 0.075 0.038 0.037 0.037 0.040 0.040 0.040 0.041 0.010 0.042 0.035 0.035 0.037 0.035 0.432 0.037 0.036 0.040 0.036 0.037 Table A.5: Actual execution times (in seconds) for single-value reduction. The problem size ranges from 100KB to 1000KB of integers. Ts and Tf (Tu) denote the execution time assuming a slow and fast root node, respectively. Tb is the runtime for balanced workloads. Each data point represents the average of 10 runs on a cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors. 114 100 p=2 Ts Tf (Tu) Tb p=4 Ts Tf (Tu) Tb p=6 Ts Tf (Tu) Tb p=8 Ts Tf (Tu) Tb p = 10 Ts Tf (Tu) Tb 200 300 problem size (in KBs) 400 500 600 700 800 900 1000 0.012 0.015 0.018 0.020 0.023 0.026 0.029 0.032 0.035 0.037 0.012 0.015 0.018 0.020 0.024 0.026 0.029 0.032 0.035 0.038 0.010 0.012 0.013 0.015 0.016 0.018 0.019 0.021 0.022 0.023 0.016 0.018 0.019 0.020 0.022 0.023 0.025 0.026 0.028 0.029 0.016 0.018 0.019 0.021 0.022 0.024 0.025 0.026 0.028 0.029 0.016 0.016 0.017 0.018 0.019 0.019 0.020 0.021 0.022 0.022 0.024 0.025 0.026 0.027 0.028 0.029 0.030 0.031 0.032 0.033 0.024 0.029 0.031 0.034 0.038 0.040 0.043 0.046 0.049 0.051 0.023 0.024 0.024 0.025 0.025 0.026 0.026 0.027 0.027 0.028 0.031 0.031 0.032 0.033 0.034 0.034 0.035 0.036 0.036 0.037 0.031 0.031 0.032 0.033 0.034 0.034 0.035 0.036 0.036 0.037 0.030 0.031 0.031 0.031 0.032 0.032 0.032 0.033 0.033 0.034 0.038 0.038 0.039 0.039 0.040 0.041 0.041 0.042 0.042 0.043 0.038 0.038 0.039 0.039 0.040 0.040 0.041 0.042 0.042 0.043 0.037 0.038 0.038 0.038 0.038 0.039 0.039 0.039 0.040 0.040 Table A.6: Predicted execution times (in seconds) for single-value reduction. The problem size ranges from 100KB to 1000KB of integers. Ts and Tf (Tu) denote the execution time assuming a slow and fast root node, respectively. Tb is the runtime for balanced workloads. Each data point represents the predicted performance on a cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors. 115 100 p=2 Ts Tf p=4 Ts Tf p=6 Ts Tf ) p=8 Ts Tf p = 10 Ts Tf 200 300 problem size (in KBs) 400 500 600 700 800 900 1000 0.096 0.188 0.269 0.370 0.447 0.543 0.630 0.733 0.794 0.911 0.073 0.138 0.205 0.275 0.342 0.400 0.466 0.535 0.601 0.669 0.120 0.239 0.361 0.474 0.595 0.699 0.808 0.946 1.021 1.175 0.047 0.086 0.125 0.169 0.200 0.249 0.279 0.315 0.357 0.394 0.137 0.276 0.393 0.524 0.629 0.774 0.873 1.052 1.150 1.256 0.039 0.072 0.098 0.132 0.155 0.184 0.233 0.250 0.293 0.323 0.274 0.265 0.410 0.527 0.646 0.792 0.890 1.082 1.161 1.301 0.038 0.058 0.083 0.109 0.141 0.156 0.187 0.217 0.229 0.261 0.472 0.373 0.411 0.592 0.668 0.783 0.964 1.072 1.187 1.357 0.053 0.207 0.286 0.154 0.197 0.253 0.258 0.285 0.315 0.334 Table A.7: Actual execution times (in seconds) for point-wise reduction. The problem size ranges from 100KB to 1000KB of integers. Ts and Tf denote the execution time assuming a slow and fast root node, respectively. Each data point represents the average of 10 runs on a cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors. 116 100 p=2 Ts Tf p=4 Ts Tf p=6 Ts Tf p=8 Ts Tf p = 10 Ts Tf 200 300 problem size (in KBs) 400 500 600 700 800 900 1000 0.012 0.218 0.317 0.432 0.430 0.630 0.729 0.832 0.935 1.039 0.068 0.127 0.186 0.246 0.305 0.364 0.423 0.481 0.540 0.599 0.118 0.224 0.323 0.438 0.436 0.636 0.735 0.839 0.941 1.045 0.047 0.080 0.112 0.145 0.177 0.210 0.243 0.274 0.306 0.338 0.126 0.232 0.331 0.446 0.444 0.644 0.742 0.847 0.949 1.053 0.049 0.075 0.101 0.127 0.153 0.178 0.205 0.229 0.255 0.281 0.133 0.239 0.338 0.453 0.451 0.651 0.750 0.854 0.956 1.060 0.056 0.082 0.108 0.134 0.160 0.185 0.212 0.236 0.262 0.288 0.140 0.246 0.345 0.460 0.458 0.658 0.757 0.861 0.963 1.067 0.063 0.089 0.115 0.141 0.167 0.192 0.219 0.243 0.269 0.295 Table A.8: Predicted execution times (in seconds) for point-wise reduction. The problem size ranges from 100KB to 1000KB of integers. Ts and Tf denote the execution time assuming a slow and fast root node, respectively. Each data point represents the predicted performance on a cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors. 117 100 p=2 Ts Tf (Tu) Tb p=4 Ts Tf (Tu) Tb p=6 Ts Tf (Tu) Tb p=8 Ts Tf (Tu) Tb p = 10 Ts Tf (Tu) Tb 200 300 problem size (in KBs) 400 500 600 700 800 900 1000 0.028 0.045 0.047 0.058 0.068 0.078 0.086 0.096 0.107 0.114 0.029 0.039 0.048 0.058 0.068 0.078 0.088 0.098 0.107 0.116 0.022 0.024 0.027 0.032 0.036 0.040 0.044 0.049 0.053 0.057 0.035 0.044 0.045 0.050 0.056 0.063 0.064 0.070 0.075 0.079 0.034 0.041 0.042 0.048 0.052 0.063 0.062 0.066 0.071 0.077 0.028 0.031 0.031 0.032 0.033 0.035 0.036 0.039 0.040 0.043 0.045 0.048 0.054 0.059 0.061 0.062 0.066 0.074 0.071 0.074 0.039 0.042 0.047 0.052 0.051 0.054 0.058 0.061 0.065 0.068 0.036 0.036 0.037 0.045 0.041 0.04 0.041 0.042 0.043 0.044 0.057 0.065 0.064 0.064 0.067 0.074 0.07 0.077 0.076 0.078 0.048 0.053 0.052 0.054 0.056 0.063 0.061 0.064 0.067 0.069 0.046 0.048 0.045 0.047 0.048 0.048 0.049 0.050 0.050 0.050 0.069 0.072 0.074 0.058 0.077 0.078 0.086 0.083 0.091 0.088 0.056 0.116 0.058 0.074 0.062 0.102 0.067 0.089 0.076 0.072 0.215 0.057 0.056 0.058 0.057 0.058 0.056 0.058 0.058 0.058 Table A.9: Actual execution times (in seconds) for pre x sums. The problem size ranges from 100KB to 1000KB of integers. Ts and Tf (Tu) denote the execution time assuming a slow and fast root node, respectively. Tb is the runtime for balanced workloads. Each data point represents the average of 10 runs on a cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors. 118 100 p=2 Ts Tf (Tu) Tb p=4 Ts Tf (Tu) Tb p=6 Ts Tf (Tu) Tb p=8 Ts Tf (Tu) Tb p = 10 Ts Tf (Tu) Tb 200 300 problem size (in KBs) 400 500 600 700 800 900 1000 0.028 0.039 0.049 0.063 0.071 0.081 0.092 0.102 0.113 0.124 0.029 0.039 0.050 0.060 0.072 0.081 0.092 0.106 0.115 0.124 0.021 0.024 0.027 0.029 0.033 0.035 0.038 0.041 0.044 0.047 0.035 0.041 0.046 0.051 0.057 0.062 0.067 0.072 0.078 0.083 0.035 0.041 0.046 0.051 0.057 0.062 0.067 0.072 0.078 0.083 0.031 0.033 0.034 0.036 0.037 0.039 0.040 0.041 0.043 0.044 0.050 0.053 0.057 0.060 0.064 0.067 0.070 0.074 0.078 0.081 0.050 0.053 0.057 0.060 0.064 0.067 0.070 0.074 0.078 0.081 0.046 0.047 0.047 0.048 0.048 0.049 0.049 0.050 0.050 0.051 0.063 0.066 0.068 0.071 0.073 0.076 0.078 0.081 0.084 0.087 0.063 0.066 0.068 0.071 0.073 0.076 0.078 0.081 0.084 0.087 0.060 0.061 0.061 0.061 0.062 0.062 0.062 0.063 0.063 0.064 0.076 0.079 0.080 0.084 0.085 0.087 0.089 0.091 0.093 0.095 0.076 0.079 0.080 0.084 0.085 0.087 0.089 0.091 0.093 0.095 0.074 0.075 0.075 0.075 0.075 0.076 0.076 0.076 0.077 0.077 Table A.10: Predicted execution times (in seconds) for pre x sums. The problem size ranges from 100KB to 1000KB of integers. Ts and Tf (Tu) denote the execution time assuming a slow and fast root node, respectively. Tb is the runtime for balanced workloads. Each data point represents the predicted performance on a cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors. 119 100 p=2 Ts Tf (Tu) Tb p=4 Ts Tf (Tu) Tb p=6 Ts Tf (Tu) Tb p=8 Ts Tf (Tu) Tb p = 10 Ts Tf (Tu) Tb 200 300 problem size (in KBs) 400 500 600 700 800 900 1000 0.137 0.249 0.365 0.492 0.623 0.732 0.877 1.019 1.104 1.252 0.135 0.263 0.389 0.525 0.655 0.769 0.928 1.046 1.156 1.307 0.139 0.276 0.403 0.545 0.663 0.809 0.938 1.079 1.193 1.337 0.229 0.441 0.681 0.876 1.104 1.460 1.742 1.989 2.241 2.501 0.186 0.360 0.624 0.885 1.085 1.317 1.564 1.830 2.083 2.396 0.210 0.475 0.739 0.961 1.218 1.488 1.759 2.017 2.270 2.598 0.256 0.559 0.770 1.296 1.537 1.981 2.270 2.646 2.967 3.359 0.204 0.465 0.639 1.224 1.521 1.946 2.165 2.765 2.932 3.392 0.222 0.537 0.908 1.193 1.521 1.769 2.159 2.504 2.828 3.155 0.323 0.531 0.917 1.220 1.665 2.110 2.470 2.956 4.126 3.787 0.208 0.402 0.843 1.040 1.490 2.093 2.363 2.815 3.310 3.542 0.241 0.469 0.881 1.218 1.788 2.050 2.207 2.823 3.641 3.765 1.456 1.769 1.452 1.770 2.310 3.588 3.332 3.877 4.489 5.061 0.450 0.862 1.266 1.537 2.041 2.435 3.152 3.573 4.212 4.773 0.410 1.130 1.134 1.766 1.839 2.676 3.269 3.633 4.476 4.952 Table A.11: Actual execution times (in seconds) for one-to-all broadcast. The problem size ranges from 100KB to 1000KB of integers. Ts and Tf (Tu) denote the execution time assuming a slow and fast root node, respectively. Tb is the runtime for balanced workloads. Each data point represents the average of 10 runs on a cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors. 120 100 p=2 Ts Tf (Tu) Tb p=4 Ts Tf (Tu) Tb p=6 Ts Tf (Tu) Tb p=8 Ts Tf (Tu) Tb p = 10 Ts Tf (Tu) Tb 200 300 problem size (in KBs) 400 500 600 700 800 900 1000 0.182 0.346 0.510 0.673 0.837 1.001 1.165 1.329 1.493 1.656 0.141 0.264 0.387 0.510 0.632 0.755 0.878 1.001 1.124 1.247 0.120 0.222 0.324 0.426 0.528 0.630 0.732 0.834 0.936 1.038 0.194 0.358 0.522 0.685 0.849 1.013 1.177 1.341 1.505 1.668 0.132 0.235 0.337 0.440 0.542 0.644 0.747 0.849 0.952 1.054 0.132 0.234 0.336 0.438 0.540 0.640 0.744 0.846 0.948 1.050 0.210 0.374 0.538 0.701 0.865 1.029 1.193 1.357 1.521 1.684 0.148 0.250 0.352 0.454 0.556 0.658 0.760 0.862 0.964 1.066 0.148 0.250 0.352 0.454 0.556 0.658 0.760 0.862 0.964 1.066 0.224 0.388 0.552 0.715 0.879 1.043 1.207 1.371 1.535 1.698 0.162 0.264 0.366 0.468 0.570 0.672 0.774 0.876 0.978 1.080 0.162 0.264 0.366 0.468 0.570 0.672 0.774 0.876 0.978 1.080 0.238 0.402 0.566 0.729 0.893 1.057 1.221 1.385 1.549 1.712 0.176 0.278 0.380 0.482 0.584 0.686 0.788 0.890 0.992 1.094 0.176 0.278 0.380 0.482 0.584 0.686 0.788 0.890 0.992 1.094 Table A.12: Predicted execution times (in seconds) for one-to-all broadcast. The problem size ranges from 100KB to 1000KB of integers. Ts and Tf (Tu) denote the execution time assuming a slow and fast root node, respectively. Tb is the runtime for balanced workloads. Each data point represents the predicted performance on a cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors. 121 100 p=2 SB ID p=4 SB ID p=6 SB ID p=8 SB ID p = 10 SB ID 200 300 problem size (in KBs) 400 500 600 700 800 900 1000 0.017 0.024 0.035 0.045 0.057 0.066 0.076 0.088 0.097 0.109 0.033 0.054 0.070 0.088 0.109 0.127 0.142 0.164 0.182 0.201 0.028 0.040 0.053 0.076 0.087 0.121 0.130 0.140 0.156 0.228 0.043 0.058 0.073 0.090 0.118 0.129 0.157 0.183 0.199 0.210 0.035 0.047 0.069 0.103 0.108 0.128 0.139 0.159 0.176 0.202 0.059 0.071 0.089 0.114 0.136 0.157 0.240 0.200 0.224 0.245 0.041 0.124 0.071 0.096 0.115 0.128 0.147 0.613 0.230 0.199 0.069 0.303 0.103 0.194 0.268 0.172 0.237 0.312 0.269 0.249 0.057 0.320 0.239 0.244 0.203 0.226 0.282 0.315 0.380 0.383 0.142 0.367 0.434 1.073 0.555 0.422 0.750 0.512 0.442 0.462 Table A.13: Actual execution times (in seconds) for all-to-all broadcast. There are two algorithms compared|simultaneous broadcast (SB) and intermediate destination (ID). The problem size ranges from 100KB to 1000KB of integers. Each data point represents the average of 10 runs on a cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors. 122 100 p=2 SB ID p=4 SB ID p=6 SB ID p=8 SB ID p = 10 SB ID 200 300 problem size (in KBs) 400 500 600 700 800 900 1000 0.029 0.050 0.070 0.091 0.111 0.132 0.152 0.173 0.193 0.214 0.058 0.098 0.138 0.179 0.219 0.259 0.299 0.339 0.379 0.419 0.035 0.056 0.076 0.097 0.117 0.138 0.158 0.179 0.199 0.220 0.070 0.110 0.150 0.191 0.231 0.271 0.311 0.351 0.391 0.431 0.043 0.064 0.084 0.105 0.125 0.146 0.166 0.187 0.207 0.228 0.086 0.126 0.166 0.207 0.247 0.287 0.327 0.367 0.407 0.447 0.050 0.071 0.091 0.112 0.132 0.153 0.173 0.194 0.214 0.235 0.100 0.140 0.180 0.221 0.261 0.301 0.341 0.381 0.421 0.461 0.068 0.098 0.129 0.160 0.191 0.221 0.252 0.283 0.313 0.242 0.114 0.154 0.194 0.235 0.275 0.315 0.355 0.395 0.435 0.475 Table A.14: Predicted execution times (in seconds) for all-to-all broadcast. There are two algorithms compared|simultaneous broadcast (SB) and intermediate destination (ID). The problem size ranges from 100KB to 1000KB of integers. Each data point represents the predicted performance on a cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors. 123 List of References [ACS89] A. Aggarwal, A. K. Chandra, and M. Snir. \On communication latency in PRAM computations." In 1st ACM Symposium on Parallel Algorithms and Architectures, pp. 11{21, 1989. [ACS90] A. Aggarwal, A. K. Chandra, and M. Snir. \Communication complexity of PRAMs." In J. Theoretical Computer Science, March 1990. [AGL98] Gail A. Alverson, William G. Griswold, Calvin Lin, David Notkin, and Lawrence Snyder. \Abstractions for Portable, Scalable Parallel Programming." IEEE Transactions on Parallel and Distributed Systems, 9(1):71{86, January 1998. [Akl97] Selim Akl. Parallel Computation: Models and Methods. Prentice Hall, 1997. [Bat68] K. Batcher. \Sorting networks and their applications." In Proceedings of the AFIPS Spring Joint Computing Conference, pp. 307{314, 1968. [BBC94] Vasanth Bala, Jehoshua Bruck, Robert Cypher, Pablo Elustondo, Alex Ho, Ching-Tien Ho, Shlomo Kipnis, and Marc Snir. \CCL: A Portable and Tunable Collective Communication Library for Scalable Parallel Computers." In Proceedings of 8th International Parallel Processing Symposium, pp. 835{844, 1994. [BBR00] Olivier Beaumont, Vincent Boudet, Fabrice Rastello, and Yves Robert. \Matrix-Matrix Multiplication on Heterogeneous Platforms." Technical Report 2000-24, E cole Normale Superieure de Lyon, January 2000. [BDR99] Pierre Boulet, Jack Dongarra, Fabrice Rastello, Yves Robert, and Frederic Vivien. \Algorithmic issues on heterogeneous computing platforms." Parallel Processing Letters, 9(2):197{213, 1999. [BGM95] Guy Blelloch, Phil Gibbons, Yossi Matias, and Marco Zagha. \Accounting for Memory Bank Contention and Delay in High-Bandwidth Multiprocessors." In Seventh ACM Symposium on Parallel Algorithms and Architectures, pp. 84{94, June 1995. 124 [BGP94] Mike Barnett, Satya Gupta, David G. Payne, Lance Shuler, Robert van de Geijn, and Jerrell Watts. \Interprocessor Collective Communication Library (Intercom)." Scalable High Performance Computing Conference, pp. 357{364, 1994. [BHP96] Gianfranco Bilardi, Kieran T. Herley, Andrea Pietracaprina, Geppino Pucci, and Paul Spirakis. \BSP vs LogP." In Eighth Annual ACM Symposium on Parallel Algorithms and Architectures, pp. 25{32, June 1996. [BJ99] David Bader and Joseph JaJa. \SIMPLE: A Methodology for Programming High Performance Algorithms on Clusters of Symmetric Multiprocessors (SMPs)." Journal of Parallel and Distributed Computing, 58(1):92{108, July 1999. [BL92] R. Butler and E. Lusk. \User's guide to the p4 Programming System." Technical Report ANL-92/17, Argonne National Laboratory, 1992. [BLM98] G. E. Blelloch, C. E. Leiserson, B. M. Maggs, C. G. Plaxton, S. J. Smith, and M. Zagha. \An experimental analysis of parallel sorting algorithms." Theory of Computing Systems, 31(2):135{167, March/April 1998. [BMP98] M. Banikazemi, V. Moorthy, and D. Panda. \Ecient Collective Communication on Heterogeneous Networks Workstations." In International Conference on Parallel Processing, pp. 460{467, 1998. [BRP99] Prashanth Bhat, C.S. Raghavendra, and Viktor Prasanna. \Ecient Collective Communication in Distributed Heterogeneous Systems." In International Conference on Distributed Computing Systems, May 1999. [BSP99] M. Banikazemi, J. Sampathkumar, S. Prabhu, D. Panda, and P. Sadayappan. \Communication Modeling of Heterogeneous Networks of Workstations for Performance Characterization of Collective Operations." In Heterogeneous Computing Workshop (HCW '99), pp. 125{ 133, April 1999. [Buy99a] Rajkumar Buyya. High Performance Cluster Computing: Architectures and Systems, volume 1. Prentice Hall, 1999. [Buy99b] Rajkumar Buyya. High Performance Cluster Computing: Programming and Applications, volume 2. Prentice Hall, 1999. 125 [BYT95] Byte Magazine. \The BYTEmark benchmark." URL http://www.byte.com/bmark/bmark.htm, 1995. [CG89] N. Carriero and D. Gelernter. \LINDA in context." Communications of the ACM, 32:444{458, 1989. [CKP93] David Culler, Richard Karp, David Patterson, Abhijit Sahay, Klaus Erik Schauser, Eunice Santos, Ramesh Subramonian, and Thorsten von Eicken. \LogP: Towards a Realistic Model of Parallel Computation." In Fourth ACM Symposium on Principles and Practice of Parallel Programming, pp. 1{12, May 1993. [CKP96] David E. Culler, Richard M. Karp, David Patterson, Abhijit Sahay, Eunice E. Santos, Klaus Erik Schauser, Ramesh Subramonian, and Thorsten von Eicken. \LogP: A Practical Model of Parallel Computation." Communications of the ACM, 39(11):78{85, November 1996. [CLR94] T. Cormen, C. Leiserson, and R. Rivest. Introduction to Algorithms. MIT Press, 1994. [CR73] S. A. Cook and R.A. Reckhow. \Time Bounded Random Access Machines." Journal of Computer and Systems Sciences, 7:354{375, 1973. [DCS96] Andrea C. Dusseau, David E. Culler, Klaus Erik Schauser, and Richard P. Martin. \Fast Parallel Sorting Under LogP: Experience with the CM-5." IEEE Transactions on Parallel and Distributed Systems, 7(8):791{805, August 1996. [DFR93] F. Dehne, A. Fabri, and A. Rau-Chaplin. \Scalable Parallel Computational Geometry for Coarse Multicomputers." In Proc. ACM Symposium on Computational Geometry, pp. 298{307, 1993. [EF93] M. M. Eshaghian and R. F. Freund. \Cluster-M Paradigms for HighOrder Heterogeneous Procedural Speci cation Computing." In Workshop on Heterogeneous Processing, 1993. [ES93] M. M. Eshaghian and M. E. Shaaban. \Cluster-M Parallel Programming Paradigm." International Journal of High Speed Computing, 1993. [FK98] Ian Foster and Carl Kesselman, editors. The Grid: Blueprint for a New Computing Infrastructure. Morgan Kaufmann, 1998. 126 [FM70] [For93] [FW78] [Gel85] [GHP90] [Gib89] [GLR99] [GMR94] [GMR97] [Goo93] [GR98] W. D. Frazer and A. C. McKellar. \Samplesort: A Sampling Approach to Minimal Storage Tree Sorting." Journal of the ACM, 17(3):496{507, 1970. High Performance Fortran Forum. \High Performance Fortran Language Speci cation." Scienti c Programming, 2(1{2):1{170, 1993. S. Fortune and J. Wyllie. \Parallelism in Random Access Machines." In Proceedings of the 10th Annual Symposium on Theory of Computing, pp. 114{118, 1978. D. Gelernter. \Generative Communication in Linda." ACM Transactions on Programming Languages and Systems, 7(1):80{112, 1985. G. A. Geist, M. T. Heath, B. W. Peyton, and P. H. Worley. \A user's guide to PICL: A portable instrumented communication library." Technical Report TM-11616, Oak Ridge National Laboratory, 1990. Phillip Gibbons. \A more pratical PRAM model." In 1st ACM Symposium on Parallel Algorithms and Architectures, pp. 158{168, 1989. Mark W. Goudreau, Kevin Lang, Satish B. Rao, Torsten Suel, and Thanasis Tsantilas. \Portable and Ecient Parallel Computing Using the BSP Model." IEEE Transactions on Computers, 48(7):670{689, 1999. Phillip Gibbons, Yossi Matias, and Vijaya Ramachandran. \The QRQW PRAM: Accounting for Contention in Parallel Algorithms." In Fifth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 638{648, January 1994. Phillip B. Gibbons, Yossi Matias, and Vijaya Ramachandran. \Can a Shared-Memory Model Serve as a Bridging Model for Parallel Computation?" In 9th Annual ACM Symposium on Parallel Algorithms and Architectures, pp. 72{83, 1997. M. Goodrich. \Parallel Algorithms Column 1: Models of Computation." SIGACT News, 24:16{21, December 1993. Mark W. Goudreau and Satish B. Rao. \Single Message vs. Batch Communication." In M.T. Heath, A. Ranade, and R.S. Schreiber, editors, Algorithms for Parallel Processing, volume 105 of IMA Volumes in Mathematics and Applications, pp. 61{74. Springer-Verlag, 1998. 127 [GS96] Alexandros V. Gerbessiotis and Constantinos J. Siniolakis. \Deterministic Sorting and Randomized Mean Finding on the BSP Model." In Eighth Annual ACM Symposium on Parallel Algorithms and Architectures, pp. 223{232, June 1996. [GV94] Alexandros V. Gerbessiotis and Leslie G. Valiant. \Direct BulkSynchronous Parallel Algorithms." Journal of Parallel and Distributed Computing, 22(2):251{267, August 1994. [HBJ96] David R. Helman, David A. Bader, and Joseph JaJa. \Parallel Algorithms for Personalized Communication and Sorting with an Experimental Study." In Eighth Annual ACM Symposium on Parallel Algorithms and Architectures, pp. 211{222, June 1996. [HC83] J.S. Huang and Y.C. Chow. \Parallel Sorting and Data Partitioning by Sampling." In IEEE Computer Society's Seventh International Computer Software & Applications Conference (COMPSAC'83), pp. 627{631, November 1983. [HH98] P. Husbands and J. C. Hoe. \MPI-StarT: Delivering Network Performance to Numerical Applications." In Supercomputing '98, 1998. [HJS97] Jonathan M.D. Hill, Stephen A. Jarvis, Constantinos Siniolakis, and Vasil P. Vasilev. \Portable and Architecture Independent Parallel Performance Tuning Using a Call-Graph Pro ling Tool: A Case Study in Optimising SQL." Technical Report PRG-TR-17-97, Oxford University Computing Laboratory, 1997. [HMS98] Jonathan M. D. Hill, Bill McColl, Dan C. Stefanescu, Mark W. Goudreau, Kevin Lang, Satish B. Rao, Torsten Suel, Thanasis Tsantilas, and Rob Bisseling. \BSPlib: The BSP Programming Library." Parallel Computing, 24(14):1947{1980, 1998. [Hoa62] C.A.R. Hoare. \Quicksort." Computer Journal, 5(1):10{15, 1962. [HPR92] William L. Hightower, Jan F. Prins, and John H. Reif. \Implementations of Randomized Sorting on Large Parallel Machines." In 4th Annual ACM Symposium on Parallel Algorithms and Architectures, pp. 158{167, June 1992. [HX98] Kai Hwang and Zhiwei Xu. Scalable Parallel Computing. McGrawHill, 1998. [Ion96] Mihai Florin Ionescu. \Optimizing Parallel Bitonic Sort.". Master's thesis, University of California at Santa Barbara, 1996. 128 [JW98] [KBG00] [KGG94] [KHP99] [KPS93] [KS99] [KSF00] [LB96] [Lei93] [LM88] Ben Juurlink and Harry Wijsho . \A Quantitative Comparison of Parallel Computation Models." ACM Transactions on Computer Systems, 16(3):271{318, 1998. Thilo Kielmann, Henri Bal, and Sergei Gorlatch. \Bandwidth-ecient Collective Communicaiton for Cluster Wide Area Systems." In 14th International Parallel and Distributed Processing Symposium, pp. 492{499, 2000. V. Kumar, A. Grama, A. Gupta, and G. Karypis. Introduction to Parallel Computing: Design and Analysis of Algorithms. Benjamin/Cummings, 1994. Thilo Kielmann, Rutger F. H. Hofman, Henri E. Bal Aske Platt, and Raoul A. F. Bhoedjang. \MPI's Reduction Operations in Clustered Wide Area Systems." In Message Passing Interface Developer's and User's Conference, pp. 43{52, Atlanta, GA, March 1999. A. Khokhar, V. Prasanna, M. Shaaban, and C. Wang. \Heterogeneous computing: Challenges and opportunities." Computer, 26(6):18{27, June 1993. Danny Krizanc and Anton Saarimaki. \Bulk synchronous parallel: practical experience with a model for parallel computing." Parallel Computing, 25(2):159{181, 1999. Nicholas T. Karonis, Bronis R. De Supinski, Ian Foster, and William Gropp. \Exploiting Hierarchy in Parallel Computer Networks to Optimize Collective Operation Performance." In 14th International Parallel and Distributed Processing Symposium, pp. 377{384, 2000. Bruce B. Lowekamp and Adam Beguelin. \ECO: Ecient Collective Operations for Communication on Heterogeneous Networks." In International Parallel Processing Symposium, pp. 399{405, Honolulu, HI, 1996. Tom Leighton. Introduction to Prallel Architectures: Arrays, Trees, Hypercubes. Morgan Kaufmann, 1993. Charles Leiserson and Bruce M. Maggs. \Communication-Ecient Parallel Algorithms for Distributed Random-Access Machines." Algorithmica, 3:53{77, 1988. 129 [LMR95] Zhiyong Li, Peter H. Mills, and John H. Reif. \Models and Resource Metrics for Parallel and Distributed Computation." In Proceedings of 28th Annual Hawaii International Conference on System Sciences, January 1995. [McC93] W. F. McColl. \General Purpose Parallel Computing." In A. M. Gibbons and P. Spirakis, editors, Lectures in Parallel Computation, Proceedings 1991 ALCOM Spring School on Parallel Computation, pp. 337{391. Cambridge University Press, 1993. [MMT95] Bruce M. Maggs, Lesley R. Matheson, and Robert E. Tarjan. \Models of Parallel Computation: A Survey and Synthesis." In Proceedings of the 28th Hawaii International Conference on System Sciences, volume 2, pp. 61{70. IEEE Press, January 1995. [Mor98a] Pat Morin. \Coarse-Grained Parallel Computing on Heterogeneous Systems." In Proceedings of the 1998 ACM Symposium on Applied Computing, pp. 629{634, 1998. [Mor98b] Pat Morin. \Two Topics in Applied Algorithmics.". Master's thesis, Carleton University, 1998. [MR95] Philip McKinley and David Robinson. \Collective Communication in Wormhole-Routed Massively Parallel Computers." IEEE Computer, 28(12):39{50, December 1995. [RV87] J. H. Reif and L. G. Valiant. \A Logarithmic Time Sort for Linear Size Networks." Journal of the ACM, 34(1):60{76, 1987. [SDA97] Howard J. Siegel, Henry G. Dietz, and John K. Antonio. \Software Support for Heterogeneous Computing." In Allen B. Tucker, editor, The Computer Science and Engineering Handbook, pp. 1886|1909. CRC Press, 1997. [SG97] Gregory Shumaker and Mark W. Goudreau. \Bulk-Synchronous Parallel Computing on the Maspar." In World Multiconference on Systemics, Cybernetics and Informatics, volume 1, pp. 475{481, July 1997. Invited paper. [SHM97] David B. Skillicorn, Jonathon M. D. Hill, and W. F. McColl. \Questions and Answers About BSP." Scienti c Programming, 6(3):249{ 274, 1997. 130 [Sny86] Lawrence Snyder. \Type Architectures, Shared Memory and the Corollary of Modest Potential." Annual Review of Computer Science, pp. 289{318, 1986. [SOJ96] Marc Snir, Steve Otto, Steven Jus-Lederman, David Walker, and Jack Dongarra. MPI: The Complete Reference. MIT Press, 1996. [ST98] David B. Skillicorn and Domenica Talia. \Models and Languages for Parallel Computation." ACM Computing Surveys, 30(2):123{169, June 1998. [Sun90] V. S. Sunderam. \PVM: A framework for parallel distributed computing." Concurency: Practice and Experience, 2(4):315{349, 1990. [Val90a] Leslie G. Valiant. \A bridging model for parallel computation." Communications of the ACM, 33(8):103{111, 1990. [Val90b] Leslie G. Valiant. \General Purpose Parallel Architectures." In J. van Leeuwen, editor, Handbook of Theoretical Computer Science, volume A: Algorithms and Complexity, chapter 18, pp. 943{971. MIT Press, Cambridge, MA, 1990. [Val93] Leslie G. Valiant. \Why BSP Computers?" In Proceedings of the 7th International Parallel Processing Symposium, pp. 2{5. IEEE Press, April 1993. [WG98] Ti ani L. Williams and Mark W. Goudreau. \An experimental evaluation of BSP sorting algorithms." In Proceedings of the 10th IASTED International Conference on Parallel and Distributed Computing Systems, pp. 115{118, October 1998. [WP00] Ti ani L. Williams and Rebecca J. Parsons. \The Heterogeneous Bulk Synchronous Parallel Model." In Parallel and Distributed Processing, volume 1800 of Lecture Notes in Computer Science, pp. 102{108. Springer-Verlag, Cancun, Mexico, May 2000. [WWD94] Charles C. Weems, Glen E. Weaver, and Steven G. Dropsho. \Linguistic Support for Heterogeneous Parallel Processing: A Survey and an Approach." In Proceedings of the Heterogeneous Computing Workshop, pp. 81{88, 1994. 131