A General-Purpose Model for Heterogeneous Computation Tiffani L

advertisement
A General-Purpose Model for
Heterogeneous Computation
by
Tiffani L. Williams
B.S. Marquette University, 1994
A dissertation submitted in partial ful llment of the requirements
for the degree of Doctor of Philosophy
in the School of Electrical Engineering and Computer Science
in the College of Engineering and Computer Science
at the University of Central Florida
Orlando, Florida
Fall Term
2000
Major Professor: Rebecca Parsons
c 2000 Ti ani L. Williams
Abstract
Heterogeneous computing environments are becoming an increasingly popular platform for executing parallel applications. Such environments consist of a
diverse set of machines and o er considerably more computational power at a
lower cost than a parallel computer. Ecient heterogeneous parallel applications
must account for the di erences inherent in such an environment. For example,
faster machines should possess more data items than their slower counterparts
and communication should be minimized over slow network links. Current parallel applications are not designed with such heterogeneity in mind. Thus, a new
approach is necessary for designing ecient heterogeneous parallel programs.
We propose the k-Heterogeneous Bulk Synchronous Parallel (HBSPk ) model,
which is an extension of the BSP model of parallel computation, as a framework for developing applications for heterogeneous parallel environments. The
BSP model provides guidance on designing applications for good performance
on homogeneous parallel machines. However, it is not appropriate for modeling
heterogeneous computation since it assumes that all processors are identical and
limits its view of the homogeneous communication network to one layer. The
HBSPk model extends BSP hierarchically to address k-level heterogeneous parallel systems. Under HBSPk , improved performance results from exploiting the
speeds of the underlying heterogeneous computing components.
Collective communication algorithms provide the foundation for our investigation of the HBSPk model. Ecient collective communication operations must be
iii
available for parallel programs to exhibit good performance on heterogeneous systems. We develop and analyze six collective communication algorithms|gather,
scatter, reduction, pre x sums, one-to-all broadcast, and all-to-all broadcast|for
the HBSPk model. Experimental results demonstrate the improved performance
that results from e ectively exploiting the heterogeneity of the underlying system.
Moreover, the model predicts the performance trends of the collective routines.
Improved performance is not a result of programmers having to account for myriad di erences in a heterogeneous environment. By hiding the non-uniformity of
the underlying system from the application developer, the HBSPk model o ers a
framework that encourages the design of heterogeneous parallel software.
iv
To my mother.
v
Acknowledgments
First, I would like to thank my mother, who throughout my life has always been
there to o er love, encouragement, and support. Secondly, I give thanks to my
brother for showing me the meaning of perserverance.
I am grateful to the members of my doctoral committee who listened to this
thesis in its various stages and reacted with patience and incisive suggestions.
In particular, I would like to thank my academic advisor, Rebecca Parsons, for
o ering the time necessary to make my journey successful and Narsingh Deo for
sparking my interest in parallel computation.
The Florida Education Fund (FEF) provided the nancial support that made
this research a nished product. Through FEF, I have met three great friends,
Keith Hunter, Dwayne Nelson, and Larry Davis, two of whom have a decent
racquetball game. Without Keith's persistance, I would never have met Mrs.
Jacqueline Smith, who from the very rst day we spoke has been one of my
greatest advocates.
I would also like to acknowledge the "members" of the Evolutionary Computing Lab|Marc Smith, Paulius Micikevicious, Grace Yu, Jaren Johnston, Larry
Davis, Yinn Wong, Lynda Vidot, Denver Williams, and Bill Allen|for the entertaining, yet scholarly discussions.
Lastly, I give thanks to the unsung heroes who actually read this dissertation.
Enjoy.
vi
Table of Contents
List of Tables : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : x
List of Figures : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : xvi
1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
1
2 A Review of Models of Parallelism : : : : : : : : : : : : : : : : :
6
2.1 Computational Models . . . . . . . . . . . . . . . . . . . . . . . .
7
2.1.1 PRAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
2.1.2 Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.3 Bridging . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Programming Models . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.1 Data-Parallel . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.2 Message-Passing . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.3 Shared-Memory . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3 Heterogeneous Computing . . . . . . . . . . . . . . . . . . . . . . . 23
2.3.1 HCGM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3.2 Cluster-M . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
vii
2.3.3 PVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3 A Case for BSP : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 27
3.1 The BSP model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2 BSP Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2.1 Experimental Approach . . . . . . . . . . . . . . . . . . . . 32
3.2.2 Randomized Sample Sort . . . . . . . . . . . . . . . . . . . 36
3.2.3 Deterministic Sample Sort . . . . . . . . . . . . . . . . . . . 39
3.2.4 Bitonic Sort . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2.5 Radix Sort . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4 HBSPk : A Generalization of BSP : : : : : : : : : : : : : : : : : : 55
4.1 Machine Representation . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2 Cost Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.3 HBSPk Collective Communication Algorithms . . . . . . . . . . . . 61
4.3.1 Gather . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.3.2 Scatter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.3.3 Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.3.4 Pre x Sums . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.3.5 One-to-All Broadcast . . . . . . . . . . . . . . . . . . . . . 68
4.3.6 All-to-all broadcast . . . . . . . . . . . . . . . . . . . . . . 72
4.3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
viii
5 HBSP1 Collective Communication Performance : : : : : : : : : 75
5.1 The HBSP Programming Library . . . . . . . . . . . . . . . . . . . 76
5.2 The HBSP1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.3 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.4 Application Performance . . . . . . . . . . . . . . . . . . . . . . . 83
5.4.1 Randomized Sample Sort . . . . . . . . . . . . . . . . . . . 97
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6 Conclusions and Future Work : : : : : : : : : : : : : : : : : : : : : 100
6.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.2 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
A Collective Communication Performance Data : : : : : : : : : : : 108
List of References : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 124
ix
List of Tables
3.1 BSP system parameters . . . . . . . . . . . . . . . . . . . . . . . 34
3.2 Algorithmic and model summaries using 16 processors on the SGI
Challenge and the Intel Paragon. Predicted and actual running
times are in s/key. y For radix sort, the largest problem size that
could be run on both machines was 4,194,304 keys. . . . . . . . . 36
4.1 De nitions of Notations . . . . . . . . . . . . . . . . . . . . . . . 59
5.1 The functions that constitute HBSPlib interface. . . . . . . . . . . 77
5.2 Speci cation of the nodes in our heterogeneous cluster. z A 2
processor system, where each number is for a single CPU. . . . . . 79
5.3 BYTEmark benchmark scores. . . . . . . . . . . . . . . . . . . . . 81
5.4 Cluster speed and synchronization costs. . . . . . . . . . . . . . . 83
5.5 rj values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.6 Randomized-sample sort performance. Factor of improvement is
determined by Tu=Tb . The problem size ranges from 104 to 105
integers. Each data point represents the average of 10 runs on a
cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors. . 98
x
A.1 Actual execution times (in seconds) for gather. The problem size
ranges from 100KB to 1000KB of integers. Ts and Tf (Tu) denote
the execution time assuming a slow and fast root node, respectively. Tb is the runtime for balanced workloads. Each data point
represents the average of 10 runs on a cluster comprised of 2, 4, 6,
8, and 10 heterogeneous processors. . . . . . . . . . . . . . . . . . 110
A.2 Predicted execution times (in seconds) for gather. The problem
size ranges from 100KB to 1000KB of integers. Ts and Tf (Tu)
denote the execution time assuming a slow and fast root node,
respectively. Tb is the runtime for balanced workloads. Each data
point represents the predicted performance on a cluster comprised
of 2, 4, 6, 8, and 10 heterogeneous processors. . . . . . . . . . . . 111
A.3 Actual execution times (in seconds) for scatter. The problem size
ranges from 100KB to 1000KB of integers. Ts and Tf (Tu) denote
the execution time assuming a slow and fast root node, respectively. Tb is the runtime for balanced workloads. Each data point
represents the average of 10 runs on a cluster comprised of 2, 4, 6,
8, and 10 heterogeneous processors. . . . . . . . . . . . . . . . . . 112
A.4 Predicted execution times (in seconds) for scatter. The problem
size ranges from 100KB to 1000KB of integers. Ts and Tf (Tu)
denote the execution time assuming a slow and fast root node,
respectively. Tb is the runtime for balanced workloads. Each data
point represents the predicted performance on a cluster comprised
of 2, 4, 6, 8, and 10 heterogeneous processors. . . . . . . . . . . . 113
xi
A.5 Actual execution times (in seconds) for single-value reduction. The
problem size ranges from 100KB to 1000KB of integers. Ts and
Tf (Tu ) denote the execution time assuming a slow and fast root
node, respectively. Tb is the runtime for balanced workloads. Each
data point represents the average of 10 runs on a cluster comprised
of 2, 4, 6, 8, and 10 heterogeneous processors. . . . . . . . . . . . 114
A.6 Predicted execution times (in seconds) for single-value reduction.
The problem size ranges from 100KB to 1000KB of integers. Ts
and Tf (Tu) denote the execution time assuming a slow and fast
root node, respectively. Tb is the runtime for balanced workloads.
Each data point represents the predicted performance on a cluster
comprised of 2, 4, 6, 8, and 10 heterogeneous processors. . . . . . 115
A.7 Actual execution times (in seconds) for point-wise reduction. The
problem size ranges from 100KB to 1000KB of integers. Ts and
Tf denote the execution time assuming a slow and fast root node,
respectively. Each data point represents the average of 10 runs on
a cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors. 116
A.8 Predicted execution times (in seconds) for point-wise reduction.
The problem size ranges from 100KB to 1000KB of integers. Ts and
Tf denote the execution time assuming a slow and fast root node,
respectively. Each data point represents the predicted performance
on a cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors.117
xii
A.9 Actual execution times (in seconds) for pre x sums. The problem
size ranges from 100KB to 1000KB of integers. Ts and Tf (Tu)
denote the execution time assuming a slow and fast root node,
respectively. Tb is the runtime for balanced workloads. Each data
point represents the average of 10 runs on a cluster comprised of
2, 4, 6, 8, and 10 heterogeneous processors. . . . . . . . . . . . . . 118
A.10 Predicted execution times (in seconds) for pre x sums. The problem size ranges from 100KB to 1000KB of integers. Ts and Tf (Tu)
denote the execution time assuming a slow and fast root node, respectively. Tb is the runtime for balanced workloads. Each data
point represents the predicted performance on a cluster comprised
of 2, 4, 6, 8, and 10 heterogeneous processors. . . . . . . . . . . . 119
A.11 Actual execution times (in seconds) for one-to-all broadcast. The
problem size ranges from 100KB to 1000KB of integers. Ts and
Tf (Tu ) denote the execution time assuming a slow and fast root
node, respectively. Tb is the runtime for balanced workloads. Each
data point represents the average of 10 runs on a cluster comprised
of 2, 4, 6, 8, and 10 heterogeneous processors. . . . . . . . . . . . 120
A.12 Predicted execution times (in seconds) for one-to-all broadcast.
The problem size ranges from 100KB to 1000KB of integers. Ts
and Tf (Tu) denote the execution time assuming a slow and fast
root node, respectively. Tb is the runtime for balanced workloads.
Each data point represents the predicted performance on a cluster
comprised of 2, 4, 6, 8, and 10 heterogeneous processors. . . . . . 121
xiii
A.13 Actual execution times (in seconds) for all-to-all broadcast. There
are two algorithms compared|simultaneous broadcast (SB) and
intermediate destination (ID). The problem size ranges from 100KB
to 1000KB of integers. Each data point represents the average of
10 runs on a cluster comprised of 2, 4, 6, 8, and 10 heterogeneous
processors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
A.14 Predicted execution times (in seconds) for all-to-all broadcast.
There are two algorithms compared|simultaneous broadcast (SB)
and intermediate destination (ID). The problem size ranges from
100KB to 1000KB of integers. Each data point represents the predicted performance on a cluster comprised of 2, 4, 6, 8, and 10
heterogeneous processors. . . . . . . . . . . . . . . . . . . . . . . . 123
xiv
List of Figures
2.1 The PRAM model. . . . . . . . . . . . . . . . . . . . . . . . . . .
8
2.2 Bus, mesh, and hypercube networks. . . . . . . . . . . . . . . . . 11
2.3 The BSP model. . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 A superstep. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5 HPF data allocation model. . . . . . . . . . . . . . . . . . . . . . 17
2.6 Messages sent without context are erroneously received. . . . . . . 20
3.1 Code fragment demonstrating BSMP . . . . . . . . . . . . . . . . 33
3.2 Predicted and actual execution time per key of randomized sample sort on an SGI Challenge and an Intel Paragon. Each plot
represents the run time on a 2, 4, 8, or 16 processor system. . . . 38
3.3 Predicted and actual execution time per key of deterministic sample sort on an SGI Challenge and an Intel Paragon. Each plot
represents the run time on a 2, 4, 8, or 16 processor system. . . . 41
3.4 A schematic representation of a bitonic sorting network of size
n = 8. BMk denotes a bitonic merging network of input size
k that sorts the input in either monotonically increasing (+) or
decreasing (-) order. The last merging network (BM8+ ) sorts the
input. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
xv
3.5 A bitonic sorting network of size n = 8. Each node compares two
keys, as indicated by the edges and selects either the maximum or
the minimum. Shaded and unshaded nodes designate where the
minimum and maximum of two keys is placed, respectively. . . . . 44
3.6 Predicted and actual execution time per key of bitonic sort on an
SGI Challenge and an Intel Paragon. Each plot represents the run
time on a 2, 4, 8, or 16 processor system. . . . . . . . . . . . . . . 46
3.7 Global rank computation. The computation is illustrated with
4 processors and 4 buckets for the values 0{3. Each processor's
t(i; j ) value is shown inside of each bucket. The number outside
of a bucket re ects the b(i; j ) value after the multiscan. After the
multicast, g(i; j ) re ects the starting position in the output where
the rst key with value i on processor j belongs. For example, P0
will place the rst key with value \0" at position 0, the \1" keys
starting at position 7, etc. . . . . . . . . . . . . . . . . . . . . . . 49
3.8 Predicted and actual execution time per key of radix sort on an
SGI Challenge and an Intel Paragon. Each plot represents the run
time on a 2, 4, 8, or 16 processor system. . . . . . . . . . . . . . . 51
4.1 An HBSP2 cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.2 Tree representation of the cluster shown in Figure 4.1. . . . . . . . 57
xvi
4.3 An HBSP2 pre x sums computation. Execution starts with the leaf
nodes (or HBSP0 machines) in the top diagram. Here, the nodes
send the total of their pre x sums computation to the coordinator
of its cluster. The upward traversal of the computation continues
until the root node is reached. The bottom diagram shows the
downward execution of the computation. The leaf nodes hold the
nal result. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.4 HBSP2 pre x sums . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.1 Gather actual performance. The improvement factor is determined
by (a) TTfs and (b) TTub . The problem size ranges from 100KB to
1000KB of integers. Each data point represents the average of 10
runs on a cluster comprised of 2, 4, 6, 8, and 10 heterogeneous
processors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.2 Gather predicted performance. The improvement factor is determined by (a) TTfs and (b) TTub . The problem size ranges from 100KB
to 1000KB of integers. Each data point represents the predicted
performance on a cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors. . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.3 Scatter actual performance. The improvement factor is determined
by (a) TTfs and (b) TTub . The problem size ranges from 100KB to
1000KB of integers. Each data point represents the average of 10
runs on a cluster comprised of 2, 4, 6, 8, and 10 heterogeneous
processors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
xvii
5.4 Scatter predicted performance. The improvement factor is determined by (a) TTfs and (b) TTub . The problem size ranges from 100KB
to 1000KB of integers. Each data point represents the predicted
performance on a cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors. . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.5 Single-value reduction actual performance. The improvement factor is determined by (a) TTfs and (b) TTub . The problem size ranges
from 100KB to 1000KB of integers. Each data point represents
the average of 10 runs on a cluster comprised of 2, 4, 6, 8, and 10
heterogeneous processors. . . . . . . . . . . . . . . . . . . . . . . . 90
5.6 Single-value reduction predicted performance. The improvement
factor is determined by (a) TTfs and (b) TTub . The problem size ranges
from 100KB to 1000KB of integers. Each data point represents the
predicted performance on a cluster comprised of 2, 4, 6, 8, and 10
heterogeneous processors. . . . . . . . . . . . . . . . . . . . . . . . 90
5.7 Point-wise reduction actual performance. The improvement factor
is determined by TTfs . The problem size ranges from 100KB to
1000KB of integers. Each data point represents the average of 10
runs on a cluster comprised of 2, 4, 6, 8, and 10 heterogeneous
processors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.8 Point-wise reduction predicted performance. The improvement
factor is determined by TTfs . The problem size ranges from 100KB
to 1000KB of integers. Each data point represents the predicted
performance on a cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors. . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
xviii
5.9 Pre x sums actual performance. The improvement factor is determined by (a) TTfs and (b) TTub . The problem size ranges from 100KB
to 1000KB of integers. Each data point represents the average of
10 runs on a cluster comprised of 2, 4, 6, 8, and 10 heterogeneous
processors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.10 Pre x sums predicted performance. The improvement factor is
determined by (a) TTfs and (b) TTub . The problem size ranges from
100KB to 1000KB of integers. Each data point represents the
predicted performance on a cluster comprised of 2, 4, 6, 8, and 10
heterogeneous processors. . . . . . . . . . . . . . . . . . . . . . . . 93
5.11 One-to-all broadcast actual performance. The improvement factor
is determined by (a) TTfs and (b) TTub . The problem size ranges from
100KB to 1000KB of integers. Each data point represents the
average of 10 runs on a cluster comprised of 2, 4, 6, 8, and 10
heterogeneous processors. . . . . . . . . . . . . . . . . . . . . . . . 94
5.12 One-to-all broadcast predicted performance. The improvement
factor is determined by (a) TTfs and (b) TTub . The problem size ranges
from 100KB to 1000KB of integers. Each data point represents the
predicted performance on a cluster comprised of 2, 4, 6, 8, and 10
heterogeneous processors. . . . . . . . . . . . . . . . . . . . . . . . 94
5.13 All-to-all broadcast actual performance. There are two algorithms
compared|simultaneous broadcast (SB) and intermediate destination (ID). The improvement factor is given for SB versus ID..
The problem size ranges from 100KB to 1000KB of integers. Each
data point represents the average of 10 runs on a cluster comprised
of 2, 4, 6, 8, and 10 heterogeneous processors. . . . . . . . . . . . 95
xix
5.14 All-to-all broadcast predicted performance. There are two algorithms compared|simultaneous broadcast (SB) and intermediate
destination (ID). The improvement factor is given for (a) SB versus
ID. The problem size ranges from 100KB to 1000KB of integers.
Each data point represents the predicted performance on a cluster
comprised of 2, 4, 6, 8, and 10 heterogeneous processors. . . . . . 96
5.15 HBSP1 randomized sample sort. . . . . . . . . . . . . . . . . . . . 97
6.1 Processor allocation of p = 16 matrix blocks. Each processor receives the same block size on a homogeneous cluster. On heterogeneous clusters, processors receive varying block sizes. . . . . . . 107
xx
CHAPTER 1
Introduction
Heterogeneous computing environments are becoming an increasingly popular
platform for executing parallel applications [KPS93, SDA97]. Such environments
consist of a wide range of architecture types such as Pentium PCs, shared-memory
multiprocessors, and high-performance workstations. Heterogeneous parallel systems o er considerably more computational power at a lower cost than traditional
parallel machines. Additionally, heterogeneous systems provide users with the
opportunity to reuse existing computer hardware and combine di erent models
of computation. Despite these advantages, application developers must contend
with myriad di erences (such as varying computer speeds, di erent network protocols, and incompatible data formats) in a heterogeneous environment. Current
parallel programs are not written to handle such non-uniformity. Thus, a new
approach is necessary to promote the development of ecient applications for
heterogeneous parallel systems.
We propose the k-Heterogeneous Bulk Synchronous Parallel (HBSPk ) model,
which is an extension of the BSP model of parallel computation [Val90a], as
a framework for the development of heterogeneous parallel applications. The
BSP model provides guidance on designing applications for good performance on
homogeneous parallel machines. Furthermore, experimental results have demonstrated the utility of the model|in terms of portability, eciency, and predictability|
1
on diverse parallel platforms for a wide variety of non-trivial applications [GLR99,
KS99, WG98]. Since the BSP model assumes that all processors have equal computation and communication abilities, it is not appropriate for heterogeneous
systems. Instead, it is only suitable for 1-level homogeneous architectures, which
consist of some number of identical processors connected by a single communications network.
The HBSPk model extends BSP hierarchically to address k-level1 heterogeneous parallel systems. Here, k represents the number of network layers present
in the heterogeneous environment. Unlike BSP, the HBSPk model describes multiple heterogeneous parallel computers connected by some combination of internal buses, local-area networks, campus-area networks, and wide-area networks.
As a result, it can guide the design of applications for traditional parallel systems, heterogeneous or homogeneous clusters [Buy99a, Buy99b], the Internet,
and computational grids [FK98]. Furthermore, HBSPk incorporates parameters
that re ect the relative computational and communication speeds at each of the
k levels.
Performance gains in heterogeneous environments result from e ectively exploiting the speeds of the underlying components. Similarly to homogeneous architectures, good algorithmic performance in heterogeneous environments is the
result of balanced machine loads. Executing standard parallel algorithms on heterogeneous platforms leads to the slowest processor becoming a bottleneck, which
reduces overall system performance. Computation and communication should be
minimized on slower processors. One the other hand, faster processors should
be used as often as possible. The HBSPk cost model guides the programmer in
balancing these objectives to produce ecient heterogeneous programs.
1
The terms level and layer will be used interchangeably throughout the text.
2
It is imperative that a unifying model for heterogeneous computation emerges
to avoid the software development problems that traditional parallel computing
currently faces. Frequently, high-performance algorithms and system software are
obtained by exploiting architectural features such as number of processors, memory organization, and communication latency of the underlying parallel machine.
However, designing software to accommodate the speci cs of one machine often
results in inadequate performance on other machines. Hence, the goal of parallel
computing is to produce architecture-independent software that takes advantage
of a parallel machine's salient characteristics. The HBSPk model seeks to be a
general-purpose model for heterogeneous computation.
Most parallel algorithms require processors to exchange data. There are a
few common basic patterns of interprocessor communication that are frequently
used as building blocks in a variety of parallel algorithms. Proper implementation of these collective communication operations is vital to the ecient execution of the parallel algorithms that use them. Collective communication
for homogeneous parallel environments has been throughly researched over the
years [BBC94, BGP94, MR95]. Collective operations designed for traditional
parallel machines are not adequate for heterogeneous environments. As a result,
we design and analyze six collective communication algorithms|gather, scatter,
reduction, pre x sums, one-to-all broadcast, and all-to-all broadcast|for heterogeneous parallel systems. The intent is not to obtain the best possible algorithms,
but rather to point to the potential advantages of using the HBSPk model. Afterwards, we present a randomized sample sort algorithm based on our HBSPk
collective communication operations.
We test the e ectiveness of our collective operations on a non-dedicated, heterogeneous network of workstations. HBSPlib, a library based on BSPlib [HMS98],
3
provides the foundation for HBSP1 programming. Experimental results demonstrate that our collective algorithms have increased performance on heterogeneous platforms. The experiments also validate that randomized sample sort
bene ts from using the HBSPk collective communication algorithms. Moreover,
the model accurately predicts the performance trends of the communication algorithms. Improved performance is not a result of programmers having to account
for di erences in a heterogeneous environment. By hiding the non-uniformity of
the underlying system from the application developer, the HBSPk model o ers
a framework that encourages the design of heterogeneous parallel software in an
architecture-independent manner.
The ultimate goal of this work is to provide a unifying framework that makes
parallel computing a viable option for heterogeneous platforms. As heterogeneous
parallel systems seem likely to be the platform of choice in the foreseeable future,
we propose the HBSPk model and seek to demonstrate that it can provide a
simple programming approach, portable applications, ecient performance, and
predictable execution. Our results fall into four categories:
Model development for heterogeneous computing systems.
Infrastructure to support HBSPk programming and analysis.
HBSPk application programming.
Experimentation examining the e ectiveness of deriving portable, ecient,
predictable, and scalable algorithms through the formalisms of the model.
The rest of the thesis addresses each of the above contributions. Chapter 2
provides a review of various parallel computational models. Of the models considered, we believe that BSP provides a fundamentally sound approach to parallel
4
programming. Chapter 3 evaluates the utility of BSP in developing ecient sorting applications. The HBSPk model and its associated collective communication
algorithms are presented in Chapter 4. The merits of HBSPk are experimentally
investigated in Chapter 5. Conclusions and directions for future work are given
in Chapter 6.
5
CHAPTER 2
A Review of Models of Parallelism
The success of sequential computing can be attributed to the Random-Access
Machine (RAM) model [CR73] providing a single, general model of serial computation. The model is accurate for a vast majority of programs. There are a few
cases, such as programs that perform extreme amounts of disk I/O, where the
model does not accurately re ect program execution. Due to its generality and
stability, the RAM model continually supports advancements made in sequential
programming. Moreover, these concentrated e orts have allowed the development of software-engineering techniques, algorithmic paradigms, and a robust
complexity theory.
Unfortunately, parallel computing has not enjoyed success similar to its sequential counterpart. Parallel computers have made a tremendous impact on the
performance of large-scale scienti c and engineering applications such as weather
forecasting, earthquake prediction, and seismic data analysis, but the e ective design and implementation of algorithms for them remains problematic. Frequently,
high-performance algorithms and system software are obtained by exploiting architectural features such as the number of processors, memory organization, and
communication latency of the underlying machine. Designing software to accommodate the speci cs of one machine often results in inadequate performance on
other machines. Thus, the goal of parallel computing is to produce architecture-
6
independent software that takes advantage of a parallel machine's salient characteristics.
Without a universal model of parallel computation, there is no foundation
for the development of portable and ecient parallel applications. As a result,
numerous models have been developed that attempt to model algorithm execution
accurately on existing parallel machines. However, we narrow our focus on two of
the most popular approaches. One method is the development of a computational
model |an abstraction of a computing machine|that guides the high-level design
of parallel algorithms as well as provides an estimation of performance. The
other approach is to develop a programming model, a set of language constructs
that can be used to express an algorithmic concept in a programming language.
For example, the programming languages Pascal and C are designed within the
imperative model, which consists of constructs such as arrays, control structures,
procedures, and recursion.
2.1 Computational Models
A computational model is an abstraction of a computing machine that guides
the high-level design of parallel algorithms as well as provides an estimation of
performance. In the following subsections, we focus on three classes of computational models|PRAM models, network models, and bridging models|since
these models have attracted considerable attention from the research community.
For an examination of models not discussed here, we refer the reader to [Akl97],
[LMR95], and [MMT95].
7
P
0
P
1
. . .
Pp-1
SHARED MEMORY
Figure 2.1: The PRAM model.
2.1.1 PRAM
The Parallel Random Access Machine (PRAM) [FW78] is the most widely used
parallel computational model. The PRAM model consists of p sequential processors sharing a global memory as shown in Figure 2.1. During each time step
or cycle, each processor executes a RAM instruction or accesses global memory. After each cycle, the processors implicitly synchronize to execute the next
instruction.
In the PRAM model, more than one processor can try to read from or
write into the same memory location simultaneously. CRCW (Concurrent-read,
concurrent-write), CREW (Concurrent-read, exclusive-write), and EREW (Exclusiveread, exclusive-write) PRAMs [FW78] handle simultaneous access of several processors to the same location of global memory. The CRCW PRAM, the most
powerful PRAM model, uses a protocol to resolve concurrent writes. Example
protocols include arbitration (an arbitrary processor proceeds with the write operation), prioritization (the processor with the highest priority writes the result),
and summation (the sum of all quantities is written).
The PRAM model assumes that synchronization and communication is essentially free. However, these overheads can signi cantly a ect algorithm per8
formance since existing parallel machines do not adhere to these assumptions.
By ignoring costs associated with exploiting parallelism, the PRAM is a simple
abstraction which allows the designer to expose the maximum possible computational parallelism in a given task. Thus, the PRAM provides a measure of the
ideal parallel time complexity.
Many modi cations to the PRAM have been proposed that attempt to bring
it closer to practical parallel computers. Goodrich [Goo93] and McColl [McC93]
survey the PRAM model and its extensions. A brief overview of machine characteristics that have been the focus of e orts to improve the PRAM is given
below.
1. Memory Access. The LPRAM (Local-memory PRAM) [ACS90] augments
the CREW PRAM by associating with each processor an unlimited amount
of local private memory. The QRQW (Queue-read, queue write) PRAM [GMR94]
assumes that simultaneous accesses to the same memory block will be inserted into a request queue and served in a FIFO manner. The cost of a
memory access is a function of the queue length.
2. Asynchrony. The Phase PRAM [Gib89] extends the PRAM by allowing
asynchronous execution. A computation is divided into phases and all processors run asynchronously within a phase. An explicit synchronization is
performed at the end of a phase.
3. Latency. The BPRAM (Block PRAM) [ACS89], an extension of the LPRAM,
addresses communication latency by taking into account the reduced cost
for transferring a contiguous block of data. The BPRAM model is de ned
with two parameters L (latency or startup time) and b (block size). Al-
9
though it costs one unit of time to access local memory, accessing a block
of size b of contiguous locations from global memory costs L + b time units.
4. Bandwidth. The DRAM (Distributed RAM) [LM88] eliminates the paradigm
of global shared memory and replaces it with only private distributed memory. Additionally, the communication topology of the network is ignored.
To address the notion of limited bandwidth, the model proposes a cost
function for a non-local memory access which is based on the maximum
possible congestion for a given data partition and execution sequence. The
function attempts to provide scheduling incentives to respect limited access
to non-local data.
2.1.2 Network
Concurrent with the study of PRAM algorithms, there has been considerable research on network-based models. Figure 2.2 illustrates several di erent
networks. In these models, processors send messages to and receive messages
from other processors over a given network. Communication is only allowed between directly connected processors. Other communication is explicitly forwarded
through intermediate nodes. In each step, the nodes can communicate with their
nearest neighbors and operate on local data. Leighton [Lei93] provides a survey
and analysis of these models.
Many algorithms have been designed to run eciently on particular network
topologies. Examples are parallel pre x (tree) and FFT (butter y). Although
this approach can lead to very ne-tuned algorithms, it has some disadvantages.
First, algorithms designed for one network may not perform well on other net10
Bus
P0
P1
P2
P3
P4
P0
P1
P2
P3
P4
P5
P6
P7
P8
P9
P10
P11
P12
P13
P14
P15
P16
P 17
P18
P19
P6
P4
P2
P0
3-D Hypercube
P7
P5
P1
2-D Mesh
P3
Figure 2.2: Bus, mesh, and hypercube networks.
11
works. Hence, to solve a problem on a new machine, it maybe necessary to design
a completely new algorithm. Second, algorithms that take advantage of a particular network tend to be more complicated than algorithms designed for more
abstract models like the PRAM since they incorporate some of the details of the
network.
2.1.3 Bridging
PRAM and network models are simple models which appeal to algorithm designers. However, neither approach facilitates the development of portable and
ecient algorithms for a variety of parallel platforms. This has prompted the introduction of bridging models [Val90a, Val93]. Ideally, a bridging model provides
a uni ed abstraction capturing architectural features that are signi cant to the
performance of parallel programs. An algorithm designed on a bridging model
should be readily implementable on a variety of parallel architectures, and its
eciency on the model should be a good re ection on its actual performance.
The Bulk Synchronous Parallel (BSP) model [Val90a] is a bridging model
that consists of p processor/memory modules, a communication network, and a
mechanism for ecient barrier synchronization of all the processors. Figure 2.3
shows the BSP model. A computation consists of a sequence of supersteps.
During a superstep, each processor performs asynchronously some combination of
local computation, message transmissions, and message arrivals. Each superstep
is followed by a global synchronization of all the processors. A message sent in
one superstep is guaranteed to be available to the destination processor at the
beginning of the next superstep. A superstep is shown in Figure 2.4.
Three parameters characterize the performance of a BSP computer. p represents the number of processors, L measures the minimal number of time steps
12
P
0
P
1
M
0
M
1
. . .
P
p-1
M
p-1
COMMUNICATION NETWORK
Figure 2.3: The BSP model.
Processors
Local Computations
Global Communications
Barrier Synchronization
Figure 2.4: A superstep.
13
between successive synchronization operations, and g re ects the minimal time interval between consecutive message transmissions on a per-processor basis. Both
g and L are measured in terms of basic computational operations. The time
complexity of a superstep in a BSP program is:
w + gh + L
where w is the maximum number of basic computational operations executed by
any processor in the local computation phase, and h is the maximum number
of messages sent or received by any processor. The total execution time for the
program is the sum of all the superstep times.
An approach related to BSP is the LogP model [CKP93, CKP96]. LogP models the performance of point-to-point messages with three parameters: o (computation overhead of handling a message), g (time interval between consecutive
message transmissions at a processor), and L (latency for transmitting a single
message). The main di erence between the two models is that, under LogP, the
scheduling of communication at the single-message level is the responsibility of
the application programmer. Under BSP, the underlying system performs that
task. Although proponents of LogP argue that their model o ers a more exible style of programming, Goudreau and Rao [GR98] argue that the advantages
are largely illusory, since both approaches lead to very similar high-level parallel
algorithms. Moreover, cross simulations between the models show that LogP is
no more powerful than BSP from an asymptotic point of view [BHP96]. Thus,
BSP's simpler programming style is perhaps to be preferred.
Other bridging models have been proposed. Candidate Type Architecture
(CTA) [AGL98, Sny86] is an early two parameter model (communication cost
L and number of processor p) that was the result of a multidisciplinary e ort.
Blelloch et al. [BGM95] propose the (d; x)-BSP model as a re nement for BSP
14
that provides more detailed modeling of memory bank contention and delay.
LogP-HMM [LMR95] extends the LogP model with a hierarchical memory model
characterizing each processor. Each of the models discussed above are distributed
memory models. However, there are arguments in support of the shared-memory
abstraction. The Queuing Shared Memory (QSM) [GMR97] model is one such
example.
2.1.4 Summary
The various models show the lack of consensus for a computational model (or
models) of parallel computing. However, the proposed models demonstrate that
a small set of machine characteristics are important: communication latency,
communication overhead, communication bandwidth, execution synchronization,
and memory hierarchy. Early computational models used a few parameters to
describe the features of parallel machines. Unfortunately, the assumptions made
by these models are not consistent with existing parallel machines. These simpli ed assumptions led to inaccuracies in predicting the actual running time of
algorithms. Recent models attempt to bridge the gap between software and hardware by using more parameters to capture the essential characteristics of parallel
machines. Furthermore, these models appear to be a promising approach towards
the systematic development of parallel software.
2.2 Programming Models
A programming model is a set of language constructs that can be used to express
an algorithmic concept in a programming language. Parallel programming is
similar to sequential programming in that there are many di erent languages to
15
select from to solve a problem. In this section, we restrict our attention to three
dominant programming approaches|data-parallel, message-passing, and sharedmemory. A good survey on this topic is the paper by Skillicorn and Talia [ST98].
2.2.1 Data-Parallel
The data-parallel model provides constructs for expressing that a statement sequence is to be executed in parallel on di erent data. Data-parallel languages are
attractive because parallelism is not expressed as a set of processors whose interactions are managed by the user, but rather as parallel operations on aggregate
data structures. Typically, the programmer must analyze the algorithms to nd
the parts which can be executed in parallel. The compiler then maps the data
parallel parts onto the underlying hardware.
High-Performance Fortran (HPF) [For93] is a data-parallel language based on
Fortran-90. It adds more direct data parallelism by including directives to specify how data structures are allocated to processors, and constructs to carry out
data-parallel operations, such as reductions. The directives for data distribution
support a two-phase process in which an array is aligned, using the ALIGN directive, relative to a template or another array that has already been distributed.
The DISTRIBUTE directive is used to distribute an object (and any other objects
that may be aligned with it) onto an abstract processor array. An array distribution can be changed at any point by REDISTRIBUTE and REALIGN. The mapping
of abstract processor to physical processors is implementation dependent and is
not speci ed in the language. Data distribution directives are recommendations
to an HPF compiler, not instructions. The compiler does not have to obey them
if it determines that performance can be improved by ignoring them.
Figure 2.5 illustrates HPF's data allocation model for the integer arrays A,
16
A(1) A(2) ... A(100)
B(1) B(2) ... B(100)
C(1) C(2) ... C(101)
Align
A(1) A(2) A(3) ... A(100)
C(1) C(2) ... C(101)
B(1) B(2) ... B(99) B(100)
Distribute
A(26:50)
B(25:49)
C(2:101:4)
A(1:25)
B(1:24)
C(1:101:4)
P
2
P
1
A(51:75)
B(50:74)
C(3:101:4)
A(76:100)
B(75:99)
C(4:101:4)
P
3
P
4
Physical Mapping
P
1
P
3
P
2
P
4
P
1
P
3
P
, 2
,P4
A Uniprocessor
A Mesh
Figure 2.5: HPF data allocation model.
17
B, and C of sizes 100, 100, and 101, respectively. The ALIGN directive aligns
array element A(i) with element B(i-1). Afterwards, the data is distributed onto
4 logical processors. Arrays A and B are distributed in a block fashion whereas
array C is distributed cyclically1. Thus, P1 consists of array elements A(1), A(2),
..., A(25), B(1), B(2), ..., B(24), C(1), C(5), C(9), ..., C(97), C(101); P2 consists
of array elements A(26), A(27), .., A(50), B(25), B(26), ..., B(49), C(2), C(6),
C(10), .., C(98); etc. The same logical mapping can be used for di erent physical
mappings. For example, each logical node maps to a physical node in a 3x4 mesh.
However, all of the logical nodes map to a uniprocessor. HPF's data allocation
model results in using the same code (with all the directives) on both computers
without a change.
HPF also o ers directives to provide hints for data dependence analysis. For
instance, PURE asserts that a subroutine has no side e ects so that its presence
in a loop does not inhibit the loop parallelism. Additionally, the INDEPENDENT
directive asserts that the loop has no loop carried dependence which allows the
compiler to parallelize the loop without any further analysis.
The most attractive feature of the data-parallel approach as exempli ed in
HPF is that the compiler takes on the job of generating communication code.
However, program performance depends on how well the compiler handles interprocess communication and synchronization. The relative immaturity of these
compilers usually means that they may not produce ecient code in many situations.
A block distribution evenly divides an array into a number of chunks (or blocks) of consecutive array elements, and allocates a block to a node. A cyclic distribution evenly divides an
array so that every ith element is allocated to the ith node.
1
18
2.2.2 Message-Passing
A message-passing program consists of multiple processes having only local memory. Since processes reside in di erent address spaces, they communicate by sending and receiving messages. Typically, message passing programming is done by
linking with and making calls to libraries which manage the data exchange between processes.
The Message Passing Interface (MPI) library [SOJ96] is the standard messagepassing interface for writing parallel applications and libraries. An MPI program
is a collection of concurrent communicating tasks belonging to a speci ed group.
Task groups provide contexts through which MPI operations can be restricted to
only the members of a particular group. The members of a group are assigned
unique contiguous identi ers, called ranks, starting from zero. Since new groups
cannot be created from scratch, MPI provides a number of functions to create
new groups from existing ones. Point-to-point communications between processes are based on send and receive primitives that support both synchronous
and asynchronous communication. MPI also provides primitives for collective
communication. Collective operations execute when all tasks in the group call
the collective routine with matching parameters. Synchronization, broadcast,
scatter/gather, and reductions (min, max, multiply) are examples of collective
routines supported by MPI.
One of the advantages of using MPI is that it facilitates the development of
portable parallel libraries. An important requirement for achieving this goal is to
guarantee a safe communication space in which unrelated messages are separated
from one another. MPI introduces a communicator, which binds a communication context to a group of tasks, to achieve a safe communication space. Having
a communication context allows library packages written in message-passing systems to protect or mark their messages so that they are not received (incorrectly)
19
Process 0
Setcontext(A)
Send(1, tag)
Call lib
Setcontext(B)
Send(1,tag)
No context
context (A)
context (B)
Recv(1,tag)
Process 1
Setcontext(A)
Send(0, tag)
Call lib
Setcontext(B)
Recv(0,tag)
Recv(0,tag)
Figure 2.6: Messages sent without context are erroneously received.
by the user's code. Figure 2.6 illustrates the fundamental problem. In this gure,
two processes are calling a library routine that also performs message passing.
The library and user's code have both chosen the same tag to mark a message.
Without context, messages are received in the wrong order. To solve this problem,
a third tag that is assigned by the operating system is needed to distinguish user
messages from library messages. Upon entrance to a library routine, for example, the software would determine this third tag and use it for all communications
within the library.
The functionality of message-passing libraries is relatively simple and easy to
implement. Unlike the data-parallel approach, the programmer must explicitly
implement a data distribution scheme and handle all interprocess communication. Consequently, the development of anything but simple programs is quite
20
dicult. Additionally, it is the programmer's responsibility to resolve data dependencies and avoid deadlock and race conditions. Thus, the performance of an
application under the message-passing approach often depends upon the ability
of the developer.
Other message-passing models include p4 [BL92] and PICL [GHP90].
2.2.3 Shared-Memory
The shared-memory model is similar to the data-parallel model in that it has a
single address (global naming) space. It is similar to the message-passing model
in that it is multithreaded and asynchronous. Communication is done implicitly through shared reads and writes of variables. However, synchronization is
explicit. We base our discussion of shared-memory models on Linda. Other
shared-memory models include Orca and SR.
Linda [CG89, Gel85] is a shared-memory language that provides an extension
to standard imperative languages. In Linda, point-to-point communication is
replaced by a tuple space which is shared and accessible to all processes. The
tuple space is also associative. Items are removed from the tuple space using
pattern-matching rules rather than by being addressed directly. Thus, tuple
space is similar to a cache in that it is addressed associatively. Tuple space is
also anonymous. Once a tuple has been placed in tuple space, the system does
not keep track of its creator.
The Linda communication model contains three communication operations:
in, which reads and removes a tuple from the tuple space; rd, which reads a
tuple from the tuple space; and out, which adds a tuple to the tuple space. For
example, the rd operation
21
rd (``Florida'', ?X, ``Orlando'')
searches the tuple space for tuples of three elements, with a rst element \Florida",
last element \Orlando", and a second element of the same type as the variable
X. A match occurs if a tuple is found with the same number of elements and
the types and values of the corresponding elements are the same. If a matching
tuple is not found, the issuing processor must wait until a satisfying tuple enters
the tuple space. Besides these three basic operations, Linda provides the eval(x)
operation that implicitly creates a new process to evaluate the tuple x and inserts
the result in the tuple space.
There is a general feeling that shared-memory programming is easier than
message-passing programming [HX98]. One reason is that the shared-memory
abstraction is similar to the view of memory in sequential programming. However, for developing new, ecient parallel programs that are loosely synchronous
and have regular communication patterns, the shared-variable approach is not
necessarily easier than the message-passing one. Moreover, shared-memory programs may be more dicult to debug than message-passing ones. Since processes
in a shared-memory program reside in a single address space, accesses to shared
data must be protected by synchronization constructs such as locks and critical regions. As a result, subtle synchronization errors can easily occur that are
dicult to detect. These problems occur less frequently in a message-passing
program, as the processes do not share a single address space.
2.2.4 Summary
From our survey of parallel programming models, two observations appear. First,
the programming models are mostly extensions of C or Fortran depicting the
22
programmer's reluctance to learn a completely new language. Secondly, parallel
programming models are evolving towards more high-level approaches. Thus, the
programmer is not responsible for handling all aspects of developing a parallel
application. Instead, compilers handle such things as data-dependence detection,
communication, synchronization, scheduling, and data-mapping. The trend towards higher-level programming models appears to be a good approach since it
provides for more robust and portable parallel software. Yet, the success of such
higher-level models clearly depends upon the advances of compiler technology.
2.3 Heterogeneous Computing
Several models exist to support heterogeneous parallel computation. Below, we
consider three models|HCGM, Cluster-M, and PVM|for developing applications for heterogeneous machines. The rst two approaches are computational
approaches whereas PVM is a programming model. An overview of heterogeneous
models not discussed here appears in [SDA97] and [WWD94].
2.3.1 HCGM
The Heterogeneous Coarse-Grained Muliticomputer (HCGM) model [Mor98a]
is a generalization of the CGM model [DFR93]. HCGM shares the same spirit
as the BSP and LogP models in that it attempts to provide a bridge between
the hardware and software layers of a heterogeneous machine. Formally, HCGM
models parallel computers consisting of p heterogeneous processors. Since processors have varying computing capabilities, si represents the speed of processor
Pi. The model assumes memory and communication speeds of the processors
are proportional to their computational speeds. As a result, faster processors
23
process and communicate more data. Here, si 1 and the slowest processor's si
?1 s .
value is normalized to 1. The total speed of the parallel machine is s = Ppi=0
i
The processors are interconnected by a network capable of routing any all-to-all
communication in which the total amount of data exchanged is O(m). The performance of an HCGM algorithm is measured in terms of computation time and
number of supersteps.
A model similar in structure and philosophy is the Heterogeneous Bulk Synchronous Parallel (HBSP) model [WP00]. Both HBSP and HCGM are similar in
structure and philosophy. The main di erence is that HCGM is not intended to
be an accurate predictor of execution times whereas HBSP attempts to provide
the developer with predictable algorithmic performance. Additionally, HBSP
provides part of the motivation for the development of the HBSPk model. In the
HBSPk model, HBSP is synonymous with HBSP1 .
2.3.2 Cluster-M
Cluster-M [EF93, ES93] is a model designed to bridge the gap between software
and hardware for heterogeneous computing. Cluster-M consists of three main
components: the speci cation module, the representation module, and the mapping module. A program is represented as a Spec graph (a multilevel clustered
task graph), where nodes (Spec clusters ) show execution times, and arcs represent
the expected amount of data to be transferred between the nodes. Leaf nodes
represent a single computation operand. All clusters at a level are independent
and may be executed simultaneously. Furthermore, the programmer speci es the
manner in which a program is clustered, which may be modi ed during run-time.
In the representation module, a heterogeneous suite of computers is represented by a Rep graph (a multilevel partitioning of a system graph), where nodes
24
contain the speeds of arithmetic operations for the associated processor, and arcs
express the bandwidth for communications between processors. Given an arbitrary Spec graph containing M task modules, and an arbitrary Rep graph of N
processors, the mapping module is a portable heuristic tool responsible for nearoptimal mapping of the two graphs in O(MP ) time, where P = maxfM; N g.
Moreover, the mapping module has an interface that can be used with portable
network communication tools, such as PVM (see Section 2.2.2), for executing
portable parallel software across heterogeneous machines.
2.3.3 PVM
PVM (Parallel Virtual Machine) [Sun90] is a message-passing software system
that is a byproduct of the Heterogeneous Network Project| a collaborative e ort
by researchers at Oak Ridge National Laboratory, the University of Tennessee,
and Emory University to facilitate heterogeneous parallel computing. PVM is
built around the concept of a virtual machine, which is a dynamic collection of
computational resources managed as a single parallel computer.
The PVM system consist of two parts: a PVM daemon (called pvmd) that
resides on every computer of the virtual machine, and a library of standard interface routines that is linked to the user application. The pvmd daemon oversees the operation of user processes within a PVM application and coordinates
inter-machine PVM communications. The PVM library contains subroutine calls
that the application programmer embeds in their application code. The library
routines interact with the pvmd to provide services such as communication, synchronization, and process management. The pvmd may provide the requested
service alone or in cooperation with other pvmds in the heterogeneous system.
25
Application programs that use PVM are composed of several tasks. Each
task is responsible for a part of the application's computational workload. By
sending and receiving messages, multiple tasks of an application can cooperate
to solve a problem in parallel. Under PVM, the programmer has the ability
to place tasks on speci c machines. Such exibility enables various tasks of a
heterogeneous application to exploit particular strengths of the computational
resources. However, it is the programmer's responsibility to understand and
explicitly code for any distinctive properties of the heterogeneous system.
2.3.4 Summary
Programming heterogeneous systems is dicult because each application must
take advantage of the underlying architectures and adjust for hardware availability. Some HC models rely solely on programmers to handle the complexity of
HC systems. As a result, the programmer must hand-parallelize each task speci cally for the appropriate target machine. If the con guration changes, parts of the
application must be rewritten. Other approaches rely on compilers to automatically handle some of the complexity of tailoring applications for heterogeneous
systems. By hiding heterogeneity, the developer does not need to understand
all of the characteristics of the HC system. Programs written in this way are
potentially mechanically portable. Thus, the success of developing software to
execute eciently and predictably on HC systems requires a model that hides
some of the heterogeneity from the programmer while describing the underlying
system with accuracy.
26
CHAPTER 3
A Case for BSP
We believe that the BSP model provides a fundamentally sound approach to
parallel programming. The model supports the development of architectureindependent software, which promotes a widespread software industry for parallel computers. Moreover, existing applications do not have to be redeveloped or
modi ed in a non-trivial way when migrated to di erent machines. Secondly, the
BSP model consists of a cost model that provides predictable costs of algorithm
execution. BSP captures the essential characteristics of parallel machines with
only a few parameters. More complex computational models tend to use more
parameters that render them too tedious for practical use. Additionally, the BSP
model can be viewed as a kind of programming methodology. The essence of the
BSP approach is the notion of the superstep and the idea that the input/output
associated with a superstep is performed as a global operation, involving the
whole set of individual sends and receives. Viewed in this way, a BSP program is
simply one which proceeds in phases, with the necessary global communications
taking place between the phases. Lastly, the BSP model provides practical design
goals for architects. According to the model, the routing of h-relations should
be ecient (g should be small) and barrier synchronization should be ecient
(L should be small). Parallel machines developed with these architectural design
goals will be quite suitable for executing BSP algorithms. On the other hand,
27
systems not designed with BSP in mind may not deliver good values of g and L
resulting in inadequate performance of BSP algorithms.
3.1 The BSP model
As discussed in Chapter 2.1.3, a BSP computer consists of a set of processor/memory modules, a communication network, and a mechanism for ecient
barrier synchronization of all the processors. A computation consists of a sequence of supersteps. During a superstep, each processor performs asynchronously
some combination of local computation, message transmissions, and message arrivals. A message sent in one superstep is guaranteed to be available to the
destination processor at the beginning of the next superstep. Each superstep is
followed by a global synchronization of all the processors.
Three parameters characterize the performance of a BSP computer. p represents the number of processors, L measures the minimal number of time steps
between successive synchronization operations, and g re ects the minimal time
interval between consecutive message transmissions on a per-processor basis. The
values of g and L can be given in absolute times or normalized with respect to
processor speed.
The parameters described above allow for cost analysis of programs. Cost
prediction can be used in the development of BSP algorithms or to predict the
actual performance of a program ported to a new architecture. Consider a BSP
program consisting of S supersteps. The time complexity of superstep i in a BSP
program is:
wi + ghi + L
(3.1)
28
where wi is largest amount of local computation performed by any processor, and
hi is the maximum number of messages sent or received by any processor. (This
communication pattern is called an h-relation.) The execution time of the entire
program is de ned as
W + gH + LS
(3.2)
where W = PiS=0?1 wi and H = PiS=0?1 hi.
The above cost model demonstrates what factors are important when designing BSP applications. To minimize execution time, the programmer must
attempt to (i) balance the local computation between processors in each superstep, (ii) balance communication between processors to avoid large variations in
hi, and (iii) minimize the number of supersteps. In practice, these objectives can
con ict, and trade-o s must be made. The correct trade-o s can be selected by
taking into account the g and L parameters of the underlying machine.
The cost model also shows how to predict performance across target architectures. The values W , H , and S can be determined by measuring the amount
of local computation, the number of bytes sent, and the total number of supersteps [SHM97]. The values of g and L can then be inserted into the cost formula
to predict the performance of programs ported to new parallel computers.
From the point of view of the BSP programmer, there are only two levels of
memory locality: either the data is in local memory, or it is in nonlocal memory
(at the other processors). There is no concept of network locality, as would be the
case if the underlying interconnection network is a mesh, hypercube, or fat tree.
If the underlying interconnection network does indeed support network locality,
this fact will not be exploited by the BSP programmer.
29
3.2 BSP Sorting
Sequential sorting algorithms have been developed under the Random-Access
Machine (RAM) model, an abstraction of the von Neumann model that has
guided uniprocessor hardware design for decades. Parallel sorting algorithms
have been investigated for many di erent machines and models, however unlike sequential computing, parallel computing has no widely accepted model for
program development. As a result, ecient parallel programs are often machinespeci c. To demonstrate the utility of the model, we develop BSP implementations of four sorting algorithms | randomized sample sort, deterministic sample
sort, bitonic sort, and radix sort | that present various computation and communication patterns to a parallel machine. With these applications, we evaluate
the utility of BSP in terms of portability, eciency, and predictability on an SGI
Challenge and an Intel Paragon.
The claim that both eciency and portability can be achieved by using the
BSP model is supported by both theoretical and experimental results [Val90a,
GLR99, KS99, McC93, Val90b, Val93, WG98] . However, other general-purpose
models, such as LogP [CKP96], make similar claims. LogP models the performance of point-to-point messages with three parameters representing software
overhead, network latency, and communication bandwidth. Under LogP, the
programmer is not constrained by a superstep programming style. Although
proponents of LogP argue that it o ers a more exible style of programming,
Goudreau and Rao [GR98] argue that the advantages are largely illusory, since
both approaches lead to very similar high-level parallel algorithms. In fact, most
of the BSP sorting algorithms discussed here are, from a high-level perspective,
virtually identical to Dusseau et al.'s LogP implementations [DCS96]. The main
di erence between the two models is that the scheduling of communication under
LogP at the single-message level is the responsibility of the application program30
mer, while the underlying system of BSP performs that task. We argue that the
cost of allowing the underlying system to handle communication scheduling is
negligible; thus the higher-level BSP approach is preferable.
Several experimental studies on the implementation of parallel sorting algorithms have in uenced this work. Similar parallel sorting studies are described
by Blelloch et al. for a Connection Machine CM-2 [BLM98], Hightower, Prins,
and Reif for a MasPar MP-1 [HPR92], and by Helman, Bader, and JaJa on a
Connection Machine CM-5, an IBM SP-2, and a Cray Research T3D [HBJ96].
Of particular relevance is the work of Dusseau et al. [DCS96], which described
several sorting approaches on a Connection Machine CM-5 using the LogP cost
model [CKP93]. Dusseau et al.'s work is very similar in philosophy to this work
in that it advocates the use of a bridging model for the design of portable and
ecient code.
Experimental results for sorting based directly on the BSP model can be found
in the work of Gerbessiotis and Siniolakis for an SGI Power Challenge [GS96],
Shumaker and Goudreau for a MasPar MP-2 [SG97], and Hill et al. for a Cray
T3E [HJS97]. Juurlink and Wijsho performed a detailed experimental analysis
of three parallel models (BSP, E-BSP, and BPRAM) and validated them on ve
platforms (Cray T3E, Thinking Machine CM-5, Intel Paragon, MasPar MP-1,
and Parsytec GCel) using sorting, matrix multiplication, and all pairs shortest
path algorithms [JW98].
31
3.2.1 Experimental Approach
The code for the sorting algorithms uses the BSPlib library [HMS98]. BSPlib
synthesizes several BSP programming approaches and provides a parallel communication library based around a Single Program Multiple Data (SPMD) model
of computation. BSPlib provides a small set of BSP operations and two styles
of data communication, Direct Remote Memory Access (DRMA) and Bulk Synchronous Message Passing (BSMP). DRMA re ects a one-sided direct remote
memory access while BSMP captures a BSP oriented message passing approach.
We use the BSMP style of communication in our sorting algorithms.
Figure 3.1 contains a small, but representative, example that captures the
BSMP style of communication. In this code fragment, each processor calls
bsp nprocs and bsp pid. These functions return to the calling processor the
number of processors and its identity, respectively. Next, processor 0 sends a
message with an integer tag (0) and a payload with two doubles (1.4 and 2.3) to
all of the other processors. The bsp send call sends the message. After the barrier
synchronization call (bsp sync), the message can be received at the destination
process by rst accessing the tag (bsp get tag) and then transferring the payload to the destination (bsp move). The status variable used in the bsp get tag
call returns the length of the payload. The bsp move call also serves to ush this
message from the input bu er.
The current experiments utilize two platforms:
An SGI Challenge|a shared-memory platform|with 16 MIPS R4400 processors running IRIX System V.4. A shared memory implementation of the
BSP library developed by Kevin Lang is used.
32
int i;
int p;
int pid;
int status;
int tag;
double payload[2];
...
p = bsp_nprocs();
pid = bsp_pid();
/* no. of processors */
/* processor identity */
/* P0 broadcasts message */
if (pid == 0) {
tag = 0;
payload[0] = 1.4;
payload[1] = 2.3;
for (i = 1; i < p; i++)
bsp_send(i, &tag, &payload, 2*sizeof(double));
}
bsp_sync();
/* Receive message from P0 */
if (pid != 0) {
bsp_get_tag(&status, &tag);
bsp_move(&payload, &status)
}
...
Figure 3.1: Code fragment demonstrating BSMP
33
SGI
Intel
s ) L (s) g ( s ) L (s)
p g ( byte
byte
1
0.03
0
0.22
353
2
0.03
20
0.40
657
4
0.02
20
0.30
1299
8
0.03
40
0.32
2505
16 0.04
60
0.38
4990
Table 3.1: BSP system parameters
An Intel Paragon|a message-passing machine|with 32 i860 XP processors
running Paragon OSF/1 Release 1.0.4. The BSPlib implementation used
was developed by Travis Terry at the University of Central Florida.
The code is compiled with the -O2 optimization ag. We consider BSP to
model only communication and synchronization; I/O and local computation are
not modeled. As a result, none of the experiments include I/O, and local computation is measured as best as possible on our platforms. We discuss our method
for measuring the amount of local computation later in this section. Timings
start when the input data is evenly distributed among the processors. The input
data consists of 4K to 8192K uniformly distributed integers, where K = 1024.
For local sorting, we used an 11-bit radix sort, which is the fastest sort that we
could nd.
Table 3.1 shows the values for g and L achieved on an SGI Challenge and an
Intel Paragon. The bandwidth parameter g is the time per packet for a suciently
large superstep with a total exchange communication pattern. g is based on an
h-relation size of 512 bytes. The value of L corresponds to the time needed
for processors to synchronize for an empty superstep (i.e. no computation or
34
communication). To illustrate cost prediction on the SGI Challenge, we measure
the values of W , H , and S on an SGI Challenge for each of the sorting algorithms.
The values of g and L can then be inserted into the cost formula to predict the
performance of our sorting applications.
Performance prediction on the Intel Paragon is slightly di erent. To predict
runtimes on the Paragon, we apply the following cost model
cW + gH + LS
(3.3)
where c re ects the factor used to estimate local computation (or work depth) on
an Intel Paragon. We determine the value of c in the following manner. For each
problem size, the work depth of the applications are measured on both platforms.
Next, we compute the ratio of work depth on the SGI Challenge to work depth
on the Intel Paragon; c re ects the average of these ratios.
Table 3.2 provides data about the performance of the BSP sorting applications
using 16 processors on both of our parallel platforms. We give the algorithmic
parameters, including work depth (as measured on the SGI), the sum over all
supersteps of the maximum number of bytes sent or received by any processor,
and the number of supersteps. We also include the actual running times, BSP
predicted running times, and the c factor. Execution times are given in s/key.
The goal is to observe a constant execution time as we scale the problem size.
The error of prediction is given by maxfTactual ; Tpredg / minfTactual; Tpredg, where
Tactual and Tpred represent the actual and predicted execution times, respectively.
The data indicates a general trend that for these applications, ecient use
of larger numbers of processors can be achieved by increasing the problem size.
This is true not only for these sorting algorithms, but for a wide range of important applications. Intuitively, this will occur whenever the computation can be
35
app
n/p
rand
4K
rand
512K
dterm
4K
dterm 512K
bitonic
4K
bitonic 512K
radix
4K
radix 256Ky
SGI
pred
1.96
0.40
3.98
0.76
5.98
2.63
10.58
2.41
SGI
actual
1.71
0.43
4.52
0.74
3.74
2.40
13.17
2.26
SGI
Intel Intel
error pred actual
12.67% 6.82 5.09
6.98% 0.86 0.87
11.93% 11.11 5.21
2.69% 1.37 1.19
37.51% 13.76 4.29
8.80% 3.65 2.05
19.65% 36.90 19.79
6.22% 5.17 3.18
Intel
error
25.33%
1.25%
53.14%
12.97%
68.83%
43.69%
46.36%
38.53%
SGI
W (sec) H (bytes) S
c
0.12
148648
5 2.99
3.29
2229404
5 1.93
0.26
106488
7 2.55
6.17
4280584
7 1.59
0.39
139264
25 1.88
21.33
17825792 25 1.11
0.67
318336 161 2.23
9.41
16833408 161 1.54
Table 3.2: Algorithmic and model summaries using 16 processors on the SGI
Challenge and the Intel Paragon. Predicted and actual running times are in
s/key. y For radix sort, the largest problem size that could be run on both
machines was 4,194,304 keys.
equally balanced among the processors, and communication and synchronization
requirements grow more slowly than the computation requirements.
3.2.2 Randomized Sample Sort
One approach for parallel sorting that is suitable for BSP computing is randomized sample sort. The sequential predecessor to the algorithm is sequential
samplesort [FM70], proposed by Frazer and McKellar as a re nement of Hoare's
quicksort [Hoa62]. Sequential samplesort uses a random sample set of input keys
to select splitters, resulting in greater balance|and therefore a lower number
of expected comparisons|than quicksort. The fact that the sampling approach
could be useful for splitting keys in a balanced manner over a number of processors was discussed in the work of Huang and Chow [HC83] and Reif and
36
Valiant [RV87]. Its use was analyzed in a BSP context by Gerbessiotis and
Valiant [GV94].
The basic idea behind randomized sample sort in a p-processor system is the
following:
1. A set of p ? 1 splitter keys is randomly selected. Conceptually, the splitters
will partition the input data into p buckets.
2. All keys assigned to the ith bucket are sent to the ith processor.
3. Each processor sorts its bucket.
The selection of splitters that de ne approximately equal-sized buckets is a
crucial issue. The standard approach is to randomly select ps keys from the input
set, where s is called the oversampling ratio. These keys are sorted, and the keys
with ranks s; 2s; 3s; : : : ; (p ? 1)s are selected as the splitters. By choosing a large
enough oversampling ratio, it can be shown with high probability that no bucket
will contain many more keys than the average [HC83].
Our BSPlib implementation of randomized sample sort is similar to Dusseau
e t al.'s LogP implementation. Since the sending of keys to appropriate buckets
requires irregular and unbalanced communication that cannot be predicted before
run time, Dusseau e t al. ignore analyzing the communication of randomized
sample sort. The BSP approach avoids this situation by focusing on the global
routing problem.
Analysis of Data. Figure 3.2 shows the actual and predicted performance
for randomized sample sort on an SGI Challenge and an Intel Paragon. Among
the algorithms considered, randomized sample sort had the best performance
across all platforms. From the plots, we see that the time per key decreases with
37
SGI Actual
4
3
3
4
8
Intel Predicted
512K
256K
14
12
10
8
6
4
2
0
2
4
8
Keys per processor (n/p)
512K
256K
128K
64K
32K
16K
4K
16
8K
us/key
512K
256K
128K
64K
32K
16K
8K
128K
Intel Actual
14
12
10
8
6
4
2
0
4K
64K
Keys per processor (n/p)
Keys per processor (n/p)
us/key
32K
16
4K
512K
256K
128K
64K
0
32K
0
16K
1
8K
1
16K
2
2
2
8K
us/key
4
4K
us/key
SGI Predicted
Keys per processor (n/p)
Figure 3.2: Predicted and actual execution time per key of randomized sample
sort on an SGI Challenge and an Intel Paragon. Each plot represents the run
time on a 2, 4, 8, or 16 processor system.
38
the number of processors. However, as n approaches our largest problem size, the
time per key slightly increases on the SGI Challenge. The BSP model improves
the accuracy of its predictions as n=p increases. When n=p = 512K, the actual
execution times are within 6.98% and 21.17% of that predicted by the BSP model
for the SGI Challenge and the Intel Paragon, respectively.
From a BSP perspective, randomized sample sort is attractive. Of all the
ratios considered, the algorithm performed best with an oversampling ratio of
1000. Randomized sample sort uses a constant number of supersteps, independent
of both n and p. Also, the algorithm uses only one stage of communication, and
achieves a utilization of bandwidth that is close to ideal; most of the data only
makes a single hop. Moreover, the local sort component can use the best available
sequential sorting algorithm.
3.2.3 Deterministic Sample Sort
Deterministic sample sort, analyzed in a BSP context by Gerbessiotis et
al. [GS96], is motivated by randomized sample sort. The basic idea behind deterministic sample sort is the following:
1. Each processor sorts its local keys.
2. A set of p ? 1 splitter keys is deterministically selected.
3. All keys assigned to the ith bucket are sent to the ith processor.
4. Each processor merges the keys in its bucket.
As in randomized sample sort, the selection of splitters is key to good algorithmic performance. Our approach deterministically selects ps keys from the input
39
set, where s is called the oversampling ratio. These keys are merged, and the
keys with ranks s; 2s; :::; (p ? 1)s are selected as splitters. Deterministic sample
sort also requires irregular and unbalanced communication to send keys to their
appropriate bucket.
If many keys have the same value, failure to break ties consistently can result
in an uneven distribution of keys to buckets. Gerbessiotis et al.'s algorithm
bounds the bucket sizes, assuming that all keys are distinct. Since we allow
duplicate keys, their bounds do not hold for our implementation. Dusseau et al.
do not implement this algorithm.
Analysis of Data. Figure 3.3 shows the experimental results for deterministic sample sort. Deterministic sample sort has the second best performance across
all platforms. Our experiments indicate that it performs better than randomized
sample sort for small problems (n=p 16K ). As with randomized sample sort,
increasing the problem size lead to more accurate run time predictions. For 8
million keys, the actual execution times are within 7.43% and 22.18% of that
predicted by the BSP model for the SGI Challenge and the Intel Paragon, respectively.
For deterministic sample sort, we nd an oversampling ratio of 1000 to have
the best overall performance out of the ratios we considered. As with randomized
sample sort, it has many positive features in a general-purpose computing context.
Both the computation and the communication are balanced. The algorithm uses
a constant number of supersteps. There is only one stage of communication, and
the bandwidth is used in an ecient manner. Moreover, the computation can
leverage ecient sequential sorting algorithms.
40
SGI Actual
5
4
4
us/key
5
3
3
2
2
1
1
4
0
0
16
12
10
10
8
8
512K
256K
8
Keys per processor (n/p)
512K
256K
128K
64K
16
4K
512K
256K
0
128K
0
64K
2
32K
2
16K
4
4
32K
4
2
6
16K
6
8K
us/key
12
8K
128K
Intel Actual
Intel Predicted
4K
64K
Keys per processor (n/p)
Keys per processor (n/p)
us/key
32K
16K
8K
8
4K
512K
256K
128K
64K
32K
16K
8K
2
4K
us/key
SGI Predicted
Keys per processor (n/p)
Figure 3.3: Predicted and actual execution time per key of deterministic sample
sort on an SGI Challenge and an Intel Paragon. Each plot represents the run
time on a 2, 4, 8, or 16 processor system.
41
3.2.4 Bitonic Sort
Bitonic sort, developed by Batcher [Bat68], is one of the rst algorithms
to attack the parallel sorting problem. The procedure depends upon the keys
being ordered as a bitonic sequence 1 . Initially, each key is considered a bitonic
sequence. Afterwards, lg n merge stages generate the sorted list. During each
stage, two bitonic sequences are merged to form a sorted sequence in increasing
or decreasing order. The monotonic sequences are ordered such that the two
neighboring sequences (one monotonically increasing and the other monotonically
decreasing) can be combined to form a new bitonic sequence for the next merge
stage. For example, Figure 3.4 illustrates that the input (a bitonic sequence)
to BM8+ is generated by combining BM4+ 's output (a monotonically increasing
sequence) and BM4? 's output (a monotonically decreasing sequence).
Dusseau e t al.'s LogP implementation of bitonic sort motivated our approach.
We simulate the steps of the algorithm on a b utter y network. Intuitively, one
can visualize the communication structure of the procedure as the concatenation
of increasingly larger butter ies. The communication structure of the ith merge
stage can be represented by n=2i butter ies each with 2i rows and i columns.
Each butter y node compares two keys and selects either the maximum or the
minimum key. (See Figure 3.5).
We employ a data placement so that all comparisons are local. The procedure
begins with the data in a blocked layout. Under this layout, the rst n=p keys
and n=p rows of the butter y nodes are assigned to the rst processor, the second
n=p keys and n=p rows are assigned to the second processor, etc. As a result,
A bitonic sequence is a sequence of elements with the property that the sequence monotonically increases and then monotonically decreases, or, a cyclic shift of the elements allows
the monotonically increasing property to satis ed.
1
42
Stage 1
+
BM 2
-
Unsorted
List
BM 2
+
BM 2
-
BM 2
Stage 2
Stage 3
+
+
BM4
BM 8
BM 4
Figure 3.4: A schematic representation of a bitonic sorting network of size n = 8.
BMk denotes a bitonic merging network of input size k that sorts the input in
either monotonically increasing (+) or decreasing (-) order. The last merging
network (BM8+ ) sorts the input.
43
Sorted
List
6
2
2
2
2
2
1
2
6
4
4
4
1
2
4
5
5
3
3
3
5
5
4
6
6
1
4
4
1
1
7
8
8
5
5
8
8
8
7
7
6
6
3
7
1
3
5
8
7
7
3
3
1
6
7
8
Stage 1
Stage 2
Stage 3
Figure 3.5: A bitonic sorting network of size n = 8. Each node compares two
keys, as indicated by the edges and selects either the maximum or the minimum.
Shaded and unshaded nodes designate where the minimum and maximum of two
keys is placed, respectively.
44
the rst lg (n=p) merge stages are entirely local. Since the purpose of these rst
stages is to form a monotonically increasing or decreasing sequence of n=p keys
on each processor, a local sort replaces these merge stages.
For subsequent merge stages, the blocked layout is remapped to a cyclic layout; the rst key is assigned to the rst processor, the second key to the second
processor, etc. Under this layout, the rst i ? lg(n=p) columns of the ith merge
stage are computed locally where each processor performs a comparison and conditional swap of pairs of keys. Afterwards, the data is remapped back to a blocked
layout, resulting in the last lg(n=p) steps of the merge stage being local, and a
local sort is executed by each processor. The remaps between the blocked and
cyclic layouts involve regular and balanced communication, i.e., the communication schedule is oblivious to the values of the keys and each processor receives
as much data as it sends. Periodic cyclic-blocked remapping requires n p2
(i.e. at least p elements per processor) to execute compare-exchange operations
locally [Ion96].
Under LogP, Dusseau e t al. discovered their approach had degraded performance due to the asynchronous nature of their platform, the CM-5. Once
processors reached the remap phase, they were seriously out of synch, increasing
the opportunity for contention. To improve performance, they employed a barrier
synchronization before each remap phase.
Analysis of Data. Experimental results for bitonic sort are shown for an
SGI Challenge and an Intel Paragon in Figure 3.6. In terms of performance, the
bitonic sort was worse than the sample sorts. Bitonic sort performed the best
when n=p = 4K . (Of course, for such a small problem size, one should probably
elect to use a sequential sort.) On the Intel Paragon, there are signi cant errors
when trying to predict the performance of the algorithm on 16 processors. For
45
SGI Actual
7
6
5
4
3
2
1
0
2
4
8
Intel Predicted
512K
256K
16
14
12
10
8
6
4
2
0
2
4
8
Keys per processor (n/p)
512K
256K
128K
64K
32K
16K
4K
16
8K
us/key
512K
256K
128K
64K
32K
16K
8K
64K
Intel Actual
16
14
12
10
8
6
4
2
0
4K
128K
Keys per processor (n/p)
Keys per processor (n/p)
us/key
32K
8K
16K
16
4K
512K
256K
128K
64K
32K
16K
8K
us/key
7
6
5
4
3
2
1
0
4K
us/key
SGI Predicted
Keys per processor (n/p)
Figure 3.6: Predicted and actual execution time per key of bitonic sort on an
SGI Challenge and an Intel Paragon. Each plot represents the run time on a 2,
4, 8, or 16 processor system.
46
example, when n=p = 512K , bitonic sort requires about 3:65s per key, but the
measured time per key is 2:05s. However, for the same problem size with 8
processors, the prediction error is only 5:23%.
Bitonic sort was originally developed to sort n numbers in O(lg2 n) parallel
time. The original algorithm, however, assumed a growth of computational resources, in this case comparators, that is O(n lg2 n). The overall work to sort
n numbers is therefore O(n lg2 n), which is asymptotically worse than the other
radix-sort-based parallel algorithms described here. Since bitonic sort is not based
on the best sequential algorithm available and consists of O(lg n) communication
phases, it is not surprising that it proved to be uncompetitive in our experiments.
3.2.5 Radix Sort
The radix sort algorithm [CLR94] relies on the binary representation of the
unordered list of keys. Let b denote the number of bits in the binary representation of a key. Radix sort examines the keys to be sorted r bits at a time, where
r is de ned to be the radix. Radix sort requires db=re passes. During pass i, it
sorts the keys according to their ith least signi cant block of r bits.
Consider a BSP formulation, which is virtually identical to Dusseau et al.'s
approach, of radix sort for n keys. Each pass consists of three phases. First, each
processor computes a local histogram containing 2r buckets by traversing its list
of keys and counting the number of occurrences of each of the 2r digits. Next, the
global rank of each key is determined by computing a global histogram from the
local histograms. Let g(i; j ) be the starting position in the output where the rst
key with value i on processor j belongs. Each processor determines the global
47
rank of a key with value i by obtaining its g(i; j ) value. The collection of g(i; j )
values represents the global histogram. Lastly, each key is stored at the correct
o set on the destination processor based on its global rank.
The rst phase of each pass only performs local computation. However, the
other phases require communication. Recall in the second phase, processors determine the global rank of each key by consulting the appropriate g(i; j ) value
from the global histogram. The central components of the global rank computation are a multiscan (2r parallel pre x computations, one for each bucket ) and
a multicast (multiple broadcasts). Let b(i; j ) represent the total number of keys
with value i on all processors with an index less than j . After the multiscan, Pj
will know the b(i; j ) values for 0 i < 2r . Let t(i; j ) be the total number of keys
with value i on Pj . After the multicast, all processors obtain the b(i; p ? 1) and
t(i; p ? 1) values to compute g(i; j ). Thus,
i?1
X
g(i; j ) = [b(k; p ? 1) + t(k; p ? 1)] + b(i; j ):
k=0
Figure 3.7 presents an example of the g(i; j ) computation.
In the last phase, the global ranks of the keys are divided equally among
the processors. The processor and o set to which a key is sent depends upon
its global rank. In our implementation, each processor loops through its set of
keys, determines the destination processor and o set of each key, and sends each
key and its o set to the appropriate processor. Since the destination of a key is
dependent on the value of the key, this phase requires irregular communication
to redistribute the keys.
We consider two ways of implementing multiscan and multicast. For the following discussion, we concentrate on the multiscan operation since the multicast
communication pattern is the same.
48
Bucket
0
1
2
3
P0
P1
P2
P3
3
1
1
2
3
0
2
1
2
0
0
0
4
0
3
5
4
1
2
2
1
4
2
6
1
5
2
1
3
7
Local
Histograms
3
9
0 3 4 5 7 9 13 14 16 17 19 20 2225 29
g(0,0)
g(0,1)
g(0,2)
g(0,3)
g(1,0)
g(1,1)
g(1,2)
g(1,3)
g(2,0)
g(2,1)
g(2,2)
g(2,3)
Global
Histogram
g(3,0)
g(3,1)
g(3,2)
g(3,3)
Figure 3.7: Global rank computation. The computation is illustrated with 4
processors and 4 buckets for the values 0{3. Each processor's t(i; j ) value is
shown inside of each bucket. The number outside of a bucket re ects the b(i; j )
value after the multiscan. After the multicast, g(i; j ) re ects the starting position
in the output where the rst key with value i on processor j belongs. For example,
P0 will place the rst key with value \0" at position 0, the \1" keys starting at
position 7, etc.
49
One plausible method of implementing multiscan is a tree-based approach.
Performing the multiscan as a sequence of m = 2r tree-based parallel pre x
computations requires m lg p messages to be sent by P0, the processor that must
send the most messages. An alternative and more ecient approach is to pipeline
the bucket sizes to the next higher processor. In this case, each processor sends
exactly m messages during the multiscan calculation, and making m large will
allow the overhead associated with lling the pipeline to become arbitrarily small.
Dusseau e t al. also use a pipeline-based approach to implement the multiscan
operation. However, their multiscan implementation does not run smoothly. The
problem arises as a result of the programmer not accounting for the fact that P0
only sends data whereas the other processors receive data, perform an addition,
and send data. Since receiving data is usually given more priority over sending
data, P1 will spend most of its time, in the early stages, receiving instead of
sending data. Dusseau et al. correct this problem by delaying the sending rate
of P0 . Additionally, Dusseau e t al. ignore analyzing the communication of radix
sort since it cannot be predicted at compile time. The BSP approach again avoids
this situation by focusing on the global routing problem.
Analysis of Data. Experimental results for radix sort are shown for an SGI
Challenge and an Intel Paragon in Figure 3.2.5. Radix sort provides the worst
performance of the parallel algorithms implemented. When n=p = 256K , it is
6 times (3 times) slower than randomized sample sort on a 16 processor SGI
Challenge (Intel Paragon) machine. Similarly to the other sorts, the BSP model
improves the accuracy of its predictions as n=p increases. When n/p = 512K,
the actual execution times are within 9.84% and 38.53% of that predicted by the
BSP model for an SGI Challenge and an Intel Paragon, respectively.
50
SGI Actual
14
12
10
8
6
4
2
0
2
4
8
256K
128K
45
40
35
30
25
20
15
10
5
0
2
4
8
Keys per processor (n/p)
256K
128K
64K
32K
16K
8K
16
4K
256K
128K
64K
32K
16K
us/key
45
40
35
30
25
20
15
10
5
0
8K
64K
Intel Actual
Intel Predicted
4K
32K
Keys per processor (n/p)
Keys per processor (n/p)
us/key
16K
8K
16
4K
256K
128K
64K
32K
16K
8K
us/key
14
12
10
8
6
4
2
0
4K
us/key
SGI Predicted
Keys per processor (n/p)
Figure 3.8: Predicted and actual execution time per key of radix sort on an SGI
Challenge and an Intel Paragon. Each plot represents the run time on a 2, 4, 8,
or 16 processor system.
51
Our experimental evidence indicates that parallel radix sort is perhaps the
least appropriate for general-purpose parallel computing. There are several reasons for this. First, the amount of communication is relatively large. In general,
all the keys can be sent to another processor in each pass. In our experiments, the
best execution times occurred when there were four passes over the keys. Ignoring
algorithmic overhead, this radix sort requires four h-relations of (approximately)
size n=p, in contrast to the sample sorts which have only one such relation. Additionally, the overhead associated with the construction of the global histogram
is substantial in terms of synchronization. At least p ? 1 barrier synchronizations
will be required to perform the multiscan, and many more will be used if there is
an attempt to pipeline communication. Similarly, the multibroadcast component
will require numerous barrier synchronizations if pipelining is used, as it typically
will be for large r.
3.3 Summary
We have described BSP implementations of four sorting algorithms and analyzed
their performance. LogP proponents argue that the application programmer
should not be constrained by the superstep programming style of BSP. However, the BSP sorting implementations described here are virtually identical to
Dusseau e t al.'s LogP implementations of randomized sample sort, bitonic sort,
and radix sort. (Dusseau et al. did not implement deterministic sample sort.)
Moreover, LogP's exible computational model creates additional burdens on the
programmer that are avoided in the BSP model. The main di erence between the
two models lies in their approach to handling communication. Under LogP, the
programmer schedules communication while the runtime-system of BSP performs
that task. Theoretical evidence supports the claim that LogP's exible computa52
tional model has no speed advantage over the BSP approach for larger problem
sizes [BHP96]. Furthermore, there is convincing evidence that compiler/runtime
systems are capable of scheduling communication for ecient performance. HighPerformance Fortran (HPF) [For93] is one such example. Consequently, we argue
that the higher-level BSP approach is preferable.
Concerning the performance of our sorting implementations, our results suggest that BSP programs can eciently execute a range of sorting applications
on both shared-memory and message-passing parallel platforms. Of the sorts
discussed here, randomized sample sort is the best overall performer followed
closely by deterministic sample sort. Both bitonic sort and radix sort (the worst
performer) appear to be uncompetitive in comparison to the sample sorts, which
indicates they may not be suitable for general-purpose parallel sorting.
Unfortunately, the accuracy of the BSP model raises some concerns. Our
results show that the BSP model does not predict accurately the execution times
of the sorting applications. However, we believe that our ndings do not re ect
negatively on the predictive capabilities of the BSP model. A more reasonable
reaction is to notice that BSP is quite useful for predicting performance trends
across target architectures of interest. In addition, the BSP cost model can be
used as an evaluation tool for nding the most suitable algorithm from a set of
alternatives to execute on a parallel architecture. For example, the BSP cost
model suggests that randomized sample sort is the best overall performer (of
the sorting algorithms discussed here) on both an SGI Challenge and an Intel
Paragon. Our experimental results corroborate this claim.
In sum, the BSP model seeks to provide a simple programming approach
that allows for portable, ecient, and predictable algorithmic performance. Our
experiments demonstrate how increased eciency and predictability under BSP
53
can often be achieved by increasing the problem size. This is also the case for
many other important applications. Thus, the cost of portable parallel computing
is that larger problem sizes are needed to achieve the desired level of eciency
and predictability.
54
CHAPTER 4
HBSPk : A Generalization of BSP
The k-Heterogeneous Bulk Synchronous Parallel (HBSPk ) model is a generalization of the BSP model [Val90a] of parallel computation. HBSPk provides
parameters that allow the user to tailor the model to the required system. As a
result, HBSPk can guide the development of applications for traditional parallel
systems, heterogeneous clusters, the Internet, and computational grids [FK98].
In HBSPk , each of these systems can be grouped into clusters based on their
ability to communicate with each other. Although the model accommodates a
wide-range of architecture types, the algorithm designer does not have to manipulate an overwhelming number of parameters. More importantly, HBSPk allows
the algorithm designer to think about the collection of heterogeneous computers
as a single system.
4.1 Machine Representation
The HBSPk model refers to a class of machines with at most k levels of communication. Thus, HBSP0 or single processor systems are the simplest class
of machines. The next class of machines are HBSP1 computers, which consist of at most one communication network. Examples of HBSP1 computers
55
SMP
SGI
workstation
LAN
111
000
000
111
000
111
000
111
000
111
11111
00000
00000
11111
00000
11111
Communications Network
Figure 4.1: An HBSP2 cluster.
include single processor systems (i.e. HBSP0), traditional parallel machines,
and heterogeneous workstation clusters. HBSP2 machines extend the HBSP1
class to handle heterogeneous collections of multiprocessor machines or clusters.
Figure 4.1 shows an HBSP2 cluster consisting of three HBSP1 machines. In
general, HBSPk systems include HBSPk?1 computers as well as machines composed of HBSPk?1 computers. Thus, the relationship of the machine classes is
HBSP0 HBSP1 HBSPk .
An HBSPk machine can be represented by a tree T = (V; E ). Each node of
T represents a heterogeneous machine. The height of the tree is k. The root r
of T is an HBSPk machine. Let d be the length of the path from the root r to
a node x. The level of node x is k ? d. Thus, nodes at level i of T are HBSPi
machines. Figure 4.2 shows a tree representation of the HBSP2 machine shown in
Figure 4.1. The root node corresponds to an HBSP2 machine. The components
of this machine (a symmetric multiprocessor, an SGI workstation, and a LAN)
are shown at level 1. Level 0 depicts the individual processors of the symmetric
multiprocessor and the LAN.
The indexing scheme of an HBSPk machine is as follows. Machines at level i,
0 i k, are labeled Mi;0 , Mi;1 , : : :, Mi;mi?1 , where mi represents the number of
HBSPi machines. Consider machine Mi;j of an HBSPk computer, where 0 j <
56
M 2,0
M
M 0,0
1,0
M
M
1,1
M 0,1 M 0,2 M 0,3
M 0,4
1,2
M
0,5
Figure 4.2: Tree representation of the cluster shown in Figure 4.1.
mi;j . One possible interpretation of Mi;j is that it is a cluster with identity j on
level i. The nodes of the cluster are the children of Mi;j . In Figure 4.2, M1;0 is an
HBSP1 cluster composed of the nodes M0;0 ; M0;1; M0;2, and M0;3 . M2;0 provides
an example of a cluster of clusters. The HBSPk model places no restriction
on the amount of nesting within clusters. Additionally, we may consider the
machines at level i of T the coordinator nodes of the machines at level i ? 1.
As shown in Figure 4.2, M1;0 , M1;1 ; M1;2, and M2;0 are examples of coordinator
nodes. Coordinator nodes ful ll many roles. They can act as a representative
for their cluster during inter-cluster communication. Or, to increase algorithmic
performance, they may represent the fastest machine in their subtree (or cluster).
Under this assumption, the root node is the fastest node of the entire HBSPk
machine.
4.2 Cost Model
Using the de nition of an HBSPk machine as a basis, we de ne the meaning
and cost of an HBSPk computation. There are two ways in which to determine the
cost of an HBSPk computation. One approach calculates the various costs directly
57
at each level i. The other nds them recursively. For expositional purposes, we
choose the former approach.
An HBSPk computation consists of some combination of superi-steps. During a superi-step, each level i node performs asynchronously some combination
of local computation, message transmissions to other level i machines, and message arrivals from its peers. A message sent in one superi -step is guaranteed
to be available to the destination machine at the beginning of the next superistep. Each superi-step is followed by a global synchronization of all the level i
computers.
Consider the class of HBSP0 machines. For these single processor systems,
computation proceeds through a series of super0-steps (or steps). Communication and synchronization with other processors is not applicable. Unlike the
previous class, HBSP1 machines perform communication. HBSP1 computers proceed through a series of super1 -steps (or supersteps). During a superstep, each
HBSP0 machine performs asynchronously some combination of local computation, message arrivals, and message transmissions. Thus, an HBSP1 computation
resembles a BSP computation. The main di erence is that an HBSP1 algorithm
delegates more work to the faster processors.
For HBSP2 machines, computation consists of super1- and super2 -steps. Super1steps proceed as described previously. During a super2-step, the coordinator
nodes for each HBSP1 cluster performs local computation and/or communicates
data with other level 1 coordinator nodes. A barrier synchronization of the coordinators separate each super2-step. Thus, an HBSP2 computation consists of
both intra- and inter-cluster communication.
An HBSPk computer is characterized by the following parameters, which are
summarized in Table 4.1:
58
Symbol Meaning
Mi;j
a machine's identity, where 0 i k, 0 j < mi
mi
number of HBSPi machines on level i
mi;j
number of children of Mi;j
g
speed the fastest machine can inject packets into the network
ri;j
speed relative to the fastest machine for Mi;j to inject packets into the network
Li;j
overhead to perform a barrier synchronization of the machines in the
j th cluster of level i
ci;j
fraction of the problem size that Mi;j receives
h
size of a heterogeneous h-relation
hi;j
largest number of packets sent or received by Mi;j in a superi -step
Si
number of superi -steps
Ti () execution time of superi -step Table 4.1: De nitions of Notations
mi, the number of HBSPi machines labeled Mi;0; Mi;1, Mi;mi ?1 on level i,
where 0 i k;
mi;j , the number of children of Mi;j ;
g, a bandwidth indicator that re ects the speed with which the fastest
machine can inject packets into the network;
ri;j , the speed relative to the fastest machine for Mi;j to inject a packet into
the network;
Li;j , overhead to perform a barrier synchronization of the machines in the
subtree of Mi;j ; and
ci;j , the fraction of the problem size that Mi;j receives.
59
We assume that the ri;j value of the fastest machine is normalized to 1. If
ri;j = t, then Mi;j communicates t times slower than the fastest node. The ci;j
parameter adds a load-balancing feature into the model. Speci cally, it attempts
to provide Mi;j with a problem size that is proportional to its computational and
communication abilities. Boulet et al. [BDR99] discuss methods for computing
ci;j on a network of heterogeneous workstations. In Chapter 5.3, we present our
method of calculating c0;j for an HBSP1 platform. We refer the read to Boulet et
al. for other strategies to compute ci;j . When k 2, it is unclear what the value
of ci;j should represent. For example, a coordinator node's ci;j value could be the
sum of its children's value. Alternatively, a combination of communication and
computation costs could factor into a machine's ci;j value. The HBSPk model
says nothing about how the ci;j costs should be tabulated. Instead, it assumes
that such costs have been determined appropriately.
The parameters described above allow for cost analysis of HBSPk programs.
First, consider the cost of a superi-step . w represents the largest amount
of local computation performed by an HBSPi machine. Let hi;j be the largest
number of messages sent or received by Mi;j , where 0 j < mi. The size of the
heterogeneous h-relation is h = maxfri;j hi;j g with a routing cost of gh. Thus,
the execution time of superi -step is
Ti () = wi + gh + Li;j :
(4.1)
Suppose that Si is the number of superi-steps, where 1 i k. Intuitively,
the execution time of an HBSPk algorithm is the sum of the superi-steps, where
1 i k. This leads to an overall cost of
Sk
S
S
X
X
X
T () + T () + : : : + T ():
1
=1
1
2
=1
2
=1
60
k
(4.2)
Similarly to BSP, the above cost model demonstrates what factors are important when designing HBSPk applications. To minimize execution time, the
programmer must attempt to (i) balance the local computation of the HBSPi
machines in each superi -step, (ii) balance the communication between the machines, and (iii) minimize the number of superi-steps. Balancing these objectives
is a nontrivial task. Nevertheless, HBSPk provides guidance on how to design
ecient heterogeneous programs.
4.3 HBSP Collective Communication Algorithms
k
Collective communication plays an important role in the development of parallel programs [BBC94, BGP94, MR95]. It simpli es the programming task,
facilitates the implementation of ecient communication schemes, and promotes
portability. In the following subsections, we design six HBSPk collective communication operations|gather, scatter, reduction, pre x sums, one-to-all broadcast,
and all-to-all broadcast.
The HBSPk model provides parameters that allow algorithm designers to
exploit the heterogeneity of the underlying system. The model promotes our
two-fold design strategy for HBSPk collective operations. First, faster machines
should be involved in the computation more often than their slower counterparts.
Collective operations use speci c nodes to collect or distribute data to the other
nodes in the system. For faster algorithmic performance, these nodes should be
the fastest machines in the system. Secondly, faster machines should receive more
data items than slower machines. This principle encourages the use of balanced
workloads, where machines receive problem sizes relative to their communication and computational abilities. Partitioning the workload so that nodes receive
61
an equal number of elements works quite well for homogeneous environments.
However, this strategy encourages unbalanced workloads in heterogeneous environments since faster machines typically sit idle waiting for slower nodes to nish
a computation.
In this chapter, we design and analyze collective communication algorithms
for HBSP1 and HBSP2 platforms. Our HBSP1 algorithms are based on BSP
communication operations. Consequently, we discuss the BSP design of each collective operation before presenting its heterogeneous counterpart. In an HBSP1
environment, the number of workstations is m1;0 (or m0 ). The single coordinator node, M1;0 , represents the fastest workstation among the HBSP0 machines.
Hence, r1;0 = 1. L1;0 is the cost of synchronizing the cluster of processors.
The HBSP2 collective routines use the HBSP1 algorithms as a basis. Unlike
HBSP1 platforms, HBSP2 machines contain a two-level communication network.
Level 0 consists of m0 individual workstations. Level 1 represents the m1 (or m2;0 )
coordinator nodes for the machines at level 0. Each coordinator, M1;j , requires
a cost of L1;j to synchronize the nodes in its cluster. The root of the entire
cluster is M2;0 , which is the fastest machine. r2;0 = 1. The communication costs
can be quite expensive on higher level links. As a result, our HBSP2 algorithms
minimally communicate on these links. We do not specify algorithms for higherlevel machines (i.e., k 3). However, one can generalize the approach given here
for these systems.
Throughout the rest of this chapter, let xi;j represent the number of items
in Mi;j 's possession, where 0 i 2; 0 j < mi. Balanced workloads assume
xi;j = ci;j n. The total number of items of interest is n. For notational convenience,
the indexes f and s are used to represent the identity of the fastest and slowest
nodes, respectively.
62
4.3.1 Gather
In the gather operation, a single node collects a unique message from each
of the other nodes. The BSP gather operation consists of all processors sending
their data to a designated processor. Typically, this duty is relegated to P0. For
balanced workloads, the processors send np elements to P0. n denotes the total
number of items P0 receives from all the processors. Since np < n, the BSP cost
of the gather operation is gn + L.
HBSP1 . We extend the above algorithm to accommodate 1-level heteroge-
neous environments. Instead of each machine sending their data to P0, they send
it to their coordinator node, M1;0 . Suppose that the total number of items that
M1;0 receives is n. The size of the heterogeneous h-relation is maxfr0;j x0;j ; r1;0ng.
Assume that each processor takes approximately the same amount of time to
send their data. Hence, x0;j = c0;j n. Recall that ci;j is inversely proportional to
the speed of Mi;j . Consequently, ri;j ci;j < 1. Thus, the gather cost gn + L1;0.
The above cost of the gather operation is ecient since the fastest processor
is performing most of the work. If r0;j c0;j > 1; M0;j has a problem size that is too
large. Its communication time will dominate the cost of the gather operation.
Whenever possible, the fastest processor should handle the most data items. Our
results, in fact, demonstrate the importance of balanced workloads. The increase
in performance is a result of M1;0 receiving the items faster. The HBSPk model,
as does BSP, rewards programs with balanced design.
HBSP2 . The HBSP2 gather algorithm proceeds as follows. First, each
HBSP1 machine performs an HBSP1 gather. Afterwards, each of the level 1 nodes
send their data items to the root, M2;0. Since the problem size is n, x2;0 = n.
63
The cost of an HBSP2 gather operation is the sum of the super1 -step and
super2-step times. Since each HBSP1 machine performs a gather operation, the
super1-step cost is the largest time need for an HBSP1 cluster to nish the operation. Once the level 1 coordinators have the n data items, they send the data to
the root. This super2 -step requires maxfr1;j x1;j ; r2;0ng + L2;0 . Assuming balanced
workloads, the super2 -step cost is gn + L2;0 . Ecient algorithm execution in this
environment implies that the size of the problem size must outweigh the cost of
performing the extra level of communication and synchronization.
4.3.2 Scatter
The scatter operation is the opposite of the gather algorithm. Here, a single node
sends a unique message to every other node. Under BSP, P0 sends np elements to
each of the processors. The cost of this operation is gn + L.
HBSP1 . The extension of the above algorithm for heterogeneous processors
is also similar. The fastest processor, M1;0 , is responsible for scattering the data
to all of the processors. M1;0 distributes c0;j n elements to M0;j . The size of the
heterogeneous h-relation is maxfr0;j c0;j n; r1;0ng. Assume that processors have
balanced workloads. So, r0;j c0;j < 1. This results in a cost of gn + L1;0.
HBSP2 . Similarly to the HBSP2 gather operation, the HBSP2 scatter consists of both super1 -step and super2-steps. In the super2 -step, the root process
sends the required data to the level 1-coordinator nodes. A super2-step requires
a cost of g maxfr1;j x1;j ; r2;0ng + L2;0 , which reduces to gn + L2;0. Once the
coordinator nodes receive the data from the root, an HBSP1 gather operation is
performed within the cluster. As in the scatter operation, the additional level
of synchronization is an issue of concern. However, if the problem size is large
enough, the e ect of the barrier cost on the system can be minimized.
64
4.3.3 Reduction
Sometimes the gather operation can be combined with a speci ed arithmetic
or logical operation. For example, the values could be gathered and then added
together to produce a single value. Such an operation is called a single-value
reduction. In BSP,each BSP processor locally reduces its np data items and sends
the value to P0. Afterwards, P0 reduces the p elements to produce a single value.
The cost of this operation in BSP is O( np ) + p(1 + g) + L.
HBSP1 . The HBSP1 algorithms proceeds as follows. Each processor locally
reduces its xi;j elements and sends the value to M1;0 . The size of the heterogeneous
h-relation is maxfr0;s 1; r1;0 m1;0 g. M1;0 reduces the m1;0 values to produce a
single value. The total cost of the algorithm is O(x0;s) + gm1;0 + L1;0 . The
computational requirement of a reduction operation that produces a single value
is m1;0 . If m1;0 is small, the bene t of using the fastest processor as the root node
is small.
A point-wise reduction is performed on an array of values provided by each
processor. Suppose that each processor sends the root node an array called inbuf
of size n. Point-wise reduction assumes that all arrays are of equal size. The rst
element of inbuf is at index 0. After the reduction, the root node applies the
speci ed operation to the rst element in each input bu er (op(inbuf[0]) and
stores the result in outbuf. Similarly, outbuf[1] contains op(inbuf[1]). Valid
operations include maximum, minimum, and summation to name a few.
The HBSP1 point-wise reduction operation proceeds as follows. First, each
processor sends its n data items to the fastest node. Afterwards, M1;0 performs
the reduction on the m1;0n elements. This step requires a heterogeneous hrelation of size maxfr0;sn; r1;0m1;0ng. Assuming that r0;s r1;0m1;0 , the cost
65
of the super1 -step is O(m1;0n) + gm1;0n + L1;0 . Here, the bene ts of using a
point-wise reduction are evident in terms of the computational and communication workload given to the fastest processor.
HBSP2 . The HBSP2 single-value reduction algorithm proceeds as follows.
First, each level 1-coordinator node performs an HBSP1 reduction. The cost
of this super1-step is the largest time needed to perform an HBSP1 reduction.
Afterwards, each of the coordinators send their result to the root, M2;0 . The
root reduces the m2;0 items to a single result. This super2-step costs O(m2;0) +
maxfr1;s 1; r2;0 m2;0g + L2;0 .
In the HBSP2 algorithm, the computational requirement of a reduction operation that produces a single value is m2;0 for the root node. If m2;0 is small, the
bene t of using the root machine to reduce the values is small.
4.3.4 Pre x Sums
Given a list of n numbers, y0; y1; : : : ; yn?1, all partial summations (i.e., y0; y0 +
y1; y0 + y1 + y2; : : : ) are computed. The pre x calculation can also be de ned with
associative operations other than addition; for example, subtraction, multiplication, maximum, minimum, and logical operations. Practical areas of application
include processor allocation, data compaction, sorting, and polynomial evaluation.
A BSP pre x sums algorithm requires two supersteps. First, each processor
computes its local pre x sums, which requires a computation time of O( np ). Next,
the processors sends their sum to P0. Since P0 receives p items, the size of the
h-relation is p. P0 computes the pre x sums of these p elements. Suppose the
66
pre x sums are labeled s0 ; s1; : : : ; sp?1. P0 sends sj to Pj+1, where 0 j < p ? 1.
The communication time of this step is gp. Each processor computes the nal
result by adding the value received from P0 to its local pre x sums. Therefore,
the cost of a BSP pre x sums computation is O( np ) + 2gp + 2L.
HBSP1 . Similarly to the BSP algorithm, the HBSP1 pre x sums opera-
tion begins with M0;j computing its local pre x sums. This step requires a
computation time of O(c0;j n). Afterwards, each processor sends its total sum
to M1;0. M1;0 computes the pre x sums of the m1;0 elements received in the
previous step. M1;0 sends each processor the appropriate value it needs to
add to its local pre x sums. This requires an heterogeneous h-relation of size
max(r0;s 1; r1;0 m1;0). Adding this value to M0;j 's local pre x sums involves c0;j n
amount of work. The size of both heterogeneous h-relations in the pre x sums algorithm is max(r0;s 1; r1;0 m1;0). Again, it is unlikely that r0;s > m1;0. Therefore,
the total time of pre x sums on an HBSP1 machine is O(c0;j n) + 2(gm1;0 + L1;0 ),
where the amount of local computation is relative to a processor's computational
speed.
HBSP2 . Figure 4.3 presents an example of a HBSP2 pre x sums computa-
tion. The HBSPk pre x sums algorithm begins with M0;j computing its local
pre x sums and sending the total sum to the coordinator of its cluster. Each
level 1 coordinator node computes the pre x sums of its children's values and
sends the total sum to its coordinator, M2;0 . This node computes the pre x sums
of its m2;0 elements. Suppose that its pre x sums are labeled s0 ; s1; : : : ; sm2;0 ?1.
Pre x sums values are distributed to M2;0's children as follows. The rst child,
M1;0 , gets the value 0, the second child, M1;1 , receives s0 , M1;2 obtains s1 , etc.
Each M1;j , adds this element to its pre x sums. Similarly to the root node, M1;j ,
67
sends an appropriate value for its children to add to their pre x sums. Figure 4.4
presents the algorithm.
In Steps 1 and 9, each HBSP0 machine computes its local pre x sums. This requires M0;j to perform c0;j n amount of work. Next, M0;j sends the total sum of its
pre x sum to its coordinator, M1;j . Since M0;j sends 1 element and M1;j receives
m1;j elements, Step 2 requires a communication time of g maxfr0;j 1; r1;j m1;j g.
In Steps 3 and 7, each level 1-coordinator computes the pre x sums of the elements it received from its children. This requires m1;j amount of computation.
In Step 4, M1;j sends its sum to the root, M2;0 . The size of the heterogeneous
h-relation is maxfr1;j 1; r2;0 m2;0g. Step 5 requires a computation time of m2;0 .
Steps 6 and 8 require a heterogeneous h-relation of size maxfr1;j 1; r2;0 m2;0g
and maxfr0;j 1; r1;j m1;j g, respectively. Let Li = maxfLi;j g, where 1 i 2,
0 j < mi. The cost of the HBSP2 pre x sums algorithm is
O(c0;j n) + 2g(maxfr0;s; r1;sm1;sg + maxfr1;s; m2;0 g) + 2(L1 + L2 ):
(4.3)
Unlike the gather and broadcast operations, the overhead of pre x sums grows
in relation to the underlying heterogeneous architecture|not the problem size.
By increasing the problem size, one can overcome the overheads of the underlying
architecture.
4.3.5 One-to-All Broadcast
The two-phase broadcast is the algorithm of choice for the BSP model. The
algorithm's strategy is to spread the broadcast items as equally as possible among
the p processors before replicating each item. In the rst phase, P0 sends np
elements to each of the processors. The second phase consists of each processor
68
26
9
11
26
9
17
1 4 9
11
10 17
1 6
26
0
9
9 11
23
56
11 15 22
82
26
37
9
19 26
22
23
26
0
34
56
26
1 4 9
82
26
27 32 35 37
37
60
60
82
60
71 75 82
Figure 4.3: An HBSP2 pre x sums computation. Execution starts with the leaf
nodes (or HBSP0 machines) in the top diagram. Here, the nodes send the total
of their pre x sums computation to the coordinator of its cluster. The upward
traversal of the computation continues until the root node is reached. The bottom
diagram shows the downward execution of the computation. The leaf nodes hold
the nal result.
69
1. M0;j computes the pre x sums of its c0;j n elements, where 0 j < m0.
2. M0;j sends the total sum of its elements to the coordinator of its cluster.
3. M1;j computes the pre x sums of the elements received in Step 2, where
0 j < m1 .
4. M1;j sends the total sum of its elements to M2;0 .
5. M2;0 computes the pre x sums of the elements received in Step 4.
6. M2;0 sends its j th pre x sum to M1;j+1, where 0 j < m1 .
7. M1;j adds the value from Step 6 to its pre x sums.
8. M1;j sends the j th pre x sum to its (j + 1)th child.
9. M0;j adds the value from Step 8 to its pre x sums, where 0 j < m0 .
Figure 4.4: HBSP2 pre x sums
sending the np elements it received in the previous stage to all of the processors.
Therefore, the cost of the two-phase broadcast is 2(gn + L).
HBSP1 . The HBSP1 broadcast algorithm uses the above BSP algorithm as
its basis. The computation starts at the root (or coordinator) node, where each
of its children execute similarly to the two-phase BSP algorithm. M1;0 sends
n
m1;0 elements to each of its children, M0;j . This phase requires a heterogeneous
h-relation of size maxfr1;0n; r0;s mn1;0 g. In a typical environment, it is reasonable
to assume that m0 ranges from the tens to the hundreds. As a result, it is quite
unlikely that a machine would be m1;0 times slower than the fastest machine.
If this is the case, it may be more appropriate not to include that machine in
the computation. As a result, the communication time of the rst phase reduces
70
to gn. The second phase consists of each processor receiving n elements1 . This
results in a communication time of gr0;sn. Thus, the complexity of a two-phase
broadcast on an HBSP1 machine is gn(1 + r0;s) + 2L1;0.
As a point of comparison, the one-phase broadcast (M1;0 sends n items to
each processor) costs gnm1;0 + L1;0, assuming r0;s < m1;0 . Clearly, the two-phase
approach is the better overall performer. An interesting conclusion concerning
the broadcast operation is that it e ectively cannot exploit heterogeneity. Since
the slowest processor must receive n items, its cost will dictate the complexity of
the algorithm. Partitioning the problem so that M0;j receives cj n elements during
the rst phase of the algorithm is ine ective. Although wall clock performance
may improve, theoretically, the resulting speedup is neglible.
HBSP2 . The two-phase approach is the algorithm of choice for HBSP1 ma-
chines. Next, we consider broadcasting in an HBSP2 computer. Given that communication is likely to be more expensive (i.e., higher latency links and increased
synchronization costs) in such an environment, we investigate whether the twophase approach is also applicable for HBSP2 machines. The algorithm begins
with the root node distributing n items to the level 1 coordinator nodes. M2;0
may broadcast the data to its children using either a single-phase or two-phase
approach. Afterwards, each level 1 coordinator node sends the n items to its
children using the HBSP1 broadcast algorithm. The total cost of the algorithm
is the sum of the super1 - and super2-steps. Since both approaches utilize the
HBSP1 broadcast, we focus our discussion on the behavior of the super2 -steps.
In the one-phase approach, the root nodes sends n elements to the level 1
machines. The cost of the super2 -step is g maxfr1;sn; r2;0 nm2;0g + L2;0 . Suppose
1 Actually, each processor will receive n ? n elements. We use n to simplify the notation.
m1 0
;
71
that that r1;s > m2;0. The super2 -step cost is gr1;sn + L2;0 . Otherwise, its
gnm2;0 + L2;0 .
Unlike the above algorithm, the two-phase approach requires two super2-steps.
Initially, the root node sends mn2;0 elements to the level 1 coordinators. Each
coordinator, then, broadcasts its mn2;0 elements to its peers. The rst super2 -step
requires a heterogeneous h-relation of size maxfr1;s mn2;0 ; r2;0 ng. The other
super2-step costs gr1;sn + L2;0 . Suppose that r1;s > m2;0. The cost of the super2steps is gr1;sn( mn2;0 + 1) + 2L2;0 . Otherwise, the cost is gn(r1;s + r2;0) + 2L2;0 .
4.3.6 All-to-all broadcast
A generalization of the one-to-all broadcast is the all-to-all broadcast, where all
nodes simultaneously initiate a broadcast. A node sends the same data to every
other node, but di erent nodes may broadcast di erent messages. A straightforward BSP algorithm for all-to-all communication is a single-stage algorithm,
where each processor sends its data to the other processors. Suppose that each
processor, Pj , sends np elements to each of the other processors. This algorithm
results in a cost of gn + L, which is the same cost for a one-to-all broadcast.
Although this algorithm is susceptible to node contention, it demonstrates the issues that are involved when performing an all-to-all broadcast in a heterogeneous
environment.
HBSP1 . One approach to designing an HBSP1 all-to-all broadcast is to use
the above BSP algorithm as a basis. This simultaneous broadcast algorithm results in the same cost as the one-phase broadcast algorithm, gnm1;0 + L1;0 . Unfortunately, this algorithm is not able to exploit the heterogeneity of the underlying
system. Another approach for all-to-all communication is the intermediate destination algorithm. Here, each processor sends their message to an intermediate
72
node. This node is responsible for broadcasting the data to all processors. Clearly,
the intermediate node should be the fastest processor in the system. The algorithm begins with M0;j sending its data items to M1;0 . This requires an h-relation
of size maxfr0;s mn1;0 ; r1;0ng. Again, we assume that r0;s < m1;0 . The root node
collects all of the data from its children and broadcasts it to all of the nodes. The
cost of broadcasting the data with a two-phase algorithm is gn(1 + r0;s) + 2L1;0 .
Overall, this approach requires a cost of gn(2 + r0;s) + 3L1;0.
Unfortunately, as in the one-to-all broadcast, the fundamental diculty of
the all-to-all broadcast is that each node must possess the same number of items.
The slowest node will always be a bottleneck since it must receive all n data
items. As a result, it is dicult to partition the problem in such a way to create
balanced workloads among the heterogeneous machines.
HBSP2 . Next, we consider performing an all-to-all broadcast on an HBSP2
machine. First, each HBSP0 node sends its data to the coordinator of its cluster. The level 1 coordinators forward the collected data to the root node, M2;0 .
Once the root receives all n items and it initiates a HBSP2 one-to-all broadcast.
Again, the performance of this algorithm is limited since the slowest machine
must receive the n items.
4.3.7 Summary
The utility of the HBSPk model is demonstrated through the design and
analysis of gather, scatter, reduction, pre x sums, one-to-all broadcast, and allto-all broadcast algorithms. Our results indicate that the HBSPk encourages
balanced workloads among the machines, if applicable. For example, a close
73
examination of the broadcast operations demonstrates that it is impossible to
avoid unbalanced workloads since the slowest processor must receive n items.
Besides analyzing execution time, the HBSPk model can also be used to determine
the penalty associated with using a particular heterogeneous environment. This
is certainly true for the pre x sums algorithm, where overhead costs are a result
of the underlying architecture and not the problem size.
74
CHAPTER 5
HBSP1 Collective Communication Performance
This chapter focuses on experimentally validating the performance of the collective communication algorithms presented in Chapter 4. Speci cally, we study the
e ectiveness of the collective routines on a non-dedicated, heterogeneous cluster
of workstations. Each of the routines were designed to utilize fast processors and
balanced workloads. Theoretical analysis of the algorithms showed that applying the above principles lead to good performance on heterogeneous platforms.
We design experiments to test whether the predictions made by the model are
relevant for HBSP1 platforms.
Additional research work has studied the performance of collective algorithms
for heterogeneous workstation clusters. The ECO package [LB96], built on top
of PVM, automatically analyzes characteristics of heterogeneous networks to develop optimized communication patterns. Bhat, Raghavendra, and Prasanna [BRP99]
extend the FNF algorithm [BMP98] and propose several new heuristics for collective operations. Their heuristics consider the e ect communication links with
di erent latencies have on a system. Banikazemi [BSP99] present a model for
point-to-point communications in heterogeneous networks of workstations and
use it to study the e ect of heterogeneity on the performance of collective operations.
75
5.1 The HBSP Programming Library
The HBSP1 collective communication algorithms are implemented using the
HBSP Programming Library (HBSPlib). Table 5.1 lists the functions that constitute the HBSPlib interface. The design of HBSPlib incorporates many of the functions contained in BSPlib [HMS98]. HBSPlib is written on top of PVM [Sun90], a
software package that allows a heterogeneous network of parallel and serial computers to appear as a single, concurrent, computational resource. The computers
compose a virtual machine and communicate by sending messages to each other.
We use PVM's pvm send() function for asynchronous communication to directly
send messages between processors. To receive a message, we take advantage of
the PVM function pvm recv().
HBSPlib's implementation of message passing among heterogeneous processors is quite easy, thanks to PVM. More problematic is the implementation of
global synchronization. PVM actually has a function called pvm barrier() that
implements barrier synchronization. Unfortunately, it is unclear whether the successful return from pvm barrier implies that all messages have been cleared from
the communication network. As a result, our implementation of global synchronization is somewhat complex, since we need to guarantee that all messages have
arrived at their destination. Therefore, extra packets are used for synchronization
purposes. PVM guarantees that message order is preserved, so when a processor
calls hbsp sync(), it sends a special synchronization packet to every other processor. Essentially, this packet tells each processor that it has no more messages
to send. Next, the processor begins handling the messages that were sent to
it. All messages are accounted for once it has processed p ? 1 synchronization
packets. At that point, the processor calls pvm barrier().
76
Function
Semantics
hbsp begin
Starts the program with the number of processors requested.
hbsp end
Called by all processors at the end of the program.
hbsp abort
One process halts the entire HBSP computation.
hbsp pid
Returns the processor id in the range of 0 to one less than
the number of processors.
hbsp time
Returns the time (in seconds) since hbsp begin was
called. The timers on each of the processors are not synchronized.
hbsp nprocs
Returns the number of processors.
hbsp sync
The barrier synchronization function call. After the call,
all outstanding requests are satis ed.
hbsp send
Sends a message to a designated processor.
hbsp get tag
Returns the tag of the rst message in the system queue.
hbsp qsize
Returns the number of messages in the system queue.
hbsp move
Retrieves the rst message from the processor's receive
bu er
hbsp get rank
Returns the identity of the processor with the requested
rank.
hbsp get speed
Returns the speed of the processor of interest.
hbsp cluster speed Returns the total speed of the heterogeneous cluster.
Table 5.1: The functions that constitute HBSPlib interface.
77
HBSPlib incorporates functions that allow the programmer to take advantage
of the heterogeneity of the underlying system. Under HBSPk , faster machines
should perform the most work. The primitive hbsp get rank(1) returns the
identity of the fastest processor. hbsp get rank(p) returns the slowest machine's
identity, where p is the number of processors. HBSPlib also includes functions
to help the programmer distribute the workload based on a machine's ability.
The HBSPlib primitive hbsp get speed(j) provides the speed of processor j .
hbsp cluster speed returns the speed of the entire cluster. When combined
together, these two functions allow for nding the value of processor j 's cj parameter. Details related to this calculation are provided in the Section 5.3.
5.2 The HBSP1 Model
We nd it useful to simplify the notation of the HBSPk model, as described
in Chapter 4.2, for this environment. The number of workstations is m0 or p.
The single coordinator node, M1;0 or Pf , represents the fastest processor among
the HBSP0 processors. Ps refers to the slowest node. To identify the individual
processors on level 0, we use the notation Pj to refer to processor M0;j . Let rf = 1
and rs = maxfrj g, where 0 j < p. Since an HBSP1 machine consists of a single
cluster of processors to synchronize, L = L1;0 .
5.3 Experimental setup
Our experimental testbed consisted of a non-dedicated, heterogeneous cluster
of SUN and SGI workstations at the University of Central Florida. Table 5.2
78
Host
CPU type
CPU speed (MHz) Memory (MB) Data cache (KB)
aditiz
UltraSPARC II
360
256
16
chromus microSPARC II
85
64
8
dcn sgi1
MIPS R5000
180
128
32
dcn sgi3
MIPS R5000
180
128
32
gradsun1 TurboSPARC
170
64
16
gradsun3 TurboSPARC
170
64
16
gromit UltraSPARC IIi
333
128
16
sgi1
MIPS R5000
180
96
32
sgi3
MIPS R5000
180
96
32
sgi7
MIPS R5000
200
64
32
Table 5.2: Speci cation of the nodes in our heterogeneous cluster. z A 2 processor
system, where each number is for a single CPU.
lists the speci cations of each machine. Each node is connected by a 100Mbit/s
Ethernet connection. Our experiments evaluate the impact of processor speed
and workload distribution on the overall performance of an algorithm. In this
section, we discuss our method for estimating the costs of the HBSP1 parameters
on this platform.
The ranking of the processors is determined by the BYTEmark benchmark [BYT95],
which consists of the 10 tests brie y described below.
Numeric sort. An integer-sorting benchmark.
String sort. A string-sorting benchmark.
Bit eld. A bit manipulation package.
Emulated oating-point. A small software oating-point package.
79
Fourier coecients. A numerical analysis benchmark for calculating series
approximations of waveforms.
Assignment algorithm. A task allocation algorithm.
Hu man compression. A well-known text and graphics compression algorithm.
IDEA encryption. A block cipher encryption algorithm.
Neural net. A back-propagation network simulator.
LU Decomposition. A robust algorithm for solving linear equations.
The BYTEmark benchmark reports both raw and indexed scores for each test.
For example, the numeric sort test reports as its raw score the number of arrays
its was able to sort per second. The indexed score is the raw score of the system
divided by the raw score obtained on the baseline machine, a 90Mhz Pentium
XPS/90 with 16MB of RAM. The indexed score attempts to normalize the raw
scores. If a machine has an index score of 2.0, it performed that test twice as fast
as a 90 Mhz Pentium computer.
After running all of the tests, BYTEmark produces two overall gures, an
Integer and a Floating-point index. The Integer index is the geometric mean
of those tests that involve only integer processing|Numeric sort, String sort,
Bit eld, Emulated oating-point, Assignment algorithm, Hu man compression,
and IDEA encryption. The Floating-point index is the geometric mean of the
remaining tests. Thus, one can use these results to get a general feel for the
performance of the machine in question as compared to a 90 Mhz Pentium.
Table 5.3 presents the Integer and Floating-point index scores for each machine in the heterogeneous cluster. Since we consider integer data only, the
80
Machine Integer Floating-point
Index
Index
aditi
chromus
dcn sgi1
dcn sgi3
gradsun1
gradsun3
gromit
sgi1
sgi3
sgi7
4.45
0.75
2.80
2.79
1.80
1.81
4.89
2.81
2.77
3.13
3.77
0.59
3.73
3.67
1.41
1.42
3.33
3.60
3.30
4.11
Table 5.3: BYTEmark benchmark scores.
Integer index scores were used to rank the processors. According to the results,
chromus is the slowest node. gromit is the fastest machine in the cluster. This
result is surprising considering aditi appears faster on paper. Interestingly,
aditi narrowly edges out gromit in every test, except string sort, where gromit
outperforms aditi with a score of 7.63 to 2.40.
BYTEmark uses only a single execution thread. Consequently, it cannot take
advantage of aditi's additional processor. This does not present a problem for
our experiments since our HBSPlib implementation does not use threads. We
ran our experiments with both aditi and gromit as the fastest processor. There
was no major di erence in the execution times. Therefore, we consider gromit
to be the fastest processor in the cluster.
81
To ensure consistent results, we apply the same processor ordering for each
experiment. Table 5.4 shows the ordering. When p = 2, the experiments utilize
gromit and chromus. The speed of this con guration is 5:64, which is the sum
of each machine's Integer index score. Each machine's cj value is based on its
Integer index score and the cluster speed. In general, Ppj=0 cj = 1. When p = 2,
89 (or .867). The c value of chromus is .133. Therefore,
gromit's cj value is 54::64
j
gromit receives 86.7% of the data elements and chromus acquires the remaining
13.3%. When p = 4, the cluster speed is 12.89. The workstations that comprise
the cluster are gromit, chromus, aditi, and dcn sgi1, which receive 37.9%,
5.8%, 34.5%, and 21.7% of the input, respectively.
Table 5.4 also presents the synchronizing costs of the clusters comprised of
2, 4, 6, 8, and 10 workstations. For example, synchronizing two processors (i.e,
gromit and chromus) requires 9,000 s. The value of L corresponds to the time
for an empty superstep (i.e., no computation or communication). When p = 4,
15,000 s are needed in order to synchronize the processors. Compared with the
L values of the Intel Paragon and SGI Challenge presented in Table 3.1, the synchronization costs for the heterogeneous cluster are quite high. Several factors
contribute to this behavior. Since the cluster is non-dedicated, many other nodes
share the network link, which e ectively degrades communication performance.
Secondly, our implementation of barrier synchronization is not necessarily ecient. Despite the high L values, our collective algorithms outperformed their
PVM counterparts. Additional work will focus on the development of a more
ecient barrier synchronization primitive.
Table 5.5 shows the rj values achieved on our heterogeneous cluster. To
obtain these values, we measure the time needed for each machine to inject a
82
p
Speed L(s)
Machine
2 gromit, chromus
4
aditi, dcn sgi1
6 dcn sgi3, gradsun1
8
gradsun3, sgi1
10
sgi3, sgi7
5.64
12.89
17.48
22.10
28.00
9,000
15,000
23,000
30,000
37,000
Table 5.4: Cluster speed and synchronization costs.
suciently large packet into the network. gromit performed the best with a
s . Processor j 's r value is relative to this score.
score of 0.196 byte
j
5.4 Application Performance
The input data for each experiment consists of 100 KBytes to 1000 KBytes of
uniformly distributed integers. The problem size, n, refers to the largest number
of integers possessed by the root. Experimental results are given in terms of an
improvement factor. Let TA and TB represent the execution time of algorithm
A and algorithm B , respectively. The improvement factor of using algorithm B
over algorithm A is TTBA .
The HBSPk model encourages the use of fast processors and balanced workloads. According to the model, applications that embody both of these principles
will result in good performance. We designed two types of experiments to validate
the predictions of the model. The rst experiment tests whether processor speed
has an impact on algorithmic performance. Let Ts represent the execution time
of a collective routine assuming the root node is the slowest processor, Ps. Tf
denotes the algorithmic cost of using Pf as the root. For these experiments, each
83
machine
aditi
chromus
dcn sgi1
dcn sgi3
gradsun1
gradsun3
gromit
sgi1
sgi3
sgi7
rj
1.03
4.08
2.12
1.95
2.00
2.46
1.00
1.68
1.20
1.16
Table 5.5: rj values.
processor has an equal number of data items since our objective is to monitor the
performance of slow versus fast root nodes. Hence, cj = 1p . The results demonstrate that often times using the fastest node as the root results in signi cant
performance improvement.
Our second experiment studies the bene t of using the fastest processor as the
root and balanced workloads. Let Tu be the execution time when the workload is
unbalanced. Note that Tu = Tf . Each processor j 's cj value is p1 . Tb denotes the
execution time when the workload is balanced. Here, cj is computed as described
in the previous section. In most cases, the results demonstrate that balanced
workloads improve the performance of the algorithm.
We also investigate the accuracy of the HBSP1 cost function in predicting execution times. Similarly to BSP, we consider HBSPk to model only communication
and synchronization [GLR99]. I/O and local computation are not modeled. As a
84
result, none of our experiments include I/O. Furthermore, local computation for
some of our collective routines (i.e., single-value reduction, point-wise reduction,
and pre x sums) was measured directly. Our results show that the model is able
to predict performance trends, but not speci c execution times. The inability of
HBSPk to predict speci c execution times does not re ect negatively toward the
model. The accuracy of the cost function depends on the choices made in the implementation of the HBSPlib library. Thus, one source for inaccurate predictions
may result from the shortcomings of the library implementation.
The remainder of this section provides experimental results for each of the
collective communication algorithms. Each data point is the average of 10 runs.
The experimental data is given in Appendix A. For each of the experiments, the
logic of the algorithms is not changed. Instead, the modi cations occur in either
root node selection or problem size distribution. In both cases, performance
increase is substantial.
Gather. Figure 5.1 (a) shows the improvement that results if the root node
is Pf . As the number of processors increase, so does performance. The improvement factor is steady across all problem sizes. Performance reaches its maximum
at n = 500KB. Unfortunately, there is virtually no bene t to distributing the
workload based on a processor's computational abilities, except at p = 2. Figure 5.1 (b) displays the results. The problem lies with the estimation of cj for
aditi. Further investigation uncovers that aditi has too many elements to send
to the root node, gromit. aditi's workload does not match its abilities. As a
result, everyone must wait for aditi to nish sending its items to the root node.
For both experiments, the results at p = 2 are interesting. First, Figure 5.1
(a) shows that it is better for the root node to be the slowest workstation. This
seems counterintuitive. In our implementation of gather (as well as the other
85
collective operations), a processor does not send data to itself. When Ps is the
root, Pf sends np items to it. Similarly, if the fastest processor is the root, Ps sends
n elements to P . T < T implies that it is more bene cial to have P waiting
f s
f
f
p
on data from Ps. It is clear that the root node should be Pf as the number of
processors increase. Unlike the situation at p = 2, Pf does not sit idle waiting
on data items from Ps. Instead, it handles the messages of the other processors
while waiting on the slowest processor's data.
Secondly, at p = 2, balanced workloads contribute to increased performance.
Tu is the execution time of Ps sending np data elements to the fastest processor.
Tb is the cost of Ps sending csn integers to Pf , where cs is calculated as described
in Section 5.3. Note that csn < np . In this setting, balanced workloads make a
di erence (i.e., Tb < Tu) since Pf receives a smaller number of elements from Ps
than in the unbalanced case.
Figure 5.2 shows predicted performance for the gather operation. Although
the model under-predicts the improvement factor, it does characterize the performance trends of the algorithm.
Scatter. Figure 5.3 (a) plots the increase in performance if the root node
is the fastest processor. The improvement factor is steady as the problem size
increases. The best improvement occurs when p = 6 and n = 500KB. When
p = 2, TTfs < 1. This is similar to the behavior experienced with the gather
operation. Ts < Tf suggests that it is more advantageous for Ps send data to the
fastest processor. As p increases, the results demonstrate that Pf is better suited
as the root node. Figure 5.3 (b) compares the performance of unbalanced and
balanced workloads. Unlike the gather results, there is a bene t to distributing
the problem size based upon a processor's computational abilities. Here, p = 2
had the best performance with a maximum improvement of 3:62.
86
(a)
(b)
p=2
p=4
p=6
p=8
p = 10
4
Improvement factor
Improvement factor
8
6
4
2
p=2
p=4
p=6
p=8
p = 10
3
2
1
0
1
2
3
4
5
6
7
8
9
0
10
1
size(x 100Kbytes)
2
3
4
5
6
7
8
9
10
size (x 100KBytes)
Figure 5.1: Gather actual performance. The improvement factor is determined
by (a) TTfs and (b) TTub . The problem size ranges from 100KB to 1000KB of integers.
Each data point represents the average of 10 runs on a cluster comprised of 2, 4,
6, 8, and 10 heterogeneous processors.
(a)
(b)
p=2
p=4
p=6
p=8
p = 10
2.0
Improvement factor
Improvement factor
4
3
2
1
0
p=2
p=4
p=6
p=8
p = 10
1.5
1.0
0.5
0.0
1
2
3
4
5
6
7
8
9
10
1
size(x 100Kbytes)
2
3
4
5
6
7
8
9
10
size(x 100Kbytes)
Figure 5.2: Gather predicted performance. The improvement factor is determined
by (a) TTfs and (b) TTub . The problem size ranges from 100KB to 1000KB of integers.
Each data point represents the predicted performance on a cluster comprised of
2, 4, 6, 8, and 10 heterogeneous processors.
87
(a)
(b)
Improvement factor
6
Improvement factor
p=2
p=4
p=6
p=8
p = 10
p=2
p=4
p=6
p=8
p = 10
4
2
0
1
2
3
4
5
6
7
8
9
3
2
1
0
10
1
size (x 100KBytes)
2
3
4
5
6
7
8
9
10
size (x 100KBytes)
Figure 5.3: Scatter actual performance. The improvement factor is determined
by (a) TTfs and (b) TTub . The problem size ranges from 100KB to 1000KB of integers.
Each data point represents the average of 10 runs on a cluster comprised of 2, 4,
6, 8, and 10 heterogeneous processors.
Figure 5.4 shows predicted performance for the scatter operation. The cost
model predicts the same performance for both the scatter and gather operations.
Thus, the graph is identical to Figure 5.2.
Single-value reduction. Unlike the gather and scatter routines, there is
neglible improvement for the single-value reduction operation if the root node is
Pf . This is not surprising considering the HBSPk model predicted such behavior.
Figure 5.5 (a) shows the result. Improvement is insigni cant since the amount of
data communicated to the root is a single value from each node. Figure 5.5 (b)
demonstrates better performance when the workloads are balanced according to
processor speed.
The predicted performance of the single-value reduction operation is shown
in Figure 5.6. For this algorithm, the cost model predicts that performance
remains unchanged regardless of the speed of the root. This is to be expected
88
(a)
(b)
p=2
p=4
p=6
p=8
p = 10
2.0
Improvement factor
Improvement factor
4
3
2
1
0
p=2
p=4
p=6
p=8
p = 10
1.5
1.0
0.5
0.0
1
2
3
4
5
6
7
8
9
10
1
size(x 100Kbytes)
2
3
4
5
6
7
8
9
10
size(x 100Kbytes)
Figure 5.4: Scatter predicted performance. The improvement factor is determined
by (a) TTfs and (b) TTub . The problem size ranges from 100KB to 1000KB of integers.
Each data point represents the predicted performance on a cluster comprised of
2, 4, 6, 8, and 10 heterogeneous processors.
since the root performs very little communication and computation. The actual
results re ect this behavior. Moreover, the cost function correctly identi es the
performance trend of balanced workloads.
Point-wise reduction. A point-wise reduction of an array of values provides
the root node with more work. The HBSPk model predicts that point-wise reduction will result in better performance than single-value reduction. Figure 5.7
plots the increased performance that results assuming that the root node is the
fastest processor. Here, performance increases with the number of workstations.
Moreover, the improvement is steady as the problem size increases.
Figure 5.8 plots the predictions of the cost model. Overall, the performance
trends of the algorithm were correctly identi ed.
Pre x sums. In the pre x sums algorithm, the problem size refers to the
total number of items held by all of the processors|not the root node. Fig89
(a)
(b)
p=2
p=4
p=6
p=8
p = 10
2.5
Improvement factor
Improvement factor
2.0
1.5
1.0
p=2
p=4
p=6
p=8
p = 10
0.5
2.0
1.5
1.0
0.5
0.0
0.0
1
2
3
4
5
6
7
8
9
1
10
2
3
4
5
6
7
8
9
10
size (x 100KBytes)
size (x 100KBytes)
Figure 5.5: Single-value reduction actual performance. The improvement factor
is determined by (a) TTfs and (b) TTub . The problem size ranges from 100KB to
1000KB of integers. Each data point represents the average of 10 runs on a
cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors.
(a)
(b)
p=2
p=4
p=6
p=8
p = 10
1.5
Improvement factor
Improvement factor
1.0
0.8
0.6
0.4
p=2
p=4
p=6
p=8
p = 10
1.0
0.5
0.2
0.0
0.0
1
2
3
4
5
6
7
8
9
10
1
size(x 100Kbytes)
2
3
4
5
6
7
8
9
10
size(x 100Kbytes)
Figure 5.6: Single-value reduction predicted performance. The improvement factor is determined by (a) TTfs and (b) TTub . The problem size ranges from 100KB to
1000KB of integers. Each data point represents the predicted performance on a
cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors.
90
p=2
p=4
p=6
p=8
p = 10
Improvement factor
8
6
4
2
0
1
2
3
4
5
6
7
8
9
10
size (x 100KBytes)
Figure 5.7: Point-wise reduction actual performance. The improvement factor is
determined by TTfs . The problem size ranges from 100KB to 1000KB of integers.
Each data point represents the average of 10 runs on a cluster comprised of 2, 4,
6, 8, and 10 heterogeneous processors.
p=2
p=4
p=6
p=8
p = 10
Improvement factor
4
3
2
1
0
1
2
3
4
5
6
7
8
9
10
size(x 100Kbytes)
Figure 5.8: Point-wise reduction predicted performance. The improvement factor
is determined by TTfs . The problem size ranges from 100KB to 1000KB of integers.
Each data point represents the predicted performance on a cluster comprised of
2, 4, 6, 8, and 10 heterogeneous processors.
91
ure 5.9 (a) graphs the improvement factor that results from using Pf as the root
instead of Ps. Although the improvement factor is smaller than that of scatter,
gather, and point-wise reduction, execution times improve by as much as 24%.
This is quite signi cant considering that the modi cations to the algorithm consist of either selecting a slow root node or a fast one. The root node in the pre x
sums routine performs very little computation and communication. Improved results can be attained if Pf receives more work. Figure 5.9 (b) shows the results.
Here, performance decreases with the number of processors. This implies that
the algorithm is able to eciently take advantage of balanced workloads if the
number of processors is small.
Figure 5.10 presents the predictability results. Similarly to single-value reduction, the cost model predicts that there is not advantage to using a fast root
since the amount of computation and communication it performs is small. Unfortunately, the actual results disagree with the predictions. However, the model
does accurately predict the bene t of using balanced workloads.
One-to-all broadcast. Figure 5.11 (a) compares the execution time of the
algorithm assuming the root node is either Ps or Pf . The plot demonstrates that
their is neglible improvement in performance. The HBSPk model predicted this
behavior. The broadcast operation takes small advantage of the heterogeneity
since each processor must receive all of the data. In fact, the improvement in
performance is a result of Pf distributing np integers to each processor during the
rst phase of the algorithm. Our analysis also applies if processor j receives cj n
elements during phase one of the algorithm. Figure 5.11 (b) corroborates the
theoretical results.
Figure 5.12 plots the predictions of the cost model, which over-predicts the
bene t of using the fastest processor.
92
(a)
(b)
p=2
p=4
p=6
p=8
p = 10
2.0
Improvement factor
Improvement factor
2.0
1.5
1.0
0.5
0.0
1
2
3
4
5
6
7
8
9
p=2
p=4
p=6
p=8
p = 10
1.5
1.0
0.5
0.0
10
1
2
3
size (x 100KBytes)
4
5
6
7
8
9
10
size (x 100KBytes)
Figure 5.9: Pre x sums actual performance. The improvement factor is determined by (a) TTfs and (b) TTub . The problem size ranges from 100KB to 1000KB of
integers. Each data point represents the average of 10 runs on a cluster comprised
of 2, 4, 6, 8, and 10 heterogeneous processors.
(a)
(b)
p=2
p=4
p=6
p=8
p = 10
2.5
Improvement factor
Improvement factor
1.0
0.8
0.6
0.4
0.2
p=2
p=4
p=6
p=8
p = 10
2.0
1.5
1.0
0.5
0.0
0.0
1
2
3
4
5
6
7
8
9
10
1
size(x 100Kbytes)
2
3
4
5
6
7
8
9
10
size(x 100Kbytes)
Figure 5.10: Pre x sums predicted performance. The improvement factor is
determined by (a) TTfs and (b) TTub . The problem size ranges from 100KB to 1000KB
of integers. Each data point represents the predicted performance on a cluster
comprised of 2, 4, 6, 8, and 10 heterogeneous processors.
93
(a)
(b)
p=2
p=4
p=6
p=8
p = 10
1.0
Improvement factor
Improvement factor
3
p=2
p=4
p=6
p=8
p = 10
2
1
0.8
0.6
0.4
0.2
0
1
2
3
4
5
6
7
8
9
10
0.0
1
size (x 100KBytes)
2
3
4
5
6
7
8
9
10
size (x 100KBytes)
Figure 5.11: One-to-all broadcast actual performance. The improvement factor
is determined by (a) TTfs and (b) TTub . The problem size ranges from 100KB to
1000KB of integers. Each data point represents the average of 10 runs on a
cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors.
(a)
(b)
p=2
p=4
p=6
p=8
p = 10
1.2
Improvement factor
Improvement factor
1.5
1.0
0.5
0.0
p=2
p=4
p=6
p=8
p = 10
0.8
0.4
0.0
1
2
3
4
5
6
7
8
9
10
1
size(x 100Kbytes)
2
3
4
5
6
7
8
9
10
size(x 100Kbytes)
Figure 5.12: One-to-all broadcast predicted performance. The improvement factor is determined by (a) TTfs and (b) TTub . The problem size ranges from 100KB to
1000KB of integers. Each data point represents the predicted performance on a
cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors.
94
p=2
p=4
p=6
p=8
p = 10
Improvement factor
2.0
1.5
1.0
0.5
0.0
1
2
3
4
5
6
7
8
9
10
size (x 100KBytes)
Figure 5.13: All-to-all broadcast actual performance. There are two algorithms
compared|simultaneous broadcast (SB) and intermediate destination (ID). The
improvement factor is given for SB versus ID.. The problem size ranges from
100KB to 1000KB of integers. Each data point represents the average of 10 runs
on a cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors.
All-to-all broadcast. Figure 5.13 compares the performance of three all-toall broadcast implementations|simultaneous broadcast (SB) and intermediate
destination (ID). Overall, the SB algorithm performed the best. In fact, the
other algorithms were not close to challenging its performance. This is somewhat
disappointing since this algorithm is very susceptible to node contention. One
possible explanation for this is that it only performs one superstep, while the other
algorithms perform three supersteps. With the high cost of synchronization in
our system, the SB algorithm is not as susceptible to the barrier synchronization
cost.
Figure 5.14 shows the predictability results. The HBSPk cost function veri es
that the SB algorithm is indeed a better performer than the ID algorithm.
95
Improvement factor
p=2
p=4
p=6
p=8
p = 10
0.6
0.4
0.2
0.0
1
2
3
4
5
6
7
8
9
10
size(x 100Kbytes)
Figure 5.14: All-to-all broadcast predicted performance. There are two algorithms compared|simultaneous broadcast (SB) and intermediate destination
(ID). The improvement factor is given for (a) SB versus ID. The problem size
ranges from 100KB to 1000KB of integers. Each data point represents the predicted performance on a cluster comprised of 2, 4, 6, 8, and 10 heterogeneous
processors.
96
1. Pf scatters the n data items to each of its children, Pj , where 0 j < p ? 1.
2. Pj randomly selects a set of sample keys from its cj n input keys.
3. Pj sends its sample keys to the fastest node, Pf .
4. Pf sorts the p sample keys. Denote these keys by sample0 ; : : : ; samplep?1 ,
where samplei is the sample key with rank i in the sorted order. Pf de nes
p ? 1 splitters, s0; : : : ; sp?2, where sj = sampled(Pjx=0 cx)pe .
5. Pf broadcasts the splitters to each of the processors.
6. All keys assigned to the j th bucket are sent to the j processor.
7. All processors sort their bucket.
Figure 5.15: HBSP1 randomized sample sort.
5.4.1 Randomized Sample Sort
Chapter 3.2.2 discusses the merits of randomized sample sort for BSP computing. Here, we extend the algorithm to accommodate a heterogeneous cluster of
workstations. Speci cally, our objective is to evaluate the performance of the
collective operations as part of a larger program. When adapting the randsort
algorithm for HBSP1 machines, we change the way in which splitters are chosen. In heterogeneous environments, it is necessary that O(cj n) keys fall between
the splitters sj and sj+1. Homogeneous environments assume c0;j = m10 , where
0 j < m0 . Figure 5.15 presents the algorithm.
The cost of the algorithm is as follows. Step 1 requires a cost of gn + L.
In Step 2, each processor performs O() amount of work. Step 3 requires a
communication time of g maxfrs; rf pg. Assuming rs < p, the communication
time reduces to gp. Mf sorts the sample keys in O(p lg p) time. Broadcasting
97
p
2
4
6
8
10
104
2.15
2.29
2.19
2.26
2.25
105
2.76
2.77
2.39
2.96
2.15
106
2.36
2.33
2.13
2.28
1.26
Table 5.6: Randomized-sample sort performance. Factor of improvement is determined by Tu =Tb. The problem size ranges from 104 to 105 integers. Each data
point represents the average of 10 runs on a cluster comprised of 2, 4, 6, 8, and
10 heterogeneous processors.
the p ? 1 splitters requires gp(1 + rs) + 2L time. Since each processor is expected
to receive approximately cj n keys [Mor98b], Step 6 uses O(cj n) computation time
and g maxfrj cj ng communication time, where 0 j < p. Once each processor
receives their keys, sorting them requires O(cj n lg cj n) time. Thus, the total time
of the algorithm is O(csn lg csn) + g(n + p + p + prs + rscsn) + 5L.
The previous section experimentally validated that using a fast root node
often times results in better performance. Assuming a fast root node, we test
the performance of randomized sample sort using balanced workloads. Table 5.6
presents the performance of our randomized sample sort implementation. The
best performance occurs when n = 105 integers. Here, the improvement factor
reaches a high of 2.96. The use of the scatter to allocate data to processors is
quite convenient. However, storage limitations will eventually prevent us from
distributing all input data from a single process. In that sense, the scalability of
the algorithm is limited.
98
5.5 Summary
The experimental results demonstrate that signi cant increases in performance occurs if the heterogeneity of the underlying system is taking into consideration. For example, using faster processors often times result in better algorithmic performance. The performance of the gather and scatter algorithms
show that there are situations when the root node should be the slowest processor. Balanced workloads also contribute to better overall performance. Overall,
the experiments demonstrate that the HBSPk cost model guides the programmer
in designing parallel software for good performance on heterogeneous platforms.
The algorithms are not ne-tuned for a speci c environment. Instead, the performance gains are a result of the cost predictions provided by the model.
99
CHAPTER 6
Conclusions and Future Work
The HBSPk model o ers a framework that makes parallel computing a viable
option for heterogeneous platforms. HBSPk extends the BSP model by incorporating parameters that apply to a diverse range of heterogeneous systems such
as workstation clusters, the Internet, and computational grids. HBSPk rewards
algorithms with balanced design. For heterogeneous systems, this translates to
nodes receiving a workload proportional to their computational and communication abilities. The HBSPk parameter, ci;j , provides the programmer with a way
to manage the workload of the each machine in the heterogeneous platform. Furthermore, faster machines should be used more often their slower counterparts.
Coordinator nodes provide the user with access to the faster nodes in the system.
Therefore, the goal of HBSPk algorithm design is to minimize activity on slower
machines while increasing the eciency of the faster machines in the system.
The utility of the model is demonstrated through the design, analysis, and
implementation of six collective communication algorithms|gather, scatter, reduction, pre x sums, one-to-all broadcast, and all-to-all broadcast. Our collective
communication algorithms are based on two simple design principles. First, the
root of a communication operation must be a fast node. Secondly, faster nodes
receive more data items than slower nodes. We designed two types of experiments
to validate the predictions of the HBSPk model. One experiment measured the
100
importance of root node selection the other tested the e ect of problem size distribution. The results clearly demonstrate that the heterogeneity of a system
cannot be ignored. If algorithms for such platforms are designed correctly, the
performance bene ts are tremendous. HBSPk provides the programmer with a
framework in which to design ecient software for heterogeneous platforms. Besides good performance, the model predicts the behavior of our collective routines
within a reasonable margin of error.
Not all algorithms bene t from executing on a heterogeneous machine. The
broadcast algorithms (one-to-all and all-to-all) show neglible bene t of using our
two-step approach to designing algorithms. A broadcast requires each machine to
possess all of the data elements at the end of the operation. Since the slowest node
must receive each element, the performance of the algorithm su ers. Therefore,
there is no way to balance the workload according to processor speed. In general,
collective operations that require nodes to possess all of the data items at the
end of the operation are unlikely to e ectively exploit heterogeneity.
HBSPk o ers a single-system image of a heterogeneous platform to the application developer. This view incorporates the salient features (characterized by
a few parameters) of the underlying machine. Under HBSPk , improved performance is not a result of programmers having to account for myriad di erences
in a heterogeneous environment. By hiding the non-uniformity of the underlying
system from the application developer, the HBSPk model o ers an environment
that encourages the design of heterogeneous parallel software in an architectureindependent manner.
101
6.1 Contributions
Below, we present a more detailed description of the contributions of this
work.
Developed a model of computation for heterogeneous and hierarchicallyconnected systems.
Introduced a classi cation scheme to characterize various types of parallel
platforms (HBSP0 , HBSP1 , ..., HBSPk ).
Designed and analyzed collective communication and sorting algorithms for
the HBSPk model.
Implemented a library to facilitate HBSP1 programming.
Experimental results demonstrating ecient, scalable, and predictable HBSP1
applications.
The HBSPk model is a general model of computation that can be applied to a
diverse range of heterogeneous platforms. It de nes a programming methodology
for designing heterogeneous programs and an associated cost model to analyze
the complexity of the algorithm; this model applies to a variety of heterogeneous
platforms. Moreover, the cost model allows for predictability of performance.
HBSPk provides the designer with parameters that re ect the relative computational and communication speeds at each of the k levels and captures the tradeo s
between communication and computation that are inherent in parallel applications. Improved performance results from e ectively exploiting the speeds of
the heterogeneous computing components. Furthermore, increased performance
comes in an architecture-independent manner.
102
In HBSPk , machines are grouped hierarchically into clusters based on their
ability to communicate with each other. HBSP0 or single processor computers are
the simplest class of machines since they do not perform communication. HBSP1
machines group HBSP0 processors together to form a single parallel system that
performs communication. In general, the HBSPk model refers to a class of machines with at most k di erent levels of communication. This characterization
allows the model to be adaptable to workstation clusters as well as computational
grids.
We designed six collective communication algorithms for heterogeneous computation. Since these basic patterns of interprocessor communication are frequently used as building blocks in a variety of parallel algorithms, ecient implementation of them is crucial. Each of the collective routines contains a phase
where one node is responsible for collecting or distributing information to the
other nodes. In these situations, there is a substantial performance gain if the
root node is the fastest machine and workloads are balanced across the nodes.
Multi-layer architectures can bene t from using coordinator nodes to allow multiple fast nodes to be in use at one time. Of course, one must be cognizant of
the high cost of communication and synchronization in such environments. Thus,
our algorithms minimize trac on slower network links.
The HBSP Programming Library (HBSPlib) is a parallel C library of communication routines for the HBSP1 model. Besides providing primitives for process initialization, process enquiry, barrier synchronization, and message passing,
HBSPlib incorporates additional functions to address the heterogeneity of the
underlying system. These functions include retrieving the identity of the fastest
processor and returning the speed of a single processor or an entire cluster. With
this information, a programmer is able to give more work to the fastest processor
103
and distribute the workload based on the relative speeds of the heterogeneous
processors.
The experimental results validate the predictions of the HBSP1 cost model.
The testbed consists of a non-dedicated, heterogeneous cluster of workstations.
We use the BYTEmark benchmark to rank the processors as well as determine
the load balancing parameter, c0;j . The experiments corroborate the theoretical
claims of the model. First, faster processors, if used appropriately, result in faster
execution times. Additionally, balanced workloads result in better overall performance. Overall, the performance of our collective operations is quite impressive.
Furthermore, randomized sample sort shows the bene t of using the HBSP1 collective routines. Fundamental changes to the algorithms are not necessary to
attain the increase in performance. Instead, modi cations consist of selecting the
root node and distributing the workload.
6.2 Future Research
Based on the lessons learned from the development, implementation, and evaluation of the HBSPk model, the following research extensions and improvements
are presented as a follow-on research agenda.
Develop an optimized and scalable HBSP library implementation. Although
our prototype showed the performance improvement that results when a
processor's load is balanced, HBSPlib could bene t from additional improvements. One area of concern is our barrier synchronization implementation. Additional work is needed to reduce the cost of this operation.
104
Although PVM served its purpose in our prototype, we are considering
using MPI or Java as a basis for future implementations of HBSPlib.
Extend HBSPlib to accommodate hierarchical architectures. Recent work
demonstrates the importance of collective communication operations for
hierarchical networks. Husbands and Hoe [HH98] develop MPI-StarT, a
system that eciently implements collective routines for a cluster of SMPs.
Bader and JaJa [BJ99] describe a methodology for developing high performance programs running on clusters of SMP nodes based on a small
kernel (SIMPLE) of collective communication primitives. Kielmann et
al. [KHP99] present MagPie, a system that also handles a two-level communication hierarchy. Karonis et al. [KSF00] develop a topology-aware
version of the broadcast operation for good performance. The P -logP
model [KBG00], an extension of LogP [CKP93], is used to optimize performance of wide-area collective operations by determining the optimal
tree shape for the communication. Moreover, large messages are split into
smaller units resulting in better link utilization. Each of the above e orts
only consider the network bandwidth of the underlying heterogeneous environment. However, a machine's computational speed plays an important
role in the overall time of a collective operation. The HBSPk model allows the algorithm designer to take advantage of the communication and
computational abilities of the components in the heterogeneous system.
Design additional HBSPk applications. Investigating the range of applications that can be eciently handled by HBSPk is an important issue. We
have shown that the HBSPk model can guide the development of ecient
collective operations. Most of our communication algorithms resulted in
increased performance if the heterogeneity of the underlying system is con105
sidered. More work must be done in showing the applicability of the model
to other problems such as matrix multiplication, minimum spanning tree,
and N -body simulation. Other communication routines (i.e., broadcast
algorithms) cannot e ectively exploit the heterogeneity of the underlying
system. Further study of problems in this category is also of interest.
Let us consider the diculty of designing a algorithm for solving the matrix multiplication problem on a heterogeneous network of workstations.
Given two n n matrices A and B , we de ne the matrix C = A B as
?1 A B [BBR00]. Here, we discuss a block version of the
Ci;j = Pkn=0
i;k
k;j
algorithm [KGG94]. For example, an n n matrix A can be regarded as
a q q array of blocks Ai;j , where 0 i; j < q, such that each block is a
n n submatrix. We can use p homogeneous processors to implement the
q q
algorithm by choosing q = pp. There is a one-to-one mapping between the
p blocks and the p homogeneous processors. As a result, each processor is
responsible for updating a distinct Ci;j block. Splitting the matrices into
p blocks of size q q will not lead to good performance on heterogeneous
platforms. Instead, we must balance the workload of a processor in accordance with its computing power. Therefore, ecient performance occurs
by tiling the C matrix into p rectangles of varying sizes (see Figure 6.1).
Investigate other methods of estimating ci;j . In our experiments, processor
j 's ci;j value, where i = 0, is based on its computational speed. However, its
communication ability may also play a role in achieving balanced workloads.
It is not unlikely for a machine's communication and computation abilities
to appear on opposite ends of the spectrum. For example, a workstation
may perform fast computationally, but its communication ability may not
be as strong. In such cases, a reasonable estimation of ci;j considers both
106
Homogeneous Partition
Heterogeneous Partition
Figure 6.1: Processor allocation of p = 16 matrix blocks. Each processor receives the same block size on a homogeneous cluster. On heterogeneous clusters,
processors receive varying block sizes.
performance values in its calculation. When k 2, determining the values
of each node's ci;j parameter becomes more dicult. One possibility is that
its value could be sum of its children's value. Additional investigation of
load-balancing strategies for hierarchical architectures is also required.
Study the bene ts of incorporating costs for di erent types of communica-
tion. Currently, the HBSPk model provides the same cost for both interand intra-cluster communication. However, sending messages within a cluster is generally less expensive than communicating outside the cluster. Furthermore, since nodes are not necessarily in the same region, communication
costs may vary depending upon the destination's geographic location. Incorporating such costs will increase the number of parameters in the model.
One of our goals in developing the HBSPk model was to keep the number
of parameters as small as possible. Thus, an empirical study is necessary
to determine the bene ts of modifying the ri;j parameter.
107
APPENDIX A
Collective Communication Performance Data
108
The following tables provide performance numbers for our collective routines.
We refer to this data in Chapter 5.4. Speci cally, the tables include the actual
and predicted runtimes on a heterogeneous cluster comprised of 2, 4, 6, 8, and
10 processors.
109
100
p=2
Ts
Tf (Tu)
Tb
p=4
Ts
Tf (Tu)
Tb
p=6
Ts
Tf (Tu)
Tb
p=8
Ts
Tf (Tu)
Tb
p = 10
Ts
Tf (Tu)
Tb
200
300
problem size (in KBs)
400 500 600 700
800
900
1000
0.074 0.141 0.206 0.273 0.346 0.413 0.475 0.545 0.615 0.683
0.066 0.128 0.186 0.249 0.308 0.372 0.436 0.491 0.553 0.612
0.021 0.038 0.053 0.069 0.087 0.103 0.118 0.135 0.153 0.168
0.092 0.185 0.288 0.373 0.461 0.556 0.644 0.743 0.828 0.920
0.042 0.073 0.144 0.144 0.170 0.215 0.243 0.282 0.313 0.338
0.050 0.086 0.140 0.175 0.229 0.276 0.317 0.382 0.392 0.442
0.103 0.225 0.302 0.421 0.511 0.613 0.759 0.825 0.911 1.022
0.032 0.059 0.085 0.105 0.132 0.170 0.187 0.216 0.233 0.268
0.04 0.070 0.102 0.137 0.177 0.213 0.239 0.291 0.323 0.353
0.171 0.208 0.338 0.426 0.532 0.650 0.752 0.916 0.974 1.079
0.028 0.053 0.073 0.087 0.109 0.130 0.152 0.171 0.190 0.213
0.035 0.145 0.084 0.131 0.141 0.179 0.209 0.234 0.247 0.281
0.337 0.257 0.356 0.450 0.581 0.650 1.017 0.863 0.951 1.101
0.038 0.069 0.169 0.120 0.199 0.213 0.219 0.243 0.271 0.297
0.174 0.065 0.152 0.164 0.182 0.190 0.231 0.253 0.265 0.296
Table A.1: Actual execution times (in seconds) for gather. The problem size
ranges from 100KB to 1000KB of integers. Ts and Tf (Tu) denote the execution
time assuming a slow and fast root node, respectively. Tb is the runtime for
balanced workloads. Each data point represents the average of 10 runs on a
cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors.
110
100
p=2
Ts
Tf (Tu)
Tb
p=4
Ts
Tf (Tu)
Tb
p=6
Ts
Tf (Tu)
Tb
p=8
Ts
Tf (Tu)
Tb
p = 10
Ts
Tf (Tu)
Tb
200
300
problem size (in KBs)
400 500 600 700
800
900
1000
0.091 0.173 0.255 0.337 0.419 0.501 0.582 0.664 0.746 0.828
0.050 0.091 0.132 0.173 0.214 0.255 0.296 0.337 0.387 0.419
0.029 0.049 0.069 0.089 0.109 0.129 0.149 0.170 0.190 0.210
0.970 0.179 0.261 0.343 0.425 0.507 0.588 0.670 0.752 0.834
0.035 0.056 0.076 0.097 0.117 0.138 0.158 0.179 0.199 0.220
0.035 0.055 0.075 0.095 0.115 0.135 0.155 0.176 0.196 0.216
0.105 0.187 0.269 0.351 0.433 0.515 0.596 0.678 0.760 0.842
0.043 0.063 0.083 0.103 0.123 0.143 0.163 0.183 0.204 0.224
0.043 0.063 0.083 0.103 0.123 0.143 0.163 0.184 0.204 0.224
0.112 0.194 0.276 0.358 0.440 0.522 0.603 0.685 0.767 0.849
0.050 0.070 0.090 0.110 0.130 0.150 0.170 0.191 0.211 0.231
0.051 0.070 0.090 0.110 0.130 0.150 0.170 0.191 0.211 0.231
0.119 0.201 0.283 0.365 0.447 0.529 0.610 0.692 0.774 0.856
0.057 0.077 0.097 0.117 0.137 0.157 0.177 0.198 0.218 0.238
0.057 0.077 0.097 0.117 0.137 0.157 0.177 0.198 0.218 0.238
Table A.2: Predicted execution times (in seconds) for gather. The problem size
ranges from 100KB to 1000KB of integers. Ts and Tf (Tu) denote the execution
time assuming a slow and fast root node, respectively. Tb is the runtime for
balanced workloads. Each data point represents the predicted performance on a
cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors.
111
100
p=2
Ts
Tf (Tu)
Tb
p=4
Ts
Tf (Tu)
Tb
p=6
Ts
Tf (Tu)
Tb
p=8
Ts
Tf (Tu)
Tb
p = 10
Ts
Tf (Tu)
Tb
200
300
problem size (in KBs)
400 500 600 700
800
900
1000
0.070 0.130 0.194 0.257 0.323 0.379 0.441 0.508 0.565 0.631
0.075 0.141 0.210 0.277 0.340 0.409 0.477 0.549 0.608 0.681
0.027 0.044 0.064 0.081 0.097 0.116 0.134 0.153 0.169 0.188
0.090 0.171 0.257 0.337 0.458 0.529 0.696 0.814 0.921 0.981
0.045 0.079 0.146 0.196 0.267 0.313 0.372 0.459 0.485 0.573
0.049 0.091 0.120 0.158 0.192 0.239 0.302 .0314 0.374 0.422
0.099 0.185 0.267 0.465 0.611 0.665 0.769 0.896 1.012 1.131
0.042 0.069 0.125 0.217 0.248 0.305 0.344 0.451 0.485 0.539
0.041 0.067 0.124 0.150 0.218 0.234 0.276 0.326 0.381 0.422
0.104 0.189 0.270 0.357 0.782 0.756 0.917 0.871 1.049 1.111
0.041 0.061 0.083 0.105 0.130 0.308 0.310 0.373 0.410 0.541
0.038 0.059 0.080 0.133 0.159 0.216 0.326 0.261 0.392 0.409
0.109 0.238 0.279 0.367 0.454 0.787 0.763 1.067 1.046 1.110
0.082 0.072 0.095 0.123 0.150 0.186 0.373 0.404 0.424 0.515
0.043 0.065 0.087 0.111 0.185 0.208 0.257 0.315 0.326 0.366
Table A.3: Actual execution times (in seconds) for scatter. The problem size
ranges from 100KB to 1000KB of integers. Ts and Tf (Tu) denote the execution
time assuming a slow and fast root node, respectively. Tb is the runtime for
balanced workloads. Each data point represents the average of 10 runs on a
cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors.
112
100
p=2
Ts
Tf (Tu)
Tb
p=4
Ts
Tf (Tu)
Tb
p=6
Ts
Tf (Tu)
Tb
p=8
Ts
Tf (Tu)
Tb
p = 10
Ts
Tf (Tu)
Tb
200
300
problem size (in KBs)
400 500 600 700
800
900
1000
0.091 0.173 0.255 0.337 0.419 0.501 0.582 0.664 0.746 0.828
0.050 0.091 0.132 0.173 0.214 0.255 0.296 0.337 0.378 0.419
0.029 0.049 0.069 0.089 0.109 0.129 0.149 0.170 0.190 0.210
0.097 0.179 0.261 0.343 0.425 0.507 0.588 0.670 0.752 0.834
0.035 0.056 0.076 0.097 0.117 0.138 0.158 0.179 0.199 0.220
0.035 0.055 0.075 0.095 0.115 0.135 0.155 0.176 0.196 0.216
0.105 0.187 0.269 0.351 0.433 0.515 0.596 0.678 0.760 0.842
0.043 0.063 0.083 0.103 0.123 0.143 0.163 0.184 0.204 0.224
0.043 0.063 0.083 0.103 0.123 0.143 0.163 0.183 0.203 0.224
0.112 0.194 0.276 0.358 0.440 0.522 0.603 0.685 0.767 0.849
0.050 0.070 0.090 0.110 0.130 0.150 0.170 0.190 0.210 0.231
0.050 0.070 0.090 0.110 0.130 0.150 0.170 0.190 0.211 0.230
0.119 0.201 0.283 0.365 0.447 0.529 0.610 0.692 0.774 0.856
0.057 0.077 0.097 0.117 0.137 0.157 0.177 0.197 0.218 0.238
0.057 0.077 0.097 0.117 0.137 0.157 0.177 0.198 0.218 0.238
Table A.4: Predicted execution times (in seconds) for scatter. The problem size
ranges from 100KB to 1000KB of integers. Ts and Tf (Tu) denote the execution
time assuming a slow and fast root node, respectively. Tb is the runtime for
balanced workloads. Each data point represents the predicted performance on a
cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors.
113
100
p=2
Ts
Tf (Tu)
Tb
p=4
Ts
Tf (Tu)
Tb
p=6
Ts
Tf (Tu)
Tb
p=8
Ts
Tf (Tu)
Tb
p = 10
Ts
Tf (Tu)
Tb
200
300
problem size (in KBs)
400 500 600 700
800
900
1000
0.016 0.019 0.023 0.026 0.030 0.033 0.036 0.039 0.043 0.047
0.016 0.020 0.023 0.026 0.030 0.033 0.037 0.040 0.044 0.047
0.013 0.015 0.015 0.017 0.018 0.020 0.021 0.023 0.026 0.027
0.021 0.023 0.024 0.026 0.028 0.029 0.031 0.033 0.035 0.037
0.020 0.022 0.024 0.026 0.027 0.029 0.030 0.032 0.040 0.035
0.019 0.019 0.019 0.020 0.020 0.021 0.021 0.021 0.023 0.023
0.027 0.028 0.030 0.031 0.032 0.033 0.034 0.035 0.036 0.037
0.025 0.027 0.027 0.028 0.029 0.030 0.031 0.034 0.034 0.035
0.024 0.024 0.031 0.024 0.025 0.025 0.026 0.027 0.026 0.027
0.034 0.035 0.038 0.037 0.038 0.039 0.039 0.040 0.041 0.042
0.036 0.031 0.033 0.032 0.033 0.039 0.036 0.035 0.037 0.038
0.029 0.029 0.031 0.030 0.030 0.032 0.031 0.032 0.032 0.032
0.041 0.042 0.043 0.044 0.044 0.045 0.085 0.046 0.047 0.048
0.075 0.038 0.037 0.037 0.040 0.040 0.040 0.041 0.010 0.042
0.035 0.035 0.037 0.035 0.432 0.037 0.036 0.040 0.036 0.037
Table A.5: Actual execution times (in seconds) for single-value reduction. The
problem size ranges from 100KB to 1000KB of integers. Ts and Tf (Tu) denote
the execution time assuming a slow and fast root node, respectively. Tb is the
runtime for balanced workloads. Each data point represents the average of 10
runs on a cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors.
114
100
p=2
Ts
Tf (Tu)
Tb
p=4
Ts
Tf (Tu)
Tb
p=6
Ts
Tf (Tu)
Tb
p=8
Ts
Tf (Tu)
Tb
p = 10
Ts
Tf (Tu)
Tb
200
300
problem size (in KBs)
400 500 600 700
800
900
1000
0.012 0.015 0.018 0.020 0.023 0.026 0.029 0.032 0.035 0.037
0.012 0.015 0.018 0.020 0.024 0.026 0.029 0.032 0.035 0.038
0.010 0.012 0.013 0.015 0.016 0.018 0.019 0.021 0.022 0.023
0.016 0.018 0.019 0.020 0.022 0.023 0.025 0.026 0.028 0.029
0.016 0.018 0.019 0.021 0.022 0.024 0.025 0.026 0.028 0.029
0.016 0.016 0.017 0.018 0.019 0.019 0.020 0.021 0.022 0.022
0.024 0.025 0.026 0.027 0.028 0.029 0.030 0.031 0.032 0.033
0.024 0.029 0.031 0.034 0.038 0.040 0.043 0.046 0.049 0.051
0.023 0.024 0.024 0.025 0.025 0.026 0.026 0.027 0.027 0.028
0.031 0.031 0.032 0.033 0.034 0.034 0.035 0.036 0.036 0.037
0.031 0.031 0.032 0.033 0.034 0.034 0.035 0.036 0.036 0.037
0.030 0.031 0.031 0.031 0.032 0.032 0.032 0.033 0.033 0.034
0.038 0.038 0.039 0.039 0.040 0.041 0.041 0.042 0.042 0.043
0.038 0.038 0.039 0.039 0.040 0.040 0.041 0.042 0.042 0.043
0.037 0.038 0.038 0.038 0.038 0.039 0.039 0.039 0.040 0.040
Table A.6: Predicted execution times (in seconds) for single-value reduction. The
problem size ranges from 100KB to 1000KB of integers. Ts and Tf (Tu) denote the
execution time assuming a slow and fast root node, respectively. Tb is the runtime
for balanced workloads. Each data point represents the predicted performance
on a cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors.
115
100
p=2
Ts
Tf
p=4
Ts
Tf
p=6
Ts
Tf )
p=8
Ts
Tf
p = 10
Ts
Tf
200
300
problem size (in KBs)
400 500 600 700
800
900
1000
0.096 0.188 0.269 0.370 0.447 0.543 0.630 0.733 0.794 0.911
0.073 0.138 0.205 0.275 0.342 0.400 0.466 0.535 0.601 0.669
0.120 0.239 0.361 0.474 0.595 0.699 0.808 0.946 1.021 1.175
0.047 0.086 0.125 0.169 0.200 0.249 0.279 0.315 0.357 0.394
0.137 0.276 0.393 0.524 0.629 0.774 0.873 1.052 1.150 1.256
0.039 0.072 0.098 0.132 0.155 0.184 0.233 0.250 0.293 0.323
0.274 0.265 0.410 0.527 0.646 0.792 0.890 1.082 1.161 1.301
0.038 0.058 0.083 0.109 0.141 0.156 0.187 0.217 0.229 0.261
0.472 0.373 0.411 0.592 0.668 0.783 0.964 1.072 1.187 1.357
0.053 0.207 0.286 0.154 0.197 0.253 0.258 0.285 0.315 0.334
Table A.7: Actual execution times (in seconds) for point-wise reduction. The
problem size ranges from 100KB to 1000KB of integers. Ts and Tf denote the
execution time assuming a slow and fast root node, respectively. Each data point
represents the average of 10 runs on a cluster comprised of 2, 4, 6, 8, and 10
heterogeneous processors.
116
100
p=2
Ts
Tf
p=4
Ts
Tf
p=6
Ts
Tf
p=8
Ts
Tf
p = 10
Ts
Tf
200
300
problem size (in KBs)
400 500 600 700
800
900
1000
0.012 0.218 0.317 0.432 0.430 0.630 0.729 0.832 0.935 1.039
0.068 0.127 0.186 0.246 0.305 0.364 0.423 0.481 0.540 0.599
0.118 0.224 0.323 0.438 0.436 0.636 0.735 0.839 0.941 1.045
0.047 0.080 0.112 0.145 0.177 0.210 0.243 0.274 0.306 0.338
0.126 0.232 0.331 0.446 0.444 0.644 0.742 0.847 0.949 1.053
0.049 0.075 0.101 0.127 0.153 0.178 0.205 0.229 0.255 0.281
0.133 0.239 0.338 0.453 0.451 0.651 0.750 0.854 0.956 1.060
0.056 0.082 0.108 0.134 0.160 0.185 0.212 0.236 0.262 0.288
0.140 0.246 0.345 0.460 0.458 0.658 0.757 0.861 0.963 1.067
0.063 0.089 0.115 0.141 0.167 0.192 0.219 0.243 0.269 0.295
Table A.8: Predicted execution times (in seconds) for point-wise reduction. The
problem size ranges from 100KB to 1000KB of integers. Ts and Tf denote the
execution time assuming a slow and fast root node, respectively. Each data point
represents the predicted performance on a cluster comprised of 2, 4, 6, 8, and 10
heterogeneous processors.
117
100
p=2
Ts
Tf (Tu)
Tb
p=4
Ts
Tf (Tu)
Tb
p=6
Ts
Tf (Tu)
Tb
p=8
Ts
Tf (Tu)
Tb
p = 10
Ts
Tf (Tu)
Tb
200
300
problem size (in KBs)
400 500 600 700
800
900
1000
0.028 0.045 0.047 0.058 0.068 0.078 0.086 0.096 0.107 0.114
0.029 0.039 0.048 0.058 0.068 0.078 0.088 0.098 0.107 0.116
0.022 0.024 0.027 0.032 0.036 0.040 0.044 0.049 0.053 0.057
0.035 0.044 0.045 0.050 0.056 0.063 0.064 0.070 0.075 0.079
0.034 0.041 0.042 0.048 0.052 0.063 0.062 0.066 0.071 0.077
0.028 0.031 0.031 0.032 0.033 0.035 0.036 0.039 0.040 0.043
0.045 0.048 0.054 0.059 0.061 0.062 0.066 0.074 0.071 0.074
0.039 0.042 0.047 0.052 0.051 0.054 0.058 0.061 0.065 0.068
0.036 0.036 0.037 0.045 0.041 0.04 0.041 0.042 0.043 0.044
0.057 0.065 0.064 0.064 0.067 0.074 0.07 0.077 0.076 0.078
0.048 0.053 0.052 0.054 0.056 0.063 0.061 0.064 0.067 0.069
0.046 0.048 0.045 0.047 0.048 0.048 0.049 0.050 0.050 0.050
0.069 0.072 0.074 0.058 0.077 0.078 0.086 0.083 0.091 0.088
0.056 0.116 0.058 0.074 0.062 0.102 0.067 0.089 0.076 0.072
0.215 0.057 0.056 0.058 0.057 0.058 0.056 0.058 0.058 0.058
Table A.9: Actual execution times (in seconds) for pre x sums. The problem size
ranges from 100KB to 1000KB of integers. Ts and Tf (Tu) denote the execution
time assuming a slow and fast root node, respectively. Tb is the runtime for
balanced workloads. Each data point represents the average of 10 runs on a
cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors.
118
100
p=2
Ts
Tf (Tu)
Tb
p=4
Ts
Tf (Tu)
Tb
p=6
Ts
Tf (Tu)
Tb
p=8
Ts
Tf (Tu)
Tb
p = 10
Ts
Tf (Tu)
Tb
200
300
problem size (in KBs)
400 500 600 700
800
900
1000
0.028 0.039 0.049 0.063 0.071 0.081 0.092 0.102 0.113 0.124
0.029 0.039 0.050 0.060 0.072 0.081 0.092 0.106 0.115 0.124
0.021 0.024 0.027 0.029 0.033 0.035 0.038 0.041 0.044 0.047
0.035 0.041 0.046 0.051 0.057 0.062 0.067 0.072 0.078 0.083
0.035 0.041 0.046 0.051 0.057 0.062 0.067 0.072 0.078 0.083
0.031 0.033 0.034 0.036 0.037 0.039 0.040 0.041 0.043 0.044
0.050 0.053 0.057 0.060 0.064 0.067 0.070 0.074 0.078 0.081
0.050 0.053 0.057 0.060 0.064 0.067 0.070 0.074 0.078 0.081
0.046 0.047 0.047 0.048 0.048 0.049 0.049 0.050 0.050 0.051
0.063 0.066 0.068 0.071 0.073 0.076 0.078 0.081 0.084 0.087
0.063 0.066 0.068 0.071 0.073 0.076 0.078 0.081 0.084 0.087
0.060 0.061 0.061 0.061 0.062 0.062 0.062 0.063 0.063 0.064
0.076 0.079 0.080 0.084 0.085 0.087 0.089 0.091 0.093 0.095
0.076 0.079 0.080 0.084 0.085 0.087 0.089 0.091 0.093 0.095
0.074 0.075 0.075 0.075 0.075 0.076 0.076 0.076 0.077 0.077
Table A.10: Predicted execution times (in seconds) for pre x sums. The problem
size ranges from 100KB to 1000KB of integers. Ts and Tf (Tu) denote the execution time assuming a slow and fast root node, respectively. Tb is the runtime for
balanced workloads. Each data point represents the predicted performance on a
cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors.
119
100
p=2
Ts
Tf (Tu)
Tb
p=4
Ts
Tf (Tu)
Tb
p=6
Ts
Tf (Tu)
Tb
p=8
Ts
Tf (Tu)
Tb
p = 10
Ts
Tf (Tu)
Tb
200
300
problem size (in KBs)
400 500 600 700
800
900
1000
0.137 0.249 0.365 0.492 0.623 0.732 0.877 1.019 1.104 1.252
0.135 0.263 0.389 0.525 0.655 0.769 0.928 1.046 1.156 1.307
0.139 0.276 0.403 0.545 0.663 0.809 0.938 1.079 1.193 1.337
0.229 0.441 0.681 0.876 1.104 1.460 1.742 1.989 2.241 2.501
0.186 0.360 0.624 0.885 1.085 1.317 1.564 1.830 2.083 2.396
0.210 0.475 0.739 0.961 1.218 1.488 1.759 2.017 2.270 2.598
0.256 0.559 0.770 1.296 1.537 1.981 2.270 2.646 2.967 3.359
0.204 0.465 0.639 1.224 1.521 1.946 2.165 2.765 2.932 3.392
0.222 0.537 0.908 1.193 1.521 1.769 2.159 2.504 2.828 3.155
0.323 0.531 0.917 1.220 1.665 2.110 2.470 2.956 4.126 3.787
0.208 0.402 0.843 1.040 1.490 2.093 2.363 2.815 3.310 3.542
0.241 0.469 0.881 1.218 1.788 2.050 2.207 2.823 3.641 3.765
1.456 1.769 1.452 1.770 2.310 3.588 3.332 3.877 4.489 5.061
0.450 0.862 1.266 1.537 2.041 2.435 3.152 3.573 4.212 4.773
0.410 1.130 1.134 1.766 1.839 2.676 3.269 3.633 4.476 4.952
Table A.11: Actual execution times (in seconds) for one-to-all broadcast. The
problem size ranges from 100KB to 1000KB of integers. Ts and Tf (Tu) denote
the execution time assuming a slow and fast root node, respectively. Tb is the
runtime for balanced workloads. Each data point represents the average of 10
runs on a cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors.
120
100
p=2
Ts
Tf (Tu)
Tb
p=4
Ts
Tf (Tu)
Tb
p=6
Ts
Tf (Tu)
Tb
p=8
Ts
Tf (Tu)
Tb
p = 10
Ts
Tf (Tu)
Tb
200
300
problem size (in KBs)
400 500 600 700
800
900
1000
0.182 0.346 0.510 0.673 0.837 1.001 1.165 1.329 1.493 1.656
0.141 0.264 0.387 0.510 0.632 0.755 0.878 1.001 1.124 1.247
0.120 0.222 0.324 0.426 0.528 0.630 0.732 0.834 0.936 1.038
0.194 0.358 0.522 0.685 0.849 1.013 1.177 1.341 1.505 1.668
0.132 0.235 0.337 0.440 0.542 0.644 0.747 0.849 0.952 1.054
0.132 0.234 0.336 0.438 0.540 0.640 0.744 0.846 0.948 1.050
0.210 0.374 0.538 0.701 0.865 1.029 1.193 1.357 1.521 1.684
0.148 0.250 0.352 0.454 0.556 0.658 0.760 0.862 0.964 1.066
0.148 0.250 0.352 0.454 0.556 0.658 0.760 0.862 0.964 1.066
0.224 0.388 0.552 0.715 0.879 1.043 1.207 1.371 1.535 1.698
0.162 0.264 0.366 0.468 0.570 0.672 0.774 0.876 0.978 1.080
0.162 0.264 0.366 0.468 0.570 0.672 0.774 0.876 0.978 1.080
0.238 0.402 0.566 0.729 0.893 1.057 1.221 1.385 1.549 1.712
0.176 0.278 0.380 0.482 0.584 0.686 0.788 0.890 0.992 1.094
0.176 0.278 0.380 0.482 0.584 0.686 0.788 0.890 0.992 1.094
Table A.12: Predicted execution times (in seconds) for one-to-all broadcast. The
problem size ranges from 100KB to 1000KB of integers. Ts and Tf (Tu) denote the
execution time assuming a slow and fast root node, respectively. Tb is the runtime
for balanced workloads. Each data point represents the predicted performance
on a cluster comprised of 2, 4, 6, 8, and 10 heterogeneous processors.
121
100
p=2
SB
ID
p=4
SB
ID
p=6
SB
ID
p=8
SB
ID
p = 10
SB
ID
200
300
problem size (in KBs)
400 500 600 700
800
900
1000
0.017 0.024 0.035 0.045 0.057 0.066 0.076 0.088 0.097 0.109
0.033 0.054 0.070 0.088 0.109 0.127 0.142 0.164 0.182 0.201
0.028 0.040 0.053 0.076 0.087 0.121 0.130 0.140 0.156 0.228
0.043 0.058 0.073 0.090 0.118 0.129 0.157 0.183 0.199 0.210
0.035 0.047 0.069 0.103 0.108 0.128 0.139 0.159 0.176 0.202
0.059 0.071 0.089 0.114 0.136 0.157 0.240 0.200 0.224 0.245
0.041 0.124 0.071 0.096 0.115 0.128 0.147 0.613 0.230 0.199
0.069 0.303 0.103 0.194 0.268 0.172 0.237 0.312 0.269 0.249
0.057 0.320 0.239 0.244 0.203 0.226 0.282 0.315 0.380 0.383
0.142 0.367 0.434 1.073 0.555 0.422 0.750 0.512 0.442 0.462
Table A.13: Actual execution times (in seconds) for all-to-all broadcast. There
are two algorithms compared|simultaneous broadcast (SB) and intermediate
destination (ID). The problem size ranges from 100KB to 1000KB of integers.
Each data point represents the average of 10 runs on a cluster comprised of 2, 4,
6, 8, and 10 heterogeneous processors.
122
100
p=2
SB
ID
p=4
SB
ID
p=6
SB
ID
p=8
SB
ID
p = 10
SB
ID
200
300
problem size (in KBs)
400 500 600 700
800
900
1000
0.029 0.050 0.070 0.091 0.111 0.132 0.152 0.173 0.193 0.214
0.058 0.098 0.138 0.179 0.219 0.259 0.299 0.339 0.379 0.419
0.035 0.056 0.076 0.097 0.117 0.138 0.158 0.179 0.199 0.220
0.070 0.110 0.150 0.191 0.231 0.271 0.311 0.351 0.391 0.431
0.043 0.064 0.084 0.105 0.125 0.146 0.166 0.187 0.207 0.228
0.086 0.126 0.166 0.207 0.247 0.287 0.327 0.367 0.407 0.447
0.050 0.071 0.091 0.112 0.132 0.153 0.173 0.194 0.214 0.235
0.100 0.140 0.180 0.221 0.261 0.301 0.341 0.381 0.421 0.461
0.068 0.098 0.129 0.160 0.191 0.221 0.252 0.283 0.313 0.242
0.114 0.154 0.194 0.235 0.275 0.315 0.355 0.395 0.435 0.475
Table A.14: Predicted execution times (in seconds) for all-to-all broadcast. There
are two algorithms compared|simultaneous broadcast (SB) and intermediate
destination (ID). The problem size ranges from 100KB to 1000KB of integers.
Each data point represents the predicted performance on a cluster comprised of
2, 4, 6, 8, and 10 heterogeneous processors.
123
List of References
[ACS89] A. Aggarwal, A. K. Chandra, and M. Snir. \On communication latency in PRAM computations." In 1st ACM Symposium on Parallel
Algorithms and Architectures, pp. 11{21, 1989.
[ACS90] A. Aggarwal, A. K. Chandra, and M. Snir. \Communication complexity of PRAMs." In J. Theoretical Computer Science, March 1990.
[AGL98] Gail A. Alverson, William G. Griswold, Calvin Lin, David Notkin, and
Lawrence Snyder. \Abstractions for Portable, Scalable Parallel Programming." IEEE Transactions on Parallel and Distributed Systems,
9(1):71{86, January 1998.
[Akl97] Selim Akl. Parallel Computation: Models and Methods. Prentice Hall,
1997.
[Bat68] K. Batcher. \Sorting networks and their applications." In Proceedings
of the AFIPS Spring Joint Computing Conference, pp. 307{314, 1968.
[BBC94] Vasanth Bala, Jehoshua Bruck, Robert Cypher, Pablo Elustondo,
Alex Ho, Ching-Tien Ho, Shlomo Kipnis, and Marc Snir. \CCL: A
Portable and Tunable Collective Communication Library for Scalable
Parallel Computers." In Proceedings of 8th International Parallel Processing Symposium, pp. 835{844, 1994.
[BBR00] Olivier Beaumont, Vincent Boudet, Fabrice Rastello, and Yves
Robert. \Matrix-Matrix Multiplication on Heterogeneous Platforms."
Technical Report 2000-24, E cole Normale Superieure de Lyon, January 2000.
[BDR99] Pierre Boulet, Jack Dongarra, Fabrice Rastello, Yves Robert, and
Frederic Vivien. \Algorithmic issues on heterogeneous computing
platforms." Parallel Processing Letters, 9(2):197{213, 1999.
[BGM95] Guy Blelloch, Phil Gibbons, Yossi Matias, and Marco Zagha. \Accounting for Memory Bank Contention and Delay in High-Bandwidth
Multiprocessors." In Seventh ACM Symposium on Parallel Algorithms
and Architectures, pp. 84{94, June 1995.
124
[BGP94] Mike Barnett, Satya Gupta, David G. Payne, Lance Shuler, Robert
van de Geijn, and Jerrell Watts. \Interprocessor Collective Communication Library (Intercom)." Scalable High Performance Computing
Conference, pp. 357{364, 1994.
[BHP96] Gianfranco Bilardi, Kieran T. Herley, Andrea Pietracaprina, Geppino
Pucci, and Paul Spirakis. \BSP vs LogP." In Eighth Annual ACM
Symposium on Parallel Algorithms and Architectures, pp. 25{32, June
1996.
[BJ99] David Bader and Joseph JaJa. \SIMPLE: A Methodology for Programming High Performance Algorithms on Clusters of Symmetric
Multiprocessors (SMPs)." Journal of Parallel and Distributed Computing, 58(1):92{108, July 1999.
[BL92] R. Butler and E. Lusk. \User's guide to the p4 Programming System."
Technical Report ANL-92/17, Argonne National Laboratory, 1992.
[BLM98] G. E. Blelloch, C. E. Leiserson, B. M. Maggs, C. G. Plaxton,
S. J. Smith, and M. Zagha. \An experimental analysis of parallel
sorting algorithms." Theory of Computing Systems, 31(2):135{167,
March/April 1998.
[BMP98] M. Banikazemi, V. Moorthy, and D. Panda. \Ecient Collective
Communication on Heterogeneous Networks Workstations." In International Conference on Parallel Processing, pp. 460{467, 1998.
[BRP99] Prashanth Bhat, C.S. Raghavendra, and Viktor Prasanna. \Ecient
Collective Communication in Distributed Heterogeneous Systems."
In International Conference on Distributed Computing Systems, May
1999.
[BSP99] M. Banikazemi, J. Sampathkumar, S. Prabhu, D. Panda, and P. Sadayappan. \Communication Modeling of Heterogeneous Networks of
Workstations for Performance Characterization of Collective Operations." In Heterogeneous Computing Workshop (HCW '99), pp. 125{
133, April 1999.
[Buy99a] Rajkumar Buyya. High Performance Cluster Computing: Architectures and Systems, volume 1. Prentice Hall, 1999.
[Buy99b] Rajkumar Buyya. High Performance Cluster Computing: Programming and Applications, volume 2. Prentice Hall, 1999.
125
[BYT95] Byte Magazine.
\The BYTEmark benchmark."
URL
http://www.byte.com/bmark/bmark.htm, 1995.
[CG89] N. Carriero and D. Gelernter. \LINDA in context." Communications
of the ACM, 32:444{458, 1989.
[CKP93] David Culler, Richard Karp, David Patterson, Abhijit Sahay,
Klaus Erik Schauser, Eunice Santos, Ramesh Subramonian, and
Thorsten von Eicken. \LogP: Towards a Realistic Model of Parallel
Computation." In Fourth ACM Symposium on Principles and Practice of Parallel Programming, pp. 1{12, May 1993.
[CKP96] David E. Culler, Richard M. Karp, David Patterson, Abhijit Sahay,
Eunice E. Santos, Klaus Erik Schauser, Ramesh Subramonian, and
Thorsten von Eicken. \LogP: A Practical Model of Parallel Computation." Communications of the ACM, 39(11):78{85, November 1996.
[CLR94] T. Cormen, C. Leiserson, and R. Rivest. Introduction to Algorithms.
MIT Press, 1994.
[CR73] S. A. Cook and R.A. Reckhow. \Time Bounded Random Access Machines." Journal of Computer and Systems Sciences, 7:354{375, 1973.
[DCS96] Andrea C. Dusseau, David E. Culler, Klaus Erik Schauser, and
Richard P. Martin. \Fast Parallel Sorting Under LogP: Experience
with the CM-5." IEEE Transactions on Parallel and Distributed Systems, 7(8):791{805, August 1996.
[DFR93] F. Dehne, A. Fabri, and A. Rau-Chaplin. \Scalable Parallel Computational Geometry for Coarse Multicomputers." In Proc. ACM Symposium on Computational Geometry, pp. 298{307, 1993.
[EF93] M. M. Eshaghian and R. F. Freund. \Cluster-M Paradigms for HighOrder Heterogeneous Procedural Speci cation Computing." In Workshop on Heterogeneous Processing, 1993.
[ES93] M. M. Eshaghian and M. E. Shaaban. \Cluster-M Parallel Programming Paradigm." International Journal of High Speed Computing,
1993.
[FK98] Ian Foster and Carl Kesselman, editors. The Grid: Blueprint for a
New Computing Infrastructure. Morgan Kaufmann, 1998.
126
[FM70]
[For93]
[FW78]
[Gel85]
[GHP90]
[Gib89]
[GLR99]
[GMR94]
[GMR97]
[Goo93]
[GR98]
W. D. Frazer and A. C. McKellar. \Samplesort: A Sampling Approach to Minimal Storage Tree Sorting." Journal of the ACM,
17(3):496{507, 1970.
High Performance Fortran Forum. \High Performance Fortran Language Speci cation." Scienti c Programming, 2(1{2):1{170, 1993.
S. Fortune and J. Wyllie. \Parallelism in Random Access Machines."
In Proceedings of the 10th Annual Symposium on Theory of Computing, pp. 114{118, 1978.
D. Gelernter. \Generative Communication in Linda." ACM Transactions on Programming Languages and Systems, 7(1):80{112, 1985.
G. A. Geist, M. T. Heath, B. W. Peyton, and P. H. Worley. \A
user's guide to PICL: A portable instrumented communication library." Technical Report TM-11616, Oak Ridge National Laboratory,
1990.
Phillip Gibbons. \A more pratical PRAM model." In 1st ACM Symposium on Parallel Algorithms and Architectures, pp. 158{168, 1989.
Mark W. Goudreau, Kevin Lang, Satish B. Rao, Torsten Suel, and
Thanasis Tsantilas. \Portable and Ecient Parallel Computing Using
the BSP Model." IEEE Transactions on Computers, 48(7):670{689,
1999.
Phillip Gibbons, Yossi Matias, and Vijaya Ramachandran. \The
QRQW PRAM: Accounting for Contention in Parallel Algorithms."
In Fifth Annual ACM-SIAM Symposium on Discrete Algorithms, pp.
638{648, January 1994.
Phillip B. Gibbons, Yossi Matias, and Vijaya Ramachandran. \Can a
Shared-Memory Model Serve as a Bridging Model for Parallel Computation?" In 9th Annual ACM Symposium on Parallel Algorithms
and Architectures, pp. 72{83, 1997.
M. Goodrich. \Parallel Algorithms Column 1: Models of Computation." SIGACT News, 24:16{21, December 1993.
Mark W. Goudreau and Satish B. Rao. \Single Message vs. Batch
Communication." In M.T. Heath, A. Ranade, and R.S. Schreiber, editors, Algorithms for Parallel Processing, volume 105 of IMA Volumes
in Mathematics and Applications, pp. 61{74. Springer-Verlag, 1998.
127
[GS96]
Alexandros V. Gerbessiotis and Constantinos J. Siniolakis. \Deterministic Sorting and Randomized Mean Finding on the BSP Model."
In Eighth Annual ACM Symposium on Parallel Algorithms and Architectures, pp. 223{232, June 1996.
[GV94] Alexandros V. Gerbessiotis and Leslie G. Valiant. \Direct BulkSynchronous Parallel Algorithms." Journal of Parallel and Distributed
Computing, 22(2):251{267, August 1994.
[HBJ96] David R. Helman, David A. Bader, and Joseph JaJa. \Parallel Algorithms for Personalized Communication and Sorting with an Experimental Study." In Eighth Annual ACM Symposium on Parallel
Algorithms and Architectures, pp. 211{222, June 1996.
[HC83] J.S. Huang and Y.C. Chow. \Parallel Sorting and Data Partitioning
by Sampling." In IEEE Computer Society's Seventh International
Computer Software & Applications Conference (COMPSAC'83), pp.
627{631, November 1983.
[HH98] P. Husbands and J. C. Hoe. \MPI-StarT: Delivering Network Performance to Numerical Applications." In Supercomputing '98, 1998.
[HJS97] Jonathan M.D. Hill, Stephen A. Jarvis, Constantinos Siniolakis, and
Vasil P. Vasilev. \Portable and Architecture Independent Parallel Performance Tuning Using a Call-Graph Pro ling Tool: A Case Study in
Optimising SQL." Technical Report PRG-TR-17-97, Oxford University Computing Laboratory, 1997.
[HMS98] Jonathan M. D. Hill, Bill McColl, Dan C. Stefanescu, Mark W.
Goudreau, Kevin Lang, Satish B. Rao, Torsten Suel, Thanasis Tsantilas, and Rob Bisseling. \BSPlib: The BSP Programming Library."
Parallel Computing, 24(14):1947{1980, 1998.
[Hoa62] C.A.R. Hoare. \Quicksort." Computer Journal, 5(1):10{15, 1962.
[HPR92] William L. Hightower, Jan F. Prins, and John H. Reif. \Implementations of Randomized Sorting on Large Parallel Machines." In 4th
Annual ACM Symposium on Parallel Algorithms and Architectures,
pp. 158{167, June 1992.
[HX98] Kai Hwang and Zhiwei Xu. Scalable Parallel Computing. McGrawHill, 1998.
[Ion96] Mihai Florin Ionescu. \Optimizing Parallel Bitonic Sort.". Master's
thesis, University of California at Santa Barbara, 1996.
128
[JW98]
[KBG00]
[KGG94]
[KHP99]
[KPS93]
[KS99]
[KSF00]
[LB96]
[Lei93]
[LM88]
Ben Juurlink and Harry Wijsho . \A Quantitative Comparison of
Parallel Computation Models." ACM Transactions on Computer Systems, 16(3):271{318, 1998.
Thilo Kielmann, Henri Bal, and Sergei Gorlatch. \Bandwidth-ecient
Collective Communicaiton for Cluster Wide Area Systems." In 14th
International Parallel and Distributed Processing Symposium, pp.
492{499, 2000.
V. Kumar, A. Grama, A. Gupta, and G. Karypis. Introduction
to Parallel Computing: Design and Analysis of Algorithms. Benjamin/Cummings, 1994.
Thilo Kielmann, Rutger F. H. Hofman, Henri E. Bal Aske Platt, and
Raoul A. F. Bhoedjang. \MPI's Reduction Operations in Clustered
Wide Area Systems." In Message Passing Interface Developer's and
User's Conference, pp. 43{52, Atlanta, GA, March 1999.
A. Khokhar, V. Prasanna, M. Shaaban, and C. Wang. \Heterogeneous
computing: Challenges and opportunities." Computer, 26(6):18{27,
June 1993.
Danny Krizanc and Anton Saarimaki. \Bulk synchronous parallel:
practical experience with a model for parallel computing." Parallel
Computing, 25(2):159{181, 1999.
Nicholas T. Karonis, Bronis R. De Supinski, Ian Foster, and William
Gropp. \Exploiting Hierarchy in Parallel Computer Networks to Optimize Collective Operation Performance." In 14th International Parallel and Distributed Processing Symposium, pp. 377{384, 2000.
Bruce B. Lowekamp and Adam Beguelin. \ECO: Ecient Collective
Operations for Communication on Heterogeneous Networks." In International Parallel Processing Symposium, pp. 399{405, Honolulu,
HI, 1996.
Tom Leighton. Introduction to Prallel Architectures: Arrays, Trees,
Hypercubes. Morgan Kaufmann, 1993.
Charles Leiserson and Bruce M. Maggs. \Communication-Ecient
Parallel Algorithms for Distributed Random-Access Machines." Algorithmica, 3:53{77, 1988.
129
[LMR95] Zhiyong Li, Peter H. Mills, and John H. Reif. \Models and Resource
Metrics for Parallel and Distributed Computation." In Proceedings
of 28th Annual Hawaii International Conference on System Sciences,
January 1995.
[McC93] W. F. McColl. \General Purpose Parallel Computing." In A. M.
Gibbons and P. Spirakis, editors, Lectures in Parallel Computation,
Proceedings 1991 ALCOM Spring School on Parallel Computation, pp.
337{391. Cambridge University Press, 1993.
[MMT95] Bruce M. Maggs, Lesley R. Matheson, and Robert E. Tarjan. \Models of Parallel Computation: A Survey and Synthesis." In Proceedings
of the 28th Hawaii International Conference on System Sciences, volume 2, pp. 61{70. IEEE Press, January 1995.
[Mor98a] Pat Morin. \Coarse-Grained Parallel Computing on Heterogeneous
Systems." In Proceedings of the 1998 ACM Symposium on Applied
Computing, pp. 629{634, 1998.
[Mor98b] Pat Morin. \Two Topics in Applied Algorithmics.". Master's thesis,
Carleton University, 1998.
[MR95] Philip McKinley and David Robinson. \Collective Communication in
Wormhole-Routed Massively Parallel Computers." IEEE Computer,
28(12):39{50, December 1995.
[RV87] J. H. Reif and L. G. Valiant. \A Logarithmic Time Sort for Linear
Size Networks." Journal of the ACM, 34(1):60{76, 1987.
[SDA97] Howard J. Siegel, Henry G. Dietz, and John K. Antonio. \Software
Support for Heterogeneous Computing." In Allen B. Tucker, editor,
The Computer Science and Engineering Handbook, pp. 1886|1909.
CRC Press, 1997.
[SG97] Gregory Shumaker and Mark W. Goudreau. \Bulk-Synchronous Parallel Computing on the Maspar." In World Multiconference on Systemics, Cybernetics and Informatics, volume 1, pp. 475{481, July
1997. Invited paper.
[SHM97] David B. Skillicorn, Jonathon M. D. Hill, and W. F. McColl. \Questions and Answers About BSP." Scienti c Programming, 6(3):249{
274, 1997.
130
[Sny86]
Lawrence Snyder. \Type Architectures, Shared Memory and the
Corollary of Modest Potential." Annual Review of Computer Science,
pp. 289{318, 1986.
[SOJ96] Marc Snir, Steve Otto, Steven Jus-Lederman, David Walker, and Jack
Dongarra. MPI: The Complete Reference. MIT Press, 1996.
[ST98] David B. Skillicorn and Domenica Talia. \Models and Languages
for Parallel Computation." ACM Computing Surveys, 30(2):123{169,
June 1998.
[Sun90] V. S. Sunderam. \PVM: A framework for parallel distributed computing." Concurency: Practice and Experience, 2(4):315{349, 1990.
[Val90a] Leslie G. Valiant. \A bridging model for parallel computation." Communications of the ACM, 33(8):103{111, 1990.
[Val90b] Leslie G. Valiant. \General Purpose Parallel Architectures." In J. van
Leeuwen, editor, Handbook of Theoretical Computer Science, volume
A: Algorithms and Complexity, chapter 18, pp. 943{971. MIT Press,
Cambridge, MA, 1990.
[Val93] Leslie G. Valiant. \Why BSP Computers?" In Proceedings of the 7th
International Parallel Processing Symposium, pp. 2{5. IEEE Press,
April 1993.
[WG98] Ti ani L. Williams and Mark W. Goudreau. \An experimental evaluation of BSP sorting algorithms." In Proceedings of the 10th IASTED
International Conference on Parallel and Distributed Computing Systems, pp. 115{118, October 1998.
[WP00] Ti ani L. Williams and Rebecca J. Parsons. \The Heterogeneous Bulk
Synchronous Parallel Model." In Parallel and Distributed Processing, volume 1800 of Lecture Notes in Computer Science, pp. 102{108.
Springer-Verlag, Cancun, Mexico, May 2000.
[WWD94] Charles C. Weems, Glen E. Weaver, and Steven G. Dropsho. \Linguistic Support for Heterogeneous Parallel Processing: A Survey and
an Approach." In Proceedings of the Heterogeneous Computing Workshop, pp. 81{88, 1994.
131
Download