A Case Study of Shared Memory and Message Passing:

A Case Study of Shared Memory and Message Passing:
The Triangle Puzzle
by
Kevin A. Lew
Submitted to the Department of Electrical Engineering and Computer Science
in partial fulfillment of the requirements for the degree of
Master of Science
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
January 20, 1995
© Massachusetts Institute of Technology, 1995. All Rights Reserved.
Author ......
of Electrical Engineering and Computer Science
- - Department
fI
I,
I
/I
January 20, 1995
Certified
bya-- 1~.......tY~~~dl.
.........................................
7 lff /Hr----g--------t--------------.........
Certified by
U/
~
Professor M. Frans Kaashoek
Department of Electrical Engineering and Computer Science
Thesis Supervisor
.......................
Certified by ....
Kirk Johnson
Department of Electrical Engineering and Computer Science
isor
Thesis cn "
by .......
Accepted
......
......
C-h
..
............................
Prtfe~sorFredericR. Morgenthaler
r
a
Chair, Department doimttee on Graduate Students
Depart9tg
J1
gatkeering
OF TECHNOLOGY
JUL 1 71995
LIBRARIES
Barker Eng
and ComputerScience
2
A Case Study of Shared Memory and Message Passing:
The Triangle Puzzle
by
Kevin A. Lew
Submitted to the Department of Electrical Engineering and Computer Science
on January 20, 1995, in partial fulfillment of the requirements for
the degree of Master of Science
Abstract
This thesis is the first controlled case study that compares shared-memory and messagepassing implementations of an application that solves the triangle puzzle and runs on
actual hardware: only the communication interfaces used by the implementations vary; all
other system components remained fixed. The implementations run on the MIT Alewife
machine, a cache-coherent, distributed-shared-memory multiprocessor that efficiently
supports both the shared-memory and message-passing programming models. The goal of
the triangle puzzle is to count the number of solutions to a simple puzzle in which a set of
pegs, arranged in a triangle, is reduced to one peg by jumping over and removing a peg
with another, as in checkers. The shared-memory implementations explore distributing
data structures across processors' memory banks, load distribution, and prefetching. A
single message-passing implementation uses only message passing for interprocessor
communication. By comparing these shared-memory and message-passing implementations, we draw two main conclusions. First, when using distributed shared memory, performing cache coherence actions and decoupling synchronization and data transfer can
make a shared-memory implementation less efficient than the message-passing implementation. For our application, we observe a maximum of 52% performance improvement of
the message-passing implementation over the best shared-memory one that uses both synchronization and data transfer. Second, shared memory offers low-overhead data access
and can perform better than message passing for applications that exhibit low contention.
For our application, we observe a maximum of 14% improvement of a shared-memory
implementation over the message-passing one. Thus, sometimes message passing is better than shared memory. Sometimes the reverse is true. To enable all parallel sharedmemory and message-passing applications to perform well, we advocate parallel
machines that efficiently support both communication styles.
Thesis Supervisor: M. Frans Kaashoek
Title: Assistant Professor of Computer Science and Engineering
Thesis Co-Supervisor: Kirk Johnson
Title: Doctoral candidate
3
4
Acknowledgments
I wish to thank several people who made this thesis possible. First and foremost, I
thank Prof. M. Frans Kaashoek for introducing me to the problem presented in this thesis
and for caring by holding weekly meetings to stimulate my thinking and provide guidance.
Second, I thank Kirk Johnson for providing me the code that solves the triangle puzzle on
the CM-5, his guidance, and the numerous enlightening discussions as this work evolved.
Next, I wish to thank the members of my research group for providing feedback and an
excellent learning environment: Dawson Engler, Sandeep Gupta, Wilson Hsieh, Max
Poletto, Josh Tauber, and Deborah Wallach. I would also like to thank the members of the
Alewife group for providing a truly unique machine, especially Rajeev Barua, David
Chaiken, David Kranz, John Kubiatowicz, Beng-Hong Lim, Ken Mackenzie, Sramana
Mitra, and Donald Yeung. Thanks also to Kavita Bala, Chris Lefelhocz, and Ken Mackenzie for providing useful feedback on earlier drafts of this thesis. Finally, thanks to my
family for their continuous encouragement.
5
6
Table of Contents
1 Introduction ................................................................................................................ 13
2 Background ................................................................................................................ 17
2.1 Architectural Issues .......................................................................................... 17
2.2 Programming Issues......................................................................................... 19
2.3 Performance Issues .......................................................................................... 20
3 Solving the Triangle Puzzle ....................................................................................... 23
3.1 Structure of the Solution.................................................................................. 23
3.2 Parallel Breadth-First Search versus Parallel Depth-First Search................... 25
3.3 Implementation of PBFS ................................................................................. 27
4 Implementations and Performance Results................................................................ 29
4.1 Alewife Hardware Details ............................................................................... 30
4.1.1 Shared-Memory Interface .................................................................... 31
4.1.2 Message-Passing Interface ................................................................... 31
4.2 Shared-Memory Implementations ................................................................... 32
4.2.1 Distributed Transposition Table and Privatized Work Queues ...........33
4.2.2 Improved DTABQ Implementation ..................................................... 35
4.2.3 Load Distribution ................................................................................. 37
4.2.4 Prefetching ........................................................................................... 37
4.3
Message-Passing Implementation ....................................................................
39
4.4
Comparisons of Shared-Memory and Message-Passing Implementations ......39
4.4.1 When Message Passing Can Perform Better Than Shared Memory ...39
4.4.2 When Shared Memory Can Perform Better Than Message Passing ...41
4.4.3 Improved DTABQ and Message-Passing Implementations ................46
4.5 Hybrid Implementation .................................................................................... 47
4.6 Speedups of the Improved DTABQ and Message-Passing Implementations .48
5 Related Work ............................................................................................................. 51
6 Conclusions ................................................................................................................ 53
Appendix A Supplemental Tables ................................................................................. 55
Bibliography .................................................................................................................
7
63
8
List of Figures
Initial placement of pegs in the triangle puzzle ........................................... 14
Shared-memory machine with one memory bank ....................................... 17
Machine with "dance hall" organization..................................................... 18
Machine with processors tightly coupled with memory banks ................... 18
Axes of symmetry of the triangle puzzle board .......................................... 24
Four data structures used in PBFS ............................................................. 28
The Alewife architecture ............................................................................. 30
Alewife's shared-memory and message-passing interfaces ........................ 32
Performance of the SM implementation, problem size 6............................ 33
Distributed transposition table and privatized work queues in the DTABQ
implementation with two processors ......................................................... 34
Figure 4.5. Performance of SM and DTABQ implementations, problem size 6........... 34
Figure 4.6. Comparison of DTABQ and Improved DTABQ implementations, problem
Figure 1.1.
Figure 2.1.
Figure 2.2.
Figure 2.3.
Figure 3.1.
Figure 3.2.
Figure 4.1.
Figure 4.2.
Figure 4.3.
Figure 4.4.
size 6 ..........................................................................................................
36
Figure 4.7. Comparison of DTABQ and message-passing implementations, problem
size 6 ..........................................................................................................
40
Figure 4.8. Comparison of DTABQ, baseline SM, and message-passing implementations,
problem size 5............................................................................................ 44
Figure 4.9. Comparison of DTABQ, baseline SM, and message-passing implementations,
problem size 6 ............................................................................................
45
Figure 4.10. Comparison of message-passing and Improved DTABQ implementations,
problem size 6............................................................................................ 46
Figure 4.11. Comparison of DTABQ, hybrid, message-passing, and Improved DTABQ
implementations, problem size 6 .............................................................. 47
Figure 4.12. Speedups of Improved DTABQ and message-passing implementations over
sequential implementation ........................................................................ 48
9
10
List of Tables
Table 3.1. Relationship between symmetry and number of positions explored ...............25
Table 3.2. Position statistics............................................................................................. 27
Table 4.1. Time spent executing LimitLESS software for DTABQ and Improved DTABQ
implementations, problem size 6...................................................................... 36
Table 4.2. The percentages of total execution time a processor on the average spends idle
for the DTABQ and Improved DTABQ implementations ............................... 37
Table 4.3. The percentage improvement of prefetching positions from the next queues over
no prefetching in the Improved DTABQ implementation, problem size 6...... 38
Table 4.4. Maximum and average number of moves from any given position in a round in
problem sizes 5 and 6 .......................................................................................
42
Table 4.5. Ratio of average request rates for work queues for problem size 6 to that for
problem size 5 in the SM implementation during the round with the highest
number of generated extensions ....................................................................... 43
Table A. 1. Execution times for shared-memory and message-passing implementations,
problem size 5................................................................................................... 55
Table A.2. Execution times for shared-memory and message-passing implementations,
problem size 6................................................................................................... 57
Table A.3. Position statistics per round for problem size 5.............................................. 58
Table A.4. Position statistics per round for problem size 6.............................................. 59
Table A.5. Position statistics per round for problem size 7.............................................. 60
Table A.6. Request rates and standard deviations for work queue operations, problem
size 5 .................................................................................................................
61
Table A.7. Request rates and standard deviations for work queue operations, problem
size 6 ................................................................................................................. 61
11
12
Chapter 1
Introduction
Advocates of shared-memory multiprocessors have long debated with supporters of
message-passing multicomputers. The shared-memory camp argues that shared-memory
machines are easy to program; the message-passing camp argues that the cost of hardware
to support shared memory is high and limits scalability and performance. This thesis is
the first controlled case study of shared-memory and message-passing implementations of
an application that solves the triangle puzzle and runs on actual hardware: only the communication interfaces used by the implementations vary; all other system components
(e.g., compiler, processor, cache, interconnection network) remain fixed. We show when
and why the message-passing version outperforms a shared-memory version and vice
versa. Additional overhead due to cache coherence actions and decoupling synchronization and data transfer can make a shared-memory implementation perform worse than the
message-passing implementation (our message-passing implementation performs up to
52% better than the best shared-memory implementation that uses both synchronization
and data transfer). A shared-memory version can outperform the message-passing implementation (by up to 14% for our application) under low contention because shared memory offers low-overhead data access.
Our implementations run on the MIT Alewife multiprocessor [2]. The message-passing implementation was ported from a message-passing implementation that runs on
Thinking Machines' CM-5 family of multicomputers [25]. The original CM-5 implementation written by Kirk Johnson won first place in an Internet newsgroup contest [14], the
goal of which was to solve the triangle puzzle in the shortest time.
Alewife efficiently supports both message-passing and cache-coherent shared-memory
programming models in hardware. In fact, Alewife is the only existing machine of its
class, and is thus a unique platform on which to compare shared-memory and messagepassing implementations. Previous research comparing shared-memory and messagepassing implementations of the same application has resorted to either using the same
machine to run different simulators [8] or using different machines to run different simulators [19]. The Stanford FLASH multiprocessor [17] also efficiently supports these two
programming models, but has yet to be built.
We chose the triangle puzzle because it is simple enough to understand and solve, yet
exhibits many of the characteristics of complex tree-search problems. For example,
13
=peg
= hole
Figure 1.1. Initial placement of pegs in the triangle puzzle.
exploiting the symmetry of the playing board is used both in our implementations and in
game-search problems, such as chess. In addition, solving this problem involves many
parallel programming issues, such as load distribution, data management, and synchronization. Finally, our application isfine-grained and irregular. By "fine-grained," we mean
that the time spent computing between shared-memory accesses or communication events
is short. Although our study is performed using a single fine-grained application, we
expect that our conclusions will hold for other fine-grained applications. Our application
is irregular because its data structures evolve dynamically as the application executes.
The triangle puzzle consists of a triangular board with fifteen holes, each of which may
contain a peg. At the start of the puzzle, every hole except the middle hole of the third row
contains a peg (see Figure 1.1). A move consists of removing a peg by jumping over it
with another peg as in checkers. A solution to the puzzle is a sequence of moves that
leaves only one peg on the board. Counting the number of distinct solutions to the triangle
puzzle is the goal of the triangle puzzle search problem [15]. Because we must find all
solutions, solving this search problem involves an exhaustive search. This search problem
can be extended for puzzles with boards that have more than five holes on each side. We
refer to the number of holes per side as the problem size.
We solved problem sizes 5, 6, and 7 on a 32-node Alewife machine. Problem size 5
has 1,550 solutions.
Problem size 6 has 29,235,690,234
solutions.
Problem size 7 has
zero solutions. Gittinger, Chikayama, and Kumon [13] show that there are zero solutions
for problem sizes 3N+1 (N > 2). Solving problem size 8 is expected to require several
gigabytes of memory [14]; to the best of our knowledge, no one has solved it yet.
The rest of this thesis is organized as follows. Chapter 2 provides background on the
architectural, programming, and performance issues when using shared-memory and message-passing machines. Chapter 3 describes how we solve the triangle puzzle using parallel tree search. Chapter 4 describes and compares the performance of our shared-memory
and message-passing implementations, describes an implementation that uses both shared
memory and message passing, and closes by presenting the speedups of the message-pass-
14
ing implementation and the best shared-memory implementation. Chapter 5 presents
related work. Finally, Chapter 6 concludes.
15
16
Chapter 2
Background
This chapter presents background material on the architectural, programming, and performance issues of shared-memory and message-passing machines.
2.1 Architectural Issues
Shared memory and message passing are two interprocessor communication mechanisms. Multiple processors can share memory (i.e., use a single address space) in two
ways. First, all processors can access a single memory bank by sending messages across a
network, as shown in Figure 2.1. This organization is not scalable because the single
shared-memory bank becomes a bottleneck when the number of processors becomes
large. Second, all processors can access multiple memory banks in a "dance hall" fashion,
as shown in Figure 2.2. Machines with this "dance hall" organization solve the bottleneck
problem, but can be improved by tightly coupling each processor with a memory bank, as
shown in Figure 2.3. Machines with this tightly coupled processor-memory organization
are called distributed-shared-memory machines, and are better than those with the "dance
hall" organization because the wires between processors and memory banks are shorter.
Thus, distributed-shared-memory machines cost less, and a given processor takes less
time to access its tightly coupled memory bank.
(
Network
Figure 2.1. Shared-memory machine with one memory bank.
P = processor, M = memory bank.
17
+@P
P
M
M
Figure 2.2. Machine with "dance hall" organization.
P = processor, M = memory bank.
P
P
P
P
M
M
0aO
M
M
Network
Figure 2.3. Machine with processors tightly coupled with memory banks.
P = processor, M = memory bank.
Caching also helps reduce memory access time. If the tightly coupled processor-memory organization is used to implement shared memory, and each processor has a cache,
then every memory reference to address A must perform the following check: if the data
with address A is currently cached, then load it from the cache. Otherwise, if A is a local
address, then load the data from the current processor's memory bank. If none of these
conditions are true, then load the data from remote memory and perform cache coherence
actions. To minimize the time taken by this procedure of checks, it is implemented with
18
specialized hardware. By using this hardware support, shared-memory accesses can be
performed using a single instruction [16].
To illustrate cache coherence actions, we describe actions that a directory-based cache
coherence protocol executes when a processor references remote data that is not present in
its cache. In a directory-based cache coherence protocol, a section of memory, called the
directory, stores the locations and state of cached copies of each data block. To invalidate
cached copies, the memory system sends messages to each cache that has a copy of the
data. In order to perform a remote read, a processor must send a request to the remote processor, which then responds with the requested data. In addition, the data must be placed
in the requesting processor's cache, and the directory of cached copies in the remote processor must be updated. In order to perform a remote write, a processor sends a request to
the remote processor, which then must invalidate all cached copies of the data.
Some message-passing machines organize processors, memory banks, and the network as shown in Figure 2.3. However, instead of having a single address space for all
processors, each processor has its own address space for its local memory bank. To access
a remote processor's memory bank, a processor must explicitly request data by sending a
message to the remote processor, which then responds with a message containing the
requested data. (Of course, a shared-memory programming model (i.e., a single address
space) can be implemented using this message-passing architecture; messages are used to
keep data values consistent across all processors [5, 6].) Thus, the check described in the
previous paragraph is not performed by hardware in message-passing machines. This is
the main difference between shared-memory and message-passing machines. If this check
is to be performed, it must be implemented entirely in software using multiple instructions.
2.2 Programming Issues
Because shared-memory machines offer programmers a single address space whereas
message-passing machines offer multiple address spaces, shared-memory machines are
conceptually easier to program. For a shared-memory machine with multiple distributed
shared-memory banks, the shared-memory programming model makes programming easier because it frees the programmer from having to manage multiple memory banks and it
hides message passing from the programmer. If a programmer uses message passing, the
19
programmer is in charge of orchestrating all communication events through explicit sends
and receives. This task can be difficult when communication is complex.
For shared-memory machines, synchronization (e.g., locks and barriers) is usually
provided by various functions in a library. Data transfer is accomplished by memory loads
and stores. On the other hand, when using message passing, synchronization and data
transfer can be coupled. When one processor transfers data to another processor by sending the data in a message, the receiving processor can process the message by atomically
executing a block of code, as in active messages [26]. Because this processing is performed atomically, this code can be thought of as a critical section; it is if a lock were
acquired, the code is executed, and the lock were released.
In the following section, we discuss the performance implications of this coupling of
synchronization and data transfer in message passing versus their being decoupled in
shared memory. We assume that all synchronization and data transfers are achieved by
using simple library calls.
2.3 Performance Issues
There are certain scenarios in which shared memory can perform worse than message
passing [16]. First, if the grain size of shared data is larger than a cache line, shared memory may perform worse because a data transfer requires multiple coherence actions, which
demand more network bandwidth and increase the latency of the data transfer. It would be
more efficient to send the data using a single message. Even if prefetching were used,
shared memory can still perform worse, partly because issuing load instructions to perform prefetching incurs overhead, and the prefetched data consumes network bandwidth.
Still, the main reason shared memory, even with prefetching, performs worse than message passing when the grain size of shared data is larger than a cache line is that the overhead of cache coherence actions increases the demand on network bandwidth and
increases the latency of data transfer. In addition, if data is immediately consumed after it
is obtained and not re-used, the time used performing cache coherence actions is not well
spent; the main idea behind caching is to reduce latency by exploiting the fact that if data
is accessed once, it is likely to be referenced again.
Second, shared memory can perform worse when communication patterns are known
and regular. The main reason is that shared-memory accesses requires two messages: one
message from the requesting processor to the shared-memory bank, and one from the
20
memory bank back to the requesting processor. On the other hand, message passing is
inherently one-way: a message is sent from one point to another. In addition, in many
cache coherence protocols, for a processor to acquire a cache line that is dirty in another
processor's cache, the data must be transferred through a home or intermediate node
instead of being communicated directly to the requester. For example, in a protocol that
keeps the status of a cached data block with a home node, the home node must update the
status of the data block.
Lastly, shared memory can perform worse than message passing in combining synchronization and data transfer, assuming that sending a message requires very little overhead. As described in the previous section, message-passing machines can bundle
synchronization with data transfer by using mechanisms such as active messages, whereas
shared-memory machines decouple the two operations. By sending a one-way, point-topoint message, a message-passing machine can effectively acquire a lock, transfer data,
and release the lock. On the other hand, in shared memory, this procedure requires the following steps. One round trip is required to obtain the lock: a processor sends a request
message to the memory bank where the lock is located, the lock is obtained, and a
response message is sent back to the requester. Then, data transfer takes place. Finally,
one round trip, analogous to acquiring the lock, is required to release the lock. Because
message passing bundles synchronization and data transfer in a one-way, point-to-point
message, it can perform better than shared memory, which requires multiple round-trip
messages to achieve the same functionality.
In summary, shared memory can perform worse than message passing under three
conditions: (1) when the grain size of shared data is larger than a cache line, (2) when
communication patterns are known and regular, and (3) when combining synchronization
and data transfer. When grain size of shared data is small and communication patterns are
irregular, shared memory can perform better than message passing.
21
22
Chapter 3
Solving the Triangle Puzzle
This chapter describes using tree search to solve the triangle puzzle, using a transposition table to eliminate redundant work during tree search, exploiting symmetry to optimize
the search process, and using parallel breadth-first search.
3.1 Structure of the Solution
The generic solution to the triangle puzzle constructs a search tree. Each node in the
search tree represents some placement of pegs into holes that can be derived from the initial placement of pegs through a sequence of moves. Any such placement is called aposition. We represent a position as a bit vector, with each bit corresponding to whether a peg
is present in a particular hole. An edge from node A to node B indicates that the position
represented by B can be obtained by making a single move from the position represented
by node A. We call the position represented by node B an extension of the position represented by node A. Thus, the root of the search tree represents the initial position, and
leaves of the tree represent positions from which no more moves can be made. For an initial position with P pegs, a sequence of P-1 moves is required to arrive at a solution to the
triangle puzzle, since each move removes exactly one peg. Therefore, a path from the root
to any leaf at a depth of P-1 in the search tree is a solution. We call such leaves valid
leaves. All other leaves at depths less than P-1 are invalid leaves.
Positions are stored in a transposition table so that extensions of positions that have
already been explored are not explored again. When an extension of a position is found in
the transposition table, the subtree generated from this extension does not need to be
explored again, and we can instead join the extension and the position in the transposition
table (this is also known as folding or fusing positions) [1, 23]. In the triangle puzzle,
because the number of joined positions is large, the size of the search tree is greatly
reduced by using a transposition table. For problem sizes 5, 6, and 7, 66%, 83%, and 90%
of all positions explored are joined, respectively. Because joining occurs, what we refer to
as the search tree is actually a directed acyclic graph (dag).
The transposition table is implemented using a hash table. Similar to an element of an
array, a slot of the transposition table holds a set of positions that map to that slot under a
23
O-o eo'"..Q.
*-
I
Figure 3.1. Axes of symmetry of the triangle puzzle board.
hash function. If two positions map to the same slot, the positions are chained together in
the slot.
To further reduce the number of positions explored, our implementations also exploit
the symmetry of the triangle puzzle board. Since the board is an equilateral triangle, there
are three axes of symmetry, as shown in Figure 3.1. Thus, up to six positions can be represented in each node of the search dag by performing the appropriate reflections. If only
one reflection about one axis of symmetry is performed, each node in the search dag represents two positions (not necessarily distinct). A node can be uniquely identified by choosing one of the positions as the canonical position. (We interchangeably refer to a node and
a canonical position as a position.) Since a position in our triangle puzzle implementations is represented as a bit vector, we arbitrarily choose the position that lexicographically
precedes the other as the canonical position. For problem sizes 5, 6, and 7, if each node
represents two positions, the number of positions explored is nearly halved, as shown in
Table 3.1.
If each node in the search dag represents six positions, all six positions can be obtained
by making a sequence of five reflections about two axes of symmetry, by alternating
between the axes on each reflection in the sequence. For problem sizes 5 and 6, if each
node represents six positions, the number of positions explored is reduced by nearly a factor of two (exactly the same factor as when each node represents two positions), as shown
in Table 3.1. However, doing the same for problem size 7 cuts the number of positions
explored by a factor of almost six [4].
Clearly, there is a time-space trade-off when exploiting symmetry: if the search procedure exploits symmetry, extra computation is required to perform the reflections, but the
search procedure saves space and computation by reducing the size of the search dag and
avoiding computing positions already explored, respectively.
24
Problem Size
5
6
7
Symmetry
(number of
positions per node)
Number of
Positions Explored
Reduction Factor
Compared to 1
Position Per Node
1
4,847
1.000
2
2,463
1.968
6
2,463
1.968
1
1,373,269
1.000
2
688,349
1.995
6
688,349
1.995
1
304,262,129
1.000
2
152,182,277
1.999
6
53,158,132
5.724
Table 3.1. Relationship between symmetry and number of positions explored.
In addition to using a transposition table and exploiting symmetry, other algorithmic
optimizations may be performed during tree search. These include compression [1], evicting positions from the transposition table when a threshold is reached [4, 18], pre-loading
the transposition table [1], pruning heuristics besides exploiting symmetry [10, 21], and
move ordering [18]. We did not explore the first two optimizations, and we do not employ
the remaining ones. We cannot pre-load the transposition table because information
obtained at one point in the search gives no information about later stages of the search.
Finally, because the triangle puzzle search problem requires finding all solutions and
hence requires an exhaustive search, we do not use any pruning heuristics other than
exploiting symmetry, and we do not perform move ordering. Move ordering helps when
different moves have different effects on the outcome (e.g., a score) of a puzzle or game.
In the triangle puzzle, all sequences of moves are equal in value.
3.2 Parallel Breadth-First Search versus Parallel Depth-First Search
Parallel breadth-first search (PBFS) executes faster and uses less memory than parallel
depth-first search (PDFS) when used to solve the triangle puzzle. Intuitively, PBFS is
more appropriate for the triangle puzzle than PDFS because PBFS inherently performs an
exhaustive search to find all solutions. On the other hand, if we had a search problem
whose objective was to find any solution, PDFS would probably be more suitable. The
25
performance difference is a result of how the total number of solutions is determined. We
keep a solution counter with each node that stores the number of paths from the root that
can reach this node. PBFS sets the counter in the root of the search dag to "1," then each
node at the next level sums the values of the counters of its parents. The sum of the values
of the counters of valid leaves is the number of solutions to the triangle puzzle.
In contrast, in PDFS, since different processors can be working concurrently on different levels in the dag, PDFS cannot count solutions in the same top-down manner as in
PBFS. The search procedure has two options: it can either "push" counter values down
complete subdags in a manner similar to counting solutions in PBFS, or it can count solutions in a bottom-up fashion after the entire dag is formed. With the first option, the search
procedure can traverse portions of the dag multiple times. For example, suppose one processor is exploring a subtree. Now suppose that another processor generates an extension
that is represented by the root of this subtree. Then the search procedure must "push" the
counter value of this extension down the entire subtree in order to update the number of
paths that can reach all nodes in the subtree. This process is very costly because of the
high number of joins, as noted in Chapter 3.1.
To count solutions in a bottom-up fashion in PDFS, the search procedure sets the
counters of valid leaves to "1," and the counters of invalid leaves to "0." Starting at the
bottom of the search dag and working upwards, nodes at each level of the dag then sum
the values of their children's counters. The value of the counter at the root of the dag is the
number of solutions. Because PDFS that counts solutions in a top-down fashion can
traverse portions of the dag multiple times, whereas PDFS that counts solutions in a bottom-up fashion traverses the dag only twice, the latter performs better. Henceforth,
"PDFS" refers to the latter version.
PBFS executes faster than PDFS because PBFS traverses the search dag once, whereas
PDFS traverses it twice: once to construct the dag, and a second time to count solutions.
In addition, detecting when the dag is completely formed in PDFS incurs additional overhead.
PBFS also requires less memory than PDFS. Since PBFS counts solutions in a topdown fashion, it needs to remember at most only those positions in two successive levels
of the dag1 . On the other hand, because PDFS counts solutions in a bottom-up fashion, it
needs to remember all positions in the dag. Thus, PBFS will require less memory than
PDFS to store positions in the transposition table, and it will require progressively less
1. This is a loose upper bound because once all extensions have been generated from a given position, the position can be discarded.
26
Total Number of
Positions Explored
Largest Number of
Positions at Two
Successive Levels
of the Search Dag
5
2,463
356
6
688,349
44,219
7
53,158,132
1,815,907
Problem
Size
Table 3.2. Position statistics.
memory than PDFS as the problem size increases. This point is illustrated in Table 3.2,
which shows the total number of positions explored and the largest number of positions at
two successive levels for problem sizes 5, 6, and 7, when symmetry is exploited.
We developed an implementation of PDFS that counts solutions in a bottom-up fashion and runs on Thinking Machines' CM-5 family of multicomputers. The performance
of this PDFS implementation for problem size 5 and 64 processors (.046 seconds, averaged over ten runs) is a factor of five worse than that of the CM-5 PBFS implementation
(.009 seconds, averaged over ten runs), thus supporting the argument that PBFS is better
than PDFS for the triangle puzzle. We expect that this is also true for larger problem sizes
because the size of the search dag significantly increases as problem size increases.
Therefore, we only implemented PBFS on Alewife.
3.3 Implementation of PBFS
PBFS uses four data structures: a transposition table, a current queue of positions to
extend at the current level in the search dag, a next queue of positions to extend at the next
level, and a pool of positions (see Figure 3.2). We refer to the current and next queues collectively as work queues. The pool of positions is a chunk of memory used to store positions; to keep the counters associated with positions consistent, the transposition table and
the work queues use pointers to refer to positions in this pool.
PBFS manipulates these data structures in a number of rounds equal to the number of
levels in a complete search dag. Before the first round begins, the initial position is placed
in the current queue. Each round then proceeds as follows. Processors repeatedly remove
positions from the current queue and generate all possible extensions of each position. If
an extension is not already in the transposition table, it is placed in both the transposition
table and the next queue, with the solution counter set to the value of the counter of the
27
Figure 3.2. Four data structures used in PBFS. Solid arrows indicate the path travelled
by extensions. Dashed arrows indicate that the current queue, next queue, and the
transposition table use pointers to refer to positions in the pool of positions.
position from which this extension was generated. If an extension is already in the transposition table, it is joined with the position in the table by increasing the solution counter
of the position in the table by the value of the counter of the extension, and the extension is
not placed in the next queue. After all positions in the current queue have been extended,
the round ends. Positions in the next queue are placed in the current queue, and the transposition table is cleared. Then the next round starts. After the last round is completed, the
sum of the values of the solution counters associated with positions in the current queue is
the number of solutions.
28
Chapter 4
Implementations and Performance Results
This chapter presents hardware details of Alewife, describes our shared-memory and
message-passing implementations, and compares their performance. The shared-memory
implementations explore distributing the transposition table and privatizing the work
queues, distributing load, and prefetching. After comparing the shared-memory implementations with a message-passing one, we show that a hybrid shared-memory and message-passing implementation is not suitable for this application. We close by comparing
the speedups of the message-passing implementation and the best shared-memory implementation.
Most of the results we present are for problem size 6. The reason we show results
mostly for problem size 6 is that this problem size is large enough to produce significant
changes in execution time as we vary the number of processors, yet not so large that it
takes an excessive amount of time to solve. For example, the best parallel implementation
(our message-passing implementation) takes about 200 times longer to solve problem
size 7 than to solve problem size 6 on 32 processors. In addition, at least 32 processors are
needed to solve problem size 7 because of memory requirements. Thus, for problem
size 7, it would be impossible to draw conclusions about how execution time changes as
the number of processors varies since the present maximum machine size is 32 processors.
The execution time of each implementation is averaged over ten trials.
deviation for all experiments is less than .03 seconds (worst case), and is
less (.0028 seconds average for all problem size 6 experiments and .000076
age for problem size 5). The execution time of each trial is measured by
The standard
usually much
seconds averusing a cycle
counter. Because setup times vary among different implementations, we do not include
setup time in our measurements, unless otherwise noted. To supplement the graphical presentation of the performance of each implementation, Appendix A numerically presents
the execution times and standard deviations of all implementations. Appendix A also
shows the number of positions explored and the number of joined positions per round for
problem sizes 5, 6, and 7.
As a final note, the total memory space required by the transposition table, summed
over all processors, is held constant for all shared-memory and message-passing implementations for a given problem size. (i.e., the number of slots and the maximum number
of positions a slot holds are constant for a given problem size.) In addition, the pool of
29
Alewife machine
0
SCSIDiskarray
Figure 4.1. The Alewife architecture.
positions is equally partitioned among all processors for all shared-memory and messagepassing implementations.
4.1 Alewife Hardware Details
The results presented in the following sections were obtained on a 32-node Alewife
machine at the MIT Laboratory for Computer Science. Alewife is a distributed sharedmemory, cache-coherent multiprocessor. Each processor is tightly coupled with a memory
bank. Each node contains its own memory bank, part of which is used as a portion of the
single shared address space. Each node consists of a Sparcle processor [3] clocked at
20MHz, a 64KB direct-mapped cache with 16-byte cache lines, a communications and
memory management unit (CMMU), a floating-point coprocessor, an Elko-series mesh
routing chip (EMRC) from Caltech, and 8MB of memory [2]. The EMRCs implement a
direct network [24] with a two-dimensional mesh topology using wormhole routing [11].
A mesh-connected SCSI disk array provides I/O. Figure 4.1 shows the Alewife architecture. The CMMU implements shared-memory and message-passing communication
interfaces, which will be described next.
30
4.1.1 Shared-Memory Interface
Shared memory is accessed by loads and stores. Caches store recently accessed shared
data. These caches are kept coherent by a directory-based protocol called LimitLESS [7].
Directory-based protocols were described in Chapter 2.1. LimitLESS implements the
directory in both hardware and software. Each processor keeps six hardware pointers to
cached data blocks: one pointer consists of a single bit to indicate whether a word is
cached at the local processor and five 32-bit pointers refer to remote processors' caches.
When more pointers are required, a processor is interrupted for software emulation of a
larger directory. We refer to this emulation as "time spent executing LimitLESS software." Because there are six hardware pointers, no time is spent executing LimitLESS
software with six and fewer processors. Of the 8MB of memory per node, 4MB are used
as a piece of global shared memory, 2MB are used for cache coherence directories, 1MB
is user-accessible private memory, and 1MB is used by the operating system. A clean read
miss to a processor's local shared memory bank is satisfied in about 13 cycles, while a
clean read miss to a neighboring processor's memory bank takes about 38 cycles. Cached
32-bit shared-memory loads and stores require two and three cycles, respectively.
4.1.2 Message-Passing Interface
Besides providing a shared-memory interface, Alewife provides a message-passing
interface, which allows programmers to bypass shared-memory hardware and access the
communication network directly. This organization is illustrated in Figure 4.2. When a
processor sends a message to another processor, the receiving processor is interrupted and
then processes the message. This message-passing model provides the functionality of
active messages [26]. The end-to-end latency of an active message, in which delivery
interrupts the receiving processor, is just over 100 cycles. Since the version of the Alewife
software system used in this case study does not offer polling (later versions are expected
to), polling was not used. For further details on how message passing is implemented, the
reader is referred to the papers by Agarwal et al. [2] and Kranz et al. [16].
31
-
Processor
-
.
.
Message-Passing
Interface
Shared-Memory Hardware
A
IF
(
I
Network
!
_
)
Figure 4.2. Alewife's shared-memory and message-passing interfaces.
4.2 Shared-Memory Implementations
We developed several shared-memory implementations of PBFS, all of which use
Mellor-Crummey-Scott queue locks [20]. Our baseline shared-memory implementation
(SM) takes the simplest approach by putting the transposition table and the work queues in
the shared-memory bank of one processor. To extend a position in the current queue, a
processor acquires a lock on the current queue, extracts a position, releases the lock, and
computes the position's extension. Similarly, to add an extension onto the next queue, a
processor acquires a lock on the next queue, adds the extension, then releases the lock.
Because the SM implementation centralizes data structures, it performs poorly as the number of processors increases for problem size 6, as shown in Figure 4.3. At 16 and 32 processors, the execution times are about the same.
To attempt to improve the performance of the SM implementation, variations distribute the transposition table across processors' shared-memory banks and privatize the work
queues (giving each processor its own set of work queues), distribute load, and prefetch
positions. Unless otherwise noted, in all shared-memory implementations, there is one
lock per slot of the transposition table, and each processor owns an equal portion of a
shared pool of positions. Mutually exclusive access to a particular position is provided by
the locks on the transposition table and the work queues. The presence of locks on the
32
Al
i
a
40
·
35
=
30
0
25
W
20
E
so
15
10
5
0
1
2
l___!_I_
4
8
16
32
Numberof Processors
Figure 4.3. Performance of the SM implementation, problem size 6.
four main data structures described in Chapter 3.3 and the placement of these data structures will be described on a case-by-case basis.
4.2.1 Distributed Transposition Table and Privatized Work Queues
The DTABQ ("D" for distributed, "TAB" for transposition table, and "Q" for work
queues) implementation attempts to improve the performance of the SM implementation
by distributing the transposition table across processors' memory banks and by giving
each processor private work queues that are stored in private memory that is local to that
processor. Distributing the transposition table means that the slots are interleaved across
processors' shared-memory banks, as shown in Figure 4.4. Privatizing the work queues
means that each processor has its own current queue in private memory and next queue in
its shared-memory bank. There are no locks on the current queues, but the locks on the
next queues remain. To distribute work, when a processor generates an extension of a
position, it first determines if the extension exists in the transposition table. If it does not,
the processor computes a destination processor by hashing the bit representation of the
extension. Then, by using a shared-memory data transfer, the processor places the extension in the destination processor's next queue. Figure 4.5 shows that the DTABQ implementation performs better than the SM implementation for problem size 6 and for all
numbers of processors. The largest improvement, 46%, occurs at 32 processors.
33
Processor
_ =lf%._
private
memory
Processor
current
queue .
---next
next
f
.
shared
memory
I
transpos.
table
transpos.
!
I
S
table
poolofn
pool ofn
IpositionsI
I positions
----
.................
--
.
I
I
--
Figure 4.4. Distributed transposition table and privatized work queues in the DTABQ
implementation with two processors. Slots of the transposition table are interleaved
across processors' shared-memory banks, with one lock per slot. Each processor has its
own current queue in private memory, and has its next queue in shared memory. Each
processor has an equal portion of the shared pool of positions.
Am %RO/.
'a
.
g 40
a
a
o
-..
SM ImplemenTation
=R
0
DTABQimplementation
_ I
35
._
._
30
3
25
'
20
6%
15
10
5
n
"_I
6%
2
4
8
16
16
46%
32
Numberof Processors
Figure 4.5.
Performance of SM and DTABQ implementations, problem size 6.
Percentages show how much better the DTABQ implementation is.
34
4.2.2 Improved DTABQ Implementation
Because the DTABQ implementation shares the transposition table, checking whether
an extension resides in the transposition table can involve accesses to remote memory
banks. To make accesses to the transposition table local and to reduce the amount of synchronization, the DTABQ implementation can be improved by modifying it in the following manner. Each processor only accesses its own partition of the transposition table. In
addition, each processor owns a number of next queues equal to the number of processors.
Instead of checking whether an extension exists in the transposition table before placing
each extension in a next queue as in the DTABQ implementation, each processor unconditionally places each extension in a next queue in a destination processor's set of next
queues. The particular next queue in the set is determined by the number of the processor
that generates the extension. Since only one processor writes to any particular next queue,
no locks are required on these next queues. Call this phase in which extensions are generated and placed in processors' sets of next queues the computation phase. After all extensions have been generated, each processor then determines if extensions in its set of next
queues are in its partition of the transposition table. If an extension is already in the partition, it is joined with the position in the partition. Call this phase the checking phase.
Because extensions are unconditionally placed in other processors' sets of next queues,
accesses to the transposition table during the checking phase are local2 . In fact, because
each processor executes the checking phase on its own set of positions in its next queues
and on its own partition of the transposition table, no locks are required on the slots of the
table. Call this improved implementation the Improved DTABQ implementation.
Figure 4.6 shows that the Improved DTABQ implementation performs better than the
DTABQ implementation for eight and more processors, but worse for fewer processors.
The reason is that the Improved DTABQ implementation unconditionally places extensions in next queues. Thus, it performs more enqueue operations to the next queues than
the DTABQ implementation does. For problem size 6, the DTABQ implementation performs 113,893 enqueue operations to next queues, whereas the Improved DTABQ implementation performs 688,349 such operations. As the number of processors increases,
however, the parallelism of these enqueue operations also increases so that with eight and
more processors, the Improved DTABQ implementation performs better. In addition,
2. Each slot consists of a counter of positions in the slot and a set of pointers to positions. Thus,
accesses to this counter and the pointers are local, not accesses to the bit representations of positions.
35
a-
45
-
o
40
--------
_
DTABQ mplementation
ImVk%%
Imoroved DTABQimol.
-
30
LN
= 2 25
W 20
S,
S'
S
NS
S
S
15
Li
SI
10
N
S
S
5
0
-
M
1
2
4
8
I N" I.
16 32
Numberof Processors
Figure 4.6. Comparison of DTABQ and Improved DTABQ implementations, problem size 6.
There is not enough memory in one processor to execute the Improved DTABQ implementation.
Implementation
DTABQ
Improved
DTABQ
Number
of
Processors
Time spent
executing
LimitLESS
software,
averaged over
all processors
(seconds)
Percent of
Execution
Time
8
0.75
9%
16
0.70
13%
32
0.40
12%
8
0
0%
16
0.0001
0.001%
32
0.0001
0.005%
Table 4.1. Time spent executing LimitLESS software for DTABQ and
Improved DTABQ implementations, problem size 6.
because of remote accesses to the transposition table, the DTABQ implementation spends
more time executing LimitLESS software, as shown in Table 4.1.
36
Percent of
Percent of
. time
.execution
execution time
execution
aet idle,
Number of
Processors
spent idle,
over
averaged
pav
eraged
spent idle,
averaged over all
alls over
processors
processors(mpoe
(DTABQ)
(DTABQ)
(Improved
DTABQ)
2
0.4%
0.7%
4
1.3%
3.1%
8
1.9%
4.2%
16
3.5%
6.1%
32
7.0%
9.7%
Table 4.2. The percentages of total execution time a
processor on the average spends idle for the DTABQ and
Improved DTABQ implementations. The percentage
increases as the number of processors increases because,
with a random distribution of the same number of
positions among an increasing number of processors,
positions are distributed less evenly.
4.2.3 Load Distribution
The DTABQ and Improved DTABQ implementations use a hash function to randomly
distribute positions among processors. This random load distribution policy distributes
work fairly evenly. As shown in Table 4.2, the percentage of total execution time a processor on the average spends idle is small. Thus, employing more sophisticated scheduling techniques, such as self-scheduling [22], is not warranted.
4.2.4 Prefetching
In our experiments with prefetching, the execution time of the Improved DTABQ
implementation can be improved at most 3.8% for problem size 6. We experimented with
prefetching data in three places. First, when generating extensions from positions in the
current queue, processors prefetch one position (its bit representation), its associated solution counter, and a pointer to the next position in the queue. Second, when scanning a slot
of the transposition table to check if an extension already exists, processors prefetch one
position and its solution counter. Lastly, when dequeuing positions from the next queues,
processors prefetch one position, its associated solution counter, and a pointer to the next
position in the queue. Because of the dynamic, fine-grained nature of the Improved
37
Number of Processors
Percentage Improvement of
Prefetching over No Prefetching
2
1.7%
4
2.2%
8
2.7%
16
3.2%
32
3.8%
Table 4.3. The percentage improvement of prefetching positions
from the next queues over no prefetching in the Improved DTABQ
implementation, problem size 6.
DTABQ implementation, we observed that prefetching positions from the current queue
and from slots in the transposition table generally performs worse than without prefetching. We believe that positions prefetched from the current queue and transposition table
are either evicted from direct-mapped caches before they are used or are already present in
the cache. However, prefetching positions from the next queues does improve performance. We believe the reason is that out of the three places in which prefetching is performed, only when positions are prefetched from the next queue do processors access
mostly uncached shared data. Recall that another processor most likely placed a given
position in the next queue and the processor accessing the position has not accessed it
before. On the other hand, a position in the current queue was accessed by the processor
during the checking phase. A position in a slot of the transposition table also can be
accessed multiple times during the checking phase, and thus can be cached. Thus, we
believe that the conditions just mentioned under which prefetching is ineffective usually
do not occur when prefetching positions from the next queues.
Table 4.3 shows the percentage improvement of prefetching positions from the next
queues over no prefetching. As the number of processors increases, the percentage
improvement increases because the benefit of prefetching magnifies as parallelism
increases. From now on, the Improved DTABQ implementation with prefetching refers to
the one that prefetches positions from the next queues.
38
4.3 Message-Passing Implementation
In the message-passing implementation of PBFS, each processor keeps in private
memory its own set of the four main data structures. Thus, the transposition table is not
shared; each processor accesses its own private partition of the transposition table. Each
round in PBFS proceeds as described in Chapter 3.3, except that processors send each
extension and the associated solution counter to another processor using one active message with a data payload of 16 bytes. As in the DTABQ implementation, the destination
processor is determined by hashing the bit representation of the generated extension.
Upon receiving a message, a processor handles the extension in the message in the fashion
described in Chapter 3.3. First, if the extension is not in the transposition table, it and its
associated solution counter are placed in both the transposition table and the next queue.
If the extension is already in the transposition table, it is joined with the existing position,
whose solution counter is increased by the value of the extension's solution counter. A
round ends by passing two barriers. First, a barrier is passed when all extensions have
been generated. Second, another barrier is passed when all extensions have been received
and processed.
4.4 Comparisons of Shared-Memory and Message-Passing Implementations
Comparing the shared-memory implementations with the message-passing implementation yields the two conclusions of this case study. First, when using distributed shared
memory, performing cache coherence actions and decoupling synchronization and data
transfer can make a shared-memory implementation less efficient than the message-passing implementation. Second, shared memory offers low-overhead data access and can
perform better than message passing for applications that exhibit low contention.
4.4.1 When Message Passing Can Perform Better Than Shared Memory
Recall from Chapter 2 that performing cache coherence actions and decoupling of synchronization and communication can make a shared-memory implementation less efficient
than the message-passing one. To illustrate this conclusion, we compare the DTABQ and
message-passing implementations. They implement PBFS similarly, and the DTABQ
39
,
E
FC
on
u
..... .....
. ......
--
I
150
45
0
35
30
20
P30
25
5
025
2150
10
5
0
Numberof Processors
Figure 4.7. Comparison of DTABQ and message-passing implementations, problem size 6.
The percentages show how much better the message-passing implementation is than the
DTABQ implementation.
implementation is the best shared-memory implementation that uses explicit synchronization. Although the Improved DTABQ implementation does not perform locking operations on any of the four main data structures, has local accesses to the transposition table,
and is thus better than the DTABQ implementation for eight and more processors (see
Figure 4.6), we defer comparing the Improved DTABQ and message-passing implementations to Chapter 4.4.3.
Figure 4.7 shows that the DTABQ implementation performs worse than the messagepassing implementation for problem size 6 and for four and more processors. The message-passing implementation performs up to 52% better than the DTABQ implementation.
With one and two processors, the DTABQ implementation performs better. For example,
with one processor, the message-passing implementation executes 6.51 seconds longer
than the DTABQ implementation. The message-passing implementation for one and two
processors performs worse for three reasons. First, the message-passing implementation
incurs the end-to-end latency of user-level active messages, whereas the DTABQ implementation does not. Based on the performance statistics given in Chapter 4.1, each active
message requires about 100 more cycles than a cached shared-memory access. For one
processor, since each position requires one active message, the message-passing implementation spends approximately (100 cycles/position)(688,349 positions)(l second/
20,000,000 cycles), or about 3.44 seconds more than the DTABQ implementation in performing data transfer for problem size 6. The other 3.07 seconds of the 6.51 second differ-
40
ence between execution times is spent evaluating the values of data sent per message and
copying the data from a message to local variables upon receiving the message. This phenomenon does not significantly affect the difference in execution times as the number of
processors increases because the effect of this phenomenon decreases as parallelism of
sending messages and shared-memory accesses increases. Second, since there are fewer
remote access possibilities for one and two processors than for four or more processors,
we observe that the average cache hit ratio in the DTABQ implementation is higher for
one and two processors (96.0% and 95.2% for one and two processors, respectively, and
94.0%-94.6% for more processors), thus making the overhead of cache coherence actions
lower. Since the difference between cache hit ratios is small, this second reason why the
DTABQ performs better than the message-passing implementation does not contribute
significantly to the difference in execution times.
4.4.2 When Shared Memory Can Perform Better Than Message Passing
By comparing the DTABQ and message-passing implementations, we saw that message passing can perform better than shared memory. By comparing the SM and messagepassing implementations, we will observe that shared memory can perform better than
message passing.
Two key insights are the sources of why the SM implementation can perform better
than the message-passing implementation. The first key insight is that a given position in
problem size 5 has on the average fewer possible moves than a given position in problem
size 6 because there are fewer holes and fewer initial pegs in the puzzle board. Table 4.4
illustrates this point by showing the maximum and average number of moves from any
given position during each round. Recall that all our implementations except the
Improved DTABQ implementation determine whether an extension resides in the transposition table immediately after the extension is generated. If an extension is not in the
transposition table, it is added to the table and to the next queue. Because a given position
in problem size 6 has on the average more possible moves than a given position in problem size 5, the rate of adding extensions to the next queue will be higher in problem size 6
than in problem size 5 for the SM implementation.
The second key insight is that problem size 6 has a higher average of joined positions
per round than problem size 5. In problem size 6, 65.4% of all positions generated per
round are joined positions. In contrast, 46.2% of all positions generated per round are
joined positions in problem size 5. (Table A.3 and Table A.4 show the average percent-
41
Problem
Problem
size
Round
size 6
Problem
Round maximum
size 5 average
maximum
numerofnumber of
moves
moves
Problem
size 6 average
maximum
number of
moves
moves
1
3
3.0
4
4.0
2
5
4.3
9
6.3
3
7
4.5
13
7.2
4
7
4.5
15
7.5
5
8
4.6
15
8.0
6
8
4.0
15
8.2
7
6
3.3
16
8.1
8
6
2.6
16
7.9
9
4
1.9
16
7.5
10
3
1.1
14
6.9
11
2
0.5
13
6.2
12
1
0.2
12
5.3
13
0
0
12
4.3
14
11
3.3
15
10
2.4
16
10
1.6
17
4
0.7
18
2
0.4
19
0
0
Table 4.4. Maximum and average number of moves from any given position in a round in
problem sizes 5 and 6.
ages per round.) As the transposition table fills up with positions, more positions are
joined, and the rate at which positions are added to the next queue slows down. Because
problem size 6 has a higher percentage of joined positions per round than problem size 5
and because in the SM implementation a processor fetches a new position from the current
queue if no more extensions can be generated from a given position, the rate of fetching
positions from the current queue is higher in problem size 6 than in problem size 5.
To illustrate that the rates at which positions are fetched from the current queue and
added to the next queue are greater for problem size 6 than problem size 5, we compare
42
SM
problem size 6
average request rate for
current queue divided
by problem size 5
request rate
SM
problem size 6
average request rate for
next queue divided by
problem size 5 request
rate
2
1.92
1.74
4
1.87
1.67
8
2.18
1.99
16
2.55
2.10
32
2.77
2.32
Number of
Processors
Table 4.5. Ratio of average request rates for work queues for problem
size 6 to that for problem size 5 in the SM implementation during the
round with the highest number of generated extensions. Since there is
no contention when there is only one processor, the ratios are not shown
for this machine size.
the request rates of processors to be serviced by the work queues during the round with the
highest number of generated extensions (round 8 for problem size 5 and round 11 for
problem size 6) in the SM implementation. When a processor is serviced, it performs a
series of operations, which we will refer to as one queue operation. A current queue operation consists of obtaining a lock, obtaining a position from the current queue, and releasing the lock. A next queue operation consists of obtaining a lock, adding a position to the
next queue, and releasing the lock. Table 4.5 shows that average request rate for problem
size 6 is 1.67 to 2.77 times the rate for problem size 5, and the ratio of rates generally
increases as the number of processors increases. Since a higher request rate implies
greater contention, contention for the work queues is definitely higher in problem size 6
than in problem size 5. (Table A.6 and Table A.7 show the actual request rates and the
standard deviations of the experiments; ten trials were executed.) We expect higher rates
for generally all rounds because both the percentage of joined positions per round and the
maximum and average number of possible moves from a given position are higher in
problem size 6 than problem size 5.
Lower contention for the work queues in problem size 5 causes less time spent acquiring locks and lower demand on memory access. These factors contribute to why the SM
implementation performs better than both the DTABQ and message-passing implementations for problem size 5, as shown in Figure 4.8. Recall that the SM implementation has a
43
n In
__-
-,,f.U
m
0
m
Us
E_0.15
o_
Ix
w
0.10
0.05
0.00
Numberof Processors
Figure 4.8. Comparison of DTABQ, baseline SM, and message-passing implementations,
problem size 5. Percentages show how much better the SM implementation is than the
message-passing one.
lock on the current queue, whereas the DTABQ implementation does not have locks on
the multiple current queues. Thus, we should expect the SM implementation to perform
worse, but it does not. The reason is that since less time is spent acquiring locks, other
overheads are relatively more significant. Recall that the DTABQ implementation hashes
the bit representations of extensions in order to distribute work, whereas the SM implementation does not. Because this hashing overhead in the DTABQ implementation is
more significant relative to time spent acquiring a lock on the current queue in the SM
implementation, the DTABQ implementation performs worse. The difference in execution times is not attributed to time spent executing LimitLESS software because the difference between the times the two implementations spend executing LimitLESS software,
averaged over all processors, is insignificant relative to execution time.
The SM implementation performs better than the message-passing implementation for
all machine sizes for the following reason: data access requests are processed by the
underlying hardware in parallel with the executing thread on the processor that services
the requests, and this process significantly affects execution time when demand on memory access is low. In contrast, in the message-passing implementation, data access
requests interrupt the executing thread on the servicing processor, thus making processing
data access requests and running the executing thread sequential. Thus, we arrive at our
second conclusion: shared memory offers low-overhead data access and performs better
than message passing for applications that exhibit low contention.
44
A
§
S
i
°
,
Cn
iv
45
45
40
35
30
25
20
15
10
5
0
Numberof Processors
Figure 4.9. Comparison of DTABQ, baseline SM, and message-passing implementations,
problem size 6. Percentages show how much worse the SM implementation is than the
DTABQ and message-passing implementations.
For problem size 6, this conclusion no longer holds: because contention for the work
queues is higher, the demand on memory access is higher. Thus, although shared-memory
accesses are being processed in parallel with executing threads, the parallelism is reduced.
Figure 4.9 shows that the baseline SM implementation performs worse than the DTABQ
and message-passing implementations for problem size 6 for two and more processors.
For one processor, the SM implementation performs better than the message-passing
implementation because the message-passing implementation incurs the end-to-end
latency of user-level active messages and overheads mentioned in the previous section,
whereas the SM implementation does not. Again, the overheads in the message-passing
implementation do not significantly influence the difference in execution times as the
number of processors increases because parallelism of sending messages and sharedmemory accesses also increases. For one processor, the performance difference between
the SM and message-passing implementations is not as large as that for the DTABQ and
message-passing implementations because the SM implementation uses a lock on the current queue, whereas the DTABQ implementation does not. The difference in execution
times of the SM and DTABQ implementations is not attributed to executing LimitLESS
software because for all numbers of processors, both implementations spend about the
same time, averaged over all processors, for problem size 6. The SM implementation performs worse than the DTABQ implementation for one processor because more time is
spent locking the current queue in the SM implementation than hashing positions to distribute work in the DTABQ implementation. Recall that for problem size 5, the SM
45
fn
0
0V
999999999
o
8
S
E
0
C
5
§030
W
m-
na.D
In m
lllrl
.... 1..ll
k%,~%~Y ImprovedDTABQimpl.
with prefetching
W
45
40
35
16%
30
10
isSI2 2%Oc
5
0
1*
21%
15
~~~1/
_~
I
,.)
1
2
4
mI
8
16
&%
32
Numberof Processors
Figure 4.10. Comparison of message-passing and Improved DTABQ implementations,
problem size 6. There is not enough memory in one processor (*) to execute the Improved
DTABQ implementation. Percentages show how much worse the Improved DTABQ
implementation is than the message-passing implementation.
implementation performs better with one processor. The reason is that 34.4% of all
explored positions are hashed in the DTABQ implementation for problem size 5, whereas
only 16.5% of all explored positions are hashed in the DTABQ implementation for problem size 6. Thus, for one processor, problem size 5 spends more time hashing in the
DTABQ implementation than locking in the SM implementation, whereas problem size 6
spends more time locking in the SM implementation than hashing in the DTABQ implementation.
4.4.3 Improved DTABQ and Message-Passing Implementations
When compared with the message-passing implementation, the Improved DTABQ
implementation also supports the two conclusions of this thesis. Since it reduces the
amount of synchronization by not using any locks on the four main data structures, the
Improved DTABQ implementation reduces the negative effect of decoupling synchronization and communication. In addition, since data accesses to a processor's set of next
queues execute in parallel with the thread that runs on that processor, the Improved
DTABQ implementation approaches the performance of the message-passing implementation within 16% in the best case and 24% in the worst case. Figure 4.10 compares the
Improved DTABQ and message-passing implementations.
46
0rlV
AA
'JII
-
a,
0
E
I-
15
I
10
I
-
,
/
'
MWI
I
I
IIIIJIEEEmlI
'rAo'
l
27%
8
5
B36%
-lL
0
-
·.. ·
IUUEI
W"A'WJ-dHybridImplementation
Message-passing
impl.
,xL%%%%ImprovedDTABQImpl.
with prefetching
48%
i1
8
jS
_~~~~I
_
16
i
015N
A
32
Numberof Processors
Figure 4.11. Comparison of DTABQ, hybrid, message-passing, and Improved DTABQ
implementations, problem size 6. Percentages show how much the hybrid implementation is
worse than the message-passing implementation. Because the hybrid implementation
requires two pools of positions, it needs at least eight processors to have enough memory to
execute.
4.5 Hybrid Implementation
For this application, a hybrid implementation (one that shares the transposition table
among processors and uses message passing to distribute extensions) will not perform better than the pure message-passing implementation. Two characteristics of the messagepassing implementation support this hypothesis. First, extensions with the same canonical
position are sent to the same processor. Second, checking whether these extensions are in
this processor's piece of the transposition table always involves accesses to private memory. Because of this locality in the message-passing implementation, a hybrid implementation will perform worse.
We implemented a hybrid implementation that is the same as the DTABQ implementation, with three exceptions. First, extensions are sent as active messages. Second, generating extensions precedes a phase in which it is determined whether extensions reside in
the transposition table. Third, the pool of positions is managed differently. Each processor has one pool of positions in its private memory and one pool of positions in its portion
of shared memory. Because deadlock may result in the current Alewife implementation
when a message handler accesses shared memory, a processor places received extensions
in its next queue, which refers to positions in the processor's private pool of positions.
47
#q}
*0
0L
0
0.
co
~en
- --+-
-
- -
32
_
Message-passing,
Problem
size
5
Message-passing,
Problemsize6
.
-r-
-
- - -
Imprvd.DTABQwithprefetching,Prb.size5
Imprvd.DTABQwithprefetching,Prb.size6
Idealspeedup
/
/
+s
16
8
4
2
1
n
0
I
I
I
I
I
I
I
1
2
4
8
16
32
64
Numberof processors
Figure 4.12. Speedups of Improved DTABQ and message-passing implementations over
sequential implementation.
Because the transposition table is shared, if an extension in the next queue does not reside
in the table, it is copied to a shared position and placed in the table and current queue.
Figure 4.11 shows that the hybrid implementation for problem size 6 performs worse than
the message-passing implementation, but performs better than the DTABQ implementation. Figure 4.11 also shows that the hybrid implementation performs worse than the
Improved DTABQ implementation because the Improved DTABQ implementation has
local accesses to the transposition table and has no locks on any of the four main data
structures.
4.6 Speedups of the Improved DTABQ and Message-Passing Implementations
Figure 4.12 examines the speedups of the Improved DTABQ and message-passing
implementations over a sequential implementation of the breadth-first search algorithm on
one node of the Alewife machine. This sequential implementation executes in 0.137 seconds for problem size 5, and executes in 34.694 seconds for problem size 6 (averaged over
ten trials). Because of the characteristics of the message-passing implementation mentioned in the previous section and because it does not incur the overhead of performing
48
cache coherence actions, the message-passing implementation generally scales better than
the Improved DTABQ implementation. However, for eight and more processors and
problem size 5, the Improved DTABQ implementation scales better. This fact supports
our second conclusion that with a small problem size and hence lower contention, shared
memory performs better. In addition, both implementations scale better for problem size 6
than problem size 5. The reason is that a larger problem size is more suitable for parallel
solution because there is dramatically more work to perform (688,349 positions are
explored in problem size 6, whereas only 2,463 positions are explored in problem size 5).
The average time per processor spent executing LimitLESS software does not affect the
scalability of the Improved DTABQ implementation. For both problem sizes and eight
and more processors, 2,000 or fewer cycles are spent executing LimitLESS software.
49
50
Chapter 5
Related Work
Several papers compare shared-memory and message-passing versions of parallel
applications. Chandra et al. [8] compare shared-memory and message-passing implementations of four parallel applications. However, unlike our comparison, which uses an
actual machine to make measurements, they simulate a distributed, cache-coherent sharedmemory machine and a message-passing machine by using different simulators that run on
a CM-5. In addition, their message-passing machine simulator uses Thinking Machines'
CMMD message-passing library, which has questionable efficiency. Moreover, their simulators assume constant network latency and do not account for network contention. Martonosi and Gupta [19] compare shared-memory and message-passing implementations of
a standard cell router called LocusRoute. However, they use different simulators for their
shared-memory and message-passing machines, and run each simulator on a different
machine. Kranz et al. [16] evaluate shared-memory and message-passing microbenchmarks on Alewife, and one of their results supports our first main conclusion.
A noteworthy parallel solution by Hatcher and Quinn [15] solves the triangle puzzle
on different parallel machines that support either shared memory or message passing in
hardware. Their algorithm starts with parallel breadth-first search until each logical processor (a physical processor consists of logical processors) has at least one position to
extend, then switches to parallel depth-first search in which each logical processor is
responsible for independently computing the number of solutions from the positions it
owns. Because of the large number of joins encountered when solving the triangle puzzle,
implementations that use this approach will duplicate work and thus can perform worse
than those that strictly perform parallel breadth-first search and utilize a shared transposition table. Our message-passing implementation in which independent depth-first search
begins after each physical processor has at least one position performs worse than the
PBFS-only, message-passing implementation on the same machine, a 64-node CM-5. For
example, for problem size 5 on 64 processors, performance degrades by a factor of three
(setup time included).
The triangle puzzle has been solved using sequential algorithms. Bischoff [4], who
won second place in the Internet contest mentioned in Chapter 1, uses algorithmic techniques similar to ours in a sequential algorithm that runs on various machines. On a
486DX2-66, his algorithm that solves problem size 6 runs in 4.13 seconds, whereas our
51
algorithm on the same architecture runs in 7.25 seconds, or 1.8 as slow. The reason Bischoff's sequential algorithm executes faster is that his algorithm stores all reflections of
positions in a table, whereas we dynamically compute them; there is a time-space tradeoff. We decided to dynamically compute reflections of positions because physical memory per node on Alewife is low (5MB maximum usable memory) and there is no virtual
memory yet.
The triangle puzzle is similar to other tree-search problems, such as the N-queens
problem [9, 12, 23, 27, 28], the Hi-Q puzzle [23], and chess [18]. Many of the search
techniques, such as exploiting symmetry to reduce the search space and using a transposition table, arise in solving these problems.
52
Chapter 6
Conclusions
We presented parallel shared-memory and message-passing implementations that
solve the triangle puzzle. This thesis is the first controlled case study of message-passing
and shared-memory implementations of an application that runs on actual hardware: only
the communication interfaces used by the implementations vary; all other system components remain fixed. From this case study, we draw two main conclusions. First, when
using distributed shared memory, performing cache coherence actions and decoupling
synchronization and data transfer can make a shared-memory implementation less efficient than the message-passing implementation. Second, shared memory offers low-overhead data access and can perform better than message passing for applications that exhibit
low contention.
To address the shared memory versus message passing debate, we comment on the
performance of our implementations and the ease of programming them. We learned from
this case study that under certain circumstances and for this particular application, a
shared-memory implementation sometimes performs better than the message-passing one.
Under other circumstances, the message-passing implementation performs better. Regarding ease of programming, we found the shared-memory implementations easier to program. One insidious bug (that was fixed) in our message-passing implementation was a
race condition that existed when executing code was interrupted by the code in a message
handler. Thus, we argue for machines that efficiently support both shared memory and
message passing so that all parallel applications can perform well.
53
54
Appendix A
Supplemental Tables
Table A. 1 and Table A.2 show the execution times for shared-memory and messagepassing implementations for problem sizes 5 and 6, respectively. Only execution times
that were graphically presented in Chapter 4 are presented. Table A.3, Table A.4, and
Table A.5 show the number of positions explored and number of joined positions per
round for problem sizes 5, 6, and 7, respectively.
Table A.6 and Table A.7 show request
rates and standard deviations of work queue operations for problem sizes 5 and 6, respectively; the data from these tables was used to derive the data in Table 4.5.
Execution
Time
(seconds)
Implementation
SM
DTABQ
Standard
Deviation
(seconds)
1
0.162
0.000001
2
0.096
0.000159
4
0.057
0.000125
8
0.037
0.000106
16
0.029
0.000184
32
0.029
0.000313
1
0.165
0.000002
2
0.106
0.000023
4
0.069
0.000050
8
0.048
0.000065
16
0.037
0.000144
32
0.033
0.000105
Table A.1. Execution times for shared-memory and message-passing implementations, problem size 5.
55
Implementation
Execution
Time
(seconds)
Number of
Standard
Deviation
(seconds)
1
0.187
0.000002
Improved
2
0.113
0.000009
DTABQ
4
0.067
0.000013
with
8
0.044
0.000027
prefetching
16
0.031
0.000068
32
0.026
0.000070
1
0.188
0.000001
2
0.106
0.000014
Message
4
0.062
0.000038
Passing
8
0.043
0.000075
16
0.033
0.000159
32
0.030
0.000061
Table A.1. Execution times for shared-memory and message-passing implementations, problem size 5.
(This portion of Table A. 1 is continued from the previous page.)
56
Execution
Time
(seconds)
Implementation
Standard
Deviation
(seconds)
1
44.370
0.000012
2
24.577
0.002670
4
13.260
0.001937
8
8.679
0.008489
16
6.245
0.015937
32
6.165
0.027826
1
41.765
0.000013
2
23.046
0.003010
4
12.550
0.001806
8
8.166
0.004335
16
5.251
0.003485
32
3.349
0.009677
1
can't execute
2
28.626
0.000394
Improved
4
14.934
0.000358
DTABQ
8
7.703
0.000265
16
3.934
0.000135
32
2.062
0.000090
1
can't execute
Improved
2
28.150
0.000565
DTABQ
4
14.610
0.000302
with
8
7.497
0.000320
prefetching
16
3.807
0.000141
32
1.984
0.000140
1
can't execute
2
can't execute
4
can't execute
8
7.787
SM
DTABQ
Hybrid
0.001534
Table A.2. Execution times for shared-memory and message-passing implementations, problem size 6.
57
Implementation
Standard
Deviation
(seconds)
Execution
Time
(seconds)
Processors
16
4.173
0.000870
32
2.371
0.001063
1
48.278
0.000013
2
24.191
0.000853
Message
4
12.107
0.000599
Passing
8
6.138
0.000478
16
3.071
0.000244
32
1.599
0.000230
Hybrid
Table A.2. Execution times for shared-memory and message-passing implementations, problem size 6.
(This portion of Table A.2 is continued from the previous page.)
Number of
Positions Explored
Number of Joined
Positions
Percentage of Joined
Positions
1
2
1
2
3
0
0%
3
13
2
15.4%
4
49
18
36.7%
5
134
63
47.0%
6
319
187
58.6%
7
524
342
65.3%
8
595
416
69.9%
9
462
323
69.9%
10
265
168
63.4%
11
81
55
67.9%
12
14
8
57.1%
13
1
0
0%
(including initial
position) 2463
1614
(average per round)
46.2%
Total
50.0%
Table A.3. Position statistics per round for problem size 5.
58
Number of
Positions Explored
Number of Joined
Positions
Percentage of Joined
Positions
1
2
1
2
4
0
0%
3
25
2
8%
4
166
59
35.5%
5
805
399
50.0%
6
3255
1927
59.2%
7
10842
7250
66.9%
8
29037
21063
72.5%
9
62813
48320
76.9%
10
108651
87704
80.7%
11
144681
121409
83.9%
12
143275
123753
86.4%
13
102976
90417
87.8%
14
53927
47673
88.4%
15
20684
18236
88.2%
16
5893
5132
87.1%
17
1180
1012
85.8%
18
120
92
76.7%
19
12
7
58.3%
(including initial
position) 688349
574456
(average per round)
65.4%
Total
50.0%
Table A.4. Position statistics per round for problem size 6.
59
Number of
Positions Explored
Number of Joined
Positions
Percentage of Joined
Positions
1
2
1
50.0%
2
6
3
50.0%
3
26
3
11.5%
4
234
107
45.7%
5
1362
725
53.2%
6
7320
4571
62.4%
7
32694
22460
68.7%
8
124336
92116
74.1%
9
395100
308524
78.1%
10
1057990
861074
81.4%
11
2372103
1994427
84.1%
12
4435444
3826796
86.3%
13
6881092
6058173
88.0%
14
8829772
7897726
89.4%
15
9336725
8452864
90.5%
16
8102692
7401676
91.3%
17
5755731
5289176
91.9%
18
3356026
3094497
92.2%
19
1602445
1479590
92.3%
20
618748
570793
92.2%
21
191239
175627
91.8%
22
47219
43046
91.2%
23
8620
7762
90.0%
24
1061
908
85.6%
25
144
120
83.3%
26
0
0
0%
(including initial
position) 53158132
47582765
(average per round)
73.7%
Total
Table A.5. Position statistics per round for problem size 7.
60
Request rate
Number of
Processors
for current
queue
(operations/
(cyle)
Standard
deviation for
request rate for
current queue
(operations/
cycle)
Request rate
for next queue
(operations/
cycle)
Standard
deviation for
request rate for
next queue
(operations/
cycle)
2
5066.86
48.86
5040.81
139.40
4
5707.26
146.01
5725.38
518.98
8
6684.33
358.28
6621.06
1119.82
16
9702.18
916.56
9382.08
2145.33
32
18249.81
3294.98
16756.77
5046.32
rable A.6. Request rates and standard deviations for work queue operations, problem size 5
Number of
Processors
Requestrate
for current
queue
(operations/
pcycle)
2
9724.42
Standard
deviation for
request rate for
current queue
(operations/
cycle)
Request rate
for next queue
(operations/
cycle)
Standard
deviation for
request rate for
next queue
(operations/
cycle)
161.89
8753.54
166.60
4
10647.27
163.51
9583.73
180.65
8
14597.15
1535.84
13157.04
1522.94
16
24781.83
16010.10
19662.13
4050.93
32
50469.07
38950.69
38912.12
3363.33
Table A.7. Request rates and standard deviations for work queue operations, problem
size 6. The standard deviations for 16 and 32 processors for the request rate for current
queue operations are high because the processor whose memory bank contains the work
queues has a request rate that is much higher than all other processors. The reason is that
all accesses to the work queues for this processor are local. For all other processors, the
request rates are equal to about the average shown.
61
62
Bibliography
[1]
D. Applegate, G. Jacobson, and D. Sleator. "Computer Analysis of Sprouts." Carnegie Mellon University, School of Computer Science, TR 91-144, May 1991.
[2]
A. Agarwal, R. Bianchini, D. Chaiken, K. Johnson, D. Kranz, J. Kubiatowicz, B.-H.
Lim, K. Mackenzie, and D. Yeung. "The MIT Alewife Machine: Architecture and
Performance." Submitted for publication, December 1994.
[3]
A. Agarwal, J. Kubiatowicz, D. Kranz, B.-H. Lim, D. Yeung, G. D'Souza, and M.
Parkin. "Sparcle: An Evolutionary Processor Design for Large-Scale Multiprocessors." IEEE Micro, June 1993, pp. 48-61.
[4]
M. Bischoff. "Approaches for Solving the Tri-Puzzle." November 1993. Available
viaanonymousftpfrom lucy. ifi. unibas. ch as tri-puzzle/michael/
doku.ps.
[5]
M. Blumrich, K. Li, R. Alpert, C. Dubnicki, E. Felten, and J. Sandberg. "Virtual
Memory Mapped Network Interface for the SHRIMP Multicomputer." In Proceed-
ings of the Twenty-firstAnnual InternationalSymposiumon ComputerArchitecture,
Chicago, Illinois, April 1994. pp. 142-153.
[6]
J. Carter, J. Bennett, W. Zwaenepoel. "Implementation and Performance of Munin."
In Proceedingsof the ThirteenthACM Symposiumon OperatingSystemsPrinciples,
Pacific Grove, California, October 1991. pp. 152-164.
[7]
D. Chaiken, J. Kubiatowicz, and A. Agarwal. "LimitLESS Directories: A Scalable
Cache Coherence Scheme." In Proceedings of the Fourth International Conference
on Architectural Support for ProgrammingLanguages and Operating Systems,
Santa Clara, California, April 1991, pp. 224-234.
[8]
S. Chandra, J. Larus, and A. Rogers. "Where Is Time Spent in Message-Passing and
Shared-MemoryImplementations?" In Proceedingsof Architectural Support for
Programming Languages and Operating Systems VI, San Jose, California, October
1994, pp. 61-73.
[9]
P.-C. Chen. Heuristic Sampling on Backtrack Trees. Ph.D. thesis. Stanford University. May 1989. (Also available as Stanford Technical Report CS 89-1258.)
[10] L. Crowl, M. Crovella, T. LeBlanc, M. Scott. "Beyond Data Parallelism: The
Advantages of Multiple Parallelizations in Combinatorial Search." University of
Rochester, Dept. of Computer Science, TR 451, April 1993.
[11] W. Dally. A VLSI Architecture for Concurrent Data Structures. Kluwer Academic
Publishers, 1987.
[12] R. Finkel and U. Manber. "DIB--A Distributed Implementation of Backtracking."
ACM Transactions on Programming Languages and Systems, April 1987, Vol. 9,
No. 2. pp. 235-256.
63
[13] J. Gittinger; T. Chikayama and K. Kumon. Proofs that there are no solutions for the
triangle puzzle for size 3N+1.
Available via anonymous ftp from
lucy.ifi.unibas.ch in directories tri-puzzle/gitting and tripuzzle/jm. November 1993.
[14] S. Gutzwiller and G. Haechler. "Contest: How to Win a Swiss Toblerone Chocolate!" August 1993, Usenet newsgroup comp. parallel.
[15] P. Hatcher and M. Quinn. Data-Parallel Programming on MIMD Computers. The
MIT Press: 1991.
[16] D. Kranz, K. Johnson, A. Agarwal, J. Kubiatowicz, and B.-H. Lim. "Integrating
Message-Passing and Shared-Memory: Early Experience." In Proceedings of the
FourthACM SIGPLANSymposiumon Principlesand Practiceof ParallelProgramming, San Diego, California, May 1993. pp. 54-63.
[17] J. Kuskin, D. Ofelt, M. Heinrich, J. Heinlein, R. Simoni, K. Gharachorloo, J. Chapin,
D. Nakahira, J. Baxter, M. Horowitz, A. Gupta, M. Rosenblum, and J. Hennessy.
"The Stanford FLASH Multiprocessor." In Proceedings of the Twenty-first International Symposium on Computer Architecture, Chicago, Illinois, April 1994. pp. 302313.
[18] B. Kuszmaul. Synchronized MIMD Computing. Ph.D. thesis. MIT Laboratory for
Computer Science. May 1994.
[19] M. Martonosi and A. Gupta. "Trade-offs in Message Passing and Shared Memory
Implementations of a Standard Cell Router." In Proceedings of the 1989 International Conference on Parallel Processing, Pennsylvania State University Park,
Pennsylvania, August 1989. pp. 111-88to 111-96.
[20] J.M. Mellor-Crummey and M.L. Scott. "Algorithms for Scalable Synchronization
on Shared-Memory Multiprocessors." ACM Transactions on Computer Systems,
February 1991, Vol. 9, No. 1. pp. 21-65.
[21] J. Pearl. Heuristics: Intelligent Search Strategies for Computer Problem Solving.
Addison-Wesley, 1984.
[22] C. Polychronopoulos and D. Kuck. "Guided Self-Scheduling: A Practical Scheduling Scheme for Parallel Supercomputers." IEEE Transactions on Computers,
December 1987. pp. 1425-1439.
[23] E. Reingold, J. Nievergelt, N. Deo. Combinatorial Algorithms: Theory and Practice. Prentice Hall, 1977.
[24] C. Seitz. "Concurrent VLSI Architectures." IEEE Transactions on Computers,
December 1984. pp. 1247-1265.
[25] Thinking Machines Corporation. Connection Machine CM-5 Technical Summary.
Nov. 1993.
64
[26] T. von Eicken, D. Culler, S. Goldstein, and K. Schauser. "Active Messages: A
Mechanism for Integrated Communication and Computation." In Proceedings of the
19th International Symposium on Computer Architecture, Gold Coast, Australia,
May 1992. pp. 256-266.
[27] C.K. Yuen and M.D. Feng. "Breadth-First Search in the Eight Queens Problem."
ACM SIGPLANNotices, Vol. 29, No. 9, Sep. 1994. pp. 51-55.
[28] Y Zhang. ParallelAlgorithmsfor CombinatorialSearch Problems. Ph.D. thesis.
UC Berkeley. November 1989. (Also available as UC Berkeley Computer Science
Technical Report 89/543.)
65
66