Distributive Program Parallelization Using a Suggestion Language

advertisement
Distributive Program Parallelization Using a Suggestion
Language
Bryan Jacobs, Tongxin Bai, and Chen Ding
{jacobs,kelsey,cding}@cs.rochester.edu
The University of Rochester
Computer Science Department
Rochester, NY 14627
Technical Report TR-952
December 2009
Abstract
Most computing users today have access to clusters of multi-core computers. To fully
utilize a cluster, a programmer must combine two levels of parallelism: shared-memory
parallelism within a machine and distributed memory parallelism across machines. Such
programming is difficult. Either a user has to mix two programming languages in a single
program and use fixed computation and data partitioning between the two, or the user
has to rewrite a program from scratch. Even after careful programming, a program may
still have hidden concurrency bugs. Users who are accustomed to sequential programming
do not find the same level of debugging and performance analysis support especially for a
distributed environment.
The paper presents a suggestion-based language that enables a user to parallelize a
sequential program for distributed execution by inserting hints. The hints are safe against
any type of misuse and expressive enough to specify independent, pipelined, and speculative
parallel execution on a cluster of multi-core computers.
The research is supported by the National Science Foundation (Contract No. CNS-0720796, CNS0834566), IBM CAS Faculty Fellowship, and a gift from Microsoft Research. Any opinions, findings, and
conclusions or recommendations expressed in this material are those of the authors and do not necessarily
reflect the views of the funding organizations.
1
Introduction
Computer users are increasingly interested in parallel programming because they want to
utilize clusters of multi-core processors, which is capable, in theory, of performance tens
or hundreds of times of a single personal computer. Often they have a program that take
hours or days to execute, with coarse-grained parallelism that is easy to identify. The
difficulty, however, is safe parallelization. Large computation tasks may execute tens of
thousands of lines of code and make extensive use of bit-level operations, unrestricted pointers, exception handling, custom memory management, and third-party libraries. Although
compiler techniques are effective in exploiting loop parallelism in scientific code written in
Fortran [2, 8, 21, 47], they are not a sufficient solution for C/C++ programs or programs
with input-dependent behavior where both the degree and the granularity of parallelism are
not guaranteed or even predictable.
Manual parallel programming is becoming easier. There are ready-to-use parallel constructs in mainstream languages such as Fortran, C, C++ and Java, in threading libraries
such as Microsoft .NET, Intel Thread Building Blocks, in domain-specific languages such
as Google’s Map Reduce and Intel Concurrent Collection for C++. Still, writing parallel
code is considerably harder than sequential programming because of non-determinism. A
program may run fine in one execution but generate incorrect results or run into a deadlock in the next execution because of a different thread interleaving. It may acquire new
data races when ported to a machine with a different hardware memory consistency model.
In addition, important types of computations such as mesh refinement, clustering, image
segmentation, and SAT approximation cannot be parallelized without speculation since
conflicts are not known until the computation finishes [26, 27].
This paper extends the suggestion language BOP [16,51,52]. BOP lets a user parallelize
general types of sequential programs by inserting suggestions or hints. In this paper, the
BOP language is extended to have four types of hints as follows. Based on these hints, the
BOP system support divides a sequential program into tasks that can run on a computer
cluster and ensures the correctness of parallel execution.
• Parallelism hints let a user mark possibly parallel regions (PPR ) in a program. BOP
then tries to execute PPR tasks in parallel in a distributed environment using explicit
message passing instead of shared memory.
• Data-checking hints let a user specify private data or use value-based checking to
avoid false dependences in parallel execution.
• Dependence hints include speculative post-wait primitives. They express possible dependences between PPR tasks. BOP implementation uses speculative synchronization
and communication to satisfy these dependences. Since all BOP tasks logically share
a single address space, it is easy for them to share (and build) dynamically allocated
data structures such as lists and trees.
• Computation reordering hints enable BOP to increase the granularity of parallelism
by allowing successive PPR tasks be joined into larger units of execution.
1
BOP is a parallelization language. Unlike parallel constructs in a parallel language, BOP
primitives exist purely as hints. The output of an execution with suggestions is always
identical to that of the original sequential execution without suggestions, independent of
task scheduling and execution interleaving. A user never needs parallel debugging. For more
effective parallelization, a user may need to restructure a program to expose parallelism
and remove conflicts, but the restructuring is done on sequential code using sequential
programming tools such as conventional profilers or debuggers [16].
BOP uses software speculation and guarantees sequential semantics through speculative
execution, run-time monitoring, and dependence checking. Speculation, if it fails, wastes
system resources including electric power. It may even delay useful computations by causing unnecessary contention. There are inherent overheads that limit the scalability of a
speculative system. However, speculation is valuable and maybe necessary for parallelizing
existing sequential code and for exploiting speculative parallelism (which is unknown until
computation finishes).
When speaking of programming language design, C++ inventor Bjarnes Stroustrup said
“legacy code is code that actually works” and “legacy code breeds more legacy code.” As a
parallelization language, BOP allows legacy sequential code be parallelized without explicit
parallel programming. As programming hints, it offers new types of expressive power. As a
result, BOP may enable ordinary users to more easily utilize a group of shared-use computers
to improve the performance of existing, sequential software and of new software that reuses
an existing code base.
In the rest of the report, Section 2 gives an overview of the language design. Section 3
presents the parallelism hint and the system for distributed speculation. Section 4 and
Section 5 describe dependence and computation-reordering hints. The dependence hint is
an improvement over our previous design [52]. In Section 4, we describe an improved design,
complete semantics, and a simpler implementation. In Section 6, we use BOP suggestions
to build other speculation constructs. As a non-trivial use of the language, we show a
parallelization technique called time skewing in Section 7. Finally, we discuss related work
and summarize.
2
Overview of the Parallelization Language
A parallelization language must not change the result of a program, so BOP hints must
guarantee correctness. But this is not sufficient. We design the BOP language interface
with three additional goals.
• Incremental. BOP supports parallelization with partial information, either a user examining part of a program or a profiling tool analyzing a few executions. The language
should allow a user or a tool to gradually discover and supply needed information.
• Succinct. For parallelization, a user should not have to mark all dependences in a
program or to mark every dependence individually. The language should enable the
user to express only relevant dependences with a minimal number of hints.
2
• Efficient. The system support should be efficient and robust in the presence of incorrect, incomplete, or out-of-date hints.
Since BOP guarantees sequential semantics, it cannot use concurrency constructs such
as critical sections, monitors, atomic region, or transactions. In their place, BOP uses speculative, serial constructs. Speculation enables similar out-of-order execution as concurrency
constructs, at the cost of run-time monitoring and in-order commit. BOP tries to tolerate
these costs with coarse-grained parallelism.
The BOP language aims for both shared-memory and distributed-memory parallelization. Usually distributed programming is more onerous—a user must specify all data sharing
to implement it with message passing. As a speculation system, BOP has to monitor all
data access, so it can detect data sharing automatically and remove this distinction between shared-memory and distributed-memory programming. There is no language-level
difference whether a user programs for one multi-core computer or ten.
The BOP language has four types of hints shown in Figure 1(a). A parallelism hint
suggests a parallel task. A dependence hint suggests coordination between parallel tasks. A
data-checking hint helps run-time monitoring. A computation-reordering hint allows to runtime system to create larger tasks. Next we describe these hints, using the parallelization
shown in Figure 1(b,c) as a running example.
3
Parallelism Hints
A bop parallel block suggests that a lexical block is possibly parallel with code after the
block. An instance of a bop parallel block is the basic unit of BOP parallelism. We call each
instance a possibly parallel (PPR ) task [16]. PPR tasks may have an iterative structure if
they are spawned one after another, nested if one is spawned inside another, or a mixture
of both.
To see a use of the parallelism hint, consider the code of a generic processing loop in
Figure 1(b). Each iteration de-queues an element from a queue called inputs, processes
the element, and adds the result to a queue called results. In addition, the program
displays the inputs queue before the loop and results after the loop. There is not enough
information for static parallelization. The size of inputs is unknown. The process function
may be too complex to analyze at compile time. It may use an externally compiled library
with no source code available. Still, the code may be parallelizable. This possibility is
expressed in two bop parallel blocks. The first suggests the function parallelism between
displaying the inputs queue and processing it. The second suggests the loop parallelism
between processing different elements.
PPR is similar to the future construct in Multilisp. A future expresses the parallelism
between the future and its continuation [22]. Other future-style constructs include spawn
in Cilk [17], future in Java, and async in X10 [10] and in the proposed C++200X standard.
PPR expresses possible rather than definite parallelism [16]. Two similar constructs have
been studied, safe future in Java [46] and ordered transactions in C/C++ [45].
3
suggestion type
syntax
parallelism hint bop_parallel {
(and implicit
//code block
dependence hint) }
meaning
mark a possibly parallel region (ppr) in code, may be
parallelized by spawning a speculative process to execute
from the end of the code block, also called spawn_n_skip
(sas). each ppr has my_ppr_index, incremented for each ppr.
bop_fill(channel, var)
inform the channel to post var at the time of bop_post
bop_post(channel)
post all modified data and mark the channel ready
bop_wait(channel)
wait for channel until it is ready and retrieve its data
computation
reordering hint
bop_deferrable {
//code block
}
construct a closure for execution later
correctness
checking hints
bop_private(var)
mark variable var possibly private
bop_check(var)
mark variable var for value-based checking
explicit
dependence
hints
(a) BOP language has four types of hints for marking possible parallelism and dependence
queue inputs, results
inputs.display
while ( ! inputs.empty ) {
i = inputs.dequeue
t = process( i )
results.enqueue( t )
}
results.display
(b) A processing loop with two queues,
one holds the inputs and the other the
results.
The queues are displayed
before and after the loop.
bop_parallel { inputs.display }
first_iter_id = my_ppr_id
while ( ! inputs.empty )
i = inputs.dequeue
bop_parallel {
t = process( i )
if ( my_ppr_id > first_iter_id )
bop_wait( my_ppr_index - 1 )
results.enqueue( t )
bop_fill( my_ppr_id, &results.tail, sizeof(void*) )
bop_fill( my_ppr_id, results.tail, sizeof(qnode) )
bop_post( my_ppr_id )
}
}
results.display
(c) BOP parallelization. Two bop_parallel blocks
mark function and loop parallelism.
Post-wait
ensures correct construction of the dynamically
allocated results queue.
Figure 1: The parallelization language in (a), illustrated by an example in (b,c)
A future represents a fork in fork-join parallelism. Each future should be paired with
a join. A join point is often explicitly coded by the user, for example, using sync in Cilk,
finish in X10, get in Java and in C++200X. Because of its speculation support, BOP does
not require that a programmer specify a join point. There are three benefits. The first is
safety. It allows partial-knowledge parallelization when the point of a join is not completely
known. The second is speculative parallelization in cases when a join point is known only
after future execution or when future execution ignores some infrequent conflicts. The third
is flexibility. A user can suggest a task with or without a join. To suggest the join point, a
user can add an explicit dependence hint, which we describe in Section 4.2.
4
3.1
Background on Software Speculative Parallelization
A speculation system deals with three basic problems.
• monitoring data accesses and detecting conflicts — recording shared data accesses by
parallel tasks and checking for dependence violations
• managing speculative states — buffering and maintaining multiple versions of shared
data and directing data accesses to the right version
• recovering from erroneous speculation — preserving a safe execution state and reexecuting the program when speculation fails
A speculation task can be implemented by a thread or a process. Most thread-based methods target loop parallelism and rely on a compiler or a user to identify shared data and
re-direct shared data access when needed.
We use a process-based solution [7, 16, 24, 25, 53]. A process can be used to start a
speculation task anywhere in the code, allowing general types of fork-join parallelism. By
using the page protection mechanism, a run-time system can monitor data access without
instrumenting a program. Processes are well isolated from each other’s actions. Modern
OS performs copy-on-write and replicates a page on demand. In a speculation system, such
replication removes false dependence conflicts and makes error recovery trivial to do. We
can simply discard an erroneous speculative task by terminating the process.
Process-based software speculation has been shown effective in safe parallelization of
large tasks in legacy code [16, 52], library calls [16], and map-reduce code [7], in debugging
and race-free implementation of multi-threaded code [7], and in adaptive control of speculation [24]. It supports continuous speculation, which starts a speculation task whenever
a processor becomes available, to tolerate load imbalance caused by irregular task sizes
and uneven hardware speed [53]. It supports parallel memory allocation and deallocation
during speculation [7, 52]. It supports threaded speculation in loops using two techniques
from software transactional memory: logging and undo buffer [40]. It also supports speculative program optimization and its use in parallel program profiling and memory-access
checking [25]. Next we extend the process-based design to enable program parallelization
in a distributed environment.
3.2
Distributed Speculation
The BOP run-time system has three types of processes.
• A control process manages shared system information and serves as the central
point of communication. There is one control task in each BOP execution. We refer
to it as the controller.
• A host management process manages parallel execution within a host and coordinates other hosts through the control process. There is one management process on
each host. We refer to it as a (host) manager.
5
• A work process runs one or a series of PPR tasks on a single processor. Work processes are dynamically created and terminated and may perform computation speculatively. We refer to it as a worker or a risky worker if its job is speculative.
The first two types are common in a distributed system such as the one used by MPI and
Erlang. The unique problems in BOP design are how to support speculation including the
ability to run a PPR task on an available machine, checking correctness and maintaining
progress in a distributed environment. To do so, the three types of processes divide the
work and coordinate with each other as follows.
• The control process distributes a group of PPR instances (or jobs) to the management
process of each host.
• On a host, the management process starts one or more work processes based on the
resources available on the host.
• After finishing its assigned PPR instances, a work process reports its results for verification.
• If a finished PPR instance has a conflict, the control process redistributes the PPR
instance and its successors for re-execution. If all finished PPR instances are verified
correct, the control process continues distributing subsequent PPR instances until the
program finishes.
Figure 2 shows the distributed execution of an example program on two hosts. The
program has four PPR tasks separated by gaps. The controller distributes PPR tasks to
hosts. A host manager forks a worker process to execute a PPR task. Inter-PPR gaps are
executed by the control and every manager.
To start a worker, BOP needs to create a process and initialize it to have the right
starting state. There are two basic ways of creating an existing state: one is copying from
the existing state, the other is re-computing it from the beginning. On the same host,
copying can be easily done using Unix fork. Across hosts, we use a hybrid scheme we
call skeleton re-execution. Each host manager executes all inter-PPR gaps, which is the
“skeleton.” When it reaches a PPR task, the manager waits for its successful completion
(by some worker), copies the data changes, and skips to continue the skeleton execution at
the next inter-PPR gap. With skeleton re-execution, a manager maintains a local copy of
program state and use it to start worker tasks through fork.
An alternative to re-execution is to use remote checkpointing, for example, to use a
system like Condor [30] to implement a remote fork. Simple checkpointing would transfer
the entire program state, which is unnecessary. Incremental checkpointing may alleviate
the problem. Checkpointing is a more general solution than re-execution because it handles
code that cannot be re-executed. In our current prototype, we allow only CPU, memory,
and basic file I/O operations, where re-execution can be supported at the application level.
Checkpointing support can eliminate these restrictions.
6
ppr
task t.1
gap g.1
ppr task t.2
g.2
gap g.3
(a) A program execution consists of two PPR tasks, t.1 and t.2,
separated by gaps g.1, g.2, and g.3.
job t.1
Manager A
run g.1
spawn A.1
wait for t.1
add t.1 result
run g.2
wait for t.2
Controller
job t.2
Manager B
run g.1
spawn B.1
wait for t.1
Host 1
Worker A.1
run t.1
return t.1 result
add t.1 result
run g.2
wait for t.2
add t.2 result
run g.3
Host 2
Worker B.1
skip t.1
run g.2
run t.2
return t.2 result
add t.2 result
run g.3
(b) Distributed parallel execution of t.1 on Host 1 and t.2 on Host 2. Both hosts run gaps g.1,
g.2 and g.3.
Figure 2: An example execution by BOP . The input program is a series of PPR tasks
separated by gaps. The controller distributes PPR tasks to hosts. A host manager forks
a worker process to execute a PPR task. Inter-PPR gaps are executed by the control and
every manager.
3.3
Correctness Checking and Data Update
There are two basic problems in speculation support: correctness checking and data update.
Each problem has two basic solution choices.
• Correctness checking can be done by centralized validation, where checking is centralized in one process (the controller), or distributed validation, where checking work is
divided among hosts.
• Data update can be done by eager update, where changes made by one host are copied
to all other hosts, or lazy update, where changes are communicated only when they
are needed by a remote host.
7
The problem of data updates is similar to the ones in software distributed shared memory
(S-DSM), while the problem of correctness checking is unique to a speculative parallelization
system. The above choices can be combined to produce four basic designs.
• centralized validation, eager update. This design is similar to shared-memory speculation. With eager update, a worker incurs no network-related delays because all its
data is available on the local host. However, the controller may become a bottleneck,
since it must send all data updates to all hosts.
• distributed validation, eager update. With distributed validation, each host sends its
data updates to all other hosts, which avoids the bottleneck in centralized validation
but still retains the benefit of eager update. However, correctness checking is repeated
on all hosts (in parallel), which increases the total checking and communication cost
due to speculation.
• distributed validation, lazy update. Lazy update incurs less network traffic because it
transfers data to a host only if it is requested by one of its workers. The reduction
of traffic comes at an increase of latency. Instead of accessing data locally, a worker
must wait for data to be fetched from the network, although the latency may be
tolerated by creating more workers to maintain full progress when some workers wait
for data. With distributed validation, the global memory state is distributed instead
of centralized and replicated.
• centralized validation, lazy update. This scheme maintains a centralized global state.
As the sole server of data requests, the controller should inevitably become a bottleneck as the number of lazy-update workers increases. This combination is likely not
a competitive option.
There are hybrids among these basic combinations. For example, we may combine centralized and distributed validation by checking correctness in the controller but sending
data updates from each host, in a way similar to the “migrate-on-read-miss” protocol in a
distributed, cache coherent system [12]. We may also divide program data and use eager
update in one set and lazy update in another.
3.4
Centralized Validation and Eager Data Update
To verify correctness, the controller and manager processes maintain a progress frontier,
which separates the verified part and the speculative part of an execution. A worker is
started by a manager from its progress frontier and given a target PPR instance. A worker
usually executes a series of gaps and a PPR instance. We use the following symbols when
describing the checking algorithm.
• f : progress frontier. fg is the (global) progress frontier maintained by the controller.
fi is the progress frontier known by the manager at host i.
• gap(f, i), ppr(f, i): the two parts, gap and PPR , of the execution of PPR instance i
started from progress frontier f .
8
• R(x), W (x), RW (x): the read, write, and read-write access maps of a gap or a PPR
instance x.
• P (x): the content of modified data (pages) by a gap or PPR instance x.
• U (i): the effect of PPR instance i, stored as the set of pages modified in i and their
content.
In centralized validation, each worker sends meta-data and data to the controller, which
verifies correctness of PPR instances in order. Below are the steps of correctness checking.
1. A worker executes PPR instance i from progress frontier f . It executes the inter-PPR
gaps from gap f + 1 to gap i − 1 and then starts recording read and write maps
separately for gap i first and PPR i next as it executes them.
2. When a worker finishes a speculative PPR execution, it sends the result, consisting of
RW (gap(f, i)), RW (ppr(f, i)), and P (ppr(f, i)) to the controller.
3. The controller validates worker results in the order of increasing gap and PPR index.
The result of gap(f, i) and ppr(f, i) (i > f +1) is correct if gap(f 0 , i−1) and ppr(f 0 , i−
1) have been verified correct for some f 0 and:
• there has been no failure between f and i, that is, the last failure index (j) is
less than f in the controller’s blacklist.
• the read and write sets of gap(f, i) and ppr(f, i) do not overlap with any update
set U (j) after f . In symbolic terms we require
\
[RW (gap(f, i)) ∪ RW (ppr(f, i))] Σi−1
j=f +1 U (j) = φ
4. If gap(f, i), ppr(f, i) is correct, the controller advances the progress frontier fg to i,
stores the update set of i, U (i) = P (ppr(f, i)), and sends to all managers the new fg
and U (i); otherwise, the controller blacklists index i and re-issues the job (fg , i) and
its subsequent jobs for execution.
A special case is i = f + 1, when the execution is non-speculative. A worker executes
only PPR i (no gap) and returns P (ppr(f, f + 1)), W (ppr(f, f + 1)) for the controller to
update its progress frontier and U (f + 1). This case is used for BOP to guarantee progress
and base performance, as discussed in Section 3.6.
Correctness checking in BOP differs from the checking in shared-memory speculative
systems. In particular, BOP does not check for gap-to-gap dependences and does not need
gap updates. This difference is due to the gap execution being repeated on every host.
Because of the asymmetry in gap and PPR executions, they are treated differently in data
collection and access checking.
The incremental procedure in the algorithm for correctness checking guarantees two
transitive properties. Speculation succeeds for PPR instances from 1 to n, if and only if
there is no dependence between (1) gap(j) and any predecessor ppr(i) and (2) ppr(j) and
any predecessor ppr(i).
9
3.5
Correctness Checking Hints
By default BOP monitors all global and heap data. Local variables are considered task
private. Compiler analysis can be used to identify whether a local variable (stack data) is
entire private. If a compiler cannot statically determine all accesses, it can allocate it in the
heap for monitoring. Our previous system uses compiler support to identify global variables
for monitoring. BOP provides a function for a program to explicitly register global data for
monitoring, which allows BOP to be used without compiler support.
For correctness checking, BOP classifies monitored data into three categories—possibly
private, shared, and value checked—as explained in detail in [16]. The checking is primarily
dependence based. There are three types of dependences. BOP checks for flow and output dependence violations. Anti-dependence violation poses no problem because of data
replication [16, 52]. In our running example in Figure 1, the input queue can displayed in a
PPR task while its elements are being removed by the next PPR tasks, because the queue
is replicated by the later tasks.
Flow and output dependences would normally cause speculation to fail. In two cases,
however, BOP permits these dependences when it can prove that they are harmless. These
two cases are suggested by bop private and bop check.
A variable is marked by bop private if in a PPR task, its value is assigned before it is
used. Because the first access is a write, the variable does not inherit value from prior
tasks. Verifying the suggestion requires capturing the first access to a variable, which can
be costly if the variable is an array or a structure. For efficiency we use a compromise.
We insert code at the start of the PPR to write a constant value in all variables that are
marked bop private. If the suggestion is correct, the additional write adds a small extra cost
but does not change the program semantics. If the suggestion is wrong, the program may
not execute correctly, but the sequential version has the same error, and the error can be
identified using conventional debugging tools. Under this implementation, bop private is a
directive rather than a hint, unlike other BOP primitives.
The hint bop check suggests that a variable holds the same value before and after a
PPR task. The variable may take other values in between, but the intermediate values are
not visible outside and do not affect correctness of later PPR tasks. In implementation,
BOP records the value of checked variables at the start of a PPR instance and at the end
compares the starting value with the ending value. If the two versions are identical, the
variable has the same effect on parallel execution as a read-only variable and does not cause
any dependence violation.
3.6
Progress and Performance Guarantee
Speculative execution may be too slow or produce incorrect results. Our previous systems reexecute speculative work in a sequential non-speculative task called the understudy, which
guarantees progress and base performance [16, 52]. In the distributed case, we use the
controller for this purpose.
At each progress frontier, the controller executes the next PPR instance non-speculatively
if it has not been finished correctly by a speculative worker. When the controller executes
10
a PPR instance, it records the write set for use in correctness checking and record keeping.
Except for write access monitoring, the controller execution is the same as a sequential
execution. Therefore, it guarantees that BOP execution takes at most the time required for
the controller to finish the entire execution.
4
Dependence Hints
4.1
Implicit Dependence Hints
In BOP , a program execution becomes a tree of tasks similar to a function invocation
tree. Tasks in every tree path are sequentially spawned. Since in between two PPR s the
code is executed sequentially, dependences among inter-PPR code regions are automatically
enforced.
Inter-PPR code is most often used to implement serial work needed to start parallel
tasks. For example in Figure 1(c), the input queue is accessed outside the loop PPR so
queue elements are extracted sequentially. Another example is that in loops, the loop
control code should be placed outside a PPR .
Hence a PPR marks not only parallelism but also dependences. It suggests serial dependence in inter-PPR code, dependence from inter-PPR code to PPR code, and the absence
of dependences from PPR code to inter-PPR code. An example of a PPR to inter-PPR
dependence can be found in Figure 1(b), between the computation of the result t and its
entry into the queue. To express this dependence, we need explicit dependence hints as we
next describe.
4.2
Explicit Dependence hints
Post-wait was created by Cytron to express dependences in a do-across loop [13, 34]. We
extend post-wait to mark dependences between PPR tasks. In BOP post-wait, dependent operations communicate data through a single-use channel. Channels are created on-demand
and have the initial status not posted. Communicated data is identified by variable name
and, if it is an array, optionally a range. BOP post-wait has three primitives.
• bop fill(channel id, start addr, num pages) adds to a local, not yet communicated channel, with the identifier channel id. A bop fill call is ignored, if the channel has been
posted. Only the location of the data is recorded at this point. The same memory
locations may be added by multiple bop fill operations. The contents of the pages are
taken only when the channel is actually posted.
• bop post(channel id) releases the locally constructed channel data for consumption
and stops the channel from accepting new data by changing the status of the channel,
both globally and locally, from not posted to posted. It communicates only the data
that have been modified by the time of the bop post.
• bop wait(channel id) stalls if the channel status is not posted. If and when the channel
status changes, it receives data into the waiting task.
11
For the same channel, bop post operations are atomic and logically does not overlap
with each other. Bop wait operations for the same channel may retrieve data in parallel
with each other. Since bop fill is local, two PPR tasks may create two channels with the
same identifier. However, for the same channel identifier, only one bop post operation may
succeed.
Our running example shows a use of post-wait, where PPR tasks assemble their results
into a linked list. In the code in Figure 1(c), post-wait is used to coordinate parallel PPR
instances to enqueue their results sequentially. The first PPR modifies the queue and posts
the tail. Each subsequent PPR waits for the previous PPR ’s post, modifies the queue, and
posts the new tail.
Since PPR tasks have an identical address space, the sender determines the placement of
communicated data in the receiver. This property is indispensable for our example, where
one PPR allocates a new queue node and passes it on to the next PPR . The example shows
that BOP post-wait makes it easy for parallel construction of dynamic data structures. Not
only can a waiting task receive data allocated before it was started, but it can also receive
data allocated in other tasks after the receiver was started.
4.3
Expressiveness and Efficiency
Post-wait is a form of explicit communication similar to send-receive, with the posting PPR
as the sender and the waiting PPR as the receiver. Compared to existing communication
primitives, BOP post-wait has several distinct features.
Deterministic behavior In the base design, a channel accepts a single post. After a
channel is posted, its status is changed and all subsequent bop post and bop fill operations
are ignored. This design is to ensure deterministic behavior. If we were to allow multiple
posts, it would be uncertain how many of the posts had happened at the time of a bop wait.
In addition, we make bop fill a task-local operation, which precludes a channel from being
filled by multiple tasks. The design has two benefits. First, a local fill is efficient since it
incurs no network communication. Second, it again avoids non-determinism. If two PPR
tasks could fill a channel, a single post cannot guarantee that both tasks have finished
placed data in the channel. If a task depends on two predecessor tasks, the right solution
is to use two channels, one for each predecessor task, and let the dependent task wait for
both channels.
Uncertainty in data identity and access time If we know the exact dependence
relation between two tasks, that is, we know data d needed by task j is produced by task
i, then post-wait can easily express the dependence. Let’s relax the requirement. Suppose
we know the time of access but not the location of data, as in our example where a PPR
task creates a node and passes to the next task. With BOP post-wait, a waiting task can
retrieve the unknown data using a channel identifier. A related case is when we know the
identity of data d and its last write in i but not its first read in j. Current BOP post-wait
cannot express this dependence but can be extended to express it by allowing waiting on
12
a data address instead of a channel identifier. The flow keyword in ordered transactions
expresses one type of such dependence, where there is only one write in i [45]. If we relax
the requirement further and suppose the last modification time is unknown. To preclude
rollback, we have to put the post operation at the end of task i.
Imperfect post-wait matching Post-wait is suggestion. Problems such as useless, redundant, or mismatched calls do not break a program. In the example in Figure 1(c),
although the post in the last iteration has no matching wait, it does not affect correct
(parallel) execution. The nature of hints makes their use in coding simpler. Incomplete
knowledge of a program can cause three types of errors in specification: under-specification,
where a dependence happens but a user does not mark it; over-specification, where a user
specifies a dependence that does not happen in execution; or incorrect specification, where
the location of the source or the target of a dependence is specified wrong or left unspecified. The speculative implementation tolerates these errors. It ignores an error if it does
not affect parallelism or efficiency; otherwise, it can report it to the user to fix an error.
Selective communication Not all communication has to use post-wait. In fact, most
often the bulk of data changes in BOP are communicated implicitly when PPR tasks commit.
The difference between post-wait and PPR commit is similar to the difference between
synchronous and asynchronous communication. Post-wait is used when parallel execution
requires direct synchronization. In the example in Figure 1(c), the next PPR must acquire
the tail of the queue before it can append to it (assuming the queue is implemented as
a singly linked list). There are a few interesting subtleties in the example. If more than
three PPR tasks are run in parallel, the third PPR receives only the new node created by
the second PPR but not the first. Appending to a linked list involves modifying the last
two nodes, but only one node is posted. Clearly post-wait does not communicate all queue
updates to all PPR s. Still, the solution is correct because other (non-essential) pieces will
be combined at commit time. During parallel execution, each PPR works on one link of the
queue. After it, BOP automatically merges the results and obtains the complete queue, one
identical to the result obtained from a sequential execution.
Aggregate communication Post-wait lets a user to aggregate communication. In the
example, a PPR must communicate two pieces of information to the next PPR : the new tail
pointer, and the new tail node. It adds them in two bop fill calls and communicates them
in one post. Aggregation may also happen due to page-based data protection. If both data
are on the same page, the second bop fill call is ignored. On the other hand, false sharing
may cause unexpected conflict, for example, if the PPR modifies a queue node next to the
tail after the post. Such cases will cause parallel execution to roll back. To enable full
parallelism, a user has to either communicate at the end of PPR (as in this example) or
allocate dependent data on separate memory pages.
Progress frontier A posted channel may be consumed by any number of wait operations. Logically these wait operations all receive the same data. Communication may not
13
be needed if the waiting task already has posted data from earlier commits. If a user accidentally inserts a post without a matching wait or vice versa, the unmatched operations are
eventually dropped when the progress frontier moves past. The system stores the identifier
of all channels that have been posted in case a wait operation is later called on one of these
channels. The storage cost is small since the system stores only channel identifiers not
channel data.
Extension to k-post Suppose in a more general case, we know a set of u possible program
points that may be the source of a dependence and a set of v possible program points that
may be the sink of the dependence. We would need to synchronize between the last executed
point in u possible sources and the first executed point in v possible sinks. First, if some
of the u possible sources happen in a single PPR , we find the latest point (the one that
post-dominates the others in either program or profiling analysis) and reduce the possible
sources to a set of up points in up PPR s. Similarly we reduce the set of possible sinks to vp
points in vp PPR s.
With single-post design, we need to insert at each source a post operation with a different
channel and insert at each sink up wait operations so it waits for all channels, for a total
of up post operations and up vp wait operations. We can extend the design to assign a
parameter to a channel as the num posts, which means that the channel is posted after it
has received exactly num posts. In this example, we create a single channel with num posts
equals to u, insert a post operation at each source, and a wait operation at each sink. The
total number of post-wait operations is u + v.
4.4
Implementation of Speculative Post-wait
Incorrect use of BOP post and wait may cause a task to consume a stale value of some data.
BOP detects all stale-value accesses as conflicts. There are three types of conflicts. First,
a sender conflict happens if a PPR instance modifies some data that it has posted earlier.
This conflict causes a transmission of stale data. Second, a receiver conflict happens when
a PPR instance accesses some data and then later receives the data through a bop wait.
This conflict causes a consumption of stale data. Finally, a pairing conflict happens when
some data is received not from the last modifying PPR instance but from an earlier one. In
the first two types, the error is caused by one of the two ends of the communication. In the
third, the error is caused by the actions of a third party. BOP detects and handles these
conflicts as follows.
Sender conflict The simple solution is that if PPR i incurs a sender conflict, we abort
and re-start all speculation of PPR j for j > i. With sufficient bookkeeping, we can identify
and rollback only those PPR instances that are affected by the offending send. The sender
can remove the data from the earlier posted channels and sends the newest value in the
latest post. To avoid recurring conflicts, the BOP runtime can delay any post operation
that has caused a sender conflict.
14
task 1
write x
fill(1, x)
task 2
wait(1)
task 3
wait(1)
read x
post(1)
write x
Figure 3: Illustration of a pairing conflict when using post-wait. The read in task 3 is
incorrect because the received value is not the newest one in sequential semantics.
A pairing conflict. Task k is aborted because it
Receiver conflict and multiple
receives
uses a stale
version of A
dataPPR
x. instance must be aborted if it incurs
a receiver conflict. A related case is when a PPR receives x more than once from different
senders. We rank the recentness of received values by the PPR index of their sender. A
larger index means a later sender and a newer value. There are three cases. First, x is
first accessed before the first receive, which means a receiver conflict. Second, x is first
accessed between two receives. If the value from the first receive is less recent than that
of the second receive, a receiver conflict is triggered. If the value from the first receive is
newer, the second receive can be silently dropped without raising a conflict. In the third
case, x is first accessed after two or more receives. BOP keeps the newest value of x and
continues. We call the last two extensions silent drop and reordered receive.
Pairing conflict A pairing conflict is caused when some data x is communicated incorrectly. The 3-task example in Figure 3 shows such a conflict. Data x is modified in tasks
1 and 2 and read in task 3. The value of x sent by task 1 is the newest for task 2 but not
so for task 3 since it can get a newer value from task 2. In terms of data dependences,
the dependence from task 2 to 3 is violated. To ensure correctness, we need to combine
dependence and post-wait checking.
A pairing conflict is detected at the time of commit, when all post-wait and data-access
information is known and stored in access maps. An access map marks four types of data
access: r, wr, p(i), wa(i), representing read, written, posted, and received (through a wait).
The last two types include the PPR index of the sender. Let ∆i be the access map of PPR
i. If PPR i posted data x, we add p(i) to ∆i (x) (and remove wr since the write is implied).
If PPR i received data x from PPR s, we add wa(s) to ∆i (x).
When checking ppr(f, i), we compare its access with each predecessor PPR from f + 1
to i − 1, as in the algorithm in Section 3.4. To handle post-wait, we augment the Step
3 of the algorithm as follows. For ease of checking, it builds a cumulative write map Φ
representing the set of write accesses before PPR i. There are four cases, with different
correctness conditions for i and updates to the cumulative map Φ.
1. x is not accessed in PPR i. If ∆i (x) = φ, the check passes; otherwise, continue to the
next case.
2. x is not modified until PPR i. If Φ(x) = φ, set Φ(x) = ∆i (x) − r (Φ is a write map),
and the check passes; otherwise, continue to one of the next two cases.
15
3. x is modified before i but not posted. If wr ∈ Φ(x) and ∆i (x) 6= φ, the check fails.
4. x is modified before i and posted. There are two cases depending on whether i received
x from this sender s. If not, p(s) ∈ Φ(x) but wa(s) ∈
/ ∆i (x), and the check fails. If
so, p(s) ∈ Φ(x) and wa(s) ∈ ∆i (x), and the check passes but we need to set Φ(x)
carefully. If x is only read in i, leave Φ(x) unchanged. If x is modified and posted in
i, set Φ(x) = {p(i)}. If x is modified but not posted in i, set Φ(x) = {wr}.
As an example, consider the steps of checking task 3 in the earlier example (Figure 3).
At the end of task 1 and 2, Case 2 and 4 are invoked respectively, and the cumulative map
Φ(x) becomes first p(1) and then w. Finally, the check at task 3 invokes Case 3 and detects
a conflict. This solution differs from the previous solution, which was based on the concept
of communication paths [52].
Inter-PPR post-wait handling The skeleton execution described in Section 3.2 permits more precise post-wait checking than the previous solution for shared-memory speculation [52] in that it can ignore post-wait constructs in inter-PPR code. A post in gap i is
unnecessary since its data changes are visible to all subsequent execution. A wait in gap
i is unnecessary either, if the matching post comes from another gap, or implies a dependence violation, if the matching post comes from a PPR . Therefore, bop fill and bop wait
are ignored, and bop post sends its channel without any content.
Outdated post As mentioned earlier, data updates in BOP are propagated in two ways,
post-wait and PPR commit. If a post-wait pair spans many PPR instances, it is possible
that the receiver is started after the commit of the sender so the receiver already possesses
the sent data. The frontier parameter can be used to check this case precisely. If ppr(f, j)
receives a post from PPR i that happens before f (i ≤ f ), bop wait succeeds but it will
nullify the content of the channel and accept no new data. We call the special wait a
nullified receive.
Progress guarantee Usually blocking communication implies the risk of entangling a
set of tasks into a deadlock. BOP post-wait is deadlock free because communication flows
in one direction in the increasing order of PPR index. A wait without a matching, earlier
post would stall the receiving task. However, it does not stall the advance of the progress
frontier as explained in Section 3.6. When the progress frontier moves past, the stalling
task is canceled.
5
Computation Reordering and Run-time Task Coarsening
At run-time, BOP may group a series of PPR instances to create a large parallel task. We
call the transformation run-time task coarsening and the aggregate task a BOP task. By
choosing different group sizes, we can create parallel tasks of a desirable length, which is
especially useful in distributive parallelization.
16
for i in 1 ... n
bop_parallel {
t = data[ i ].analyze
bop_deferrable ( t ) {
if i > 1 bop_wait(i - 1)
histogram.add( t )
bop_fill(...)
bop_post( i )
}
}
end for
for i in 1 ... n
bop_parallel {
// call to a large function
t = data[ i ].analyze
if i > 1 bop_wait(i - 1)
histogram.add( t )
bop_fill(...)
bop_post( i )
}
end for
bop task 1
// ppr 1
data[ 1 ].analyze
histogram.add
bop_post( 1 )
// ppr 2
data[ 2 ].analyze
bop_wait( 1 )
histogram.add
bop_post( 2 )
bop task 2
// ppr 3
data[ 3 ].analyze
bop_wait( 2 )
histogram.add
bop_post( 3 )
// ppr 4
data[ 4 ].analyze
bop_wait( 3 )
histogram.add
bop_post( 4 )
(a) An example loop (upper
graph) and parallel execution of
the first 4 pprs in 2 tasks.
Limited parallelism due to postwait.
bop task 1
// ppr 1
data[ 1 ].analyze
// ppr 2
data[ 2 ].analyze
// delayed blocks
histogram.add
bop_post( 1 )
bop_wait( 1 )
histogram.add
bop_post( 2 )
bop task 2
// ppr 3
data[ 3 ].analyze
// ppr 4
data[ 4 ].analyze
// delayed blocks
bop_wait( 2 )
histogram.add
bop_post( 3 )
bop_wait( 3 )
histogram.add
bop_post( 4 )
(b) An example deferrable block
(upper graph) and parallel execution
of the first 4 pprs in 2 tasks. More
parallelism due to delayed post-wait.
Figure 4: An example showing the use of a deferrable block in exposing parallelism between
“coarsened” BOP tasks that contain multiple PPR instances.
Computation reordering is necessary to enable this type of automatic task coarsening.
Consider PPR blocking for the example loop in Figure 4(a). Each iteration produces a result
and adds it to a histogram. The suggested parallelization is to run each iteration in parallel
but serialize its update to the histogram using post-wait. An example PPR grouping is
shown below the loop. There are two BOP tasks each executing 2 PPR s. Because of postwait, PPR 3 must wait for PPR 2 (shown by the arrow in the figure), and BOP task 2 must
stop after the first PPR to wait for task 1 to finish. Post-wait is necessary to guarantee
sequential semantics but it causes PPR grouping to lose most parallelism. The purpose of
computation reordering is to regain parallelism while preserving sequential semantics.
The computation-reordering hint, bop deferrable, suggests that a block of code can be
executed later. Figure 4(b) shows its use in our example code, suggesting that the histogram
update can be delayed. The two BOP tasks, shown in the figure below the code, can now
run in parallel. The histogram updates are still sequentially done, but they are delayed to
the end of BOP tasks and do not stop task 2 in the middle of the execution. With delayed
updates, PPR blocking can increase the amount of synchronization-free, parallel work by
the group size. By selecting group size at run time, BOP can control the granularity
17
of parallelism and de-couple program specification from machine implementation: a user
focuses on identifying parallelism, and BOP tailors the parallelism for a machine.
As a form of dependence hint, deferrable block has an identical meaning as a PPR
block. Both mean an absence of dependence between the code block and its succeeding
computation. The use of the two types of blocks is so different that we give them different
names in BOP . The implementation issues, however, are similar in many aspects especially
in correctness checking. The main difference is interaction with the succeeding block. A
PPR block is in a separate process from its succeeding block, but a deferrable block runs in
the same process as its succeeding block. Instead of creating a process, a deferrable block
creates a closure to include the code and all its variable bindings. The closure is invoked as
late as possible in a BOP task.
Most parallel languages provides an atomic-section construct for use by a parallel loop.
The loop may then be partitioned and executed in blocks of iterations. However, atomic
sections do not guarantee correctness (i.e. same result as sequential execution) or deterministic output. Computation-reordering hint in BOP aims to provide the same flexibility
while guaranteeing sequential semantics.
6
Language Extensions
BOP hints are basic building blocks that can be combined to build other constructs. As
a result, we can add new suggestions without changing language implementation. We
demonstrate the potential for language extensions by building two loop-based and one
data-based suggestions.
When parallelizing a loop, a direct way is to mark a parallel loop. Consider the example
in Figure 5(a). Each iteration has two steps: computing a value and appending it to an
object. Assume that the first step is parallel, and the second step has to be serial. To
parallelize the loop, we mark the loop parallel and place the second step in a serial region.
A second construct is a shared variable declaration as used by Jade [37]. A compiler is used
to identify and manage shared data accesses. A third loop-based construct is the pipelined
loop designed by Thies et al [42]. To parallelize the example, a user adds two pipeline
stages and marks the first stage parallel (by specifying the number of processors p), as in
Figure 5(c).
Serial region One can declare a parallel loop and mark dependent operations in a serial
region, as in Figure 5(e). The term “serial region” was used by an early version of the
OpenMP language to mean code that should be executed by one thread. We generalize the
concept to add a label at each region. Code of the same named regions must be serialized
in the sequential order, while code from different named regions may execute in parallel.
The generalized serial region can be implemented by BOP post-wait. Figure 5(f) shows the
code translation for a generic serial region r1. The scheme is akin to token passing, where
a “token” is a channel number indexed by the identity of the host task, and it is passed by
post-wait. To determine data communication, we may send all data changed in the region
or find ways to identify and send only dependent data. Since BOP serial region is a hint, it
18
parallel for i in 1 ... n
r = compute( i )
serial r1 {
s.append( r )
}
end for
for i in 1 ... n
r = compute( i )
s.append( r )
end for
(a) a loop whose body
consists of a parallel step
and a sequential step
parallel for i in 1 ... n
// parallel work
serial r1 {
// sequential work
}
// parallel work
end for
(e) general form of
serial region
shared s
parallel for i in 1 ... n
r = compute( i )
s.append( r )
end for
(b) parallelization of (a)
using a parallel loop
and a serial region
(c) parallelization using a
parallel loop and shared
variable declaration
bop_post( r1_ids[0] )
for i in 1 ... n
bop_parallel {
// parallel work
bop_wait( r1_ids[my_ppr -1] )
// sequential work
bop_post( r1_ids[my_ppr] )
// parallel work
}
end for
(f) general implementation of
serial region using BOP
for i in 1 ... n
begin_pipelined_loop
pipeline( p )
r = compute( i )
pipeline
s.append( r )
end_pipelined_loop
end for
(d) parallelization using
a pipelined loop
for i in 1 ... n
begin_pipelined_loop
// stage 1
pipeline( p )
// stage 2
pipeline
// stage 3
end_pipelined_loop
end for
//pipeline(p) means that
next stage is parallel
for i in 1 ... n
bop_parallel {
bop_wait( <my_ppr-1, s1> )
// stage 1
bop_post( <my_ppr, s1> )
bop_wait( <my_ppr-1, s1> )
// stage 2
bop_post( <my_ppr, s2> )
bop_wait( <my_ppr-1, s3> )
// stage 3
bop_post( <my_ppr, s3> )
}
end for
(g) general form
of pipelined loop
(h) implementation by
BOP
Figure 5: Using BOP as the base language to implement composite constructs similar to
the serial region and pipelined-loop
is possible in an execution that an iteration executes a serial region multiple times or none
at all.
Shared variable Shared variable declaration enables a compiler to identify shared data
accesses [37]. Such a compiler can generate parallel code using BOP . For each PPR , the
compiler identifies the earliest and latest accesses of shared data and places the wait and
post operations as in the implementation of the serial region. If a PPR has no shared data
accesses, it is equivalent to having the first and last accesses at the start of PPR . If the
accesses in a PPR are not completely known, the conservative solution is assuming that the
first access is at the beginning and the last access is at the end. To improve parallelism, the
compiler may use a different serial region for each shared variable. Since the correctness is
guaranteed by BOP run time, a compiler can generate more aggressive, albeit unsafe code.
Pipelined loop With a pipelined loop, most of the work done by a user is just to divide a
loop body into pipeline stages [42]. By default a stage is sequential in that its execution in
different iterations does not overlap. If the work can run in parallel, a user marks the stage
parallel by specifying the number of processors p. Figure 5(g) shows a 3-stage example,
where stage 1 and 3 are sequential but stage 2 is parallel. Although simple, pipelined-loop
shows clearly the composition of cross-stage and intra-stage parallelism. Its implementation
is also simple and effective, with one process running a sequential stage and p processes
running a parallel stage as specified. All processes are fully active in the steady state.
19
The same interface can be implemented by BOP hints using the following code transformation. First, we turn a pipelined loop body into a PPR . Then we create a series of
channel numbers indexed by the PPR index and the stage index. For each sequential stage,
we insert at the front a wait for the same stage from the preceding PPR and at the back
a post for itself (this stage and this PPR ). The conversion of a parallel stage is the same
except that the wait is for the preceding stage in the preceding PPR . Figure 5(h) shows
this transformation. The BOP code exploits the same, staged parallelism but the implementation is very different. One process is used per iteration not per stage. To fully utilize
p processors, BOP may need to execute more than p processes in parallel and rely on the
multi-tasking support of the operating system.
Potential benefits As the three examples show, BOP can serve as a basis for building
other parallel constructs. Compared to the original constructs, BOP -based emulators benefit
from uniformity and correctness guarantees of BOP . Take pipelined loop as an example. The
original construct does not guarantee the same result in parallel executions, but the BOP
version adds this guarantee. The original primitive cannot be used within a nested loop or
called function. A target loop cannot have break and continue statements (need to first
convert them to if statements). The BOP version removes these restrictions. The original
implementation copies all data changes synchronously at the end of a pipelined loop. BOP
copies data changes incrementally and asynchronously in parallel with loop or post-loop
execution. In the original design, a user must remove all conflicts. With BOP , conflicts do
not affect correctness, and occasional unknown conflicts may not affect parallel performance
if most of the execution is conflict free.
Another benefit is clarity—the meaning of an extension is completely defined by its
construction. This removes ambiguity in semantics especially when a construct is used
amid complex control flow. Take the example of the serial region. The implementation in
Figure 5(f) assumes that a serial region is executed once and only once in each PPR . The
assumption holds for the implementation of pipelined-loop in Figure 5(h). However, in a
general case, one cannot guarantee how often a serial region is executed or whether it is executed at all. We call it the unknown occurrence problem. It includes the multi-occurrence
problem, which may cause out of order region execution, and the missing-occurrence problem, which may cause indefinite waiting. To solve the multi-occurrence problem, we should
move the post operation to the end of PPR . To solve the missing-occurrence problem, we
add a post operation at the end of PPR . In all these cases, the behavior and properties of a
serial region are defined according to how we choose to implement it with BOP primitives.
There are implementation limitations to BOP -based extensions. BOP may not be the
most efficient solution for example for pipelined loop. However, the previous solutions were
not designed for software speculative execution. BOP implementation may still be the right
choice in a speculative context. In addition, by building different constructs from a single
basis, a user can combine and mix them, for example post-wait and pipelined loop, when
parallelizing a single program.
20
7
A Demonstration
We demonstrate a use of BOP in parallelizing iterative computations. An iterative algorithm
is commonly used in solving equations that have no direct solutions. Examples can be found
in many types of numerical analysis and in general computational problems that require
there is node to be clusteredfixed-point and equilibrium solutions. Figure 6 shows a typical program structure with a
2-nested loop. The inner loop computes on a data array, and the outer loop repeats this
= new ... // init next cluster choice
computation until the result is good enough. Informally the loop can be said to have two
B) { // b-loop to find largest cluster
dimensions with the inner loop being the space dimension and the outer loop being the
st; // init block choice
time dimension. The inner loop is usually parallel, but the outer loop is not because the
i ++) { // i-loop to find largest in block
same data is used and updated in every time step. In addition, the convergence check,
continue;
in line C in Figure 6, must be done after one time step and before the next. Without
e;
support, a parallel version must use a barrier to ensure that the next iteration
); // cluster centered at nodespeculation
i
+) {
wait for the current iteration to finish completely. With BOP , however, the next iteration
et_dis(nodes[ i ], nodes[ j ]);
can start before the current one finishes. At the end when the converged variable is set,
!eliminated[ j ] )
s[ current.size++ ] = j;
the write access causes a dependence conflict, BOP detects the conflict and aborts the
speculation beyond the convergence point. Before the convergence, different time iterations
e( &block_largest, &current )
may
overlap. The transformation is known as time skewing [39, 48].
current // largest cluster in this
block
intra-block search
while ( not converged )
for i is 1 to N
bop_parallel {
r = compute( data[ i ] )
bop_deferrable {
if ( i > 1 )
bop_wait( my_ppr - 1 );
s = add_result( r )
bop_fill( my_ppr, cur )
B:
bop_post( my_ppr )
} // end bop_deferrable
} // end bop_parallel
} // end of i-loop
- 1);
in all blocks
global_largest, &block_largest )
= block_largest
while ( not converged )
for i is 1 to N
ppr, global_largest);
newly found cluster
al_largest.size; nn++) {
_largest.nodes[ nn ] ] = 1
and exit if clustering is done
erged = true
r = compute( data[ i ] )
s = add_result( r )
} // end of i-loop
if good_enough?( s )
C: converged = true
if good_enough?( s )
C: converged = true
}
}
Figure 6: Example iterative computation and its parallelization by BOP . There is no barrier
at the end of the while-loop, which permits speculative parallelism.
The parallelism from time-skewing is data dependent. Time skewing is likely successful
if the beginning part of the next iteration accesses different data from the ending part of
the current iteration. The parallelism may be highly dynamic, and BOP execution may
have frequent rollback and restart. The parallelism does not extend beyond two outermost
loop iterations since consecutive iterations as a whole always have at least one conflict.
However, the program does not have to wait for every task to finish after one iteration
and before starting another. The removal of the barrier synchronization makes it possible
that the waiting by one task can always be overlapped with the execution of subsequent
tasks. This is useful in hiding the overhead of serial correctness checking in BOP . It is also
useful in tolerating imbalances in task sizes and variations due to the processor or system
environment.
21
QT clustering An example of iterative computation is quality threshold (QT) clustering.
The algorithm was invented in 1999 for gene clustering [23]. In each iteration, the algorithm
finds the cluster around every node, which contains all nodes that are within a distance. It
then picks the largest cluster. The next iteration repeats the process for the remaining nodes
until all nodes are clustered. QT clustering is computationally intensive, but it addresses
two shortcomings of traditional k-means clustering. To use QT clustering, one does not
have to know the number of clusters beforehand. The result is deterministic and does not
depend on initial choices of cluster centroids.
To parallelize QT clustering, we need speculative post-wait to communicate dynamic
data. In each time step, all clusters are compared to find the largest cluster. This is done by
a post-wait region that passes the cluster object global largest from task to task in sequential
order. Since the size of a cluster is not known, we allocate an array with size n, the total
number of nodes. While the code posts the entire array, BOP run-time system posts only
modified pages, so it communicates only part of the array with a size proportional to the
size of the cluster.
We coded the basic algorithm, where stack variables were private and heap and global
data were shared (unless declared bop private). Since we have not implemented deferrable
block, we manually strip-mined the clustering loop. We inserted three lines of hints, one for
parallelism and two for a post-wait pair. To enable time skewing, we have structured the
code carefully so to avoid unnecessary dependence conflicts. We found that the difficulty
for time skewing is mostly algorithmic. In implementation, the base code is sequential,
and BOP preserves its semantics. BOP parallelization did not go much beyond sequential
programming and sequential algorithm adaptation. We could use existing tools such as
compilers, debuggers, profilers.
We tested our prototype implementation using two machines. One had 8 Intel Nehalem
cores, and the other 4 Intel 2.66 GHz Kentsfield cores, all running at 2.66 GHz. We used
a large input, so the sequential running time, with no BOP involvement, was 70 seconds.
With BOP , the program finished in 7 seconds, obtaining a speedup of 10. This showed
clearly the benefit of using two machines, since the single-machine improvement would be
bounded by 8.
8
Related Work
Distributive parallelization Parallelism may be inferred by a parallelizing compiler.
Powerful compiler techniques such as dependence, array-section, alias and inter-procedural
analysis can extract coarse-grained parallelism in scientific code [1, 2, 8, 18, 19, 21, 47]. Automatic parallelization can be improved by language and tool support. High performance
Fortran (HPF) lets a user suggest data distribution in Fortran code [1, 2]. Cleanly designed
languages can better express and implement parallelism through high-level primitives and
advanced techniques such as regions and lazy evaluation as in ZPL [9] and SoC [38]. A
user can help compiler parallelization using an interactive tool [29]. Like HPF, BOP provides hints for use in a traditional language. While HPF hints relies on the compiler for
parallelization, BOP hints express parallelism directly.
22
A recent approach to distributed parallelization is to automatically translate a Fortran
program from using OpenMP to using MPI [4–6]. Parallelization and communication are
inserted and optimized by the compiler, which uses static analysis for regular loops and
inserts run-time inspectors in irregular loops. With distributed OpenMP, a user can program
distributed-memory machines using a shared-memory parallel language. BOP strives for the
same goal using a suggestion language in which dependences are specified using post-wait
rather than automatically inferred. Post-wait can be used by a compiler or a profiling-based
analyzer, so BOP does not have to rely on complete compiler support for correctness. As a
result, BOP can handle general C/C++ code.
Most automatic techniques target regular computations on arrays and cannot directly
handle irregular computations on dynamic data. One exception is Jade, which uses a
combination of user annotations and type checking to infer program data accesses and
enable dynamic parallelization [37]. Data specification in Jade and dependence hints in
BOP represent very different design choices. Data access is the cause of dependences, and
data specification in Jade can be used by a run-time system to identify all dependences. BOP
allows partial dependence specification, simply the specification of frequent and immediate
dependences necessary to enable parallelization. The BOP run-time detects infrequent
dependences and satisfies long-range dependences automatically without any specification.
Jade aims for complete safety and automatically optimized parallelization. BOP tolerates
incomplete program knowledge and supports more direct control in parallelization.
Parallel programming languages Parallel languages are based on either single-program
multiple data (SPMD) [15] or fork-join parallelism. The SPMD model is used in distributedmemory programming as by MPI, PVM, software shared distributed memory (S-DSM) systems such as TreadMarks [3] and InterWeave [41], and PGAS languages such as UPC [44]
and co-array Fortran [32]. Data by default is not shared. Shared data can be declared as
shared pages in SDSM and global arrays in PGAS languages and X10 [10]. SPMD programming requires manual partitioning of computation and placement of data. Although
such requirements may be necessary for scalable parallel computing, it makes SPMD programming considerably more difficult than sequential programming.
Fork-join is used in shared-memory programming as by OpenMP, Cilk, Java, transactional memory, and many other languages in which parallel programming is done by annotating a sequential program. These languages specify parallel tasks using parallel loops,
regions, or future-style primitives such as spawn in Cilk [17] and async in X10 [10]. Synchronization is done using concurrency constructs such as atomic sections and transactions.
Concurrency constructs are more flexible because they do not restrict a program to sequential semantics. However, the flexibility is also a burden since a programmer must ensure
correctness under all possible program interleavings. They cannot be used by profiling-based
automatic parallelization. For safety, they requires a user to have complete knowledge of
all program dependences that may impair parallel execution. BOP overcomes these shortcomings with the dependence-based construct of post-wait and with run-time speculation
support. It ensures sequential semantics at the costs of requiring monitoring and in-order
commit. It tolerates these costs by using an asynchronous design and by enabling more
aggressive forms of parallelism.
23
Software speculative parallelization Loop-level software speculation was pioneered by
the LRPD test for Fortran loops [36]. Many improvements have been developed including
array replication and stronger progress guarantees [14, 20]. The support has been extended
to C programs using annotations [11], compiler support [43, 50], and techniques developed
for transactional memory [31, 33, 40]. Speculation support has been developed for Java to
support safe future [46], return-value speculation [35], and speculative data abstractions [28].
Our implementation uses ideas from [16, 24, 52], which is based on processes instead
of threads. Process-based systems use the virtual memory protection mechanism, which
transparently provides strong isolation, on-demand data replication, and complete abortability. A recent system, Grace, removes concurrency problems such as race conditions and
deadlocks by implementing threads using processes and employing similar monitoring and
protection mechanisms [7].
Previous speculation systems have a limited interface for expressing dependences. They
run on shared-memory systems. We aim to develop a more general programming interface.
For example in a software implementation, page-level protection raises the problem of false
sharing. Dependence hints such as post-wait can be used to tolerate false sharing by serializing access to falsely shared pages. It is especially useful for dense array data. This
report describes support for distributed speculation, task coarsening, and various language
extensions such as serial region and pipelined loop.
Dependence hints Post and wait were coined by Cytron for do-across loop parallelism [13]. They are paired by an identifier, which is usually a data address. BOP post-wait
specifies data for communication. By integrating it with speculation support, we have made
it a hint rather than a directive and used it to construct other speculative parallel constructs
such as sequential regions and pipelined loops. In the context of speculative parallelization,
a signal-wait pair can be used for inter-thread synchronization [49]. Signal-wait did not
take parameters, and a compiler was used to determine the data transfer and must do so
correctly. In an ordered transaction, the keyword flow specifies that a variable read should
wait until either a new value is produced by the previous transaction or all previous transactions finish [45]. Flow is equivalent to wait with an implicit post (for the first write in the
previous task). Ordered transaction is implemented using hardware speculation support.
It is unclear how it copes with errors such as read before flow. It cannot easily be used to
communicate dynamically allocated data as in our example in Section 2.
9
Summary
We have presented the design of the BOP language consisting of four types of hints: parallelism, explicit dependence, computation reordering, and correctness checking. It provides
a sequential programming model and a hybrid solution to data sharing, where a user marks
likely parallel tasks and identifies their immediate data sharing using dependence hints and
leave remaining data sharing unspecified. BOP system implements distributed communication with a mix of user-directed synchronous updates and automatic asynchronous updates.
We have demonstrated the expressiveness of the suggestion language in two ways. First,
24
we show that they can be used to implement other language constructs such as serial regions, shared variables, and pipelined loops. Second, we show that dependence hints enable
time skewing, which removes the need of barrier communication in many types of iterative
solvers.
Until recently a user faced with the task of extracting coarse-grained parallelism in
complex code had to choose either a compiler to automate the task or a language extension
to parallelize a program completely by hand. BOP provides a middle ground where the user
manually specifies high-level parallelism and known dependences, and the run-time system
automates distributed parallelization and communication.
Acknowledgments
We wish to thank Chao Zhang and Xiaoming Gu for their help with the implementation
and testing and Michael Scott, Xipeng Shen, and Kai Shen for the helpful discussions and
for pointing out related work and possible applications.
References
[1] V. S. Adve and J. M. Mellor-Crummey. Using integer sets for data-parallel program
analysis and optimization. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 186–198, 1998.
[2] R. Allen and K. Kennedy. Optimizing Compilers for Modern Architectures: A
Dependence-based Approach. Morgan Kaufmann Publishers, Oct. 2001.
[3] C. Amza, A. L. Cox, S. Dwarkadas, P. J. Keleher, H. Lu, R. Rajamony, W. Yu, and
W. Zwaenepoel. ThreadMarks: Shared memory computing on networks of workstations. IEEE Computer, 29(2):18–28, 1996.
[4] A. Basumallik and R. Eigenmann. Towards automatic translation of OpenMP to MPI.
In International Conference on Supercomputing, pages 189–198, 2005.
[5] A. Basumallik and R. Eigenmann. Optimizing irregular shared-memory applications
for distributed-memory systems. In Proceedings of the ACM SIGPLAN Symposium on
Principles and Practice of Parallel Programming, pages 119–128, 2006.
[6] A. Basumallik, S.-J. Min, and R. Eigenmann. Programming distributed memory sytems
using OpenMP. In Proceedings of the International Parallel and Distributed Processing
Symposium, pages 1–8, 2007.
[7] E. D. Berger, T. Yang, T. Liu, and G. Novark. Grace: Safe multithreaded programming
for C/C++. In Proceedings of OOPSLA, 2009.
[8] W. Blume et al. Parallel programming with polaris. IEEE Computer, 29(12):77–81,
1996.
25
[9] B. Chamberlain, S.-E. Choi, C. Lewis, C. Lin, L. Snyder, and W. Weathersby. ZPL:
a machine independent programming language for parallel computers. IEEE Transactions on Software Engineering, 26(3):197–211, Mar 2000.
[10] P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. von
Praun, and V. Sarkar. X10: an object-oriented approach to non-uniform cluster computing. In Proceedings of OOPSLA, pages 519–538, 2005.
[11] M. H. Cintra and D. R. Llanos. Design space exploration of a software speculative parallelization scheme. IEEE Transactions on Parallel and Distributed Systems, 16(6):562–
576, 2005.
[12] A. L. Cox and R. J. Fowler. Adaptive cache coherency for detecting migratory shared
data. In Proceedings of the International Symposium on Computer Architecture, pages
98–108, 1993.
[13] R. Cytron. Doacross: Beyond vectorization for multiprocessors. In Proceedings of the
1986 International Conference on Parallel Processing, pages 836–844, St. Charles, IL,
Aug. 1986.
[14] F. Dang, H. Yu, and L. Rauchwerger. The R-LRPD test: Speculative parallelization
of partially parallel loops. In Proceedings of the International Parallel and Distributed
Processing Symposium, pages 20–29, Ft. Lauderdale, FL, Apr. 2002.
[15] F. Darema, D. A. George, V. A. Norton, and G. F. Pfister. A single-program-multipledata computational model for epex/fortran. Parallel Computing, 7(1):11–24, 1988.
[16] C. Ding, X. Shen, K. Kelsey, C. Tice, R. Huang, and C. Zhang. Software behavior
oriented parallelization. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 223–234, 2007.
[17] M. Frigo, C. E. Leiserson, and K. H. Randall. The implementation of the cilk-5 multithreaded language. In Proceedings of the ACM SIGPLAN Conference on Programming
Language Design and Implementation, pages 212–223, 1998.
[18] J. Gu and Z. Li. Efficient interprocedural array data-flow analysis for automatic program parallelization. IEEE Transactions on Software Engineering, 26(3):244–261, 2000.
[19] M. Gupta and P. Banerjee. PARADIGM: A compiler for automatic data distribution on
multicomputers. In International Conference on Supercomputing, pages 87–96, 1993.
[20] M. Gupta and R. Nim. Techniques for run-time parallelization of loops. In Proceedings
of SC’98, page 12, November 1998.
[21] M. W. Hall, S. P. Amarasinghe, B. R. Murphy, S.-W. Liao, and M. S. Lam. Interprocedural parallelization analysis in SUIF. ACM Transactions on Programming Languages
and Systems, 27(4):662–731, 2005.
[22] R. H. Halstead. Multilisp: A language for concurrent symbolic computation. ACM
Transactions on Programming Languages and Systems, 7(4):501–538, Oct. 1985.
26
[23] L. Heyer, S. Kruglyak, and S. Yooseph. Exploring expression data: Identification and
analysis of coexpressed genes, 1999.
[24] Y. Jiang and X. Shen. Adaptive software speculation for enhancing the cost-efficiency
of behavior-oriented parallelization. In Proceedings of the International Conference on
Parallel Processing, pages 270–278, 2008.
[25] K. Kelsey, T. Bai, and C. Ding. Fast track: a software system for speculative optimization. In Proceedings of the International Symposium on Code Generation and
Optimization, pages 157–168, 2009.
[26] M. Kulkarni, K. Pingali, G. Ramanarayanan, B. Walter, K. Bala, and L. P. Chew.
Optimistic parallelism benefits from data partitioning. In Proceedings of the International Conference on Architectual Support for Programming Languages and Operating
Systems, pages 233–243, 2008.
[27] M. Kulkarni, K. Pingali, B. Walter, G. Ramanarayanan, K. Bala, and L. P. Chew.
Optimistic parallelism requires abstractions. In Proceedings of the ACM SIGPLAN
Conference on Programming Language Design and Implementation, pages 211–222,
2007.
[28] M. Kulkarni, K. Pingali, B. Walter, G. Ramanarayanan, K. Bala, and L. P. Chew.
Optimistic parallelism requires abstractions. Communications of ACM, 52(9):89–97,
2009.
[29] S.-W. Liao, A. Diwan, R. P. B. Jr., A. M. Ghuloum, and M. S. Lam. SUIF Explorer:
An interactive and interprocedural parallelizer. In Proceedings of the ACM SIGPLAN
Symposium on Principles and Practice of Parallel Programming, pages 37–48, 1999.
[30] M. Litzkow, T. Tannenbaum, J. Basney, and M. Livny. Checkpoint and migration
of UNIX processes in the Condor distributed processing system. Technical report, U.
Wisconsin-Madison, 1997.
[31] M. Mehrara, J. Hao, P.-C. Hsu, and S. A. Mahlke. Parallelizing sequential applications on commodity hardware using a low-cost software transactional memory. In
Proceedings of the ACM SIGPLAN Conference on Programming Language Design and
Implementation, pages 166–176, 2009.
[32] R. Numrich and J. Reid. Co-Array Fortran for parallel programming. ACM Fortran
Forum, 17(2):1–32.
[33] C. E. Oancea, A. Mycroft, and T. Harris. A lightweight in-place implementation for
software thread-level speculation. In Proceedings of the ACM Symposium on Parallel
Algorithms and Architectures, pages 223–232, 2009.
[34] J.-K. Peir and R. Cytron. Minimum distance: A method for partitioning recurrences
for multiprocessors. IEEE Transactions on Computers, 38(8):1203–1211, 1989.
27
[35] C. J. F. Pickett and C. Verbrugge. Software thread level speculation for the Java language and virtual machine environment. In Proceedings of the Workshop on Languages
and Compilers for Parallel Computing, pages 304–318, 2005.
[36] L. Rauchwerger and D. Padua. The LRPD test: Speculative run-time parallelization of
loops with privatization and reduction parallelization. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, La Jolla,
CA, June 1995.
[37] M. C. Rinard and M. S. Lam. The design, implementation, and evaluation of Jade.
ACM Transactions on Programming Languages and Systems, 20(3):483–545, 1998.
[38] S.-B. Scholz. Single assignment c: efficient support for high-level array operations in a
functional setting. Journal of Functional Programming, 13(6):1005–1059, 2003.
[39] Y. Song and Z. Li. New tiling techniques to improve cache temporal locality. In
Proceedings of the ACM SIGPLAN Conference on Programming Language Design and
Implementation, pages 215–228, Atlanta, Georgia, May 1999.
[40] M. F. Spear, K. Kelsey, T. Bai, L. Dalessandro, M. L. Scott, C. Ding, and P. Wu.
Fastpath speculative parallelization. In Proceedings of the Workshop on Languages
and Compilers for Parallel Computing, 2009.
[41] C. Tang, D. Chen, S. Dwarkadas, and M. L. Scott. Integrating remote invocation and
distributed shared state. In Proceedings of the International Parallel and Distributed
Processing Symposium, 2004.
[42] W. Thies, V. Chandrasekhar, and S. P. Amarasinghe. A practical approach to exploiting coarse-grained pipeline parallelism in c programs. In Proceedings of the ACM/IEEE
International Symposium on Microarchitecture, pages 356–369, 2007.
[43] C. Tian, M. Feng, V. Nagarajan, and R. Gupta. Copy or discard execution model for
speculative parallelization on multicores. In Proceedings of the ACM/IEEE International Symposium on Microarchitecture, pages 330–341, 2008.
[44] UPC consortium. UPC language specification v1.2. Technical Report LBNL-59208,
Lawrence Berkeley National Lab, 2005.
[45] C. von Praun, L. Ceze, and C. Cascaval. Implicit parallelism with ordered transactions.
In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel
Programming, Mar. 2007.
[46] A. Welc, S. Jagannathan, and A. L. Hosking. Safe futures for Java. In Proceedings of
OOPSLA, pages 439–453, 2005.
[47] M. J. Wolfe. High Performance Compilers for Parallel Computing. Addison-Wesley,
Redwood City, CA, 1996.
[48] D. Wonnacott. Achieving scalable locality with time skewing. International Journal
of Parallel Programming, 30(3), June 2002.
28
[49] A. Zhai, J. G. Steffan, C. B. Colohan, and T. C. Mowry. Compiler and hardware
support for reducing the synchronization of speculative threads. ACM Transactions
on Architecture and Code Optimization, 5(1):1–33, 2008.
[50] A. Zhai, S. Wang, P.-C. Yew, and G. He. Compiler optimizations for parallelizing
general-purpose applications under thread-level speculation. pages 271–272, New York,
NY, USA, 2008. ACM.
[51] C. Zhang, C. Ding, X. Gu, K. Kelsey, and T. Bai. Continuous speculative program
parallelization in software. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2010.
[52] C. Zhang, C. Ding, K. Kelsey, T. Bai, X. Gu, and X. Feng. A language of suggestions
for program parallelization. Technical Report URCS #948, Department of Computer
Science, University of Rochester, 2009.
[53] E. Z. Zhang, Y. Jiang, and X. Shen. Does cache sharing on modern CMP matter to the
performance of contemporary multithreaded programs? In Proceedings of the ACM
SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2010.
29
Download