Distributive Program Parallelization Using a Suggestion Language Bryan Jacobs, Tongxin Bai, and Chen Ding {jacobs,kelsey,cding}@cs.rochester.edu The University of Rochester Computer Science Department Rochester, NY 14627 Technical Report TR-952 December 2009 Abstract Most computing users today have access to clusters of multi-core computers. To fully utilize a cluster, a programmer must combine two levels of parallelism: shared-memory parallelism within a machine and distributed memory parallelism across machines. Such programming is difficult. Either a user has to mix two programming languages in a single program and use fixed computation and data partitioning between the two, or the user has to rewrite a program from scratch. Even after careful programming, a program may still have hidden concurrency bugs. Users who are accustomed to sequential programming do not find the same level of debugging and performance analysis support especially for a distributed environment. The paper presents a suggestion-based language that enables a user to parallelize a sequential program for distributed execution by inserting hints. The hints are safe against any type of misuse and expressive enough to specify independent, pipelined, and speculative parallel execution on a cluster of multi-core computers. The research is supported by the National Science Foundation (Contract No. CNS-0720796, CNS0834566), IBM CAS Faculty Fellowship, and a gift from Microsoft Research. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the funding organizations. 1 Introduction Computer users are increasingly interested in parallel programming because they want to utilize clusters of multi-core processors, which is capable, in theory, of performance tens or hundreds of times of a single personal computer. Often they have a program that take hours or days to execute, with coarse-grained parallelism that is easy to identify. The difficulty, however, is safe parallelization. Large computation tasks may execute tens of thousands of lines of code and make extensive use of bit-level operations, unrestricted pointers, exception handling, custom memory management, and third-party libraries. Although compiler techniques are effective in exploiting loop parallelism in scientific code written in Fortran [2, 8, 21, 47], they are not a sufficient solution for C/C++ programs or programs with input-dependent behavior where both the degree and the granularity of parallelism are not guaranteed or even predictable. Manual parallel programming is becoming easier. There are ready-to-use parallel constructs in mainstream languages such as Fortran, C, C++ and Java, in threading libraries such as Microsoft .NET, Intel Thread Building Blocks, in domain-specific languages such as Google’s Map Reduce and Intel Concurrent Collection for C++. Still, writing parallel code is considerably harder than sequential programming because of non-determinism. A program may run fine in one execution but generate incorrect results or run into a deadlock in the next execution because of a different thread interleaving. It may acquire new data races when ported to a machine with a different hardware memory consistency model. In addition, important types of computations such as mesh refinement, clustering, image segmentation, and SAT approximation cannot be parallelized without speculation since conflicts are not known until the computation finishes [26, 27]. This paper extends the suggestion language BOP [16,51,52]. BOP lets a user parallelize general types of sequential programs by inserting suggestions or hints. In this paper, the BOP language is extended to have four types of hints as follows. Based on these hints, the BOP system support divides a sequential program into tasks that can run on a computer cluster and ensures the correctness of parallel execution. • Parallelism hints let a user mark possibly parallel regions (PPR ) in a program. BOP then tries to execute PPR tasks in parallel in a distributed environment using explicit message passing instead of shared memory. • Data-checking hints let a user specify private data or use value-based checking to avoid false dependences in parallel execution. • Dependence hints include speculative post-wait primitives. They express possible dependences between PPR tasks. BOP implementation uses speculative synchronization and communication to satisfy these dependences. Since all BOP tasks logically share a single address space, it is easy for them to share (and build) dynamically allocated data structures such as lists and trees. • Computation reordering hints enable BOP to increase the granularity of parallelism by allowing successive PPR tasks be joined into larger units of execution. 1 BOP is a parallelization language. Unlike parallel constructs in a parallel language, BOP primitives exist purely as hints. The output of an execution with suggestions is always identical to that of the original sequential execution without suggestions, independent of task scheduling and execution interleaving. A user never needs parallel debugging. For more effective parallelization, a user may need to restructure a program to expose parallelism and remove conflicts, but the restructuring is done on sequential code using sequential programming tools such as conventional profilers or debuggers [16]. BOP uses software speculation and guarantees sequential semantics through speculative execution, run-time monitoring, and dependence checking. Speculation, if it fails, wastes system resources including electric power. It may even delay useful computations by causing unnecessary contention. There are inherent overheads that limit the scalability of a speculative system. However, speculation is valuable and maybe necessary for parallelizing existing sequential code and for exploiting speculative parallelism (which is unknown until computation finishes). When speaking of programming language design, C++ inventor Bjarnes Stroustrup said “legacy code is code that actually works” and “legacy code breeds more legacy code.” As a parallelization language, BOP allows legacy sequential code be parallelized without explicit parallel programming. As programming hints, it offers new types of expressive power. As a result, BOP may enable ordinary users to more easily utilize a group of shared-use computers to improve the performance of existing, sequential software and of new software that reuses an existing code base. In the rest of the report, Section 2 gives an overview of the language design. Section 3 presents the parallelism hint and the system for distributed speculation. Section 4 and Section 5 describe dependence and computation-reordering hints. The dependence hint is an improvement over our previous design [52]. In Section 4, we describe an improved design, complete semantics, and a simpler implementation. In Section 6, we use BOP suggestions to build other speculation constructs. As a non-trivial use of the language, we show a parallelization technique called time skewing in Section 7. Finally, we discuss related work and summarize. 2 Overview of the Parallelization Language A parallelization language must not change the result of a program, so BOP hints must guarantee correctness. But this is not sufficient. We design the BOP language interface with three additional goals. • Incremental. BOP supports parallelization with partial information, either a user examining part of a program or a profiling tool analyzing a few executions. The language should allow a user or a tool to gradually discover and supply needed information. • Succinct. For parallelization, a user should not have to mark all dependences in a program or to mark every dependence individually. The language should enable the user to express only relevant dependences with a minimal number of hints. 2 • Efficient. The system support should be efficient and robust in the presence of incorrect, incomplete, or out-of-date hints. Since BOP guarantees sequential semantics, it cannot use concurrency constructs such as critical sections, monitors, atomic region, or transactions. In their place, BOP uses speculative, serial constructs. Speculation enables similar out-of-order execution as concurrency constructs, at the cost of run-time monitoring and in-order commit. BOP tries to tolerate these costs with coarse-grained parallelism. The BOP language aims for both shared-memory and distributed-memory parallelization. Usually distributed programming is more onerous—a user must specify all data sharing to implement it with message passing. As a speculation system, BOP has to monitor all data access, so it can detect data sharing automatically and remove this distinction between shared-memory and distributed-memory programming. There is no language-level difference whether a user programs for one multi-core computer or ten. The BOP language has four types of hints shown in Figure 1(a). A parallelism hint suggests a parallel task. A dependence hint suggests coordination between parallel tasks. A data-checking hint helps run-time monitoring. A computation-reordering hint allows to runtime system to create larger tasks. Next we describe these hints, using the parallelization shown in Figure 1(b,c) as a running example. 3 Parallelism Hints A bop parallel block suggests that a lexical block is possibly parallel with code after the block. An instance of a bop parallel block is the basic unit of BOP parallelism. We call each instance a possibly parallel (PPR ) task [16]. PPR tasks may have an iterative structure if they are spawned one after another, nested if one is spawned inside another, or a mixture of both. To see a use of the parallelism hint, consider the code of a generic processing loop in Figure 1(b). Each iteration de-queues an element from a queue called inputs, processes the element, and adds the result to a queue called results. In addition, the program displays the inputs queue before the loop and results after the loop. There is not enough information for static parallelization. The size of inputs is unknown. The process function may be too complex to analyze at compile time. It may use an externally compiled library with no source code available. Still, the code may be parallelizable. This possibility is expressed in two bop parallel blocks. The first suggests the function parallelism between displaying the inputs queue and processing it. The second suggests the loop parallelism between processing different elements. PPR is similar to the future construct in Multilisp. A future expresses the parallelism between the future and its continuation [22]. Other future-style constructs include spawn in Cilk [17], future in Java, and async in X10 [10] and in the proposed C++200X standard. PPR expresses possible rather than definite parallelism [16]. Two similar constructs have been studied, safe future in Java [46] and ordered transactions in C/C++ [45]. 3 suggestion type syntax parallelism hint bop_parallel { (and implicit //code block dependence hint) } meaning mark a possibly parallel region (ppr) in code, may be parallelized by spawning a speculative process to execute from the end of the code block, also called spawn_n_skip (sas). each ppr has my_ppr_index, incremented for each ppr. bop_fill(channel, var) inform the channel to post var at the time of bop_post bop_post(channel) post all modified data and mark the channel ready bop_wait(channel) wait for channel until it is ready and retrieve its data computation reordering hint bop_deferrable { //code block } construct a closure for execution later correctness checking hints bop_private(var) mark variable var possibly private bop_check(var) mark variable var for value-based checking explicit dependence hints (a) BOP language has four types of hints for marking possible parallelism and dependence queue inputs, results inputs.display while ( ! inputs.empty ) { i = inputs.dequeue t = process( i ) results.enqueue( t ) } results.display (b) A processing loop with two queues, one holds the inputs and the other the results. The queues are displayed before and after the loop. bop_parallel { inputs.display } first_iter_id = my_ppr_id while ( ! inputs.empty ) i = inputs.dequeue bop_parallel { t = process( i ) if ( my_ppr_id > first_iter_id ) bop_wait( my_ppr_index - 1 ) results.enqueue( t ) bop_fill( my_ppr_id, &results.tail, sizeof(void*) ) bop_fill( my_ppr_id, results.tail, sizeof(qnode) ) bop_post( my_ppr_id ) } } results.display (c) BOP parallelization. Two bop_parallel blocks mark function and loop parallelism. Post-wait ensures correct construction of the dynamically allocated results queue. Figure 1: The parallelization language in (a), illustrated by an example in (b,c) A future represents a fork in fork-join parallelism. Each future should be paired with a join. A join point is often explicitly coded by the user, for example, using sync in Cilk, finish in X10, get in Java and in C++200X. Because of its speculation support, BOP does not require that a programmer specify a join point. There are three benefits. The first is safety. It allows partial-knowledge parallelization when the point of a join is not completely known. The second is speculative parallelization in cases when a join point is known only after future execution or when future execution ignores some infrequent conflicts. The third is flexibility. A user can suggest a task with or without a join. To suggest the join point, a user can add an explicit dependence hint, which we describe in Section 4.2. 4 3.1 Background on Software Speculative Parallelization A speculation system deals with three basic problems. • monitoring data accesses and detecting conflicts — recording shared data accesses by parallel tasks and checking for dependence violations • managing speculative states — buffering and maintaining multiple versions of shared data and directing data accesses to the right version • recovering from erroneous speculation — preserving a safe execution state and reexecuting the program when speculation fails A speculation task can be implemented by a thread or a process. Most thread-based methods target loop parallelism and rely on a compiler or a user to identify shared data and re-direct shared data access when needed. We use a process-based solution [7, 16, 24, 25, 53]. A process can be used to start a speculation task anywhere in the code, allowing general types of fork-join parallelism. By using the page protection mechanism, a run-time system can monitor data access without instrumenting a program. Processes are well isolated from each other’s actions. Modern OS performs copy-on-write and replicates a page on demand. In a speculation system, such replication removes false dependence conflicts and makes error recovery trivial to do. We can simply discard an erroneous speculative task by terminating the process. Process-based software speculation has been shown effective in safe parallelization of large tasks in legacy code [16, 52], library calls [16], and map-reduce code [7], in debugging and race-free implementation of multi-threaded code [7], and in adaptive control of speculation [24]. It supports continuous speculation, which starts a speculation task whenever a processor becomes available, to tolerate load imbalance caused by irregular task sizes and uneven hardware speed [53]. It supports parallel memory allocation and deallocation during speculation [7, 52]. It supports threaded speculation in loops using two techniques from software transactional memory: logging and undo buffer [40]. It also supports speculative program optimization and its use in parallel program profiling and memory-access checking [25]. Next we extend the process-based design to enable program parallelization in a distributed environment. 3.2 Distributed Speculation The BOP run-time system has three types of processes. • A control process manages shared system information and serves as the central point of communication. There is one control task in each BOP execution. We refer to it as the controller. • A host management process manages parallel execution within a host and coordinates other hosts through the control process. There is one management process on each host. We refer to it as a (host) manager. 5 • A work process runs one or a series of PPR tasks on a single processor. Work processes are dynamically created and terminated and may perform computation speculatively. We refer to it as a worker or a risky worker if its job is speculative. The first two types are common in a distributed system such as the one used by MPI and Erlang. The unique problems in BOP design are how to support speculation including the ability to run a PPR task on an available machine, checking correctness and maintaining progress in a distributed environment. To do so, the three types of processes divide the work and coordinate with each other as follows. • The control process distributes a group of PPR instances (or jobs) to the management process of each host. • On a host, the management process starts one or more work processes based on the resources available on the host. • After finishing its assigned PPR instances, a work process reports its results for verification. • If a finished PPR instance has a conflict, the control process redistributes the PPR instance and its successors for re-execution. If all finished PPR instances are verified correct, the control process continues distributing subsequent PPR instances until the program finishes. Figure 2 shows the distributed execution of an example program on two hosts. The program has four PPR tasks separated by gaps. The controller distributes PPR tasks to hosts. A host manager forks a worker process to execute a PPR task. Inter-PPR gaps are executed by the control and every manager. To start a worker, BOP needs to create a process and initialize it to have the right starting state. There are two basic ways of creating an existing state: one is copying from the existing state, the other is re-computing it from the beginning. On the same host, copying can be easily done using Unix fork. Across hosts, we use a hybrid scheme we call skeleton re-execution. Each host manager executes all inter-PPR gaps, which is the “skeleton.” When it reaches a PPR task, the manager waits for its successful completion (by some worker), copies the data changes, and skips to continue the skeleton execution at the next inter-PPR gap. With skeleton re-execution, a manager maintains a local copy of program state and use it to start worker tasks through fork. An alternative to re-execution is to use remote checkpointing, for example, to use a system like Condor [30] to implement a remote fork. Simple checkpointing would transfer the entire program state, which is unnecessary. Incremental checkpointing may alleviate the problem. Checkpointing is a more general solution than re-execution because it handles code that cannot be re-executed. In our current prototype, we allow only CPU, memory, and basic file I/O operations, where re-execution can be supported at the application level. Checkpointing support can eliminate these restrictions. 6 ppr task t.1 gap g.1 ppr task t.2 g.2 gap g.3 (a) A program execution consists of two PPR tasks, t.1 and t.2, separated by gaps g.1, g.2, and g.3. job t.1 Manager A run g.1 spawn A.1 wait for t.1 add t.1 result run g.2 wait for t.2 Controller job t.2 Manager B run g.1 spawn B.1 wait for t.1 Host 1 Worker A.1 run t.1 return t.1 result add t.1 result run g.2 wait for t.2 add t.2 result run g.3 Host 2 Worker B.1 skip t.1 run g.2 run t.2 return t.2 result add t.2 result run g.3 (b) Distributed parallel execution of t.1 on Host 1 and t.2 on Host 2. Both hosts run gaps g.1, g.2 and g.3. Figure 2: An example execution by BOP . The input program is a series of PPR tasks separated by gaps. The controller distributes PPR tasks to hosts. A host manager forks a worker process to execute a PPR task. Inter-PPR gaps are executed by the control and every manager. 3.3 Correctness Checking and Data Update There are two basic problems in speculation support: correctness checking and data update. Each problem has two basic solution choices. • Correctness checking can be done by centralized validation, where checking is centralized in one process (the controller), or distributed validation, where checking work is divided among hosts. • Data update can be done by eager update, where changes made by one host are copied to all other hosts, or lazy update, where changes are communicated only when they are needed by a remote host. 7 The problem of data updates is similar to the ones in software distributed shared memory (S-DSM), while the problem of correctness checking is unique to a speculative parallelization system. The above choices can be combined to produce four basic designs. • centralized validation, eager update. This design is similar to shared-memory speculation. With eager update, a worker incurs no network-related delays because all its data is available on the local host. However, the controller may become a bottleneck, since it must send all data updates to all hosts. • distributed validation, eager update. With distributed validation, each host sends its data updates to all other hosts, which avoids the bottleneck in centralized validation but still retains the benefit of eager update. However, correctness checking is repeated on all hosts (in parallel), which increases the total checking and communication cost due to speculation. • distributed validation, lazy update. Lazy update incurs less network traffic because it transfers data to a host only if it is requested by one of its workers. The reduction of traffic comes at an increase of latency. Instead of accessing data locally, a worker must wait for data to be fetched from the network, although the latency may be tolerated by creating more workers to maintain full progress when some workers wait for data. With distributed validation, the global memory state is distributed instead of centralized and replicated. • centralized validation, lazy update. This scheme maintains a centralized global state. As the sole server of data requests, the controller should inevitably become a bottleneck as the number of lazy-update workers increases. This combination is likely not a competitive option. There are hybrids among these basic combinations. For example, we may combine centralized and distributed validation by checking correctness in the controller but sending data updates from each host, in a way similar to the “migrate-on-read-miss” protocol in a distributed, cache coherent system [12]. We may also divide program data and use eager update in one set and lazy update in another. 3.4 Centralized Validation and Eager Data Update To verify correctness, the controller and manager processes maintain a progress frontier, which separates the verified part and the speculative part of an execution. A worker is started by a manager from its progress frontier and given a target PPR instance. A worker usually executes a series of gaps and a PPR instance. We use the following symbols when describing the checking algorithm. • f : progress frontier. fg is the (global) progress frontier maintained by the controller. fi is the progress frontier known by the manager at host i. • gap(f, i), ppr(f, i): the two parts, gap and PPR , of the execution of PPR instance i started from progress frontier f . 8 • R(x), W (x), RW (x): the read, write, and read-write access maps of a gap or a PPR instance x. • P (x): the content of modified data (pages) by a gap or PPR instance x. • U (i): the effect of PPR instance i, stored as the set of pages modified in i and their content. In centralized validation, each worker sends meta-data and data to the controller, which verifies correctness of PPR instances in order. Below are the steps of correctness checking. 1. A worker executes PPR instance i from progress frontier f . It executes the inter-PPR gaps from gap f + 1 to gap i − 1 and then starts recording read and write maps separately for gap i first and PPR i next as it executes them. 2. When a worker finishes a speculative PPR execution, it sends the result, consisting of RW (gap(f, i)), RW (ppr(f, i)), and P (ppr(f, i)) to the controller. 3. The controller validates worker results in the order of increasing gap and PPR index. The result of gap(f, i) and ppr(f, i) (i > f +1) is correct if gap(f 0 , i−1) and ppr(f 0 , i− 1) have been verified correct for some f 0 and: • there has been no failure between f and i, that is, the last failure index (j) is less than f in the controller’s blacklist. • the read and write sets of gap(f, i) and ppr(f, i) do not overlap with any update set U (j) after f . In symbolic terms we require \ [RW (gap(f, i)) ∪ RW (ppr(f, i))] Σi−1 j=f +1 U (j) = φ 4. If gap(f, i), ppr(f, i) is correct, the controller advances the progress frontier fg to i, stores the update set of i, U (i) = P (ppr(f, i)), and sends to all managers the new fg and U (i); otherwise, the controller blacklists index i and re-issues the job (fg , i) and its subsequent jobs for execution. A special case is i = f + 1, when the execution is non-speculative. A worker executes only PPR i (no gap) and returns P (ppr(f, f + 1)), W (ppr(f, f + 1)) for the controller to update its progress frontier and U (f + 1). This case is used for BOP to guarantee progress and base performance, as discussed in Section 3.6. Correctness checking in BOP differs from the checking in shared-memory speculative systems. In particular, BOP does not check for gap-to-gap dependences and does not need gap updates. This difference is due to the gap execution being repeated on every host. Because of the asymmetry in gap and PPR executions, they are treated differently in data collection and access checking. The incremental procedure in the algorithm for correctness checking guarantees two transitive properties. Speculation succeeds for PPR instances from 1 to n, if and only if there is no dependence between (1) gap(j) and any predecessor ppr(i) and (2) ppr(j) and any predecessor ppr(i). 9 3.5 Correctness Checking Hints By default BOP monitors all global and heap data. Local variables are considered task private. Compiler analysis can be used to identify whether a local variable (stack data) is entire private. If a compiler cannot statically determine all accesses, it can allocate it in the heap for monitoring. Our previous system uses compiler support to identify global variables for monitoring. BOP provides a function for a program to explicitly register global data for monitoring, which allows BOP to be used without compiler support. For correctness checking, BOP classifies monitored data into three categories—possibly private, shared, and value checked—as explained in detail in [16]. The checking is primarily dependence based. There are three types of dependences. BOP checks for flow and output dependence violations. Anti-dependence violation poses no problem because of data replication [16, 52]. In our running example in Figure 1, the input queue can displayed in a PPR task while its elements are being removed by the next PPR tasks, because the queue is replicated by the later tasks. Flow and output dependences would normally cause speculation to fail. In two cases, however, BOP permits these dependences when it can prove that they are harmless. These two cases are suggested by bop private and bop check. A variable is marked by bop private if in a PPR task, its value is assigned before it is used. Because the first access is a write, the variable does not inherit value from prior tasks. Verifying the suggestion requires capturing the first access to a variable, which can be costly if the variable is an array or a structure. For efficiency we use a compromise. We insert code at the start of the PPR to write a constant value in all variables that are marked bop private. If the suggestion is correct, the additional write adds a small extra cost but does not change the program semantics. If the suggestion is wrong, the program may not execute correctly, but the sequential version has the same error, and the error can be identified using conventional debugging tools. Under this implementation, bop private is a directive rather than a hint, unlike other BOP primitives. The hint bop check suggests that a variable holds the same value before and after a PPR task. The variable may take other values in between, but the intermediate values are not visible outside and do not affect correctness of later PPR tasks. In implementation, BOP records the value of checked variables at the start of a PPR instance and at the end compares the starting value with the ending value. If the two versions are identical, the variable has the same effect on parallel execution as a read-only variable and does not cause any dependence violation. 3.6 Progress and Performance Guarantee Speculative execution may be too slow or produce incorrect results. Our previous systems reexecute speculative work in a sequential non-speculative task called the understudy, which guarantees progress and base performance [16, 52]. In the distributed case, we use the controller for this purpose. At each progress frontier, the controller executes the next PPR instance non-speculatively if it has not been finished correctly by a speculative worker. When the controller executes 10 a PPR instance, it records the write set for use in correctness checking and record keeping. Except for write access monitoring, the controller execution is the same as a sequential execution. Therefore, it guarantees that BOP execution takes at most the time required for the controller to finish the entire execution. 4 Dependence Hints 4.1 Implicit Dependence Hints In BOP , a program execution becomes a tree of tasks similar to a function invocation tree. Tasks in every tree path are sequentially spawned. Since in between two PPR s the code is executed sequentially, dependences among inter-PPR code regions are automatically enforced. Inter-PPR code is most often used to implement serial work needed to start parallel tasks. For example in Figure 1(c), the input queue is accessed outside the loop PPR so queue elements are extracted sequentially. Another example is that in loops, the loop control code should be placed outside a PPR . Hence a PPR marks not only parallelism but also dependences. It suggests serial dependence in inter-PPR code, dependence from inter-PPR code to PPR code, and the absence of dependences from PPR code to inter-PPR code. An example of a PPR to inter-PPR dependence can be found in Figure 1(b), between the computation of the result t and its entry into the queue. To express this dependence, we need explicit dependence hints as we next describe. 4.2 Explicit Dependence hints Post-wait was created by Cytron to express dependences in a do-across loop [13, 34]. We extend post-wait to mark dependences between PPR tasks. In BOP post-wait, dependent operations communicate data through a single-use channel. Channels are created on-demand and have the initial status not posted. Communicated data is identified by variable name and, if it is an array, optionally a range. BOP post-wait has three primitives. • bop fill(channel id, start addr, num pages) adds to a local, not yet communicated channel, with the identifier channel id. A bop fill call is ignored, if the channel has been posted. Only the location of the data is recorded at this point. The same memory locations may be added by multiple bop fill operations. The contents of the pages are taken only when the channel is actually posted. • bop post(channel id) releases the locally constructed channel data for consumption and stops the channel from accepting new data by changing the status of the channel, both globally and locally, from not posted to posted. It communicates only the data that have been modified by the time of the bop post. • bop wait(channel id) stalls if the channel status is not posted. If and when the channel status changes, it receives data into the waiting task. 11 For the same channel, bop post operations are atomic and logically does not overlap with each other. Bop wait operations for the same channel may retrieve data in parallel with each other. Since bop fill is local, two PPR tasks may create two channels with the same identifier. However, for the same channel identifier, only one bop post operation may succeed. Our running example shows a use of post-wait, where PPR tasks assemble their results into a linked list. In the code in Figure 1(c), post-wait is used to coordinate parallel PPR instances to enqueue their results sequentially. The first PPR modifies the queue and posts the tail. Each subsequent PPR waits for the previous PPR ’s post, modifies the queue, and posts the new tail. Since PPR tasks have an identical address space, the sender determines the placement of communicated data in the receiver. This property is indispensable for our example, where one PPR allocates a new queue node and passes it on to the next PPR . The example shows that BOP post-wait makes it easy for parallel construction of dynamic data structures. Not only can a waiting task receive data allocated before it was started, but it can also receive data allocated in other tasks after the receiver was started. 4.3 Expressiveness and Efficiency Post-wait is a form of explicit communication similar to send-receive, with the posting PPR as the sender and the waiting PPR as the receiver. Compared to existing communication primitives, BOP post-wait has several distinct features. Deterministic behavior In the base design, a channel accepts a single post. After a channel is posted, its status is changed and all subsequent bop post and bop fill operations are ignored. This design is to ensure deterministic behavior. If we were to allow multiple posts, it would be uncertain how many of the posts had happened at the time of a bop wait. In addition, we make bop fill a task-local operation, which precludes a channel from being filled by multiple tasks. The design has two benefits. First, a local fill is efficient since it incurs no network communication. Second, it again avoids non-determinism. If two PPR tasks could fill a channel, a single post cannot guarantee that both tasks have finished placed data in the channel. If a task depends on two predecessor tasks, the right solution is to use two channels, one for each predecessor task, and let the dependent task wait for both channels. Uncertainty in data identity and access time If we know the exact dependence relation between two tasks, that is, we know data d needed by task j is produced by task i, then post-wait can easily express the dependence. Let’s relax the requirement. Suppose we know the time of access but not the location of data, as in our example where a PPR task creates a node and passes to the next task. With BOP post-wait, a waiting task can retrieve the unknown data using a channel identifier. A related case is when we know the identity of data d and its last write in i but not its first read in j. Current BOP post-wait cannot express this dependence but can be extended to express it by allowing waiting on 12 a data address instead of a channel identifier. The flow keyword in ordered transactions expresses one type of such dependence, where there is only one write in i [45]. If we relax the requirement further and suppose the last modification time is unknown. To preclude rollback, we have to put the post operation at the end of task i. Imperfect post-wait matching Post-wait is suggestion. Problems such as useless, redundant, or mismatched calls do not break a program. In the example in Figure 1(c), although the post in the last iteration has no matching wait, it does not affect correct (parallel) execution. The nature of hints makes their use in coding simpler. Incomplete knowledge of a program can cause three types of errors in specification: under-specification, where a dependence happens but a user does not mark it; over-specification, where a user specifies a dependence that does not happen in execution; or incorrect specification, where the location of the source or the target of a dependence is specified wrong or left unspecified. The speculative implementation tolerates these errors. It ignores an error if it does not affect parallelism or efficiency; otherwise, it can report it to the user to fix an error. Selective communication Not all communication has to use post-wait. In fact, most often the bulk of data changes in BOP are communicated implicitly when PPR tasks commit. The difference between post-wait and PPR commit is similar to the difference between synchronous and asynchronous communication. Post-wait is used when parallel execution requires direct synchronization. In the example in Figure 1(c), the next PPR must acquire the tail of the queue before it can append to it (assuming the queue is implemented as a singly linked list). There are a few interesting subtleties in the example. If more than three PPR tasks are run in parallel, the third PPR receives only the new node created by the second PPR but not the first. Appending to a linked list involves modifying the last two nodes, but only one node is posted. Clearly post-wait does not communicate all queue updates to all PPR s. Still, the solution is correct because other (non-essential) pieces will be combined at commit time. During parallel execution, each PPR works on one link of the queue. After it, BOP automatically merges the results and obtains the complete queue, one identical to the result obtained from a sequential execution. Aggregate communication Post-wait lets a user to aggregate communication. In the example, a PPR must communicate two pieces of information to the next PPR : the new tail pointer, and the new tail node. It adds them in two bop fill calls and communicates them in one post. Aggregation may also happen due to page-based data protection. If both data are on the same page, the second bop fill call is ignored. On the other hand, false sharing may cause unexpected conflict, for example, if the PPR modifies a queue node next to the tail after the post. Such cases will cause parallel execution to roll back. To enable full parallelism, a user has to either communicate at the end of PPR (as in this example) or allocate dependent data on separate memory pages. Progress frontier A posted channel may be consumed by any number of wait operations. Logically these wait operations all receive the same data. Communication may not 13 be needed if the waiting task already has posted data from earlier commits. If a user accidentally inserts a post without a matching wait or vice versa, the unmatched operations are eventually dropped when the progress frontier moves past. The system stores the identifier of all channels that have been posted in case a wait operation is later called on one of these channels. The storage cost is small since the system stores only channel identifiers not channel data. Extension to k-post Suppose in a more general case, we know a set of u possible program points that may be the source of a dependence and a set of v possible program points that may be the sink of the dependence. We would need to synchronize between the last executed point in u possible sources and the first executed point in v possible sinks. First, if some of the u possible sources happen in a single PPR , we find the latest point (the one that post-dominates the others in either program or profiling analysis) and reduce the possible sources to a set of up points in up PPR s. Similarly we reduce the set of possible sinks to vp points in vp PPR s. With single-post design, we need to insert at each source a post operation with a different channel and insert at each sink up wait operations so it waits for all channels, for a total of up post operations and up vp wait operations. We can extend the design to assign a parameter to a channel as the num posts, which means that the channel is posted after it has received exactly num posts. In this example, we create a single channel with num posts equals to u, insert a post operation at each source, and a wait operation at each sink. The total number of post-wait operations is u + v. 4.4 Implementation of Speculative Post-wait Incorrect use of BOP post and wait may cause a task to consume a stale value of some data. BOP detects all stale-value accesses as conflicts. There are three types of conflicts. First, a sender conflict happens if a PPR instance modifies some data that it has posted earlier. This conflict causes a transmission of stale data. Second, a receiver conflict happens when a PPR instance accesses some data and then later receives the data through a bop wait. This conflict causes a consumption of stale data. Finally, a pairing conflict happens when some data is received not from the last modifying PPR instance but from an earlier one. In the first two types, the error is caused by one of the two ends of the communication. In the third, the error is caused by the actions of a third party. BOP detects and handles these conflicts as follows. Sender conflict The simple solution is that if PPR i incurs a sender conflict, we abort and re-start all speculation of PPR j for j > i. With sufficient bookkeeping, we can identify and rollback only those PPR instances that are affected by the offending send. The sender can remove the data from the earlier posted channels and sends the newest value in the latest post. To avoid recurring conflicts, the BOP runtime can delay any post operation that has caused a sender conflict. 14 task 1 write x fill(1, x) task 2 wait(1) task 3 wait(1) read x post(1) write x Figure 3: Illustration of a pairing conflict when using post-wait. The read in task 3 is incorrect because the received value is not the newest one in sequential semantics. A pairing conflict. Task k is aborted because it Receiver conflict and multiple receives uses a stale version of A dataPPR x. instance must be aborted if it incurs a receiver conflict. A related case is when a PPR receives x more than once from different senders. We rank the recentness of received values by the PPR index of their sender. A larger index means a later sender and a newer value. There are three cases. First, x is first accessed before the first receive, which means a receiver conflict. Second, x is first accessed between two receives. If the value from the first receive is less recent than that of the second receive, a receiver conflict is triggered. If the value from the first receive is newer, the second receive can be silently dropped without raising a conflict. In the third case, x is first accessed after two or more receives. BOP keeps the newest value of x and continues. We call the last two extensions silent drop and reordered receive. Pairing conflict A pairing conflict is caused when some data x is communicated incorrectly. The 3-task example in Figure 3 shows such a conflict. Data x is modified in tasks 1 and 2 and read in task 3. The value of x sent by task 1 is the newest for task 2 but not so for task 3 since it can get a newer value from task 2. In terms of data dependences, the dependence from task 2 to 3 is violated. To ensure correctness, we need to combine dependence and post-wait checking. A pairing conflict is detected at the time of commit, when all post-wait and data-access information is known and stored in access maps. An access map marks four types of data access: r, wr, p(i), wa(i), representing read, written, posted, and received (through a wait). The last two types include the PPR index of the sender. Let ∆i be the access map of PPR i. If PPR i posted data x, we add p(i) to ∆i (x) (and remove wr since the write is implied). If PPR i received data x from PPR s, we add wa(s) to ∆i (x). When checking ppr(f, i), we compare its access with each predecessor PPR from f + 1 to i − 1, as in the algorithm in Section 3.4. To handle post-wait, we augment the Step 3 of the algorithm as follows. For ease of checking, it builds a cumulative write map Φ representing the set of write accesses before PPR i. There are four cases, with different correctness conditions for i and updates to the cumulative map Φ. 1. x is not accessed in PPR i. If ∆i (x) = φ, the check passes; otherwise, continue to the next case. 2. x is not modified until PPR i. If Φ(x) = φ, set Φ(x) = ∆i (x) − r (Φ is a write map), and the check passes; otherwise, continue to one of the next two cases. 15 3. x is modified before i but not posted. If wr ∈ Φ(x) and ∆i (x) 6= φ, the check fails. 4. x is modified before i and posted. There are two cases depending on whether i received x from this sender s. If not, p(s) ∈ Φ(x) but wa(s) ∈ / ∆i (x), and the check fails. If so, p(s) ∈ Φ(x) and wa(s) ∈ ∆i (x), and the check passes but we need to set Φ(x) carefully. If x is only read in i, leave Φ(x) unchanged. If x is modified and posted in i, set Φ(x) = {p(i)}. If x is modified but not posted in i, set Φ(x) = {wr}. As an example, consider the steps of checking task 3 in the earlier example (Figure 3). At the end of task 1 and 2, Case 2 and 4 are invoked respectively, and the cumulative map Φ(x) becomes first p(1) and then w. Finally, the check at task 3 invokes Case 3 and detects a conflict. This solution differs from the previous solution, which was based on the concept of communication paths [52]. Inter-PPR post-wait handling The skeleton execution described in Section 3.2 permits more precise post-wait checking than the previous solution for shared-memory speculation [52] in that it can ignore post-wait constructs in inter-PPR code. A post in gap i is unnecessary since its data changes are visible to all subsequent execution. A wait in gap i is unnecessary either, if the matching post comes from another gap, or implies a dependence violation, if the matching post comes from a PPR . Therefore, bop fill and bop wait are ignored, and bop post sends its channel without any content. Outdated post As mentioned earlier, data updates in BOP are propagated in two ways, post-wait and PPR commit. If a post-wait pair spans many PPR instances, it is possible that the receiver is started after the commit of the sender so the receiver already possesses the sent data. The frontier parameter can be used to check this case precisely. If ppr(f, j) receives a post from PPR i that happens before f (i ≤ f ), bop wait succeeds but it will nullify the content of the channel and accept no new data. We call the special wait a nullified receive. Progress guarantee Usually blocking communication implies the risk of entangling a set of tasks into a deadlock. BOP post-wait is deadlock free because communication flows in one direction in the increasing order of PPR index. A wait without a matching, earlier post would stall the receiving task. However, it does not stall the advance of the progress frontier as explained in Section 3.6. When the progress frontier moves past, the stalling task is canceled. 5 Computation Reordering and Run-time Task Coarsening At run-time, BOP may group a series of PPR instances to create a large parallel task. We call the transformation run-time task coarsening and the aggregate task a BOP task. By choosing different group sizes, we can create parallel tasks of a desirable length, which is especially useful in distributive parallelization. 16 for i in 1 ... n bop_parallel { t = data[ i ].analyze bop_deferrable ( t ) { if i > 1 bop_wait(i - 1) histogram.add( t ) bop_fill(...) bop_post( i ) } } end for for i in 1 ... n bop_parallel { // call to a large function t = data[ i ].analyze if i > 1 bop_wait(i - 1) histogram.add( t ) bop_fill(...) bop_post( i ) } end for bop task 1 // ppr 1 data[ 1 ].analyze histogram.add bop_post( 1 ) // ppr 2 data[ 2 ].analyze bop_wait( 1 ) histogram.add bop_post( 2 ) bop task 2 // ppr 3 data[ 3 ].analyze bop_wait( 2 ) histogram.add bop_post( 3 ) // ppr 4 data[ 4 ].analyze bop_wait( 3 ) histogram.add bop_post( 4 ) (a) An example loop (upper graph) and parallel execution of the first 4 pprs in 2 tasks. Limited parallelism due to postwait. bop task 1 // ppr 1 data[ 1 ].analyze // ppr 2 data[ 2 ].analyze // delayed blocks histogram.add bop_post( 1 ) bop_wait( 1 ) histogram.add bop_post( 2 ) bop task 2 // ppr 3 data[ 3 ].analyze // ppr 4 data[ 4 ].analyze // delayed blocks bop_wait( 2 ) histogram.add bop_post( 3 ) bop_wait( 3 ) histogram.add bop_post( 4 ) (b) An example deferrable block (upper graph) and parallel execution of the first 4 pprs in 2 tasks. More parallelism due to delayed post-wait. Figure 4: An example showing the use of a deferrable block in exposing parallelism between “coarsened” BOP tasks that contain multiple PPR instances. Computation reordering is necessary to enable this type of automatic task coarsening. Consider PPR blocking for the example loop in Figure 4(a). Each iteration produces a result and adds it to a histogram. The suggested parallelization is to run each iteration in parallel but serialize its update to the histogram using post-wait. An example PPR grouping is shown below the loop. There are two BOP tasks each executing 2 PPR s. Because of postwait, PPR 3 must wait for PPR 2 (shown by the arrow in the figure), and BOP task 2 must stop after the first PPR to wait for task 1 to finish. Post-wait is necessary to guarantee sequential semantics but it causes PPR grouping to lose most parallelism. The purpose of computation reordering is to regain parallelism while preserving sequential semantics. The computation-reordering hint, bop deferrable, suggests that a block of code can be executed later. Figure 4(b) shows its use in our example code, suggesting that the histogram update can be delayed. The two BOP tasks, shown in the figure below the code, can now run in parallel. The histogram updates are still sequentially done, but they are delayed to the end of BOP tasks and do not stop task 2 in the middle of the execution. With delayed updates, PPR blocking can increase the amount of synchronization-free, parallel work by the group size. By selecting group size at run time, BOP can control the granularity 17 of parallelism and de-couple program specification from machine implementation: a user focuses on identifying parallelism, and BOP tailors the parallelism for a machine. As a form of dependence hint, deferrable block has an identical meaning as a PPR block. Both mean an absence of dependence between the code block and its succeeding computation. The use of the two types of blocks is so different that we give them different names in BOP . The implementation issues, however, are similar in many aspects especially in correctness checking. The main difference is interaction with the succeeding block. A PPR block is in a separate process from its succeeding block, but a deferrable block runs in the same process as its succeeding block. Instead of creating a process, a deferrable block creates a closure to include the code and all its variable bindings. The closure is invoked as late as possible in a BOP task. Most parallel languages provides an atomic-section construct for use by a parallel loop. The loop may then be partitioned and executed in blocks of iterations. However, atomic sections do not guarantee correctness (i.e. same result as sequential execution) or deterministic output. Computation-reordering hint in BOP aims to provide the same flexibility while guaranteeing sequential semantics. 6 Language Extensions BOP hints are basic building blocks that can be combined to build other constructs. As a result, we can add new suggestions without changing language implementation. We demonstrate the potential for language extensions by building two loop-based and one data-based suggestions. When parallelizing a loop, a direct way is to mark a parallel loop. Consider the example in Figure 5(a). Each iteration has two steps: computing a value and appending it to an object. Assume that the first step is parallel, and the second step has to be serial. To parallelize the loop, we mark the loop parallel and place the second step in a serial region. A second construct is a shared variable declaration as used by Jade [37]. A compiler is used to identify and manage shared data accesses. A third loop-based construct is the pipelined loop designed by Thies et al [42]. To parallelize the example, a user adds two pipeline stages and marks the first stage parallel (by specifying the number of processors p), as in Figure 5(c). Serial region One can declare a parallel loop and mark dependent operations in a serial region, as in Figure 5(e). The term “serial region” was used by an early version of the OpenMP language to mean code that should be executed by one thread. We generalize the concept to add a label at each region. Code of the same named regions must be serialized in the sequential order, while code from different named regions may execute in parallel. The generalized serial region can be implemented by BOP post-wait. Figure 5(f) shows the code translation for a generic serial region r1. The scheme is akin to token passing, where a “token” is a channel number indexed by the identity of the host task, and it is passed by post-wait. To determine data communication, we may send all data changed in the region or find ways to identify and send only dependent data. Since BOP serial region is a hint, it 18 parallel for i in 1 ... n r = compute( i ) serial r1 { s.append( r ) } end for for i in 1 ... n r = compute( i ) s.append( r ) end for (a) a loop whose body consists of a parallel step and a sequential step parallel for i in 1 ... n // parallel work serial r1 { // sequential work } // parallel work end for (e) general form of serial region shared s parallel for i in 1 ... n r = compute( i ) s.append( r ) end for (b) parallelization of (a) using a parallel loop and a serial region (c) parallelization using a parallel loop and shared variable declaration bop_post( r1_ids[0] ) for i in 1 ... n bop_parallel { // parallel work bop_wait( r1_ids[my_ppr -1] ) // sequential work bop_post( r1_ids[my_ppr] ) // parallel work } end for (f) general implementation of serial region using BOP for i in 1 ... n begin_pipelined_loop pipeline( p ) r = compute( i ) pipeline s.append( r ) end_pipelined_loop end for (d) parallelization using a pipelined loop for i in 1 ... n begin_pipelined_loop // stage 1 pipeline( p ) // stage 2 pipeline // stage 3 end_pipelined_loop end for //pipeline(p) means that next stage is parallel for i in 1 ... n bop_parallel { bop_wait( <my_ppr-1, s1> ) // stage 1 bop_post( <my_ppr, s1> ) bop_wait( <my_ppr-1, s1> ) // stage 2 bop_post( <my_ppr, s2> ) bop_wait( <my_ppr-1, s3> ) // stage 3 bop_post( <my_ppr, s3> ) } end for (g) general form of pipelined loop (h) implementation by BOP Figure 5: Using BOP as the base language to implement composite constructs similar to the serial region and pipelined-loop is possible in an execution that an iteration executes a serial region multiple times or none at all. Shared variable Shared variable declaration enables a compiler to identify shared data accesses [37]. Such a compiler can generate parallel code using BOP . For each PPR , the compiler identifies the earliest and latest accesses of shared data and places the wait and post operations as in the implementation of the serial region. If a PPR has no shared data accesses, it is equivalent to having the first and last accesses at the start of PPR . If the accesses in a PPR are not completely known, the conservative solution is assuming that the first access is at the beginning and the last access is at the end. To improve parallelism, the compiler may use a different serial region for each shared variable. Since the correctness is guaranteed by BOP run time, a compiler can generate more aggressive, albeit unsafe code. Pipelined loop With a pipelined loop, most of the work done by a user is just to divide a loop body into pipeline stages [42]. By default a stage is sequential in that its execution in different iterations does not overlap. If the work can run in parallel, a user marks the stage parallel by specifying the number of processors p. Figure 5(g) shows a 3-stage example, where stage 1 and 3 are sequential but stage 2 is parallel. Although simple, pipelined-loop shows clearly the composition of cross-stage and intra-stage parallelism. Its implementation is also simple and effective, with one process running a sequential stage and p processes running a parallel stage as specified. All processes are fully active in the steady state. 19 The same interface can be implemented by BOP hints using the following code transformation. First, we turn a pipelined loop body into a PPR . Then we create a series of channel numbers indexed by the PPR index and the stage index. For each sequential stage, we insert at the front a wait for the same stage from the preceding PPR and at the back a post for itself (this stage and this PPR ). The conversion of a parallel stage is the same except that the wait is for the preceding stage in the preceding PPR . Figure 5(h) shows this transformation. The BOP code exploits the same, staged parallelism but the implementation is very different. One process is used per iteration not per stage. To fully utilize p processors, BOP may need to execute more than p processes in parallel and rely on the multi-tasking support of the operating system. Potential benefits As the three examples show, BOP can serve as a basis for building other parallel constructs. Compared to the original constructs, BOP -based emulators benefit from uniformity and correctness guarantees of BOP . Take pipelined loop as an example. The original construct does not guarantee the same result in parallel executions, but the BOP version adds this guarantee. The original primitive cannot be used within a nested loop or called function. A target loop cannot have break and continue statements (need to first convert them to if statements). The BOP version removes these restrictions. The original implementation copies all data changes synchronously at the end of a pipelined loop. BOP copies data changes incrementally and asynchronously in parallel with loop or post-loop execution. In the original design, a user must remove all conflicts. With BOP , conflicts do not affect correctness, and occasional unknown conflicts may not affect parallel performance if most of the execution is conflict free. Another benefit is clarity—the meaning of an extension is completely defined by its construction. This removes ambiguity in semantics especially when a construct is used amid complex control flow. Take the example of the serial region. The implementation in Figure 5(f) assumes that a serial region is executed once and only once in each PPR . The assumption holds for the implementation of pipelined-loop in Figure 5(h). However, in a general case, one cannot guarantee how often a serial region is executed or whether it is executed at all. We call it the unknown occurrence problem. It includes the multi-occurrence problem, which may cause out of order region execution, and the missing-occurrence problem, which may cause indefinite waiting. To solve the multi-occurrence problem, we should move the post operation to the end of PPR . To solve the missing-occurrence problem, we add a post operation at the end of PPR . In all these cases, the behavior and properties of a serial region are defined according to how we choose to implement it with BOP primitives. There are implementation limitations to BOP -based extensions. BOP may not be the most efficient solution for example for pipelined loop. However, the previous solutions were not designed for software speculative execution. BOP implementation may still be the right choice in a speculative context. In addition, by building different constructs from a single basis, a user can combine and mix them, for example post-wait and pipelined loop, when parallelizing a single program. 20 7 A Demonstration We demonstrate a use of BOP in parallelizing iterative computations. An iterative algorithm is commonly used in solving equations that have no direct solutions. Examples can be found in many types of numerical analysis and in general computational problems that require there is node to be clusteredfixed-point and equilibrium solutions. Figure 6 shows a typical program structure with a 2-nested loop. The inner loop computes on a data array, and the outer loop repeats this = new ... // init next cluster choice computation until the result is good enough. Informally the loop can be said to have two B) { // b-loop to find largest cluster dimensions with the inner loop being the space dimension and the outer loop being the st; // init block choice time dimension. The inner loop is usually parallel, but the outer loop is not because the i ++) { // i-loop to find largest in block same data is used and updated in every time step. In addition, the convergence check, continue; in line C in Figure 6, must be done after one time step and before the next. Without e; support, a parallel version must use a barrier to ensure that the next iteration ); // cluster centered at nodespeculation i +) { wait for the current iteration to finish completely. With BOP , however, the next iteration et_dis(nodes[ i ], nodes[ j ]); can start before the current one finishes. At the end when the converged variable is set, !eliminated[ j ] ) s[ current.size++ ] = j; the write access causes a dependence conflict, BOP detects the conflict and aborts the speculation beyond the convergence point. Before the convergence, different time iterations e( &block_largest, &current ) may overlap. The transformation is known as time skewing [39, 48]. current // largest cluster in this block intra-block search while ( not converged ) for i is 1 to N bop_parallel { r = compute( data[ i ] ) bop_deferrable { if ( i > 1 ) bop_wait( my_ppr - 1 ); s = add_result( r ) bop_fill( my_ppr, cur ) B: bop_post( my_ppr ) } // end bop_deferrable } // end bop_parallel } // end of i-loop - 1); in all blocks global_largest, &block_largest ) = block_largest while ( not converged ) for i is 1 to N ppr, global_largest); newly found cluster al_largest.size; nn++) { _largest.nodes[ nn ] ] = 1 and exit if clustering is done erged = true r = compute( data[ i ] ) s = add_result( r ) } // end of i-loop if good_enough?( s ) C: converged = true if good_enough?( s ) C: converged = true } } Figure 6: Example iterative computation and its parallelization by BOP . There is no barrier at the end of the while-loop, which permits speculative parallelism. The parallelism from time-skewing is data dependent. Time skewing is likely successful if the beginning part of the next iteration accesses different data from the ending part of the current iteration. The parallelism may be highly dynamic, and BOP execution may have frequent rollback and restart. The parallelism does not extend beyond two outermost loop iterations since consecutive iterations as a whole always have at least one conflict. However, the program does not have to wait for every task to finish after one iteration and before starting another. The removal of the barrier synchronization makes it possible that the waiting by one task can always be overlapped with the execution of subsequent tasks. This is useful in hiding the overhead of serial correctness checking in BOP . It is also useful in tolerating imbalances in task sizes and variations due to the processor or system environment. 21 QT clustering An example of iterative computation is quality threshold (QT) clustering. The algorithm was invented in 1999 for gene clustering [23]. In each iteration, the algorithm finds the cluster around every node, which contains all nodes that are within a distance. It then picks the largest cluster. The next iteration repeats the process for the remaining nodes until all nodes are clustered. QT clustering is computationally intensive, but it addresses two shortcomings of traditional k-means clustering. To use QT clustering, one does not have to know the number of clusters beforehand. The result is deterministic and does not depend on initial choices of cluster centroids. To parallelize QT clustering, we need speculative post-wait to communicate dynamic data. In each time step, all clusters are compared to find the largest cluster. This is done by a post-wait region that passes the cluster object global largest from task to task in sequential order. Since the size of a cluster is not known, we allocate an array with size n, the total number of nodes. While the code posts the entire array, BOP run-time system posts only modified pages, so it communicates only part of the array with a size proportional to the size of the cluster. We coded the basic algorithm, where stack variables were private and heap and global data were shared (unless declared bop private). Since we have not implemented deferrable block, we manually strip-mined the clustering loop. We inserted three lines of hints, one for parallelism and two for a post-wait pair. To enable time skewing, we have structured the code carefully so to avoid unnecessary dependence conflicts. We found that the difficulty for time skewing is mostly algorithmic. In implementation, the base code is sequential, and BOP preserves its semantics. BOP parallelization did not go much beyond sequential programming and sequential algorithm adaptation. We could use existing tools such as compilers, debuggers, profilers. We tested our prototype implementation using two machines. One had 8 Intel Nehalem cores, and the other 4 Intel 2.66 GHz Kentsfield cores, all running at 2.66 GHz. We used a large input, so the sequential running time, with no BOP involvement, was 70 seconds. With BOP , the program finished in 7 seconds, obtaining a speedup of 10. This showed clearly the benefit of using two machines, since the single-machine improvement would be bounded by 8. 8 Related Work Distributive parallelization Parallelism may be inferred by a parallelizing compiler. Powerful compiler techniques such as dependence, array-section, alias and inter-procedural analysis can extract coarse-grained parallelism in scientific code [1, 2, 8, 18, 19, 21, 47]. Automatic parallelization can be improved by language and tool support. High performance Fortran (HPF) lets a user suggest data distribution in Fortran code [1, 2]. Cleanly designed languages can better express and implement parallelism through high-level primitives and advanced techniques such as regions and lazy evaluation as in ZPL [9] and SoC [38]. A user can help compiler parallelization using an interactive tool [29]. Like HPF, BOP provides hints for use in a traditional language. While HPF hints relies on the compiler for parallelization, BOP hints express parallelism directly. 22 A recent approach to distributed parallelization is to automatically translate a Fortran program from using OpenMP to using MPI [4–6]. Parallelization and communication are inserted and optimized by the compiler, which uses static analysis for regular loops and inserts run-time inspectors in irregular loops. With distributed OpenMP, a user can program distributed-memory machines using a shared-memory parallel language. BOP strives for the same goal using a suggestion language in which dependences are specified using post-wait rather than automatically inferred. Post-wait can be used by a compiler or a profiling-based analyzer, so BOP does not have to rely on complete compiler support for correctness. As a result, BOP can handle general C/C++ code. Most automatic techniques target regular computations on arrays and cannot directly handle irregular computations on dynamic data. One exception is Jade, which uses a combination of user annotations and type checking to infer program data accesses and enable dynamic parallelization [37]. Data specification in Jade and dependence hints in BOP represent very different design choices. Data access is the cause of dependences, and data specification in Jade can be used by a run-time system to identify all dependences. BOP allows partial dependence specification, simply the specification of frequent and immediate dependences necessary to enable parallelization. The BOP run-time detects infrequent dependences and satisfies long-range dependences automatically without any specification. Jade aims for complete safety and automatically optimized parallelization. BOP tolerates incomplete program knowledge and supports more direct control in parallelization. Parallel programming languages Parallel languages are based on either single-program multiple data (SPMD) [15] or fork-join parallelism. The SPMD model is used in distributedmemory programming as by MPI, PVM, software shared distributed memory (S-DSM) systems such as TreadMarks [3] and InterWeave [41], and PGAS languages such as UPC [44] and co-array Fortran [32]. Data by default is not shared. Shared data can be declared as shared pages in SDSM and global arrays in PGAS languages and X10 [10]. SPMD programming requires manual partitioning of computation and placement of data. Although such requirements may be necessary for scalable parallel computing, it makes SPMD programming considerably more difficult than sequential programming. Fork-join is used in shared-memory programming as by OpenMP, Cilk, Java, transactional memory, and many other languages in which parallel programming is done by annotating a sequential program. These languages specify parallel tasks using parallel loops, regions, or future-style primitives such as spawn in Cilk [17] and async in X10 [10]. Synchronization is done using concurrency constructs such as atomic sections and transactions. Concurrency constructs are more flexible because they do not restrict a program to sequential semantics. However, the flexibility is also a burden since a programmer must ensure correctness under all possible program interleavings. They cannot be used by profiling-based automatic parallelization. For safety, they requires a user to have complete knowledge of all program dependences that may impair parallel execution. BOP overcomes these shortcomings with the dependence-based construct of post-wait and with run-time speculation support. It ensures sequential semantics at the costs of requiring monitoring and in-order commit. It tolerates these costs by using an asynchronous design and by enabling more aggressive forms of parallelism. 23 Software speculative parallelization Loop-level software speculation was pioneered by the LRPD test for Fortran loops [36]. Many improvements have been developed including array replication and stronger progress guarantees [14, 20]. The support has been extended to C programs using annotations [11], compiler support [43, 50], and techniques developed for transactional memory [31, 33, 40]. Speculation support has been developed for Java to support safe future [46], return-value speculation [35], and speculative data abstractions [28]. Our implementation uses ideas from [16, 24, 52], which is based on processes instead of threads. Process-based systems use the virtual memory protection mechanism, which transparently provides strong isolation, on-demand data replication, and complete abortability. A recent system, Grace, removes concurrency problems such as race conditions and deadlocks by implementing threads using processes and employing similar monitoring and protection mechanisms [7]. Previous speculation systems have a limited interface for expressing dependences. They run on shared-memory systems. We aim to develop a more general programming interface. For example in a software implementation, page-level protection raises the problem of false sharing. Dependence hints such as post-wait can be used to tolerate false sharing by serializing access to falsely shared pages. It is especially useful for dense array data. This report describes support for distributed speculation, task coarsening, and various language extensions such as serial region and pipelined loop. Dependence hints Post and wait were coined by Cytron for do-across loop parallelism [13]. They are paired by an identifier, which is usually a data address. BOP post-wait specifies data for communication. By integrating it with speculation support, we have made it a hint rather than a directive and used it to construct other speculative parallel constructs such as sequential regions and pipelined loops. In the context of speculative parallelization, a signal-wait pair can be used for inter-thread synchronization [49]. Signal-wait did not take parameters, and a compiler was used to determine the data transfer and must do so correctly. In an ordered transaction, the keyword flow specifies that a variable read should wait until either a new value is produced by the previous transaction or all previous transactions finish [45]. Flow is equivalent to wait with an implicit post (for the first write in the previous task). Ordered transaction is implemented using hardware speculation support. It is unclear how it copes with errors such as read before flow. It cannot easily be used to communicate dynamically allocated data as in our example in Section 2. 9 Summary We have presented the design of the BOP language consisting of four types of hints: parallelism, explicit dependence, computation reordering, and correctness checking. It provides a sequential programming model and a hybrid solution to data sharing, where a user marks likely parallel tasks and identifies their immediate data sharing using dependence hints and leave remaining data sharing unspecified. BOP system implements distributed communication with a mix of user-directed synchronous updates and automatic asynchronous updates. We have demonstrated the expressiveness of the suggestion language in two ways. First, 24 we show that they can be used to implement other language constructs such as serial regions, shared variables, and pipelined loops. Second, we show that dependence hints enable time skewing, which removes the need of barrier communication in many types of iterative solvers. Until recently a user faced with the task of extracting coarse-grained parallelism in complex code had to choose either a compiler to automate the task or a language extension to parallelize a program completely by hand. BOP provides a middle ground where the user manually specifies high-level parallelism and known dependences, and the run-time system automates distributed parallelization and communication. Acknowledgments We wish to thank Chao Zhang and Xiaoming Gu for their help with the implementation and testing and Michael Scott, Xipeng Shen, and Kai Shen for the helpful discussions and for pointing out related work and possible applications. References [1] V. S. Adve and J. M. Mellor-Crummey. Using integer sets for data-parallel program analysis and optimization. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 186–198, 1998. [2] R. Allen and K. Kennedy. Optimizing Compilers for Modern Architectures: A Dependence-based Approach. Morgan Kaufmann Publishers, Oct. 2001. [3] C. Amza, A. L. Cox, S. Dwarkadas, P. J. Keleher, H. Lu, R. Rajamony, W. Yu, and W. Zwaenepoel. ThreadMarks: Shared memory computing on networks of workstations. IEEE Computer, 29(2):18–28, 1996. [4] A. Basumallik and R. Eigenmann. Towards automatic translation of OpenMP to MPI. In International Conference on Supercomputing, pages 189–198, 2005. [5] A. Basumallik and R. Eigenmann. Optimizing irregular shared-memory applications for distributed-memory systems. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 119–128, 2006. [6] A. Basumallik, S.-J. Min, and R. Eigenmann. Programming distributed memory sytems using OpenMP. In Proceedings of the International Parallel and Distributed Processing Symposium, pages 1–8, 2007. [7] E. D. Berger, T. Yang, T. Liu, and G. Novark. Grace: Safe multithreaded programming for C/C++. In Proceedings of OOPSLA, 2009. [8] W. Blume et al. Parallel programming with polaris. IEEE Computer, 29(12):77–81, 1996. 25 [9] B. Chamberlain, S.-E. Choi, C. Lewis, C. Lin, L. Snyder, and W. Weathersby. ZPL: a machine independent programming language for parallel computers. IEEE Transactions on Software Engineering, 26(3):197–211, Mar 2000. [10] P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. von Praun, and V. Sarkar. X10: an object-oriented approach to non-uniform cluster computing. In Proceedings of OOPSLA, pages 519–538, 2005. [11] M. H. Cintra and D. R. Llanos. Design space exploration of a software speculative parallelization scheme. IEEE Transactions on Parallel and Distributed Systems, 16(6):562– 576, 2005. [12] A. L. Cox and R. J. Fowler. Adaptive cache coherency for detecting migratory shared data. In Proceedings of the International Symposium on Computer Architecture, pages 98–108, 1993. [13] R. Cytron. Doacross: Beyond vectorization for multiprocessors. In Proceedings of the 1986 International Conference on Parallel Processing, pages 836–844, St. Charles, IL, Aug. 1986. [14] F. Dang, H. Yu, and L. Rauchwerger. The R-LRPD test: Speculative parallelization of partially parallel loops. In Proceedings of the International Parallel and Distributed Processing Symposium, pages 20–29, Ft. Lauderdale, FL, Apr. 2002. [15] F. Darema, D. A. George, V. A. Norton, and G. F. Pfister. A single-program-multipledata computational model for epex/fortran. Parallel Computing, 7(1):11–24, 1988. [16] C. Ding, X. Shen, K. Kelsey, C. Tice, R. Huang, and C. Zhang. Software behavior oriented parallelization. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 223–234, 2007. [17] M. Frigo, C. E. Leiserson, and K. H. Randall. The implementation of the cilk-5 multithreaded language. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 212–223, 1998. [18] J. Gu and Z. Li. Efficient interprocedural array data-flow analysis for automatic program parallelization. IEEE Transactions on Software Engineering, 26(3):244–261, 2000. [19] M. Gupta and P. Banerjee. PARADIGM: A compiler for automatic data distribution on multicomputers. In International Conference on Supercomputing, pages 87–96, 1993. [20] M. Gupta and R. Nim. Techniques for run-time parallelization of loops. In Proceedings of SC’98, page 12, November 1998. [21] M. W. Hall, S. P. Amarasinghe, B. R. Murphy, S.-W. Liao, and M. S. Lam. Interprocedural parallelization analysis in SUIF. ACM Transactions on Programming Languages and Systems, 27(4):662–731, 2005. [22] R. H. Halstead. Multilisp: A language for concurrent symbolic computation. ACM Transactions on Programming Languages and Systems, 7(4):501–538, Oct. 1985. 26 [23] L. Heyer, S. Kruglyak, and S. Yooseph. Exploring expression data: Identification and analysis of coexpressed genes, 1999. [24] Y. Jiang and X. Shen. Adaptive software speculation for enhancing the cost-efficiency of behavior-oriented parallelization. In Proceedings of the International Conference on Parallel Processing, pages 270–278, 2008. [25] K. Kelsey, T. Bai, and C. Ding. Fast track: a software system for speculative optimization. In Proceedings of the International Symposium on Code Generation and Optimization, pages 157–168, 2009. [26] M. Kulkarni, K. Pingali, G. Ramanarayanan, B. Walter, K. Bala, and L. P. Chew. Optimistic parallelism benefits from data partitioning. In Proceedings of the International Conference on Architectual Support for Programming Languages and Operating Systems, pages 233–243, 2008. [27] M. Kulkarni, K. Pingali, B. Walter, G. Ramanarayanan, K. Bala, and L. P. Chew. Optimistic parallelism requires abstractions. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 211–222, 2007. [28] M. Kulkarni, K. Pingali, B. Walter, G. Ramanarayanan, K. Bala, and L. P. Chew. Optimistic parallelism requires abstractions. Communications of ACM, 52(9):89–97, 2009. [29] S.-W. Liao, A. Diwan, R. P. B. Jr., A. M. Ghuloum, and M. S. Lam. SUIF Explorer: An interactive and interprocedural parallelizer. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 37–48, 1999. [30] M. Litzkow, T. Tannenbaum, J. Basney, and M. Livny. Checkpoint and migration of UNIX processes in the Condor distributed processing system. Technical report, U. Wisconsin-Madison, 1997. [31] M. Mehrara, J. Hao, P.-C. Hsu, and S. A. Mahlke. Parallelizing sequential applications on commodity hardware using a low-cost software transactional memory. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 166–176, 2009. [32] R. Numrich and J. Reid. Co-Array Fortran for parallel programming. ACM Fortran Forum, 17(2):1–32. [33] C. E. Oancea, A. Mycroft, and T. Harris. A lightweight in-place implementation for software thread-level speculation. In Proceedings of the ACM Symposium on Parallel Algorithms and Architectures, pages 223–232, 2009. [34] J.-K. Peir and R. Cytron. Minimum distance: A method for partitioning recurrences for multiprocessors. IEEE Transactions on Computers, 38(8):1203–1211, 1989. 27 [35] C. J. F. Pickett and C. Verbrugge. Software thread level speculation for the Java language and virtual machine environment. In Proceedings of the Workshop on Languages and Compilers for Parallel Computing, pages 304–318, 2005. [36] L. Rauchwerger and D. Padua. The LRPD test: Speculative run-time parallelization of loops with privatization and reduction parallelization. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, La Jolla, CA, June 1995. [37] M. C. Rinard and M. S. Lam. The design, implementation, and evaluation of Jade. ACM Transactions on Programming Languages and Systems, 20(3):483–545, 1998. [38] S.-B. Scholz. Single assignment c: efficient support for high-level array operations in a functional setting. Journal of Functional Programming, 13(6):1005–1059, 2003. [39] Y. Song and Z. Li. New tiling techniques to improve cache temporal locality. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 215–228, Atlanta, Georgia, May 1999. [40] M. F. Spear, K. Kelsey, T. Bai, L. Dalessandro, M. L. Scott, C. Ding, and P. Wu. Fastpath speculative parallelization. In Proceedings of the Workshop on Languages and Compilers for Parallel Computing, 2009. [41] C. Tang, D. Chen, S. Dwarkadas, and M. L. Scott. Integrating remote invocation and distributed shared state. In Proceedings of the International Parallel and Distributed Processing Symposium, 2004. [42] W. Thies, V. Chandrasekhar, and S. P. Amarasinghe. A practical approach to exploiting coarse-grained pipeline parallelism in c programs. In Proceedings of the ACM/IEEE International Symposium on Microarchitecture, pages 356–369, 2007. [43] C. Tian, M. Feng, V. Nagarajan, and R. Gupta. Copy or discard execution model for speculative parallelization on multicores. In Proceedings of the ACM/IEEE International Symposium on Microarchitecture, pages 330–341, 2008. [44] UPC consortium. UPC language specification v1.2. Technical Report LBNL-59208, Lawrence Berkeley National Lab, 2005. [45] C. von Praun, L. Ceze, and C. Cascaval. Implicit parallelism with ordered transactions. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Mar. 2007. [46] A. Welc, S. Jagannathan, and A. L. Hosking. Safe futures for Java. In Proceedings of OOPSLA, pages 439–453, 2005. [47] M. J. Wolfe. High Performance Compilers for Parallel Computing. Addison-Wesley, Redwood City, CA, 1996. [48] D. Wonnacott. Achieving scalable locality with time skewing. International Journal of Parallel Programming, 30(3), June 2002. 28 [49] A. Zhai, J. G. Steffan, C. B. Colohan, and T. C. Mowry. Compiler and hardware support for reducing the synchronization of speculative threads. ACM Transactions on Architecture and Code Optimization, 5(1):1–33, 2008. [50] A. Zhai, S. Wang, P.-C. Yew, and G. He. Compiler optimizations for parallelizing general-purpose applications under thread-level speculation. pages 271–272, New York, NY, USA, 2008. ACM. [51] C. Zhang, C. Ding, X. Gu, K. Kelsey, and T. Bai. Continuous speculative program parallelization in software. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2010. [52] C. Zhang, C. Ding, K. Kelsey, T. Bai, X. Gu, and X. Feng. A language of suggestions for program parallelization. Technical Report URCS #948, Department of Computer Science, University of Rochester, 2009. [53] E. Z. Zhang, Y. Jiang, and X. Shen. Does cache sharing on modern CMP matter to the performance of contemporary multithreaded programs? In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2010. 29