A Language of Suggestions for Program Parallelization Chao Zhang‡ , Chen Ding† , Kirk Kelsey† , Tongxin Bai† , Xiaoming Gu† , Xiaobing Feng? † Department of Computer Science, University of Rochester ? Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences ‡ Intel Beijing Lab The University of Rochester Computer Science Department Rochester, NY 14627 Technical Report 948 July 2009 Abstract Coarse-grained task parallelism exists in sequential code and can be leveraged to boost the use of chip multi-processors. However, large tasks may execute thousands of lines of code and are often too complex to analyze and manage statically. This report describes a programming system called suggestible parallelization. It consists of a programming interface and a support system. The interface is a small language with three primitives for marking possibly parallel tasks and their possible dependences. The support system is implemented in software and ensures correct parallel execution through speculative parallelization, speculative communication and speculative memory allocation. It manages parallelism dynamically to tolerate unevenness in task size, inter-task delay and hardware speed. When evaluated using four full-size benchmark applications, suggestible parallelization obtains up to a 6 times speedup over 10 processors for sequential legacy applications up to 35 thousand lines in size. The overhead of software speculation is not excessively high compared to unprotected parallel execution. The research is supported by the National Science Foundation (Contract No. CNS-0834566, CNS0720796), IBM CAS Faculty Fellowship, and a gift from Microsoft Research. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the funding organizations. 1 Introduction It is generally recognized that although automatic techniques are effective in exploiting loop parallelism (Hall et al., 2005; Blume et al., 1996; Allen and Kennedy, 2001; Wolfe, 1996), they are not sufficient to uncover parallelism among complex tasks in general-purpose code. Explicit parallel programming, however, is not yet a satisfactory solution. On the one hand, there are ready-to-use parallel constructs in mainstream languages such as Fortran, C, C++ and Java and in threading libraries such as Microsoft .NET and Intel TBB. On the other hand, parallel programming is still difficult because of non-determinism. A program may run fine in one execution but generate incorrect results or run into a deadlock in the next execution because of a difference in thread interleaving. One possible solution, which is explored in this report, is to let a user identify parallelism but let an automatic system carry the burden of safe implementation. We present suggestible parallelization, where a user (or an analysis tool) parallelizes a sequential program by inserting suggestions or cues. A cue indicates likely rather than definite parallelism. It exists purely as a hint and has no effect on the semantics of the host program—the output of an execution with suggestions is always identical to that of the original sequential execution without suggestions, independent of task scheduling and execution interleaving. We address the problem of extracting coarse-grained parallelism from legacy sequential code. There are at least two main difficulties. First, the parallelism among large tasks may depend on external factors such as program input. Second, such parallelism depends on program implementation. It is often unclear whether a large piece of complex code can be executed in parallel, especially if it uses third-party libraries including proprietary libraries in the binary form. The benefit of parallelization hinges not only on these uncertain dependences but also on the unknown granularity and resource usage. The goal of suggestible parallelization is to relieve a programmer of such concerns and shift the burden to the implementation system. It aims to enable a programmer to parallelize a complex program without complete certainty of its behavior. We use a form of software speculative parallelization, which executes a program in parallel based on suggestions, monitors for conflicts, and reverts to sequential execution when a conflict is detected. It is based on a system called BOP , which implements speculation tasks using Unix processes because it assumes that part of program code is too complex to analyze; therefore, all data may be shared, and most memory accesses may reach any shared data (Ding et al., 2007). During execution, BOP determines whether a set of tasks are parallel and whether they have the right size and resource usage to benefit from parallel execution. It has been used to parallelize several pieces of legacy code as well as one usage of a commercial library. The BOP system has two significant limitations. At the programming level, there is no way to indicate possible dependences between tasks. At the implementation level, it executes one task group at a time and does not start new speculation until the entire group finishes. In this report, we extend the BOP system with three additions. • The suggestion language: We use a construct from BOP to suggest possible paral1 lelism. We extend Cytron’s post/wait construct (Cytron, 1986) in suggesting task dependence. These suggestions may be incorrect or incomplete. • Continuous speculation: We build a new run-time system that supports irregular task sizes and uneven hardware speed by starting a speculation task whenever a processor becomes available. • Parallel memory allocation: We support parallel memory allocation and deallocation during continuous speculation. Speculative parallelization has been an active area of research in the past two decades. Related issues are being addressed in a growing body of literature on atomic sections and transactions. Most problems of software speculation have been studied in thread-based solutions (Mehrara et al., 2009; Tian et al., 2008; Kulkarni et al., 2007; Cintra and Llanos, 2005; Welc et al., 2005; Dang et al., 2002; Gupta and Nim, 1998; Rauchwerger and Padua, 1995). The problems are different in a process-based solution. For example, a thread replicates data at the array or object level and uses indirection to access them speculatively. A process replicates the virtual address space and accesses data without indirection. As a result, our design has to use different types of meta-data and different strategies in heap management (memory allocation in particular), correctness checking and error recovery. Speculation incurs significant run-time costs. These include not just the effort wasted on failed speculation but also the overhead of monitoring and checking necessary even when all speculation succeeds. We will evaluate the benefit and cost of our design using fullsize benchmark programs and show that the loss in efficiency is not excessive compared to unprotected parallel executions. 2 The Suggestion Language The suggestion language has three primitives shown in Table 1. We first describe them using a running example and then discuss the interaction with the speculation system. 2.1 Spawn and Skip Spawn-and-skip (SAS ) encloses a code block, as shown in Table 1. When an execution reaches a SAS block, it spawns a new speculative task. The new task skips ahead and executes the code after the block, while the spawning task executes the code in the block. SAS parallelism cascades in two ways. It is nested if the block code reaches another SAS block and spawns an inner task, or it is iterative (non-overlapping) if the post-block code reaches a SAS block and spawns a peer task. Although different in name and syntax, SAS is identical in meaning to the construct called a possibly parallel region (PPR ). Although a single construct, PPR can express function parallelism, loop parallelism, and their mixes (Ding et al., 2007). In this report, we use SAS and PPR as synonyms: one reflects the operation while the other its intent. As for PPR we allow only non-overlapping spawns (by suppressing SAS calls inside a SAS block). 2 SAS { //code block } spawn and skip: fork a speculative process that executes from the end of the code block Post(channel, addr, size) post to channel a total of size bytes from address addr Wait(channel) wait for data from channel (a) the language has three suggestions: SAS for possible parallelism, and Post-Wait for possible dependences work.display while (! work.empty) { w = work.dequeue t = w.process result.enqueue( t ) } result.display SAS { work.display } chid = 0 while ( ! work.empty ) { w = work.dequeue SAS { t = w.process if ( chid > 0 ) Wait(chid) result.enqueue( t ) Post(chid, &result.tail, sizeof(...)) } chid ++ } result.display (b) a work loop with two display calls (c) suggested parallelization with mixed function and loop parallelism queue work, result Figure 1: The suggestion language in (a), illustrated by an example in (b,c) We demonstrate the suggestion language using the example program in Figure 1(b). The main body is a work loop, which takes a list of items from the work queue, invokes the process function on each item, and stores the result in the result queue. It displays the work and result queues before and after the work loop. The code does not contain enough information for static parallelization. The size of the work queue is unknown. The process call may execute thousands of lines of code with behavior dependent on the work item. The queue may be implemented with third-party code such as the standard template library. The example program may be parallelizable. This possibility is expressed in Figure 1(c) using two SAS blocks. The first block indicates the possible (function) parallelism between the pre-loop display and the loop. The second block indicates the possible (loop) parallelism. They compose into mixed parallelism—the pre-loop display may happen in parallel with the loop execution. SAS is a general primitive. An execution of a SAS block yields the pairwise parallelism between the code block and its continuation. It is composable. An arbitrary degree of parallelism can be obtained by different or repeated SAS block executions. The recursive nature means that parallel tasks are spawned one at a time. Because of sequential spawning, the code between SAS tasks naturally supports sequential dependences outside SAS blocks 3 such as incrementing a loop index variable. Dependences inside SAS blocks, however, need a different support, as we discuss next. 2.2 Post and Wait Post and wait are shown with their complete parameters in Table 1(a). They express possible dependences between SAS tasks. A data dependence requires two actions for support: • communication: the new value must be transferred from the source of the dependence to the sink • synchronization: the sink must wait for the source to finish and the new value to arrive Communication is done through a named channel. SAS tasks are processes and have identical virtual address spaces. The posted data is identified by the data address and size. A post is non-blocking, but a wait blocks upon reading an empty channel and provides synchronization by suspending the calling task. After the channel is filled, the waiting task retrieves all data from the channel and then continues execution. SAS post-wait provides a means of speculative communication. It has two mutually supportive features: Unrestricted use The suggestion language places no restrictions on the (mis-)use of parallelization hints as far as program correctness is concerned. This is valuable considering the many ways a programmer may make mistakes. A hint may be incorrect: the source, the sink, or the data location of the dependence may be wrong, a post may miss its matching wait or vice versa, or they may cross-talk on the wrong channels. Furthermore, a hint may be too imprecise. For example, a post may send a large array in which only one cell is new. Finally, hints may be incomplete or excessive. They may show not enough or more than enough dependences. Run-time support The SAS run-time system has two important properties. First, it guarantees sequential progress (through the use of competitive recovery, to be discussed in Section 2.3). An incorrect or mismatched post or wait is ignored when the front of the recovery execution sweeps past them. Second, the system monitors the data access by a task. At a post, the system communicates only memory pages that have been modified, avoiding unnecessary data transfer.1 The monitoring system may also detect errors in suggestions and missed dependences and report them to a user. With unrestricted use and run-time support, the suggestion language acquires unusual flexibility in expressiveness—not only does an incorrectly annotated program run correctly, but it may run correctly in parallel. Our running example demonstrates this flexibility. 1 The run-time system also pools multiple posts and delivers them together to a waiting task, effectively aggregating the data transfer. 4 In the example loop, each iteration inserts into the result queue, thus creating a dependence to other insertions in previous iterations. In Figure 1(c), the dependence is expressed by inserting a wait before the list insertion and a post after the insertion. At run time, they generate the synchronization and communication by which each task first waits for the previous insertion, then receives the new tail of the list, and finally makes its own insertion and sends the result to the next task. The code in Figure 1(c) uses a channel identifier, chid, for two purposes. One is to avoid waiting by the first iteration. The other is to ensure that communication occurs only between consecutive iterations. Without the channel numbering, the program may lose parallelism but it will still produce the correct result. With the channel numbering, the suggestions still have a problem because the post in the last iteration is redundant. The redundant post may not be removable if the last iteration of the while-loop is unknown until the loop exits or if there are hidden early exits. Conveniently for a programmer, the extra post is harmless because it does not affect correctness nor the degree of parallelism due to the speculative implementation. Suggestions can be occasionally wrong or incomplete, while parallelization is still accomplished if some of them are correct. This feature represents a new type of expressiveness in a parallel programming language. It allows a user to utilize incomplete knowledge about a program and parallelize partial program behavior. 2.3 Implicit Synchronization and Competitive Recovery A conflict happens in a parallel execution when two concurrent tasks access the same datum and at least one task writes to it. This is known as the Bernstein condition (Allen and Kennedy, 2001). Since conflicts produce errors, they are not allowed in conventional parallel systems. In SAS , conflicts are detected and used to enable implicit synchronization. As a result, errors happen as part of a normal execution. This leads to the use of competitive recovery, which starts error recovery early assuming an error will happen later. Implicit synchronization A parallel task has three key points in time: the start point, the end point, and the use point. Since there can be many use points, most parallel languages let a user specify a synchronization point in code that precedes all use points. The synchronization point can be implicit or explicit. In OpenMP, a barrier synchronization is implied at the end of a parallel loop or section. Other languages use explicit constructs such as sync in Cilk and get in Java. For ease of programming, SAS does not require a user to write explicit synchronization. The first use point of a SAS task happens at the first parallel conflict, which is detected automatically by the speculation substrate. Implicit synchronization permits a user to write parallel tasks without complete knowledge of its use points. For example in the code in Figure 1(c), the last SAS task exits the loop and displays the result queue. Since some of the results may yet to be added, the last task may make a premature use of the queue and incur a conflict that implicitly delay the display until all results are ready. The synchronization happens automatically without user intervention. 5 Sometimes a user knows about use points and can suggest synchronization by inserting posts and waits. In the example loop, properly inserted post-wait pairs may allow existing results being displayed while new results are being computed. If the user misses some use points, implicit synchronization still provides a safety net in terms of correctness. Competitive recovery Speculation may fail in two ways: either a speculative task incurs a conflict, or it takes too long to finish. SAS overcomes both types of failures with the use of an understudy task. The understudy task starts from the last correct state and nonspeculatively executes the program alongside the speculative execution. It has no overhead and executes at the same speed as the unmodified sequential code. If speculation fails because of an error or high overhead, the understudy will still produce the correct result and produce it as quickly as the would-be sequential run. Figuratively, the understudy task and speculation tasks are the two teams racing to finish the same work, and the faster team wins. In the worst case when all speculation fails, the program finishes in about the same time as the fastest sequential run. We call it competitive recovery because it guarantees (almost) zero extra slowdown due to recovery. The speculation-understudy race is a race between parallel and sequential execution. But it is actually more than a race. Neither the understudy nor the speculation has to execute the entire program. When either one finishes the work, the other is aborted and re-started. The last finish line becomes the new starting line. Thus, competitive recovery is as much a parallel-sequential cooperation as a parallel-sequential competition. The example code in Figure 1(b,c) may have too little computation for a parallel execution, especially one using heavy weight processes. In that case, competitive recovery will finish executing the loop first and abort all speculation. If the amount of computation is input dependent, the run-time system would switch between sequential and parallel modes to produce the fastest speed, without any additional burden on the programmer. 3 Continuous Speculation 3.1 Basic Concepts A speculation system runs speculative tasks in parallel while ensuring that they produce the same result as the sequential execution. It deals with three basic problems. • monitoring data accesses and detecting conflicts — recording shared data accesses by parallel tasks and checking for dependence violations • managing speculative states — buffering and maintaining multiple versions of shared data and directing data accesses to the right version • recovering from erroneous speculation — preserving a safe execution state and reexecuting the program when speculation fails 6 A speculation task can be implemented by a thread or a process. Most software methods target loop parallelism and use threads. These methods rely on a compiler or a user to identify shared data and re-direct shared data access when needed. A previous system, BOP , shows that processes are cost-effective for very coarse-grain tasks (Ding et al., 2007). BOP protects the entire address space. It divides program data into four types, shared data, likely private data, value-checked data, and stack data, and manages them differently. A process has several important advantages when being used to implement a general suggestion language. First, a process can be used to start a speculation task anywhere in the code, allowing mixing parallelism from arbitrary SAS blocks. Second, a process is easier to monitor. By using the page protection mechanism, a run-time system can detect which pages are read and modified by each task without instrumenting a program. For example, there is no instrumentation for shared data or its access in the parallelized code in Figure 1(c). Third, processes are well isolated from each other’s actions. Modern OS performs copy-on-write and replicates a page when it is shared by two tasks. Logically, all tasks have an identical but separate address space. This leads to easy error recovery. We can simply discard an erroneous task by terminating the task process. 3.2 Managing Speculative Access For coarse-grain tasks, most cost comes from managing data access. A process-based design has three advantages in minimizing the cost of shared data access. First, page protection incurs a monitoring step only at the first access. There is no monitoring of subsequent accesses to the same page. Second, each task has its private copy of the address space; data access is direct and has no indirection overhead. Finally, we use lazy conflict detection, that is, checking correctness after a task finishes and not when the task executes, so there is no synchronization delay for shared data accesses. Although it incurs almost no overhead at the time of a speculative access, a processbased design has several downsides. First, the cost of creating a process is high. However, if a task has a significant length, the cost is amortized. Second, page-level monitoring leads to false sharing and loss of parallelism. The problem can be alleviated by data placement, which is outside the scope of this report. Finally, modified data must be explicitly moved between tasks. Unless one modifies the operating system, the data must be explicitly copied from one address space to another. To reduce the copying cost, we use group commits. Group commit As part of the commit, a correct speculation task must explicitly copy the changed data to later tasks. For a group of k concurrent tasks on k processors, each task can copy its data in two ways: either to the next task or to the last task of the group. The first choice incurs a copying cost quadratic to k because the same data may be copied k − 1 times in the worst case. The second choice has the minimal, linear cost, where each changed datum is copied only once. We call this choice group commit or group update. It is a unique requirement in process-based speculation. We next show how to support group commit in continuous speculation, when new tasks are constantly being started before previous tasks finish. 7 3.3 Dual-group Activity Window On a machine with k processors, the hardware is fully utilized if we can always run k active tasks in parallel. Continuous speculation tries to accomplish this by starting the next speculation task whenever a processor becomes available. At each moment of execution, speculative activities are contained in the activity window, which is the set of speculative tasks that have started but not yet completed. Here completion means the time after correctness checking and group commit. In SAS , tasks are spawned one at a time, so there is a total order in their starting time. We talk about “earlier” and “later” tasks based on their relative spawning order. Conceptually tasks in an activity window are analogous to instructions in an instructionreordering buffer (IRB) on modern processors. Continuous speculation is analogous to instruction lookahead, but the implementation is very different because a task has an unpredictable length, it can accumulate speculative state as large as its address space allows, and it should commit as part of a group. A dual-group activity window divides the active tasks into two groups based on their spawning order in such a way that every task in the first group is earlier than any task in the second group. When a task in either group finishes execution, the next task enters the window into the second group. When all tasks in the first group complete the group commit, the entire group exits, and the second group stops expanding and becomes the first group in the window. The next task starts the new second group, and the same dual-group dynamics continue in the activity window. For a speculative system that requires group commits, the dual-group activity window is both a necessary and sufficient solution to maximize hardware utilization. It is obvious that we need groups for group commits. It turns out that two groups are sufficient. When there is not enough active parallelism in the activity window, the second group keeps growing by spawning new tasks to utilize the available hardware. Continuous speculation is an improvement over the BOP system. It uses batch speculation: the system spawns k tasks on k processors and stops new spawning until the current batch of tasks finishes. Batch speculation is effectively a single-group activity window. It leads to hardware underutilization when the parallelism in the group is unbalanced. Such an imbalance may happen for the following four reasons. • Uneven task size In the extreme case, one task takes much longer than other tasks, and all k − 1 processors will be idle waiting. • Inter-spawn delay The delay is caused by the time spent in executing the code between SAS blocks. With such delay, even tasks of the same size do not finish at the same time, again leaving some processors idle. • Sequential commit Virtually all speculative systems check correctness in the sequential order, one task at a time. Later checks always happen at a time when the early tasks are over and their processors are idle. In fact, this makes full utilization impossible with a single-group activity window. 8 • Asymmetrical hardware Not all processors have the same speed. For example, a hyperthreaded core has a different speed when running one or two tasks. As a result, identical tasks may take a different amount of time to finish, creating the same effect as uneven task size. A dual-group window solves these four problems with continuous speculation. The solution makes speculative parallelization more adaptive by tolerating variations in software demand and hardware resources. All four problems become more severe as we use more processors in parallel execution. Continuous speculation is necessary to obtain a scalable performance. As we will show in evaluation, continuous speculation has a higher cost at low processor counts but produces a greater speedup at high processor counts. Next we discuss four components of the design: controlling the activity window, updating data, checking correctness, and supporting competitive recovery. Token passing For a system with p available processors, we need to reserve one for competitive recovery (as discussed later in this section). We use k = p − 1 processor tokens to maintain k active tasks. The first k tasks in a program form the first group. They start without waiting for a token but finish by passing a token. Each later task waits for a process token before executing and releases the token after finishing. To divide active tasks into two groups, we use a group token, which is passed by the first group after a group commit to the next new task entering the activity window. The new task becomes the first task of the next group. Variable vs. fixed window size The activity window can be of a fixed size or a variable size. Token passing bounds the group size from below to be at least k, the number of processor tokens. It does not impose an upper bound because it allows the second group to grow to an arbitrary size to make up for the lack of parallelism in the activity window. Conceivably there is a danger of a runaway window growth as a large group begets an even larger group. To bound the window and group size, we augment the basic control by storing the processor tokens when the second group exceeds a specific size. The window expansion stops until the first group finishes and releases the group token. In evaluation, we found that token passing creates an “elastic” window that naturally contracts and does not enter a run away expansion. However, the benefit of variable over fixed window size is small in our tests. Triple updates Upon successful completion, a speculative task needs to commit its changes to shared data. In a process-based design, this means copying modified pages from the task process to other processes. In continuous speculation, a modified page is copied three times in a scheme we call triple updates. Consider group g. The first update happens at the group commit and copies modified pages in all but the last task of g to the last task. This must be done before the understudy process can start after g. The next group, g+1, is executing concurrently with g, and may have nearly finished when g finishes. Since there is no sure benefit in updating g+1, we copy the changes by g 9 to group g+2 in the second update. Recall that g and g+2 are serialized by the passing of the group token. The second update ensures that g + 2 and later speculation have the new data before their execution. This is in fact the earliest point for making the new data of g visible to the remaining program execution. The third update of group g happens at the end of group g+1. As in the first update, the goal here is to produce a correct execution state before starting the understudy process after g+1. Because of group commit and a dual-group activity window, the changes of each group are needed to form a correct state for two successive understudy processes, hence the need for the two updates. In process-based design, inter-task copying is implemented by communication pipes. We need 3 pipes for each group. Since the activity window has two groups, we need a total of 6 pipes, independent of the size of the activity window. Dual-map checking Correctness checking is sequential. After task t passes the check, we check t+1. The speculation is correct if task t+1 incurs no write-read and write-write conflicts on the same page.2 The page accesses of each task are recorded in two access maps, one for reads and one for writes. To be correct, the read and write set of task t+1 cannot overlap with the write set of preceding tasks in the activity window. In batch speculation, the writes of task t+1 are checked against the union of the write maps up to task t. During the check, the write set of t+1 is merged into the unified write map for use by t+2. The unified map is reset after each group. In continuous speculation, a task executes in parallel with peer tasks from two groups. Each group g must also check with the writes by g-1. We use two unified maps, one for g-1 and one for g. Each task checks with both unified maps and extends the g unified map. The inter-group check may be delayed to the end of g and accomplished by a group check. However, the group check is too imprecise to support speculative Post-Wait, which we will discuss in Section 3.4. There are two interesting problems in implementing dual-map checking: how to reuse the maps, and when to reset them for the next use. With map reuse, a total of two unified maps suffice regardless of the size of the activity window. In naive thinking, we should be able to reset the unified map of g-1 after g finishes. However, this would be too early because the unified map is needed by g − 1 in the second update. The correct reset point is after the first task of g+1. Competitive recovery In continuous speculation, the understudy task is started after the first task, as illustrated in Figure 2. As the activity window advances, the understudy task is re-started after each task group. An understudy task is always active (after the first task) as long as speculation continues. As a result, it requires a processor constantly, leaving one fewer processor for parallel execution. We refer to this as the “missing processor” problem in continuous speculation. 2 Write-write conflicts may seem harmless with page replication. However, page-level monitoring cannot rule out a write-read conflict. After a write, the later task opens the access permission for the page, which allows subsequent reads to the page and opens the possibility of unmonitored write-read conflicts. To ensure 10 i=1 i=2 LEAD A1 i=3 A2 Group A i=4 UNDY B1 i=5 B2 Group B 2nd Update A 1st Update A UNDY i=6 C1 i=7 C2 Group C 2nd Update B 3rd Update A and 1st Update B UNDY i=8 D1 i=9 UNDY D2 Group D 3rd Update B and 1st Update C be Figure 2: Example of 9 SAS tasks 3 processors, showingscheme the use of token Figure 3:execution An illustration of theonimproved rotating passing and triple updates. Processor tokens are passed along solid lines with a double headed arrow. The group token is passed along dotted lines. The hands point to the triple killed. thefirst understudy the speculation lead in themaintains next stage. updatesAnd by the two groups.becomes Continuous 2 active speculation tasks and 1 understudy task (marked “UNDY”) at all times. 3.1 Token Relay In comparison, batch speculation does not suffer from the missing processor problem. In each batch, speculation the first task isisnot speculative backup execution. The understudy The n-depth spawned byand theneeds n −no1-depth speculation but waits unprocess starts after the lead process. If all batch tasks were to have the same size, the til there is a free processor. We use a single token pipe to pass the baton from a understudy would hardly run. In comparison in continuous speculation, all tasks in the compulsively killed understudy or a completely finished speculation to the new but activity window are speculative after the first task finishes. suspended one. When such a process dying, it writes to the pipe and the waiting one A question is where to draw the finish line for the parallel-sequential race. Since the unwill derstudy start to isdo its real work. The cores are kept as busy as possible. run with two speculation groups, it has two possible finish lines: the completion We don’t need or to the pass the baton from a compulsively because of the first group completion of both groups. We choose thekilled secondspeculation because favoring it means the track speculations and no deeper speculations. speculation meansfor that it won’t give is upwrong the competition as long as there is a chance the 3.2 parallel execution may finish early. A victory by the understudy, on the other hand, means no performance improvement and no benefit from parallel execution. Triple Updates Anspeculations example Figure 2 showswin, an example of continuous There are 9 SAS If the in a group the changed data byspeculation. them is updated three times. tasks, represented by vertical bars marked with i = 1, . . . , 9. There are 3 processors p = 3, The first update is after conflict checking and for the understudy in the next group so there are k = p − 1 = 2 processor tokens. The first 3 SAS tasks form the first group, to guarantee correctness. The second and third updates are for deeper speculations correctness, we have to the disallow write-write page sharing.data. The first update is an incomplete which start without previously changed as the same as in batch scheme which doesn’t need to update the changed data by the last speculation. The second and third 11 updates are complete which update all 4 Post (channel, addrs) 1 remove unmodified pages from send list 2 copy the pages to channel Wait (channel) 3 wait for data from channel 4 for each page 5 if an order conflict drop the page from the channel 6 7 else if page has been accessed by me abort me // receiver conflict 8 9 else accept the page PageFaultHandler (page) 10 if writing to a posted page // sender conflict 11 abort next task EndSAS 12 if the last spec in the group 13 copy pages from group commit 14 skip page if recved before by a wait (a) Implementation of speculative post and wait, including synchronization (line 3) and data transfer (lines 1,2,4-9) and handling of sender, receiver, and order conflicts (5-6, 7-8, 10-11). Task 1 Task 2 Post(1, x) Wait(2) Post(3, x) Post(2, x) Task 3 Wait(3) Wait(1) (b) Example tasks and post/wait dependences. A stale page is transferred by Post(1), but Task 3 ignores Wait(1) (conflict detection at line 5) and runs correctly. Figure 3: The implementation of speculative post/wait and an example. A. When a task finishes, it passes a processor token, shown by a line with a double headed arrow, to a task in the next group B. When group A finishes, it passes the group token, shown by a dotted line, to start group C. The tasks enter the activity window one individual at a time and leave the window one group at a time. In the steady state, it maintains two task groups with two speculation tasks active at all times. This can be seen in the figure. For example, the token passing mechanism delays task B1 until a processor is available. After finishing, group A copies its speculative changes three times: at the end of A for the understudy of B, C, at the end of B for the understudy of C, D, and at the start of C for its and future speculation. The understudy tasks are marked “UNDY”. An understudy task is always running during the continuous speculation. 3.4 Speculative Post and Wait Post and wait complement each other and also interact with run-time monitoring during speculation and the commit operation after speculation. The related code is shown in Figure 3 with numbered lines. Run-time monitoring happens at PageFaultHandler, called 12 when a page is first accessed by a speculation task. Commit happens at EndSAS, called at the end of a SAS block. Post(channel, addrs) removes the unmodified pages from the send list (at line 2), which not only avoids redundant communication but also makes later conflict checking more accurate because we know every page has been written to by the sender. Wait(channel) blocks until the waited channel is filled (line 3). For each page, it first checks for conflicts before accepting it (lines 5–9). Conflict handling Incorrect use of BOP post and wait may create three types of conflicts. The sender conflict happens when a page is modified after it is posted, in which case the receiver sees an intermediate state of the sender. This is detected at the page fault handler (line 10). We conservatively abort all later speculation (line 11). The receiver conflict happens when a received page has been accessed before the Wait call, which means that the receiver has missed the new data from the sender (recall from line 2 that only modified pages are transferred) and should be aborted. The receiver conflict is checked at line 7. An order conflict happens when multiple post/wait pairs communicate the same page in a conflicting order. For example, a receiver receives two inconsistent versions of a page from two senders. Since only modified pages are transferred, one of the two copies must contain stale data. Order conflicts are checked at the time of receipt (line 5). The receiver tests the overlap of communication paths. The communication path of a post/wait pair includes the intermediate tasks between the sender task and the receiver task in the spawning order. Two communication paths have a correct overlap if two conditions are met: they start at the same sender, or the end of one path is the start of the other. Any other overlap is an order conflict, and the incoming page is dropped. The previously received page may be still be incorrect. Such error is detected as part of the correctness checking at the time of the commit. As a suggestion, either a post or a wait may miss its matching half. Unreceived posts are removed when the changes by the sender task has become visible to everyone in the activity window. Let the posting task be t, its result is fully visible when the succeeding task group exits the activity window. Starving waits cause speculation to fail and be aborted by the understudy task. An example The three example tasks in Figure 3(b) demonstrate an order conflict when three post/wait pairs communicate the same page x, which is modified by the first two tasks and used by the third task. Post-wait serialized the writes by the first two tasks. However, Post(1) is misplaced. It tries to send a stale version of x to the third task. At Wait(1), Task 3 detects an incorrect overlap in the communication path and ignores the transferred page. When post-wait actions are considered at commit time, we see that Task 3 incurs no dependence conflict due to x (which would not be the case if its two waits happened in the reverse order). Such order conflicts may happen in real programs. In pointer-based data structures, different pointers may refer to the same object. In addition, two objects may reside on the same page, and a correct order at the object level may lead to inconsistency at the page level. This example again shows the flexibility of the suggestion language and its implementation in tolerating erroneous and self-contradictory hints. 13 programs bzip2 SPEC INT 2000 hmmer SPEC INT 2006 namd SPEC FP 2006 sjeng SPEC INT 2006 prog. description source size, prog. language sequential time and num. SAS tasks distribution of task sizes (seconds) min 1st qu. 183 SAS tasks, 52 sec. total 0.28 0.28 0.29 0.29 0.30 hmm-calibrate: 50 SASs, 929 sec. 4.4 8.0 8.8 9.6 14.9 hmm-search: 50 SASs, 439 sec. 18.6 18.7 18.8 18.8 19.2 data 4,244 lines, compression written in C mean 3rd qu. max gene matching 35,992 lines, written in C molecular dynamics 5,315 lines, in C++ 38 SASs, 1055 sec. 27.7 27.8 27.8 27.8 27.8 computer chess 13,847 lines, in C 15 SASs, 1811 sec. 4.3 51.9 122.8 127.9 419.5 global heap pages heap pages sequential num. committed post/wait var size allocated allocated time (sec) instructions pages pages (pages) outside SAS inside SAS num. of page faults 53 1.04E+11 74 1,608 0 0 732 984 hmmer 1,368 9.28E+11 10,704 582,574 1,267,511 0 3,083 11,230 namd 1,055 2.33E+12 64 11,558 900 34,200 0 34,357 sjeng 1,811 2.70E+12 880 43,947 0 0 0 1021 bzip2 Table 1: Characteristics of four application benchmarks 3.5 Dynamic Memory Allocation In a speculation system, each SAS task may allocate and free memory independently and in parallel with other tasks. A unique problem in process-based design is to maintain an identical address space across all tasks and to make speculative allocation and deallocation in one task visible to all other tasks. The previous BOP system pre-allocates a fixed amount of space to each task at the start of a batch and aborts speculation if a task attempts to allocate too much space. The speculative allocation is made visible after the batch finishes. Pre-allocation does not work for continuous speculation since the number of speculation tasks is not known. We have developed a SAS allocator building on the design of Hoard, an allocator for multithreaded code (Berger et al., 2000), and McRT-Malloc, an allocator for transactional memory (Hudson et al., 2006). As Hoard, the SAS allocator maintains a global heap divided into pages. New pages are allocated from the global heap into per-thread local heap, and near empty pages are returned to the global heap. To reduce contention, the SAS allocator allocates and frees a group of pages each time at the global level. To support speculation and recovery, the SAS allocator delays freeing the data allocated before a task until the end of the task, as done in McRT-Malloc. The SAS allocator has a few unique features. Since SAS tasks by default do not share 14 memory, the allocator maintains the meta-data of the global heap in a shared memory segment visible to all processes. To reduce fragmentation, the SAS allocator divides a page into chunks as small as 32 bytes. The per-page meta-data is maintained locally on the page, not visible to the outside until the owner task finishes. 4 Evaluation Testing environment We use the compiler support of the BOP system (Ding et al., 2007), implemented in GNU GCC C compiler 4.0.1. All programs are compiled with “-O3”. Batch and continuous speculation are implemented in C and built as a run-time library. We use a Dell workstation which has four dual-core Intel 3.40 GHz Xeon processors for a total of 8 CPUs. Each core supports two hyper-threads. The machine has 16MB of cache per core and 4GB of physical memory. Benchmarks Our test suite is given in Table 1. It consists of four benchmark programs, including hmmer and sjeng from SPEC INT 2006, namd from SPEC FP 2006, and bzip2 from SPEC CPU 2000. We chose them for three reasons. First, they were known to us to have good coarse-grained parallelism. In fact, all of them are over 99.9% parallelizable. Second, they showed different characteristics in task size (see Table 1) and inter-task delay, enough for us to evaluate the factors itemized in Section 3.3 and test the scalability of our design. Finally, it turned out that some of them have parallel implementations that do not use speculation. They allowed us to measure the overhead of the speculation system. The programs are parallelized manually using the suggestion language and the help of a profiling tool (running on the test input). The transformation includes separating independent computation slices and recomputing their initial conditions for Namd, lifting the random number generation (and its dependence) out of the parallel gene-search loop for hmmer, removing dependences on four global variables, and marking several global variables as checked variables (to enable parallelism across multiple chess boards) for Sjeng. Post-wait is used in bzip2 to pass to the next task the location of the compressed data, in hmmer-search to add results to a histogram and update a global score, and in hmmercalibrate to conduct a significance test over previous results and then update a histogram. Hmmer and namd allocate 5GB and 4MB data respectively inside SAS tasks. Because it lacks support for post-wait and parallel memory allocation, the original BOP system can parallelize only one of the tests, sjeng. In evaluation, batch speculation refers to the one augmented with post-wait and dynamic memory allocation. As Table 1 shows, the applications have one to two static SAS blocks and 15 to 183 dynamic SAS tasks. The parallelism is extremely coarse. Each task has on average between 500 million and 10 billion instructions. The programs use a large amount of data, including a quarter to 40 mega-byte of global data, 4 mega-byte to 2 giga-byte of dynamic allocation outside SAS blocks, and up to 5 gig-byte of dynamic allocation inside SAS blocks. Without speculation support, it would be near impossible for us to ensure correctness. All programs are written in C/C++ and make heavy use of pointer data and indirections.3 3 One program, Namd, was converted into C from C++ using an internal tool. 15 x x x x x o o o o o 6 + 5 5 o x o o 2 + x o x o x + x o 0 0 + o x non−speculative continuous speculation batch speculation 1 3 2 + 4 4 x + 1 + + + speedup + + 3 6 bzip2, 4K LOC 2 4 6 8 10 num. processors p Figure 4: Bzip2 tasks has a significant inter-task delay and requires post-wait. Execution time is reduced from 53 to 15.8 seconds by batch speculation and to 11.3 seconds by continuous speculation. The non-speculative version is bzip2smp by Isakov. The size of the source code ranges from over 4 thousand in bzip2 to near 36 thousand in hmmer. The SAS tasks may execute a large part of a test program including many loops and functions. Not all code is included in the source form. In bzip2, up to 25% of executed instructions are inside the glibc library. For complex programs like these, full address space protection is important not just for correctness but also for efficiency (to avoid the extra indirection in shared data access, as discussed in Section 3.2). Continuous vs. batch speculation Figure ?? compares the speedup curve between continuous and batch speculation, parameterized by the number of processors p shown by the x-axis.4 Unless explicitly noted, all speedups are normalized based on the sequential time shown in Table 1. Three programs benefit from continuous speculation. Bzip2 is the only one with a nontrivial inter-task delay of about 16% of the task length, caused by the sequential reading of the input file between compression tasks. The post-wait is also sequential and accounts for 4% of running time, although the post-wait code runs in parallel with the code for file input. The improvement of continuous speculation over batch increases from 6% when p = 3 to 63% at p = 6 and stays between 39% and 63% for p between 6 and 11. Continuous at p = 5 is faster than batch at p = 11, a savings of 6 processors. The speedup curve levels off with 7 or more processors because the inter-task delay limits the scalability due to Amdahl’s law. 4 Because of the missing processor problem, continuous speculation automatically switches to batch speculation if p = 2. 16 hmmer, 35K LOC + + 8 8 + + 4 x o + + x o o x o x o x + x o o x o 6 x o 0 non−speculative continuous speculation batch speculation 0 + x o x o x 4 + x o 2 + 2 speedup 6 + 2 4 6 8 10 num. processors p Figure 5: Hmmer has uneven task sizes in one SAS block and needs post-wait and dynamic memory allocation. Execution time is reduced from 1548 to 326 seconds by batch and to 270 by continuous speculation. The non-speculative version uses OpenMP. o o 5 o x x x x 2 o x x o continuous speculation batch speculation 0 0 x o 1 x o 1 x o 3 3 o x 2 speedup 4 o o x 4 5 o x 6 6 namd, 5K LOC 2 4 6 8 10 num. processors p Figure 6: Namd has perfectly balanced parallelism but needs dynamic memory allocation. Execution time is reduced from 1090 to 197 seconds by batch and to 182 by continuous speculation. 17 4 + 4 sjeng, 14K LOC + + x x x x x 2 + x + x o o x o o o o o o 1 1 speedup + 3 + 2 3 + + x o 0 0 non−speculative continuous speculation batch speculation 1 2 3 4 5 6 7 8 num. processors p Figure 7: Sjeng has very uneven task sizes. Execution time is reduced from 1924 seconds to 855 by batch and to 666 by continuous speculation. We manually created the nonspeculative version. Two tests, hmmer and sjeng, are improved in continuous speculation because of their uneven task sizes. In Sjeng, the size of the 15 SAS tasks ranges from 4.3 seconds to 420 seconds (Table 1). Continuous speculation is 23% (at p = 5) to 76% (at p = 6) faster than batch. Batch speculation shows a parallel anomaly: the performance was decreased by 26% as the number of processors was increased from 5 to 6. This was because the two longest SAS tasks were assigned to the same batch in the 5-processor run but to two batches in the 6-processor run, increasing the total time from the maximum of the two tasks’ durations to their sum. Compared to sjeng, hmmer has a smaller variance in task size in the calibration phase. The sizes range between 4.4 seconds and 14.9 seconds. Continuous speculation is initially 35% slower than batch because of the missing-processor problem. However, it has better scalability and catches up with the batch performance at p = 8 and outruns batch by 9% at p = 11, despite using one fewer processor. Namd has an identical task size, between 27.7 and 27.8 seconds in all 38 SAS tasks, and no inter-task delay. Both continuous and batch speculation show near-linear scaling. The parallel efficiency ranges between 82% at p = 2 to 71% at p = 8. At p = 9, continuous speculation gains an advantage because of hyperthreading as we discuss next. Hyperthreading induced imbalance Asymmetrical hardware can cause uneven task sizes even in programs with equally partitioned work. On the test machine, hyperthreading is engaged when p > 8. Some tasks have to share the same physical CPU, leading to 18 10 12 14 16 18 0 50 100 4 6 8 10 12 14 16 18 8 6 max 7 spec tasks (8 total) max 6 spec tasks (7 total) max 5 spec tasks (6 total) 4 size of activity window bzip2 window dynamics 150 logic time (PPR instance num.) Figure 8: The window size in three executions of bzip2, for p = 6, 7, 8, where a size greater than 2(p − 1) means that more SAS tasks are needed to utilize p processors. asymmetry in task speed in an otherwise homogeneous system. Batch speculation shows erratic performance after p = 8, shown by a 5% drop in parallel performance in hmmer and a 22% drop in namd, both at p = 9. The performance of continuous speculation increases without any relapse. When p is increased from 8 to 11, the performance is increased by 8% in hmmer and 17% in namd, elevating the overall speedup from 4.6 to 5.0 and from 5.1 to 6.0 respectively. Three hyperthreads effectively add performance roughly equivalent to a physical CPU. Variable-size activity window Continuous speculation adds new tasks to the second group of the activity window when the existing tasks are not enough to keep all processors utilized while the first group is running. A variable-size window does not place an upperbound on the size of the second gorup. However, the scheme raises the question whether a larger window begets an even larger window, and the window grows indefinitely whenever there is a need to expand. We examine the dynamic window sizes during the executions of Bzip2, the program in our test set with the most SAS tasks and hence the most activity windows. Figure 8 shows on the x-axis the sequence number of the tasks and on the y-axis the number of tasks in the activity window when task i exits the window (which is the largest window containing i). The window dynamics shows that monotonic window growth does not happen. For example at p = 7, the window size grows from the normal size of 12 to a size of 17 and then retreats equally quickly back to 12. Similar dynamics is observed in all test cases. The experiments show that the variable-size activity window is “elastic”: expanding to a larger size when there is insufficient parallelism and contracting to the normal size when the 19 parallelism is adequate. When we measure the performance, we found that using variable window sizes does not lead to significant improvement except in one program, sjeng. When p = 6, 7, continuous speculation using variable-size windows is 14% and 6% faster than using fixed-size windows. Other factors of the design We have manually moved dependent operations outside SAS tasks and found that post-wait produces similar performance compared to SAS tasks without post-wait. With perfectly even parallelism, Namd has none of the benefit and should expose fully the overhead of continuous speculation. The nearly parallel curves in Figure 6 show that the overhead comes entirely from competitive recovery. Other overheads such as triple updates and dual-map checking do not affect performance or scalability. One factor we did not evaluate is the impact of speculation failure, since our test programs are mostly parallel (except for conflicts after the last SAS task). A recent study shows that adaptive spawning is an effective solution (Jiang and Shen, 2008). We plan to consider adaptive spawning in our future work. The cost of speculation Speculation protects correctness and makes parallel programming easier. The downside is the cost. We have obtained non-speculative parallel code for three of the applications: bzip2smp, hand-parallelized by Isakov,5 hmmer we parallelized with OpenMP in ways similar to (Srinivasan et al., 2005), and sjeng we parallelized by hand. In terms of the best parallel performance, speculation is slower than non-speculative code by 23%, 38%, and 28% respectively. There are two significant factors in the difference. One is the cost of speculative memory allocation in hmmer. The other is data placement of global variables. To avoid false sharing, our speculation system allocates each global variable on a separate page, which increases the number of TLB misses especially in programs with many global variables such as sjeng (880 memory pages as shown in Table 1). 5 Related Work Parallel languages Most languages let a user specify the unit of parallel execution explicitly: threads in C/C++ and Java; futures in parallel Lisp (Halstead, 1985); tasks in MPI (MPI), Co-array Fortran (Numrich and Reid, 1998), and UPC (El-Ghazawi et al., 2005); and parallel loops in OpenMP (OpenMP), ZPL (Lewis et al., 1998), SaC (Grelck and Scholz, 2003) and recently Fortress (Allen et al., 2007). Two of the HPCS languages— X10 (Charles et al., 2005) and Chapel (Chamberlain et al., 2007)—support these plus atomic sections. Most are binding annotations and require definite parallelism. SAS is similar to future, which was pioneered in Multilisp (Halstead, 1985) to mean the parallel evaluation of the future code block with its continuation. The term future emphasizes the code block (and its later use). Spawn-and-skip (SAS ), however, emphasizes the continuation because it is the target of speculation. 5 Created in 2005 and available from http://bzip2smp.sourceforge.net/. The document states that the parallel version is not fully interchangeable with bzip2. It is based on a different sequential version, so our comparison in Figure 4 uses relative speedups. 20 Future constructs in imperative languages include Cilk spawn (Blumofe et al., 1995) and Java future. Because of the side effect of a future task, a programmer has to use a matching construct, sync in Cilk and get in Java, to specify a synchronization point for each task. These constructs specify both parallelism and program semantics and may be misused. For example, Java future may lead an execution to deadlock. Speculative support allows one to use future without its matching synchronization. There are at least three constructs: safe future for Java (Welc et al., 2005), ordered transaction (von Praun et al., 2007) and possibly parallel region (PPR) (Ding et al., 2007) for C/C++. They are effectively parallelization hints and affect only program performance, not semantics. Data distribution directives in data parallel languages like HPF are also programming hints, but they express parallelism indirectly rather than directly (Allen and Kennedy, 2001; Adve and Mellor-Crummey, 1998). SAS augments these future constructs with speculative post-wait to express dependence in addition to parallelism. Post and wait were coined for do-across loop parallelism (Cytron, 1986). They are paired by an identifier, which is usually a data address. They specify only synchronization, not communication, which was sufficient for use by non-speculative threads. Speculation may create multiple versions of shared data. Signal-wait pairs were used for inter-thread synchronization by inserting signal after a store instruction and wait before a load instruction (Zhai, 2005). A compiler was used to determine the data transfer. In an ordered transaction, the keyword flow specifies that a variable read should wait until either a new value is produced by the previous transaction or all previous transactions finish (von Praun et al., 2007). Flow is equivalent to wait with an implicit post (for the first write in the previous task). SAS post-wait is similar to do-across in that the two primitives are matched by an identifier. They are speculative, like signal-wait and flow. However, SAS post-wait provides the pieces missing in other constructs: it specifies data communication (unlike do-across post-wait and TLS signal-wait), and it marks both ends of a dependence (unlike flow ). As far as we know, it is the first general program-level primitive for expressing speculative dependence. The example in Figure 1(c) demonstrates the need for this generality. Instead of describing dependences, another way is to infer them as the Jade language does (Rinard and Lam, 1998). A programmer lists the set of data accessed by each function, and the Jade system derives the ordering relation and the communication between tasks. Our system addresses the problem of (speculative) implementation of dependences and is complementary to systems like Jade. Software methods of speculative parallelization Loop-level software speculation was pioneered by the LRPD test (Rauchwerger and Padua, 1995). It executed a loop in parallel, recorded data accesses in shadow arrays, and checked for dependences after the parallel execution. Later techniques speculatively privatized shared arrays (to allow for false dependences) and combined the marking and checking phases (to guarantee progress) (Cintra and Llanos, 2005; Dang et al., 2002; Gupta and Nim, 1998). The techniques supported parallel reduction (Gupta and Nim, 1998; Rauchwerger and Padua, 1995). Recently, a copy-ordiscard model, CorD, was developed to manage memory states for general C code (Tian et al., 2008). It divided a loop into prologue, body, and epilogue and supported dependent 21 operations. These techniques are fully automatic but do not provide a language interface for a user to manually select parallel tasks or remove dependences. There are three types of loop scheduling (Cintra and Llanos, 2005). Static scheduling is not applicable in our context since the number of future tasks is not known a priori. The two types of sliding windows are analogous to batch and continuous speculation in this report. Continuous speculation is a form of pipelined execution, which can be improved through language (Thies et al., 2007) and compiler (Rangan et al., 2008) support. Our design originated in the idea of RingSTM for managing transactions (Spear et al., 2008). These techniques use threads and require non-trivial code changes to manage speculative states and speculative data access. Safe future used a compiler to insert read and write barriers and a virtual machine to store speculative versions of a shared object (Welc et al., 2005). Manual or automatic access monitoring were used to support speculative data abstractions in Java (Kulkarni et al., 2007) and implement transaction mechanism for loop parallelization (with static scheduling) (Mehrara et al., 2009). They all needed to redirect shared data accesses. Two techniques used static scheduling of loop iterations (Mehrara et al., 2009; Welc et al., 2005). Thread-based speculation is necessary to exploit fine-grained parallelism. In addition, it has precise monitoring and can avoid false sharing. However, some systems monitor data conservatively or at a larger granularity to improve efficiency. This may lead to false alerts. In comparison, we focus on coarse-grained tasks with possibly billions of accesses to millions of potentially shared data. We use processes to minimize per-access overhead, as discussed in Section 3.2. Process-based speculation poses different design problems, which we have solved using a dual-group activity window, triple updates, dual-map checking, and competitive recovery. Process-based speculation was used previously in the BOP system (Ding et al., 2007). We have augmented BOP in three ways: speculative post-wait, continuous speculation, and parallel memory allocation. The original BOP system cannot parallelize any of our test programs except for one. For that program, sjeng, continuous speculation is 23% to 76% faster than the original batch speculation. 6 Summary We have presented the design and implementation of suggestible parallelization, which includes a minimalistic suggestion language with just three primitives and several novel runtime techniques including speculative post-wait, parallel memory allocation, and continuous speculation with a dual-group activity window, triple data updates and competitive error recovery. Suggestible parallelization is 2.9 to 6.0 times faster than fully optimized sequential execution. It enables scalable performance and removes performance anomalies for programs with unbalanced parallelism and inter-task delays and on machines with asymmetrical performance caused by hyperthreading. A school of thought in modern software design argues for full run-time checks in production code. Tony Hoare is quoted as saying that removing these checks is like a sailor wearing a lifejacket when practicing on a pond but taking it off when going to sea. We believe the 22 key question is the overhead of the run-time checking or in the sailor analogy, the weight of the lifejacket. Our design protects parallel execution at an additional cost of between 23% and 38% compared to unprotected parallel execution, which is a significant improvement over the previous system and consequently helps to make speculation less costly for use in real systems. Suggestible parallelization attempts to divide the complexity in program parallelization by separating the issues of finding parallelism in a program and maintaining the semantics of the program. It separates the problem of expressing parallelism from that of implementing it. While this type of separation loses the opportunity of addressing the issues together, we have shown that our design simplifies parallel programming and enables effective parallelization of low-level legacy code. Acknowledgments We wish to thank Jingliang Zhang at ICT for help with the implementation of BOP -malloc. We wish also to thank Bin Bao, Ian Christopher, Bryan Jacobs, Michael Scott, Xipeng Shen, Michael Spear, and Jaspal Subhlok for the helpful comments about the work and the report. References Adve, Vikram S. and John M. Mellor-Crummey. 1998. Using integer sets for data-parallel program analysis and optimization. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 186–198. Allen, E., D. Chase, C. Flood, V. Luchangco, J. Maessen, S. Ryu, and G. L. Steele. 2007. Project fortress: a multicore language for multicore processors. Linux Magazine, pages 38–43. Allen, R. and K. Kennedy. 2001. Optimizing Compilers for Modern Architectures: A Dependencebased Approach. Morgan Kaufmann Publishers. Berger, Emery D., Kathryn S. McKinley, Robert D. Blumofe, and Paul R. Wilson. 2000. Hoard: a scalable memory allocator for multithreaded applications. In Proceedings of the International Conference on Architectual Support for Programming Languages and Operating Systems, pages 117–128. Blume et al., W. 1996. Parallel programming with polaris. IEEE Computer, 29(12):77–81. Blumofe, R. D., C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. 1995. Cilk: An efficient multithreaded runtime system. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. Santa Barbara, CA. Chamberlain, Bradford L., David Callahan, and Hans P. Zima. 2007. Parallel programmability and the chapel language. International Journal of High Performance Computing Applications, 21(3):291–312. Charles, Philippe, Christian Grothoff, Vijay Saraswat, Christopher Donawa, Allan Kielstra, Kemal Ebcioglu, Christoph von Praun, and Vivek Sarkar. 2005. X10: an object-oriented approach to non-uniform cluster computing. In Proceedings of the ACM SIGPLAN OOPSLA Conference, pages 519–538. 23 Cintra, M. H. and D. R. Llanos. 2005. Design space exploration of a software speculative parallelization scheme. IEEE Transactions on Parallel and Distributed Systems, 16(6):562–576. Cytron, R. 1986. Doacross: Beyond vectorization for multiprocessors. In Proceedings of the 1986 International Conference on Parallel Processing. St. Charles, IL. Dang, F., H. Yu, and L. Rauchwerger. 2002. The R-LRPD test: Speculative parallelization of partially parallel loops. Technical report, CS Dept., Texas A&M University, College Station, TX. Ding, C., X. Shen, K. Kelsey, C. Tice, R. Huang, and C. Zhang. 2007. Software behavior oriented parallelization. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. El-Ghazawi, Tarek, William Carlson, Thomas Sterling, , and Katherine Yelick. 2005. UPC: Distributed Shared Memory Programming. John Wiley and Sons. Grelck, Clemens and Sven-Bodo Scholz. 2003. SAC—from high-level programming with arrays to efficient parallel execution. Parallel Processing Letters, 13(3):401–412. Gupta, M. and R. Nim. 1998. Techniques for run-time parallelization of loops. In Proceedings of SC’98. Hall, Mary W., Saman P. Amarasinghe, Brian R. Murphy, Shih-Wei Liao, and Monica S. Lam. 2005. Interprocedural parallelization analysis in SUIF. ACM Transactions on Programming Languages and Systems, 27(4):662–731. Halstead, R. H. 1985. Multilisp: a language for concurrent symbolic computation. ACM Transactions on Programming Languages and Systems, 7(4):501–538. Hudson, Richard L., Bratin Saha, Ali-Reza Adl-Tabatabai, and Ben Hertzberg. 2006. McRT-Malloc: a scalable transactional memory allocator. In Proceedings of the International Symposium on Memory Management, pages 74–83. Jiang, Yunlian and Xipeng Shen. 2008. Adaptive software speculation for enhancing the costefficiency of behavior-oriented parallelization. In Proceedings of the International Conference on Parallel Processing, pages 270–278. Kulkarni, Milind, Keshav Pingali, Bruce Walter, Ganesh Ramanarayanan, Kavita Bala, and L. Paul Chew. 2007. Optimistic parallelism requires abstractions. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 211–222. Lewis, E, C. Lin, and L. Snyder. 1998. The implementation and evaluation of fusion and contraction in array languages. In Proceedings of the SIGPLAN ’98 Conference on Programming Language Design and Implementation. Montreal, Canada. Mehrara, Mojtaba, Jeff Hao, Po-Chun Hsu, and Scott A. Mahlke. 2009. Parallelizing sequential applications on commodity hardware using a low-cost software transactional memory. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 166–176. MPI. 1997. Mpi-2: Extensions to the message-passing interface. Message Passing Interface Forum http://www.mpi-forum.org/docs/mpi-20.ps. Numrich, R. W. and J. K. Reid. 1998. Co-array Fortran for parallel programming. ACM Fortran Forum, 17(2):1–31. 24 OpenMP. 2005. OpenMP application program http://www.openmp.org/drupal/mp-documents/spec25.pdf. interface, version 2.5. Rangan, Ram, Neil Vachharajani, Guilherme Ottoni, and David I. August. 2008. Performance scalability of decoupled software pipelining. ACM Transactions on Architecture and Code Optimization, 5(2). Rauchwerger, L. and D. Padua. 1995. The LRPD test: Speculative run-time parallelization of loops with privatization and reduction parallelization. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. La Jolla, CA. Rinard, M. C. and M. S. Lam. 1998. The design, implementation, and evaluation of Jade. ACM Transactions on Programming Languages and Systems, 20(3):483–545. Spear, Michael F., Maged M. Michael, and Christoph von Praun. 2008. RingSTM: scalable transactions with a single atomic instruction. In Proceedings of the ACM Symposium on Parallel Algorithms and Architectures, pages 275–284. Srinivasan et al., U. 2005. Characterization and analysis of hmmer and svm-rfe. In Proceedings of the IEEE International Symposium on Workload Characterization. Thies, William, Vikram Chandrasekhar, and Saman P. Amarasinghe. 2007. A practical approach to exploiting coarse-grained pipeline parallelism in c programs. In Proceedings of the ACM/IEEE International Symposium on Microarchitecture, pages 356–369. Tian, Chen, Min Feng, Vijay Nagarajan, and Rajiv Gupta. 2008. Copy or discard execution model for speculative parallelization on multicores. In Proceedings of the ACM/IEEE International Symposium on Microarchitecture, pages 330–341. von Praun, Christoph, Luis Ceze, and Calin Cascaval. 2007. Implicit parallelism with ordered transactions. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. Welc, Adam, Suresh Jagannathan, and Antony L. Hosking. 2005. Safe futures for Java. In Proceedings of the ACM SIGPLAN OOPSLA Conference, pages 439–453. Wolfe, M. J. 1996. High Performance Compilers for Parallel Computing. Addison-Wesley, Redwood City, CA. Zhai, Antonia. 2005. Compiler Optimization of Value Communication for Thread-Level Speculation. Ph.D. thesis, School of Computer Science, Carnegie Mellon University, PA. 25