Extended Sequential Reasoning for Data-Race-Free Programs ∗ Laura Effinger-Dean Hans-J. Boehm Dhruva Chakrabarti Pramod Joisha University of Washington † effinger@cs.washington.edu Abstract Most multithreaded programming languages prohibit or discourage data races. By avoiding data races, we are guaranteed that variables accessed within a synchronization-free code region cannot be modified by other threads, allowing us to reason about such code regions as though they were single-threaded. However, such single-threaded reasoning is not limited to synchronization-free regions. We present a simple characterization of extended interference-free regions in which variables cannot be modified by other threads. This characterization shows that, in the absence of data races, important code analysis problems often have surprisingly easy answers. For instance, we can use local analysis to determine when lock and unlock calls refer to the same mutex. Our characterization can be derived from prior work on safe compiler transformations, but it can also be simply derived from first principles, and justified in a very broad context. In addition, systematic reasoning about overlapping interference-free regions yields insight about optimization opportunities that were not previously apparent. We give preliminary results for a prototype implementation of interference-free regions in the LLVM compiler infrastructure. We also discuss other potential applications for interference-free regions. Categories and Subject Descriptors D.3.1 [Programming Languages]: Formal Definitions and Theory—Semantics; D.3.3 [Programming Languages]: Language Constructs and Features—Concurrent programming structures; F.3.2 ∗ This material is based upon work supported by the National Science Foundation Graduate Research Fellowship under Grant No. DGE-0718124. † Some of this work was done while at HP Labs. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MSPC’11, June 5, 2011, San Jose, California, USA. c 2011 ACM 978-1-4503-0794-9/11/06. . . $10.00 Copyright Hewlett-Packard Laboratories {hans.boehm, dhruva.chakrabarti, pramod.joisha}@hp.com [Logics and Meanings of Programs]: Semantics of Programming Languages—Program analysis General Terms Languages Keywords Concurrency, memory consistency models, data races, sequential reasoning, compiler optimization 1. Introduction Programming languages that support multiple concurrent threads communicating via shared variables must provide a memory model. This model specifies whether variables may be concurrently accessed, when an update by one thread becomes visible to another, and so on. Many mainstream programming languages now have memory models that disallow data races; that is, they completely prohibit concurrent accesses to a shared variable unless both accesses are reads. The prohibition against data races has arguably been the dominant rule for decades. Both Posix threads [6] and Ada [12] have explicitly disallowed data races from their inception. The revisions of the C and C++ standards [2–5] currently being finalized are more precise but equally explicit on this point. Here we assume such a prohibition against data races. We do not directly address languages like Java [9], which also discourage, but do not completely disallow data races. Java attempts to provide well-defined semantics for data races, but recent work [1, 11] points out serious deficiencies in the approach, and calls for a reexamination of the problem. At this point, we can only speculate whether our work will eventually be applicable to Java in some form. It is well-known that, in the absence of data races, when a variable x is accessed, its value cannot be changed by another thread between the immediately preceding and immediately following synchronization operations. If another thread could modify the variable between those two points (i.e., within a synchronization-free region), then there would be an execution of the program in which the two accesses of x are concurrent and therefore form a data race. Similarly, if x is written within a synchronization-free region, it cannot be read by another thread within that region. These observations are fundamental to code optimization in current compilers. They allow compilers for languages like C and C++ to largely ignore the presence of threads when transforming synchronization-free code. No other thread can detect such transformations without introducing a data race. As long as synchronization operations are treated as “opaque” (i.e., as potentially modifying any memory location), and no speculative operations, and hence no new data races, are introduced, safe sequential transformations remain correct in the presence of multiple threads. In this paper, we introduce a simple but powerful extension to synchronization-free regions called interference-free regions. The observation we make is that when a variable x is accessed, its value cannot be changed by another thread between the immediately preceding acquire1 synchronization and the immediately following release2 synchronization. This is strictly more general than synchronization-free regions, which do not discern between different types of synchronization operations. A nice consequence of this generalization is that the interference-free regions for different accesses to the same variable may overlap, revealing previously unrecognized optimization opportunities. Our contributions are as follows: • We define interference-free regions over execution traces, and present several interesting examples for which reasoning about interference-freedom, as we have defined it, allows for easy answers to tricky problems. • We present a compiler analysis that identifies conserva- tive approximations of interference-free regions, allowing other compiler analyses to remove redundant memory operations across synchronization boundaries. Our analysis is straightforward, but general and applicable to a number of interesting cases. We hope that the discussion sheds light on issues that have not been well understood prior to our work. 2. Interference-Free Regions In this paper, we concern ourselves with regions of code that are not synchronization-free. Can we regain some sequential reasoning even in the presence of synchronization? To illustrate the problem we are trying to address, consider lock(mtx p); ... unlock(mtx p); where mtx p is a global, potentially shared, pointer to a mutex. We will also assume, for purposes of the examples, that the only operations on mutexes are lock() and unlock() operations taking a pointer to a mutex, and that ellipses represent synchronization-free code that does not modify any of the variables mentioned in the example. Is it the case that 1 E.g., 2 E.g., mutex lock, thread join, volatile read. mutex unlock, thread create, volatile write. both instances of mtx p always refer to the same lock? Or could another thread modify mtx p in the interim? The prohibition against data races often allows us to answer such questions, without analyzing code that might be run by other threads. Without such a prohibition, and without knowledge about other threads in the system, we clearly could not guarantee anything about concurrent updates of mtx p. Moreover, reasoning about synchronization-free regions is insufficient in this case: the lock is acquired between the first and second load of mtx p, and hence the two references are not in the same synchronization-free region. Nevertheless, we argue that the data-race-freedom assumption is strong enough to establish, for this example, that another thread cannot concurrently modify mtx p. We make an easily provable, but foundational, and apparently normally overlooked observation: the region of code during which other threads cannot modify a locally accessed variable may often be extended in both directions beyond the access’s enclosing synchronization-free region. In particular, we can extend the boundary of a region backwards in the trace past any earlier release operations, such as mutex unlock() calls, and forwards through any later acquire operations, such as mutex lock() calls. To put it another way, the variable cannot be modified by other threads in the region between the most recent acquire operation and the next release operation. We call this extended region the interference-free region for an access. Thus, in our example, the interference-free region for the initial load of mtx p extends through the immediately following lock() call, and includes the second load of mtx p, guaranteeing that both loads must yield the same pointer. Here we have separated the accesses to mtx p from the synchronization operations and labeled the first load, but the idea is the same: A: tmp = mtx p; interference-free lock(tmp); region for A ... tmp = mtx p; unlock(tmp); We believe that the notion of an interference-free region is a fundamental observation about the analysis of multithreaded code, in that it gives us a much better characterization of the applicability of single-threaded techniques. Note that we make these deductions with no specific information about other threads in the system; we are relying only on the data-race-freedom requirement imposed by the language. 2.1 Formalism Memory models are delicate and must be reasoned about in a formal setting. We give a formal definition of our execution model and a proof that interference-free regions are correct in Appendix A. Briefly, if one thread reads a variable x and another thread modifies x, then the data-race-freedom guarantee requires that these accesses be ordered by the happens- lock(&mtx1); A: ... unlock(&mtx2); ... unlock(&mtx3); ... X: tmp = x; ... lock(&mtx4); ... lock(&mtx5); B: ... unlock(&mtx6); no acquires synchronization- interferencefree region free region no releases Figure 1. An interference-free region in a thread trace. Ellipses are synchronization-free code. before relation, a partial order on actions in the execution which is defined as the transitive closure of the program order and synchronizes-with relations. Because the two actions occur in different threads, and program order only orders actions from the same thread, at least one of the edges in the happens-before ordering must be a synchronizes-with edge. Given how we defined interference-free regions, it is not possible for there to be an incoming synchronizes-with edge in the region before the access, nor for there to be an outgoing synchronizes-with edge in the region after, so any write by another thread must happen-before or happen-after the entire interference-free region. 2.2 Interference-Free Regions in Thread Traces Given a memory access in an execution, we can infer the interference-free region for that access. In the execution trace for a single thread in Figure 1, the IFR for access X extends backwards through region A and forwards through region B. Any conflicting write must happen-before the lock of mtx1 or happen-after the unlock of mtx6. The exact sequence of lock acquires and releases is irrelevant; we simply identify incoming and outgoing synchronizes-with edges. 2.3 Overlapping Regions We extend our reasoning about interference-free regions by considering cases in which two or more regions for the same variable overlap. If x cannot be changed in either interval a or interval b, and a and b overlap, then clearly it cannot change in a ∪ b. For example, suppose there is a critical section nested between two accesses, as in Figure 2. In this case, the interference-free region for load A extends forwards into region B. The interference-free region for load C extends backwards past the unlock into region B. Thus mtx p must be interference-free for the entire region, and we can conclude that all locks acquired are released. The above reasoning does not generally apply if there is more than one nested critical section in a row. However, A: tmp = mtx p; lock(tmp); ... lock (&mtx2); B: ... unlock(&mtx2); ... C: tmp = mtx p; unlock(tmp); interference-free region for A interference-free region for C Figure 2. The interference-free regions for accesses A and C overlap, despite the intervening critical section. A: tmp = p; lock(&tmp->mtx); ... lock(&mtx2); B: ... unlock(&mtx2); ... C: tmp = p; local = tmp->data; ... lock(&mtx3); D: ... unlock(&mtx3); ... E: tmp = p; unlock(&tmp->mtx); interference-free region for A interference-free region for C interference-free region for E Figure 3. Here, the load of p at line C means that p is interference-free during both nested critical sections. there are cases for which we can derive similar results even then. Consider the common case in which, rather than mtx p, we have a pointer p to a structure that includes both a mutex and some data, with the program as shown in Figure 3. The program includes three loads of p, at lines A, C, and E. The interference-free region for the load of p at line A extends forward through region B. The interference-free region for the load of p at line E extends backwards through region D. The interference-free region for the load of p at line C extends backwards through B and forwards through D. Thus we conclude that p cannot be modified by another thread. More generally, we may conclude that a variable x is interference-free along a section of the execution trace if it is accessed between every pair of release and subsequent acquire operations. 2.4 Loop Invariance We can also use interference-free regions to determine loopinvariant references for loops that contain synchronization. The loop in Figure 4 is again not a simple synchronizationfree region, so it is not immediately clear whether the load of x can be moved out of the loop. However, x is guaranteed to while(...) { A: r1 = x; ... lock(&mtx); ... unlock(&mtx); } Figure 4. The access at line A is loop-invariant. be accessed between every lock release and the next lock acquire operation. Hence the interference-free region for each access of x must overlap with the previous and next one, if they exist. Therefore, all loaded values of x must be the same, so it is safe to move the load out of the loop (taking care to guard the load so as not to introduce a data race). Similar observations apply to loops that access C++0x [5] atomic objects. If we consider the loop below where a is an atomic variable, we can deduce that the loop contains only acquire operations, and therefore the interference-free region of any access in the loop includes all later iterations of the loop. Thus the read of x can safely be hoisted out of the loop: do { r1 = x; ... } while(a); 2.5 Barriers A common form of synchronization is rendezvous-style barriers (e.g., the pthread barrier t type). Barriers allow threads to stop and wait for other threads to reach a certain point in their execution before continuing. Treating barriers as release and acquire actions is too imprecise, so we handle them specially. We have proved that if a variable is accessed both before and after a call to the “wait” function for a barrier, then we can extend the interference-free regions for both accesses through the call. This makes intuitive sense, as any remote write to the variable would have to happen either before or after the barrier call, and the write would therefore race with at least one of the two local accesses. A formal explanation of barriers and their relation to interference-free regions appears in Appendix B. 3. Compiler Analysis Compilers can use the concept of interference-free regions to improve the scope of optimizations for data-race-free programs. We have implemented a pass in the LLVM compiler framework [8] that identifies interference-free regions. Because we do not know which path through a program a given execution will take, we must be conservative: we identify synchronization calls that, no matter which path is taken, fall into some interference-free region for a given variable. We then remove the variable from the set of variables modified by each identified synchronization call. 3.1 Algorithm Compilers apply sequential optimizations to multithreaded code by assuming that synchronization functions are opaque: they might access any variable. If we can show that no concurrently running threads modify a particular variable at the moment a synchronization call is executed, it is sound to remove that variable from the call’s set of modified variables.3 We identify synchronization calls whose modified sets may be pruned by exploiting two symmetric insights: 1. If, on a path through a function that passes through an acquire call C, there is an access A to a variable x such that A precedes C and there are no release calls between A and C, then C is in the interference-free region for A for that path. Therefore, if such an access A exists for every path through C, C does not modify x. 2. If, on a path through a function that passes through a release call C, there is an access A to a variable x such that A follows C and there are no acquire calls between C and A, then C is in the interference-free region for A for that path. Therefore, if such an access A exists for every path through C, C does not modify x. Our analysis determines two pieces of information. First, for each acquire call, we need the set of variables that must have been accessed since the last release call (ASLR). Second, for each release call, we need the set of variables that must be accessed before the next acquire call (ABNA). The former is a simple forward dataflow analysis; the latter is a simple backward dataflow analysis. For acquire calls, we remove any variables in the ASLR set from the modified set for that call; for release calls, we remove any variables in the ABNA set. A call to pthread barrier wait is interference-free for a given variable if that variable appears in both the ASLR and ABNA sets for the call. As an example, in Figure 2, mtx p is removed from the modified sets for the first three synchronization calls, although not the last because it is a release action and its ABNA set is empty. A redundant load analysis will therefore find that the second load can be eliminated. Figure 4 is an example of conservatism in our analysis: x is not in the ABNA set for the unlock call (because there is a path that does not access x after the unlock), so access A will not be hoisted out of the loop. In order to improve the accuracy of the analysis, we distinguish between read and write accesses in the implementation. For example, if a variable x must be modified (not just accessed) before an acquire call, then we may assume that the call neither reads nor writes x. 3 The bodies of the synchronization functions may modify some variables— e.g., pthread create, a synchronization operation with release behavior, modifies the new thread ID—so we must take care to distinguish these (nonracy) writes from writes in other threads. int max, shared_counter; pthread_mutex_t m; void *f(void *my_num) { int n = *((int *) my_num); while (n <= max) { pthread_mutex_lock(&m); shared_counter++; pthread_mutex_unlock(&m); n += 2; } } int main() { pthread_t t1, t2; int n1 = 1; int n2 = 2; max = 10000000; shared_counter = 0; pthread_mutex_init(&m, NULL); pthread_create(&t1, NULL, f, (void *) &n1); pthread_create(&t2, NULL, f, (void *) &n2); pthread_join(t1, NULL); pthread_join(t2, NULL); } Figure 5. A microbenchmark demonstrating the effectiveness of IFR-based optimizations. When combined with our analysis, GVN moves the load of max out of the loop in f(). 3.2 Data-Race-Freedom Since we assume data-race-freedom, programs with data races may be transformed in hard-to-predict ways, complicating debugging. This is already true for current compilers; otherwise current sequential techniques could not even be applied within synchronization-free regions [3]. We are not qualitatively changing the situation, and it does not appear to be a major problem in practice. We expect this analysis might be useful on an opt-in basis. We envision an “-ODRF” flag in future C/C++ compilers, the documentation for which would make explicit that racy programs might have unexpected behavior. 3.3 Preliminary Results Our LLVM implementation is still in progress, but our initial results are promising. We inserted our analysis into LLVM’s link-time optimization pipeline just before the loop-invariant code motion (LICM) and global value numbering (GVN) transformations. The machine used for compilation and running the tests was a 4-core 2.8GHz Intel Xeon with 16GB of RAM, running Linux. Microbenchmark Figure 5 shows a small C program that contains a redundant load: every iteration of the loop in function f() checks the value of max, which is constant. But be- SPLASH-2 Benchmark lu-n radix fft lu-c water-n water-s barnes ocean-n volrend fmm ocean-c cholesky raytrace radiosity LOC 678 833 899 911 2,063 2,670 2,864 3,046 4,204 4,325 4,774 5,139 10,649 11,760 Syncs in IFRs 14 27 12 17 27 27 20 34 31 37 95 83 19 93 Loads deleted 16 22 13 19 24 30 11 53 25 36 79 19 14 45 Speedup 1.0080 0.9962 1.0081 0.9988 0.9941 0.9701 0.9977 0.9840 0.9412 1.0050 0.9915 0.9869 1.0260 1.0000 Table 1. SPLASH-2 results. cause the loop contains synchronization, the standard LLVM alias analysis pass assumes the synchronization calls may modify max. Our analysis removes max from the modified sets for the synchronization calls, allowing GVN to hoist the load out of the loop. Running the program under valgrind shows that the optimization is indeed effective: the optimized program performs 10 million fewer loads. However, the program performs over 400 million memory operations in total, so there is no noticable performance improvement. Realistic applications We compiled the SPLASH-2 benchmarks [13] using our analysis. The analysis did not affect the LICM pass (we suspect because LICM handles only function calls that do not modify any memory locations), but the GVN pass found numerous opportunities to remove redundant loads (Table 1).4 The third column in Table 1 lists the number of synchronization calls which were found to be in the interference-free region of at least one variable. The fourth column lists the difference in the number of loads deleted by GVN when run with and without our analysis. The fifth column gives the speedup for each benchmark, which we compute as the runtime for the version of the code compiled without our analysis divided by the runtime for the version compiled with our analysis. Although the analysis exposed a number of redundant loads, we have had little success in terms of actually extracting performance from these optimizations. The benchmarks either have similar performance on both versions of the code, or our “optimized” version is slightly worse. One problem is that the loads may not be located on hot paths. Another possibility is that the optimizations increased the live ranges of variables, resulting in more loads as register variables are spilled to the stack (perhaps due to the low number of calleesaved registers on the x86 architecture). 4 Theoretically, our analysis should also be useful for dead store elimination, but we did not observe any improvement in LLVM’s dead store elimination pass, so we concentrate on loads here. 4. Other Applications Interference-free regions are useful for understanding the behavior of functions that contain internal synchronization. For example, the C++ draft specification defines malloc and free as acquire and release synchronization operations, respectively [5]. We can use interference-free regions in the code below to establish that global variables p and q are not referenced by other threads, and therefore that the two free operations properly deallocate the memory allocated by the two malloc operations. p = malloc(...); q = malloc(...); free(p); free(q); This kind of reasoning is applicable not just to compilers, but also to static analysis tools, where reasoning about properties such as deadlock freedom or memory allocation often requires knowledge about variables that might conceivably be changed by other threads. We should be able to use our existing alias analysis to improve the accuracy of existing analyses, when it is sound to assume data-race-freedom. 5. Related Work Our analysis is related to a compiler transformation known as roach-motel reordering. This transformation increases the size of critical sections by moving actions either past a lock acquire or before a lock release. In some cases, it is possible to use this line of reasoning to infer interference-free regions by repeatedly swapping an access until it reaches the end of a region. Sevcik established that this transformation is legal for data-race-free programs [10]. We believe our characterization is independently useful, particularly since we make very minimal assumptions about the language and synchronization primitives and we avoid complex reasoning about syntactic code transformations. In particular, we know of no prior presentation of a similar compiler analysis, nor any discussion of the consequences of overlapping regions. Prevous work by several of the authors of this paper describes a framework for enabling sequential optimizations in multithreaded programs [7]. They identify paths on which variables are “siloed” by iteratively refining a graph of the program. (This “siloed” property is essentially the same as our notion of interference-freedom.) Like us, their implementation refines the modified/referenced sets for synchronization calls. Our work is complementary to this work, as our analysis could likely be incorporated neatly into their framework as an “interference-type refinement.” 6. Conclusion We have presented an approach to inferring interferencefree regions in data-race-free programs. Since our analysis is based heavily on data-race-freedom, it requires only information about a single thread. We have implemented these ideas as an alias analysis pass in an optimizing compiler, and we have also discussed how this idea could apply to static analysis of multithreaded code. Our approach enables new kinds of inferences about program behavior that had not previously been considered. Acknowledgments We would like to thank Dan Grossman and the MSPC reviewers for their valuable feedback on this paper. References [1] S. V. Adve and H.-J. Boehm. Memory models: A case for rethinking parallel languages and hardware. Communications of the ACM, 53(8):90–101, August 2010. [2] M. Batty, S. Owens, S. Sarkar, P. Sewell, and T. Weber. Mathematizing C++ concurrency. In ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages, 2011. [3] H.-J. Boehm and S. V. Adve. Foundations of the C++ concurrency memory model. In ACM SIGPLAN Conference on Programming Language Design and Implementation, 2008. [4] C Standards Committee. Programming Languages—C. C standards committee paper WG14/N1539, http://www. open-std.org/JTC1/SC22/WG14/www/docs/n1539.pdf, November 2010. [5] C++ Standards Committee, Pete Becker, ed. Working Draft, Standard for Programming Language C++. C++ standards committee paper WG21/N3242=J16/11-0012, http://www.open-std.org/JTC1/SC22/WG21/docs/ papers/2011/n3242.pdf, February 2011. [6] IEEE and The Open Group. IEEE Standard 1003.1-2001. 2001. [7] P. Joisha, R. S. Schreiber, P. Banerjee, H.-J. Boehm, and D. R. Chakrabarti. A technique for the effective and automatic reuse of classical compiler optimizations on multithreaded code. In ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages, 2011. [8] C. Lattner and V. Adve. LLVM: A compilation framework for lifelong program analysis & transformation. In International Symposium on Code Generation and Optimization, 2004. [9] J. Manson, W. Pugh, and S. Adve. The Java memory model. In ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages, 2005. [10] J. Sevcik. Program Transformations in Weak Memory Models. PhD thesis, University of Edinburgh, 2008. [11] J. Sevcik and D. Aspinall. On the validity of program transformations in the Java memory model. In European Conference on Object-Oriented Programming, 2008. [12] Reference Manual for the Ada Programming Language: ANSI/MIL-STD-1815A-1983 Standard 1003.1-2001. United States Department of Defense, 1983. Springer. [13] S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The SPLASH-2 programs: Characterization and methodological considerations. In International Symposium on Computer Architecture, 1995. A. Analysis In this section, we formally define and prove the correctness of the interference-free regions described in Section 2. Most proofs are omitted for space; a Coq script with full proofs of all theorems is available online at http://wasp.cs. washington.edu/wasp_memmodel.html. Define an execution to be a set of memory actions, together with three relations on that set. The sequenced-before relation totally orders all actions in a thread, and does not order actions from different threads. The synchronizes-with relation defines an ordering between synchronization operations. Finally, the happens-before order is the reflexive transitive closure of the union of the sequenced-before and synchronizes-with orders. Formally, we say that an execution is a quadruple (A, ≤sb , <sw , ≤hb ) where: • A is a set of actions, where each action is a triple of a thread ID t, an kind of action k, and a UID (unique ID) u: (t, k, u). Kinds of actions include reads and writes to shared variables: k ::= read(x) | write(x) | . . . • ≤sb (sequenced-before) is a partial order over UIDs that totally orders actions with the same thread ID. • <sw (synchronizes-with) is a relation over UIDs. We do not define this relation, as the details are irrelevant to our analysis; presumably, lock releases synchronize-with acquires of the same lock, writes to volatile variables synchronize-with reads of the same variable, and so on. • ≤hb is an antisymmetric partial order over UIDs con- structed as the reflexive transitive closure of the union of ≤sb and <sw : ≤hb = (≤sb ∪ <sw )∗ . 5 We omit the values being read and written (or, equivalently, a writes-seen relation). Because we only consider data-race-free programs, every normal read sees the value most recently written to the same variable in ≤hb . Next we define the notion of incoming and outgoing edges: happens-before edges that start in one thread and end in another. An action has an outgoing edge if it synchronizeswith an action in another thread. An action has an incoming edge if an action in another thread synchronizes-with that action. The intuition is that by establishing that a subsection of a single thread’s execution does not have any incoming or outgoing edges, we can convince ourselves that no other thread could interfere with that thread’s execution. Definition 1 (Outgoing Edge). Let (t1 , k1 , u1 ) ∈ A. u1 has an outgoing happens-before edge if there exists (t2 , k2 , u2 ) ∈ A such that u1 <sw u2 and t1 6= t2 . 5 The actual C++0x happens-before relation is not transitively closed, so that memory order depends can be supported. The effect is to prevent ≤sb from contributing to ≤hb in certain contexts. This does not affect our arguments. Definition 2 (Incoming Edge). Let (t2 , k2 , u2 ) ∈ A. u2 has an incoming happens-before edge if there exists (t1 , k1 , u1 ) ∈ A such that u1 <sw u2 and t1 6= t2 . If there is a happens-before edge between two actions in two different threads, then there must be an action in the first thread with an outgoing happens-before edge. Lemma 1 (Existence of Outgoing Edge). Let (t1 , k1 , u1 ), (t2 , k2 , u2 ) ∈ A such that u1 ≤hb u2 and t1 6= t2 . Then there exists u3 such that u1 ≤sb u3 , u3 ≤hb u2 , and u3 has an outgoing edge. The proof (omitted) is a straightforward inductive case analysis on u1 ≤hb u2 . Intuitively, as happens-before is the closure of the sequenced-before and synchronizes-with relations, clearly any chain of happens-before edges that crosses threads must include an outgoing synchronizes-with edge. Symmetrically, if there is a happens-before edge between two actions in two different threads, the second thread must have an action with an incoming happens-before edge. Lemma 2 (Existence of Incoming Edge). Let (t1 , k1 , u1 ), (t2 , k2 , u2 ) ∈ A such that u1 ≤hb u2 and t1 6= t2 . Then there exists u3 such that u1 ≤hb u3 , u3 ≤sb u2 , and u3 has an incoming edge. We now establish our key result: if a region of code after a normal memory access has no outgoing happens-before edges, then any writes must happen-after that region of code. The usefulness of this result depends crucially on the fact that in an execution of a data-race-free program, reads and writes to the same variable are ordered by happens-before. Theorem 1 (Forwards Interference-Free). Let (t1 , read(x), u1 ), (t1 , k2 , u2 ) ∈ A such that u1 <sb u2 . Furthermore, for all u3 such that u1 ≤sb u3 <sb u2 , u3 does not have an outgoing edge. Finally, there is some write (t4 , write(x), u4 ) such that u1 ≤hb u4 and t1 6= t4 . Then u2 ≤hb u4 . Proof. By Lemma 1, there exists u3 such that u1 ≤sb u3 , u3 ≤hb u4 , and u3 has an outgoing edge. Clearly u2 ≤sb u3 , because otherwise u3 would violate an assumption. Then u2 ≤hb u4 by transitivity of happens-before. Symmetrically, we can show that a variable is interference-free for a region in which there are no incoming happens-before edges. Theorem 2 (Backwards Interference-Free). Let (t1 , read(x), u1 ), (t1 , k2 , u2 ) ∈ A such that u1 <sb u2 . Furthermore, for all u3 such that u1 <sb u3 ≤sb u2 , u3 does not have an incoming edge. Finally, there is some write (t4 , write(x), u4 ) such that u4 ≤hb u2 and t1 6= t4 . Then u4 ≤hb u1 . By applying Theorems 1 and 2, we can conclude that any write must happen-before or happen-after the entire interference-free region for an access. B. Barriers In this section we extend our analysis to cover rendezvousstyle barriers. Our key observation is that even though barriers act like a release operation followed by an acquire action, we can still reason about interference-freedom if there are accesses to a variable both before and after a given invocation of a barrier. Assume that the possible types of actions include notify and wait actions on barrier identifiers b: k ::= notify(b) | wait(b) | . . . Notify and wait actions have the following behavior: Notify synchronizes-with wait: If (t1 , notify(b), u1 ), (t2 , wait(b), u2 ) ∈ A, then u1 <sw u2 . Notify and wait only synchronize-with each other: If (t1 , k1 , u1 ), (t2 , k2 , u2 ) ∈ A and u1 <sw u2 , then k1 = notify(b) if and only if k2 = wait(b). Notify and wait are called in the proper order with no intervening synchronization: If a thread t calls notify and wait on a barrier b ((t, notify(b), u1 ), (t, wait(b), u01 ) ∈ A), then notify is called before wait (u1 ≤sb u01 ), and t does not perform any synchronization actions between the calls to notify and wait.6 Note that each barrier is invoked only once per thread; else it would not always be the case that a notify always synchronizes-with a wait on the same barrier, or that notify is sequenced-before wait with no intervening synchronization. Although real programs may wait on the same barrier multiple times, this is simply a convenience; it is possible to allocate a new barrier for every invocation. Moreover, we could make our formalism more realistic by tagging notify and wait actions with a generation that indicates how many times a thread has invoked this particular barrier, but such a change would not increase the expressiveness of the model. Our key insight that nothing can happen-between a notify and a wait. First, we prove two lemmas. The first lemma establishes that, if an action in another thread happens-after a notify, then either the action also happens-after the subsequent wait, or the happens-before chain between the notify and the remote action includes a synchronizes-with edge originating from the notify. Lemma 3 (Inversion of Happens-After-Notify). Suppose (t1 , notify(b), u1 ), (t1 , wait(b), u01 ), (t2 , k, u2 ) ∈ A such that t1 6= t2 and u1 ≤hb u2 . Then either (1) u01 ≤hb u2 or (2) there exists u3 such that u1 <sw u3 ≤hb u2 . The proof (omitted) is a straightforward inductive analysis on u1 ≤hb u2 . A symmetric lemma establishes that if a remote action happens-before a wait, then either the action also happens-before the notify, or the happens-before chain includes a synchronizes-with edge terminating at the wait. if ∃u2 , u3 such that u1 ≤sb u2 ≤sb u01 and u2 <sw u3 , then u2 = u1 ; if u1 ≤sb u2 ≤sb u01 and u3 <sw u2 , then u2 = u01 . 6 Formally, r1 = x; ... pthread ... pthread ... pthread ... pthread ... pthread ... r2 = x; mutex lock(...); mutex lock(...); barrier wait(b); mutex unlock(...); mutex unlock(...); Figure 6. Interference-free region around a barrier wait. Lemma 4 (Inversion of Happens-Before-Wait). Suppose (t1 , notify(b), u1 ), (t1 , wait(b), u01 ), (t2 , k, u2 ) ∈ A such that t1 6= t2 and u2 ≤hb u01 . Then either (1) u2 ≤hb u1 or (2) there exists u3 such that u2 ≤hb u3 <sw u01 . Given these two lemmas, the key theorem follows easily: Theorem 3 (Nothing Happens-Between Notify and Wait). Suppose (t1 , notify(b), u1 ), (t1 , wait(b), u01 ), (t2 , k, u2 ) ∈ A such that t1 6= t2 . Then ¬(u1 ≤hb u2 ≤hb u01 ). Proof. Assume u1 ≤hb u2 ≤hb u01 . We will establish a contradiction by proving that ≤hb is no longer antisymmetric. By Lemma 3, either u01 ≤hb u2 or there exists u3 such that u1 <sw u3 ≤hb u2 . The first case is a violation of the antisymmetry of ≤hb , as we have that u2 ≤hb u01 , and we know u01 6= u2 because they are from different threads. Similarly, we can rule out the first case for Lemma 4. Therefore (using u4 as the witness for the second case of Lemma 4), we have that u1 <sw u3 ≤hb u2 ≤hb u4 <sw u01 . By assumption, u3 is wait(b) and u4 is notify(b). Therefore u4 <sw u3 , which creates a cycle in ≤hb : u3 ≤hb u4 and u4 ≤hb u3 . We know that u3 6= u4 as they are notify and wait actions, respectively, so the theorem is proved. We can combine Theorem 3 with Theorems 1 and 2 to infer larger interference-free regions. For instance, consider the code in Figure 6. The call to pthread barrier wait performs both the notify and wait actions, satisfying the requirement that these actions occur in the proper order with no intervening synchronization. Suppose a remote write to x were to happen-after the first read and happenbefore the second. By Theorem 1, the write must happenafter the notify; by Theorem 2, the write must happenbefore the wait. Therefore, by Theorem 3, no such write exists, so the two reads see the same value. In effect, the interference-free region for x extends across the call to pthread barrier wait—but only because there are reads of x before and after the barrier. Else the remote write could happen-before or happen-after the barrier.