Asynchronous Assertions Edward E. Aftandilian Samuel Z. Guyer Martin Vechev Tufts University eaftan@cs.tufts.edu Tufts University sguyer@cs.tufts.edu IBM Research mtvechev@us.ibm.com Eran Yahav Technion, Israel yahave@cs.technion.ac.il Abstract Assertions are a powerful bug detection technique. Traditional assertion checking, however, is performed synchronously, imposing its full cost on the runtime of the program. As a result, many useful kinds of checks are impractical because they lead to extreme slowdowns. We present a solution that decouples assertion evaluation from program execution: assertions are evaluated asynchronously while the program continues to execute. Our technique ensures that the assertion checking thread operates on a consistent view of the global state, and that an assertion always produces the same result as it would in a serial execution. We implemented our technique in a system called STROBE, a snapshot-based system for asynchronous assertion checking in both single-and multi-threaded Java applications. STROBE runs inside the Java virtual machine and uses copy-on-write to build snapshots incrementally. We find that asynchronous checking scales almost perfectly over synchronous checking in many cases, indicating that the snapshot overhead is quite low. STROBE provides tolerable overheads (under 2X) even for heavy-weight assertions that would otherwise result in crushing slowdowns. 1. Introduction Assertions are a powerful technique for program monitoring and bug detection. Traditional general purpose assertion checking, however, is currently considered too expensive to be of practical use. Indeed, as assertion checking is performed synchronously, regardless of how well one optimizes their assertion checking code, it fundamentally imposes its full cost on the runtime of the program [3, 1, 35]. In turn, this makes the process of writing assertion checks inherently burdensome and impractical as the programmer is forced to continually estimate and limit the complexity and frequency of their assertion checks to avoid extreme slowdowns. Asynchronous Assertions In this work we present asynchronous assertions. The idea behind asynchronous assertions is to allow as- [Copyright notice will appear here once ’preprint’ option is removed.] 1 sertion evaluation to proceed without stopping the application. With this approach, the program issues an assertion and immediately continues its execution, while another thread evaluates the assertion in the background. With asynchronous assertions, the cost of an assertion check is no longer imposed on the program runtime, thereby allowing developers to write arbitrarily complex assertions, and removing a significant roadblock towards making dynamic assertion checking practical. Asynchronous assertions follow a long line of research on pushing program monitoring and analysis into a separate thread [28, 2, 25, 24, 31, 33]. Our work shares many of the goals of prior work, but differs in two critical ways. First, these approaches cannot be used to check arbitrarily complex properties as they lack a general mechanism for ensuring that helper threads see a consistent snapshot of the program state. Second, these systems mostly target low-level memory bugs, such as out-of-bounds pointer and array accesses and uninitialized reads, and provide specific solutions for each of these low level properties. Snapshots The key idea behind our approach is to construct snapshots incrementally by saving the relevant program state right before the application modifies it. The assertion checking code is compiled to be aware of the copies of the program state, giving the illusion of a monolithic snapshot. As a result, asynchronous assertions always produce the same result as they would in a serial execution. The key challenge in implementing this approach lies with correctly and efficiently managing multiple snapshots for multiple concurrent assertion checks. Since the state of modern programs resides mostly in the heap, many critical assertions involve heap properties, which are the most difficult to check efficiently. Our experiments focus on these difficult cases. Programming model We view the evaluation of an asynchronous assertion as a kind of future. In this model, when the program executes an assertion, it starts the concurrent check and immediately returns a future—a handle to its future result. The program can later force the evaluation of the future, blocking until the check is complete. If the programmer does not force its evaluation, the assertion check also completes asynchronously at some future time. The challenge for programmers is placing the assertions in a way that they have enough time to execute before critical code that depends on their results. While the interface we provide is similar to futures, it is important to note that an assertion is always evaluated on a consistent snapshot of the program state. The actual assertions consist of ordinary Java code (such as a repOk or invariant method) marked with an annotation that the system recognizes. 2011/3/21 ... 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: ... N a = new N(); // A N b = new N(); // B N c = new N(); // C a.n = b; b.n = c; assert(reach(a,c)); b.n = null; assert(reach(a,c)); N d = new N(); // D d.n = c; a.n = d; boolean reach(N a,N b) { while (a != null) { if (a == b) return true; a = a.n; } return false; } Figure 1. A simple example with assertions checking reachability. Evaluation Our implementation, which we call STROBE, is a snapshot-based system for supporting asynchronous assertion checking. STROBE is implemented in the Jikes RVM Java virtual machine and successfully runs both single- and multi-threaded Java applications. The two key mechanisms in STROBE are (a) a copy-on-write write barrier in the application code that builds a snapshot of the global state for each concurrently running assertion, and (b) a read barrier in the assertion code that forwards reads to the snapshots. Our evaluation of STROBE focuses on exploring the performance characteristics of this mechanism across a range of conditions. We systematically vary the frequency of the assertion checks, the cost of individual assertion checks, and the number of dedicated assertion checker threads. We find that asynchronous checking scales well over synchronous checking, and in many cases enables kinds of assertions (complex or very frequent) that would otherwise be impractical. 6: 7: 8: 9: 10: 11: AsyncAssert r1 = new AsyncReachAssert(a, c); b.n = null; AsyncAssert r2 = new AsyncReachAssert(a, c); N d = new N(); d.n = c; a.n = d; AsyncReachAssert<T> extends AsyncAssert { boolean evaluate(T x, T y) { return reach(x,y); } } Figure 2. The example of Fig. 1 using asynchronous assertions. Lines 1-5 are omitted. Using Asynchronous Assertions To use asynchronous assertions as provided by STROBE, we rewrite the code of Fig. 1 to use the future-like API of asynchronous assertions, as shown in Fig. 2. Using STROBE, the evaluation of both assertions can proceed concurrently with the program and with other in-flight assertions. Furthermore, regardless of the concurrent interleaving between the program and the assertions, the assertions will always produce the same result as if they were executed in a synchronous manner. In our example, regardless of how both asynchronous assertions interleave with the program, the assertion at line 6 will always pass and the assertion at line 8 will always fail. A Simple Example Consider the program excerpt shown in Fig. 1. The program first creates three new objects, and creates references between them. Then, it checks whether we can reach the last object from the first via the assertion in line 6. The code for reach is also shown in the figure: it iterates over the list and checks whether we can reach one object from another. Such assertion checks can be realized in the language runtime as in the case of Phalanx [35], or as a separate procedure in the program as in Ditto [30] (as shown here). In either case, we restrict the assertion to only allow reads of the shared state, and not modifications. In our example, in a serial execution of the program, the assertion at line 6 will pass, while the assertion at line 8 will fail. Evaluation of Asynchronous Assertions Fig. 3 shows (informally) how STROBE evaluates the asynchronous assertions in the program of Fig. 2. For this example, we assume that each object consists of three fields, the first two refer to potential snapshots of the object (one for each active assertion). We refer to these two fields as snapshot fields. The last field is the object field n. Generally, the number of snapshot fields is parametric, and there could be more than two snapshot fields. In the figure, we depict objects as rectangles consisting of the three aforementioned fields. For each object, we write the name of the local variable that points to it under the object node. For presentation purposes, we label each object with a name shown above the object node. The program starts by executing the statements up to and including line 5, resulting in a state where object A points to B which points to C. First Asynchronous Assertion In the next step, the asynchronous assertion of line 6 is invoked. STROBE starts the assertion in a separate thread and assigns to it a unique snapshot identifier 0 (we refer to this assertion as assertion 0). Assertion 0 begins execution, and the thread executing it is immediately preempted before processing any objects. Next the program executes the statement at line 7 which stores null in the n field of object B. Because there is an active assertion (assertion 0), a snapshot of object B is now taken (the copy is called Bs ), and the snapshot field 0 is updated to hold a pointer to the snapshot copy Bs . Second Asynchronous Assertion The program proceeds to execute line 8, invoking another asynchronous assertion (assigned snapshot identifier 1). The new assertion starts executing, but the thread executing it is preempted before processing any objects. Newly Allocated Objects Execution proceeds, and at line 9, a new object D is allocated. Newly allocated objects should never be seen by in-flight assertions as these new objects were not part of the program state when the assertions started. This invariant is ensured because whenever we store a reference to a new object in another existing object, the existing object is snapshotted (and as we will see, assertions use the snapshotted copy). However, 2 2011/3/21 Main Contributions The main contributions of this paper are as follows: • We present asynchronous assertions, assertions that are evalu- ated concurrently without stopping the application, but guarantee the same result as sequential evaluation. • An efficient implementation of our technique in a system called STROBE . • An evaluation of STROBE on a number of realistic usage scenar- ios. Our results indicate that STROBE works well in practice: it scales almost perfectly over synchronous checking, and can execute a range of intensive assertion workloads with overheads under 2X. 2. Overview In this section, we illustrate the operation of asynchronous assertion checking on a simple example. Then, we briefly demonstrate how the assertions are used in the development of a classic data structure (red-black tree). Bs Bs As Bs Bs A B C A B C A B C A B C A B C a b c a b c a b c a b c a b c d steps 1-5 step 7 step 9 d step 10 d step 11 step 6 (Asynchronous Assertion 0) AsyncReachAssert starts AsyncReachAssert ends step 8 AsyncReachAssert starts (Asynchronous Assertion 1) AsyncReachAssert ends Figure 3. Illustration of a specific execution that can occur in our system. we would also like new objects to not be snapshotted (as they will never be encountered by the assertion). Hence, at allocation time, the snapshot fields of all active assertions are updated with a special symbol ⊥ to denote that the object is considered newly allocated for in-flight active assertions. In our case, there are two active assertions (0 and 1), and correspondingly their two snapshot fields are set to ⊥. At step 10, the newly allocated object is updated. However, the object is not snapshotted for either of the two in-flight assertions 0 and 1, as the object was allocated when the assertions were still active. Traversal through Snapshotted Objects Finally, the program completes step 11 and updates object A. Object A is now snapshotted (the snapshot copy is object As ) with both active assertions now sharing the same snapshot copy As . It is generally not necessary that assertions share the same object snapshot, but in some cases, as in our example, this is possible. Indeed, this is an effective space optimization that our system performs, whenever possible. We note that in general an object can have more than one snapshot copy (each snapshot copy would belong to different in-flight assertions). In our example, the two active assertions did not process objects while the program was updating them, but in general, assertions can interleave their execution with the program in an arbitrary way. That is, in our example, an assertion can start processing objects from any of the states in the figure. If assertion 0 resumes its execution from the last state (after step 11), then it will first visit object A and observe that the snapshot field corresponding to its identifier is non-null (field index 0). It will then follow that reference and process the snapshot copy of A, i.e., As . Then, it will reach object B where once again it will observe that B has a snapshot copy Bs (for that assertion). It will then follow the reference to the snapshot copy Bs , process Bs (instead of B), and reach object C. Indeed, following the object path: A → As → B → Bs → C 3 will result in assertion 0 passing, the same result as if assertion 0 was executed synchronously after step 5. This is because the snapshot copies As and Bs contain the same contents as objects A and B right after step 5. Now, if assertion 1 resumes its execution from the last state (after step 11), then it will first visit object A, and observe that the snapshot field corresponding to its identifier is non-null (field index 1). It will follow that reference and process the snapshot copy of A, i.e., As . Then, it will reach object B, where unlike assertion 0, it will observe that object B has not been snapshotted for assertion 1 (the corresponding snapshot field in object B is null). The n-field of object B is also null. Hence, following the object path: A → As → B → null will result in assertion 1 failing, the same result as if assertion 1 was executed synchronously after step 7. After an assertion completes, it cleans up its state: for all allocated objects, the corresponding field for that assertion is set to null. The snapshot objects can then be garbage collected like all other objects. Custom Assertions Consider the problem of implementing a complex sequential data structure such as a red-black tree. In many cases, it is desirable to check that after a method on the data structure completes, certain post-conditions hold. A typical assertion check that a programmer may write is shown in Fig. 4. There, a procedure called checkAssertions is invoked, which in turn invokes a procedure traverseAndCheck that recursively iterates over the tree and checks various properties on the tree: whether the tree is balanced and ordered, and whether the coloring of the nodes is correct. Here, we assume that a node in the tree has the following fields with the obvious meaning: left, right, key, parent and color (that can range over RED and BLACK). Normally, when the program calls the assertion checking procedure checkAssertions, it stops, the assertion is evaluated 2011/3/21 aobj ⊆ Objs v ∈ Val = Objs ∪ {null, ⊥} ∪ N h ∈ Heap = Objs × FieldIds * Val s ∈ Snapshots = Objs × SnapIds * Objs as ⊆ SnapIds int traverseAndCheck(Node n, Node p) { if (n == null || n == nil) return 1; Node l = n.left; Node r = n.right; Allocated objects Values Heap Snapshots Active snapshots Table 1. Semantic Domains for a Heap /* Recursive traversal: return count of BLACK nodes */ int lb = traverseAndCheck(l, n); int rb = traverseAndCheck(r, n); We assume a standard programming language with concrete semantics that define the program state and the transitions between states. The program state consists of the local state of each thread and the shared program heap, defined in the usual way (see Tab. 1). A state is a tuple: σ = haobjσ , hσ , sσ , asσ i ∈ ST , where ST = 2Objs × Heap × Snapshots × 2SnapIds . Here, FieldIds is a finite set of field identifiers and SnapIds is a finite set of snapshot identifiers. A state σ keeps track of the set of allocated objects (aobjσ ), a mapping from fields of allocated objects to values (hσ ), a mapping from an object and a snapshot identifier to a copied object (sσ ) and a set of active snapshot identifiers (asσ ). The first two state components, aobjσ and hσ , are standard. We use the special value ⊥ to denote newly allocated objects (for a given snapshot identifier). The component sσ is an instrumented part of the state and represents the snapshots of a given object, one snapshot for each active snapshot identifier. The set of active snapshot identifiers asσ represents the identifiers of the snapshots that are currently being computed. Fig. 5 shows the procedures that we introduce in order to support snapshots. We use o.f = nv and o.sid = s as a shortcut for respectively updating the current state into a new state σ where hσ (o, f ) = nv and respectively sσ (o, sid) = s. Similarly, we use the shortcut c = o.sid for obtaining sσ (o, sid) in state σ. The procedures are shown in set notation and for simplicity we assume that each procedure is executed atomically. The left column contains the procedures invoked by an active assertion with identifier sid. The right column contains the procedures invoked by the program. In Section 4, we provide further details on the implementation of some of these procedures. For any initial state init, no snapshots are active, i.e.,asinit = ∅ and the snapshots for all objects are set to null, i.e., ∀o, sid.sinit (o, sid) = null. The procedure startSnapshot() is invoked by the program when a new assertion is triggered. If all of the snapshot identifiers are taken, then the procedure returns BUSY, which can be used by the program to block and try again later (or abandon this particular assertion). Otherwise, some snapshot identifier is selected and placed in the set of active identifiers. In the procedure allocBarrier(o), for all of the active snapshots, the newly allocated object’s snapshot is set to ⊥, meaning that this object is to be excluded from being snapshotted for any of the active snapshots. The procedure writeBarrier(o,f,nv) is invoked by the program when it is about to modify the heap. The operation takes the object to be modified, the field of that object and the new value. Because there could be multiple active snapshots at the time the modification to the object occurs, the barrier first checks whether the object has been snapshotted for any of the currently active snapshots. If there is a snapshot identifier for which the object has not been snapshotted, then a new copy is created and subsequently that snapshot copy is stored for each of the active snapshots. This optimization enables multiple active snapshots to share the same snapshotted object. Finally, the object is updated with the new value. Note that if o.sid = ⊥ (i.e., the object is newly allocated), the object is not snapshotted for the active snapshot identifier sid. The procedure processObject(o,sid) is invoked by the assertion every time it encounters an object in the heap. It checks whether the object was already snapshotted for sid and if so, obtains the snapshotted copy. Note that this procedure never encounters an object that has been allocated, that is, it is an invariant that o.sid 6= ⊥. This is because if a newly allocated object was stored in the heap, the object where it was stored would have been snapshotted by the write barrier. Finally, traversal of the object proceeds as usual. The procedure endSnapshot(o,sid) is invoked by the assertion when the snapshot computation finishes. Then, the snapshot identi- 4 2011/3/21 /* Check that the tree is balanced */ if (lb != rb || lb == -1) return -1; /* Check that the tree is ordered */ Integer val = (Integer) n.key; if (l != nil && val <= (Integer) l.key) return -1; if (r != nil && val >= (Integer) r.key) return -1; /* Check colors */ int c = n.color; if (c == RED && (l.color != BLACK || r.color != BLACK)) return -1; return lb + (n.color == BLACK ? 1 : 0); } public boolean checkAssertions() { return (traverseAndCheck(root, nil) != -1); } Figure 4. An assertion procedure that performs recursive checking of various safety properties on a red-black tree. and then the program resumes. In this example, as the assertion is iterating over the whole tree, such a pause in the program execution can be substantial, especially if the assertion check is triggered at medium to high frequency. In contrast, with our approach, once checkAssertions is invoked, the program can immediately continue execution, without stopping and waiting for the result. The result of the assertion is delivered to the program when the assertion completes. As explained on the illustrative example, every time a new assertion is triggered, logically, a new snapshot of the global state is created, and the assertion is evaluated on that snapshot. The performance overhead of our approach depends on how long the assertion takes to complete and how many and what modifications to the global state occur during assertion evaluation. Our approach is similarly applicable to assertions that check global properties such as domination, disjointness and ownership [35, 1]. 3. Formalization processObject(o, sid) { if (o.sid 6= null) o = o.sid; for each f ∈ o t = o.f ; ... } endSnapshot(sid) { as = as \ {sid}; for each o ∈ aobj o.sid = null; } writeBarrier(o, f, nv) { for each sid ∈ as if (o.sid = null) c = copyObj(o); break; for each sid ∈ as if (o.sid = null) o.sid = c; o.f = nv; } allocBarrier(o) { for each sid ∈ as o.sid = ⊥; } startSnapshot() { f reesid = SnapIds \ as; if (f reesid = ∅) return BUSY; pick sid ∈ f reesid; as = as ∪ {sid}; } Figure 5. Basic Snapshot Procedures fier is deactivated and for all currently allocated objects, their snapshot copies, if any, are made unreachable (to be collected at some point later by garbage collection). 4. Implementation In this section, we elaborate on the implementation of our approach. The system is implemented in Jikes RVM 3.1.1, the most recent stable release, and consists of three major components: • Asynchronous assertion interface: The application issues a new assertion through an interface similar to futures. Once started, concurrent checks proceed asynchronously until they complete, or until the application chooses to wait for the result. • Copying write barrier: When an assertion starts, STROBE acti- vates a special write barrier in the application code that constructs a snapshot of the program state. It copies objects as necessary to preserve the state and synchronizes snapshot access with the checker threads. • Checker thread pool: Checker threads pick up assertions as they are issued and execute them. Assertion checking code is written in regular Java, tagged with an annotation that the compiler recognizes. The code is compiled with a read barrier that returns values from object snapshots, whenever they are present. As a single checker thread can be quickly overwhelmed by even a modest number of asynchronous assertions. STROBE allows multiple asynchronous assertions to be active at the same time, and each requires its own snapshot of the program state. To implement multiple concurrent snapshots, each object is augmented with an additional position in its header to store a pointer to a forwarding array. This forwarding array contains a forwarding pointer for each checker thread, which holds the object snapshot for that checker. During object copying, synchronization between the application and the concurrent checkers is accomplished by atomically writing a sentinel value into the object header. Snapshot copies live in the Java heap and are reclaimed automatically by the garbage collector when either the assertion check is complete or the copied object becomes garbage. We modify the 5 GC object scanning code to trace the forwarding array and its pointers in addition to the object’s fields. In general, we found the optimization (as described in the previous section) of not snapshotting newly allocated objects brings significant benefits. The reason is that newly allocated objects are nearly always written to during their construction, and hence would normally need to be snapshotted. In the previous section, the snapshot fields of newly allocated objects were declaratively tagged with ⊥, indicating they need not be copied. In our implementation, we realize this by introducing a global counter, called the epoch, that is incremented each time a new assertion starts. Each new object is then timestamped with the epoch in which it was created. Then, in the write barrier, we copy an object only if its timestamp is earlier than the current epoch. When an object is modified, the write barrier makes a copy for each active checker thread for which (a) the object is not newer than the start of the check, and (b) the object has not already been copied for that checker thread. Further, as shown earlier, another optimization is to allow multiple snapshots to share a single copy of an object. This optimization applies to any checking thread that is active, but has not already made a snapshot of that object. We employ an additional optimization that helps the write barrier quickly identify objects that have already been copied. When an object is copied we update its timestamp to the current epoch. Subsequent writes to the same object within the same epoch will immediately fail the test for copying. Without this optimization every write to an object would have to check all slots in the forwarding array to decide if a copy needs to be made. void writeBarrier(Object src, Object target, Offset offset) { int epoch = Snapshot.epoch; if (Header.isCopyNeeded(src, epoch)) { // -- Needs to be copied, we are the copier // timestamp(src) == BEING_COPIED snapshotObject(src); // -- Done; update timestamp to current epoch Header.setTimestamp(src, epoch); } // -- Do the write (omitted: GC write barrier) src.toAddress().plus(offset).store(target); } Figure 6. Copy-on-write write barrier boolean isCopyNeeded(Object obj, int epoch) { int timestamp; do { // -- Atomic read of current timestamp timestamp = obj.prepareWord(EPOCH_POS); // -- If in current epoch, nothing to do if (timestamp == epoch) return false; // -- If someone else is copying, wait if (timestamp == BEING_COPIED) continue; // -- ...until CAS BEING_COPIED succeeds } while (!obj.attempt(timestamp, BEING_COPIED, EPOCH_POS)); return true; } Figure 7. Snapshot synchronization The write barrier is shown in Figure 6 (slightly simplified from the actual code). All operations on the forwarding array (making 2011/3/21 or accessing copies) are synchronized using atomic operations on the object’s timestamp. The write barrier first calls a method to determine if a copy is needed. The method isCopyNeeded() is shown in Figure 7. It consists of a loop that exits when either (a) the timestamp is current, so no copy is needed, or (b) the timestamp is older than the current epoch, so a copy is needed. In case (b) the code writes a special sentinel value BEING COPIED into the timestamp, which effectively locks the object. All other reads and writes are blocked until the sentinel is cleared. This code is compiled inline in the application. void snapshotObject(Object obj) { // -- Get forwarding array; create if needed Object [] forwardArr = Header.getForwardingArray(obj); if (forwardArr == null) { forwardArr = new Object[NUM_CHECK_THREADS]; Header.setForwardingArray(obj, forwardArr); } // -- Copy object Object copy = MemoryManager.copyObject(obj); // -- Provide copy to each active checker // that has not already copied it for (int t=0; t < NUM_CHECK_THREADS; t++) { if (isActiveCheck(t) && forwardArr[t] == null) forwardArr[t] = copy; } } Experimental set up We evaluate STROBE using two kinds of experiments: microbenchmarks instrumented with data structure invariant checks, and a real benchmark program (SPEC JBB2000), instrumented with synthetic assertions that allow us to systematically vary the frequency and cost of the checks. Both sets of benchmarks are assertion-intensive, making them a challenge to execute efficiently. Microbenchmarks. Our microbenchmarks experiments replicate the evaluation presented in the Ditto work [30]. We implemented two data structures: an ordered linked list and a tree-map using a red-black tree. For each one, we added a method that checks the data structure invariants: • Ordered linked list: check that link.next.prev == link and that elements are ordered. • TreeMap red-black tree: (a) make sure values are in order, (b) check that all children of red nodes are black, and (c) make sure that all paths from the root to a leaf have the same number of black nodes. The code is shown in Figure 4. Figure 8. Copying code The snapshot code, shown in Figure 8 (slightly simplified from the actual code), is compiled out-of-line, since it is infrequent and relatively expensive. It first loads the forwarding array, creating one if necessary. It then makes a copy of the object using an internal fast copy (the same mechanism used in the copying garbage collectors). It installs a pointer to the copy in each slot of the forwarding array for which the corresponding checker thread is (a) active and (b) has not already copied the object. Finally, the read barrier in the checker threads checks to see if a copy of the object exists and uses it in place of the current state of the object. To avoid a race condition with the write barrier, it locks the timestamp and checks for a forwarding array. If one exists, it looks in its slot in the array for a pointer to a copy. If there is a copy in the slot it uses the snapshot copy; otherwise it performs the read on the original object. 5. 5.1 Experimental Evaluation In this section we present an experimental evaluation of STROBE. The focus of this evaluation is not on bug detection: our technique detects all the same errors that traditional synchronous assertions would catch. Our goal is to explore the performance space of this technique in order to provide a sense of how well it works under a range of conditions. Our main findings are: • Asynchronous checking performs significantly better than syn- chronous checking in almost all circumstances. • The overhead of making snapshots is quite low and grows slowly, primarily as a function of the number of concurrent inflight checks, which increases the copying costs. SPEC JBB2000 SPEC JBB2000 [32] is a multithreaded benchmark that emulates a three-tier client-server system, with the database replaced by an in-memory tree and clients replaced by driver threads. The system models a company, with warehouses serving different districts and processing customer orders. We added a synthetic assertion check on the main Company data structure right before each transaction is processed (in TransactionManager.go()). This synthetic assertion allows us to systematically vary the frequency of assertions and the amount of work performed by each check. It performs a bounded transitive closure on the object it is given, emulating a fixed amount of data structure traversal and access. In addition, we control the frequency of these assertions by selecting some fraction of them to perform. Methodology We compare three kinds of runs: asynchronous assertions (with varying numbers of checking threads), synchronous assertions (stop and evaluate each assertion), and baseline (no assertion checking). Our primary measurement is overall application runtime. For asynchronous checks we also measure the number of objects copied during the run and the total number of bytes copied. For the microbenchmarks we run each test 20 times and compute the average and confidence interval. For the SPEC JBB experiments we run four iterations of the benchmark in the same invocation and time the last iteration. The differences we observe in runtimes between synchronous and asynchronous checking are so great that small perturbations in the execution do not affect our overall results. The current implementation uses a non-generational mark sweep collector (a current bug in the implementation, which we intend to fix soon, requires us to use a non-moving collector). A generational collector, however, is likely to improve the performance of asynchronous assertions by more effectively controlling the GC cost of the snapshots. All experiments were run on a 12-core machine (dual six-core Xeon X5660 running at 2.8GHz) with 12GB of main memory running Ubuntu Linux kernel 2.6.35. 5.2 Results mance, as long as there is enough assertion checking work to keep the threads busy. In addition, with too few threads, some assertions must wait, throttling program throughput. Microbenchmark Results Each run of the microbenchmark program builds one of the two data structures of a specific size and performs 1000 operations on it. Each operation is either an add, a remove, or an access. We perform the invariant checks before and after each add or remove. Operations are chosen at random so that 90% are accesses, 9% are adds, and 1% are removes. 6 2011/3/21 • Increasing the number of checker threads improves perfor- the per-assertion work from 500 objects to 32,000 objects. We run the program with synchronous checks and with asynchronous checks using 2, 6, and 10 checker threads. All times are normalized to the runtime of SPEC JBB running without any assertions. 4.5 Sync Async-2 Async-6 Async-10 4 3.5 Normalized time For each data structure, we ran experiments with size 1,000, 10,000, 100,000 and 1,000,000 elements. We varied the number of checker threads from 1 to 12. The results are presented in Figures 9 and 10. Time is normalized to the cost of the synchronous checks. We find that for the list, the cost of the asynchronous mechanism (coordinating with the checker threads and building the snapshots) overwhelms the relative simplicity of the computation. While performance improves with more threads, we never reach the breakeven point. For the TreeMap, however, increasing the number of threads improves performance substantially, particularly as the size of the data structure increases. For the largest input size, we hit the break-even point at around three threads; using 12 threads reduces the overhead of invariant checking to within a small fraction of the baseline time. 3 2.5 2 1.5 5 Size 1000 Size 10000 Size 100000 Size 1000000 4 1 0.5 Normalized time 0 0 5000 3 10000 15000 20000 25000 Per-check cost (# objects traversed) 30000 35000 Figure 11. SPECJBB: Overhead of a fixed number (1397) of checks for a range of costs, normalized to baseline (no checks). Asynchronous checking outperforms synchronous checking, but there is not enough work for 6 or 10 checker threads. 2 1 30 Sync Async-2 Async-6 Async-10 0 2 4 6 8 10 Number of checking threads 12 14 25 Figure 9. Ordered linked list: overhead versus number of checker threads for various data structure sizes. The computation is too brief to recoup the cost of synchronization and snapshotting. 5 Size 1000 Size 10000 Size 100000 Size 1000000 Normalized time 4 Normalized time 0 20 15 10 5 0 0 3 5000 10000 15000 20000 25000 Per-check cost (# objects traversed) 30000 35000 Figure 12. SPECJBB: Overhead of a fixed number (11,214) of checks for a range of costs, normalized to baseline (no checks). With more work, asynchronous checking reduces assertion cost significantly with 2 and 6 threads. With 10 threads, copying overhead constrains performance. 2 1 SPEC JBB Results The synthetic assertion added to SPEC JBB allows us to explore the performance space of our technique along three axes: frequency of checks, cost of each check, and number of checker threads. In our experiments we vary the frequency of the checks from 3% (700 checks per run) to 100% (22,000 checks per run). The work performed in each assertion is a bounded transitive closure (like a GC trace) starting at the Company object. We vary Figures 11 and 12 show the relative slowdown for a fixed frequency (number of checks) across a range of per-check costs, comparing synchronous checking against asynchronous checking with 2, 6, and 10 threads. For 1397 checks the synchronous check rapidly increases to 2X and 4.5X overhead as the cost increases. The asynchronous checks never exceed a 2X slowdown, although much of benefit is obtained with only two checker threads. The reason is that with relatively infrequent checks, most checking threads sit idle. For 11,214 checks, increasing the number of checker threads significantly improves performance. Measurements indicate that the threads are completely saturated when the work goes above 2000. With 10 threads, however, the cost of maintaining the snapshots begins to hurt scalability. Figures 13 and 14 show the relative slowdown for a fixed percheck cost across a range of frequencies. For checks of cost 4000 7 2011/3/21 0 0 2 4 6 8 10 Number of checking threads 12 14 Figure 10. TreeMap (red-black tree): overhead versus number of checker threads for various data structure sizes. Asynchronous checks beat synchronous checks for large data structures. 14 16 Sync Async-2 Async-6 Async-10 12 178M 89M 44M 22M 11M 6M 3M 1M 14 12 Normalized time Normalized time 10 8 6 4 10 8 6 4 2 2 0 0 0 5000 10000 15000 Number of checks 20000 25000 0 Figure 13. SPECJBB: Overhead of fixed-cost (4000 objects) assertions, normalized to baseline (no checks). Using 2 or 6 asynchronous checker threads yields an almost perfect speedup over synchronous checking, and keeps overall slowdown under 2X. 2 4 Threads 6 8 10 Figure 15. JBB scalability: checking overhead versus number of threads for a fixed total amount of checking work (check cost times number of checks). threads, and continues to provide some benefit at ten threads for the more expensive workloads. 30 25 Normalized time Copying costs. During each asynchronous run we count the total number of objects copied (and the volume in bytes) for all snapshots. This cost ranges from 25,000 objects (2MB) for fast, infrequent checks up to 800,000 objects (65MB) for long-running and frequent checks. The copying costs correlate most closely with the total checking work (described above), slightly biased towards more frequent checks. Sync Async-2 Async-6 Async-10 20 15 6. 10 Related Work In this section, we briefly discuss some of the related work on static and dynamic checking, usage of snapshots in concurrency, and other related runtime techniques such as futures, concurrent garbage collection and software transactions. 5 0 0 5000 10000 15000 Number of checks 20000 25000 Figure 14. SPECJBB: Overhead of fixed-cost (16,000 objects) assertions, normalized to baseline (no checks). More expensive checks run longer resulting in more snapshot copying. With 10 checker threads, overhead is still around 4X, while synchronous checking incurs 30X slowdown. (object traversals) the cost of synchronous checking increases perfectly as the number of checks increases – the cost of performing 22,000 checks is almost exactly 4X the cost of performing 5,500 checks. The asynchronous checks hide almost all of this cost, particularly for 6 and 10 threads, which top out at 2X slowdown for 27,000 checks. For checks of cost 16,000, however, the total cost increases significantly and is more difficult to hide. With 10 threads asynchronous checks are under 4X – slow, but significantly outperforming synchronous checks, which top out at 50X slower. Figure 15 shows the scalability of asynchronous checking. The X axis is the number of checking threads, where zero indicates synchronous checking. The Y axis is overhead compared to the baseline. Each line corresponds to constant amount of total checking work per run: the cost of each assertion times the number of assertions checked. Note that different configurations (work times frequency) can involve the same amount of work, so we plot the average. Asynchronous checking scales almost perfectly up to six 8 Static analysis Research in static analysis has yielded a significant body of sophisticated techniques for modeling the heap and analyzing data structures at compile-time. Previous approaches include pointer analysis [9, 20, 7], shape analysis [14, 29, 15], and type systems [13, 11]. The strength of static analysis is that it explores all possible paths through the program: a sound analysis algorithm can prove the absence of errors, or even verify the full correctness of a data structure implementation. The weakness of static analysis is that many properties, particular those involving pointers and heap-based structures, are undecidable in principle [26], and extremely difficult to approximate precisely in practice [29, 21, 12, 40]. These problems are particularly severe for web applications, which make heavy use of dynamic language features such as reflection, bytecode rewriting, and dynamic class loading. Pointer analysis for the full Java language requires re-running parts of the algorithm during execution [17]. Dynamic analysis Dynamic analysis avoids the main problem with static analysis: by performing error checks on the actual concrete heap of a running program, it does not have to make conservative assumptions about potential program state. The goal of this project is to address the primary challenge in dynamic invariant checking: performance. The Java Modeling Language jmlc tool, for example, provides a very rich and expressive system for specifying and checking run-time invariants, but can slow programs down by one or two orders of magnitude [10, 22]. There has been other work on dynamically evaluating runtime assertions. However, our work is orthogonal and complementary 2011/3/21 to these approaches. Further, our algorithms can be combined with any of these ideas to yield a more efficient system. The Ditto system speeds up invariant checking by automatically incrementalizing the checking code [30]. The incremental checks are still very expensive, and, as the assertion checking and program execution are not decoupled, the cost of assertion checking will inherently dictate program execution times. The QVM system provides a number of heap probes to check data structures, and manages the cost by reducing the frequency of checking [3]. The Phalanx system attempts to reduce the cost of assertion checking by parallelizing the queries. In both, QVM and Phalanx, the queries are evaluated synchronously with respect to the application. In the work by Aftandilian et al, [1], assertions are evaluated synchronously and at garbage collection time. That is, assertions cannot be evaluated in parallel to the program and assertions cannot be issued at specific program points. Subsequent work by Reichenbach et. al [27] investigates the class of assertions that can be computed with the same complexity as garbage collection. The assertions in that work are very expensive to perform (and they are performed sequentially). Further, the single-touch property limits the kinds of checks that the system performs. Uses of Snapshots in Concurrency Copy-on-write techniques for obtaining snapshots have had prolific use in various settings. For instance, the fork system call in UNIX uses copy-on-write to make a copy of the virtual memory address space of a process. Here, we survey only some of the recent work where snapshots have been used in the context of language runtime systems. The line of work that is most closely related to ours are the techniques used in building mark and sweep snapshot-based concurrent garbage collectors, i.e., [5, 6, 39, 36, 18]. In essence, once the garbage collector starts working, it computes all reachable nodes in the heap at the time the collector started. There are few basic technical differences between snapshot collectors and this work. First, there is no real need for multiple garbage collectors to be running at the same time computing the same information. Usually, a new collector cycle is started after the previous one has ended. In contrast, it is sensible to have multiple assertions at the same time, and hence our system must support this scenario. Second, collectors are computing a specific property: transitive closure from a set of roots. In our case, each snapshot can be used to compute completely different assertions. Third, with concurrent collectors, the program only needs to intercept heap reference modifications, while in our case, depending on whether the assertion accesses primitive values in the heap, it may be desirable to snapshot their modification as well. The idea of snapshots has also been used to implement specific concurrent programming models. In the work of Amitai et al. [4], the authors present a system (in the context of C/C++) which can enforce deterministic behavior. When a thread is forked, a (logical) snapshot, called a space, is created for that thread. Any updates that the thread performs are confined to its own local snapshot copy, that is, the thread only operates with its own local copy. Then, at specific joint points, the changes to the local copies (snapshots) are merged. If a write-write conflict occurred, then the system flags an error, otherwise, the merge proceeds. A similar concept is used in the work of Burckhardt et al. [8], which also allow for a more flexible treatment of conflicts, i.e., for instance, one may select that conflicts are resolved such that one process’s updates win over another. That system also requires programmers to explicitly quantify over the variables that need to be snapshotted. We can think of asynchronous assertions as processes that are forked off by the main program, but in our case assertions never modify the shared state, and hence there can be no conflicts at merge time. pletely different mechanism. SuperPin uses the fork system call to create a consistent snapshot, which it uses to run an instrumented copy of the main program. As with other previous work, however, this system is focused on relatively low-level program analysis added as binary instrumentation, and performance is still too slow for production use (100-500% slowdown). Speck [23], Fast Track [19], and ParExC [34] are all speculative execution systems. Speck and ParExC focus on low-level runtime checks such as array out-of-bounds and pointer deference checks. Fast Track enables speculative unsafe optimizations, whose correctness it checks by the unoptimized program on multiple processors. None of these systems are designed to check the high-level program properties we are interested in. Safe Futures for Java [38] provides a safe implementation of futures in Java. They handle concurrency issues between the main program and the futures via many of the same mechanisms as our work: read and write barriers and object versioning. However, their system does not work with multithreaded programs, and because their futures can write to the heap, they must use a read barrier for all memory accesses. Our assertions cannot write to the heap, so we can avoid a read barrier and achieve better performance. Recently, there has been significant amount of work on transactional memory [16]. In principle, it is possible to treat each assertion evaluation as a separate transaction, and let the STM detect conflict. However, the problem with this approach would be that either the assertion or the program would be rolled back every time a conflict occurs, which is highly likely for long-running assertions that may touch large portions of the heap. Frequent rollbacks would render the system unusable (and restarting the application has its own issues when it comes to side effects such as exceptions and I/O). Further, with our semantics, rollbacks are unnecessary, as the assertion only needs to compute a result with respect to the state in which it was triggered. 7. Conclusions Assertions are a powerful and convenient tool for detecting bugs, but have always been limited by the fact that the checking cost is paid for in application runtime. In this paper we show how assertions can be checked asynchronously, greatly reducing checking overhead. Our technique enables a much wider range of assertions, including complex heap checks and data structure invariants. We believe that this approach will become even more appealing as modern processors continue to add CPU cores, providing ample additional computing power that can be devoted to making software run more reliably. References [1] A FTANDILIAN , E., AND G UYER , S. Z. GC assertions: Using the garbage collector to check heap properties. In ACM Conference on Programming Languages Design and Implementation (2009). [2] A RAL , Z., AND G ERTNER , I. Non-intrusive and interactive profiling in parasight. In Principles and Practice of Parallel Programming (1988), pp. 21–30. [3] A RNOLD , M., V ECHEV, M., AND YAHAV, E. QVM: an efficient runtime for detecting defects in deployed systems. In ACM Conference on Object-Oriented Programming Systems, Languages, and Applications (2008), pp. 143–162. [4] AVIRAM , A., W ENG , S.-C., H U , S., AND F ORD , B. Efficient system-enforced deterministic parallelism. In 9th Symposium on Operating Systems Design and Implementation (2010), USENIX Association. Concurrent program checking The SuperPin system [37] provides some of the same capabilities as ours but through a com- [5] A ZATCHI , H., L EVANONI , Y., PAZ , H., AND P ETRANK , E. An onthe-fly mark and sweep garbage collector based on sliding views. 9 2011/3/21 In ACM Conference on Object-Oriented Programming Systems, Languages, and Applications (2003), pp. 269–281. [6] BACON , D. F., C HENG , P., AND R AJAN , V. T. A real-time garbage collector with low overhead and consistent utilization. In ACM Symposium on the Principles of Programming Languages (2003), pp. 285–298. [7] B ERNDL , M., L HOT ÁK , O., Q IAN , F., H ENDREN , L., AND U MANEE , N. Points-to analysis using BDDs. In ACM Conference on Programming Languages Design and Implementation (New York, NY, USA, 2003), ACM, pp. 103–114. [8] B URCKHARDT, S., BALDASSIN , A., AND L EIJEN , D. Concurrent programming with revisions and isolation types. In Proceedings of the ACM international conference on Object oriented programming systems languages and applications (New York, NY, USA, 2010), OOPSLA ’10, ACM, pp. 691–707. [9] C HASE , D. R., W EGMAN , M., AND Z ADECK , F. K. Analysis of pointers and structures. In ACM Conference on Programming Languages Design and Implementation (1990), pp. 296–310. [10] C HEON , Y., AND L EAVENS , G. T. A runtime assertion checker for the java modeling language (jml). In SERP 02: Internation Conference on Software Engineering Research and Practice (2002), pp. 322–328. [11] C LARKE , D. G., P OTTER , J. M., AND N OBLE , J. Ownership types for flexible alias protection. SIGPLAN Notices 33, 10 (1998), 48–64. [12] DARGA , P. T., AND B OYAPATI , C. Efficient software model checking of data structure properties. In ACM Conference on Object-Oriented Programming Systems, Languages, and Applications (2006), pp. 363– 382. [13] F RADET, P., AND M ÉTAYER , D. L. Shape types. In ACM Symposium on the Principles of Programming Languages (1997), pp. 27–39. [14] G HIYA , R., AND H ENDREN , L. J. Is it a tree, a DAG, or a cyclic graph? A shape analysis for heap-directed pointers in C. In ACM Symposium on the Principles of Programming Languages (1996), pp. 1–15. [15] H ACKETT, B., AND RUGINA , R. Region-based shape analysis with tracked locations. In ACM Symposium on the Principles of Programming Languages (2005), pp. 310–323. [16] H ARRIS , T., L ARUS , J., AND R AJWAR , R. Transactional Memory, 2nd edition. Morgan and Claypool Publishers, 2010. [17] H IRZEL , M., D INCKLAGE , D. V., D IWAN , A., AND H IND , M. Fast online pointer analysis. ACM Transactions on Programming Languages and Systems 29, 2 (2007). [24] O PLINGER , J., AND L AM , M. S. Enhancing software reliability with speculative threads. In ACM Conference on Architectural Support for Programming Languages and Operating Systems (2002), pp. 184– 196. [25] PATIL , H., AND F ISCHER , C. Low-cost, concurrent checking of pointer and array accesses in c programs. Software Practice and Experience 27, 1 (1997), 87–110. [26] R AMALINGAM , G. The undecidability of aliasing. ACM Transactions on Programming Languages and Systems 16, 5 (1994), 1467– 1471. [27] R EICHENBACH , C., I MMERMAN , N., S MARAGDAKIS , Y., A F TANDILIAN , E. E., AND G UYER , S. Z. What can the gc compute efficiently?: a language for heap assertions at gc time. In Proceedings of the ACM international conference on Object oriented programming systems languages and applications (New York, NY, USA, 2010), OOPSLA ’10, ACM, pp. 256–269. [28] ROSENBLUM , D. S., S ANKAR , S., AND L UCKHAM , D. C. Concurrent runtime checking of annotated ada programs. In Proc. of the sixth conference on Foundations of software technology and theoretical computer science (1986), pp. 10–35. [29] S AGIV, M., R EPS , T., AND W ILHELM , R. Parametric shape analysis via 3-valued logic. In ACM Symposium on the Principles of Programming Languages (1999), pp. 105–118. [30] S HANKAR , A., AND B OD ÍK , R. Ditto: automatic incrementalization of data structure invariant checks (in java). In ACM Conference on Programming Languages Design and Implementation (2007), pp. 310–319. [31] S HETTY, R., K HARBUTLI , M., S OLIHIN , Y., AND P RVULOVIC , M. Heapmon: a helper-thread approach to programmable, automatic, and low-overhead memory bug detection. IBM Journal of Research and Development 50, 2/3 (2006), 261–275. [32] S TANDARD P ERFORMANCE E VALUATION C ORPORATION. SPECjbb2000 Documentation, release 1.01 ed., 2001. [33] S USSKRAUT, M., W EIGERT, S., N OWACK , M., DE B RUN , D. B., AND F ETZER , C. Parallelize the runtime checks not the application. In EC2 - Exploiting Concurrency Efficiently and Correctly (2009). [34] S ÜSSKRAUT, M., W EIGERT, S., S CHIFFEL , U., K NAUTH , T., N OWACK , M., B RUM , D. B., AND F ETZER , C. Speculation for parallelizing runtime checks. In SSS ’09: Proceedings of the 11th International Symposium on Stabilization, Safety, and Security of Distributed Systems (Berlin, Heidelberg, 2009), Springer-Verlag, pp. 698–710. [18] J ONES , R., AND L INS , R. Garbage Collection. John Wiley and Sons, 1996. [35] V ECHEV, M., YAHAV, E., AND YORSH , G. Phalanx: parallel checking of expressive heap assertions. SIGPLAN Not. 45 (June 2010), 41–50. [19] K ELSEY, K., BAI , T., D ING , C., AND Z HANG , C. Fast track: A software system for speculative program optimization. In International Symposium on Code Generation and Optimization (Washington, DC, USA, 2009), IEEE Computer Society, pp. 157– 168. [36] V ECHEV, M. T., YAHAV, E., AND BACON , D. F. Correctnesspreserving derivation of concurrent garbage collection algorithms. In Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation (New York, NY, USA, 2006), PLDI ’06, ACM, pp. 341–353. [20] L ANDI , W., RYDER , B. G., AND Z HANG , S. Interprocedural modification side effect analysis with pointer aliasing. In ACM Conference on Programming Languages Design and Implementation (1993), pp. 56–67. [37] WALLACE , S., AND H AZELWOOD , K. Superpin: Parallelizing dynamic instrumentation for real-time performance. In International Symposium on Code Generation and Optimization (2007), pp. 209– 220. [21] M C P EAK , S., AND N ECULA , G. Data structure specifications via local equality axioms. In Computer Aided Verification (2005), pp. 476–490. [38] W ELC , A., JAGANNATHAN , S., AND H OSKING , A. Safe futures for java. In ACM Conference on Object-Oriented Programming Systems, Languages, and Applications (New York, NY, USA, 2005), ACM, pp. 439–453. [22] N ETHERCOTE , N., AND S EWARD , J. How to shadow every byte of memory used by a program. In International Conference on Virtual Execution Environments (2007), pp. 65–74. [39] Y UASA , T. Real-time garbage collection on general-purpose machines. Journal of Systems and Software 11, 3 (1990), 181–198. [23] N IGHTINGALE , E. B., P EEK , D., C HEN , P. M., AND F LINN , J. Parallelizing security checks on commodity hardware. In ACM Conference on Architectural Support for Programming Languages and Operating Systems (New York, NY, USA, 2008), ACM, pp. 308– 318. [40] Z EE , K., K UNCAK , V., AND R INARD , M. Full functional verification of linked data structures. In ACM Conference on Programming Languages Design and Implementation (2008), pp. 349–361. 10 2011/3/21