>> Tom Ball: Okay. I'm Tom Ball and it's my pleasure to welcome Charles Zhang here. He's a professor in the Department of Computer Science and Engineering at the Hong Kong University of Science and Technology. And his major area of work is in the area of software engineering. He's been focusing lately on debugging of concurrent systems and is a Ph.D., M.Sc. and B.Sc. with honors, all from the University of Toronto. So welcome, Charles. >> Charles Zang: Thank you. Thank you very much. All right, so I'd like to thank Tom first for giving me this opportunity to talk about my work. So I'm very, very happy to be here, to be the home [inaudible] to talk about debugging concurrent programs, very happy that [inaudible] came to give us some opinions. So before I start to talk about research itself, let me just first introduce where I come from. There're some old friends who know where I come from but there are many new. So I come from a small university in Hong Kong. It's located right here on the shore of the ocean. It's quite small. It's got in total about 7,000 students but it's quite specialized. So this is where I come from on the map. And my research group is called the Prism Research Group. It was basically established in 2009. I joined the university in 2008, so we're a pretty young research group. That's why we're happy to be here to learn from the experts. I have four Ph.D. students in the group. So the research focus of our research group -- we like program analysis and we want to use it to help programmers debug and develop large systems. Our current focus is concurrent systems as Tom has already said. So we have a few major research projects. The topic of the talk today is one of them. It's deterministic replay of concurrent software, so we'll talk about that of course for this talk. And other than that we also work on automatic fix of concurrency bugs. We have two papers there. We also work on causal analysis or people call it predicative analysis over traces, and we have a few interesting techniques there. We also work on something much lower on program analysis itself. We have some technique that computes pointers more accurately, and we're happy that it's been adopted by Soot. And we also work on some database approach for program analysis; that's our ongoing work. Okay, so this is an overview of the type of research we do in the group. Of course today I want to talk about a specific area that we have made a few interesting contributions to. It's called Deterministic User-level Replay of Concurrent Programs. It's a joint work with Jeff who is also here and together with pretty much everybody in the group. So this is something that everybody in the group has invested some amount of time. Okay? I want to give you an outline of the talk. For experts I guess I don't have to explain what replay is. But for people who're not very familiar with replay, I want to define it in a very simple manner so that we know what we're talking about. And then, I want to use a very simple example to illustrate some of the main challenges of performing user-level deterministic replay. Then I'll talk about some prior art to allow you to better understand the contribution of our techniques. And I'll talk about our techniques, specifically the three techniques LEEP, STRIDE and CLAP. So LEEP was presented in FSE 10; STRIDE in ICSE this year; and CLAP was presented in PLDI Student Research Competition. So we know that probable the FSE and the ICSE probably are software engineering centric conferences, so I like to even more welcome the opinions in this audience on the work that we did. Then, I'll make some conclusions. So to begin I'd like to first define what would we mean by replay. First of all when I talk about replay in this talk I want to talk about programs that use threads on a multiprocessor platform and communicate through shared memory. So these are the very common kind of threaded programs that we deal with. So I'm not talking about functional programs, parallel programs like Haskell. So these are shared memory multi-threaded programs, so this is the kind of program we want to replay. So I'd like offer a definition. I'd like to point out that this is by no means a very authoritative definition of what replay is. This is our working definition so that we can have an easier way to position our contribution. So I define it to be a technique that repeats a previously exercise execution according to some kind of diagnostic criterion. So that means that sometimes we replay the program to be able to time travel so that we can go back to any point in a program's execution and start to execute from that point. So we want to replay for full execution. In some other areas we want to just browse the computed values of a previous execution, but we do not care how the values are computed. Okay, so it's some other scenarios we replay for getting the computed values. And a third category, which I argue that most of our replay techniques have fallen to, is for bug reproduction, is for reproducing a particular failure. Okay? So these are just to show you some different purposes or different criteria that each replay technique would try to achieve. The basic mechanism of replay -- So the biggest denominator despite the goal of the replay technique itself is that you always want to try to monitor the application to produce some log, and we use the log to try to reconstruct an earlier execution. So this is probably the basic mechanism for replay technique. In terms of research, replay has been active for many years. So the earliest paper we can find -- We're not this is the earliest paper. The earliest paper we can find is the TOCS paper. Oops. I guess Microsoft is checking on me. All right. So the earliest paper we can find is the TOCS paper in 1986 and also all the way to this year's ICSE for our replay work, just examples for showing that this research has been going on for many, many years. And the publication of replay works spanning many different areas all the way from course of engineering conferences to hardware conferences like MICRO or VEE. The good news is that if we conduct research in this area we can many different venues that we can publish papers. Of course the bad news is that we have to read the papers from all these conferences first to understand the state of the art. So it has been a very prolonged research area active in many different areas. So replay technique can be implemented from these conferences. We can probably already sense that it can implemented in hardware such as modifying some hardware cache coherence messages, or it can be implemented at software at the user-level from program instrumentation, or at the system level where we modify the virtual machine to enable logging of the low-level system events. So it can implemented in hardware or it can be implemented in software. A replay technique is what we call deterministic if it guarantees to reproduce the earlier run. So we call it a deterministic replay. We call it a probabilistic replay if it only tries to reproduce the earlier run with the best effort, okay, with high probability. So these are just some common terminologies that we use in this area. Our focus in our research group is what I call deterministic user-level replay for bug reproduction. So there are some terms here so let me just explain a little bit better. We like deterministic replay because we think that it helps the diagnostic process more effectively. And in addition many sort of client analyses require deterministic replay such as cyclic debugging, fault localization of concurrent programs or bug classification where we have to run the concurrent program again and again. Okay? So in that case determinism is a very, very good thing to have. And also we like user-level replay because we believe it's very easy to use; it's very accessible. So user-level replay is a program will take the replay technique and take a regular user program, instrument it or transform it and just run it. Right? So the technique is going to monitor the application, produce a log and replay it. So this would not affect any other applications and this would not require any OS-level, system-level or hardware modification. So it's very easy to deploy, so we like this characteristic as well. And we like bug reproduction. We think that's probably the most common use of replay. The reason why we replay is to try to troubleshoot, try to find bugs. And in particular we focus on reproducing or producing the order of race. So I'm going to illustrate that. That's basically one of the major reasons for concurrency bugs. So we focus on producing the order of race that the programmer can have an easier time to diagnose the bug. And by saying that, we can relax on reproducing the original schedule or even the computed values. So we focus on the race order and relax on reproducing the schedule and the computed values, so this is sort of the focus of our technique. >>: [inaudible]. So you can tell me if this is more appropriately asked somewhere else, but it seems like those goals are a little bit kind of against each other. Right? Like if you do it in the software there're a lot of benefits to that, but you're kind of getting into the world of the program. Like if you truly want to be deterministic, you know, it would seem to me that you need to remain outside of the world that the program lives in. So it just kind of seems like software is really good but it kind of encourages probabilistic; deterministic is really good but it kind of, you know, encourages other decisions that are made. >> Charles Zang: So yeah, well, I'll illustrate later that the software approaches have some problems. But it's good for producing many different concurrency bugs. It has its shortcomings but it depends on how you define... >>: [inaudible] probability, right, like because it's practical. Right? I mean deterministic? You always want deterministic but you'll settle for probability if it's practical. Right? >> Charles Zang: You have a point. So I'll illustrate later on for the challenges. Probably I can answer your question after that. >>: Okay, thank you. >> Charles Zang: Okay. All right, so before I actually talk about the details of our techniques, I just want us all together to go through a replay exercise so that we can get familiar with some of the challenges and also some of the terminologies that I'm going to use later on. So our replay exercise actually is very simple. It's probably the simplest of concurrent programs you'll ever see. So we have two threads that are both reading from a global variable, G, and incremented locally and assigned back to global variable, G. So these are the two threads. And if we execute the threads following this order meaning that we execute entire computations of T2 after T1, we will compute 2 for the global variable, G, at the end. Okay? So G is 2. But if for some reason Statement 4 is executed before Statement 3 due to a different kind of scheduling, we'll have G equal to 1 at the end. Okay? So this is a very simple example. And I want to illustrate for a concurrent program the order of the race, or whether Statement 4 is executed before Statement 3 or after Statement 3 is the key to deterministic replay. So we have to restore a particular race order to be able to produce the same program output. So this is the sort of essence of deterministic replay that we need to catch or need to restore the order of race as illustrated by this example, otherwise, the computation result will be different. All right, so let me just illustrate the first solution which is a very intuitive solution in that we can just record the order of race directly. Right? So if the order is important, let's just record the order. So here suppose that the order is that Statement 4 happens before Statement 3. And since we're doing user-level recording we're going to insert some instrumentation code and try to record this race order. So by sort of inserting this instrumentation we hope to reproduce a log of 4, 3. Right? So the order is that 4 executed first followed by 3. But again, since we're doing user-level instrumentation, our instrumentation code also suffers from the [inaudible] of the scheduling so that our instrumentation code can be executed later than the other instrumentation code on T1, on Thread 1. This would produce a lock of 3, 4. So if we use that lock to replay this whole program, we know that, as shown by my earlier illustration, that we're going to produce a different result. So this is a very simple illustration of the problem with user-level instrumentation. So the common practice is we had to somehow make the instrumentation at the actual statement to execute atomically or have to use some kind of synchronization. And for people who know about concurrent programming, we know that adding additional synchronization is really, really bad. Here I want to illustrate another solution to solve the problem: we do not use any synchronization but instead I'm going to record the values that we read or write for each of the shared variables. Okay? So in this case we can record that T1 reads G and the value is zero and writes to G the value is 1. So from these values we can sort of reason that the first statement with T1 cannot happen after this last statement of T2 because the value is zero, otherwise, the value would be 1. Right? So this is a very simple example to illustrate that it's possible for us to find a schedule given thread-local/store value just by searching. So just to make sure we find a schedule that is correct, for example, sequentially consistent. Gibbons et al. proved this many years ago that this problem is NPC. Okay? So we can do it but it's an NPC problem. So let me summarize what the example is trying to say. The example shows you an order-based solution where we tried to record the thread-access orders. We explicitly track all the read-write dependencies, and we can guarantee to replay the execution. But we need to use locks, a lot of them. So this is really, really bad. Or we can try the search-based solution. We can infer the read-write dependencies through the load/store values, require no use of locks, but it's an NPC problem which means it doesn't scale. It means that we cannot always produce a reasonable schedule, so we're losing the determinism. >>: So help me understand why it's NPC. >> Charles Zang: Well, it's not... >>: The reconstruction is NP complete? >> Charles Zang: So as I stated here given the thread-local load/store value trace like the ones that I show here, to decide if there is a schedule that's sequentially consistent, it's an NP complete problem. So this is shown in... >>: Okay, so... >> Charles Zang: ...the [inaudible] paper. >>: ...[inaudible] that 1, 2, 4, 3, 5, 6. That’s [inaudible]. >> Charles Zang: Yeah, so if you assign a schedule order, which one executes first, if you want to assign a total order to represent a schedule for these statements, it's an NPC problem if you only have the value trace. Okay? So I illustrated an order-based solution. I illustrated a search-based solution. And the prior art for user-level deterministic replay is primarily order-based. Okay? And the representative techniques are called Dejavu and Recplay, and let me just briefly give you some information about these related works so that you can understand our contribution a little bit better. So to talk about the related work and also our contribution, I'm going to use a slightly more sophisticated example as the one here that involves two branches. And what's interesting about this example is that only following this schedule can lead to evaluating both branches to be true, okay, so that's the only schedule that we can follow. This just shows that we need to precisely reproduce the schedule to compute the same result. So this is the example we're going to use. And the first sort of representative work for user-level replay is called Dejavu. It was developed by J.D. Choi which is also a well-known name in concurrency work. So his proposal is very similar to the first order-based example that I gave earlier which is simply if we can identify all the locations in the program where the shared-access, shared-variable or shared-object is involved in a computation, we can actually record the exact order of the threads that exercise these shared computations. So we can record all of them in the global log; so we can record a global sequence of the thread schedule that computes the shared-variable states. So this proposal obviously works, right, but it's pretty inefficient because it involves synchronizing all the threads onto a global log. So in this simple example, to produce this log it involves six global syncs. To log into a global log we essentially tie all the threads together using a global log which incurs a lot of runtime overhead. So improve upon that, Recplay is another representative approach presented in 2003 TOCS, to use a Lamport's clock. So from distributed computing [inaudible], we know that Lamport's clock is a pretty efficient way of clocking distributed events. So we model each thread as a distributed process and we model objects as the sources of distributed messages. Right? So if you know about Lamport's clock, this is pretty straightforward mapping from that problem to this problem. So what they do is that they associate each threat and each synchronized object with a distinctive clock. And this is the standard Lamport's clock algorithm for updating these counters or these clocks. So for our example, the details on how we arrive at these counters, how we compute these counters is not important. But basically the technique is to associate each shared variable, each thread with a different clock, with a different counter. And at the end each thread would maintain a local history of all the counter values for all the accesses to the shared variables. So in this way the thread only --. So we call it the local sync. So the thread only needs to synchronize the monitor or synchronize the observation or synchronize the logging if the thread involves the particular shared variables. So there's no need to synchronize across all the threads. So in this way we call it local synchronization. So Recplay in this example involves six local synchronizations plus six compare and update's for updating the Lamport's clock counter. So our contribution compared to the prior art is to still try to lower the logging overhead while staying deterministic. And we also want to simply the replaying process. And let me just talk about our techniques now in detail. So I'm still going to use the same example, only following this path, the schedule that we can produce this error. So I want to first talk about LEAP which is our very simple idea that improves upon the Recplay work. So the idea is very simple: instead of associating each thread with a clock, we associated each shared variable with a sort of access vector. So as the program executes, we just record the sequence of threads that each shared variable sees during the execution. Okay? So we maintain the access vector for each of the shared variables. So here, similar to Recplay, we require local syncs. So each shared variable is required to only synchronize with the threads involved in the computation or race against that variable. We do not need to synchronize all the threads. So it's still the same; it's the six local syncs in this example. So the way to replay is to actually think of these logs as stacks. And we're just going to read the logs one by one or pop the item one by one and use that value to direct a user-level scheduler. So a user-level scheduler can start to check if the threads turn to read or write to that variable by checking the top of the stack. So in here we just check if it's t1's turn to execute. And the next is that we check the top of the stack for y. And it shows it's t2 so we switch to t2 and continue to execute. And then we continue to check the log. It's a very simple fashion in this way when we consume all the logs, we can guarantee that our user scheduler will follow the same execution pass as the recorded run and finally produce this error. Okay? It's -- Yeah? >>: I have two questions. First, how do you identify the shared variable [inaudible]? >> Charles Zang: Yeah. We use a static analysis prior to the run. So there is a phase of static analysis for identifying the shared variables. >>: Based on [inaudible], right? >> Charles Zang: Yeah, we do it conservatively. Yeah. >>: And the second thing is, the main reason why you're like your recording phases faster than the global is that you basically allowing parallel-like recording. Is that right? >> Charles Zang: That's right. That's right. That's right. It's a very simple idea. We wonder why anybody hasn't thought of it earlier. But, nonetheless, it is state of the art order-based deterministic replay technique. It's more lightweight compared to existing approaches. And we use static analysis and bytecode transformation to actually achieve like the questions that you just asked. We have formal proof of replay correctness which I haven't found in any of the earlier papers, and we contributed the first automated tool available to the public for replay which is very surprising. We couldn't find any other automated tool for replaying, but our tool is publicly available for downloading. And weaknesses of this approach are also apparent. There are still too many synchronizations. We have six synchronizations, although, they are local ones. And essentially when we tried to lock these access vectors, we essentially eliminated all of these low-level data races even though many of them are benign or intentional. Essentially we're forcing a sequentially consistent execution of our program. So these are the weaknesses. Yeah? >>: In your tool would you do anything to try to optimize away conditionals? Because you talked about how you execute a statement and then do, you know, "If I should keep going then execute next statement." Is there anything you do to try to, you know, say, "Oh, I should just do the next five off the stack?" >> Charles Zang: You mean for the replay part or for the... >>: For the replay part. >> Charles Zang: For the replay part we just execute -- whenever we reach a shared variable we will actually check the log to check if it's my turn to execute. So just as I illustrated... >>: So you're doing each time? >> Charles Zang: Yeah. >>: You haven't found a need to put like different tricks in there? >> Charles Zang: Yeah, it's pretty -- Well, in other words, it's pretty slow. So we just force, actually -- Each computation we had to check the log to replay. >>: Yeah. >> Charles Zang: Yeah, we haven't done any optimization on it. >>: Thank you. >> Charles Zang: But our OOPSLA paper actually does some kind of optimization on it, but I probably don't have time to talk about it. So given all these weakness of LEAP, we have improved upon it, and we propose to use a hybrid approach to address some of the weaknesses. So this is an approach that we require the log, the recorder to synchronize on the writes but not on the reads. So this is the main contribution of this technique, of STRIDE. I call it a hybrid approach because it is order-based where we record write orders similar -actually the same as the approach of LEAP. And we match the values of reads -Sorry. It's order-based where we have to record the write order. It's search-based where we record the values read, and we match those values with the writes during replay. Okay? So there is sort of a research element in the algorithm as well. And we use the write order to bound the search complexity, so we don't have this NPC problem. Actually it's a polynomial time. Okay? So the total number of steps to search is bounded by KN, where K is number of threads and N is the total number of operations. So it's a polynomial search. So let me just give you some details of how STRIDE works, still the same example. First of all, the difference is that record sort of access vector for writes only. All right? So for the shared variables x and y, we record the sequence of threads that are involved in writes to these variables. At the same time we record two values for each read. So I'm going to zoom in on this particular thing later on, but right now this is sort of the log produced by STRIDE compared to LEAP. The write log is identical to LEAP. And we have sort of two values logged for each read so we call it a double log. So let me just zoom in on this double log feature of STRIDE. So this is actually the key contribution of STRIDE. Suppose that we have different instances of writes on the same variable. So in my example the series of writes produced sort of writing a series of prime numbers on a particular variable. And our reads actually -- For example, in a recorded execution the read actually reads a second write which is the value 3. As I showed you earlier if we want to record the exact order for race, we need to make sure the recorder and actually the actual operation happen atomically using lock. And I've shown you that this lock is actually a bad idea, right? We have to insert a lot of lock to do this. In STRIDE the big contribution is that we do not need this lock anymore. Instead I'm going to read twice. The first read is going to read the actual value that is committed by the remote write. And the second read is going to read the bound sort of version, that when this read commits it's going to return whatever the version the second read returns. We call this one the bound. So how do we use this value to restore the race order? So our purpose is to sort of restore the order of race. Right? So it turns out that our bound, number four, gives us a search bound. We only need to match all the values written and committed by writes before the version four. Okay? So this bounds our search, so that's why we're polynomial. And the rest is very simple. We just scan from the bound backwards and find the match by the value because we have the value of reads recorded. It's very apparent that it's possible that we can establish a very different readwrite link than the original one; it's possible. And we prove that this does not affect the replay correctness. >>: Are you going to define that more? "We prove this does not affect replay correctness?" >> Charles Zang: Yeah, the full proof is in the paper but I can give you more information if... >>: Is the idea basically that it has to read the same value? >> Charles Zang: Yeah. >>: Even if it's from a different write. >> Charles Zang: Yeah. It returns the same value. >>: Okay. >> Charles Zang: Even if it's from a different write. >>: Okay. >>: But if the user is debugging the code, that's still going to kind of bring -- I mean, I'm just thinking about this from a practical standpoint. If they're debugging the code and they're trying to define the line in their source code that's causing the problem, if it's still going to bring them there, I consider the proof correct, right? Just practically. >> Charles Zang: So for... >>: So if you tell me that's the case, I'm happy. >> Charles Zang: So if it's a failure -- So this guarantees to reproduce the failure. But our focus is to try to restore the exact race order so that your programmer can actually understand how that failure happened. And this, in theory, does not guarantee the original order but give you one of the orders that can actually lead to the same failure. >>: Okay. So at least it gives them something to debug.... >> Charles Zang: Yeah, give you at least, for example, another possible order that leads... >>: Yeah, that's [inaudible]. >> Charles Zang: ...to the same failure. Yeah, so it doesn't guarantee to give you the original order. So let me summarize: compared to LEAP, on this particular example we only need to perform four synchronizations and we need to log two actual values and two versions. So the big change is that by double logging we do not require to synchronize. Right? So we only need four locks instead of six. So this doesn't seem to be a lot of improvement compared to LEAP or compared to Recplay, but in practice this has a huge performance improvement. So it's very intuitive that -- So what we found is that in practice, on the benchmarks we evaluated a large majority of the shared variable accesses are reads. So for example, in some of the benchmarks such as Derby and SpecJBB, more than 80% of the shared variable accesses are reads. Okay, so think about the degree of lock reduction that we can achieve using STRIDE compared to LEAP. And the writes to shared variables in our experience are mostly already carefully protected by programmers because they are more cautious about writing to a shared variable. So our investigation reviews that, for example, in the benchmark of Derby more than 95% accesses potentially incur some kind of race. However, if we only count the write-write race, it's only 15% meaning many of the writes to the variables are already protected by the programmer. This means that we do not need to insert additional locks for these writes. We can already leverage -- So in that the performance penalty is very limited because the programmer already has a lock for that write. We can just insert instrumentation inside the lock region without inserting additional locks. Another amazing find -- Well, probably also others have already observed the same thing -- is the context switch between those two double logs actually is not that frequent. So remember we do not require a log to synchronize those two reads. So in our experiments more than 90% of the reads can be solved in the first comparison. That means there is no context switch even if I read the value twice; there's no context switch in between. Only 58 out of 400 million reads in our experiments require 4 comparisons, which means there are 4 different versions of the value that are inserted in between those two reads. >>: And those numbers are based on those facts, those... >> Charles Zang: Those numbers are based on the subjects that we evaluated, so we cannot say this could generalize any program. But intuitively I think it's not that surprising. Okay? So the context switch is not that frequent for the accesses. So we do log extra values, so the value logs are separate from the version log which means that we can take advantage of the locality so a long continuous array of the same value. So it's very friendly for compression. So in our experiments we don't find we generate a lot of logs as well. So this is actually a summary of one of the evaluations we did for LEAP and for STRIDE. On the left side is an array of popular multi-threaded Java programs. And the second column actually shows the percentage of reads of all the traces that we collected. So a large portion of these operations are reads, so that's where the technique with STRIDE can actually help a lot. So the third column shows the overhead of LEAP compared to the original execution. So in some of the subjects such as Derby you can see that the LEAP already has only 10% overhead for recording. And the last column shows the overhead of STRIDE, and you can observe that it's significantly faster or has significantly lowered the recording overhead compared to LEAP. In particular in this subject, Moldyn, LEAP incurs 100 times overhead where STRIDE can really drive it down to 1.5. Okay? So this shows that to relax on the synchronization on all the reads actually has a huge performance impact on the recording overhead. So the last technique I want to talk about is called CLAP. It's a solver-based approached which we go a step further to drive down the recording overhead. In this approach we do not require any synchronization, okay, and we record only the branch choices. In other words we do the pass profiling. And we compute schedule by solving constraints. So this is ongoing work. I do not have full-blown details, and our evaluation might be still premature. But the core idea was presented at the PLDI Student Research Competition and actually won first place. So the basic idea of CLAP is not to use any synchronization at all. So let's just record thread local paths at runtime. That's all we record is the local path. And then we can construct the execution constraints on all the shared variable computations using symbolic execution on the path. And we use SMT solvers to compute the schedule. Okay? And more specifically we have path constraints, we have program order constraints, partial order constraints, read-write constraints, and we fit it to a constraint solver to compute a global order of shared variable accesses. And the good news is path profiling is pretty lightweight. It's a classic technique that people really work hard to optimize. For example, the Ball-Larus profiling is only 30%. So let me just give you some basic high-level steps of how we approach recording in CLAP. So same example, we first record all the branch decisions. Okay? So in this case we record the branch decisions and we collect the execution path for each thread. And the next step is that we try to encode the constraints based on this path. We have different kinds of variables. We have a symbolic variable for shared variables, all different instances of shared variable accesses. We have an S variable that symbolically represents the value returned by remote read. And we have order variables corresponding to each of the shared variable instances to represent a position in the global schedule. It's a corresponding position in the global schedule. Okay? So on the bottom, I'm showing how we can construct the constraints. The first order constraint is just program execution order. It's very straightforward. And the read-write constraint is basically saying that I'm trying to match the symbolic variable that represents remote read and I'm trying to match it with actual writes. But there are possible instances to match to. For example this one, the one I circled, can be potentially matched to its local write or to a remote write. Okay? So it depends on the actual pairing. I can construct a different order constraint for the order variables. So if you do not bother with the detail, this basically says, "If that variable corresponds to the local write, the computation of the other thread must happen either completely before that write or completely after the actual read." Okay? >>: So in this example, I mean the only difference in the programs that you're observing are whether or not the conditionals evaluate true or false, right? >> Charles Zang: Right. >>: So then you can actually have -- So, okay, there are certain executions in which, for example, y is equal to 1 on the right-hand side. So you're saying by observing the control flow, you are getting constraints on the interthread execution? Because if... >> Charles Zang: So this example... >>: ...if six executed and then seven executed then you know something about y changing, and so that would require some sort of race [inaudible]. >> Charles Zang: So we use the control flow to compute -- This example doesn't show that explicitly. It actually computes the constraints for each shared variable to be executed first. >>: But you're not... >> Charles Zang: Local constraints... >>: ...recording the values of variables. >> Charles Zang: Not recording. >>: So the problem is if you don't branch on some shared variables value, how will you be able to reproduce the race on it? >> Charles Zang: We use -- Probably you can answer this [inaudible], but we use a low-store --. >>: Well, maybe I just don't understand. You do have orderings on the shared variables? Or is that what you want to compute? You don't have order? >> Charles Zang: So the purpose of computation is to compute values for these O's, for these orders. >>: Right. >> Charles Zang: To assign a global array of integers to these variables [inaudible].... >>: Okay, so I guess my point is if you just have straight line code, no conditionals then there's only one path on each side. So what do you do in that case? >> Charles Zang: Then we don't have the path condition. We only have the order constraints and the read-write constraints we don't. So it's easier... >>: Actually so in that case you basically say any interleaving is possible. Like in that case you just observe these two threads executing, but actually all the interleaving's are possible in that case? >> Charles Zang: No basically you have... >>: I mean all interleaving's that are consistent... >> Charles Zang: Yeah. >>: ...with the values. >> Charles Zang: For example, you have a bad property. Right? So you have to satisfy this bad property, so you have another constraint here. >>: Right. I'm just saying that in the case where you just have straight line code with no conditionals and just a bunch of assignment to shared variables and then read's, you have a lot of possible interleavings. >> Charles Zang: Right. >>: So [inaudible] is sort of a predicate. >>: Okay. So I'd assume... >>: [Inaudible]. >>: ...there's some predicate. Okay. >> Charles Zang: There's some predicate that we symbolically execute, form all the constraints and [inaudible].... >>: Okay, so there's at least one predicate that tells you something. >> Charles Zang: Yeah there's one predicate. >>: Okay. >> Charles Zang: Yeah, that's right. So [inaudible], I think this is a simplified representation of what we did. But the core idea is to encode different constraints symbolically. And while on the top I’m giving all the constraints for this particular example and we fit it to the solver. And at the end the solver will return an integer assignment to all the order variables that represent a position of a particular access to a shared variable in the global schedule. So this would be the order variable. And we would just this variable and the user-level schedule to replay this program. So let me just summarize the characteristics of CLAP. So we think the most important contribution we make is to reduce multiprocessor replay to solving two well known problems: one is the thread-local profiling and the other is automatic theorem proving which are really very developed areas or very developed techniques. And we have also formal proof of why that replay is correct so that the schedule computed by CLAP is guaranteed to reproduce the bug. >>: If the bug doesn't depend on [inaudible] then it's not right. For example, if you have like [inaudible], you have two data races then since you have not recorded any like values then I think your schedule does not say which order has occurred. >>: But that's a bug too. >>: Yeah. >>: I mean that's an observable bug under [inaudible]. >>: Sure, sure. >>: But I mean generally I think you're also pushing off alias analysis to your constraint solver because in your simple case with scalars it was great. But when you had pointers and direction, since you're not recording values, you can use value difference to disambiguate pointers. And generally there's a rule that if star P is not equal to star Q then P and Q don't alias... >> Charles Zang: That's right. >>: ...which you can use to your advantage when you're recording values. But you don’t even have that. So you're also going to have to put all the alias or nonalias in constraints... >> Charles Zang: Right. >>: ...into your theorem prover too and ask it to also resolve potential alias, yeah? >>: Yes, that's true. >> Charles Zang: That's true. >>: So you're pushing off a lot -- You're making your runtime cheap but you're pushing a lot of problems onto the theorem which is not bad but I just sort of wonder. I mean it's not just about the -- You're going to have to have some memory model to reason about aliasing. Because like in his case, you may or may not get the error depending on whether some alias occurred or didn't occur earlier. >> Charles Zang: Yeah, the alias thing is definitely a problem. And I think Lee gave us a lot of symbolic representation of the pointer values. But there are also some problems with the race, with stuff like that. And, yeah, we admit that's a problem for us too. By having the order variable, another advantage of our approach is that we can represent different relax memory models very easily. For example, TSO/PSO models and we just encode different order constraints into our constraint solver to be able to address the bugs in these relax memory models. Another interesting characteristic is the same as the work of CHESS, we can encode to bound the search. To give an easier time for the constraint solver, we use a context switch bound, preemption bound, to reduce the search space. Basically we can provide a constraint on the sum of the delta of the thread local consecutive order variables so that we can sum them and put that as a constraint into the constraint solver so that we can ask the constraint solver only to give us a schedule within a certain number of preemption bounds. So this is another way for us to be able to make computation tractable. This is a preliminary evaluation result for CLAP, and these are some of the subjects that we used. And CLAP is completely implemented on LLVM so on C programs different from my earlier subjects. Okay, so this is all on C programs. What we observe is that recording overhead ranges from 10% to around 269%. So we also implement, for evaluation purposes, a version of LEAP on LLVM as well. And we observed some significant reduction which is as expected. All right. So I guess let me just make some conclusions. I hope I sort of highlighted the challenges of user-level deterministic replay which is that we want to reduce the recording overhead and want to avoid some Heisenberg effect where the instrumentation makes a certain type of bug, disappear. We want to also reduce replay complexity; although, in this talk I didn't have time to talk about that. And I presented three techniques: LEAP, STRIDE and CLAP. And the future work, we'd like to focus on things like event driven programs. We very much want to address how to replay long running systems and also how to replay a bug in the distributed systems. Okay? So thank you very much and thanks a lot for you time and attention. For more information about our research, this is my home page. And I think I took a little bit long but... >> Tom Ball: Thank you very much. [applause] >> Charles Zang: Maybe some feedback and suggestions. >> Tom Ball: Yeah, we have plenty of time for some more questions. >>: So I'm curious about your plans for future work. Can you say a little bit more about what you're planning to do with event driven programs? I mean what kind of programs are you referring to when you say event driven programs? >> Charles Zang: So I think about two types of event driven programs. One is the low-level event driven programs. The most famous one is the Linux kernel things like that. And for my background I'm more interested in addressing event driven [inaudible] systems, for example, or these messaging systems where debugging them is very, very difficult. So I have some preliminary ideas, but my surprise is that there's -- Maybe I don't know yet, but I don't know of any replay work that addresses this kind of system. So that's why I think it's worth investigating. >>: I think that one challenge here, [inaudible] thought about this problem, is that when we started looking at, you know, multi-threaded systems there were standard libraries that programmers could use to express multi-threaded computation. So in the open source or the UNIX board we had these B threads. On Windows there is [inaudible] library and Java and C sharp, they all have these standard API's. But in this [inaudible] world, it seems to me that there's just a huge variety of primitives and libraries. So it's kind of hard to imagine which one should you target. You see what I'm saying? >> Charles Zang: Well, we start with -- Well, I think the world is not completely chaotic. For example, there are also well-defined type systems, type-based messaging so that we can have some kind of information about the semantics of the messages. It's not -- And there're also standards like Java messaging system, it's a standard. So I think we have some ground to start. It's not really completely chaotic. >>: Okay. >> Charles Zang: But, unfortunately, I don't have further experience than that. I just feel it's a great need to have that. >>: Yeah, I agree with that. >>: [Inaudible] share the, like, actually analysis time for the offline one. How was that? I mean, was it really like scalable to [inaudible] solve the schedule problems? >> Charles Zang: So the complexity is the same as CHESS. It's exponential to the number of threads. >>: Yeah. [Inaudible]... >> Charles Zang: To enumerate the... >>: I can understand, like, the complexity in a series, but I'm just asking in your practice have you really like re-execute or replay the [inaudible] in a reasonable amount of time? >> Charles Zang: Yeah, for all these subjects I list here we can -- Yeah, so I think I want to race against the time. So I didn't explain it. So the second column is actually the time taken to reproduce the -- to trigger the error for these subjects. >>: Including the [inaudible] analysis? >> Charles Zang: No, this is only... >>: [Inaudible]... >> Charles Zang: ...recording the runtime overhead. >>: ...[inaudible]. Yeah. >> Charles Zang: And the time taken -- We have a table, but I should put it on a backup slide. >>: [Inaudible] for the first six programs, we finished it. We didn't [inaudible]. And for the last one [inaudible]. >>: What are the last ones? Did they like run overnight? >>: It'd have been difficult because there are too many [inaudible] needed to explore. >>: That isn't really like a special benchmark that one, two... >>: That's like all it does. >>: Yeah, all it does. And it's just kind of like a [inaudible] testing where a replay [inaudible]. >>: Which one is that? >>: That is racey. >> Charles Zang: The last one, it's... >>: It's the worst one. >> Charles Zang: It's basically... >>: [Inaudible]. >> Charles Zang: ...wastes all the time. >>: I see. >> Charles Zang: So for that I think it's a special -- Well, that's the only benchmark that we are unable to produce the schedule. But hopefully in real world programs, none of the race condition is like that. >>: It is not a real [inaudible]. >> Charles Zang: So all these subjects are small, but they are carefully constructed to produce a concurrency error. So in terms of replay difficulty I don't think they're more easier than the real or larger benchmarks. But we're still working on larger benchmarks. >>: You said you're using KLEE, right? >> Charles Zang: Yeah. >>: So you're actually using KLEE then to do exhaustive enumeration to find the path? So you're not really truly doing symbolic... >> Charles Zang: We modified KLEE not to duplicate states. Because we have the paths, we just use it as an engine to give us the constraints along the path. >>: Right so you're using KLEE to give you the intra-thread path constraints? But then to that you have to add all these other constraints. >>: Right. >>: Right? So then those are being added in. >> Charles Zang: Right. >>: I see. So you're going to let KLEE execute to some point or are you just -- I don't quite understand. So are you going to just take all -- Okay, you have the separate thread traces. You're just going to make one pass with KLEE on each to get a formula per thread? >>: Right. >>: And then you make one big call to the solver or --? >>: Well, basically after we collect the path constraints for each thread... >>: Right. >>: And then we [inaudible] and then we encode all the other execution constraints. >>: So there's one big call. >>: There's going to be one big call, yes.... >>: There's one big call, right? Okay. And that call, if it comes back, will give you a model for the order variables. >>: Right. >>: That's interesting. So actually that trace then -- So a typical use of KLEE or Sage is that you let go until the first branch and then you say, "Can I go a different way?" You're not using it... [simultaneous inaudible comments] >>: ...long execution and I just want the path constraints for... >>: That's right. >>: ...the whole thing. >>: That's right. >>: And furthermore, that's why you get into this issue of symbolic pointers, too, because KLEE is not doing everything symbolically and you would like it to make more things symbolic probably. Is that true? >>: So basically we need to mark the [inaudible] to be symbolic. >>: Right. Because KLEE is -- I mean as I understand it with Sage, you sort of mark some bit of state as symbolic. Is that correct? But then there's all this other stuff that you want actually to be symbolic that, if KLEE were just to treat it its default way, it would not be symbolic. >>: [Inaudible]? >>: Because it would be concrete. >>: Yeah, we have to address the symbolic [inaudible] problem. >>: So you need to make more things symbolic. KLEE is essentially like trying to find changes to the input byte string that will cause the sequential program to go a different way. Right? But in their case, they have all these global memory accesses. Like for a single thread it's just like a function of the [inaudible].... >>: It's basically just extra data because they're analyzing each thread separately. There's this extra data [inaudible]... >>: They want to make symbolic. >>: ...in symbolic. Yeah, okay. >>: Is that correct? >> Charles Zang: It has its own symbolic memory representation and then it actually emulates sort of the memory assignments. >>: Yeah, that's right. >> Charles Zang: As well. So it gives us some kind of [inaudible] information but not for all the variables. >>: This is interesting. >>: Yeah, so this is... >> Charles Zang: It doesn't... >>: ...quite interesting. >> Charles Zang: Yeah, it doesn't solve arrays, for example. >>: You definitely want to do partial -- I mean you definitely want to do reduction on this trace so that you don't like look at all the possible... >>: So there is... >>: ...points for preemption. >>: ...the research group at NEC Labs. >>: Yeah. >>: They do stuff like this, right? >>: Yeah, I was telling them about the -- I mean... >>: Like [inaudible]? >> Charles Zang: Yeah, Chao Wang. >>: But it's different... >>: And Chao Wang, yeah. >>: But it's different. They don't have this local -- They're doing something different because of the local stuff. Generally there what happens is, you record a partial order of a parallel execution then you search for other schedules that obey the same constraints but might find the bug. >> Charles Zang: I see. >>: Right? >> Charles Zang: Okay. >>: So... >>: Charles Zang: So, yeah, this is different. >>: So this is different. It's different, yeah. It's a different take. >> Charles Zang: Definitely a lot of room for organization but the key idea is do pass profiling and then do symbolic execution on that pass and solve the constraints. And... >>: You shouldn't call it pass profiling, though. I mean you're really doing a full instruction trace. >> Charles Zang: Well, pass... >>: I mean... >> Charles Zang: [Inaudible]. >>: I know the path decision, right. But it's like -- Yeah, I would talk about instruction [inaudible]. Okay. Cool, well thanks again. >> Charles Zang: Thanks a lot. [applause] Thank you.