23441 >> Madan Musuvathi: Hi everyone. I'm Madan Musuvathi from the Research and Software Engineering Group, and it's my utmost pleasure to introduce Junfeng Yang. He is an assistant professor at Columbia University, and we go a long way back. We met in graduate school. So he worked on model checking as well, and he applied model checking to find lots of errors in the Linux file system, and then he moved on to apply those techniques to distributed systems. And recently he's been working on reliable multithreading, how do you get deterministic semantics to multithreaded programs. He recently received an NSF Carrier Award for this work. So it will be exciting to hear more about it. Junfeng. >> Junfeng Yang: Thank you for the introduction. And thank you guys for being here. Since this is a small group, if you guys have any questions, feel free to interrupt me. I tend to explain things better if it's more interactive. So this talk is about our recent work on making multithreaded programs more reliable. And the key idea here is called schedule memoization, which I'll explain in detail in this talk. This talk is based on the work coming from two papers. One paper and the system is called TERN, and another paper, the system is called Peregrine. And it's joint work with my students Heming Cui, Jian Wu, [phonetic], John Gallagher, Huayang Guo and Chia-Che Tsai. These are great students. I like enjoy working with them. If you guys are looking for interns, talk to these students. Okay. So let's look at a few trends that shape our computing today. The first trend is a hardware trend. The machines are getting more and more cores. So this application shows the number of cores for AMD Opteron processors. You can see in 2010 we have 12 cores on magni cores processor. How do we use all these cores? Developers are writing more and more market strategy code. The second trend is a trend in software. The coming storm of cloud computing. So we have more and more users connect to the cloud of servers, more and more devices connect to the servers as well. And this creates lots of computation. And at the same time the computation is actually aggregated on the source in the datacenter. How do we get good performance in this setting? For performance virtually all these sources imply multithreaded programs. So at the same time, multithreaded programs are known difficult to get right. They very often contain the tricky hidden box the, data races, for example, that show up, sometimes disappears. It's really hard to write test and debug these multithreaded programs. And the key reason, I believe, and also the other researchers also believe that these programs are nondeterministic. We run them. It's like doing a coin toss. Sometimes you get heads; sometimes you get tails. Success or failures. Specifically, today, when we run the multithreaded programs, even when we run this program on the same input, we run it repeatedly. Each time we may actually use a different threading, leaving our schedule, this rectangle represents the input. Each time we run it, we make out a different schedule. There's essentially a one-to-many mapping on input to schedules. This nondeterminism actually complicates a lot of things. For example, testing becomes a lot less assuring, right? Lots of these -- many of these schedules are good to not contain bugs but some of the schedules contain tricky errors such as databases and concurrency errors. And when we test a program, when we run into these good schedules, so therefore the bugs did not show up in the testing lab. But once we ship the code to end users, when we run the program, they attribute the interleavings where the bugs show up, get crashes. Similarly, debugging becomes very complicated, because in order to track down this bug, right, we have to reproduce the same bugging interleaving which can be challenging. So recently researchers have proposed a good idea called deterministic multithreading which conceptually forces the program to always use the same schedule and the same input. Essentially there's a one-to-one mapping for existing DMP systems. You know, when we run the program on the same input, regardless of how many times we run it, each time -- we always run into the same schedule, right? So this idea is great. It solves a lot of issues caused by nondeterminism. But at the same time these existing systems actually suffer a limitation. Basically if you make the schedule tightly dependent on input, if you change the input slightly, let's say you just change the input by one bit, for example, and these systems may force a program to run into a very different schedule which may also contain bugs. But there's this tight sensitivity, input sensitivity that has value input dependency for existing DMP systems. As I just mentioned, so some of these schedules also contain bugs. And this will complicate testing as well because we can test some inputs in the testing lab but not cover all the inputs. May shift the program to end users and run on some slightly different inputs, they may actually run into bugs, right? Similarly, debugging becomes challenging because in order to reproduce this bug execution we have to reproduce exactly the same input. The input coming from the environment. Other Bs have to match, which can be hard, for technical reasons or privacy reasons. By the way, actually if you observe this behavior, if you change the input slightly, the bugs show up. If you change input more, the bugs appears with existing DMP systems. So to address this instability problem, to reduce this unstable behaviors, we propose an idea called schedule memoization, the idea is to memorize the schedules that work in the past, right, and in the future, when input comes, we can try to reuse these schedules. Therefore, we can avoid potential and known schedules, and also buggy schedules. And we call this stability, right? Hand on to the good schedules while test schedules in the future, if new input comes, we'll try to reuse these schedules that are shown to work. Therefore, can avoid bugs in the unknown schedules. Do you guys have questions? So I guess typical question I got here is if you do this, sort of leave with the program which contain bugs, right, because you hang on to the schedules which do not contain bugs, and the program itself is not very correct, right? My answer is these can be turned on. This technique can be turned on for production systems. Actually deploy the system. We'll ship the code to users and hang onto the good schedules and avoid the schedules we have not tested. In testing, you can turn this technique off and use techniques such as chess to actually explore different schedules and get a set of correct schedules. So therefore they ship the schedules together with a program to end users. Question? Right. >>: So are you going to explain what you mean by the same schedule? >> Junfeng Yang: I'm going to explain that. Define what schedule is, right? Next few slides. So we actually call the first system that implement this schedule memoization idea in turn, because the idea actually matches our natural tendencies in animals and humans. We repeat familiar things and avoid errors and known bugs, and this slide shows migration draws for birds, for migrant birds. They follow fixed rods. Don't run into random rods. The first system that implements memoization is TERN, after this arctic turn, and the bird that migrates along, the most is known as animals. Picture of tactic turn. Now we can reuse schedules for different inputs. 600 remains, which is from determinism or efficiency. You can only pick one, cannot pick both. I'm going to explain what I mean by determinism efficiency in the next few slides and also define what the schedule is. At the language level, where does nondeterminism come from? It comes from mainly two sources. First we have nondeterminsitic foundation, for example, locks. Locks can be acquired in different orders. This shows a bug in Apache that's caused by nondeterminism memoization. And the first slide acquires a lock and a size, uses objects, shared object. And the second thread is object. And they both compete for the same lock. If the code is [inaudible], this code is correct. However, if it's run in the reverse order, we have this nondeterminism order, in terms of lock of position the code will run towards use after free crash. So here, this piece of code crashes or runs correctly depending on simulation order. Another source is, nondeterminism source is data race. This slide shows data race example from one of the benchmarks we use from implementation in SPLASH2, a popular parallel benchmark. So here, this first slide thread 0 prints out some results and thread one actually computes the results. This computation, is a print. If the race results this way, we get correct result. However, if the race is result the other way, we actually print out wrong results. Here, this piece of code prints out correct or run results depending on how the races are result, the other memory accesses. So even these two types of nondeterminism, researchers help propose two types of schedules, right? The first is called sync schedule. Basically it's a total other deterministic simulation operation such as lock and unlock operations. And the second type is deterministic other shared memory access operations. So when you first other load and store operations to the same memory location. Both types of schedules have pros and cons. So sync schedule, basically for this particular example, if you just enforce this other, right, or we enforce this other, either way we can make this code deterministic, the bug always show up or the bug always not show up. So the one you see here is actually very efficient. Enforce total efficient operations. Most of the code does not control simulation. Therefore they can still run in parallel. Overhead reported is average 16 percent. The downside is if there are data races, it's approach one work. Let's look at this race again. We can enforce determinism others on the [inaudible]. But at the same time this result can be resulted in different ways. Therefore you can still have nondeterministic behaviors. So the second way to make program deterministic is to enforce this memory access schedule, right, for this particular example. If you can always enforce this order, or this order, either way, we can always enforce the same consistent order to make the behavior consistent. Either the bug always show up or the bug always not show up. The problem is this is deterministic; that is the advantage. The problem is that the normal program does lots of memory accesses will run shared memory accesses. If you enforce these orders, you can slow down the program but a lot -- the number reported in the previous system is from 1.2 to 10x slowdown. Questions? All right. >>: So there are two kinds of overheads. So one is the instrumentation overhead, just by having to do some extra work, each memory operations we are going to base some costs. >> Junfeng Yang: Slow down, right. >>: And then there's this other cost of actually enforcing the determinism wherein the hardware has more parallelism because you're not able to use that because of the data. >> Junfeng Yang: Right. >>: Do you know how you separate the overhead? >> Junfeng Yang: That's a good question. Actually I do not have specific breakdowns for the instrumentation overhead for memory access. But for the sync simulation, order overhead, like this -- so right here, if you enforce this simulation slowdown, this or hat actually comes mostly from waiting. Because there's two field simulation operations in code. And here -- so from memory access, I think the instrumentation alone gives you maybe 5X, 10X slowdown. It's reasonable, the waiting time is not critical that here. So the challenge here is either we pick the same schedule, enforce simulation order or enforce memory access order. If you use sync order, we do not get full determinism, because if the database is not deterministic at the same time the code actually runs efficiently. If you pick memory access order, we get full determinism. If the races, the races are resolved deterministically. And at the same time we get this efficiency problem because the slowdown can be quite a lot. Can we actually get both? With Peregrine, you can get actually both. You can get both determinism and efficiency. I actually explained how I computed that. Now one key idea in Peregrine is that we can enforce hyperschedules instead of man schedule or sync schedules. And the insight here is that races actually rarely occur. Although most programs have races either benign or harmful. But these races in practice actually do not occur that often. Intuitively, if your program has lots of races, your run is in testing lab, you probably have detected these races already. Empirically, if you run six programs that contain races. And we run some standard workload such as compress a file for parallel compressing ability, and there are million s of memory accesses, [inaudible] operations. But we detect it only it after 10 races. The races actually do not occur that frequently. If you look at less races in execution, most part of the execution is fine, do not contain races. The main portion contains races. >>: What does it mean by [inaudible]. >> Junfeng Yang: Contain race which means they're concurrent memory accesses, there are others. >>: Happen at the same time or using happens before ->> Junfeng Yang: Using happens before. Using happens before. Access, have access here, another access, if you look at absolute time, they're not concurrent. But they could help others if you run it again. >>: So this 10 of 10 happens before races. >> Junfeng Yang: Right. It happens before races. This is tied to the simulation retrack rate. I'll show more details when I talk about example rate. You run a regular detector, detector can detect a lot more races, but here we assume that we enforce more simulation order which print a lot of races. So based on this idea, right, that we can, on this side we propose to use hybrid schedule to combine the benefits of both sync schedules and meant schedules. The idea is for this portion, it's very rich, does not contain races, let's use sync schedules, enforce a total simulation order to get in TERM efficiency. And for this portion, the manual portion that contains races can enforce more expansive memory access schedule. Here we have more overhead wave but this portion is only a small portion of the entire execution, it wouldn't be that high, right? And we implement this idea in a system called Peregrine. This execution actually shows the Peregrine Faulkner, which is the fastest flying bird. Specifically how do we compute hybrid schedules? And the idea of schedule memoization actually makes it very easy to compute hybrid schedules. First, we run the program on the input, new input we have not seen in the past. We will record execution traits which contains both simulation operations and also memory access operations, their order. We do that because we do not know if this execution will run into races or not. And once we've done that, we have this actual trace where we'll run some offline analysis of the trace to relax this trace into a hybrid schedule. And this schedule, hybrid schedule contains a simulation optimization for the races part, and memory access schedules for the racey part. And with this schedule, in the future if there's an input compatible with this schedule comes, we can reuse this schedule and this new input with both determinism and efficiency. So what can we actually reuse schedules, right? We actually ask this question. Think of a simple parallel program such as parallel compression, it gives you a tool in our benchmark. It splits the following into multiple blocks and then compresses blocks with multiple threads. That's to make sure that the number of plugs and the number of threads remain the same we can contrast the file using the same compression schedule regardless of the file contents. It's often the case you can actually reuse these schedules. And your evaluation actually found out that for programs such as Apache, we can reuse schedules over 90 percent of the time for many programs, not just Apache. And also this basically -- the reuse part requires some analysis. And in the Peregrine system we made this part automatic. It does not require user annotations. Question? >>: So would you need to know which is a data race feed portion and which is ->> Junfeng Yang: So depending on what you mean by new input. Once you figure out one schedule, we're going to compute, okay, for this set of inputs you can totally reuse this schedule without introducing your races at runtime, right? Therefore, these inputs would be covered by this schedule. There is some new schedule that is not covered by your existing schedules [inaudible] rate, best input requires this new lattice. Okay. >>: And you're going to have details on ->> Junfeng Yang: Right. I'm going to show you an example explaining how this works, right. Question? >>: You probably are going to talk about it later, but where do you get the inputs that you tried for Apache? >> Junfeng Yang: For Apache, the input comes with the natural receive method, the receive system call to receive network data. >>: I'm just wondering, like, because it really depends on how different the includes are. >> Junfeng Yang: Right. >>: I'm just wondering what sort of maybe probably issues a new request, how you generated those requests. >> Junfeng Yang: So we generate the request by using two methods. One method is actually run this synthetic testing tool, which launches a bunch of ATV requests. Fresh different URLs. And the second type of, the second workload we got is from the trace from the Columbia website. So we trace -- we collected a one month's trace from the website, looking at all the HTV requests issued from the Columbia website. And the evaluation we actually launched the HTV request to our system and got the reuse rate. Does this answer your question? >>: Yes. Mostly static, or was it static and a combination? >> Junfeng Yang: Mostly static, those ATV requests. So our system also handles piece optimizations, runs Linux, and also for the reliability we actually made our system run in user space instead of modifying the kernel. And it also works with several programs like Apache. So I'm going to show you the summary of results before I talk about the details of the system. We evaluate the Peregrine diverse set of 18 programs, including a program such as Apache Web Server, parallel compression UDP, the two I just talked about. And also Agat parallel for downloader and also PF Scan, a parallel file scanner. And also try -- also evaluate our system, 13 scientific programs. Programs implementing scientific algorithms, ten from the popular special benchmark and three from popular PARSEC benchmark. We also tried a program called Racey, which contains tons of races. And this program is actually a stress testing tool for deterministic replay and determinism execution because if any result is differently, you get different results, this Racey benchmark, stress testing tool. And Peregrine can determinatively resolve all the races in these programs. We get determinism. And the overhead, it sometimes can actually speed up the program by up to 54 percent. And the overhead is up to 49 percent, and it can also frequently reuse schedules for nine of the 18 programs. >>: The speed up. >> Junfeng Yang: The speed up, basically, no. So there are two key things where we can actually speed things up. The first is once we enforce synchronization order, we can replace expensive synchronization operations with cheaper ones. For example, there's operations such as barrier weight. Each causes unconditional contact switch that can cause a lot of overhead, right? But once it enforces order, it can use cheaper ones, just not do connect switch. Can make it faster. And also some of the benchmarks actually use mu sleeps [phonetic] for synchronization. So synchronizes it does not use explicit synchronization, uses mu sleep to sleep for a while. And synchronize, once you enforce the order, you can skip sleep. That's why we can speed things up. Okay. So let me show you an overview and then like talk about how the system is implemented using a detailed example and show the evaluation and conclude. So Peregrine actually use both runtime components and compile time components. But given a program first we'll instrument the program using our instrumenter, which runs in the RVM compelling for structure. And then runtime, we maintain these key data structures called schedule cache which contains a set of TUPLs. C constraints as schedule. Because this is a heavy schedule. SE is heavy schedule and C are the preconditions required for us to reuse the schedule on new input. So just talk about this parallel compression example where if we maintain the same number of outlooks and the same number of threads, we can reuse the same schedule all the time, right? However, if the input actually changes. For example, we want to compress a file using more threads or less threads, fewer threads, we need to determine that this schedule is not compatible with input, need to use a new schedule. How do we do that? We actually capture these constraints in these preconditions. When we get new input, we can check the precondition and determine whether or not we can reuse the schedule. So let's say the input comes. First do this lookup in this schedule cache to figure out if the input matches the preconditions. If they do not match, right, for this input into a component called recorder, which will record a detailed execution trace including synchronization operations, memory access operations. Then we run this offline analyzer to extract a hybrid schedule and hybrid schedule and precondition required to reuse the hybrid schedule on new inputs, try this new TUPL and input it into the schedule cache. And if the input actually matches this schedule cache, right, it's simple, just forward the input and schedule to the replayer, and the replayer will run this program while enforcing this schedule S. And because the analyzer can compute correct preconditions, we can -- we can be sure that this input can always be precise by the program, while enforcing this schedule S. So, okay, so I'm going to show you, this may sound like abstract, just a bunch of conceptual ideas. Let me show you how this works using a real example. So this code is taken from the DB2 and also FFT programming in our benchmark. CJ5 if they fit within this slide. And first recent input, and then it's sponsored a bunch of worker threads to process some data thread. If you look at the worker thread, it does some memory allocation and reads data and do some computation. This is defied. The real code is more complicated than that. And they update results by grabbing new tags, update this shared variable result and release the mu tags. And the main threat also participate in this computation cause workers as well. And optionally, the main thread also update the result as well based on these argument threes. And then it presents out the results because this code actually contains a bug. Can you see the bug here? >>: Oh, yeah. >> Junfeng Yang: Right. So there's a missing piece where joined, this code cache is running parallel with this update code. Therefore we can have races, or write or read racies, for example. So given this piece of code, we'll first prepare it for use at runtime and run it with instrumenter. Instrumenter will first annotate, intercept all these lines that actually read inputs, for example, the lines that access our V, common line arguments considered inputs to our program. And this function, the system called read to refile data, we also implement the system call to mark that data input. Given this input data at runtime, we can track how the input data is used, and therefore we can compute the preconditions required for reusing some certain schedules. And this instrumenter also actually utilizes simulation operations such as lock operations, a piece we can create to record schedules. Let's say after instrumentation we get the instrument of the program, let's say we run this program with this particular set of arguments, 220, which means we're going to respond to threads. We're going to process the dataset of two bytes. And 0 means we do not actually update this result optionally. And the recorder will record a D execution trace. Record the statements that reads inputs, record the execution of these group as well. Since the number of threads is two, the first time this I is compared to N thread, we get this true return, right? And the second time we compared 2 to N thread, we get this 0 to 10 rate. So this code corresponds to the execution of this loop. And we run this worker thread which also -- would also record the other in statements, the instruction activity. Also this gray area corresponds to the loop here. Similarly, we do the same thing for the worker caught by the main. And then this main function will update the result by grabbing this lock and the child thread, the worker thread will also do the same thing and record the output of this comparison, a red flag to RV3. And then the last statement is to print out the result. So once we get this trace, we're going to show you how we can actually extract hybrid schedules and also constraints. The first we're going to copy the simulation operations to the hybrid schedule. And we also enforce their order. So if the order in the execution trace is at this, right, in the future, when we reuse the schedule, we're going to enforce the same order. And next we're going to detect races with respect to this simulation operation. And there are three accesses to share variable result. If you look at the first pair, they actually -- they're not a race, right? Because there's a simulation order constraint here making them not concurrent. But here these two threads can totally grab different locks, right, would not flag this as a race as well. Even though the locks are different, we're going to reinforce this total order simulation operations at runtime. Therefore, these two accesses narrower have not currently. Therefore they're not a race. Does this answer your question? Okay. And let's look at the other two accesses. These two accesses. There's no synchronization operation in between they haven't [inaudible] it happened before race is detected as well. As a result, this race we're going to add to memory accesses into the schedule. We'll enforce an order between these two accesses so in the future when we review the schedule, it's totally deterministic. So once we get the hybrid schedule, you need to compute the preconditions required for reusing this schedule. So how do we do that? One naive approach is we can -- okay. So there are a bunch of challenges here. First the preconditions need to ensure that all these events are reached when we rerun this program on the future inputs. Enforced feasibility of these events, that the events can be reached at runtime. And also for determinism, when we reuse schedules, we want to make sure we don't get new data races, because if there are races, execution results can be different. So this slide actually shows the example of a race that may show up if we just reuse this schedule, right? Let's say we enforce this schedule, but this argument thread, V3 becomes true, right. So that this becomes 1, right? And this flag is 1 -- this particular check satisfies. Therefore, we're going to run the true branch of this statement. We're going to access results, right, which actually races without accesses. They're coming new races, coming up if we reuse the schedule and inputs. These are things we want to avoid as well. We want to avoid new races. How do we do that? So one approach is we can look at the conditionals associated in the trace, and if the conditionals depend on input, that means if the input changes, the value, the result of the conditional may change. We can grab all these conditionals, right, and use these conditionals as a precondition. So that in the future if the input satisfied these conditionals occur, right, we're again going to see that all threads will go down the same path following the execution. Question? >>: [inaudible] concentrate. >> Junfeng Yang: So input -- here we consider input as the arguments passed to the commander line. And also inputs read from the network and also input reads from file. >>: [inaudible] lifestyle and system load and -- although I'm getting like signal altering execution and that -- [inaudible]. >> Junfeng Yang: Right. Right. >>: Determinism. >> Junfeng Yang: So the signal part currently we could not handle. The get timestamp recurrent do not handle, right. If the code is get handle, get back to value, and that value affects operations. That's something that we do not handle. And the things that we handle are, for example, the random returns, right? So if the program caused random, gets back value, we can just mark it as input and do the same thing. And for get time of day, it could have used the same approach. But the things, if they get time of day is not true value, is not an integer, cannot be checked by our, the limitation of our system. >>: [inaudible] what you've done. >> Junfeng Yang: Set the same random value. >>: Is that a generator ->> Junfeng Yang: It has the same random seed. You get the same random value. Could be reading things from DEV slash DEV runtime, totally random depending on the system setting. So these are constraints. And if the input matches all these constraints, we go down the same path. Therefore, we address all these challenges. And this constraint actually can be further simplified to be this. So this is simplified constraints. The problem is this naive approach is it actually computes overconstraining preconditions. For example, let's look at this size is too constrained, right. It says in order to reuse this schedule, the hybrid schedule we just computed, you have to make sure that the data size is true, right? But actually this schedule -- this synchronization schedule, if you look at the piece I created, all the memory accesses, you can actually reuse it a lot of different data sizes. If the data size is true, it doesn't actually matter. In some sense this size is true constraint, overconstraining. Precludes a lot of the opportunities where we can actually reuse the same schedule. And one way to solve this problem would be to throw away the constraints that do not matter, right? How do we actually figure out what constraint matters and what constraints do not matter, right? That's one of the challenges when you solve here. And turns out this problem is very hard to solve. Stop most of our efforts in this project, because then we come up with some program analysis techniques. And the first techniques to slice out operations that do not matter to affect schedules. The second technique is to improve the precision of the first technique. We actually need to simplify the program toward a schedule. Therefore, analyze the simplified program so that we get better precision. So the details are actually described in our paper, and I'm also happy to discuss the techniques with you guys off line. But it's just to complicate, to include in the talk. And intuitively, intuition, here's the, intuitively you look at the computation loop, it contains at least thread local variable. So local data. For example, this data is allocated locally. In some sense it's private data for this threat. It's on the private feed. And these computation reads files, do the computation; but, yes, we don't touch the data access for other threads. So it's local computations as well. These local computations can be basically a slice out. >>: I have a question. >> Junfeng Yang: Okay. >>: So what -- what makes a relaxation appropriate for you? Say if I have relaxed spectrums you need to do something, how do you know of that relaxation is still okay? Does it mean that you know for any state in there, this particular schedule that you computed is still feasible? Is that what you want. >> Junfeng Yang: Still feasible does not introduce new races. Who can be ->>: Not introduce new races. >> Junfeng Yang: Right. Because remember there's this conditional update of the results. So depending on the third argument. So if you run on a new input, there's a new risk coming up, right? The results may be nondeterministic depending on how the race is resolved. Those are the two conditions. All the events that are feasible, at the same time no new races coming up. So there's some past slicing work. And the algorithm can be used to solve part of the problem within a sequential execution. Here I need you to look at interthread dependencies. If I read a particular variable here, that's defined by another threat. They could use chains across threads. It can make things complicated, right? That's part of the problem we solve here, make it [indiscernible] aware, right? Question? >>: Do you do all the analysis, goes in statically? >> Junfeng Yang: We do this analysis statically in the sense that you get the program, you get the execution traits and you do the analysis. It's off line. But uses the dynamic information because we have this dynamic trace. It's a combined dynamic and static analysis. >>: [inaudible] doing all the intuitions up front as part of the training phase. So when the program was deploying all this stuff is not being done. This allocation is not ->> Junfeng Yang: Can imagine we could do this also at runtime as well, and updating the schedule cache, right? Right now the analyzer runs for a long time, right. So doing it online does not make a whole lot of sense because it's low, right. So the ideal model would be even after you deploy, deploy the system, it can still collect schedules, still run the analyzer off line instead of updating online. >>: Symbolic execution at all, how would that help or not help? >> Junfeng Yang: What? >>: Symbolic execution, for -- >> Junfeng Yang: Symbolic execution? >>: Thinking about what is effective, what condition ->> Junfeng Yang: So basically once you slice the program out, like once this statement is sliced up, we get a smaller slice, right. Actually run symbolic execution to get the constraints. So we look at input-dependent, track the conditional inputs and collect constraints. That part is done by symbolic execution. We leverage symbolic execution. Okay. So once we slice things out, like all these statements, you have reverence to the schedule; we slice them out. We get the preconditions; we separate them. These are the final preconditions. This is a hybrid schedule we use from the trace, and these are the preconditions we compute for reusing this schedule. And if we run this program on the new input, let's say 202 arrays, we want two threads, process the dataset, the dataset of 100 elements, and then the third argument is three. And this input data, the arguments match the preconditions. Therefore, we can reuse this schedule, enforce the schedule to get determinism. So there are a few benefits of this approach. The first we can actually deterministically result these races, deterministically result. Also the preconditions computed being sure there will no races coming up, when we reuse a schedule. So get determinism. And also it's very efficient, you can see enforce other update determinization, access only two classes, and major computation part can still run in parallel. And also this schedule can be reused across a bunch of different inputs. We don't care about the data content at all, right, and don't care about data sizes. We can give the same dataset, we can reuse the same schedule. So questions so far? >>: I had a question. >> Junfeng Yang: Okay. >>: [inaudible] how do you enforce the memory? >> Junfeng Yang: Access? So we actually instrument the program. So when there's access here, we put some of our operation up here. Before this access we put sample operation down there, right. So the former constraint. It's like ->>: Dynamically asked for an instrument. I'm talking about what schedule to use. >> Junfeng Yang: Issued by the program, that's right. >>: But made this point that if you instrumented programs, then there will be reactions; but just to do nothing, but no-off, you automatically start being five to ten times overhead and how do you avoid doing that? >> Junfeng Yang: If you do it for all the loaded, can slow down even more. Here we instrument this specific access we know, they're involved in races, all the memory access you run in parallel with instrumentation, because they do not involve any races. But these two access, you get some small business slowdown there. But because you summon two, a few memory accesses to get away with smaller overhead. >>: So there's one other area. What you could do is you essentially vice one of the listing data array detectors. >> Junfeng Yang: Exactly. >>: Those accesses that [inaudible] in data arrays. >> Junfeng Yang: Right. >>: They need some of the affect to, even though the result that particular program point, let's say if it's empty, because you're instrumenting you're going to pay the cost for instrumentation. >>: That's a good observation. >>: I think that might be accessing these. >> Junfeng Yang: Actually, you guys have good aesthetic, easy [indiscernible] . So this is some idea where you should pursue in the future. We can study do raise detection. And those points and next schedule them together with the operations therefore forget deterministic multithreading and that as well, but we detect with recent programs, and it has a read of how to function. For each function, you get maybe a hundred reports. Do you guys have like good static race detectors we can leverage, we can use? >>: [inaudible] automatically come to me I think. But when there's annotation ->> Junfeng Yang: Notice it with annotation. >>: And, yes, and things like that. >> Junfeng Yang: I see. Maybe you can be clever. Interesting using such a raise detector right now. Another transfer we're looking at is not just for C programs, for Java programs, there's a Racey detector developed by Aicon [phonetic] and his students, I think. They've got good precision, and I think they're second [indiscernible], right. We're looking at whether or not we can actually do this for Java, with that powerful [inaudible]. We can get the SQL work or C plus work that would be a great thing to do. >>: Why do you need to also order the locks and unlocks synchronization, can you just do memory; you'll be doing pretty much the same thing, right? >> Junfeng Yang: So if you just do memory, you can get determinative. We know that, right? First shared the determinus access for shared memory efficiently. At the same time, you pay overhead, because you have instrumentation overhead and instrumentation restore. The weight overhead, you cannot do memory access if all the ones prior to the access are not done right. So there's a weight overhead there as well, right, and the overhead can be up to 10 times, 11 times based on previous papers. >>: But this lock and unlock ->> Junfeng Yang: This one? Okay? >>: It has the same overhead because after, if that lock comes first, you have to wait until the unlock is done first. So if you order that result between the lock and unlock on the left, on the right, then you won't need to unlock and lock. >> Junfeng Yang: Okay. So, first of all, there are two questions here. First question is the overhead of instrumenting and controlling all the synchronization operations. And the second is if they do this, enforce system memory access, do we get roughly the same overhead. And for the first question, named this interception operations tend to be cheap, tend to be cheaper, right, because these are already leverage costs. Easily done right. And instead of loop load and store operations for each load optimized code that contain load and store operations we do a function call. >>: But you're already doing that, right, you're already doing that for the first one, and then ->> Junfeng Yang: Right. For the first one we do that, but for the future one we don't do that. We can avoid the overrun for the future ones, the overrun is as much as the existing tools that enforce memory access others. Really those does not give you a lot of overhead at first. The second, if for these two particular access, if we enforce this constraint, right, and you probably will get roughly the same overhead as enforcing this constraint, right? But at the same time lots of memory accesses here as well, right, and how do you know whether or not you want to enforce others for those accesses, memory accesses. If you do enforce others, you get slow-down, right, if you do not ->>: You need between that is opened, it was all down there. Can't you figure the rest of them out the same way? >> Junfeng Yang: So, we actually -- here we know there are no races between these two pieces of code, therefore can slice them out, right? And are you suggesting that for this particular access, this result access, the result access would just add this instead of adding this edge, right? But this true access do not race, right? If you're using your algorithm to do not race, therefore we will not add an edge here, right. Therefore we can get nondeterminism synchronization, right? >>: But you're already adding the lock and unlock edge, which means that they would -- there is a race if that didn't exist, right? >> Junfeng Yang: We do this, right, actually to resolve the race on the lock. So this guy grabs one lock. This guy grabs the same lock. They can grab the lock in different order. Therefore it can get different results to add this edge to say, okay, this always get the lock first, right. So therefore you do not have this nondeterministic lock intention, right? It doesn't matter -- it doesn't involve these two accesses. >>: I agree, but what I'm trying to say is that what you're doing there is fundamentally basically trying to have an edge between the memory location of the lock itself, right? Kind of like -- so if you didn't have lock and unlock and you had custom synchronization, add what -- you had to do that you had to do have the lock memory and have one of these arrows continue with them. >> Junfeng Yang: Maybe a better way to look at this is would be, the source thing. The nondeterminism lock operations translate to nondeterministic access others to memory accesses. Maybe a better way to look at this would be here it shows example with just one access, right? You can have large region with lots of accesses, right, and you can ignore synchronization operations, enforce others for memory accesses. You first lock edges and therefore get better performance. But right now just look at synchronization, know synchronization, enforce this synchronization order which automatically covers all the memory accesses, meaning a critical region. Therefore it's going to be more efficient. So our system actually has -- some folks just pointed out, first of all, because the exception is we can actually reuse schedules which may not always be possible. For example, actually one nondeterminism in your simulation to get those results so those workloads or programs will not fit within our approach. And also by enforcing some deterministic order operations, memory access or synchronization operations introduce, sometimes introduce delays. If your program is latency intensive, that's not a good workload for us, right? And also we need to keep track of constraints, in order to figure out the preconditions and our current [inaudible] constraints are integers. If you have floating pond operations, the schedule actually depends on floating pond which we cannot handle. It also requires source code or the RRVM at the compiler intermediation bit code. That's another limitation. And currently we do not handle nondeterminism to get from time of day or matlock, you matlock something you get a memory address and you do something nondeterministic synchronization based on the return address of matlock. We cannot handle. So actually this probably can be handled using some existing techniques, right. For example, their implementations of heap implementations that actually are deterministic, right? And also work with only a single process. This doesn't work if you have multiple threads, new message to each other, although we're planning on making this approach work for MPI programs as well or programs that communicate by sending and receiving messages. >>: You also have the [inaudible]. >> Junfeng Yang: Right now, no. There's no determinism of system costs. It can have race inside the kernel, possible. So there are lots of applications we can do on top of this idea, right, at the basic level you can simplify the program understanding and testing debugging, right? Enforce the same schedule, test the schedule at the production. And also can use this for doing deterministic replication of multi-strategy programs. One program runs, has some nondeterminism instead of logging and sending the nondeterminism if the program heaps the schedule cache, you use the same schedule [inaudible] and therefore it's deterministic. You can report bugs without reviewing the four inputs which may contain credit card numbers. And also once you have the schedule right, you know what the program is going to do in the future, therefore it can do some scheduling trick there as well, right? And actually very interesting application we're currently exploring is you can use this idea to build some precise static analyzer for market strategy programs. I'll briefly talk about this idea. When you analyze market strategy programs we can use dynamic analysis which analyze the schedules as they occurred. It's basically unsung, because the next run of the program we actually use a schedule that's not analyzed. We can do static analysis of all schedules. But that requires a lot of approximations, and sometimes we get imprecise results like we see detection, you can get tons active false positives. With Peregrine, you can actually solve these, address these issues to some degree, right. And we can analyze a program only with respect to a small set of schedules to be enforced at runtime therefore we get precision right. We don't assume all possible schedules just a small set of schedules. And also we can enforce this analyzed schedules at runtime using Peregrine. Therefore we can guarantee some at least for this schedules, the schedules that we analyzed. And if there are new schedules that required a runtime, to gather new input that could be not be covered by these schedules, right, we need to learn this new schedule. And for these rare cases we can actually use some extensive techniques such as we do simulation of the program to guarantee, to check for errors, for example. And another assumption that this -- a inference that we can frequently use schedules as most of the inputs, 90 percent inputs can hit the schedule cache. We can actually guarantee a precision and soundness for 90 percent of the input, which I think is a good contribution. And, by the way, this is some ongoing work, right. So we have this idea. We have this preliminary study and get some results. This is not published yet. Okay. So one change here is how do you actually analyze a program towards just a small set of schedules. When you do set analysis, you don't have this notion of schedules. And my approach is modify every static analysis to say this is the schedule, right, take that schedule into account and the program, right. But that's kind of the strenuous, because you have modified a lot of analysis and also sometimes when we build a tool we can pull a bunch of analyses together and if there's imprecision in one of the analysis which does not consist the schedule and the imprecision may propagate to other analysis, right? There are issues here. You know this is the naive solution I just talked about, modify the analysis which is not very practical. So the solution is actually taking this program, let's take this program. Let's take the schedule, right? We can special like the program toward the program, so runtime we run the specialized program is going to generate executions matching the schedule. So you transform the program to get a simpler program. And this transformation specialization process involves specializing the control flow plus the data flow of the program. For example, look at data flow. We only look at [inaudible] used chains a lot by the synchronization schedule, the schedule reinforced and we can define use here, right? You can have other load and stores on the same memory address, but if they preclude, the data cannot flow because when you force certain schedules, you can ignore those deaf use flows. And once we get this specialized program, which is a lot simpler than the original program, consider the schedule as well. We can apply stock analysis to the specialized program. >>: You think with the specialized program as annotated program? A program with annotations or do you think that some statements get dropped or sliced away? >> Junfeng Yang: Some statements get dropped. Some get loops get unrolled and some data, some variables become constant. For example, if the program, if the schedule dictates two threads and the program uses a loop to create threads and can say, okay, this loop bound must be true, right because the schedule only contains two threads and it can propagate this value to all the other places where this value is referred. So you can read it it's a simplified program. Not just slicing up statements. >>: What you're doing is really optimization. You're optimizing program, rewriting the codes to make it run faster. >> Junfeng Yang: [inaudible] about that part. We're sure about the simplified part. >>: You're making it run faster. You said that in the examples. >>: And the same thing. >> Junfeng Yang: What? >>: This part is ->> Junfeng Yang: This part is an application, technique. So you can take the program, do simplification, remove stuff. And it may run faster but we do not have external data to back it up. We believe we can do optimization. >>: That sounds like an interesting ->> Junfeng Yang: We'll definitely pursue that direction. But right now our result is mostly on precision part. So once you throw away a lot of crap and specialize the program based on the schedule you can do really precise analysis. For example, you get automatic [inaudible] because the schedule says two thread, we're going to clone it two times. We're going to get automatic stress in 3-D. >>: How do you bake the schedule into the program? >> Junfeng Yang: Okay. Let's look at control flow part. The synchronization operations. And look at the control flow go from [inaudible] output to another as strengthen the program as to new statements corresponding to the simultaneous events become control equivalent, right? And the other pass, by passing this [inaudible] gets chopped off. Because if you ever go there you want to reach the second [inaudible] operation. And if there's a loop with three synchronization operations, and loop with one synchronization operation, and in my schedule I see this operation, appears two times where I know the loop must be run twice, right? And I can actually unroll the loop. That's how we specialize the control flow, right? Okay? >>: So it's funny to me because what you're doing really is the techniques we do for optimization. Like a loop and so on and so forth. But your goal is simplifying the analysis. >> Junfeng Yang: Right. Right. To get more precision. Question? >>: Can I ask it? The specialized programs still will have threads, right? >> Junfeng Yang: Yes. Will still have threads, right. >>: And does it -- is the semantics of the thread usually when you have a multi-thread program, the semantics of the multi-thread program, is that thread in context which statically at least at arbitrary places? >> Junfeng Yang: Right. >>: The specialized program, does it continue to have the same semantics. >> Junfeng Yang: Yes. So okay we also needed dynamic parts. Maybe this is what you're getting at. The synchronization operations in my specialized program. So this synchronization operations map one-to-one to my schedule. And runtime I need some technique to enforce the order. That's where this -why I say that. We actually need to enforce schedules using Peregrine, right? Does that -- is that what you're getting at? >>: Yes. That is what I'm getting at. Because then the static analysis needs to be [inaudible] because if you want static analysis to focus on those schedules and the static analysis needs to be aware of these other edges, right. >> Junfeng Yang: Sure, sure. This is how actually we sort of get around the problem. This data flow part, we're going to -- we actually build this area of analysis right that considers where it flows, based on the schedule. And then you have another analysis which wants to query its analysis, right? Just query all areas of analysis. And therefore this area of analysis is made schedule aware, right? If your analysis depend on the area of analysis, right, your analysis does not have premade the schedule. >>: I see. >> Junfeng Yang: Okay. So you're saying that there's some basic value flow or data flow analysis that you will provide [inaudible] that holds our information there, and you just query that as a black box, analysis that as a query black box. >> Junfeng Yang: We have some initial results, for example, we looked at alias redaction rights if they do not fit the schedule, get this many aliases, with the schedule 99 percent actually gets turned up. Precision is pretty -- the increasing precision is pretty big. And we haven't built a Racey technique based on this technique yet, but buying running the analysis and looking at the results, we actually detected, I think, at least five or seven real races in the programs analyzed by a lot of previous systems. You see a lot of paper to the programs. But they did not report the recent detector, previous unknown. I think I'm very excited about this direction, that we're actually pursuing. So how much time do I have? Good? I guess I kind of drop off, right? So you guys have seen the summary of results. I'll skip the evaluation. You want the evaluation? Okay. >>: I wanted ->> Junfeng Yang: Might want to ->>: [inaudible]. >> Junfeng Yang: Overhead? >>: One of the things I'm worried about not just [inaudible] but in general the whole deterministic multithreading, you go see the think schedule let's say you run your program on a full core machine and then you find a schedule, then you try to run the same program on the same input on an encore machine, and you try to force the same synchronization schedule. I would like to know what the overheads are. >>: Let me ask you the more question, programs are nondeterministic for a reason because then they can respond to environment changes, presence of more cores, suddenly like full cores become available because of the process, finished, et cetera. They're nondeterministic for a reason do they use all those computing capability? And so if you're restricting it by adding some determinism, I would expect some performance. >> Junfeng Yang: Right. There is some overhead determinance, particularly with the work case workload that you mentioned about to get the schedule on a four-core machine, get the schedule on the four-core machine. They were running it on an eight-core machine. How would you, overhead would be large because on the four-core machine it loses switches; but on the other it probably doesn't map really well to a machine. So for the scientific competition [inaudible] normally what they do is if you have four cars and I just pay four schedules, eight across I'm going to use eight threads. Four across, eight threads, right. So this number is actually part of your preconditions. It's part of your schedule. So on an 80 core from some scientific benchmark. Normally what they do is if you have four cars just going to use four schedules. So A plus I'm going to use eight threads, four cards, four threads, eight cards, eight threads. So this number is actually part of your preconditions, right? It's part of your schedule, right. So on eight core, for some scientific [inaudible] automatically find use the efficient schedule in that sense. >>: Those programs are in a stack because it ends the number of processors the number of codes are part of the input. >> Junfeng Yang: Right. Part of the input. >>: If you look at more dynamic scheduling algorithm, past library, what they're doing is running Apache, the same time to be happening Internet Explorer. >> Junfeng Yang: Right. >>: And then Internet Explorer is using two core but Apache is using six core and when Internet Explorer finishes Apache grows to use their two cores. So there's a lot of dynamic [inaudible] in the columns which are designed precisely to makers of these additional computation [inaudible] so I'd like to know what ->>: My impression that's always been the weak point of the existing multithreaded story and that thing from under the rug from day one. >>: I want to get -- since -- you know ->>: Experiments like what did you find? >>: Our experiments these are actually -- those are mostly scientific programs, and these programs that's how programs randomly run full course and this Apache basically also uses Apache to use other cores. So we do not always -[inaudible] use that. So my impression is that if they do the experiments, you run a bunch of competing processes together and you made performance for some of the processes within this system, you may get higher overhead. But I think I can see two ways to address this problem. I do not know whether these ideas will work. So one idea is if the workload is actually predetermined like this kind of computations, Markov processes the processes for number of threads have them together have them as a bigger submitting execution framework, and there you can capture constraints, use this approach. That's one way, right? Another way to address this would be we can have not just one schedule for many inputs, we can have a few schedules for this many inputs and runtime rate, and this workload actually becomes a factor for us to choose what schedule to go, to use, right, and that can get efficiency. So in some sense do we want one-to-one determined probably not. If you change the input, you get really different schedules. Do we want many-to-one, maybe right. But we can also have many to N schedules a small number of schedules choose at runtime to get performance. Those may be the way to go. So I haven't tried these ideas yet. Right. Do you guys want to see more results or? These are the overhead. Some of them like we get speedup, because they use barriers [inaudible] which can be avoided by us whether it slips, and some of them get bigger overhead. >>: What's the overhead if you don't -- if you don't get [inaudible]. >> Junfeng Yang: So that's the recording overhead, our current recorder is like [inaudible] x-ray. That's pretty slow because our recorder is not very optimized. So there are lots of papers talking about how to do efficient recording, just haven't got the time to implement that yet. So it can be really large. If you look at the papers, previous papers like idea developed by MSR guys, it's like 10 percent recorded, the full execution stream, low storing instructions. >>: So by Apache is it the old Apache server? >> Junfeng Yang: Co-op Apache server, core Apache server. If you do HPP we haven't tried but I think the HPP won't matter because single thread to elaborate, interpret some PHP Scripps and go back, single thread go back with schedule within PHP module. >>: I'm a little bit troubled because you could just forget the whole story about determinism and here's an automatic method speed up programs. And it works fully automatically and this number of programs it could actually get better performance end of story. No determinism needed. >> Junfeng Yang: So maybe that's a better way to spin the work, right? So right now we started this work by focusing on reliability. Avoid bugs. [inaudible] we remember schedules and avoid bugs. But right now in I think the steady analysis frame we're really excited about and also junior one of my fantastic students is working on that, right? I think down the line we want to do the augmentation story. I think that will work well as well. >>: Crazy girl if I look at -- so improvement ->> Junfeng Yang: Maybe the program will not reason in a smart way because it's heavy simulation rights. It's arguable ->>: [inaudible]. >>: You don't have to be that smart. [laughter]. >>: Face the program particularly aren't skillfully written. And so sort of improves them by figuring out what is a better schedule than ->> Junfeng Yang: Right. Right. >>: And the wonderful thing about working on optimization you don't ever guarantee anything, right? >> Junfeng Yang: I think it has a guaranteed semantics. It's correct. >>: And you just have to make sure you don't change semantics or something like that. >> Junfeng Yang: Right. As you were saying ->>: There's some race there. Let there be. Why wasn't it a program. >> Junfeng Yang: That's actually maybe a better direction, good direction for us to go in. >>: But the original same schedule -- [inaudible] the optimizations you're getting is six schedule plus your optimization, right? And the sync schedule plus here. >> Junfeng Yang: The sync schedule gives us most optimizations, right. >>: So but Kindle as implemented does not give you the speedup, you guys paid some [inaudible] with the [inaudible]. >> Junfeng Yang: Right. Optimization. Here the optimization mostly comes from the synchronization optimization. So right now we do this implication, we get a programmer, as a programmer we get results. Actually, when we run programmers we still run the original program. But we haven't tried, I think, as we're suggesting so actually run the synchronization program and also can ignore races, the analysis part that deals with races, because as [inaudible] suggested the programmer has resist the programmer -you do it by races, you do it by much better performance. Great direction. >>: So do you know what percentage of memory access is best incuration and [inaudible] like when you are [inaudible]. >> Junfeng Yang: Actually I have this number. Okay so these are the races, basically. So these are the program that contain races, right, the two real ones have races. These are the [inaudible] benchmarks they also contain races as well. These are from the past sec benchmark. We run the race, the number of memory accesses can be nearly as right. [inaudible] or even more. I don't know the exact number. It's in the paper. And [inaudible] others that we were solving, look at the Ray says, brings us back to the simulation schedule. They're pretty small, the number of races here. These contend partially once you're focusing other, no races are detected and for cluster similar things happen. For the order no races detected. It should be like most programs, you run but not help lots of races. That's just the general intuition. Any other questions? Okay. I guess I'll just conclude. So I'll talk about the segmentation, the memorized schedules and reusing inputs to avoid, we can reuse the past sec schedules and reword buggy schedules. I'll talk about this idea of schedules basically combines the benefit of single schedules and schedules as Madan suggested can actually implement this idea using a recent study that we're currently looking at. And also talk about this peregrine system with it confuse schedules by taking a trace and relaxing into heavy schedules. It's deterministic makes all these programs deterministic efficient or have this glow and plus it can preview preliminary schedules. Okay. That's all. [applause]