>> Shaz Qadeer: Let’s get started. It’s my great pleasure to welcome once again Professor Alastair Donaldson to Microsoft Research. He has been visiting us off and on for several years now. In fact just before the joined Imperial College London he spent two months with us working on problems related to verification of GPU kernels. Since then he has done many, many more pieces of work on that topic and other topics at Imperial with his student. Today he’s going to tell us about some new work unrelated to GPU verification I think, yeah. >> Alastair Donaldson: Thank you Shaz for the great introduction. I’m really pleased to hear I’d been promoted to Professor, which I… [laughter] >> Shaz Qadeer: What do they call you over there? >> Alastair Donaldson: I’m a Co-Lecturer. >> Shaz Qadeer: Which? >> Alastair Donaldson: Which is kind of like Assistant Professor. >> Shaz Qadeer: I see. >> Alastair Donaldson: I’ve heard in the U.S. lecturer is really not particularly prestigious, is that true? >> Shaz Qadeer: I don’t think there’s a lecturer, Alastair. >> Alastair Donaldson: There talking about changing Imperial to talk about, to use the U.S. names. >> Shaz Qadeer: I see. >> Alastair Donaldson: Which maybe is a good thing but we have this thing called Reader which is like Associate Professor. >> Shaz Qadeer: I see. >> Alastair Donaldson: I think that’s such a cool thing. I would love to be Reader some time. I hope they make the changes after I become Reader so I can be a Reader. [laughter] Okay, so the work I’m going to present is joint work with my Ph.D. student Paul Thomson, and also my Post Doc Adam Betts. But the work is really, has really been led by Paul. Paul has spent a huge amount time on this study. It’s an Empirical Study about systematic concurrency testing methods based on schedule binding. I’ll explain shortly what those methods are. Many of them were developed here at Microsoft Research. The background is that in Paul’s PhD he’s interested in looking at advanced techniques for doing systematic concurrency testing looking at new algorithms and heuristics for bug finding. Doing this quite practical work requires significant empirical evaluation to make sense of whether the techniques are working or not. Paul spent a huge amount of time building up a set of benchmarks. This benchmark gathering is very, very time consuming. It involves huge amount of time spent on messing around with Mg files, getting things to build on certain version of Linux. Trying to then remodel parts of applications so that they’re amenable to the testing method under consideration, really a huge amount of work is involved in this. We had the idea that we would like to I guess get some more money’s worth for our effort. Or Paul should get more money’s worth for his effort. Before starting to really look at brand new techniques in evaluating them. Why not take the existing techniques that we have read about and that we have been inspired by. Try and do a very objective evaluation of those techniques on the concurrency benchmarks that are open source. That people are using in prior work and in related work on concurrency testing. I think in the end this led to a pretty interesting study. We had a paper this year at the Principles and Practices of Parallel Programming Conference, which Paul presented. I was delighted that Paul won the best Student Paper Award for this work. The study is completely reproducible if you go to, of you search for the study online you’ll find our webpage. There’s a virtual machine where you can get all of the benchmarks, all of the tools, and there are scripts there so you can rerun the experiments. We hope that this could be useful to researchers in evaluating their methods. The motivation for systematic concurrency testing is that as we all know concurrency bugs are horrible because a concurrency bug may manifest non-deterministically, rarely, and maybe hard to reproduce. The key thing is that these bugs are dependent on the schedule of treads. By a concurrency bug I specifically mean a bug that may or may not manifest according to the way threads are scheduled. A bug that would always occur would not to my mind be a concurrency bug even if it’s in a concurrent program. I would say a bug is a concurrency bug if whether or not it manifests depends upon the interleaving of threads. In our study we consider crashes and assertion failures to be bugs. We don’t consider data races to be bugs. I’ll come back to that point later on. We’re talking about a concurrent program that runs until either it crashes say with a segmentation fault or some assertion fails. The assertions are either there in our benchmarks. Or we’ve added these assertions because the benchmarks perhaps contained output checking code which we then replaced with assertions. Systematic concurrency testing is a pretty simple idea in principle. The idea is you have a concurrent program and a fixed input to that program, so one test input. Furthermore the concurrent program is assumed to be deterministic. The program should not exhibit randomization. The program should not be doing things like reading from the network and getting data values that are not inside the program. It should be a closed program. There are methods for coping with non-determinism by modeling and systematically exploring non-determinism. We didn’t look at that in this work. In this case we’re talking about a deterministic concurrent program with the exception of the thread scheduler which of course in non-deterministic. Having this fixed input program the OS scheduler would usually be responsible for scheduling the threads of this program. A systematic concurrency testing tool or SCT tool sits in between the OS scheduler. The program takes control of the scheduler and determines the order in which threads are scheduled. This means that it’s possible to repeatedly execute the program controlling the schedules that are explored. To potentially enumerate thread schedules. If the program is guaranteed to terminate for any thread schedule then in theory it’s possible to enumerate all of the schedules of the program on this input. Of course in practice for significantly sized programs this is not feasible there would be a vast space of schedules. While every schedule would be considered in the limit the practice the idea here is to try to find bugs in the program though the systematic method. There are a number of tools that have implemented systematic concurrency testing. I would say the two best known tools are Verisoft and CHESS. Verisoft was developed by Patrice Godefroid when he was at Bell Labs. He’s now at MSR. The CHESS tool was developed by colleagues at MSR here and I guess some of the interns. Yeah, I think that both of these tools have had quite some impact in finding bugs in real world concurrent programs. The basic idea of SCT then is if we consider the space of schedules as a tree. From some initial state a thread runs, makes an instruction, makes another instruction. Then there are, we get to a point where there are several options of which thread could be scheduled next. Here t one, t two, or t three could be scheduled. The systematic concurrency tester makes a decision about which thread to schedule, in this case t one. Then t two or t three could be schedule so t one is now blocked. The tool considers t two and then this leads to termination. This is a terminal schedule and then these dotted circles are unexplored schedules. These are schedule prefixes which demand more exploration. Okay, so this terminal schedule we can refer to as t one, t one, t one, t two. Because we have a fixed input and the only non-determinism comes from the scheduler. The sequence of thread IDs precisely characterizes the states reached during this schedule. Then we have these unexplored schedule prefixes, so t one, t one, t one, t three, which is where we get down to the bottom left. Where we get to this node here, okay, and then t one, t one, t two, t one, t one, t three these can be explored in future executions. Then we might look at this execution next which would then give rise to you know we would have now two terminal schedules explored. Then a bunch more unexplored schedules. The really good thing about systematic concurrency testing is that it’s relatively easy to apply to real programs. What you have to do is essentially make a concurrency unit test for your program. That may not be trivial. In our study we actually devoted quite some attention to discussing the challenges associated with doing that. But if you can get this concurrency unit test you then can run SCT fully automatically. There’s no need for any sort of static analysis or invariance or anything like that. You just run and if you do find a bug you could then reproduce that bug to your heart’s content in order to debug the problem. You don’t get any false alarms because you’re really executing the program. If it’s possible to execute the program in all schedules up to some bound and we will talk, I’ll talk later in the presentation about schedule bounds. Then you can get a bounded guarantee of the programs correctness on this test input. That bounded guarantee maybe useful. The problem is though that concurrency bugs may still be very hard to find because the schedules base is so vast. I presume this is all making sense so far? Yep, okay. There are a couple of couple of standard optimizations which you could do. First is reducing the scheduling points to visible operations. This was something that Verafast, the Verisoft technique, sorry not the separation logic Verafast tool did from the start. You schedule only at operations that could be visible to other threads, shared memory accesses, lock, unlock operations, etcetera. The observation being that invisible operations cannot influence other threads until a visible operation occurs. The CHESS tool schedules only at synchronization operations. Rather than scheduling with every read and write you schedule only at thread create, thread join, lock, and unlock. If you guarantee detecting data races and flagging them up as bugs then you’re guaranteed not to miss any bugs if you employee this sort of production. Both of these are forms of partial order reduction. Then there’s a method called Dynamic partial order reduction from Flanagan and Godefroid in POPL two thousand a five, which reduces search based on happens-before relation, and based on detect and conflicts during execution. These are all appealing reduction methods because they’re sound. In a sense they don’t miss any bugs, okay. However if we’re willing to potentially miss bugs then we can do something more drastic but potentially much simpler and more useful, which is schedule bounding. The idea is as follows. There is a hypothesis that realistic concurrency bugs don’t require too many context switches to manifest. Of course we could sit down together right now and we could write a concurrent program that will only crash if seventeen threads interleave in one particular order, right. But no programs like that actually exist, okay. I could be convinced that there may be programs that require say six or seven interleaving in some strange order. But this seems to be rare. Most concurrency bugs appear to be exposable using only a small number of context switches. This motivates the idea of restricting search to only schedules that do a certain bounded number of context switches. This can drastically reduce the schedule space. If this hypothesis about concurrency bugs is true it can still be useful in finding bugs and potentially can provide a bounded guarantee. It may be feasible to explore all schedules that involve up to say three context switches. If you can prove there are no bugs up to this depth that gives some confidence in the correctness of the concurrency test. You might even argue that know that any bug would require more than three context switches gives you some feeling for the low probability of such a bug occurring in practice. The idea of schedule bounding and there are two key methods, preemption bounding, and delay bounding, which I’ll come to in a minute. It’s as follows, so we would explore in the space of all schedules potentially all schedules involving zero preemptions. This may be a very small set. A super set would be all schedules involving up to one preemption, or up to two preemptions. Then in the limit if we carried on exploring schedules with more and more preemptions then we would in theory explore the whole space. Preemption bounding to my knowledge was first proposed by Musuvathi and Qadeer in PLDI two thousand seven. Delay bounding was proposed more recently by Michael Emmi, Shaz Qadeer, and Zvonimir Rakamaric in POPL twenty eleven. I’ll talk a little bit about preemption bounding and delay bounding. Then I’ll get onto the empirical study itself. In this diagram I’m illustrating the difference between a context switch that is forced versus unforced. In red a schedule has used zero preemptions and a yellow one preemption. If you look here thread one executes and then thread one, thread two, and thread three are enabled. Okay, so if thread one continues to execute there’s been no preemption. There’s not been an unforced context switch. However if control switches to thread two or thread three then there’s been one preemption. This schedule has cost one preemption. On the other hand if thread one was blocked at this point then it would cost no preemptions to switch to either thread two or thread three, because there’s no choice. It’s not possible to continue execution of thread one. These are unforced context switches. If we look at this slightly more complicated example we can see for instance here this red path is a schedule with zero preemptions. This yellow path or this yellow path are schedules with one preemption. This path that ends up blue there are two preemptions. Any questions regarding this? Okay, delay bounding is a guess a slightly less obvious than preemption bounding. Let me try and explain it. The idea of delay bounding, I’ll give you the idea first and then I will try to give you some intuition for why it could be useful. The idea is to fix a deterministic scheduler. For example a round robin non-preemptive scheduler, but it can be any deterministic scheduler. If you run a fixed input deterministic test case with such a scheduler then there will be one schedule, right. Okay, so the idea of delay binding is that during systematic testing we use this scheduler but we can deviate from the schedule by skipping over a thread at the cost of one delay. In the study as in prior work on CHESS we consider delay bounding with respect to a non-preemptive round robin schedule. Okay, so let me try and illustrate further how delay bounding works and then I’ll talk briefly about the intuition. Suppose we have four threads. Initially only thread one is enabled and we’re using this round robin scheduler. Thread one executes until it becomes blocked. Even if threads two, three, four become enabled thread one carries on executing. If at some point thread one becomes blocked then we go to thread two. Okay, if threads two and thread three become block we go to four. If threads four and one become blocked we go to two. This is the round robin scheduler. But to illustrate delay bounding at this point suppose we’ve got the situation where thread one, thread two is executing and it’s enabled. Then at a cost of no delays it can carry on executing. That’s the default thing the scheduler would do. At a cost of one delay we can skip to the next thread. We do an unforced preemption to thread four. Note we don’t, it doesn’t cost anything to skip over thread three because thread three’s disabled. Okay and at a cost of two delays we would go to thread one and skip it and go into, sorry we would go to thread three and skip it and go onto thread one. Okay, so this illustrates zero, expending zero delays, one delay, and two delays. Here is the schedule tree. Suppose we’ve got to this point having expended no delays. We can carry on having spent no delays by continuing with thread two, or we can switch to thread four at cost of one delay or thread one at cost of two delays. We now may consider any schedule that uses up to D delays for some small number D. Let me try to give you the intuition for why delay bounding can be effective in practice. I have to confess that I don’t have a terribly strong intuition. Our study confirms that it does work better than preemption bounding. My interest is not all that strong for why so maybe Shaz can comment as well. But my intuition is that with preemption bounding can be useful because it drastically reduces the schedule space. But does consider the possibility where at some non-forced point a thread yields to another thread. The problem is with preemption bounding the thread yields to any other thread. Then you get this explosion of possible threads. But it doesn’t seem necessary to consider all possible threads that could be yielded to. The key thing is to stop this thread executing and let someone else have a go. With delay bounding we are deterministically picking who gets to go. But not completely deterministically because we have this costly non-determinism, so if we want to deviate quite far from the schedule this cost something. Okay, so we’re not going to by default consider if we use a small delay bound we’re not going to consider these rather costly sources of non-determinism. We’re going to consider the cheapest which is just to stay with the current thread or something slightly more expensive, which is to skip on by one thread or two threads. Yeah Tom? >>: I hope your other intuition was to deal with huge numbers of threads. This goes to the state space explosion. If I have a preemption where I have a hundred enabled threads, right. Then I have many, many choices better than the next. >> Alastair Donaldson: Right. >>: If most of those threads are symmetric it certainly really doesn’t matter. But with preemption bounding I sort of treat them as if they’re all unique. With delayed bounding I don’t suffer that problem. I think I can deal with huge numbers of treads. There will just be some threads that don’t get [indiscernible]. >> Alastair Donaldson: Well I mean absolutely the idea of the large number of threads leading to a schedule explosion and that going away with the delay bounding. To me that completely makes sense. I guess the thing you’re adding there is saying that if these threads are kind of roughly the same there’s nothing especially interesting about all of them. Then it really would not be particularly effective to consider all these possibilities. >>: Right. >> Alastair Donaldson: Okay. >>: Yeah, well even if they’re not symmetric though. I mean if thread four can do something bad to thread one it may not matter that thread two intervenes… >> Alastair Donaldson: Right, you’re going to get to thread four eventually. The key thing is getting away from thread one at a certain point. >>: Right. >> Alastair Donaldson: Letting thread four do its bad thing, yeah. Okay. >>: One more quick… >> Alastair Donaldson: Yeah. >>: I think this is also can be a function of the density of the dependency [indiscernible]. The different insulations very sparse like in message parsing, sort of white card choices make the difference. That might be another knob to turn. >> Alastair Donaldson: Okay. >>: The examples where the difference is very, very sparse and they keep increasing the coupling. >> Alastair Donaldson: You think that if there’s sparse dependency then delay bounding is likely to work well? >>: No, no the outcome would be good to [indiscernible] that numbers address. >> Alastair Donaldson: Okay, yeah [indiscernible]. We can apply these methods iteratively. The idea is rather than necessarily saying we’re doing delay bounded search with a delay of three. We can actually just say we’ve got an hour or we’ve got a hundred thousand schedules. We’re going to try all schedules with the delay of zero. Then if we manage to finish all of those there will only be one of them, so we will. Then we will try all schedules with the delay of one. There will be many more of those. All schedules with the delay of up to two. Then all with the delay of up to three until we either exhaustively explore or we run out of time, or we reach some agreed number of schedules. I use iterative, IPD for iterative preemption bounding and IDB for iterative delay bounding. The claims of prior work are as follows. A low schedule bound is effective at finding bugs. Then schedule bounding provides benefit over more naïve systematic techniques like just a straightforward depth for searchable schedules. The delay bounding is better than preemption bounding in a sense that it’s faster at finding bugs because you can, you have a smaller schedule space to search. You can find these bugs in that smaller space. What we felt was that prior work is mainly from Microsoft and it uses a lot of non-public benchmarks. The benchmarks sound very interesting. But they’re not publically available so by and large they’re not publically available. It’s hard to independently validate this research. Then there are all these papers about concurrency analysis which use. They have these names of benchmarks you see cropping up again and again. What we thought would be good would be to try to get all of these open source benchmarks and implement, re-implement the algorithms for delayed bounding and preemption bounding, and do a study to A, assess how effective these techniques are with respect to each other and with respect to naïve systematic search, and B actually assess the benchmarks themselves. These benchmarks that people are using are they any good? Or you know and I’ll come back to that. The experimental method, okay, so what we did was we took this tool called Maple which is from researchers at the University of Michigan and at Intel. This was published at OOPSLA twenty twelve. The actual paper about Maple is to do with coverage guided hunting for concurrency bugs. That’s not really at all the focus of our study. But Maple was a suitable choice for us because it was independently implemented. It’s not something that we implemented. It supports systematic concurrency testing and its open source. What we did was we took Maple and we added support for delay bounding. We studied, when I say we I really mean Paul Thompson. He studied the source code of CHESS quite carefully and tried to make sure things were implemented similarly to CHESS. Then we got hold of all of the buggy concurrency benchmarks that we could find from existing papers that are amenable to systematic concurrency testing. In our paper we go to some length to explain which benchmarks we have to exclude. For instance there were quite a lot of benchmarks that involve GUIs. That’s very difficult to apply systematic concurrency testing without doing quite some significant work. Tom? >>: These are C programs, C+. >> Alastair Donaldson: Yeah these are C++ program, C++ benchmarks. Okay and the Maple tool works using the pinned instrumentation framework. >>: Okay. >> Alastair Donaldson: These are, we compile these programs and then we use binary instrumentation to do systematic testing. >>: You’re saying the point on Maple was to use coverage techniques to try to guide which schedule to select next? >> Alastair Donaldson: Yes, so Maple does, Maple is non-systematic. It controls the scheduler but it uses heuristics to try and find a schedule that’s likely to find bugs. >>: Okay. >> Alastair Donaldson: That, in our paper because we could we did actually compare with Maple. >>: Oh, okay. >> Alastair Donaldson: I’m not going to present that data though because that’s not, that wasn’t our aim at all in the study. >>: Yeah, okay. >> Alastair Donaldson: We found these fifty-two buggy benchmarks. I mean there were many more benchmarks and we whittled them down to fifty-two which were amenable to systematic concurrency testing with modest effort. These are all public code bases. We’ve amalgamated them with the permission of the various authors. We call this SCTBench, systematic concurrency testing benchmarks. This is now a publicly available benchmark suite. What we did was we looked at three techniques initially, iterative preemption bounding, iterative delay bounding, and naïve depth for search. We applied every technique to every benchmark. We gave every technique ten thousand schedules in which to try to find a bug. We used schedules rather than time because we believe there’s a future proof metric. We, running this kind of experiment takes serious time. We had to do it on a cluster. Furthermore we had to do it on a cluster where many benchmarks were running on the same node using multiple capabilities. Then timing becomes difficult. Okay the number of schedules doesn’t become difficult. If people want to compare with our techniques in future they can compare based on the number of schedules. Even if they don’t use anything like the same hardware we’re using. Did you have a question, Tom? >>: Yeah, I was wondering why you didn’t consider the concurrency fuzzy approach as well? >> Alastair Donaldson: The concurrency fuzzy approach? >>: Yeah. >> Alastair Donaldson: Because it’s not systematic so we wanted here to look at systematic methods, stateless model checking. >>: Okay, okay. >> Alastair Donaldson: There’s also kind of combinatorial explosion of how many things you compare. We, there are like various things we’d like to look at like PCT for example. But in… >>: Yeah, right. >> Alastair Donaldson: In this study we you know what it’s like you get closer and closer to actually trying to submit a paper. Eventually you think right for this study we’re going to have to just rein things in a bit and… >>: But that’s like you’re just logging that scheduler and just see what happens? Yeah and that would be good for the paper on checking other data. >> Alastair Donaldson: Yeah, absolutely, I mean we’re not finished with this study. We have a paper on the study. But we’d like to do more. >>: I think the [indiscernible] work that’s open source too. >>: Yes. >>: Right, so you could actually have this completely independent implementation… >>: [indiscernible] >>: [indiscernible] work is, I don’t think it does the systematic concurrency testing. >>: No it doesn’t do systematic and, oh, I see what you’re saying... >>: [indiscernible]. But the scheduler, the PCT scheduler you know you can use it in a non-systematic way or in a systematic way. >>: Well right, right, yeah I’ve got it, I got it, okay. >> Alastair Donaldson: That’s absolutely on our to do list of things. Especially given, I’m going to talk a bit about random scheduler in a minute. We’re very going to try PCT given the results we have with the random scheduler. >>: Okay. >> Alastair Donaldson: Alright, so a quick word on data races. Over half of the benchmarks we found to contain data races by using a dynamic data race detector. These data races are found almost instantly by any of these methods. If we treat a data race as bugs we really wouldn’t be able to distinguish between the bug finding ability of these methods because they would all just find a bug, okay. We had a look at some of the benchmarks. In some cases it’s kind of clear that the developers would regard these races as benign. In other cases it’s things like incrementing a counter in histogram which really should be done with a relax atomic in C++ eleven. In the end we decided to do as has been done in prior work was run dynamic race detection up front and find a load of data races. Then look at all of the instructions that participated in some data race, promote every one of those instructions to be a visible operation. Treat everyone as a sync op and then ignore the fact that there may be further data races and do systematic concurrency testing scheduling at sync ops and known racy instructions. But every time those instructions are executed whether or not it’s a racy scenario we consider them a sync ops. This essentially explores sequentially consistent outcomes of those data races, and is oblivious to future data races. We argue that this is, A is what has been done in some prior work, and B it’s not biased towards any of the particular techniques we’re evaluating. Okay, so here are the names of the benchmarks. I won’t spend too long on this slide. But you may be familiar with some of these names, aget, and pbzip I think are very famous in papers about concurrency testing. Then we have this rather large set of examples which come from the TACACS Software Verification Competition. May I say we’re just taking the benchmarks we can find and that people are using in their papers, and people are boasting about it. Here for example you can see that there many variance of a dining philosophers problem. You may or may not think that good in a benchmark suite. But these are the benchmarks that we could find. >>: These, all these look small programs. >> Alastair Donaldson: Yeah, so… >>: The list for you guys is it big or small? >> Alastair Donaldson: I mean I’m not intimately familiar with the source code of these things. But I think that pbzip is moderate sized. None of these are massive programs or some of them come from massive applications but they’re, sometimes I think it can be really misleading in these papers to say we tried this on sql live which is two hundred thousand lines of code, when actually the test is running on a hundred of those lines, so you know. Then there are some benchmarks from chess which Paul ported into Linux. These work [indiscernible] queue benchmarks which we thought were very interesting which were available. There are some parsec benchmarks, some radbench benchmarks, and some of the splash benchmarks. In each of these suites there are, I think in all cases there are more benchmarks than just these. But there were always examples that it would have been really quite difficult to make amenable to systematic concurrency testing. There was only so much effort we could put into curating this benchmark suite for the study. Okay, so the top level of end diagram of results is as follows. We have fifty-two benchmarks, seven of them none of our techniques could find bugs in. Preemption bounding, delay bounding, or an IU search. Thirty-three of them we could find bugs in and in all of those cases the bugs could be found by one of the bounding methods, okay within ten thousand schedules. There’s never a case where doing this naïve depth for search was better in terms of bug finding on these benchmarks. We can see here that delay bounding is significantly better than preemption bounding in that there are seven benchmarks where delay bounding was capable of finding the bug and preemption bounding was not. >>: The seven benchmarks where none of the techniques, do you know if there’s a bug in there, or no? >> Alastair Donaldson: All of these things have bugs. We know they all have bug, so I’ll come back to the bugs we couldn’t find. We know anecdotally they have bugs, right. >>: Yeah. >> Alastair Donaldson: Like people say there’s a bug in this and they explain the bug in English, right. >>: Okay, yeah… >> Alastair Donaldson: It could be that they’re wrong and you know… >>: It would be interesting to look at that. >> Alastair Donaldson: Yeah, but I’ll come back to that. Okay so what this does is confirms two claims of prior work. First of all for this benchmark suite supports the claim that schedule bounding is effective at finding bugs. Furthermore it supports the claim that delay bounding is superior to preemption bounding for finding bugs, okay. But one of the reviewers of our PPoPP paper suggested that it might be interesting just to try a completely random scheduler. A scheduler which at every scheduling point randomly selects a thread, it doesn’t use any PCT type stuff. It just randomly selects a thread. We read that and to be honest we thought kind of grown you know I guess we should try that but it’s going to be some more experiments to run. [laughter] But we did it, okay, and this is the result, right. What we find was that the random scheduler. The completely naïve random scheduler was able to find all except one of the bugs that could be found by the systematic methods. Furthermore it could find an extra bug. Furthermore which you can’t see here it finds these bugs significantly faster in most cases than delay bounding. Of course that could be due to luck in terms of the random schedules. Let me explain again about the random scheduler. We’re picking on the fly we’re making random choices. We’re not recording those choices so we could be doing the same schedule in a set of ten thousand schedules. We would just do ten thousand random schedules and see how far we get, okay. We find this very surprising. I guess if I’m completely bloodily honest it’s just one of two things. It either suggests or a mixture of these two things. It either suggests that SCTBench, this set of benchmarks which people are using and we gathered is not representative of real world bugs. Or it suggests that actually these bounding methods don’t provide benefit over just a naïve random approach for raw bug finding. I’ll talk a little bit about how good this benchmark set is in a minute. But one thing I, in defense of preemption bounding and delay bounding is first of all it’s, I think it’s definitely true that if you find a bug using a delay bound of one it’s likely that the counter example you get is going to be much more palatable than the counter example you get from some crazy random schedule. Okay, the second thing is that because these results support the hypothesis that bounding is useful that adds weight to these sequentialization methods. Context bounding and certain forms of delay bounding admit sequentializations where you can take a concurrent program and you can rewrite it into a sequential program such that the sequential program exhibits all of the behaviors of the concurrent program up to some bound. Then you can use static verification methods from the sequential land to prove not for just one input but for all inputs that this concurrent program is correct up to this bound. Now how useful is such a claim? Well it’s potentially quite useful if it seems really to be true that you can find bugs in small bounds. These results I guess add weight to that user weights bounding methods, even if they potentially detract from the use just for bug finding. >>: How does the PCT work which compares pure random to the guided really does see a significant difference. But that’s on substantially large real applications. >> Alastair Donaldson: Right. >>: Like SQL Server. >> Alastair Donaldson: Yep. >>: You know because that technique is not systematic. It’s able to run on very large programs. >> Alastair Donaldson: Yes. >>: But just because it can run on large programs shouldn’t prevent us from running even small programs… >>: Oh, no, no, no, but what I’m saying is that you know I think these benchmarks perhaps are whittled down in such a way that you know your schedule space is not that huge. >> Alastair Donaldson: Right. >>: You’re… >> Alastair Donaldson: It’s definitely… >>: It’s not a needle in a hay stack. It doesn’t seem like a needle in a hay stack if randoms doing this well. >> Alastair Donaldson: Yeah, and a second I just want to give one piece of intuition that I have for this scenario. It seems to me that if a bug can be exposed with just one preemption. Then to me there seem to be two extreme scenarios. One scenario is that the bug will occur if this preemption happens some time, right, doesn’t matter when it just has to happen. In that case it seems that loads and loads, and loads of schedules will expose the bug. Then random will find the bug. >>: Right. >> Alastair Donaldson: On the other extreme there’s the case where some preemption has to happen but at a really key point. If it doesn’t happen at that key point you won’t get the bug. Then that’s as you say, Tom, like searching for a needle in a hay stack. We wouldn’t expect random to do well. >>: Yeah but the other thing you’re not accounting for is random is finding the bugs in fewer guesses than IDB. >> Alastair Donaldson: Yes, right. >>: Systematic is penalizing you. >> Alastair Donaldson: Right. >>: Right, it’s doing worse scenario. >> Alastair Donaldson: The only intuition I have there. This came from Hana Chockler, Kings College London, who saw Paul talking about his work and she, I’ll try and say what Paul told me she said to him, which is that if you imagine the tree of schedules. Then shallow terminal schedules are likely to be favored by random, right. If you have a bug that can crash the program very quickly then it will have a shallow terminal schedule. There may be a very high probability of they’ll be a number of these shallow terminal schedules. You will have a high probability of hitting one of those. Does that may sense? >>: Well yeah but the shallowness helps both methods. >> Alastair Donaldson: Except that we’re doing, so when you, something I haven’t mentioned in this talk but is in the paper. Is that we’re doing IDB and IPB with; you have to pick a scheduling method underneath that. We’re using depth for a search underneath IPB or IDB. >>: Why is that? I don’t understand. >> Alastair Donaldson: That’s mainly because in Maple that was a pretty core engineering decision and was hard to reverse. Maple is running, you know you’re running and you’re recording your schedule. Then you’re trying to schedule variant. There are many ways to enumerate the schedule variance and a simple way is depth first in enumeration of schedule variance with a schedule bound of one say, okay. That will not necessarily favor these shallow bugs. Does that make sense? >>: No, I guess I’m misunderstanding what you mean by, what do you mean by shallow? >> Alastair Donaldson: I guess if you imagine the space of all schedules as tree. Then shallow would mean terminal schedules which are not very deep in the tree. >>: But did not deep mean few preemptions [indiscernible]? >> Alastair Donaldson: It means few instructions executed full stop. >>: Oh, okay. >> Alastair Donaldson: Yeah. >>: Why does the number of instructions executed matter? >> Alastair Donaldson: Because if you imagine the random scheduler every time it reaches a sync copy it makes a decision. If there are loads and loads of shallow terminal schedules that end with a crash then if it makes random decision it’s quite likely to get to one of those rather than if there’s a very, very deep crash. Yeah? >>: Deep meaning but deep would mean in terms of number scheduling decisions not in terms of number instructions, right? We had counting schedules, well not counting… >> Alastair Donaldson: I wonder if we might be able to talk about it after because I would like to, it’s something I haven’t thought about that much and would like to think about more, and yeah. >>: Because it seems like what you’re doing, what’s happening here is our IDB is something systematically fat, right. It’s going off on some piece of the search space where the solution isn’t. >> Alastair Donaldson: Right. >>: You know whereas apparently the space of solutions is quite dense and you’ll find one quickly if you, if it gets randomness. >> Alastair Donaldson: It does seem that way and we can look at the big table, so this is the big table. I mean we’re not going to look at it now. But this table has all the data in it, right. You know it, we couldn’t list this in the paper because there are only so many observations we could make ourselves and fit in. This has really got a lot of information. If you’re interested we could offline maybe pour over the table and look at some of the results. We’re very curious to understand this. We were very surprised by this random. We’re very grateful to the reviewer if you’re watching for suggesting we do this. Unfortunately we didn’t have a chance to have our comments about the random scheduler reviewed because we put these in the final version of the paper. Ganish? >>: Yeah, actually [indiscernible] experiment when I was visiting Intel for a year. We took [indiscernible] and then a used parallel random work basically used different seeds. >>: Okay. >>: We would get bugs [indiscernible]. >>: Right. >>: We didn’t record the traits that was the problem. >>: Okay, you didn’t record the traits at the… >>: Yeah because it was by the book. The other thing is we also did a count of how many variants of the same error there are because we don’t [indiscernible]. [indiscernible] ten thousand and twenty thousand times in the state space. >>: Yeah. >>: You don’t need to hit that sort of instance because the core [indiscernible] populated all over the state space, so random is not [indiscernible]. >>: When you say random I presume you’re saying pseudo random. You could reconstruct the bug if you… >> Alastair Donaldson: Yes, pseudo random. >>: That’s right. >>: Showed somebody how it misbehaves. >>: That’s right. >> Alastair Donaldson: Yeah, right but I think we should focus now on the benchmarks and their limitations. Because let’s not get too carried away because some of these benchmarks are not good. But the main findings then I think I’ve covered these. Schedule bounding was similar in terms of bug finding ability to random search. Many bugs can be found with a small bound. Delay bounding beats preemption bounding. But what I want to talk about now is that a significant number of the benchmarks maybe regarded as trivial in some sense. This is quite important because if benchmarks are trivial researchers should not boast about finding bugs in them. They should become a minimum baseline. If your tool can’t find these bugs your tool is not a tool. You need to boast about finding you know better bugs than these bugs. [laughter] I hope that our study can potentially set a clear baseline for at least systematic concurrency testing techniques. Maybe concurrency techniques in general so if you have these big table of benchmarks if you have all of these benchmarks, which I’m not going to elude to by name, here. But, yeah, well let’s see, so trivial benchmarks. Here’s a property bug was found with delay bound zero, fourteen benchmarks. Right, so what this means is the single schedule with delay bound zero… >>: [indiscernible] round robin. >> Alastair Donaldson: Find the bug. >>: These should be just stricken. [laughter] >> Alastair Donaldson: But let me emphasize again we just wanted to take the benchmarks people are using and study them, right. Okay, the numbers here there is overlap between the numbers. This isn’t a partitioning of the benchmarks. There were sixteen benchmarks where, Tom you were correct to say that the schedule space is not vast. There were fewer than ten thousand terminal schedules over all, forget bounding. I mean, and what this means is that all of the methods we studied would eventually find the bug. Because they would all exhaust the search space, right. Then it might be the delay bounder would get a faster or preemption bounding would get there faster, maybe, maybe not, but every time it would get there. We found in nineteen cases that more than fifty percent of the random schedules we tried exposed a bug. Now of course that could be luck but I think it suggests that there are loads and loads of schedules that expose a bug in these benchmarks. Okay and then furthermore of those nine of them every random schedule we tried was buggy, okay. >>: That’s really weird. >> Alastair Donaldson: Then in these nine… >>: Is there a default of the other… >> Alastair Donaldson: Right, and in these nine; I would have to check the exact number. But I think four or five of them every schedule we tried with every technique hit a bug. Going back to the beginning of my talk we wouldn’t actually call those concurrency bugs because they are not schedule dependent, right. We included them in the study because they claim to be concurrency bugs by… >>: I think the same is true for delay bounds here, right? >> Alastair Donaldson: Not necessarily. >>: You get a random schedule. >>: No that’s over there because it depends on which scheduler you chose. >> Alastair Donaldson: Yeah. >>: I mean that’s a little weird because that’s like the default scheduler. That’s the thing that would happen pretty much on an ordinary computer. >> Alastair Donaldson: I’m sure it’s not the case because we didn’t find, otherwise we’d find this for all fourteen. We didn’t find this, I’m saying, there’s another moment which we don’t have a column for, which is every schedule was buggy. We didn’t find that for fourteen, definitely not. Right, so… >>: Just to give, just to give, I mean anecdotes, I mean on the benchmarks we were doing with CHESS, which were the ones I worked on for .NET from library code. I mean for small like things like hash tables, you know a parallel hash table I mean you would have you know millions of schedules, right. I mean just small programs we would just and this is with all the bounding and everything. I mean huge numbers, so first of all just the, I mean ten thousand… >> Alastair Donaldson: Let me ask you were they all buggy? >>: Well… >> Alastair Donaldson: All these benchmarks are buggy, right? >>: We found bugs. But… >> Alastair Donaldson: You found bugs but what all the… >>: What Don is saying there… >>: I’m just saying the search space… >>: [indiscernible]… >>: The search space is ridiculously small for concurrency. >> Alastair Donaldson: Well, so let me emphasize though. I’m not saying that, so what we’re saying is there were sixteen cases where you had the small search space, right. >>: Yeah, yeah. >> Alastair Donaldson: For everything else there were fifty-two benchmarks. For the remaining benchmarks we don’t know how big the search space was. >>: Yeah, but you don’t need it... >> Alastair Donaldson: But it was more than ten thousand. You know that it’s not going to be, it probably not going to be ten thousand and two. It’s probably going to be bigger. >>: Well, I’m also wondering you know how those things really are. >> Alastair Donaldson: Yeah. >>: Because I mean with CHESS definitely we had experience of exhausting or preemption bound of two or three. You know we have examples you know you can make up whatever number you want. But I mean for the real things you know we would just, we could keep exploring for a very long time on relatively small to a complex [indiscernible]. >> Alastair Donaldson: Okay, so… >>: These were things you know with Volatiles and with Interlock operations. I mean the code was small but schedules fix was huge. >> Alastair Donaldson: Yeah I mean it’s like pretty clear from these, I mean I’m focusing here on the ones we say are trivial. >>: Okay, yeah, yeah, yeah. >> Alastair Donaldson: Then there are the rest of them. Some of which come from CHESS and which are harder and you know I’ll come back in a minute to some of those. >>: Yeah. >> Alastair Donaldson: But it sounds anecdotally that your reading of these results is probably that, when I said that you know we draw one of two conclusions, right. Either these benchmarks are not realistic or these methods are not providing benefits. It sounds like you’re probably thinking the first, which I would like to think because I find these bounding methods fascinating. I’d like to study them more. To me it was mess in a way but a disappointment when we found this random result. Because I kind of thought what we doing then if we can just find these bugs randomly. You know I think the key thing is the difference between bugs that just you know programs that are just very likely to crash. Then you probably don’t you know you could argue that maybe don’t need systematic testing. Maybe the program is so likely to crash that… >>: Well, one was… >>: I mean there were number approaches that say make you random testing also can be proven still. >>: Yeah. >>: We could do better than the random scale order. At least the random methods I mean PCT scheduler. >>: True. >>: I mean it’s all a matter of searching, you search a schedule space. >> Alastair Donaldson: Yeah. >>: How do you run the search that space? I could be DFS it could be PFS, it could be probabilistic search and probabilistic search is random tested. Then you can ask question how do I want to guide my search? >> Alastair Donaldson: Right. >>: I think it might be useful to qualify the benchmarks using a qualitative approach which would say this isn’t clearly a synthetic benchmark created by hand to insert a bug into a well known but small concurrent program and try to find it. I mean versus you know a benchmark that comes from the wild which is you know typically what we were going after with CHESS. >> Alastair Donaldson: Yeah. >>: I mean clearly everybody creates synthetic benchmarks. But we create them basically as functional tests to say is the tool just working? >> Alastair Donaldson: Yeah. >>: It doesn’t find a bug or not. To also do unit testing which is a investment thing for exponential search space. >> Alastair Donaldson: Yeah. >>: But to have a really tiny programs that are on a small search space so that the tool can, we can test the tool… >> Alastair Donaldson: Test the tool, right. What we wanted to do in this study like I said was just take what there is and evaluate it and do it quantitatively. >>: Yep, yeah. >> Alastair Donaldson: We didn’t, we wanted to be very objective here. But I think given some of the findings I think maybe a, you know some more meeting of the code could be in order. >>: You don’t have to beat a dead horse. >> Alastair Donaldson: I’ll talk briefly about the bugs not found. >>: [indiscernible] >> Alastair Donaldson: There were three bugs which were not found in a rather trivial way. These bugs could be exposed if you reduced the number of threads. You saw in the list of, I think this is dining philosophers example where this two, three, four, five, six, seven I think is the case. I can’t remember exactly. But I think it’s the in dining philosophers seven you couldn’t find the bug. But it’s basically the same bug as in dining philosophers two. I don’t really regard those as particularly interesting unfound bugs. Except that maybe delay bounding should find them if delay bounding’s meant to be good with large thread counts, maybe, okay, yep. Then there’s this radbench.bug number one. This is a JavaScript interpreter. This is a, we have a scenario where thread one destroys a hash table. Thread two must access the hash table. Paul believes that this should be exposable with only one delay. But there are upwards of fourteen thousand scheduling points. He looked at an execution and he looked at the number of scheduling points in that terminal execution and there were you know upwards of fourteen thousand of them. You can see that there would be a very large schedule space. >>: Why are you [indiscernible] on no technique was able to explain them. >> Alastair Donaldson: Yeah, no technique was explained, right. The radbench.bug two we know requires at least three delays or preemptions. Because I believe we were able to search this one exhaustively. We need to double check the table, okay. We didn’t find any; we were able to search it exhaustively up to this bound of three. >>: Then you found it. >> Alastair Donaldson: Sorry, what I mean sorry, no, no, no, sorry that’s not true. No we didn’t find this bug. What we did was we explored zero, one, and two. We got into three when we reached the ten thousand limit. We know that we would need at least three preemptions to find this bug. >>: Okay. >> Alastair Donaldson: Yeah. >>: Have you looked at the [indiscernible] depth metric that [indiscernible] the PCP paper? >> Alastair Donaldson: No. >>: Because it sounds like you’re actually going after something similar here. That you know figuring out how many scheduling points and how many preemptions, so we have this classification report that they on a similar idea except it’s priority lowering points it’s not preemptions. >> Alastair Donaldson: Okay. >>: It boils down to a similar thing. >> Alastair Donaldson: Is it possible to use that metric to assess the case if you didn’t find a bug or can you only use it when you found the bug? Or can you use it as, to give you a lower bound on… >>: It’s not hard to actually measure this because it states what is the minimum number that you need to find, to guarantee… >> Alastair Donaldson: To guarantee [indiscernible] percent. >>: Sort of quantified over all possible [indiscernible]. >> Alastair Donaldson: Okay. >>: Might be hard. >> Alastair Donaldson: You know we didn’t look at that. I didn’t remember that from my reading of the papers, so I’ll need to go back and study the paper. Alright and then there’s a very interesting benchmark which we classified as miscellaneous. This comes from Dmitry Vyukov and this is posted to the CHESS forum. This is a lock-free stack. There are three threads and the poster of this benchmark claims that this bug could only be exposed with at least five preemptions. There are a hundred and fourteen scheduling points. Right, so this if he is correct and you know we couldn’t find this bug, so I can’t say you know that we validated this. We… >>: He has explained the bug and… >> Alastair Donaldson: He’s explained the bug in English. Right, so his claim is that this bug cannot be exposed with a small number of preemptions. That is, obviously there will exist kind of counter examples to this claim of, well you may still rule out five as a small bound. But this is a not so small bound. It’s will be interesting to see whether the PCT could find this. We didn’t find this with random. This sounds like the sort of thing we wouldn’t find with random, right, because it requires five preemptions, random is not going you know the chances of just naïve random in searching those preemptions at the right place is very small. Something which I won’t talk about now but which we… >>: How long did you run the bound was over ten thousand. >> Alastair Donaldson: Over ten thousand. >>: Okay. >> Alastair Donaldson: I mean it was Vonimere who suggested to Paul that peoples running you know take that benchmark and just run it for ten weeks or something. >>: I find this hard to imagine that the real program needs five preemptions to hit the bug. Clearly it can craft any program if he needs that many. >> Alastair Donaldson: Right. >>: But I mean I would be interested to, I mean it’s probably terrible to look at so maybe not, right. >>: But do you… >>: But it’s really hard to believe that. >>: The author says that he crafted this to be, to have this five preemption lower bound, or discovered this by accident? >>: Or is this a latent natural? I could be… >> Alastair Donaldson: I haven’t personally looked at this. I could click the link now and we could look at it. >>: Ah, yeah. [laughter] >>: Yeah this goes back a long time, but I… >>: Go to chess.codeplex.com, oh there it is, okay. >>: Gosh I can’t believe that’s still off. >> Alastair Donaldson: In what sense in that you think it’s bogus or… >>: Number… >>: We haven’t looked at that for a little while. >>: Should have been squabbed a long time ago. >> Alastair Donaldson: Oh, before I see, maybe Paul just linked the forum not the actual bug. Okay, I don’t think I’m going to be able to find this during the talk. But let’s look at it afterwards, okay. >> Yeah. >> Alastair Donaldson: Right, so the, so now I guess it’s about the systematic concurrency’s full stop. The main problem we had doing the study was environment modeling, so GUI applications, applications that use processes not threads, and use the network. What Paul tried to do was take those applications and then extract from them isolated concurrency tests. Try to take out something that involves a few threads in a scenario and use that for scenario testing. But he found often there would be global variables and this would be very hard to do. It would not be looking at these more real world things. It would be very difficult to extract these test cases. I think for people to really use systematic concurrency testing they need to be willing to put some effort into writing concurrency tests, and trying to you know make these tests stand alone, so that you can systematically explore the schedules. Then another issue was that many of bugs we found were related to memory safety. But what we had to do was actually, knowing from reports on these bugs that there’s a problem with memory safety. We had to then add an assertion at the point where we know the problem is. Then see whether we could find that assertion. What we don’t have inside Maple is good dynamic [indiscernible] style memory analysis to try to find these bugs. >>: But there’s a [indiscernible] program still can crash right with a set forward something? >> Alastair Donaldson: Often not. >>: Okay. >> Alastair Donaldson: I mean if there’s a buffer it very often doesn’t crash. You know and whether that happens may be non-deterministic. There’s a kind of challenge of, engineering challenge of trying to do, you know if you want to do race detection maybe to do full exploration of all schedules arising from data races you need very fast on the fly race detection. If you want to find these memory areas you need very fast on the fly memory analysis. Getting all those things integrated into a tool. Then having the flexibility to be able to explore the scheduling strategies is quite some engineering challenge. But not really a research challenge. >>: I have a comment about the second issue. >> Alastair Donaldson: Yeah. >>: At the start of the CHESS project there are two challenges with this kind of testing. One of them is of course creating the isolated unit test. The second one is this common [indiscernible] explosion. But I think that at the start of the project it was not clear to me that there are two very distinct challenges. I think I found you know talking to engineers and testers in the company that often times the reason for doing random testing. The thing that they are doing random testing because of one reason but it’s really because of the other reason. >> Alastair Donaldson: You mean because they can’t isolate these unit tests? They just… >>: Because they can’t isolate it. A lot of the reason why the testing is, concurrency testing is not been exposed because you can’t isolate these tests, even what you call random scheduling, right. >> Alastair Donaldson: Yep. >>: You are able to build a truly random scheduler only once you have gotten full control over all [indiscernible] choices. >> Alastair Donaldson: Right, so it’s, yeah it’s controlled random scheduling. >>: But if go into the industry what people refer to as random scheduling just means that they’re just running the test again and again. >> Alastair Donaldson: Yeah. >>: Trying to [indiscernible] things, right. >> Alastair Donaldson: That’s absolutely not what we did here. >>: If should do this, I mean what you are doing is already way more principle. What I’m saying is it’s way more principle than what was happening. >> Alastair Donaldson: Yeah. >>: In actual product setting. >> Alastair Donaldson: But harder to apply. >>: There is this smooth transition, right. You cannot [indiscernible] translate where you want a random scheduler. Then you have testing the real the scenario where you can, maybe you’re not catching every single piece of synchronization. Yeah, but you still have some idea that whatever worked better in the isolate case is probably also going to be better if you only have partially preemptions. I mean the same scheduling decisions you can make in the purely isolated case for random scheduling have a good chance of working once you apply them to the real code, even if you don’t catch all that synchronization. >> Alastair Donaldson: But is it, how do you apply them to the real code? How do you just apply them? >>: You can always intercept some synchronization. It just can’t, you can’t restart the program every time you want to restart. Then you have no guarantee that you won’t miss some synchronization or maybe sometimes you guessed incorrectly and a thread is blocked and it’s really just slow. Those are like imprecision approximations that you have to do in the [indiscernible]. That’s what you do in [indiscernible], right? >> Alastair Donaldson: Right. >>: But [indiscernible] of those guesses is though they’re incorrect they’re still running a principle scheduler. >> Alastair Donaldson: Okay, yep, makes sense. Right, so to conclude with the slide on future work, so what we’d like to do. I mean Paul has a bunch of ideas related to extensions of systematic concurrency testing and heuristics. But in terms of this study we would like to look at more techniques. Yes, the PCT technique I think is primarily on the list. But then it would be interesting to compare with nonsystematic techniques as well. We would like to, now we have this framework for running these tests. We would like to really extend the benchmark suite significantly. We didn’t look here at partial-order reduction. The reason is that it’s not trivial to combine partial-order reduction soundly with these bounding techniques. Soundly in a sense that I would say that POR is combined soundly with delay bounding for instance if you don’t lose anything additional to what you’re already losing with delay bounding by applying POR. My understanding is that is not the case if you naively combine them and that it’s an open problem in which people are working. I think Madame [indiscernible] had a paper with Catherine Koons and Catherine McKinley at OOPSLA last year on precisely this topic. That’s something we would like to look at. Then there’s this issue of weak memory in systematic concurrency testing where I would say there’s been some preliminary work, but we would like to explore that further as well. Okay, so thank you for listening and thanks for the questions. I’d be delighted to answer any more questions. [applause] >> Shaz Qadeer: More questions? >>: Sure. >> Alastair Donaldson: Yeah. >>: There would be other possibilities for random. I mean instead of just delaying you know complete random scheduling you could do you know random delay or something like that, random perturbation from a fixed schedule. >> Alastair Donaldson: Shaz you told me that you had, you said something a bit like that to me that you even tried delay bounding with a… >>: [indiscernible] scheduler. >> Alastair Donaldson: With a randomized but deterministically random as you couldn’t see the beginning. >>: Yeah, so I… >> Alastair Donaldson: Then you have now a deterministic scheduler. But it’s like a randomly chosen deterministic scheduler, yeah. >>: Yeah, you could do that. I was thinking just as a bias in your random scheduler, right. You could say you know bias towards round robin. >> Alastair Donaldson: Okay. >>: That would put you a little closer to the, you know delay bounding you know ideal. >> Alastair Donaldson: Yeah. >>: With outward fetters then you know un-biased run. Even though delay bounding does seem to improve or… >> Alastair Donaldson: Yeah that’s an interesting idea. I think it [indiscernible] a little bit of PCT, right, which is in favor round robin but it’s prioritizing threads and randomly changing priorities. >>: I think I mean the paper that I did that had PCTs to buy into randomization that gets as quick as to the shallowest values, right. You have this capturization of that. The thing that we found you want to randomize is priorities and priority lowering points. We give all threads an initial random priority. Then the scheduler is deterministic based on that priority. There are like random points in the execution where we lower some, the priority of some threads randomly, and that’s it. >> Alastair Donaldson: Okay, you had a question in the back. >>: There is that one famous concurrency bug which was with them in the Space Shuttle had to be scrubbed on the launch pad. Do you have, did you try to add it to your list? >> Alastair Donaldson: We didn’t try to add that to our list. I don’t know whether it’s a C++ program with a fixed input and, yeah, but no we didn’t. Kanesh? >>: I was wondering whether you could take a [indiscernible] code structure. Some of the utilization code may not be of interest. If you want to get past it and then turn on the search things like that I mean. >> Alastair Donaldson: Yep, yep. >>: Start the search at say [indiscernible]. >> Alastair Donaldson: Absolutely, yeah you may want to, I mean run the application up to some point I suppose. Then maybe take a snap shot of that point and do systematic concurrency thing from the snap shot, yep. >>: I don’t have a question but I have a comment. >> Alastair Donaldson: Yep, sure. >>: When you were introducing delay bounding you mentioned that delay bounding is with respect to a round robin scheduler. >> Alastair Donaldson: What I said in general is with respect to our deterministic scheduler. >>: That’s right, so… >> Alastair Donaldson: In this implementation like CHESS we used random, yeah. >>: The result of doing delay bounding there is a lot depending on what deterministic scheduler you use. Because as Ken was saying that the deterministic scheduler is sort of like the point around which you’re biasing the search. >> Alastair Donaldson: Yeah. >>: In other applications, so I’m interested in testing systematic testing or message sparsing applications. But those applications there is no more such notion of preemption on context switch. You have processing running and they’re communicating via [indiscernible]. There is not a prior reason to believe that preemption along those processes is particularly going to be useful. We found that, and so we were using the idea of a deterministic scheduler, right. Because that’s a very general concept that doesn’t depend on whether you’re running on single class, single core, or multi core, it is applicable even to a distributed system. We found that the speed with which bugs are uncovered depends significantly on the particular deterministic scheduler that we started with. >> Alastair Donaldson: Okay. >>: We experimented with the round robin scheduler that you mentioned. >> Alastair Donaldson: Yeah. >>: There’s another one called [indiscernible] completion scheduler. >> Alastair Donaldson: Right. >>: There was one more called, so we created a random delaying scheduler, also. >> Alastair Donaldson: Okay, is that a kin, but that a kin to our random scheduler is it? >>: No, no, no, no. >> Alastair Donaldson: Did you try random scheduling like we tried? >>: We have not tried random scheduling. >> Alastair Donaldson: You should try that. >>: But we should try that, yeah. >> Alastair Donaldson: It’s easy to try. I guess what surprised me a bit was that in prior work there was not a comparison with this very [indiscernible] random approach. It didn’t surprise me actually because we didn’t think of doing it either. It was this reviewer who suggested that we try it. I guess kind of my take away from this work is definitely try all these easy things, right. You know it’s good to try them. >>: No you shouldn’t try them they’re embarrassing. >> Alastair Donaldson: Don’t try them because then you can’t write a paper on them, is that what you mean? [laughter] But yeah, I mean I think with this benchmark set my reading of it is that about half the benchmarks are I think are nonsense benchmarks. I think about a fifth of the benchmarks are really hard and we can’t find the bugs. I think the remaining are interesting benchmarks. For those benchmarks I think we’re seeing you know the claims of prior weren’t being supported. But this random approach is doing well and you know maybe those bugs are not super hard to find. But the benchmarks are not super simple either. >>: I think it’s also like what you say random or non-randomized search, or what Chad said, and also about what you knew but really didn’t point out. Is that it’s not really that much of a difference between you know randomized search and non-randomized search. If you think of the random, a number of random choices as an input to like picking a schedule, right, so you can just pick those random choices before you start, right. That would be similar to; you can deterministically enumerate all random choices. Then you have a systematic search. >> Alastair Donaldson: Right, okay so if you knew there were a hundred scheduling points you systematically enumerate all of them… >>: Yeah, so now you know the design of a good random scheduler is to pick fewer random choices. It’s just like how you determine the complexity of randomized algorithm you know it costs something to pick a random bit. You want a scheduler that uses as pure random choice as possible. That same scheduler then is actually good for systematic exploration because you can actually go through all the random choices. >> Alastair Donaldson: Okay I’d like to talk more about that actually, really. >> Shaz Qadeer: Okay, more questions? Alright, cool. >> Alastair Donaldson: Thanks. [applause]