Document 17865060

advertisement
>> Shaz Qadeer: Let’s get started. It’s my great pleasure to welcome once again Professor Alastair
Donaldson to Microsoft Research. He has been visiting us off and on for several years now. In fact just
before the joined Imperial College London he spent two months with us working on problems related to
verification of GPU kernels. Since then he has done many, many more pieces of work on that topic and
other topics at Imperial with his student. Today he’s going to tell us about some new work unrelated to
GPU verification I think, yeah.
>> Alastair Donaldson: Thank you Shaz for the great introduction. I’m really pleased to hear I’d been
promoted to Professor, which I…
[laughter]
>> Shaz Qadeer: What do they call you over there?
>> Alastair Donaldson: I’m a Co-Lecturer.
>> Shaz Qadeer: Which?
>> Alastair Donaldson: Which is kind of like Assistant Professor.
>> Shaz Qadeer: I see.
>> Alastair Donaldson: I’ve heard in the U.S. lecturer is really not particularly prestigious, is that true?
>> Shaz Qadeer: I don’t think there’s a lecturer, Alastair.
>> Alastair Donaldson: There talking about changing Imperial to talk about, to use the U.S. names.
>> Shaz Qadeer: I see.
>> Alastair Donaldson: Which maybe is a good thing but we have this thing called Reader which is like
Associate Professor.
>> Shaz Qadeer: I see.
>> Alastair Donaldson: I think that’s such a cool thing. I would love to be Reader some time. I hope
they make the changes after I become Reader so I can be a Reader.
[laughter]
Okay, so the work I’m going to present is joint work with my Ph.D. student Paul Thomson, and also my
Post Doc Adam Betts. But the work is really, has really been led by Paul. Paul has spent a huge amount
time on this study. It’s an Empirical Study about systematic concurrency testing methods based on
schedule binding. I’ll explain shortly what those methods are. Many of them were developed here at
Microsoft Research.
The background is that in Paul’s PhD he’s interested in looking at advanced techniques for doing
systematic concurrency testing looking at new algorithms and heuristics for bug finding. Doing this quite
practical work requires significant empirical evaluation to make sense of whether the techniques are
working or not.
Paul spent a huge amount of time building up a set of benchmarks. This benchmark gathering is very,
very time consuming. It involves huge amount of time spent on messing around with Mg files, getting
things to build on certain version of Linux. Trying to then remodel parts of applications so that they’re
amenable to the testing method under consideration, really a huge amount of work is involved in this.
We had the idea that we would like to I guess get some more money’s worth for our effort. Or Paul
should get more money’s worth for his effort. Before starting to really look at brand new techniques in
evaluating them. Why not take the existing techniques that we have read about and that we have been
inspired by. Try and do a very objective evaluation of those techniques on the concurrency benchmarks
that are open source. That people are using in prior work and in related work on concurrency testing.
I think in the end this led to a pretty interesting study. We had a paper this year at the Principles and
Practices of Parallel Programming Conference, which Paul presented. I was delighted that Paul won the
best Student Paper Award for this work. The study is completely reproducible if you go to, of you search
for the study online you’ll find our webpage. There’s a virtual machine where you can get all of the
benchmarks, all of the tools, and there are scripts there so you can rerun the experiments. We hope
that this could be useful to researchers in evaluating their methods.
The motivation for systematic concurrency testing is that as we all know concurrency bugs are horrible
because a concurrency bug may manifest non-deterministically, rarely, and maybe hard to reproduce.
The key thing is that these bugs are dependent on the schedule of treads. By a concurrency bug I
specifically mean a bug that may or may not manifest according to the way threads are scheduled. A
bug that would always occur would not to my mind be a concurrency bug even if it’s in a concurrent
program. I would say a bug is a concurrency bug if whether or not it manifests depends upon the
interleaving of threads.
In our study we consider crashes and assertion failures to be bugs. We don’t consider data races to be
bugs. I’ll come back to that point later on. We’re talking about a concurrent program that runs until
either it crashes say with a segmentation fault or some assertion fails. The assertions are either there in
our benchmarks. Or we’ve added these assertions because the benchmarks perhaps contained output
checking code which we then replaced with assertions.
Systematic concurrency testing is a pretty simple idea in principle. The idea is you have a concurrent
program and a fixed input to that program, so one test input. Furthermore the concurrent program is
assumed to be deterministic. The program should not exhibit randomization. The program should not
be doing things like reading from the network and getting data values that are not inside the program.
It should be a closed program. There are methods for coping with non-determinism by modeling and
systematically exploring non-determinism. We didn’t look at that in this work.
In this case we’re talking about a deterministic concurrent program with the exception of the thread
scheduler which of course in non-deterministic. Having this fixed input program the OS scheduler would
usually be responsible for scheduling the threads of this program. A systematic concurrency testing tool
or SCT tool sits in between the OS scheduler. The program takes control of the scheduler and
determines the order in which threads are scheduled.
This means that it’s possible to repeatedly execute the program controlling the schedules that are
explored. To potentially enumerate thread schedules. If the program is guaranteed to terminate for
any thread schedule then in theory it’s possible to enumerate all of the schedules of the program on this
input. Of course in practice for significantly sized programs this is not feasible there would be a vast
space of schedules. While every schedule would be considered in the limit the practice the idea here is
to try to find bugs in the program though the systematic method.
There are a number of tools that have implemented systematic concurrency testing. I would say the
two best known tools are Verisoft and CHESS. Verisoft was developed by Patrice Godefroid when he
was at Bell Labs. He’s now at MSR. The CHESS tool was developed by colleagues at MSR here and I
guess some of the interns. Yeah, I think that both of these tools have had quite some impact in finding
bugs in real world concurrent programs.
The basic idea of SCT then is if we consider the space of schedules as a tree. From some initial state a
thread runs, makes an instruction, makes another instruction. Then there are, we get to a point where
there are several options of which thread could be scheduled next. Here t one, t two, or t three could
be scheduled. The systematic concurrency tester makes a decision about which thread to schedule, in
this case t one. Then t two or t three could be schedule so t one is now blocked. The tool considers t
two and then this leads to termination. This is a terminal schedule and then these dotted circles are
unexplored schedules. These are schedule prefixes which demand more exploration.
Okay, so this terminal schedule we can refer to as t one, t one, t one, t two. Because we have a fixed
input and the only non-determinism comes from the scheduler. The sequence of thread IDs precisely
characterizes the states reached during this schedule. Then we have these unexplored schedule
prefixes, so t one, t one, t one, t three, which is where we get down to the bottom left. Where we get
to this node here, okay, and then t one, t one, t two, t one, t one, t three these can be explored in future
executions.
Then we might look at this execution next which would then give rise to you know we would have now
two terminal schedules explored. Then a bunch more unexplored schedules. The really good thing
about systematic concurrency testing is that it’s relatively easy to apply to real programs. What you
have to do is essentially make a concurrency unit test for your program. That may not be trivial. In our
study we actually devoted quite some attention to discussing the challenges associated with doing that.
But if you can get this concurrency unit test you then can run SCT fully automatically. There’s no need
for any sort of static analysis or invariance or anything like that. You just run and if you do find a bug
you could then reproduce that bug to your heart’s content in order to debug the problem. You don’t get
any false alarms because you’re really executing the program. If it’s possible to execute the program in
all schedules up to some bound and we will talk, I’ll talk later in the presentation about schedule
bounds. Then you can get a bounded guarantee of the programs correctness on this test input. That
bounded guarantee maybe useful. The problem is though that concurrency bugs may still be very hard
to find because the schedules base is so vast.
I presume this is all making sense so far? Yep, okay. There are a couple of couple of standard
optimizations which you could do. First is reducing the scheduling points to visible operations. This was
something that Verafast, the Verisoft technique, sorry not the separation logic Verafast tool did from
the start. You schedule only at operations that could be visible to other threads, shared memory
accesses, lock, unlock operations, etcetera.
The observation being that invisible operations cannot influence other threads until a visible operation
occurs. The CHESS tool schedules only at synchronization operations. Rather than scheduling with
every read and write you schedule only at thread create, thread join, lock, and unlock. If you guarantee
detecting data races and flagging them up as bugs then you’re guaranteed not to miss any bugs if you
employee this sort of production.
Both of these are forms of partial order reduction. Then there’s a method called Dynamic partial order
reduction from Flanagan and Godefroid in POPL two thousand a five, which reduces search based on
happens-before relation, and based on detect and conflicts during execution. These are all appealing
reduction methods because they’re sound. In a sense they don’t miss any bugs, okay.
However if we’re willing to potentially miss bugs then we can do something more drastic but potentially
much simpler and more useful, which is schedule bounding. The idea is as follows. There is a hypothesis
that realistic concurrency bugs don’t require too many context switches to manifest.
Of course we could sit down together right now and we could write a concurrent program that will only
crash if seventeen threads interleave in one particular order, right. But no programs like that actually
exist, okay. I could be convinced that there may be programs that require say six or seven interleaving
in some strange order. But this seems to be rare. Most concurrency bugs appear to be exposable using
only a small number of context switches.
This motivates the idea of restricting search to only schedules that do a certain bounded number of
context switches. This can drastically reduce the schedule space. If this hypothesis about concurrency
bugs is true it can still be useful in finding bugs and potentially can provide a bounded guarantee. It may
be feasible to explore all schedules that involve up to say three context switches. If you can prove there
are no bugs up to this depth that gives some confidence in the correctness of the concurrency test. You
might even argue that know that any bug would require more than three context switches gives you
some feeling for the low probability of such a bug occurring in practice.
The idea of schedule bounding and there are two key methods, preemption bounding, and delay
bounding, which I’ll come to in a minute. It’s as follows, so we would explore in the space of all
schedules potentially all schedules involving zero preemptions. This may be a very small set. A super
set would be all schedules involving up to one preemption, or up to two preemptions. Then in the limit
if we carried on exploring schedules with more and more preemptions then we would in theory explore
the whole space.
Preemption bounding to my knowledge was first proposed by Musuvathi and Qadeer in PLDI two
thousand seven. Delay bounding was proposed more recently by Michael Emmi, Shaz Qadeer, and
Zvonimir Rakamaric in POPL twenty eleven. I’ll talk a little bit about preemption bounding and delay
bounding. Then I’ll get onto the empirical study itself.
In this diagram I’m illustrating the difference between a context switch that is forced versus unforced.
In red a schedule has used zero preemptions and a yellow one preemption. If you look here thread one
executes and then thread one, thread two, and thread three are enabled. Okay, so if thread one
continues to execute there’s been no preemption. There’s not been an unforced context switch.
However if control switches to thread two or thread three then there’s been one preemption. This
schedule has cost one preemption.
On the other hand if thread one was blocked at this point then it would cost no preemptions to switch
to either thread two or thread three, because there’s no choice. It’s not possible to continue execution
of thread one. These are unforced context switches. If we look at this slightly more complicated
example we can see for instance here this red path is a schedule with zero preemptions. This yellow
path or this yellow path are schedules with one preemption. This path that ends up blue there are two
preemptions. Any questions regarding this?
Okay, delay bounding is a guess a slightly less obvious than preemption bounding. Let me try and
explain it. The idea of delay bounding, I’ll give you the idea first and then I will try to give you some
intuition for why it could be useful. The idea is to fix a deterministic scheduler. For example a round
robin non-preemptive scheduler, but it can be any deterministic scheduler. If you run a fixed input
deterministic test case with such a scheduler then there will be one schedule, right.
Okay, so the idea of delay binding is that during systematic testing we use this scheduler but we can
deviate from the schedule by skipping over a thread at the cost of one delay. In the study as in prior
work on CHESS we consider delay bounding with respect to a non-preemptive round robin schedule.
Okay, so let me try and illustrate further how delay bounding works and then I’ll talk briefly about the
intuition. Suppose we have four threads. Initially only thread one is enabled and we’re using this round
robin scheduler. Thread one executes until it becomes blocked. Even if threads two, three, four
become enabled thread one carries on executing. If at some point thread one becomes blocked then we
go to thread two. Okay, if threads two and thread three become block we go to four. If threads four
and one become blocked we go to two. This is the round robin scheduler.
But to illustrate delay bounding at this point suppose we’ve got the situation where thread one, thread
two is executing and it’s enabled. Then at a cost of no delays it can carry on executing. That’s the
default thing the scheduler would do. At a cost of one delay we can skip to the next thread. We do an
unforced preemption to thread four. Note we don’t, it doesn’t cost anything to skip over thread three
because thread three’s disabled. Okay and at a cost of two delays we would go to thread one and skip it
and go into, sorry we would go to thread three and skip it and go onto thread one. Okay, so this
illustrates zero, expending zero delays, one delay, and two delays.
Here is the schedule tree. Suppose we’ve got to this point having expended no delays. We can carry on
having spent no delays by continuing with thread two, or we can switch to thread four at cost of one
delay or thread one at cost of two delays. We now may consider any schedule that uses up to D delays
for some small number D.
Let me try to give you the intuition for why delay bounding can be effective in practice. I have to
confess that I don’t have a terribly strong intuition. Our study confirms that it does work better than
preemption bounding. My interest is not all that strong for why so maybe Shaz can comment as well.
But my intuition is that with preemption bounding can be useful because it drastically reduces the
schedule space. But does consider the possibility where at some non-forced point a thread yields to
another thread. The problem is with preemption bounding the thread yields to any other thread. Then
you get this explosion of possible threads.
But it doesn’t seem necessary to consider all possible threads that could be yielded to. The key thing is
to stop this thread executing and let someone else have a go. With delay bounding we are
deterministically picking who gets to go. But not completely deterministically because we have this
costly non-determinism, so if we want to deviate quite far from the schedule this cost something. Okay,
so we’re not going to by default consider if we use a small delay bound we’re not going to consider
these rather costly sources of non-determinism. We’re going to consider the cheapest which is just to
stay with the current thread or something slightly more expensive, which is to skip on by one thread or
two threads.
Yeah Tom?
>>: I hope your other intuition was to deal with huge numbers of threads. This goes to the state space
explosion. If I have a preemption where I have a hundred enabled threads, right. Then I have many,
many choices better than the next.
>> Alastair Donaldson: Right.
>>: If most of those threads are symmetric it certainly really doesn’t matter. But with preemption
bounding I sort of treat them as if they’re all unique. With delayed bounding I don’t suffer that problem.
I think I can deal with huge numbers of treads. There will just be some threads that don’t get
[indiscernible].
>> Alastair Donaldson: Well I mean absolutely the idea of the large number of threads leading to a
schedule explosion and that going away with the delay bounding. To me that completely makes sense. I
guess the thing you’re adding there is saying that if these threads are kind of roughly the same there’s
nothing especially interesting about all of them. Then it really would not be particularly effective to
consider all these possibilities.
>>: Right.
>> Alastair Donaldson: Okay.
>>: Yeah, well even if they’re not symmetric though. I mean if thread four can do something bad to
thread one it may not matter that thread two intervenes…
>> Alastair Donaldson: Right, you’re going to get to thread four eventually. The key thing is getting
away from thread one at a certain point.
>>: Right.
>> Alastair Donaldson: Letting thread four do its bad thing, yeah. Okay.
>>: One more quick…
>> Alastair Donaldson: Yeah.
>>: I think this is also can be a function of the density of the dependency [indiscernible]. The different
insulations very sparse like in message parsing, sort of white card choices make the difference. That
might be another knob to turn.
>> Alastair Donaldson: Okay.
>>: The examples where the difference is very, very sparse and they keep increasing the coupling.
>> Alastair Donaldson: You think that if there’s sparse dependency then delay bounding is likely to work
well?
>>: No, no the outcome would be good to [indiscernible] that numbers address.
>> Alastair Donaldson: Okay, yeah [indiscernible]. We can apply these methods iteratively. The idea is
rather than necessarily saying we’re doing delay bounded search with a delay of three. We can actually
just say we’ve got an hour or we’ve got a hundred thousand schedules. We’re going to try all schedules
with the delay of zero. Then if we manage to finish all of those there will only be one of them, so we
will. Then we will try all schedules with the delay of one. There will be many more of those. All
schedules with the delay of up to two. Then all with the delay of up to three until we either exhaustively
explore or we run out of time, or we reach some agreed number of schedules. I use iterative, IPD for
iterative preemption bounding and IDB for iterative delay bounding.
The claims of prior work are as follows. A low schedule bound is effective at finding bugs. Then
schedule bounding provides benefit over more naïve systematic techniques like just a straightforward
depth for searchable schedules. The delay bounding is better than preemption bounding in a sense that
it’s faster at finding bugs because you can, you have a smaller schedule space to search. You can find
these bugs in that smaller space.
What we felt was that prior work is mainly from Microsoft and it uses a lot of non-public benchmarks.
The benchmarks sound very interesting. But they’re not publically available so by and large they’re not
publically available. It’s hard to independently validate this research. Then there are all these papers
about concurrency analysis which use. They have these names of benchmarks you see cropping up
again and again.
What we thought would be good would be to try to get all of these open source benchmarks and
implement, re-implement the algorithms for delayed bounding and preemption bounding, and do a
study to A, assess how effective these techniques are with respect to each other and with respect to
naïve systematic search, and B actually assess the benchmarks themselves. These benchmarks that
people are using are they any good? Or you know and I’ll come back to that.
The experimental method, okay, so what we did was we took this tool called Maple which is from
researchers at the University of Michigan and at Intel. This was published at OOPSLA twenty twelve.
The actual paper about Maple is to do with coverage guided hunting for concurrency bugs. That’s not
really at all the focus of our study. But Maple was a suitable choice for us because it was independently
implemented. It’s not something that we implemented. It supports systematic concurrency testing and
its open source.
What we did was we took Maple and we added support for delay bounding. We studied, when I say we
I really mean Paul Thompson. He studied the source code of CHESS quite carefully and tried to make
sure things were implemented similarly to CHESS. Then we got hold of all of the buggy concurrency
benchmarks that we could find from existing papers that are amenable to systematic concurrency
testing.
In our paper we go to some length to explain which benchmarks we have to exclude. For instance there
were quite a lot of benchmarks that involve GUIs. That’s very difficult to apply systematic concurrency
testing without doing quite some significant work. Tom?
>>: These are C programs, C+.
>> Alastair Donaldson: Yeah these are C++ program, C++ benchmarks. Okay and the Maple tool works
using the pinned instrumentation framework.
>>: Okay.
>> Alastair Donaldson: These are, we compile these programs and then we use binary instrumentation
to do systematic testing.
>>: You’re saying the point on Maple was to use coverage techniques to try to guide which schedule to
select next?
>> Alastair Donaldson: Yes, so Maple does, Maple is non-systematic. It controls the scheduler but it
uses heuristics to try and find a schedule that’s likely to find bugs.
>>: Okay.
>> Alastair Donaldson: That, in our paper because we could we did actually compare with Maple.
>>: Oh, okay.
>> Alastair Donaldson: I’m not going to present that data though because that’s not, that wasn’t our
aim at all in the study.
>>: Yeah, okay.
>> Alastair Donaldson: We found these fifty-two buggy benchmarks. I mean there were many more
benchmarks and we whittled them down to fifty-two which were amenable to systematic concurrency
testing with modest effort. These are all public code bases. We’ve amalgamated them with the
permission of the various authors. We call this SCTBench, systematic concurrency testing benchmarks.
This is now a publicly available benchmark suite.
What we did was we looked at three techniques initially, iterative preemption bounding, iterative delay
bounding, and naïve depth for search. We applied every technique to every benchmark. We gave every
technique ten thousand schedules in which to try to find a bug. We used schedules rather than time
because we believe there’s a future proof metric.
We, running this kind of experiment takes serious time. We had to do it on a cluster. Furthermore we
had to do it on a cluster where many benchmarks were running on the same node using multiple
capabilities. Then timing becomes difficult. Okay the number of schedules doesn’t become difficult. If
people want to compare with our techniques in future they can compare based on the number of
schedules. Even if they don’t use anything like the same hardware we’re using.
Did you have a question, Tom?
>>: Yeah, I was wondering why you didn’t consider the concurrency fuzzy approach as well?
>> Alastair Donaldson: The concurrency fuzzy approach?
>>: Yeah.
>> Alastair Donaldson: Because it’s not systematic so we wanted here to look at systematic methods,
stateless model checking.
>>: Okay, okay.
>> Alastair Donaldson: There’s also kind of combinatorial explosion of how many things you compare.
We, there are like various things we’d like to look at like PCT for example. But in…
>>: Yeah, right.
>> Alastair Donaldson: In this study we you know what it’s like you get closer and closer to actually
trying to submit a paper. Eventually you think right for this study we’re going to have to just rein things
in a bit and…
>>: But that’s like you’re just logging that scheduler and just see what happens? Yeah and that would
be good for the paper on checking other data.
>> Alastair Donaldson: Yeah, absolutely, I mean we’re not finished with this study. We have a paper on
the study. But we’d like to do more.
>>: I think the [indiscernible] work that’s open source too.
>>: Yes.
>>: Right, so you could actually have this completely independent implementation…
>>: [indiscernible]
>>: [indiscernible] work is, I don’t think it does the systematic concurrency testing.
>>: No it doesn’t do systematic and, oh, I see what you’re saying...
>>: [indiscernible]. But the scheduler, the PCT scheduler you know you can use it in a non-systematic
way or in a systematic way.
>>: Well right, right, yeah I’ve got it, I got it, okay.
>> Alastair Donaldson: That’s absolutely on our to do list of things. Especially given, I’m going to talk a
bit about random scheduler in a minute. We’re very going to try PCT given the results we have with the
random scheduler.
>>: Okay.
>> Alastair Donaldson: Alright, so a quick word on data races. Over half of the benchmarks we found to
contain data races by using a dynamic data race detector. These data races are found almost instantly
by any of these methods. If we treat a data race as bugs we really wouldn’t be able to distinguish
between the bug finding ability of these methods because they would all just find a bug, okay.
We had a look at some of the benchmarks. In some cases it’s kind of clear that the developers would
regard these races as benign. In other cases it’s things like incrementing a counter in histogram which
really should be done with a relax atomic in C++ eleven. In the end we decided to do as has been done
in prior work was run dynamic race detection up front and find a load of data races. Then look at all of
the instructions that participated in some data race, promote every one of those instructions to be a
visible operation. Treat everyone as a sync op and then ignore the fact that there may be further data
races and do systematic concurrency testing scheduling at sync ops and known racy instructions. But
every time those instructions are executed whether or not it’s a racy scenario we consider them a sync
ops.
This essentially explores sequentially consistent outcomes of those data races, and is oblivious to future
data races. We argue that this is, A is what has been done in some prior work, and B it’s not biased
towards any of the particular techniques we’re evaluating.
Okay, so here are the names of the benchmarks. I won’t spend too long on this slide. But you may be
familiar with some of these names, aget, and pbzip I think are very famous in papers about concurrency
testing. Then we have this rather large set of examples which come from the TACACS Software
Verification Competition.
May I say we’re just taking the benchmarks we can find and that people are using in their papers, and
people are boasting about it. Here for example you can see that there many variance of a dining
philosophers problem. You may or may not think that good in a benchmark suite. But these are the
benchmarks that we could find.
>>: These, all these look small programs.
>> Alastair Donaldson: Yeah, so…
>>: The list for you guys is it big or small?
>> Alastair Donaldson: I mean I’m not intimately familiar with the source code of these things. But I
think that pbzip is moderate sized. None of these are massive programs or some of them come from
massive applications but they’re, sometimes I think it can be really misleading in these papers to say we
tried this on sql live which is two hundred thousand lines of code, when actually the test is running on a
hundred of those lines, so you know.
Then there are some benchmarks from chess which Paul ported into Linux. These work [indiscernible]
queue benchmarks which we thought were very interesting which were available. There are some
parsec benchmarks, some radbench benchmarks, and some of the splash benchmarks. In each of these
suites there are, I think in all cases there are more benchmarks than just these. But there were always
examples that it would have been really quite difficult to make amenable to systematic concurrency
testing. There was only so much effort we could put into curating this benchmark suite for the study.
Okay, so the top level of end diagram of results is as follows. We have fifty-two benchmarks, seven of
them none of our techniques could find bugs in. Preemption bounding, delay bounding, or an IU search.
Thirty-three of them we could find bugs in and in all of those cases the bugs could be found by one of
the bounding methods, okay within ten thousand schedules. There’s never a case where doing this
naïve depth for search was better in terms of bug finding on these benchmarks.
We can see here that delay bounding is significantly better than preemption bounding in that there are
seven benchmarks where delay bounding was capable of finding the bug and preemption bounding was
not.
>>: The seven benchmarks where none of the techniques, do you know if there’s a bug in there, or no?
>> Alastair Donaldson: All of these things have bugs. We know they all have bug, so I’ll come back to
the bugs we couldn’t find. We know anecdotally they have bugs, right.
>>: Yeah.
>> Alastair Donaldson: Like people say there’s a bug in this and they explain the bug in English, right.
>>: Okay, yeah…
>> Alastair Donaldson: It could be that they’re wrong and you know…
>>: It would be interesting to look at that.
>> Alastair Donaldson: Yeah, but I’ll come back to that. Okay so what this does is confirms two claims of
prior work. First of all for this benchmark suite supports the claim that schedule bounding is effective at
finding bugs. Furthermore it supports the claim that delay bounding is superior to preemption bounding
for finding bugs, okay.
But one of the reviewers of our PPoPP paper suggested that it might be interesting just to try a
completely random scheduler. A scheduler which at every scheduling point randomly selects a thread, it
doesn’t use any PCT type stuff. It just randomly selects a thread. We read that and to be honest we
thought kind of grown you know I guess we should try that but it’s going to be some more experiments
to run.
[laughter]
But we did it, okay, and this is the result, right. What we find was that the random scheduler. The
completely naïve random scheduler was able to find all except one of the bugs that could be found by
the systematic methods. Furthermore it could find an extra bug. Furthermore which you can’t see here
it finds these bugs significantly faster in most cases than delay bounding. Of course that could be due to
luck in terms of the random schedules.
Let me explain again about the random scheduler. We’re picking on the fly we’re making random
choices. We’re not recording those choices so we could be doing the same schedule in a set of ten
thousand schedules. We would just do ten thousand random schedules and see how far we get, okay.
We find this very surprising. I guess if I’m completely bloodily honest it’s just one of two things. It either
suggests or a mixture of these two things. It either suggests that SCTBench, this set of benchmarks
which people are using and we gathered is not representative of real world bugs. Or it suggests that
actually these bounding methods don’t provide benefit over just a naïve random approach for raw bug
finding.
I’ll talk a little bit about how good this benchmark set is in a minute. But one thing I, in defense of
preemption bounding and delay bounding is first of all it’s, I think it’s definitely true that if you find a
bug using a delay bound of one it’s likely that the counter example you get is going to be much more
palatable than the counter example you get from some crazy random schedule.
Okay, the second thing is that because these results support the hypothesis that bounding is useful that
adds weight to these sequentialization methods. Context bounding and certain forms of delay bounding
admit sequentializations where you can take a concurrent program and you can rewrite it into a
sequential program such that the sequential program exhibits all of the behaviors of the concurrent
program up to some bound. Then you can use static verification methods from the sequential land to
prove not for just one input but for all inputs that this concurrent program is correct up to this bound.
Now how useful is such a claim? Well it’s potentially quite useful if it seems really to be true that you
can find bugs in small bounds. These results I guess add weight to that user weights bounding methods,
even if they potentially detract from the use just for bug finding.
>>: How does the PCT work which compares pure random to the guided really does see a significant
difference. But that’s on substantially large real applications.
>> Alastair Donaldson: Right.
>>: Like SQL Server.
>> Alastair Donaldson: Yep.
>>: You know because that technique is not systematic. It’s able to run on very large programs.
>> Alastair Donaldson: Yes.
>>: But just because it can run on large programs shouldn’t prevent us from running even small
programs…
>>: Oh, no, no, no, but what I’m saying is that you know I think these benchmarks perhaps are whittled
down in such a way that you know your schedule space is not that huge.
>> Alastair Donaldson: Right.
>>: You’re…
>> Alastair Donaldson: It’s definitely…
>>: It’s not a needle in a hay stack. It doesn’t seem like a needle in a hay stack if randoms doing this
well.
>> Alastair Donaldson: Yeah, and a second I just want to give one piece of intuition that I have for this
scenario. It seems to me that if a bug can be exposed with just one preemption. Then to me there seem
to be two extreme scenarios. One scenario is that the bug will occur if this preemption happens some
time, right, doesn’t matter when it just has to happen. In that case it seems that loads and loads, and
loads of schedules will expose the bug. Then random will find the bug.
>>: Right.
>> Alastair Donaldson: On the other extreme there’s the case where some preemption has to happen
but at a really key point. If it doesn’t happen at that key point you won’t get the bug. Then that’s as you
say, Tom, like searching for a needle in a hay stack. We wouldn’t expect random to do well.
>>: Yeah but the other thing you’re not accounting for is random is finding the bugs in fewer guesses
than IDB.
>> Alastair Donaldson: Yes, right.
>>: Systematic is penalizing you.
>> Alastair Donaldson: Right.
>>: Right, it’s doing worse scenario.
>> Alastair Donaldson: The only intuition I have there. This came from Hana Chockler, Kings College
London, who saw Paul talking about his work and she, I’ll try and say what Paul told me she said to him,
which is that if you imagine the tree of schedules. Then shallow terminal schedules are likely to be
favored by random, right. If you have a bug that can crash the program very quickly then it will have a
shallow terminal schedule. There may be a very high probability of they’ll be a number of these shallow
terminal schedules. You will have a high probability of hitting one of those. Does that may sense?
>>: Well yeah but the shallowness helps both methods.
>> Alastair Donaldson: Except that we’re doing, so when you, something I haven’t mentioned in this talk
but is in the paper. Is that we’re doing IDB and IPB with; you have to pick a scheduling method
underneath that. We’re using depth for a search underneath IPB or IDB.
>>: Why is that? I don’t understand.
>> Alastair Donaldson: That’s mainly because in Maple that was a pretty core engineering decision and
was hard to reverse. Maple is running, you know you’re running and you’re recording your schedule.
Then you’re trying to schedule variant. There are many ways to enumerate the schedule variance and a
simple way is depth first in enumeration of schedule variance with a schedule bound of one say, okay.
That will not necessarily favor these shallow bugs. Does that make sense?
>>: No, I guess I’m misunderstanding what you mean by, what do you mean by shallow?
>> Alastair Donaldson: I guess if you imagine the space of all schedules as tree. Then shallow would
mean terminal schedules which are not very deep in the tree.
>>: But did not deep mean few preemptions [indiscernible]?
>> Alastair Donaldson: It means few instructions executed full stop.
>>: Oh, okay.
>> Alastair Donaldson: Yeah.
>>: Why does the number of instructions executed matter?
>> Alastair Donaldson: Because if you imagine the random scheduler every time it reaches a sync copy it
makes a decision. If there are loads and loads of shallow terminal schedules that end with a crash then
if it makes random decision it’s quite likely to get to one of those rather than if there’s a very, very deep
crash. Yeah?
>>: Deep meaning but deep would mean in terms of number scheduling decisions not in terms of
number instructions, right? We had counting schedules, well not counting…
>> Alastair Donaldson: I wonder if we might be able to talk about it after because I would like to, it’s
something I haven’t thought about that much and would like to think about more, and yeah.
>>: Because it seems like what you’re doing, what’s happening here is our IDB is something
systematically fat, right. It’s going off on some piece of the search space where the solution isn’t.
>> Alastair Donaldson: Right.
>>: You know whereas apparently the space of solutions is quite dense and you’ll find one quickly if you,
if it gets randomness.
>> Alastair Donaldson: It does seem that way and we can look at the big table, so this is the big table. I
mean we’re not going to look at it now. But this table has all the data in it, right. You know it, we
couldn’t list this in the paper because there are only so many observations we could make ourselves and
fit in. This has really got a lot of information. If you’re interested we could offline maybe pour over the
table and look at some of the results.
We’re very curious to understand this. We were very surprised by this random. We’re very grateful to
the reviewer if you’re watching for suggesting we do this. Unfortunately we didn’t have a chance to
have our comments about the random scheduler reviewed because we put these in the final version of
the paper. Ganish?
>>: Yeah, actually [indiscernible] experiment when I was visiting Intel for a year. We took [indiscernible]
and then a used parallel random work basically used different seeds.
>>: Okay.
>>: We would get bugs [indiscernible].
>>: Right.
>>: We didn’t record the traits that was the problem.
>>: Okay, you didn’t record the traits at the…
>>: Yeah because it was by the book. The other thing is we also did a count of how many variants of the
same error there are because we don’t [indiscernible]. [indiscernible] ten thousand and twenty
thousand times in the state space.
>>: Yeah.
>>: You don’t need to hit that sort of instance because the core [indiscernible] populated all over the
state space, so random is not [indiscernible].
>>: When you say random I presume you’re saying pseudo random. You could reconstruct the bug if
you…
>> Alastair Donaldson: Yes, pseudo random.
>>: That’s right.
>>: Showed somebody how it misbehaves.
>>: That’s right.
>> Alastair Donaldson: Yeah, right but I think we should focus now on the benchmarks and their
limitations. Because let’s not get too carried away because some of these benchmarks are not good.
But the main findings then I think I’ve covered these. Schedule bounding was similar in terms of bug
finding ability to random search.
Many bugs can be found with a small bound. Delay bounding beats preemption bounding. But what I
want to talk about now is that a significant number of the benchmarks maybe regarded as trivial in some
sense. This is quite important because if benchmarks are trivial researchers should not boast about
finding bugs in them. They should become a minimum baseline. If your tool can’t find these bugs your
tool is not a tool. You need to boast about finding you know better bugs than these bugs.
[laughter]
I hope that our study can potentially set a clear baseline for at least systematic concurrency testing
techniques. Maybe concurrency techniques in general so if you have these big table of benchmarks if
you have all of these benchmarks, which I’m not going to elude to by name, here. But, yeah, well let’s
see, so trivial benchmarks. Here’s a property bug was found with delay bound zero, fourteen
benchmarks. Right, so what this means is the single schedule with delay bound zero…
>>: [indiscernible] round robin.
>> Alastair Donaldson: Find the bug.
>>: These should be just stricken.
[laughter]
>> Alastair Donaldson: But let me emphasize again we just wanted to take the benchmarks people are
using and study them, right. Okay, the numbers here there is overlap between the numbers. This isn’t a
partitioning of the benchmarks. There were sixteen benchmarks where, Tom you were correct to say
that the schedule space is not vast. There were fewer than ten thousand terminal schedules over all,
forget bounding. I mean, and what this means is that all of the methods we studied would eventually
find the bug. Because they would all exhaust the search space, right.
Then it might be the delay bounder would get a faster or preemption bounding would get there faster,
maybe, maybe not, but every time it would get there. We found in nineteen cases that more than fifty
percent of the random schedules we tried exposed a bug. Now of course that could be luck but I think it
suggests that there are loads and loads of schedules that expose a bug in these benchmarks. Okay and
then furthermore of those nine of them every random schedule we tried was buggy, okay.
>>: That’s really weird.
>> Alastair Donaldson: Then in these nine…
>>: Is there a default of the other…
>> Alastair Donaldson: Right, and in these nine; I would have to check the exact number. But I think
four or five of them every schedule we tried with every technique hit a bug. Going back to the
beginning of my talk we wouldn’t actually call those concurrency bugs because they are not schedule
dependent, right.
We included them in the study because they claim to be concurrency bugs by…
>>: I think the same is true for delay bounds here, right?
>> Alastair Donaldson: Not necessarily.
>>: You get a random schedule.
>>: No that’s over there because it depends on which scheduler you chose.
>> Alastair Donaldson: Yeah.
>>: I mean that’s a little weird because that’s like the default scheduler. That’s the thing that would
happen pretty much on an ordinary computer.
>> Alastair Donaldson: I’m sure it’s not the case because we didn’t find, otherwise we’d find this for all
fourteen. We didn’t find this, I’m saying, there’s another moment which we don’t have a column for,
which is every schedule was buggy. We didn’t find that for fourteen, definitely not.
Right, so…
>>: Just to give, just to give, I mean anecdotes, I mean on the benchmarks we were doing with CHESS,
which were the ones I worked on for .NET from library code. I mean for small like things like hash
tables, you know a parallel hash table I mean you would have you know millions of schedules, right. I
mean just small programs we would just and this is with all the bounding and everything. I mean huge
numbers, so first of all just the, I mean ten thousand…
>> Alastair Donaldson: Let me ask you were they all buggy?
>>: Well…
>> Alastair Donaldson: All these benchmarks are buggy, right?
>>: We found bugs. But…
>> Alastair Donaldson: You found bugs but what all the…
>>: What Don is saying there…
>>: I’m just saying the search space…
>>: [indiscernible]…
>>: The search space is ridiculously small for concurrency.
>> Alastair Donaldson: Well, so let me emphasize though. I’m not saying that, so what we’re saying is
there were sixteen cases where you had the small search space, right.
>>: Yeah, yeah.
>> Alastair Donaldson: For everything else there were fifty-two benchmarks. For the remaining
benchmarks we don’t know how big the search space was.
>>: Yeah, but you don’t need it...
>> Alastair Donaldson: But it was more than ten thousand. You know that it’s not going to be, it
probably not going to be ten thousand and two. It’s probably going to be bigger.
>>: Well, I’m also wondering you know how those things really are.
>> Alastair Donaldson: Yeah.
>>: Because I mean with CHESS definitely we had experience of exhausting or preemption bound of two
or three. You know we have examples you know you can make up whatever number you want. But I
mean for the real things you know we would just, we could keep exploring for a very long time on
relatively small to a complex [indiscernible].
>> Alastair Donaldson: Okay, so…
>>: These were things you know with Volatiles and with Interlock operations. I mean the code was
small but schedules fix was huge.
>> Alastair Donaldson: Yeah I mean it’s like pretty clear from these, I mean I’m focusing here on the
ones we say are trivial.
>>: Okay, yeah, yeah, yeah.
>> Alastair Donaldson: Then there are the rest of them. Some of which come from CHESS and which are
harder and you know I’ll come back in a minute to some of those.
>>: Yeah.
>> Alastair Donaldson: But it sounds anecdotally that your reading of these results is probably that,
when I said that you know we draw one of two conclusions, right. Either these benchmarks are not
realistic or these methods are not providing benefits. It sounds like you’re probably thinking the first,
which I would like to think because I find these bounding methods fascinating. I’d like to study them
more.
To me it was mess in a way but a disappointment when we found this random result. Because I kind of
thought what we doing then if we can just find these bugs randomly. You know I think the key thing is
the difference between bugs that just you know programs that are just very likely to crash. Then you
probably don’t you know you could argue that maybe don’t need systematic testing. Maybe the
program is so likely to crash that…
>>: Well, one was…
>>: I mean there were number approaches that say make you random testing also can be proven still.
>>: Yeah.
>>: We could do better than the random scale order. At least the random methods I mean PCT
scheduler.
>>: True.
>>: I mean it’s all a matter of searching, you search a schedule space.
>> Alastair Donaldson: Yeah.
>>: How do you run the search that space? I could be DFS it could be PFS, it could be probabilistic
search and probabilistic search is random tested. Then you can ask question how do I want to guide my
search?
>> Alastair Donaldson: Right.
>>: I think it might be useful to qualify the benchmarks using a qualitative approach which would say
this isn’t clearly a synthetic benchmark created by hand to insert a bug into a well known but small
concurrent program and try to find it. I mean versus you know a benchmark that comes from the wild
which is you know typically what we were going after with CHESS.
>> Alastair Donaldson: Yeah.
>>: I mean clearly everybody creates synthetic benchmarks. But we create them basically as functional
tests to say is the tool just working?
>> Alastair Donaldson: Yeah.
>>: It doesn’t find a bug or not. To also do unit testing which is a investment thing for exponential
search space.
>> Alastair Donaldson: Yeah.
>>: But to have a really tiny programs that are on a small search space so that the tool can, we can test
the tool…
>> Alastair Donaldson: Test the tool, right. What we wanted to do in this study like I said was just take
what there is and evaluate it and do it quantitatively.
>>: Yep, yeah.
>> Alastair Donaldson: We didn’t, we wanted to be very objective here. But I think given some of the
findings I think maybe a, you know some more meeting of the code could be in order.
>>: You don’t have to beat a dead horse.
>> Alastair Donaldson: I’ll talk briefly about the bugs not found.
>>: [indiscernible]
>> Alastair Donaldson: There were three bugs which were not found in a rather trivial way. These bugs
could be exposed if you reduced the number of threads. You saw in the list of, I think this is dining
philosophers example where this two, three, four, five, six, seven I think is the case. I can’t remember
exactly. But I think it’s the in dining philosophers seven you couldn’t find the bug. But it’s basically the
same bug as in dining philosophers two.
I don’t really regard those as particularly interesting unfound bugs. Except that maybe delay bounding
should find them if delay bounding’s meant to be good with large thread counts, maybe, okay, yep.
Then there’s this radbench.bug number one. This is a JavaScript interpreter. This is a, we have a
scenario where thread one destroys a hash table. Thread two must access the hash table.
Paul believes that this should be exposable with only one delay. But there are upwards of fourteen
thousand scheduling points. He looked at an execution and he looked at the number of scheduling
points in that terminal execution and there were you know upwards of fourteen thousand of them. You
can see that there would be a very large schedule space.
>>: Why are you [indiscernible] on no technique was able to explain them.
>> Alastair Donaldson: Yeah, no technique was explained, right. The radbench.bug two we know
requires at least three delays or preemptions. Because I believe we were able to search this one
exhaustively. We need to double check the table, okay. We didn’t find any; we were able to search it
exhaustively up to this bound of three.
>>: Then you found it.
>> Alastair Donaldson: Sorry, what I mean sorry, no, no, no, sorry that’s not true. No we didn’t find this
bug. What we did was we explored zero, one, and two. We got into three when we reached the ten
thousand limit. We know that we would need at least three preemptions to find this bug.
>>: Okay.
>> Alastair Donaldson: Yeah.
>>: Have you looked at the [indiscernible] depth metric that [indiscernible] the PCP paper?
>> Alastair Donaldson: No.
>>: Because it sounds like you’re actually going after something similar here. That you know figuring
out how many scheduling points and how many preemptions, so we have this classification report that
they on a similar idea except it’s priority lowering points it’s not preemptions.
>> Alastair Donaldson: Okay.
>>: It boils down to a similar thing.
>> Alastair Donaldson: Is it possible to use that metric to assess the case if you didn’t find a bug or can
you only use it when you found the bug? Or can you use it as, to give you a lower bound on…
>>: It’s not hard to actually measure this because it states what is the minimum number that you need
to find, to guarantee…
>> Alastair Donaldson: To guarantee [indiscernible] percent.
>>: Sort of quantified over all possible [indiscernible].
>> Alastair Donaldson: Okay.
>>: Might be hard.
>> Alastair Donaldson: You know we didn’t look at that. I didn’t remember that from my reading of the
papers, so I’ll need to go back and study the paper. Alright and then there’s a very interesting
benchmark which we classified as miscellaneous. This comes from Dmitry Vyukov and this is posted to
the CHESS forum. This is a lock-free stack. There are three threads and the poster of this benchmark
claims that this bug could only be exposed with at least five preemptions. There are a hundred and
fourteen scheduling points. Right, so this if he is correct and you know we couldn’t find this bug, so I
can’t say you know that we validated this. We…
>>: He has explained the bug and…
>> Alastair Donaldson: He’s explained the bug in English. Right, so his claim is that this bug cannot be
exposed with a small number of preemptions. That is, obviously there will exist kind of counter
examples to this claim of, well you may still rule out five as a small bound. But this is a not so small
bound. It’s will be interesting to see whether the PCT could find this. We didn’t find this with random.
This sounds like the sort of thing we wouldn’t find with random, right, because it requires five
preemptions, random is not going you know the chances of just naïve random in searching those
preemptions at the right place is very small.
Something which I won’t talk about now but which we…
>>: How long did you run the bound was over ten thousand.
>> Alastair Donaldson: Over ten thousand.
>>: Okay.
>> Alastair Donaldson: I mean it was Vonimere who suggested to Paul that peoples running you know
take that benchmark and just run it for ten weeks or something.
>>: I find this hard to imagine that the real program needs five preemptions to hit the bug. Clearly it
can craft any program if he needs that many.
>> Alastair Donaldson: Right.
>>: But I mean I would be interested to, I mean it’s probably terrible to look at so maybe not, right.
>>: But do you…
>>: But it’s really hard to believe that.
>>: The author says that he crafted this to be, to have this five preemption lower bound, or discovered
this by accident?
>>: Or is this a latent natural? I could be…
>> Alastair Donaldson: I haven’t personally looked at this. I could click the link now and we could look
at it.
>>: Ah, yeah.
[laughter]
>>: Yeah this goes back a long time, but I…
>>: Go to chess.codeplex.com, oh there it is, okay.
>>: Gosh I can’t believe that’s still off.
>> Alastair Donaldson: In what sense in that you think it’s bogus or…
>>: Number…
>>: We haven’t looked at that for a little while.
>>: Should have been squabbed a long time ago.
>> Alastair Donaldson: Oh, before I see, maybe Paul just linked the forum not the actual bug. Okay, I
don’t think I’m going to be able to find this during the talk. But let’s look at it afterwards, okay.
>> Yeah.
>> Alastair Donaldson: Right, so the, so now I guess it’s about the systematic concurrency’s full stop.
The main problem we had doing the study was environment modeling, so GUI applications, applications
that use processes not threads, and use the network. What Paul tried to do was take those applications
and then extract from them isolated concurrency tests. Try to take out something that involves a few
threads in a scenario and use that for scenario testing. But he found often there would be global
variables and this would be very hard to do. It would not be looking at these more real world things. It
would be very difficult to extract these test cases.
I think for people to really use systematic concurrency testing they need to be willing to put some effort
into writing concurrency tests, and trying to you know make these tests stand alone, so that you can
systematically explore the schedules. Then another issue was that many of bugs we found were related
to memory safety. But what we had to do was actually, knowing from reports on these bugs that there’s
a problem with memory safety. We had to then add an assertion at the point where we know the
problem is. Then see whether we could find that assertion.
What we don’t have inside Maple is good dynamic [indiscernible] style memory analysis to try to find
these bugs.
>>: But there’s a [indiscernible] program still can crash right with a set forward something?
>> Alastair Donaldson: Often not.
>>: Okay.
>> Alastair Donaldson: I mean if there’s a buffer it very often doesn’t crash. You know and whether that
happens may be non-deterministic. There’s a kind of challenge of, engineering challenge of trying to do,
you know if you want to do race detection maybe to do full exploration of all schedules arising from data
races you need very fast on the fly race detection. If you want to find these memory areas you need
very fast on the fly memory analysis. Getting all those things integrated into a tool. Then having the
flexibility to be able to explore the scheduling strategies is quite some engineering challenge. But not
really a research challenge.
>>: I have a comment about the second issue.
>> Alastair Donaldson: Yeah.
>>: At the start of the CHESS project there are two challenges with this kind of testing. One of them is
of course creating the isolated unit test. The second one is this common [indiscernible] explosion. But I
think that at the start of the project it was not clear to me that there are two very distinct challenges. I
think I found you know talking to engineers and testers in the company that often times the reason for
doing random testing. The thing that they are doing random testing because of one reason but it’s
really because of the other reason.
>> Alastair Donaldson: You mean because they can’t isolate these unit tests? They just…
>>: Because they can’t isolate it. A lot of the reason why the testing is, concurrency testing is not been
exposed because you can’t isolate these tests, even what you call random scheduling, right.
>> Alastair Donaldson: Yep.
>>: You are able to build a truly random scheduler only once you have gotten full control over all
[indiscernible] choices.
>> Alastair Donaldson: Right, so it’s, yeah it’s controlled random scheduling.
>>: But if go into the industry what people refer to as random scheduling just means that they’re just
running the test again and again.
>> Alastair Donaldson: Yeah.
>>: Trying to [indiscernible] things, right.
>> Alastair Donaldson: That’s absolutely not what we did here.
>>: If should do this, I mean what you are doing is already way more principle. What I’m saying is it’s
way more principle than what was happening.
>> Alastair Donaldson: Yeah.
>>: In actual product setting.
>> Alastair Donaldson: But harder to apply.
>>: There is this smooth transition, right. You cannot [indiscernible] translate where you want a
random scheduler. Then you have testing the real the scenario where you can, maybe you’re not
catching every single piece of synchronization. Yeah, but you still have some idea that whatever worked
better in the isolate case is probably also going to be better if you only have partially preemptions. I
mean the same scheduling decisions you can make in the purely isolated case for random scheduling
have a good chance of working once you apply them to the real code, even if you don’t catch all that
synchronization.
>> Alastair Donaldson: But is it, how do you apply them to the real code? How do you just apply them?
>>: You can always intercept some synchronization. It just can’t, you can’t restart the program every
time you want to restart. Then you have no guarantee that you won’t miss some synchronization or
maybe sometimes you guessed incorrectly and a thread is blocked and it’s really just slow. Those are
like imprecision approximations that you have to do in the [indiscernible]. That’s what you do in
[indiscernible], right?
>> Alastair Donaldson: Right.
>>: But [indiscernible] of those guesses is though they’re incorrect they’re still running a principle
scheduler.
>> Alastair Donaldson: Okay, yep, makes sense. Right, so to conclude with the slide on future work, so
what we’d like to do. I mean Paul has a bunch of ideas related to extensions of systematic concurrency
testing and heuristics. But in terms of this study we would like to look at more techniques. Yes, the PCT
technique I think is primarily on the list. But then it would be interesting to compare with nonsystematic techniques as well.
We would like to, now we have this framework for running these tests. We would like to really extend
the benchmark suite significantly. We didn’t look here at partial-order reduction. The reason is that it’s
not trivial to combine partial-order reduction soundly with these bounding techniques. Soundly in a
sense that I would say that POR is combined soundly with delay bounding for instance if you don’t lose
anything additional to what you’re already losing with delay bounding by applying POR. My
understanding is that is not the case if you naively combine them and that it’s an open problem in which
people are working. I think Madame [indiscernible] had a paper with Catherine Koons and Catherine
McKinley at OOPSLA last year on precisely this topic. That’s something we would like to look at.
Then there’s this issue of weak memory in systematic concurrency testing where I would say there’s
been some preliminary work, but we would like to explore that further as well. Okay, so thank you for
listening and thanks for the questions. I’d be delighted to answer any more questions.
[applause]
>> Shaz Qadeer: More questions?
>>: Sure.
>> Alastair Donaldson: Yeah.
>>: There would be other possibilities for random. I mean instead of just delaying you know complete
random scheduling you could do you know random delay or something like that, random perturbation
from a fixed schedule.
>> Alastair Donaldson: Shaz you told me that you had, you said something a bit like that to me that you
even tried delay bounding with a…
>>: [indiscernible] scheduler.
>> Alastair Donaldson: With a randomized but deterministically random as you couldn’t see the
beginning.
>>: Yeah, so I…
>> Alastair Donaldson: Then you have now a deterministic scheduler. But it’s like a randomly chosen
deterministic scheduler, yeah.
>>: Yeah, you could do that. I was thinking just as a bias in your random scheduler, right. You could say
you know bias towards round robin.
>> Alastair Donaldson: Okay.
>>: That would put you a little closer to the, you know delay bounding you know ideal.
>> Alastair Donaldson: Yeah.
>>: With outward fetters then you know un-biased run. Even though delay bounding does seem to
improve or…
>> Alastair Donaldson: Yeah that’s an interesting idea. I think it [indiscernible] a little bit of PCT, right,
which is in favor round robin but it’s prioritizing threads and randomly changing priorities.
>>: I think I mean the paper that I did that had PCTs to buy into randomization that gets as quick as to
the shallowest values, right. You have this capturization of that. The thing that we found you want to
randomize is priorities and priority lowering points. We give all threads an initial random priority. Then
the scheduler is deterministic based on that priority. There are like random points in the execution
where we lower some, the priority of some threads randomly, and that’s it.
>> Alastair Donaldson: Okay, you had a question in the back.
>>: There is that one famous concurrency bug which was with them in the Space Shuttle had to be
scrubbed on the launch pad. Do you have, did you try to add it to your list?
>> Alastair Donaldson: We didn’t try to add that to our list. I don’t know whether it’s a C++ program
with a fixed input and, yeah, but no we didn’t. Kanesh?
>>: I was wondering whether you could take a [indiscernible] code structure. Some of the utilization
code may not be of interest. If you want to get past it and then turn on the search things like that I
mean.
>> Alastair Donaldson: Yep, yep.
>>: Start the search at say [indiscernible].
>> Alastair Donaldson: Absolutely, yeah you may want to, I mean run the application up to some point I
suppose. Then maybe take a snap shot of that point and do systematic concurrency thing from the snap
shot, yep.
>>: I don’t have a question but I have a comment.
>> Alastair Donaldson: Yep, sure.
>>: When you were introducing delay bounding you mentioned that delay bounding is with respect to a
round robin scheduler.
>> Alastair Donaldson: What I said in general is with respect to our deterministic scheduler.
>>: That’s right, so…
>> Alastair Donaldson: In this implementation like CHESS we used random, yeah.
>>: The result of doing delay bounding there is a lot depending on what deterministic scheduler you
use. Because as Ken was saying that the deterministic scheduler is sort of like the point around which
you’re biasing the search.
>> Alastair Donaldson: Yeah.
>>: In other applications, so I’m interested in testing systematic testing or message sparsing
applications. But those applications there is no more such notion of preemption on context switch. You
have processing running and they’re communicating via [indiscernible]. There is not a prior reason to
believe that preemption along those processes is particularly going to be useful.
We found that, and so we were using the idea of a deterministic scheduler, right. Because that’s a very
general concept that doesn’t depend on whether you’re running on single class, single core, or multi
core, it is applicable even to a distributed system. We found that the speed with which bugs are
uncovered depends significantly on the particular deterministic scheduler that we started with.
>> Alastair Donaldson: Okay.
>>: We experimented with the round robin scheduler that you mentioned.
>> Alastair Donaldson: Yeah.
>>: There’s another one called [indiscernible] completion scheduler.
>> Alastair Donaldson: Right.
>>: There was one more called, so we created a random delaying scheduler, also.
>> Alastair Donaldson: Okay, is that a kin, but that a kin to our random scheduler is it?
>>: No, no, no, no.
>> Alastair Donaldson: Did you try random scheduling like we tried?
>>: We have not tried random scheduling.
>> Alastair Donaldson: You should try that.
>>: But we should try that, yeah.
>> Alastair Donaldson: It’s easy to try. I guess what surprised me a bit was that in prior work there was
not a comparison with this very [indiscernible] random approach. It didn’t surprise me actually because
we didn’t think of doing it either. It was this reviewer who suggested that we try it. I guess kind of my
take away from this work is definitely try all these easy things, right. You know it’s good to try them.
>>: No you shouldn’t try them they’re embarrassing.
>> Alastair Donaldson: Don’t try them because then you can’t write a paper on them, is that what you
mean?
[laughter]
But yeah, I mean I think with this benchmark set my reading of it is that about half the benchmarks are I
think are nonsense benchmarks. I think about a fifth of the benchmarks are really hard and we can’t
find the bugs. I think the remaining are interesting benchmarks. For those benchmarks I think we’re
seeing you know the claims of prior weren’t being supported. But this random approach is doing well
and you know maybe those bugs are not super hard to find. But the benchmarks are not super simple
either.
>>: I think it’s also like what you say random or non-randomized search, or what Chad said, and also
about what you knew but really didn’t point out. Is that it’s not really that much of a difference
between you know randomized search and non-randomized search. If you think of the random, a
number of random choices as an input to like picking a schedule, right, so you can just pick those
random choices before you start, right. That would be similar to; you can deterministically enumerate
all random choices. Then you have a systematic search.
>> Alastair Donaldson: Right, okay so if you knew there were a hundred scheduling points you
systematically enumerate all of them…
>>: Yeah, so now you know the design of a good random scheduler is to pick fewer random choices. It’s
just like how you determine the complexity of randomized algorithm you know it costs something to
pick a random bit. You want a scheduler that uses as pure random choice as possible. That same
scheduler then is actually good for systematic exploration because you can actually go through all the
random choices.
>> Alastair Donaldson: Okay I’d like to talk more about that actually, really.
>> Shaz Qadeer: Okay, more questions? Alright, cool.
>> Alastair Donaldson: Thanks.
[applause]
Download