22773 >> Shaz Qadeer: Welcome, everybody. It's my great... Serdar Tasiran to you. Actually, he doesn't really need...

advertisement
22773
>> Shaz Qadeer: Welcome, everybody. It's my great pleasure to introduce
Serdar Tasiran to you. Actually, he doesn't really need any introduction because
he's a long-term collaborator of many people in the Rise Group here in MSR
Redmond. Started out as a professor in the computer science department -computer science or computer engineering?
>> Serdar Tasiran: Computer engineering.
>> Shaz Qadeer: Computer engineering department at Koc University in
Istanbul, Turkey. And he's worked for any years on various aspects of testing
and verification of concurrent programs. Today he's going to tell us about some
new work he's done together with his students on coverage metrics, novel
coverage metric for concurrent software.
>> Serdar Tasiran: Thank you, Shaz. So I'm going to present an empirical study
about the simple idea for coverage metric. This is joint work with my students
Erkan and Kivanc. Kivanc is in the audience. He's at UW and he's an intern
here this summer.
So the use we want to put this coverage metric to very similar to other
concurrency coverage metrics is to answer the following questions: Have I
explored enough interesting and distinct thread interleavings during testing? If
I'm applying a certain way of testing my program, for instance, randomized
testing with a set of fixed data, has this stopped paying off or am I still
discovering new interesting scenarios.
If my current way of testing is not paying off anymore, where else should I focus
my testing effort? What inputs, what interleavings or threads should I prioritize?
These are the questions that this metric is trying to approximately answer.
And in this study I'm going to show you empirically that the location pairs
coverage metric corresponds well to a certain kind of concurrency, atomicity and
refinement violations. And I'll also show it strikes a good compromise between
the difficulty of achieving coverage versus the power of the metric for detecting
bugs.
Before I dive into the specifics, I want to provide the motivation for this metric
using well-known example from the atomicity literature. This is an older version
of the string buffer class from the Jahova Class Library [phonetic], and it stores
the characters in the string buffer in this field called value of characters.
And not all of these characters are part of the string but only count for many of
them are. So count tells us how many characters in this character array are part
of the stream buffer really.
And here is the append method for stream buffer. Let's suppose a thread called
T1 is running it. So to the string buffer this, we're trying to append the string
buffer SP. This method has again a well-known atomicity violation. First we look
at how long SV is and then we determine how much room we need, and whether
we need to extend our array to accommodate the characters that are going to
come from SV. So new counts as count plus Len where Len is the length of SP.
If new count is bigger than how long we are, we say let's expand our capacity.
Then we run the get chars method where starting in index -- let's see. So in the
array value, we copy len number of elements, starting from index zero in SP and
starting from index this dot count.
But it got a little complicated but you get the idea. We're trying to append the
characters in here to the value array here. And here's what could go wrong.
This method is not holding the lock for its argument SP during its execution.
So it's possible that another method running on the string buffer SP in this case a
carefully chosen set length zero method could get interleaved just in between
these statements here.
And what this does is when you try to copy the characters from SB to this, this
method here checks if len is greater than SB.count and just because SB.count
became zero it throws string out of bounds, et cetera.
So the atomicity violation happened because of the view we had of SB when we
started got changed but we were still operating with that old view of SB.
So common pattern in at misty violations where a block of code intended to be
atomic does some operations in the beginning, somebody else interferes, and
then when the first thread continues its operation, things go wrong.
So the purpose of the metric was to capture concisely this bad interleaving
without saying let's enumerate all possible interleavings of axises to the field.
This is how the trouble occurred.
Have I lost anyone so far? So here's how trouble occurred. We read SB.count
to modify it in the middle and then we tried to read it here again. And it had
gotten changed.
And our initial view of SB.count wasn't valid. So if we look at the code, this
access, this read of SB.count is line 425 in abstract string builder.java. And this
SB.count gets zero is line 180 in abstract string builder.java. This is a line that
gets executed only when SB is becoming shorter.
So if it is the case that the two consecutive axises to SB.count are this line and
that line by two different threads, then this atomicity violation occurs and that
exception is thrown.
So this being the crux of the metric, I'll pause here for a second.
>>: That implies it's a one edge block, it's really a two edge. It's a two edge
because you have to observe, you need a little bit more. You need the second
edge, correct?
>> Serdar Tasiran: So the second edge -- the second edge is being implicitly
captured by saying the two consecutive axises to SB.count need to be this and
that. So if this guy were lower than these two would be the two consecutive
axises.
>>: But it's not just that, you're saying the bug occurs. But that's not correct,
because if you've never done the SB.count get cares doesn't matter. If SB is
dead after the second write to the count then nobody would ever observe ->> Serdar Tasiran: Correct.
>>: Always going to do it because ->>: He's just saying -- I thought he was making a general statement. If there are
two consecutive axises to SB count where one was in thread T 1 and the other in
thread T2 then a bug occurs.
>> Serdar Tasiran: No, if line 425 is accessed by one thread and if line 180 is
the immediate next access by another thread, and if you don't stop the program
there, if you let this append method continue.
>>: So you're making a general statement, sorry.
>> Serdar Tasiran: No. In fact, there are no theorems or any certain claims.
This being the talk about coverage, everything is about what correlates well with
bugs.
So this is a lucky scenario where this interleaving guarantees many bugs but
there are many others where it doesn't. But it increases the likelihood ->>: You can have an ADA that comes in, happens to reset it before the core gets
back ->> Serdar Tasiran: Sure, another one could run here.
>>: Or that count happened to be zero, then the bug ->> Serdar Tasiran: Right. It could be reset back to however long it was before.
>>: Those two axises happen to be data arrays, that's no precedence, right?
>>: They could be synchronized. It could be a property.
>>: They cannot be immediately succeeding each other.
>> Serdar Tasiran: If that was a ->>: A log between.
>> Serdar Tasiran: If that was a get count, for instance.
>>: The count ->> Serdar Tasiran: If it was synchronized get count method here, for instance,
that would be okay.
>>: But then I don't understand your definition of immediately succeeding each
other.
>> Serdar Tasiran: We'll get to that.
>>: What are we talking about these super ->>: He's showing a data array in his example. SP count, one thread and reading
of SP count by different thread.
>> Serdar Tasiran: Just pretend count is volatile, for instance. And there is no
data arrays, but there's still the atomicity violation.
>>: Right. But it's still data arrays for people who don't know about memory
models, let's say. So what I'm saying.
>> Serdar Tasiran: True. In that case pretend there's a synchronized get count
method within which this occurs.
>>: And it's not synchronized in thread two or it's also synchronized in thread
two?
>> Serdar Tasiran: This one is also synchronized. But this method ends and
then this one starts. So this problematic.
>>: So there are other statements in between, and they're not immediately
succeeding each other.
>> Serdar Tasiran: I'm talking about -- I'm not saying this is immediately
succeeded by this. I'm saying in this execution of the axises to SB.count.
>>: Succeeding.
>> Serdar Tasiran: I'm going to show you a formal definition. In fact, why don't I
do that right now before I get into more trouble.
Okay. So here's what a location pair means formally. I should have known with
this crowd an informal introduction is a mistake. So L1 and L2 are two bytecode
instructions in compiled Java code abstractstringbuilder.class, there are two
axises in there.
And the two can be the same. Because sometimes I want to look at the same
axis being run by two different threads. So here is when we say location pair is
covered by an execution. Here's our execution. These are states. These are
bytecode instructions leading to state transitions.
L1 executes, gets executed by a thread T1. Axis is memory location M. Later,
with no other intervening axises to that same memory location, L2 gets executed
by a different thread and axises memory location M.
That's what I mean by -- of the consecutive axises to SB. These are the two and
they are accessing the same string buffer object, not two string buffer objects.
This is when I say that location pair is covered.
So, again, without any strong claims, let's say if this happens it's a really, really
bad idea, and it could cause problems. Let me weaken my statement.
So this is the idea behind the definition of the metric. It says determine all
location pairs, try to cover all of them. And I'm going to tell you more about all of
this.
And it's very similar in spirit to many other concurrency coverage metrics that talk
about sequences, definitions and uses, et cetera, so I'll briefly contrast it with the
all definitions uses concurrent coverage metric.
So here's another interleaving of the same two threads, but this one's okay
because the set length occurs before the execution of append starts.
And in the terminology of coverage metrics, an assignment to SB.count is a
definition. The reads of SB.count are uses. And all definitions, uses -- coverage
metric says cover all definition use pairs by different threads.
So in this nonproblematic, okay, interleaving, all definition used pairs that appear
on this slide are covered. This definition is used by this read, and this read as
well. So you have actually accomplished the 100 percent all definitions uses
coverage, yet you haven't triggered the error.
>>: So is it implicit in I've never heard of the definition use metric before. So in
that definition, does it say that, do they talk about thread scheduling and do they
say that the definition ->> Serdar Tasiran: Originally the metric was sequential. And then it got
generalized to concurrent threads. And then in the concurrent version it says the
definition under use should be by two different threads.
>>: What you specify the order of those or no?
>> Serdar Tasiran: They do, the definition counts before. In fact, there's a
concept of what definition a use sees.
>>: Right.
>> Serdar Tasiran: And I have a slide about that, too. So in this interleaving,
which is okay, there are no bugs, all definition use pairs have been exercised but
the bug hasn't been triggered. The message being by there being a slight
difference in definition between location pairs and all definition on uses, location
pairs gets at atomicity violations better. Right.
So in particular ->>: Definition first ->> Serdar Tasiran: Yes. That's what I was about to say. So this is a read and
then a write. And so it's not an intended scenario, of course. I mean, people
look at definition use pairs because that's the intended use.
I look at the other one because I'm trying to catch unintended but possible
atomicity violations.
>>: It says they kind of screwed up when they extend the definition use to the
multi-threaded case because single threaded case you can't use it before it's
defined. But in the multi-threaded case that's exactly what you're finding is the
places where the use is ->> Serdar Tasiran: Exactly. That's the crux.
>>: When they extended it to multi-thread, they should have thought to do this,
but they didn't for some reason.
>> Serdar Tasiran: That's basically the summary of the talk.
>>: Okay.
>>: So ->> Serdar Tasiran: I haven't talked about it that way, but it's good.
>>: Anti-dependence, covering anti-dependence, whereas the definition use is
very much based on data flow.
>> Serdar Tasiran: And many other concurrency coverage metrics are like that,
too. They talk about two different synchronization constructs in two different
threads. They say they should be exercised in blocking unblocked mode, et
cetera. They all try to express intended uses. And I think Mike's point is correct,
in making that generalization to multi-threaded programs, they should have
thought of the unintended but problematic uses.
Okay. And this idea was inspired by a number of papers that study bugs in
concurrent programs. Some of them being ours. And we looked at all the
published examples in these papers that were atomicity violations.
And pretty much all of them were captured by the LP metric, LP short for location
pairs, meaning if you accomplish the 100 percent LP coverage you are
guaranteed to trigger the bugs listed in these papers.
Okay. So I'm slowly starting the more formal part of the talk. So here are the
issues that one considers when studying a coverage metric. First, is this a good
metric? Is it a good proxy? Is it a good way of getting at the bugs that I'm after?
You select a category of bugs. In my case atomicity and refinement violations,
and I try to show that if you accomplish 100 percent coverage according to this
metric, you're in some examples guaranteed. In others likely, and likelier than
other metrics, to catch all such errors. Yes?
>>: What do you mean by location?
>> Serdar Tasiran: Bytecode.
>>: Line of code?
>> Serdar Tasiran: Bytecode.
>>: So of course location could be hidden in different contexts, could mean
completely different things.
>> Serdar Tasiran: It could. And you could make the location pairs metric more
demanding by making it context sensitive. But in the benchmarks we looked at,
you didn't have to do that. Yes?
>>: You're getting something like context sensitivity by requiring the two pairs to
be on the same memory axis.
>> Serdar Tasiran: Correct. So the argument against that is if certain field was
only accessed with a get and set method, then all the axises would be those two
bytecode instructions, and then you wouldn't really get anything interesting in that
way.
Right. And then the second issue is I mean one could accomplish the first goal
by saying cover every possible state of this program. That wouldn't be a very
useful metric, because it's not a reasonable size. If you run some testing, you
would end up with a huge coverage gap, and it would not be a practical tool.
So I will try to show you empirically that the LP metric is a good compromise,
which means these two goals. So what I'm going to do is formally define it, talk
to you about how to statically compute the coverable pairs set for approximated,
rather, tell you a little bit about the tool implementation, and then show you bug
detection experiments and saturation experiments.
So here's the definition, you already saw it. Location pair. Pair of bytecode
instructions in the code is covered if they both access the same memory
instruction. One of them is a write and they are run by two different threads, and
nothing else, nobody else accesses in between.
The definitions uses, because I'm going to compare with it, is defined like this.
Again, it uses pair of locations. One of them is a definition, and the second one
is a use.
L1 writes to M. L2 reads M. And the value that L2 reads is the one written by L1
or, rather, the write seen by this read is L1. But there may be other axises to M
in the middle. As long as this is the write being seen by this read from the axises
in between do not matter.
There's another metric I'm going to compare with called method pairs. And you
look at two methods from your code and you say M 1 and M 2 is covered with
while M 1 is in progress by one thread, before it ends at least one action from M
2 is executed.
So while M 1 is in progress, M 2 gets executed. And we didn't make this up,
right?
>>: This is the metric that's the same ->> Serdar Tasiran: In the literature. Just checking. This is what happens when
students do the work and you do the publicity.
So here is the proposed, the intended use of the LP metric. First, you use static
analysis, which I will describe to compute an overapproximation of the possible
set of recoverable pairs. Then you do your testing. Maybe randomized. Maybe
try different bits of data.
Use thread exploration using chess if you like. After you've done enough testing
and you're tired then you look at what location pairs are not covered. And the
ones that aren't covered, you inspect.
And if you decide they're not coverable, then that's okay. That was an error in
the overall approximation. Otherwise, you try to devise a scenario, pick the data,
pick the input interleaving to cover that pair.
And this is where the LP coverage metric pays off. If this succeeds, then you've
got at something a scenario that was difficult to cover in random exploration. But
the metric drew your attention there, and the testing you did in this way helped
you discover an error.
And this approach is feasible as our paper, oh, I forgot to tell you, this is a paper
that recently got accepted to the Journal of Empirical Software Engineering, and
in that paper a number of examples and a number of benchmarks between the
literature we show that this approach is feasible.
>>: So say for Apache FTP server, you really took every single location pair that
your static analysis did not rule out as not coverable immediately and inspected
it ->> Serdar Tasiran: No, after random simulation.
>>: Okay. That wasn't a difference between random simulation can hit then what
[inaudible] analysis could -- and inspected it.
>> Serdar Tasiran: Yes. There aren't that many, that's what surprises me.
Okay. So the way we implemented the coverage measurement tool was we
didn't try to be efficient. We tried to just get a prototype out, because we were
just exploring if this metric makes sense. We use Java Pathfinder. It has a VM
listener interface. It calls your code after it executes a bytecode instruction, and
you can figure out what memory address was accessed, what line of code, et
cetera, and JPF can explore different thread interleavings.
The problem was it took a lot of space, even when we turned off state caching,
even when we just stored the set of states from the initial state to the current
state it took a huge amount of state so we hacked it to turn off all sorts of -- any
kind of state storage, basically.
So it randomly walked the state space. And told us when it executed the
bytecode instruction. So basically we used it as a runtime instrumentation
engine, because we could work with it.
So right. So let me jump forward. We need to have a good coverage
denominator, some sort of analysis that says here's a good candidate set of
coverage, of location pairs that you should target. And for this we made use of a
static arrays detection tool, port, by Mike, et al. and it was [inaudible] and several
papers afterwards.
So cord has a number of pages of static analysis and it keeps narrowing down
the set of pairs of axises in your code that could result in data arrays.
So here's the golden set of racing pairs which we can't exactly determine. This
purple line unlocked pairs is the final result of cord. It says here are a pair of
axises that could result in data arrays. But we weren't interested in data races.
We were interested in pairs of axises that could come one after the other and
access the same bit of data.
So what we're after lies in between unlocked pairs and escaping pairs,
somewhere in here. So what we did simply was to use the set of escaping pairs
as the statically determined overapproximation.
So that's our starting point. So when the coverage tool starts, we first run -- we
run cord. We determine this set. We give it to the coverage tool and the
coverage tool sees what's covered and what's not covered of those pairs.
Here's a detailed slide this is where the overapproximation could tell you things
that are not actually coverable. Think of this synchronized block and suppose
two threads are running this method concurrently.
Now, one could finish and run location L3 and another one could start and run
location L1. So L3/L1 is coverable. But, for instance, when -- right after one runs
L1, the other one could not run L2 on the same bit of data because it's
synchronized.
So there are a number of our pairs that in our static set that are not actually
coverable. And currently we need to weed these out manually.
And we could have worked on cord on maybe come up with a more precise static
analysis to rule these out. But we first wanted to see if this metric was any good
first.
So I guess this is the first stab at Sebastian's question. What if you have a lot of
pairs? Are you going to really try to cover all of those?
So here's a bunch of benchmarks. They're sizes in lines of code. And this is the
size of the static set. This is before any simulation is done. Of course, we didn't
run experiment -- we didn't debug all of these by hand. Debugged some of them.
But what's interesting is these numbers are not horrible. I mean, they're not in
the tens of thousands, millions. Even -- they don't correlate tightly with the lines
of code.
Some with lots of lines have a few pairs. And there was one with just a few lines
on lots of pairs I can't find it right now. I guess this one. It has about as many
pairs as it has lines.
And if you're thinking well what if you have millions of lines of code, then you
probably test -- well, you would probably attack such code in modules.
So the kind of coverage that I'm talking about probably from libraries that are
completely separate from each other, you wouldn't expect consecutive shared
variable axises. But that's just speculation. For these ones, here are the
numbers and they're not horrible.
Okay. So here's one example. Moldyn. So ->>: Want to understand first. So the comment you made earlier about if the
axises were buried within the center or -- so you're not doing transitive. So if I
call some method that calls the center, that is only the center that counts as the
axis, not the ->> Serdar Tasiran: Correct.
>>: So that's why if you have many modules, it wouldn't grow so large but you
also, you miss interactions because the different modules wouldn't interact by
calling ->> Serdar Tasiran: Right. That example is a case where we would need context
sensitivity. Yes.
>>: So you're not distinguishing between synchronization variables and data,
right? In your definition? So your definition will apply equally well to objects you
are synchronizing?
>> Serdar Tasiran: No. Because we're using cord to start with and cord only
looks at data variables. So I should have said that in the beginning.
>>: I see. So the other thing that I [inaudible] in your previous example before
cord gives you some false pairs, is that -- maybe dynamically you did array cert,
you would be able to say well this location is guarded by -- this dot X is guarded
by this and you could eliminate -- you're getting -- this is -- you could get some
context sensitivity because here you're getting pairs because of flow and
sensitivity, right? And if you knew that, sort of that full set of axises was blocked
by a lock, then runtime you might just say it seems at runtime this set of locations
is guarded by this lock so you can't have an interleaving, these pairs, you can
weed these pairs out, even with a runtime ->> Serdar Tasiran: This was the flavor of most of the pairs that we had to weed
out by hand.
So on Moldyn, static analysis gave us 26 pairs. And a whole bunch of random
testing covered only nine of those. And then we tweaked the example, changed
the length of the test, and then by brute force we were able to get up to 23 pairs.
That left three pairs to analyze by hand. And here is a description of those three
pairs.
So up here is a chunk of code that is only executed by thread ID 0. And there's a
barrier here. And later, down here in these lines, all threads execute these lines.
They all go through there. And the pairs that were not exercised were 361 was
the first and then 552, 553, 554, these were the noncovered pairs.
And this was happening upon inspection, we realized, because only thread 0
executes these. And by chance it's also the first one that thread is the first one to
arrive at these lines in the executions.
So what we did was to put a delay in the middle of thread 0 and let other guys go
through. And then all pairs were covered.
So this is not even empirical, it's an anecdotal demonstration of feasibility.
>>: So then you say what did you need to do to the tests to make them cover
everything at the end? So the last one, it's a matter of a simple delay, which --
>> Serdar Tasiran: Right. In others that are matrix operations, and certain
things that only happen when matrices are big enough, or when there are
enough concurrent threads. So sometimes we have to play with data or the
number of concurrent threads to make it happen.
So this was like I said anecdotal demonstration of feasibility. Now, coming to bar
detectionability, what we did was to first create interesting buggy programs and
for that we did two kinds of things we used mutation operators for concurrent
programs from this paper, and these pretty much all play with synchronized
blocks and critical regions. They shrink them, expand them, split them, et cetera.
And in some examples we inserted atomicity violations by hand. And again how
this happened was similar in spirit to these but it required some reordering of if
statements, for instance, of the conditions and if statements or moving certain
reads and writes out of critical block.
One thing I should say here is a lot of these mutations, a lot of these changes
result in trivial atomicity bugs. No matter what metric you use, after the first two,
three runs, the bug the triggered. So it's not -- those aren't good ways -- those
aren't good examples on which to compare metrics.
So we created maybe around 40 buggy programs, and of those we found that
seven or eight had nontrivial difficult bugs in them.
And, again, the theme of the bugs were, there was a bigger block intended to be
atomic but in fact it was split into several atomic blocks.
And so I need to describe to you the experimental setup. So here's what I mean
by one pass of an experiment. If you have a data structure, for instance, you will
have two or three threads in your test harness.
Each thread is performing two or three operations, like insert, delete, lookup. In
other scientific computing-like programs, one pass is one execution of a program
from start to finish.
And we say that a bug is caught by a pass if one of the instructions we put in
there is violated during that pass.
And we put in instructions by hand because we knew what the bugs were. So
we wrote instructions, for instance, about data structures state or what methods
could return and what they couldn't return, what the matrix context to satisfy, et
cetera.
Of course, when you do a pass you might not detect a bug you might not get an
assertion violation. So while the bug isn't caught, keep repeating the pass, that's
what we call an iteration. And we did several iterations, I think 100 for each
example, to collect statistics.
And what you do is during each iteration you measure different kinds of
coverage. So here's a rather picture-free slide. But if you care about the
numbers that are going to show up in the next three or four slides, you should try
to understand this.
So here are our measures of how a metric is good or not. If during an iteration,
remember, an iteration is keep repeating passes with different interleavings until
you cache the bug. If the metric reaches 100 percent coverage and yet the bug
hasn't been caught, then that metric is a bad way of measuring coverage for that
bug. It's not adequate. It's not successful as a measure of coverage.
There were two other numbers we measured to check how correlated a metric
was with bugs.
So suppose -- consider a given path and suppose it covers a new location pair,
or a definition use pair that hadn't been covered in earlier passes. If that pass
also catches a bug, then there's a correlation between covering a new, let's say,
LP and detecting that bug.
And conversely, consider the passes, the very final pass of each iteration that
caches the bug. Did that pass also cover a new LP or new definitional use pair.
Was that maybe the reason that the bug was detected in that iteration. So I'm
going to show you -- yes?
>>: How do you decide when the LP is new or not?
>> Serdar Tasiran: So in the beginning of each iteration, we start from 0.
>>: Basically in this case the metrics are kind of issued because I may found a
new LP which confidential issue bug but it didn't to context not set up where the
bug ->> Serdar Tasiran: So I think I may have miscommunicated what new means.
By new it means you haven't covered any verification pairs, and you're doing
pass after pass after pass, every pass might or might not detect a location pair.
And then you cover a location pair that you didn't cover in previous passes. You
call that a new LP. So you're looking at -- you're running around repeating the
experiment, and some of the experiments happen to exercise an interleaving, a
location pair that you haven't exercised before. If that interleaving also happens
to catch a bug that's a good indication that covering that LP was a good idea.
Make sense? So first set of numbers -- so remember if metric reaches
100 percent before the bug is caught, metric is bad. And this is the table for that.
On the set of benchmarks we had, here is LP, scoring 100 percent. Basically not
saturating before the bug is caught, in all cases, whereas method pairs and
definitions and uses pairs do well sometimes but not so well at other times.
And these are the multi-set benchmark, with mutants seeded errors and then the
elevator mutants generated by hand. These are the ones that had nontrivial
atomicity problems of the 40 or so mutants that we looked at.
Now, again on the same set of benchmarks, it's the percentage of passes that
cover a new MP or D or LP. But also detect the bug.
So the higher this is, the better correlated the metric is with bugs. And ->>: I'm having trouble with this. Wouldn't we need to know how many passes
that do not discover new location pair also to detect the bug?
>>: The other number [inaudible].
>>: So if you want to attribute the bug-finding event to the fact that you
discovered a new location pair, it would need to be higher in cases where you
discover location pair and cases where you did not discover location pair.
>> Serdar Tasiran: I think the next slide is indirect way of getting at that. So let
me just tell you what this is for what it is, and then if the next slide doesn't answer
your question. So the idea here is if a pass discovers a new MP or D or LP, what
are the chances that it will also trigger the bug?
And you see that the numbers in almost all cases are higher for LP than other
metrics. And then the next set of numbers says let's just look at the passes that
did detect a bug. Did that pass also cover an MPD or URLD that hadn't been
seen before. And I think this is an indirect way of getting at that. And again LP
does significantly better, especially by this measure than MP and D.
Okay. So that's the empirical bug detection liability study. So there's a paper by
these authors in FOC '09 about saturation-based testing, and I would like to
quote some important remarks from there.
One thing they say is "the coverage metrics that are out there for testing
concurrent programs are too weak, we need to look at stronger ones." I agree.
They also say, well, people do these things like randomization or controlled
exploration of thread interleavings and they prioritize some interleavings over
others. And they don't study whether this way of doing things is really better at
getting at bugs or not.
So this comment might not exactly be related to what I'm presenting, but I
thought it would bugs on people in this room so I'm throwing it out there.
>>: Have you tried anything in this -- obviously now you have an empiric tool to
compare all these different approaches, formal, nonformal, simulation, random.
>> Serdar Tasiran: We've tried some. I'm going to show you some saturation
results. But others, I'm trying to bug Shaz with context bounded exploration any
better at getting at bugs. I think he'd say yes. But I haven't done that study.
And other points they make are, again, you should -- coverage metrics need to
avoid being too weak meaning saturating very quickly. For instance, line
coverage is a weak metric, it saturates very quickly.
And they should not be too strong. So difficult that the coverage target is too
difficult to compute. And one, again, controversial point they make is if you have
a nontrivial metric, if your metric is useful for concurrent programs, then chances
are you cannot determine the coverage target, the denominator precisely
enough. It's just too complicated, too complex. So they say instead look at
saturation, to stop testing, to decide when your array of testing stops paying off.
In our case, at least for the examples we looked at static computation of the
covering denominator was possible.
But I also think this might happen and this might be needed. So I'm going to
show you two curves for saturation and two examples. And on those we'll see
that LP saturates later than the other two metrics. But not much later.
>>: Is that better or -- if it saturates later is that good?
>> Serdar Tasiran: It's good, but if it takes impossibly long to saturate or never
saturates it's like state coverage. In fact, in their paper they have a curve of state
coverage, the line goes -- it's a straight line. It keeps going off.
So on the elevator example, the X axis is the number of method calls. It's log
scale. And so curves could be seen better. The Y axis is percent coverage,
according to the particular metric, and blue is line coverage. Red is definitions
uses coverage, and green is LP coverage.
And you see that LP coverage keeps registering interesting increments along the
way and saturates later. But not impossibly late. I think it's a million method
calls.
>>: So are you doing any randomization or ->> Serdar Tasiran: Yes thread schedules are random at every scheduling point,
general pathfinder chooses a random new thread to schedule.
>>: That's not a very good randomization strategy. Maybe we should take it up -choosing a random thread at any point is favoring some schedules over others.
>> Serdar Tasiran: Okay. But the X axis in any case is some measure of how
much testing you've done. Think of this as time, for instance. And at this point,
for instance, if you didn't know the rest of this curve, and if you only had line and
U coverage to look at, they would say stop now, you're not going to get anything
more interesting.
But LP says maybe there's some more stuff coming.
>>: Is that the only randomization strategy implemented in J [inaudible].
>> Serdar Tasiran: I think so, yes. There's also the nonrandomized. But
[inaudible].
>>: I will --
>> Serdar Tasiran: Here's another benchmark, same idea. This time it's not log
scale. Line coverage saturates very easily. DU coverage is a little harder, but
still saturates fairly early.
And LP coverage saturates later and keeps registering interesting increments
along the way. So LP saturates later. Is a more stringent, more demanding
metric than the other two. But not impossibly demanding. There are more
curves. I'm just showing you two.
To sum up, I showed you a coverage metric for shared memory concurrent
programs. It's called vocation pairs. We empirically showed it corresponds well
to atomicity violations, refined violations and it appears to work better than all
definitions, uses and method pairs metrics.
It's more demanding than other metrics but it still saturates within tolerable
amount of time. And we believe it strikes a good compromise between bug
detectionability and being too hard to compute. Thanks.
>> Shaz Qadeer: Let's thank the speaker.
[applause].
>> Shaz Qadeer: Questions?
>> Serdar Tasiran: Sorry, I finished early.
>> Shaz Qadeer: Good one. Better than finishing late.
>> Serdar Tasiran: If you'd like, I could tell you how we debugged the Apache
FTP server using this method. But maybe not.
>> Shaz Qadeer: Sure. Yeah.
>> Serdar Tasiran: I'll tell you the gist. It will take too long. But sometimes
atomicity violations are in the ->>: This bug was already known?
>> Serdar Tasiran: Yes, it was known. But what was interesting was we tried to
fix it. We tried several different fixes, which were also not good. And LP told us
what -- so sometimes atomic blocks are not too small, they're too big. And you
end up not seeing certain interleavings. And then the LP metric can tell you
you're not seeing this interleaving because your locks are too big and in the
Apache FTP server, for instance, the timeout thread, in one version of it the
timeout thread was never able to kick in because the atomic block was too big.
And the LP metric basically said nothing is registering, you're not covering
anything, because your atomic block is too big.
Thank you.
>>: Can you tell us about what you think of extending this [inaudible]
automaticity? Proof of that?
>> Serdar Tasiran: So there are other papers that tell you, not pairs, but triples.
Not just any two threads, but specific to threads.
And they get too expensive. So of course if you look at triples or sequences of
length four you could catch more bugs. But it would be difficult to determine
statically what can be covered.
Something like this.
>>: Sorry [inaudible] to the same variable or different variables?
>> Serdar Tasiran: There are metrics that talk about both. Some ->>: This notion of an atomic set. It's like suppose you have an atomic set, X -the location of XYI in atomic sense now you're doing something -- so what you
might say is if you see a -- so now the generalization of your coverage is
basically based on that set. So if you saw, for example, a read of X and then
followed by later a write of Y, and nowhere in between a write to either X or Y,
right now that's a new type of ->> Serdar Tasiran: Right. So there's a paper about hierarchy of such metrics.
And this is one of the cheaper ones in that hierarchy. Variant on one of the
cheaper ones.
But I think its real value lies in giving the programmer something simple to look
at. I mean, if you say this pair of axises to X were never followed by this pair of
axises to Y, that might be a little too difficult for the programmer to rule out. But a
pair -- it's a good point to start. And sometimes it tells you what different data,
what different interleavings to test.
>>: I guess the question is how many concurrency bugs are there which you
have two totally unrelated variables, related on this path, might be ->> Serdar Tasiran: So this paper says -- this paper agrees with you, the top
paper.
>>: Right. Sorry?
>> Serdar Tasiran: The top paper here agrees with you. It says there aren't that
many.
>>: There aren't that many.
>> Serdar Tasiran: There aren't that many. There are some but not that many.
And sometimes to make a location pair happen, you might have to do a lot of
difficult things.
We might have to fill up a hash table with thread IDs before it overflows and
times out, et cetera.
The summary of what you did in the exercise is simple but sometimes it's really
tricky to get there data-wise and interleavings wise.
Thank you.
[applause]
Download