22773 >> Shaz Qadeer: Welcome, everybody. It's my great pleasure to introduce Serdar Tasiran to you. Actually, he doesn't really need any introduction because he's a long-term collaborator of many people in the Rise Group here in MSR Redmond. Started out as a professor in the computer science department -computer science or computer engineering? >> Serdar Tasiran: Computer engineering. >> Shaz Qadeer: Computer engineering department at Koc University in Istanbul, Turkey. And he's worked for any years on various aspects of testing and verification of concurrent programs. Today he's going to tell us about some new work he's done together with his students on coverage metrics, novel coverage metric for concurrent software. >> Serdar Tasiran: Thank you, Shaz. So I'm going to present an empirical study about the simple idea for coverage metric. This is joint work with my students Erkan and Kivanc. Kivanc is in the audience. He's at UW and he's an intern here this summer. So the use we want to put this coverage metric to very similar to other concurrency coverage metrics is to answer the following questions: Have I explored enough interesting and distinct thread interleavings during testing? If I'm applying a certain way of testing my program, for instance, randomized testing with a set of fixed data, has this stopped paying off or am I still discovering new interesting scenarios. If my current way of testing is not paying off anymore, where else should I focus my testing effort? What inputs, what interleavings or threads should I prioritize? These are the questions that this metric is trying to approximately answer. And in this study I'm going to show you empirically that the location pairs coverage metric corresponds well to a certain kind of concurrency, atomicity and refinement violations. And I'll also show it strikes a good compromise between the difficulty of achieving coverage versus the power of the metric for detecting bugs. Before I dive into the specifics, I want to provide the motivation for this metric using well-known example from the atomicity literature. This is an older version of the string buffer class from the Jahova Class Library [phonetic], and it stores the characters in the string buffer in this field called value of characters. And not all of these characters are part of the string but only count for many of them are. So count tells us how many characters in this character array are part of the stream buffer really. And here is the append method for stream buffer. Let's suppose a thread called T1 is running it. So to the string buffer this, we're trying to append the string buffer SP. This method has again a well-known atomicity violation. First we look at how long SV is and then we determine how much room we need, and whether we need to extend our array to accommodate the characters that are going to come from SV. So new counts as count plus Len where Len is the length of SP. If new count is bigger than how long we are, we say let's expand our capacity. Then we run the get chars method where starting in index -- let's see. So in the array value, we copy len number of elements, starting from index zero in SP and starting from index this dot count. But it got a little complicated but you get the idea. We're trying to append the characters in here to the value array here. And here's what could go wrong. This method is not holding the lock for its argument SP during its execution. So it's possible that another method running on the string buffer SP in this case a carefully chosen set length zero method could get interleaved just in between these statements here. And what this does is when you try to copy the characters from SB to this, this method here checks if len is greater than SB.count and just because SB.count became zero it throws string out of bounds, et cetera. So the atomicity violation happened because of the view we had of SB when we started got changed but we were still operating with that old view of SB. So common pattern in at misty violations where a block of code intended to be atomic does some operations in the beginning, somebody else interferes, and then when the first thread continues its operation, things go wrong. So the purpose of the metric was to capture concisely this bad interleaving without saying let's enumerate all possible interleavings of axises to the field. This is how the trouble occurred. Have I lost anyone so far? So here's how trouble occurred. We read SB.count to modify it in the middle and then we tried to read it here again. And it had gotten changed. And our initial view of SB.count wasn't valid. So if we look at the code, this access, this read of SB.count is line 425 in abstract string builder.java. And this SB.count gets zero is line 180 in abstract string builder.java. This is a line that gets executed only when SB is becoming shorter. So if it is the case that the two consecutive axises to SB.count are this line and that line by two different threads, then this atomicity violation occurs and that exception is thrown. So this being the crux of the metric, I'll pause here for a second. >>: That implies it's a one edge block, it's really a two edge. It's a two edge because you have to observe, you need a little bit more. You need the second edge, correct? >> Serdar Tasiran: So the second edge -- the second edge is being implicitly captured by saying the two consecutive axises to SB.count need to be this and that. So if this guy were lower than these two would be the two consecutive axises. >>: But it's not just that, you're saying the bug occurs. But that's not correct, because if you've never done the SB.count get cares doesn't matter. If SB is dead after the second write to the count then nobody would ever observe ->> Serdar Tasiran: Correct. >>: Always going to do it because ->>: He's just saying -- I thought he was making a general statement. If there are two consecutive axises to SB count where one was in thread T 1 and the other in thread T2 then a bug occurs. >> Serdar Tasiran: No, if line 425 is accessed by one thread and if line 180 is the immediate next access by another thread, and if you don't stop the program there, if you let this append method continue. >>: So you're making a general statement, sorry. >> Serdar Tasiran: No. In fact, there are no theorems or any certain claims. This being the talk about coverage, everything is about what correlates well with bugs. So this is a lucky scenario where this interleaving guarantees many bugs but there are many others where it doesn't. But it increases the likelihood ->>: You can have an ADA that comes in, happens to reset it before the core gets back ->> Serdar Tasiran: Sure, another one could run here. >>: Or that count happened to be zero, then the bug ->> Serdar Tasiran: Right. It could be reset back to however long it was before. >>: Those two axises happen to be data arrays, that's no precedence, right? >>: They could be synchronized. It could be a property. >>: They cannot be immediately succeeding each other. >> Serdar Tasiran: If that was a ->>: A log between. >> Serdar Tasiran: If that was a get count, for instance. >>: The count ->> Serdar Tasiran: If it was synchronized get count method here, for instance, that would be okay. >>: But then I don't understand your definition of immediately succeeding each other. >> Serdar Tasiran: We'll get to that. >>: What are we talking about these super ->>: He's showing a data array in his example. SP count, one thread and reading of SP count by different thread. >> Serdar Tasiran: Just pretend count is volatile, for instance. And there is no data arrays, but there's still the atomicity violation. >>: Right. But it's still data arrays for people who don't know about memory models, let's say. So what I'm saying. >> Serdar Tasiran: True. In that case pretend there's a synchronized get count method within which this occurs. >>: And it's not synchronized in thread two or it's also synchronized in thread two? >> Serdar Tasiran: This one is also synchronized. But this method ends and then this one starts. So this problematic. >>: So there are other statements in between, and they're not immediately succeeding each other. >> Serdar Tasiran: I'm talking about -- I'm not saying this is immediately succeeded by this. I'm saying in this execution of the axises to SB.count. >>: Succeeding. >> Serdar Tasiran: I'm going to show you a formal definition. In fact, why don't I do that right now before I get into more trouble. Okay. So here's what a location pair means formally. I should have known with this crowd an informal introduction is a mistake. So L1 and L2 are two bytecode instructions in compiled Java code abstractstringbuilder.class, there are two axises in there. And the two can be the same. Because sometimes I want to look at the same axis being run by two different threads. So here is when we say location pair is covered by an execution. Here's our execution. These are states. These are bytecode instructions leading to state transitions. L1 executes, gets executed by a thread T1. Axis is memory location M. Later, with no other intervening axises to that same memory location, L2 gets executed by a different thread and axises memory location M. That's what I mean by -- of the consecutive axises to SB. These are the two and they are accessing the same string buffer object, not two string buffer objects. This is when I say that location pair is covered. So, again, without any strong claims, let's say if this happens it's a really, really bad idea, and it could cause problems. Let me weaken my statement. So this is the idea behind the definition of the metric. It says determine all location pairs, try to cover all of them. And I'm going to tell you more about all of this. And it's very similar in spirit to many other concurrency coverage metrics that talk about sequences, definitions and uses, et cetera, so I'll briefly contrast it with the all definitions uses concurrent coverage metric. So here's another interleaving of the same two threads, but this one's okay because the set length occurs before the execution of append starts. And in the terminology of coverage metrics, an assignment to SB.count is a definition. The reads of SB.count are uses. And all definitions, uses -- coverage metric says cover all definition use pairs by different threads. So in this nonproblematic, okay, interleaving, all definition used pairs that appear on this slide are covered. This definition is used by this read, and this read as well. So you have actually accomplished the 100 percent all definitions uses coverage, yet you haven't triggered the error. >>: So is it implicit in I've never heard of the definition use metric before. So in that definition, does it say that, do they talk about thread scheduling and do they say that the definition ->> Serdar Tasiran: Originally the metric was sequential. And then it got generalized to concurrent threads. And then in the concurrent version it says the definition under use should be by two different threads. >>: What you specify the order of those or no? >> Serdar Tasiran: They do, the definition counts before. In fact, there's a concept of what definition a use sees. >>: Right. >> Serdar Tasiran: And I have a slide about that, too. So in this interleaving, which is okay, there are no bugs, all definition use pairs have been exercised but the bug hasn't been triggered. The message being by there being a slight difference in definition between location pairs and all definition on uses, location pairs gets at atomicity violations better. Right. So in particular ->>: Definition first ->> Serdar Tasiran: Yes. That's what I was about to say. So this is a read and then a write. And so it's not an intended scenario, of course. I mean, people look at definition use pairs because that's the intended use. I look at the other one because I'm trying to catch unintended but possible atomicity violations. >>: It says they kind of screwed up when they extend the definition use to the multi-threaded case because single threaded case you can't use it before it's defined. But in the multi-threaded case that's exactly what you're finding is the places where the use is ->> Serdar Tasiran: Exactly. That's the crux. >>: When they extended it to multi-thread, they should have thought to do this, but they didn't for some reason. >> Serdar Tasiran: That's basically the summary of the talk. >>: Okay. >>: So ->> Serdar Tasiran: I haven't talked about it that way, but it's good. >>: Anti-dependence, covering anti-dependence, whereas the definition use is very much based on data flow. >> Serdar Tasiran: And many other concurrency coverage metrics are like that, too. They talk about two different synchronization constructs in two different threads. They say they should be exercised in blocking unblocked mode, et cetera. They all try to express intended uses. And I think Mike's point is correct, in making that generalization to multi-threaded programs, they should have thought of the unintended but problematic uses. Okay. And this idea was inspired by a number of papers that study bugs in concurrent programs. Some of them being ours. And we looked at all the published examples in these papers that were atomicity violations. And pretty much all of them were captured by the LP metric, LP short for location pairs, meaning if you accomplish the 100 percent LP coverage you are guaranteed to trigger the bugs listed in these papers. Okay. So I'm slowly starting the more formal part of the talk. So here are the issues that one considers when studying a coverage metric. First, is this a good metric? Is it a good proxy? Is it a good way of getting at the bugs that I'm after? You select a category of bugs. In my case atomicity and refinement violations, and I try to show that if you accomplish 100 percent coverage according to this metric, you're in some examples guaranteed. In others likely, and likelier than other metrics, to catch all such errors. Yes? >>: What do you mean by location? >> Serdar Tasiran: Bytecode. >>: Line of code? >> Serdar Tasiran: Bytecode. >>: So of course location could be hidden in different contexts, could mean completely different things. >> Serdar Tasiran: It could. And you could make the location pairs metric more demanding by making it context sensitive. But in the benchmarks we looked at, you didn't have to do that. Yes? >>: You're getting something like context sensitivity by requiring the two pairs to be on the same memory axis. >> Serdar Tasiran: Correct. So the argument against that is if certain field was only accessed with a get and set method, then all the axises would be those two bytecode instructions, and then you wouldn't really get anything interesting in that way. Right. And then the second issue is I mean one could accomplish the first goal by saying cover every possible state of this program. That wouldn't be a very useful metric, because it's not a reasonable size. If you run some testing, you would end up with a huge coverage gap, and it would not be a practical tool. So I will try to show you empirically that the LP metric is a good compromise, which means these two goals. So what I'm going to do is formally define it, talk to you about how to statically compute the coverable pairs set for approximated, rather, tell you a little bit about the tool implementation, and then show you bug detection experiments and saturation experiments. So here's the definition, you already saw it. Location pair. Pair of bytecode instructions in the code is covered if they both access the same memory instruction. One of them is a write and they are run by two different threads, and nothing else, nobody else accesses in between. The definitions uses, because I'm going to compare with it, is defined like this. Again, it uses pair of locations. One of them is a definition, and the second one is a use. L1 writes to M. L2 reads M. And the value that L2 reads is the one written by L1 or, rather, the write seen by this read is L1. But there may be other axises to M in the middle. As long as this is the write being seen by this read from the axises in between do not matter. There's another metric I'm going to compare with called method pairs. And you look at two methods from your code and you say M 1 and M 2 is covered with while M 1 is in progress by one thread, before it ends at least one action from M 2 is executed. So while M 1 is in progress, M 2 gets executed. And we didn't make this up, right? >>: This is the metric that's the same ->> Serdar Tasiran: In the literature. Just checking. This is what happens when students do the work and you do the publicity. So here is the proposed, the intended use of the LP metric. First, you use static analysis, which I will describe to compute an overapproximation of the possible set of recoverable pairs. Then you do your testing. Maybe randomized. Maybe try different bits of data. Use thread exploration using chess if you like. After you've done enough testing and you're tired then you look at what location pairs are not covered. And the ones that aren't covered, you inspect. And if you decide they're not coverable, then that's okay. That was an error in the overall approximation. Otherwise, you try to devise a scenario, pick the data, pick the input interleaving to cover that pair. And this is where the LP coverage metric pays off. If this succeeds, then you've got at something a scenario that was difficult to cover in random exploration. But the metric drew your attention there, and the testing you did in this way helped you discover an error. And this approach is feasible as our paper, oh, I forgot to tell you, this is a paper that recently got accepted to the Journal of Empirical Software Engineering, and in that paper a number of examples and a number of benchmarks between the literature we show that this approach is feasible. >>: So say for Apache FTP server, you really took every single location pair that your static analysis did not rule out as not coverable immediately and inspected it ->> Serdar Tasiran: No, after random simulation. >>: Okay. That wasn't a difference between random simulation can hit then what [inaudible] analysis could -- and inspected it. >> Serdar Tasiran: Yes. There aren't that many, that's what surprises me. Okay. So the way we implemented the coverage measurement tool was we didn't try to be efficient. We tried to just get a prototype out, because we were just exploring if this metric makes sense. We use Java Pathfinder. It has a VM listener interface. It calls your code after it executes a bytecode instruction, and you can figure out what memory address was accessed, what line of code, et cetera, and JPF can explore different thread interleavings. The problem was it took a lot of space, even when we turned off state caching, even when we just stored the set of states from the initial state to the current state it took a huge amount of state so we hacked it to turn off all sorts of -- any kind of state storage, basically. So it randomly walked the state space. And told us when it executed the bytecode instruction. So basically we used it as a runtime instrumentation engine, because we could work with it. So right. So let me jump forward. We need to have a good coverage denominator, some sort of analysis that says here's a good candidate set of coverage, of location pairs that you should target. And for this we made use of a static arrays detection tool, port, by Mike, et al. and it was [inaudible] and several papers afterwards. So cord has a number of pages of static analysis and it keeps narrowing down the set of pairs of axises in your code that could result in data arrays. So here's the golden set of racing pairs which we can't exactly determine. This purple line unlocked pairs is the final result of cord. It says here are a pair of axises that could result in data arrays. But we weren't interested in data races. We were interested in pairs of axises that could come one after the other and access the same bit of data. So what we're after lies in between unlocked pairs and escaping pairs, somewhere in here. So what we did simply was to use the set of escaping pairs as the statically determined overapproximation. So that's our starting point. So when the coverage tool starts, we first run -- we run cord. We determine this set. We give it to the coverage tool and the coverage tool sees what's covered and what's not covered of those pairs. Here's a detailed slide this is where the overapproximation could tell you things that are not actually coverable. Think of this synchronized block and suppose two threads are running this method concurrently. Now, one could finish and run location L3 and another one could start and run location L1. So L3/L1 is coverable. But, for instance, when -- right after one runs L1, the other one could not run L2 on the same bit of data because it's synchronized. So there are a number of our pairs that in our static set that are not actually coverable. And currently we need to weed these out manually. And we could have worked on cord on maybe come up with a more precise static analysis to rule these out. But we first wanted to see if this metric was any good first. So I guess this is the first stab at Sebastian's question. What if you have a lot of pairs? Are you going to really try to cover all of those? So here's a bunch of benchmarks. They're sizes in lines of code. And this is the size of the static set. This is before any simulation is done. Of course, we didn't run experiment -- we didn't debug all of these by hand. Debugged some of them. But what's interesting is these numbers are not horrible. I mean, they're not in the tens of thousands, millions. Even -- they don't correlate tightly with the lines of code. Some with lots of lines have a few pairs. And there was one with just a few lines on lots of pairs I can't find it right now. I guess this one. It has about as many pairs as it has lines. And if you're thinking well what if you have millions of lines of code, then you probably test -- well, you would probably attack such code in modules. So the kind of coverage that I'm talking about probably from libraries that are completely separate from each other, you wouldn't expect consecutive shared variable axises. But that's just speculation. For these ones, here are the numbers and they're not horrible. Okay. So here's one example. Moldyn. So ->>: Want to understand first. So the comment you made earlier about if the axises were buried within the center or -- so you're not doing transitive. So if I call some method that calls the center, that is only the center that counts as the axis, not the ->> Serdar Tasiran: Correct. >>: So that's why if you have many modules, it wouldn't grow so large but you also, you miss interactions because the different modules wouldn't interact by calling ->> Serdar Tasiran: Right. That example is a case where we would need context sensitivity. Yes. >>: So you're not distinguishing between synchronization variables and data, right? In your definition? So your definition will apply equally well to objects you are synchronizing? >> Serdar Tasiran: No. Because we're using cord to start with and cord only looks at data variables. So I should have said that in the beginning. >>: I see. So the other thing that I [inaudible] in your previous example before cord gives you some false pairs, is that -- maybe dynamically you did array cert, you would be able to say well this location is guarded by -- this dot X is guarded by this and you could eliminate -- you're getting -- this is -- you could get some context sensitivity because here you're getting pairs because of flow and sensitivity, right? And if you knew that, sort of that full set of axises was blocked by a lock, then runtime you might just say it seems at runtime this set of locations is guarded by this lock so you can't have an interleaving, these pairs, you can weed these pairs out, even with a runtime ->> Serdar Tasiran: This was the flavor of most of the pairs that we had to weed out by hand. So on Moldyn, static analysis gave us 26 pairs. And a whole bunch of random testing covered only nine of those. And then we tweaked the example, changed the length of the test, and then by brute force we were able to get up to 23 pairs. That left three pairs to analyze by hand. And here is a description of those three pairs. So up here is a chunk of code that is only executed by thread ID 0. And there's a barrier here. And later, down here in these lines, all threads execute these lines. They all go through there. And the pairs that were not exercised were 361 was the first and then 552, 553, 554, these were the noncovered pairs. And this was happening upon inspection, we realized, because only thread 0 executes these. And by chance it's also the first one that thread is the first one to arrive at these lines in the executions. So what we did was to put a delay in the middle of thread 0 and let other guys go through. And then all pairs were covered. So this is not even empirical, it's an anecdotal demonstration of feasibility. >>: So then you say what did you need to do to the tests to make them cover everything at the end? So the last one, it's a matter of a simple delay, which -- >> Serdar Tasiran: Right. In others that are matrix operations, and certain things that only happen when matrices are big enough, or when there are enough concurrent threads. So sometimes we have to play with data or the number of concurrent threads to make it happen. So this was like I said anecdotal demonstration of feasibility. Now, coming to bar detectionability, what we did was to first create interesting buggy programs and for that we did two kinds of things we used mutation operators for concurrent programs from this paper, and these pretty much all play with synchronized blocks and critical regions. They shrink them, expand them, split them, et cetera. And in some examples we inserted atomicity violations by hand. And again how this happened was similar in spirit to these but it required some reordering of if statements, for instance, of the conditions and if statements or moving certain reads and writes out of critical block. One thing I should say here is a lot of these mutations, a lot of these changes result in trivial atomicity bugs. No matter what metric you use, after the first two, three runs, the bug the triggered. So it's not -- those aren't good ways -- those aren't good examples on which to compare metrics. So we created maybe around 40 buggy programs, and of those we found that seven or eight had nontrivial difficult bugs in them. And, again, the theme of the bugs were, there was a bigger block intended to be atomic but in fact it was split into several atomic blocks. And so I need to describe to you the experimental setup. So here's what I mean by one pass of an experiment. If you have a data structure, for instance, you will have two or three threads in your test harness. Each thread is performing two or three operations, like insert, delete, lookup. In other scientific computing-like programs, one pass is one execution of a program from start to finish. And we say that a bug is caught by a pass if one of the instructions we put in there is violated during that pass. And we put in instructions by hand because we knew what the bugs were. So we wrote instructions, for instance, about data structures state or what methods could return and what they couldn't return, what the matrix context to satisfy, et cetera. Of course, when you do a pass you might not detect a bug you might not get an assertion violation. So while the bug isn't caught, keep repeating the pass, that's what we call an iteration. And we did several iterations, I think 100 for each example, to collect statistics. And what you do is during each iteration you measure different kinds of coverage. So here's a rather picture-free slide. But if you care about the numbers that are going to show up in the next three or four slides, you should try to understand this. So here are our measures of how a metric is good or not. If during an iteration, remember, an iteration is keep repeating passes with different interleavings until you cache the bug. If the metric reaches 100 percent coverage and yet the bug hasn't been caught, then that metric is a bad way of measuring coverage for that bug. It's not adequate. It's not successful as a measure of coverage. There were two other numbers we measured to check how correlated a metric was with bugs. So suppose -- consider a given path and suppose it covers a new location pair, or a definition use pair that hadn't been covered in earlier passes. If that pass also catches a bug, then there's a correlation between covering a new, let's say, LP and detecting that bug. And conversely, consider the passes, the very final pass of each iteration that caches the bug. Did that pass also cover a new LP or new definitional use pair. Was that maybe the reason that the bug was detected in that iteration. So I'm going to show you -- yes? >>: How do you decide when the LP is new or not? >> Serdar Tasiran: So in the beginning of each iteration, we start from 0. >>: Basically in this case the metrics are kind of issued because I may found a new LP which confidential issue bug but it didn't to context not set up where the bug ->> Serdar Tasiran: So I think I may have miscommunicated what new means. By new it means you haven't covered any verification pairs, and you're doing pass after pass after pass, every pass might or might not detect a location pair. And then you cover a location pair that you didn't cover in previous passes. You call that a new LP. So you're looking at -- you're running around repeating the experiment, and some of the experiments happen to exercise an interleaving, a location pair that you haven't exercised before. If that interleaving also happens to catch a bug that's a good indication that covering that LP was a good idea. Make sense? So first set of numbers -- so remember if metric reaches 100 percent before the bug is caught, metric is bad. And this is the table for that. On the set of benchmarks we had, here is LP, scoring 100 percent. Basically not saturating before the bug is caught, in all cases, whereas method pairs and definitions and uses pairs do well sometimes but not so well at other times. And these are the multi-set benchmark, with mutants seeded errors and then the elevator mutants generated by hand. These are the ones that had nontrivial atomicity problems of the 40 or so mutants that we looked at. Now, again on the same set of benchmarks, it's the percentage of passes that cover a new MP or D or LP. But also detect the bug. So the higher this is, the better correlated the metric is with bugs. And ->>: I'm having trouble with this. Wouldn't we need to know how many passes that do not discover new location pair also to detect the bug? >>: The other number [inaudible]. >>: So if you want to attribute the bug-finding event to the fact that you discovered a new location pair, it would need to be higher in cases where you discover location pair and cases where you did not discover location pair. >> Serdar Tasiran: I think the next slide is indirect way of getting at that. So let me just tell you what this is for what it is, and then if the next slide doesn't answer your question. So the idea here is if a pass discovers a new MP or D or LP, what are the chances that it will also trigger the bug? And you see that the numbers in almost all cases are higher for LP than other metrics. And then the next set of numbers says let's just look at the passes that did detect a bug. Did that pass also cover an MPD or URLD that hadn't been seen before. And I think this is an indirect way of getting at that. And again LP does significantly better, especially by this measure than MP and D. Okay. So that's the empirical bug detection liability study. So there's a paper by these authors in FOC '09 about saturation-based testing, and I would like to quote some important remarks from there. One thing they say is "the coverage metrics that are out there for testing concurrent programs are too weak, we need to look at stronger ones." I agree. They also say, well, people do these things like randomization or controlled exploration of thread interleavings and they prioritize some interleavings over others. And they don't study whether this way of doing things is really better at getting at bugs or not. So this comment might not exactly be related to what I'm presenting, but I thought it would bugs on people in this room so I'm throwing it out there. >>: Have you tried anything in this -- obviously now you have an empiric tool to compare all these different approaches, formal, nonformal, simulation, random. >> Serdar Tasiran: We've tried some. I'm going to show you some saturation results. But others, I'm trying to bug Shaz with context bounded exploration any better at getting at bugs. I think he'd say yes. But I haven't done that study. And other points they make are, again, you should -- coverage metrics need to avoid being too weak meaning saturating very quickly. For instance, line coverage is a weak metric, it saturates very quickly. And they should not be too strong. So difficult that the coverage target is too difficult to compute. And one, again, controversial point they make is if you have a nontrivial metric, if your metric is useful for concurrent programs, then chances are you cannot determine the coverage target, the denominator precisely enough. It's just too complicated, too complex. So they say instead look at saturation, to stop testing, to decide when your array of testing stops paying off. In our case, at least for the examples we looked at static computation of the covering denominator was possible. But I also think this might happen and this might be needed. So I'm going to show you two curves for saturation and two examples. And on those we'll see that LP saturates later than the other two metrics. But not much later. >>: Is that better or -- if it saturates later is that good? >> Serdar Tasiran: It's good, but if it takes impossibly long to saturate or never saturates it's like state coverage. In fact, in their paper they have a curve of state coverage, the line goes -- it's a straight line. It keeps going off. So on the elevator example, the X axis is the number of method calls. It's log scale. And so curves could be seen better. The Y axis is percent coverage, according to the particular metric, and blue is line coverage. Red is definitions uses coverage, and green is LP coverage. And you see that LP coverage keeps registering interesting increments along the way and saturates later. But not impossibly late. I think it's a million method calls. >>: So are you doing any randomization or ->> Serdar Tasiran: Yes thread schedules are random at every scheduling point, general pathfinder chooses a random new thread to schedule. >>: That's not a very good randomization strategy. Maybe we should take it up -choosing a random thread at any point is favoring some schedules over others. >> Serdar Tasiran: Okay. But the X axis in any case is some measure of how much testing you've done. Think of this as time, for instance. And at this point, for instance, if you didn't know the rest of this curve, and if you only had line and U coverage to look at, they would say stop now, you're not going to get anything more interesting. But LP says maybe there's some more stuff coming. >>: Is that the only randomization strategy implemented in J [inaudible]. >> Serdar Tasiran: I think so, yes. There's also the nonrandomized. But [inaudible]. >>: I will -- >> Serdar Tasiran: Here's another benchmark, same idea. This time it's not log scale. Line coverage saturates very easily. DU coverage is a little harder, but still saturates fairly early. And LP coverage saturates later and keeps registering interesting increments along the way. So LP saturates later. Is a more stringent, more demanding metric than the other two. But not impossibly demanding. There are more curves. I'm just showing you two. To sum up, I showed you a coverage metric for shared memory concurrent programs. It's called vocation pairs. We empirically showed it corresponds well to atomicity violations, refined violations and it appears to work better than all definitions, uses and method pairs metrics. It's more demanding than other metrics but it still saturates within tolerable amount of time. And we believe it strikes a good compromise between bug detectionability and being too hard to compute. Thanks. >> Shaz Qadeer: Let's thank the speaker. [applause]. >> Shaz Qadeer: Questions? >> Serdar Tasiran: Sorry, I finished early. >> Shaz Qadeer: Good one. Better than finishing late. >> Serdar Tasiran: If you'd like, I could tell you how we debugged the Apache FTP server using this method. But maybe not. >> Shaz Qadeer: Sure. Yeah. >> Serdar Tasiran: I'll tell you the gist. It will take too long. But sometimes atomicity violations are in the ->>: This bug was already known? >> Serdar Tasiran: Yes, it was known. But what was interesting was we tried to fix it. We tried several different fixes, which were also not good. And LP told us what -- so sometimes atomic blocks are not too small, they're too big. And you end up not seeing certain interleavings. And then the LP metric can tell you you're not seeing this interleaving because your locks are too big and in the Apache FTP server, for instance, the timeout thread, in one version of it the timeout thread was never able to kick in because the atomic block was too big. And the LP metric basically said nothing is registering, you're not covering anything, because your atomic block is too big. Thank you. >>: Can you tell us about what you think of extending this [inaudible] automaticity? Proof of that? >> Serdar Tasiran: So there are other papers that tell you, not pairs, but triples. Not just any two threads, but specific to threads. And they get too expensive. So of course if you look at triples or sequences of length four you could catch more bugs. But it would be difficult to determine statically what can be covered. Something like this. >>: Sorry [inaudible] to the same variable or different variables? >> Serdar Tasiran: There are metrics that talk about both. Some ->>: This notion of an atomic set. It's like suppose you have an atomic set, X -the location of XYI in atomic sense now you're doing something -- so what you might say is if you see a -- so now the generalization of your coverage is basically based on that set. So if you saw, for example, a read of X and then followed by later a write of Y, and nowhere in between a write to either X or Y, right now that's a new type of ->> Serdar Tasiran: Right. So there's a paper about hierarchy of such metrics. And this is one of the cheaper ones in that hierarchy. Variant on one of the cheaper ones. But I think its real value lies in giving the programmer something simple to look at. I mean, if you say this pair of axises to X were never followed by this pair of axises to Y, that might be a little too difficult for the programmer to rule out. But a pair -- it's a good point to start. And sometimes it tells you what different data, what different interleavings to test. >>: I guess the question is how many concurrency bugs are there which you have two totally unrelated variables, related on this path, might be ->> Serdar Tasiran: So this paper says -- this paper agrees with you, the top paper. >>: Right. Sorry? >> Serdar Tasiran: The top paper here agrees with you. It says there aren't that many. >>: There aren't that many. >> Serdar Tasiran: There aren't that many. There are some but not that many. And sometimes to make a location pair happen, you might have to do a lot of difficult things. We might have to fill up a hash table with thread IDs before it overflows and times out, et cetera. The summary of what you did in the exercise is simple but sometimes it's really tricky to get there data-wise and interleavings wise. Thank you. [applause]