>> Tom Ball: Okay. Hello, everybody. Welcome. My name is Tom Ball, and it's my pleasure to welcome Todd Mowry from Carnegie Mellon University where he's a professor in the Computer Science Department. He received his Ph.D. from Stanford in 1994, one year after I got my Ph.D., so we're pretty close there in time. And his research interests are varied and span areas of computer architecture, compilers, operating systems, parallel processing, database performance and modular robotics, which is a very cool topic too. He's served a rotation as the director of Intel's Research Lab in Pittsburgh and associate editor of ACM Transactions on Computer Systems. So today he's going to talk to us about butterfly analysis for dynamic parallel programming monitoring using dataflow analysis. Something like that. >> Todd Mowry: Exactly. >> Tom Ball: You'll probably describe it a lot better than me. Thanks, Tom. >> Todd Mowry: Sure. So I'm going to talk about, this is work that was done primarily by my student, Michelle Goodstein, and this work appeared in ASPLOS conference last summer, and what I was going to do is I'm going to start by telling you a bit about the project that led to this particular piece of work and then I'll talk about the work and then afterwards I'll talk about where we're going to sort of follow along with the next steps. So let me just back up. So a little bit about myself. So I don't have -- my background is not in verification or particularly informal methods. I work mostly in compilers and architecture, and most of my research had been on making things run faster either through parallelism or through cache optimizations. And after doing that for many years I wanted to do something that wasn't just about making things run faster. One thing I was concerned about is with the whole word moving to multicore and with everyone trying to start to write parallel programs, I know firsthand how hard that is, so I wanted to do a project that might help people with functionality bugs in parallel programs. So that was the motivation for this overall project. And as everyone knows, when a bug occurs, as your program is executing often at the moment that the root cause of the problem occurs you don't necessarily notice the problem right away. Typically you notice it later when something catastrophic happens. So at this point you get something like a core dump, and you can see what the world looks like at that point, but that's not exactly what you want. What you really want to do is know what did things look like back here so that you can find the real problem. And the way the classic debugging happens is that at this point you would then try to rerun the system with a debugger turned on and try to sneak up on the bug from the beginning, but especially when you're thinking about a parallel system where things are timing dependent, it may be extremely difficult to reproduce all the right circumstances to get that bug, and if debugging slows things down and dilates time, then maybe it's not even possible to reproduce the bug with the distorted timing. So there are many ways to attack bugs. So static analysis is one great way to do that, and a way that I think about what we're doing, we've been focusing on dynamic analysis not because we don't believe in static analysis but because we think this would be complementary to static analysis. So do everything you can with fancy static analysis tools, but for the things that you still can't catch or for the things you're not quite sure about how they behave at run time, our goal is to build the world's nicest, fastest run time analysis framework and then let clever people figure out new kinds of tools to build on top of that. So we didn't come up with the idea of doing dynamic analysis, of course. There have been lots of tools to do things looking -- tools that are typically built using binary instrumentation tools like Pin or Valgrind, and they look for things like memory bugs, security bugs, concurrency bugs and so on. So in particular we're interested in very sophisticated detailed tools that look typically at each and every instruction. So it's easy to do dynamic analysis without much overhead if you only monitor things that occur fairly infrequently. But if you want to look at every instruction and do something interesting, the overheads can be very, very high. But if you could make this run fast enough, a nice thing about doing an analysis dynamically is that hopefully you notice the root cause of the problem before the catastrophic consequences occur down the road. Also, if you can look at every instruction, we think you have a better chance of finding things rather than just, say, look at a core dump or a window of the last million instructions after the system crashes because you don't miss anything. You can see all the instructions from the beginning of when the program runs. So it would wonderful if these tools were wickedly fast and we could use them for everything. But the bad news is typically these tools, when they're implemented today with binary instrumentation, slow things down quite a bit. Slowdowns of 30 to 100 times are very common with interesting tools like this. And when things slow down this much, the problem is you can't really run these on a live system. Interactivity doesn't really work anymore. So that was an issue we wanted to fix. So in the sort of overarching project that we've been working on what we've been trying to do is to accelerate this kind of analysis to the point where the slowdowns are tolerable without resorting to sampling. So we're still looking at every instruction, but we can get the overheads down to, say, something like 20 percent or 50 percent to the point where you could actually think about leaving this on on a live system. And I've talked about how we've done that in previous visits here and talks so I'm not going to go into a lot of detail on that. I'm just going to quickly summarize this in case there are people who haven't heard this before. So how do we run so much faster than -- yes? >>: Is 20 percent and 50 percent performance, is that something acceptable? >> Todd Mowry: Well, actually we can -- I think we can reduce it lower than that. So our actual goal is no slowdown, and ->>: [inaudible] >> Todd Mowry: Yeah, so actually this is -- we've done a lot of other work that I'm not showing. Actually I wasn't going to -- actually I can talk at great length about that. So we've done a lot of work on different ways to accelerate things, and these are fairly conservative numbers. The 20 percent to 50 percent was our number from three years ago, and we've gotten much better results since then. In fact, you can always close the rest of the gap by doing just a little bit of sampling. So if I'm willing to look at three quarters of instructions instead of 100 percent of them, I can just immediately make any slowdown go away. >>: [inaudible] >> Todd Mowry: Sorry, what's that? >>: [inaudible] >> Todd Mowry: Yeah. So our goal is to it without any sampling, and I think we can get down to no slowdown. There are different tricks for basically getting to that. We're very close to that. But it's more that we sped things up by, like, more than a factor of 20, ask there's more low hanging fruit that we're working on to speed it up more. So things got a lot better. The point is when you're looking at this kind of slowdown, nobody even thinks about running this on a live system. Okay. But in a nutshell, the thing that we do, which, again, I've talked about in earlier visits, is we don't use binary instrumentation. Instead what we do is we modify the hardware a little bit to basically have the processor collect some of the information that we would want to binary instrumentation. And this is really not a big deal in that the hardware already has this information, and in fact it captures this internally for debugging purposes in some cases, but let me tell you a little bit about what we do. So, first of all, the application is running on modified on one processor, and the analysis -- instead of taking the analysis and stitching it together into one software thread and running that on one processor, we keep them separate. We run the analysis on a separate core on the same chip and then we -- the hardware logs the information that we care about dynamically and stores it in memory and then that is consumed by the analysis. So it's basically the same type of information flow that happened with binary instrumentation, but we're pulling it apart and running it on separate processors. So the reasons why it runs faster are, first of all, well, we're using another processor, and so that gets us some benefits. It's not just that there's another processor, but the fact that we have another set of registers and another cache, that helps quite a bit because when you take a program that was already compiled to use all the registers and then you try to stick analysis into that, there's a lot of register spilling, plus the caching effects matter. The problem isn't that this analysis is knocking the cache -- is displacing the data from the application. It's the other way around. This analysis typically has a much smaller footprint than the application, but it's actually the bottleneck. And if you want this to keep up, you want it to run as fast as possible, so we don't want the application to be knocking the data out of the cache for this tool. And, finally -- so with the things I described so far, if we do this the right way, and hardware designers don't think this is a crazy idea, then you could run this at full speed because this is just running on bear hardware, and the information is just being captured in the background and sent over to this processor. Now, it turns out that the problem, the performance bottleneck, is on the consumer side because these tools may do something like five to ten instructions worth of analysis for every one instruction in the application. So they're starting off with a big disadvantage in terms of instructions. What was that? >>: [inaudible] >> Todd Mowry: Oh, okay. So what we did in part is both through hardware -- it says hardware here, but we also did this through the compiler -- we found ways to recognize redundancy in the analysis and eliminate it or reduce it, and we've had papers on that which I'm also happy to talk about in more detail. Yes? >>: Isn't that analysis running the exactly same type of processor as the [inaudible]? >> Todd Mowry: So in our designs we've made each processor the same. So any processor could either run an application or could run the lifeguard. We did that just for flexibility so that if you wanted to use all resources just for computation, you could do that, so that we weren't hard wiring anything. But you certainly could -- it's interesting to thing about optimizing a processor just for doing this kind of analysis. So with all these things turned on, a very ->>: So I can think of this as another type of tool [inaudible] where you have like two cores that are running the same thing, just one is doing something else. My question is -- and you also have some changes to the hardware, so are you proposing that every work station or every machine should be changed so that -- the hardware should be changed so that it will support [inaudible]? Do you think this is feasible? >> Todd Mowry: Yes. So -- >>: Do you think really it's worth, like, all the costs that will be associated with hardware change? Because you know that's very expensive. >> Todd Mowry: Yes. So I often have worn an Intel hat, and this project was done jointly with a bunch of Intel people, and we've spent quite a bit of time talking to Intel architects throughout the whole project about the feasibility of this. So it's not -- this is not technically infeasible. The question is just sort of value proposition and so on, and in fact Ben and [inaudible] have helped participate in discussions about that with Intel. So it's a topic of discussion. Do you guys have other questions? >>: [inaudible] applications usually when you make a small change [inaudible] just an extra write to the memory or an extra read from the memory can result in totally different set of results, so I was wondering if your construct, your conditional construct, you're adding [inaudible]. >> Todd Mowry: It does not -- let's see. Well, it's not that there's zero perturbance of the system because it is writing a log into a small piece of memory, and it can sometimes -- there's nothing special about that memory that would prevent it from ever conflicting with other memory accesses in the application. And in our experiments we measure this. The interference between the log that we're storing here and the normal cache data is very small because this buffer is not very big at all. But there is a small effect there. So it's not that there's zero perturbance from what we're doing, but the hope is that it's not very large. It's certainly better than, I think, what it is when you slow things down by an order of magnitude. >>: So what about the memory traffic that's going off core? >> Todd Mowry: Well, one of the things that we like about this is -- so people are worried about off-chip bandwidth, and a nice thing in my mind about this is all of the -- nearly all of the traffic associated with doing this analysis stays on chip because the log almost never gets knocked out of the chip because it gets consumed more or less right away, and then the metadata that the lifeguard uses caches very well, so ->>: So it's not off chip. So it's just the same core. You might even be sharing a cache? >> Todd Mowry: Yes, they are typically sharing the last-level cache. >>: I see. >> Todd Mowry: So that's actually where this thing resides is it just really never goes off chip. So one of the -- and we've papers where we look at a wide variety of different lifeguards, and this is one that is part of a Valgrind tool that's called Address Check. It's one of the simpler ones, but it initially has a slowdown of about 30x, and using our bells and whistles we can get that down to about, say, 2 percent, something that's in the noise. For the more sophisticated tools it's a little higher than that, like 20 percent, 30 percent or so, but it's small enough that you can think about actually turning this on and leaving it on all the time. Okay. So that's just kind of context. So that's the project. But what I've talked about so far, this was just single threaded applications. So we're using one lifeguard thread to monitor one application thread. And what we wanted to do is look at parallel programs, and this makes life much more interesting and challenging. Okay. So now I'm jumping into the official slides from Michelle's presentation so there's a little bit of repetition here. The idea is when we think about finding these problems, you can either try to cash things using static analysis before anything executes or during execution or after something crashes. And all of these are useful things to do, but we're focusing on things in the middle. We're trying to cash problems during the middle of execution. Okay. And the way that these tools work is that they associate extra state, the shadow state or metadata, for all of the data in the application, and that shadow state is tracking some correctness property of the data. For example, memory checkers record whether a location has been allocated and initialized and freed and that type of thing. So, for example, as your program is running if you load from an address, it's going to go over and check whether you actually have allocated that thing, and if not, then it knows that there's a problem. Okay. So for sequential programs this is basically just a finite state machine where you go from instruction to instruction. It knows how to update this metadata, and it knows how to check the metadata. So that's fairly straightforward. Okay. Now what happens when we want to look at parallel programs? Well, it would be nice if we didn't have to change things very much. So what happens in this situation? Well, that would be nice. That's not what really happens, though. So now imagine that, for example, on one thread we're allocating memory and we're dereferencing it and we want to know whether we're going to have a null point or exception, and it's possible that this other thread is going to set P to be null, and we want to -- imagine we want to write a lifeguard that's going to check for this problem. So now we need to worry about the ordering of when this assignment occurs relative to this and this access and whether this assignment is visible at this point. So the way this basically happens today, more or less or less the state of the art is that people track this by doing time slicing. So the idea is that you basically force the different threads to interleave on one core, and then if you do that you can run one lifeguard. A single-threaded lifeguard can look at this time-slides execution and pretend that it's basically one program except that you have thread IDs whenever you switch from one thread to another, but the same mechanism of having a finite state machine basically still works. So the good news is this will work, it will function, but the bad news is that it's slow, because now we're throwing away all of our parallelism. We have to take all of these threads and put them all onto one core, and we don't want to do that. We want to get the performance benefits of parallelism. Okay. So let's say we want to -- yes? >>: So let me see if I understand. You only want to check actual races basically instead of possible races? >> Todd Mowry: Right. Yes. >>: So can we tell by the interleaving -- it isn't all possible interleavings, a particular interleaving, and you're going to cache that one, right? >> Todd Mowry: For the most part, yes. Actually, when I talk about what we actually did in the butterfly analysis, it's basically a hybrid where within a recent window of time we're actually considering all possible interleavings within a window, but beyond that we aren't. But with what I'm talking about here, it's only considering one interleaving. So the idea of the tool is simply to say given how things have behaved so far, does it look okay. It's not trying to think about anything about what else could have happened with a different interleaving. But the idea here is with the time slicing you force a particular interleaving and then you just analyze that interleaving and that's all that happens. Okay. So if we wanted to do this in parallel, now the analysis needs to actually run in parallel. And the tricky part is thinking about these dependencies. So one possibility -- we had another paper at ASPLOS last year where we looked at hardware support to do this. So if the hardware logged the communication between processors the right way, then you would have all the information you would need for the lifeguards to know precisely whenever I access something, am I reading something where this write was visible or not. So if you're willing to add extra information to all the coherence messages, you can track this precisely. So that's one possibility. But that requires even more new hardware. And also it really only works if you have sequential consistency or total store ordering. So that's one possibility, but that's not what I'm going to talk about today. So instead I gave Michelle this challenge, I said, okay, let's -- so it's natural to sit down and think if I'm going to analyze this parallel thing, step number one is figure out what the ordering is. What is the interleaved ordering of everything? But let's assume that that's impossible to observe, that we just don't know the interleaving. So in particular I think that's important for a couple reasons. First of all, without this new special hardware we really don't know what that exactly interleaving is, at least within the near term -- I mean within the recent history. And also this idea of an interleaving only makes sense if you have a sequentially consistent new machine, and although programmers like sequential consistency, real machines aren't sequentially consistent. So in real machines sequential consistency is an illusion. Under the covers, things are happening out of order. So let's embrace that and say that things really happen concurrently, within reason. So the idea -- the way that we're tackling this is we're not simply saying we don't know anything at all about ordering. What we're saying is that we have some idea of a window -- a bounded window of uncertainty where from the perspective of this reference over here, if we go back far enough on other threads, then there's some point where enough time has passed that we know that memory accesses are now visible because there's only so much buffering in the hardware, in the reorder buffers and the memory system. And that may be tens of thousands or hundreds of thousands of instructions. So beyond this point we know that things are visible, but in the recent past and in the near future on concurrent threads we don't know -- we'll assume that we don't know the ordering, and therefore anything is possible. So, for example, in this case if I see -- if there's an access where I'm setting P to be null, then if that occurs within this window of uncertainty, I have to assume the worst, which is that this was visible before I dereferenced this and that this would be a problem. So the idea is we want to have our analysis -- instead of having a total ordering, we have only a partial ordering information. And we're going to come up -- we want to design something that's conservative but correct. So I'll talk more about how this works. So, first of all, we have this window of uncertainty of a certain size, and one question is how big is it? So the answer is it's fairly big. It is typically thousands to tens of thousands of instructions. It needs to be a large enough number of instructions in this window that you would expect the reorder buffer to flush and the memory buffers to flush on the chip so that it would at least hit the cache, the last level shared cache. Although it's big relative to these structures, it's actually quite small relative to how many instructions have executed since the program started running probably. So we do know things about quite a bit of the execution. We just don't know what happened exactly in the last 10,000 instructions or so. Okay. So the idea is that as the program is running, we're capturing either through our fancy hardware support or through maybe even binary instrumentation, you're logging or you're watching all the instructions dynamically as they go by, and as you see these dynamic traces what we want to do is divide the execution, the log, into these different windows. So each window should contain at least W instructions is W is the window side that we're targeting, and we call these things epochs because everyone calls windows like this epochs. We discovered that if you use the term epoch you immediately get a whole lot of related work suggestions when your paper gets reviewed even if they have nothing to do with your paper, but that's okay. So the idea is you carve these into these different epochs, and one thing is know is you don't need to precisely carve them because when you actually -- oh, second, this is not a barrier. So the application is not stalled when we're making these cuts. This is something that happens in the background. So we just take the logs after they're generated and we carve these things out. Also, they don't need to be -- it doesn't need to be a precise cut. So it's okay that there's a little bit of stagger between them as long as it's large enough that you have both W and whatever amount of stagger there is. So the way you would really implement this is, for example, you would just pass a token around between the different threads in the software and have a fence, and from the time you started until the time that it comes back, you don't really start counting W yet, and when it's made a complete round trip, then you count W instructions and then you know you've hit an entire window and then you can send the token around again. Yes? >>: So do you use some static information to put some initial guidelines [inaudible]? >> Todd Mowry: For this? >>: Yeah. Because it's like static analysis [inaudible]. >> Todd Mowry: In our experiments we just used machine parameters and we didn't look at the code to worry about this because code typically -- well, you could try to adjust the window size or think about optimizing it, but we were just erring on the safe side and made this a big number and didn't try to think about resizing it. But I have some ->>: [inaudible] >> Todd Mowry: Oh, so you weren't -- so not specific to the epochs but just generally if we did static analysis ->>: No, for the epochs in particular, if you used some static analysis and then you combine with the dynamic analysis you could save some performance overhead. >> Todd Mowry: Well, that's very -- I mean, yeah. We haven't looked at that. I'm very interested in combining static analysis with what we've done. >>: Right, yeah. I remember you said this complemented it. >> Todd Mowry: So the short answer is we didn't look at that for this. So we just did it all dynamically. In a way -- I believe the right thing to do is to combine static analysis with what we're doing. I've been in a way trying to make life hard for ourselves by not doing that static analysis yet just to see if we can keep up and get decent performance where we don't use any static analysis, but static analysis should only make things better. Okay. So first we carve things into these epochs, and this doesn't slow down the application. This just happens in the background as you capture these logs. Okay. So then what do we do? Well, conceptually we can think about these logically time stamped epochs, and relative to a particular layer of epochs if we go two layers either back or ahead, then enough time has passed that we know that there's actually a real ordering there, but if we think about adjacent ones, then those are too close. We don't know what the ordering is. So what we really need to do is consider a sliding window of three of these different layers at a time. So the way the analysis works is it just moves from three levels one at a time -- >>: Is that because you don't have the [inaudible]? >> Todd Mowry: Actually, it's not even -- even if they were precisely cut, the issue is that an instruction near the top of this overlaps some number of instructions into the ->>: Got it. >> Todd Mowry: It's just because of the boundaries. >>: Yeah. >> Todd Mowry: Okay. Oops. I apologize. The fonts here are way too small. But the idea is that -- so, for example, if we're allocating memory here and dereferencing it down here and over here on these different threads, in this case it's two epochs away, so that's okay. We know that that's ordered and that there's not a problem here with respect to -- when you touch this, this will definitely have already been visible. But this one over here is too close. So for this case there's a good possibility, which is that it actually saw the malloc, but there's a bad possibility, which is that it didn't and it actually dereferenced this before the malloc was visible. Okay. So that's the -- so, so far the idea is we're going to have these epochs and we've going to think about these windows of uncertainty. But that's not the interesting part. The interesting part is how we actually do the analysis. Okay. One other piece of information is that things are a little bit -- there's some special information that we know, which is with respect to, say, a particular epoch on a particular thread, we know that for the same thread in the previous epoch, that that definitely occurred before the event here because it's on the same thread. That's usual thread ordering. And we also know that the things after it definitely have not occurred yet. But otherwise, for all the other concurrent threads, we don't know -- we have to assume that anything could be interleaved in any way on those other threads with respect to any instruction here. So we kept drawing these things on the whiteboard and decided that that looked like a butterfly or a dog bone, I guess, but that's why it's called butter analysis, not for any other good reason. And people point out that our butterfly fly backwards because we put the head at the top of the whiteboard, but it's actually moving this way. But forward data flow is flowing that way, though, so it does go through the head and come out the tail. Okay. So we have this -- here's our problem, which is we need to do the analysis considering everything here to be concurrent. So what's the big deal? Well, the big deal is a combinatorial explosion of possibilities. So the size -- these blocks are large. They have tens of thousands of instructions, and we have a large number of these threads, so there's a huge number of -- even though we bounded the window of uncertainty, the number of possible combinations is enormous. So we don't want -- for performance reasons, remember, we're trying to do this analysis on-the-fly as the program is running without -- and we don't want to slow things down, so we can't afford to do something that's super computationally expensive. We also don't want to say to the lifeguard writer, well, okay, write the code here that happens to know what to do with all of these concurrent things. Good luck with that. Instead we want to have this framework where the framework knows how to deal with all the concurrency, and if you just write down your analysis in a straightforward way it will just take care of all of this for you. So there are all these possible interactions, and -- all right. So I'm a compiler person, and I tend to think about data flow analysis and that type of thing by default, and I was staring at this and thinking, wow, can I just somehow treat -- even though this is a dynamic piece of the log, what if I thought about this like a control flow graph. Is there some way that I can do flow analysis across this and borrow techniques from that? Okay. Well, it turns out that that's not a good idea to do it in a straightforward way, and here are some reasons for that. First, if we just think about -- let's not think about everything. Let's just start by thinking about one of these concurrent blocks. So the issue is that information can escape from this block and affect any instruction here after every instruction in this block as opposed to in a normal flow analysis where things only enter through the top and exit through the bottom. So that's a problem. And, also, similarly, information can enter this one when we go to analyze it at the beginning of each instruction. So you might think, well, okay, well, why not just represent -- take each instruction and make it its own basic block? Well, we could do that, but now we've got, you know, just a ridiculous number of little basic blocks, you know, tens and tens of thousands of them now. So obviously that's going to be too slow to compute, so that's not really practical. And there's another bad thing about doing it this way. We realized that -- this is an example here where one thread is executing along, and imagine that you're doing taint analysis to see whether you're doing information flow tracking and you want to know whether X is tainted, and assume that at the beginning of this it's not tainted. So we're executing along, we check it, this check should be okay because it's only tainted after the check, and this other thread isn't tainting it. But it turns out if we just treat this like a flow graph and let it crank away and look for -- solve the problem, it can decide that there's a problem here, and the reason is that it can think that this taint has propagated information up here and down through here and then back up over here and down here again. So we'll actually get wrong answers if we try to do this. Okay. So we don't want to do data flow analysis in that sort of traditional style, but the idea that we had was that we started thinking about something that's called interval analysis or region based analysis where you try to collapse a flow graph into, say, a single transfer function to represent the entire thing. And you do this by reducing different components. For example, if there's two nodes where one follows the other, you can compose them together, you can deal with meet operations, and the most interesting thing in my opinion is the thing you do for loops. So, for example, to make this back edge go away what we want to do is compute the closure of the transfer function to represent anywhere from zero to any number of instances of this transfer function being applied. But the idea is with this closure operation and region based analysis, in one shot you can summarize the net effect of any number of iterations through this loop. And that felt very much like what we were trying to do with concurrency. So the idea is we're -- our solution was inspired by that thought. So it's different, though, because in interval analysis we're dealing with a static control flow graph and we're computing the closure for a loop. In our case this is dynamic chunks of a log, and the closure is with respect to all of these concurrent blocks in the wings of the butterfly. Another thing that's different is that information can enter and escape anywhere in the middle of these blocks. So that also makes it different. But similar to flow analysis, we wanted to have this standard framework where if you wrote your problem in a certain way, the framework will just take care of all the details for you. Okay. So in the paper we go through different examples of data flow problems and lifeguards and show how to adapt the ideas of flow analysis to work here. And one of the things that's quite different, as I've been mentioning in the past, normally you have in and out and you might generate and kill different pieces of information, and we had to introduce new concepts. Our new concept we called Side-Out and Side-In. Sounds like we spent too much time playing volleyball here. But that captures -- Side-Out captured the fact that information can escape after each instruction and have side effects on other instructions, and then Side-In captures the opposite of that, that information can flow into a block before any instruction and not just at the beginning of it. So it's these things that make the equations look fundamentally different from what you're used to seeing with normal data flow analysis on control flow graphs. >>: So those two steps are sort of treating the statements in a flow-insensitive fashion. I mean, that's another way of looking at it. You're basically saying there are all these [inaudible] possible, and you have a flow insensitive step interleaved with a flow sensitive step. So within each thread you need to be flow sensitive but sort of across, these side things are like flow insensitive. >> Todd Mowry: Yes. >>: Gotcha. >> Todd Mowry: So -- okay. So what we do is, in this framework, the person writing the tool, they have to specify various things. They have to talk about which instructions and which system calls or library calls are interesting that will cause the metadata to be changed or checked, and you have to talk about how to represent the metadata, and you also have to specify a meet operation similar to normal data flow analysis. And then the framework operates in two passes. What happens is we make -- oh, and to be specific, what I mean is there's this sliding window. As the window slides to the next layer, in total we will walk over each layer twice, but that doesn't mean we record the whole log and then go back and start it over from the beginning. We just do this two passes as we're sliding the window forward. In the first pass over a block we compute the effect of its instructions on all the other concurrent instructions within the three layers, and on the second pass, that's the flow sensitive one where we gather all that information together and then we can walk down through the thread and decide what the real state of the world is at that point. Okay. So then to illustrate this, although this isn't an interesting lifeguard, it's the simplest thing we could come up with that was a stand-in for a real piece of analysis, which is just available expressions. That's something that compilers compute to look for redundancy. And so I'll show you what that looks like and then I'll show you how that can be applied to a real lifeguard. But the idea behind available expressions is that it's -- an expression is available, something like A plus B is available if it's evaluated along every path that it can arrive at a certain point. And this is interesting because it has to be true for all instances that arrive somewhere, and a lot of the analysis tools care about that. What is true for all possible interleavings? Okay. So for available expressions let's say the expression of interest is A minus B, and so we see it here, here and here, and let's say we want to know whether it's available right here. So we carve things into epochs and imagine that it ends up looking like this, and also imagine that the very beginning before we reach this point that A minus B is available along -- across all these different threads. So intuitively what's the answer that we want to get here? Well, if you just look up here it may look like everything's good except there's a problem down here, so down here we're rewriting B, and so there's a path where this is not available. So if we go here to here to here and there, which is a possible interleaving, then it's actually not -- A minus B -- the most recent A minus B has not been computed at this point because B has changed. So we wanted to get the answer that this is not available in this specific example. Okay. So as I said earlier, there are two passes. In the first pass we're going to summarize the effect of a block on the other blocks. And this boils down to computing the Side-Out or the side effects of it coming out the sides as well as the bottom. So in this case the net effect of this block is that it's going to kill A minus B because it's rewriting B. So that will end up showing up in the Side-Out for this block. So we do that for -- we've done that for all the different blocks. You can see down here that it's -Side-Out contains the fact that it's going to kill A minus B. So that's our first pass. We've computed this. And then in the second pass we collect all these things together and we do a meet of all of that and then we can summarize -- we have a summary of all of the side outs, and given how meet occurs, you'll kill something if it's killed anywhere, so killing A minus B ends up in the input after this -- that we get after this meet, and then we can do our second pass where we now realize that this is not available, because even though it was available at the point where we started to enter this block, probably it's getting killed because of the concurrent accesses, so it's not available. Yes? >>: So the [inaudible] killed A minus B, but did it happen to take into account the fact that Z -you know, you had this path going to Z equals A minus B and then B equals B minus 1, but is that necessary [inaudible]? >> Todd Mowry: No. That was just -- that's not the only path. >>: Okay. I see. >> Todd Mowry: It's actually just simply a fact that ->>: [inaudible] >> Todd Mowry: That was just an illustration that touched everything. Yes? >>: So how does this analysis work together with a machine that works based on chance? So you can do this [inaudible] in hardware and either you can do these intersections [inaudible] and say these two epochs do not access even the same [inaudible]. So wouldn't it simplify your analysis a lot? >> Todd Mowry: Yeah. Actually we do that in some of the lifeguards. That's a very good observation. That's a good trick for knowing that -- to accelerate things, because if you know that there's nothing that overlaps -- if the read and write sets don't overlap, then you don't need to actually do the second pass. So we can short circuit the second pass in some cases. And that's something we want to explore much more in our future work is ideas like that for accelerating it, taking advantage of that, plus static analysis and other ways to memoize things. I think that I think that there's a lot of potential for that. And you could even -- if there was hardware support, like some kind of bloom filter or something, you could use that, too. So you could, in the first pass, load up these bloom filters and exchange them and then just do a quick test in hardware to see whether there's any conflict. That might be a nice way to do the checking too. >>: So how about if you actually -- let's say that these guys actually communicate, but if they communicate you squash one of the guys. So basically you enforce the framework do not communicate, so [inaudible] approach to this. >> Todd Mowry: So something more transactional or ->>: Yes. >> Todd Mowry: Yeah. So in fact that's another possibility. We were -- we didn't want to -- well, we weren't thinking about -- our philosophy was to take whatever machine model or memory model the machine had and look for problems. In fact, as we -- as we started doing this, it's really tempting to think of ways to simplify the problem and say, well, if we have properly synchronized programs then we could play this trick and that trick, but I always remind the group it's exactly the programs that aren't properly synchronized that we have to be paying attention to. But, anyway, that's a tangent there. That's an interesting thought. Ben? >>: So is this analysis happening dynamically every time you want to analyze this effect? >> Todd Mowry: Right. So there's a lot of potential for reusing things. That's, like, number one on our list is to take advantage of memoization, because you keep -- you know, if you're in a loop, you're going to see the same code path again and again. Some things will be a little different, maybe some data addresses will be different, but right now we're actually regenerating everything on-the-fly all the time. So the overheads are much higher right now than they need to be in this paper. Let's see. Okay. So in our ASPLOS paper we looked at available expressions and reaching definitions. They're interesting because one of them takes the intersection and one takes the union of the possible inputs, and you can apply those concepts to understand how to do real lifeguards, although there are other extra details that make those even more interesting. And what we did in the paper is we have proofs showing that there will never be false-negatives in the analysis, but there can be false-positives certainly. And in fact I've got numbers on that coming up in the results. And also you can apply this to a machine that has relaxed consistency. It doesn't depend on sequential consistency. Okay. So just very quickly, I won't go into a ton of detail on this, but looking at address check, this is a real lifeguard, and it's looking for memory allocation bugs. And I talked about it a little bit before. And it's a little bit like available expressions. You want to know whether something is true in all cases. So, for example, here if we are dereferencing Q and then allocating something for P and dereferencing it and freeing something over here, you know, we start with our epochs, we do our analysis, and after the first pass we might decide things like this block is mallocing P and not freeing anything, and we compute this everywhere. So down here we realize that we're freeing Q, and we exchange all this information after the first pass, we do our meet, there's a little fancy animation, and now we know that the beginning of this block, that somehow concurrently something is freeing Q and nothing is mallocing anything. And now when we do the second pass we know that there's a problem here when we dereference Q because down there there was a free of Q. So this is a problem. There's tons more detail in the paper if you're curious. Okay. So now onto some results. So we built the code to do this, and we ran our experiments a top of our logging infrastructure. And the way that that works is we're emulating the hardware changes inside of Simics, but a top of that we have a full software stack running on it. And we measured 4-, 8-, and 16-core CMPs where we use half of the cores for the application and the other half for the lifeguards, and we looked at both 8K and 64K epoch window sizes. The first set of results are for the 64K results. Yes? >>: So I didn't understand what kind of [inaudible]. >> Todd Mowry: Yes. So in the first few slides I was talking about the logging hardware for the project, but I forgot to mention that the butterfly work doesn't actually really need any of that. You can do all the butterfly stuff just with binary instrumentation. It doesn't require any new hardware. We just used this because it just makes everything run faster, so everything would be 20 times slower if we did it with binary instrumentation, but you can do it that way. The good news is it would still be a win for binary instrumentation because you could at least run it in parallel. I mean, your tool is going to start off 30 times slower, but it can at least speed up with more cores, whereas if you time slice it's going to stay 30 times slower and never get faster even if you have more and more threads. >>: But there's no additional hardware to support butterfly analysis [inaudible]? >> Todd Mowry: Yeah. In fact, a goal of the analysis was to not require any special hardware support. So it works on existing hardware. It just -- like I said, in our experiments we went ahead and used our logging because we like it, but we didn't need to do that. Okay. So here are a couple of applications that we were looking at, and along the x axis for each application this is the number of cores running the application, and there's the same number also running the lifeguard. And the y axis is time, normalized time. And it's normalized to the application running by itself on one core. So the blue bars show time when we run the application in parallel by itself without monitoring. So it gets faster because we get parallel speedup in most cases. So that's good. And we're only going to make things worse than this because monitoring isn't going to speed anything up, it's just going to make things slower. Okay. So the next bar in red is the time sliced approach. So if we just took those concurrent threads and made them run interleaved on one processor and then used one lifeguard to monitor it, we get the execution time shown in red. And is not surprisingly, this doesn't typically ever run faster than one processor. Sometimes it runs much slower due to a variety of effects like partly ->>: [inaudible] >> Todd Mowry: Due to measurement error. It shouldn't ever run faster. But basically this isn't a good number. You want to compare this performance with the performance of the blue bar, and you can see that it's way slower than the performance without monitoring. So this doesn't look very good. And then, finally, the green bar is with the butterfly analysis, and so we actually can speed up. So butterfly analysis is not -- does not have zero overhead. It's a little worse than the blue case. In some cases it's actually significantly worse due to the fact that we haven't done anything hardly to optimize to take advantage of memoization or other things like that. I think in our ongoing work -- in our future work we're exploring quite a few ideas to reduce this sort of constant overhead of what we're doing, but the good news is that it speeds up and it scales. So like this is a nice case here where it's getting a lot faster. Yes? >>: [inaudible] >> Todd Mowry: Yeah. So to make that comparison what you could argue is that really I should be comparing the performance of running the application on 4 cores and using 4 cores to do the analysis with the application running on 8 cores. And the data is actually there. You can just say compare this blue bar with that green bar. So it does make it look even a little bit worse, but it's better than time slicing still. And the other argument for that is -- well, but if your program is buggy, you would rather run the tool and find the bug. I've had that discussion many times with people at Intel. I don't know if that was of the point you were making or not, but sometimes people complain that we're using half the cores to do the analysis, and we actually are pursuing an idea where we think we can dramatically reduce the overhead for doing that. But that's still ongoing work. Ben? >>: It seems like at some point you get to the conclusion that having [inaudible] language that has run time checking and takes [inaudible] is part of the language run time is way better than this, because arguably you are determining certain properties as address check properties, but at a high cost. I mean, at this point it's like half the cores ->> Todd Mowry: Actually I even hate having address check up here at all because it's a really bad motivation for doing this kind of analysis. There are other ways to get memory safety checking, but I think that security checking and concurrency checking, which are things we've looked at also, those are much more legitimate. I think security in particular, doing information flow tracking, that's -- but, yeah ->>: It's an interesting thing because I think so -- you know, I don't think I've seen this case lately, but I think you could make the case which is that if you start from a bad position -- in this case, you know, language that doesn't have garbage collection or whatever -- you know, the cost of getting back to that, especially as you go to concurrency and these kind of things, it really doesn't scale, you know. You sort of lose too much at that point, especially -- and, you know, it just reminds me of security properties. I mean, in a similar way the sort of work going on here [inaudible] which embeds the ability to [inaudible] security properties, again, in a type system. So it's -- you know, you get -- I mean, it's sort of like what you're saying, too, about doing static checking. The more static checking you can do, the less of this you have to do at run time. >> Todd Mowry: Yeah. The ideal thing would be to have type safe languages and then only add this in for whatever else is left. And then the numbers will probably look really good because there will be way less to have to check. But there will probably be some room for some dynamic things that you would want this to look at. You can still have bugs in type safe programs, I believe. Yes? >>: [inaudible] so you are collecting information from the user's machine so you can be able to ->> Todd Mowry: Yeah, you would definitely -- you'd certainly imagine using it that way. We have a couple of goals. One of them is, yes, you can collect better debugging information to send back to the developer so they can fix the bug. But we're also hoping long-term that if we could catch problems early enough, that we could do something in the software to not -- to maybe work around the problem or mitigate the damage. We wouldn't necessarily be able to fix the problem, but like if there was -- if we detected a malicious packet arriving, maybe we could rewind the system and kick it out and run forward without it, or if there's a data race maybe or something like that, maybe we can back up and do that type ->>: [inaudible] >> Todd Mowry: Yeah. So what we envisioned is having some recovery mechanism combined with this. That isn't something that I'm talking about yet or that we've looked at, but that was part of our long-term -- that was actually what started the whole project is we thought about doing recovery and trying to have systems that were more robust, but then we got a little bogged down in figuring out how to find bugs. Turns out that that's a hard problem. Okay. So the green bars are much better than the red bars and, in many cases, not too far off from the blue bars. In some cases the overhead's larger. There's actually a case here where we didn't actually quite beat the red bar, but that's mostly just because we start off with a very large amount of overhead. We are scaling very nicely, and if we can do some of the obvious things to reduce the constant overheads, I think the performance would look much better. But as you go to more and more cores, the scaling benefits of this will become more and more attractive. Okay. I'm getting close to the end here. So one of the things that I mentioned is that this is conservative analysis. It doesn't have false-negatives, but it can have false-positives. And I also said that we looked at different window sizes. So what you can do at a high level, having a larger window size is better for performance generally because there are some constant overheads every time you kind of move from one window to the next, so you can mitigate that a bit. So in these experiments the purple bar is for a smaller window size of 8K instructions and green is 64K instructions. So generally the larger window is faster. This is execution time here. But that comes at a cost, which is false-positives. The larger the window size, the more things get mixed in there. Then you can see that the false-positive rate goes up. In one case ocean it gets to be probably unacceptably high. Yes? >>: These are fractions of all memory accesses that are falsely categorized as errors? >> Todd Mowry: Yes. >>: So how does this compare to the true error? I mean, if I got a report, what's my -- what's the probability that it's a real bug versus it's a false-positive? >> Todd Mowry: Oh, I think I may have that in a backup slide. Let's see. I can either jump there now or maybe get it at the end. I think we have that answer. I'll get back to that at the end. But anyway, this is a parameter that you can play with. You can trade off performance and false-positives with this, and we've got ideas, and in just a few slides I'm going to talk about another idea to build upon that. Okay. So that basically talks -- that covers what we discussed in the ASPLOS paper, but now I'll very briefly talk about where we're going with this and what we've been working on since then. So the themes of what we've been working on are improving precision and improving overhead -I'm sorry, improving performance. So we have false-positives, we'd like to have fewer of them, and we're not quite keeping up with unmonitored execution, and we'd like to do something about that also. So the first thing that we're looking at is that we are -- some of the false-positives that we get are due to the fact that in the initial analysis, we just ignored synchronization completely. So if you had a case like this, this should not appear to be a taint problem, but we would think that it was. So what we're doing is we've been -- the paper we're going to submit this summer to ASPLOS, we've extended our analysis, and the basic idea is that whenever there is explicit synchronization in the middle of a block, we basically break it into a sub-block and then we have a way of meeting together. We basically merge two inputs before we process this other block so it's not -- at a high level it sounds kind of straightforward, but there were a few details to work out in our equations to make this look clean and elegant still. And we used vector clocks also. And so that's one of the things we've been working on. Lots of animations here. Ah. Now, the next thing we're doing which I'm actually really excited about is that by default, with our conservative analysis, you get -- a false-positive looks the same as a real -- a true positive, unfortunately. But what we can do is instead of just having a binary state is something tainted or untainted, we can create another state which is a maybe tainted state. So actually if I back this up a second, in this case here one possibility is that there's a taint and then an untaint and a read. In that case it's good, it's not tainted, but another interleaving is that we do the untaint and then the taint sneaks in here between these two things and then we do the read and then it's tainted. So both possibilities exist within this window of uncertainty. But in our analysis we can recognize that and flag it separately and say here's a case where we've seen both possibilities. So that would be a third state. So we'll have definitely untainted, definitely tainted, and possibly tainted, meaning that there's -- both possibilities exist. And I think that this is very interesting because that state -- well, first of all, once we model that we'll immediately be able to bound the false-positives because a false-positive would have to show up as this case. It wouldn't be that case. And also we want to use this as a -- this is a metric that we can use to dynamically adjust our precision. If we see a lot of these maybe cases occurring, then we may want to dial down the window size or do other things like that to do more precise analysis. So that's another idea that we're pursuing. >>: But isn't the whole premise here that the program in normal execution mostly works, which by definition would mean that things that always end up tainted would probably have been detected, right? In some sense, you know, the maybe tainted only occurs because, you know, there is one interleaving that's bad, but it's not the common case, right? So that's a bug. But a bug that occurs in every execution is pretty easy to fix in some sense, right, in a concurrency? So I understand the idea here, but I'm not sure that for real bugs it would actually help separate the false-positives from the real bugs. >> Todd Mowry: Well, what we wanted to do was when we see the maybe tainted, what we might want to do is even, say, rewind our analysis and try to do other more expensive things to try and definitively tell what is this thing really, is it really a problem or really not a problem. That's what -- the idea, rather than just saying, wow, and kind of hold on to a maybe tainted case. On-the-fly we want to zero in on that. And then one thing you could do with this is to dynamically adjust the resolution of the window. If you see a lot of these maybe tainted cases, then maybe we should make the window smaller to try to have less noise in our results. Even though that's going to slow things down a little bit, it might be worth it. Yes? >>: How close do you run these window sizes? Kind of the minimum of what you're forced to use? >> Todd Mowry: Right now not really at all. Right now they're much, much larger than they need to be. There's another issue. If you make it too small, then you can start getting false-negatives. There are other things that we can do in this case. We can start inserting other -- under the covers we could start adding other markers or things to track what's really happening, basically -in the hardware-based approach, the hardware will pass extra information to track the real communication all the time. We could do that in software. It would be expensive. But we might think it's worth doing if we start seeing cases like this, and we could zero in on those specific cases and maybe get more precision that way but only pay for it when we really need it. So that's the idea. Okay. So, finally, last slide, so the idea is this is a way to do parallel monitoring that was inspired by interval style data flow analysis. I think it has some nice properties because if you write down your problem in this style, the framework will take care of all the unpleasant details of thinking about concurrency and running reasonably efficiently, and the fact that information goes in and out of the sides of blocks was one thing that made it new and interesting for us and we saw some decent performance. And with the larger window sizes, the false-positives rates weren't too bad, and we're excited about the things we're looking at in the future with this. So that's it. >> Tom Ball: Thank you very much. >> Todd Mowry: Thanks. [applause]