23888 >> Ben Zorn: Welcome. It's a great pleasure to introduce Joe Greathouse. Joe's a Ph.D. student from the University of Michigan, works for Todd Austin. He's going to be talking about dynamic software analysis. I want to say a few things about that. We know at Microsoft how important static and dynamic analysis is for finding bugs and reliability problems in programs. But one of the key problems with dynamic analysis has always been it's very slow relative to static analysis and the overhead that you have doing it sometimes makes it unusable. So today we're going to hear more about how to make that dynamic analysis fast and usable. Thank you. >> Joseph Greathouse: Thanks. So you'll hear some ways to make it more fast and usable. I hope I can get across that this is not the only way to do it. If anybody else has any ideas, please stop me in the middle of the talk, chuck things at me, tell me that I'm wrong and that you have much better ideas I would love to hear them because I don't think this is going to the be all end all of all of this. I just think they're interesting ideas. As I was introduced I'm Joe Greathouse and talking about accelerating dynamic software analyses. The reason I'm here talking about dynamic software analyses, as you can imagine bad software is everywhere. As an example, in 2002, the National Institute of Standards and Technology estimated that bad software cost the U.S. economy a little under $60 billion per year. And that's from things like software crashing and you lose a couple of man-hours here and there. Corrupting data files that are worth a lot of money, et cetera. And mind you that was about ten years ago. So besides things like software crashing all the time and losing money, you also have things like security errors that come about from bad software. So just as another example of how much that can cost, in 2005 the FBI Computer Crime Survey estimated computer security issues cost the U.S. economy about $67 billion per year. Where more than a third of that is from things like viruses or network intrusions or things you can trace back to bugs in software, bad programming practices, bad setup files, et cetera. And so I mention that these are from 2002 and 2005. Now, that data is a little bit old but what this graph shows right here is vulnerabilities over time. So this is the data from two different bug databases, CVE candidates, common vulnerability estimates and the U.S. CERT computer crime people. The cert vulnerabilities released publicly. As you can see over time these numbers are at the very least not going down. You can kind of see a general uptick from the 2002/2005 time frame to 2008 the last year I could get data from both databases. But the point is bad software is everywhere and actually it's probably getting worse over time. We're writing more software. It's harder to program complex software. So we're adding more bugs into our software, et cetera. So I just want to give you an example of a modern bug. This is one of my favorite bugs that I love to talk about, because it's real interesting to me. This is a security flaw that was released a little over a year ago in the open SSL security library. It's actually in the open TLS version of that library, I believe. As an example it was a small piece of code in this library that is used to secure Web servers. So the idea with open SSL if you go to some website and it says you have HTTPS connection, then you have a secure connection to that Web server and that server is running open SSL. The idea there is there's smart people that program this. They worry about security all the time. And still these little four lines of code ended up biting them and having a really nasty security vulnerability that in fact allowed them to have remote code exploitation on an open SSL server. What this little piece of code does it checks to see if some pointer is null. If it is it allocates a buffer into it and copies data into that buffer. A simple operation. In a single threaded world this piece of code works perfectly fine. If there's no buffer it puts one there that's the correct size and puts the data into it. But the problem -- and you have some pointer. The problem comes around when you run this in a multithreaded world. In this case open SSL would use some shared pointer between these threads and it's possible two people are connecting to this SSL server at once, and they each want to allocate some different size buffer. In this case, you have these things might happen at the same time. This is just some example dynamic ordering of how the instructions in this program might run, where time is going down here. As you can see, in this ordering, the first thread checks to see if that shared pointer is null. And it is. So it's going to start trying to allocate the buffer and put copy data into it. However, because this is on some concurrent machine, maybe there's an interrupt and a second thread gets scheduled or second processor is trying to do this at the same time. The second thread can ask the exact same question before anything happens to that pointer. So now both threads are going to try to make a buffer here. Second thread wins the race, and allocate some large buffer and gets ready to start copying a large amount of data into it. The problem comes about when the first thread starts -- is scheduled again and starts working on this. It still wants to put a buffer in there. So it creates the small buffer for its data and just destroys the pointer to the first, to the large buffer. So the first thing you notice here is that large buffer has been leaked and that's bad enough. That might end up in some type of denial of service where you run out of memory and the program crashes. The worst thing happens, after the first string copies small amount of data in the small buffer, the worst thing that happens is when that second thread is then rescheduled and starts copying data into whatever that pointer is pointing at. In that case, you end up with a buffer overflow because it's trying to fill a -- it's trying to stuff a large amount of data into this small buffer. So what this is it's a data race that's caused a security error in a program that is programmed by very smart people and that worry about security all the time and still -- and get bit because of small little errors that are pretty difficult to find just by looking with your eyeballs. So one way that you can try to find these things is with dynamic software analysis. There are static analysis out there and other things that you can do that look at the program off line and try to reason about what kind of problems there could be. What I'm going to focus on for this talk is dynamic analysis. Where you analyze the program as it runs, and you try to see -- you try to reason about what state the program is in, if it's in a bad state or not. So as an example, any executed path you would be able to find in there, because what you would do as a developer you have some program. Maybe you throw it through some type of instrumentation tool. I chose a meat grinder here because Val Grind [phonetic] sounds like that. You would end up with some program that has analysis as part of its state, as part of its code. And you run that program through some type of in-house test machine. You spend some large amount of time grinding on it. You throw tests into it, and this dynamic analysis looks at the program as it executes and sees if any of those states are wrong or bad in some way, after some amount of time grinding on it. You send back your analysis results to the developer and you say ah the states you checked were good or the states you checked were bad. The two down sides to this is that this has an extremely large runtime overhead on average. I'll get to that in a slide about what those numbers are. But in conjunction with that, dynamic analysis can really only find errors on paths that you execute, because they're looking at the paths that you're executing. And if you have very large runtime overheads, that can significantly reduce the amount of paths that you can test before you have to ship the product, stop your nightly test or you give up because you're sick of waiting on an answer to come back. So these two negatives here actually combine together to make dynamic analyses a lot weaker than we would like them to be. So when I see that the runtime overheads are large, what I mean by that, so this is just an example of a collection of software analysis. You have things like data rays detection where Intel has a product on that. Sun, Google, they have products on that. And those range from anywhere from 2x to 300 X overhead, depending on the amount of sharing that goes on in the program, the particular algorithms, the binary instrumentation tool, et cetera. You have things like taint analysis which are mostly in the literature right now rather than in commercial tools, but you have things like taint check which was a Val Grind tool that did taint analysis, type of security analysis. That's anywhere from 2 to 200 times slower than the original program, things like mem check which is probably one of the most popular dynamic analysis tools out there. It's the biggest Val Grind tool that people use. That's still anywhere from five to 50 times slower. So that limits the amount of times you can check to see if you might have a memory leak or non-initialized value being used. Then things like bounds checking and symbolic execution can also be pretty nasty. So the point here is if you're running 200 times slower than your original program, and it takes a day to boot up windows instead of a few minutes, you're never going to run these tests over all types of things that you want to test it on, because eventually you have to ship the product. So then the goal of this talk is to find ways to accelerate these types of dynamic analyses. And so I'm going to talk about a little bit of background information before I go into my two methods of doing this. And this background information is demand-driven dynamic data flow analysis. I'm going to talk about what dynamic data flow analysis are and what I mean by demand-driven. Suffice it to say we're going to use this background information to inform the decisions in my proposals. So when I say dynamic data flow analysis, I really mean an analysis that associates metadata with the original values in the program. So, for instance, you have some variable you might have a piece of metadata that says whether you trust that variable or not. And as you run the program, it forms some dynamic data flow. Meanwhile, you also propagate and clear this metadata to create some shadow data flow that goes along with the program. So if I didn't trust a source value, I shouldn't trust a destination value. So you propagate that untrustedness alongside the original data flow of the program. And eventually you check this metadata at certain points in the program to see if there are errors in your software, where with this untrusted thing perhaps if I jump to an untrusted location, or if I use a pointer that's untrusted and de-reference it that might be a bad thing and I raise some error. As I mentioned, what this does is it forms a shadow data flow that looks very similar to the original data flow in the program. And so that's why we talk about dynamic data flow analysis. So as an example, earlier I was talking about taint analysis. So in this taint analysis example, anything that comes from outside the program, anything that comes from input is untrusted. It has some metadata associated with it that says that I don't trust it. And so as I read a value from outside of the program, I put that value into X, and I have some meta value associated with X that I associate with it and say I don't trust X. Then as I use X as the source for further operations, I propagate that tainted value alongside the original data flow in the program. So because Y is based on X, I also don't trust Y and you can continue to do that for multiple other operations. Now, it's also possible to clear this. Maybe I have some type of validation operation, because even though X came from the outside world, I validated it through some software method and now I trust it. And so I should be able to do dangerous things with it. So we can also clear these tainted values. And then if you use X after it's trusted as a source operation you don't propagate metadata alongside of it because now W is also trusted. And eventually you can check these values. So I want to dereference W. I check it. That's fine. I trust it. However, if I want to jump to Z or A, I check them and they are untrusted. That's a bad thing. I raised some error. So this is the idea of dynamic data flow analysis in this talk. So, again, the problem with this was that these analyses were extremely slow. 200 times slower than the original program, which means no user's ever going to run these to find a buffer overflow on their system. Instead they'll reboot the thing and it will be done before this program ever comes back. One thing you can accelerate this is to note that not all of those values needs to go through all of the slow dynamic operations. If I'm working entirely on trusted data, on data that I trust, I want to be very specific with that, then I don't need to spend any time in instrumentation code doing propagation rules, right, because if there's nothing to propagate, why would I spend time in software trying to calculate that I'm not doing anything? So in that case you can take two versions of your application. Your native application that doesn't have any of this large instrumented code to do this analysis, and that runs it full speed. That's your original program. And next to it you can have some instrumented version that does all the propagation clearing and checking. And what you'd like to do is only have to turn on that slow version of the application, whenever you're touching metadata. Whether it's a destination or source. But if you're not, if you're doing trusted value, plus a trusted value, goes into a trusted value, say, there's no reason to do anything different than your native program. So what I've added down here at the bottom, besides these two programs, is some type of tool, some utility that tells us when we're touching shadowed values, touching metadata. And I'm going to leave this a little nebulous for what this is for right now but suffice it to say it will immediately tell us whenever we touch a tainted value. So we run some instruction from the program and this metadata detection system says this is not shadowed data. This is completely trusted, everything in its instruction. So we can immediately go back and update program state. There's no runtime overhead with that as long as this thing doesn't have any overhead. Next instruction can go through here and maybe it is touching shadowed data. It's touching something that's tainted that we don't trust. In that case we need to flip the switch and turn on this instrumented version of our code and spend some time doing this analysis. We go back to this slow runtime overhead. Before we eventually also go back and update program state. And we can then continue to send instructions through this for some time until we see that we're not operating on metadata, in which case we can flip the switch again and go back to running full speed on native code. So the benefit here is that if we're very -- if very little time is spent touching tainted data, then very little time is spent in the slow version of our application. And this was originally described by Alex Ho and Steve Hand and Andrew Warfield, et cetera, the Zen guys as described in Euro Sys 2006. The next question is how do we find that metadata? How do we build that thing that was at the bottom of that last image? So you'd like it to have no additional overhead, right? We'd like to say there's pure -- there's zero runtime overhead until we start touching tainted data. To me that implies you need some type of hardware support to do those checks for you. So what that hardware should do is cause some fault or some interrupt whenever you touch shadowed values. So the simple solution to that, and this is what they used, is virtual memory watch points, where here I showed as an example of three pages where virtual memory watch points work like this. Regular virtual memory lets you access a value and it does a virtual to physical translation for you. The TLB gives you an answer back. If there's no miss and there's no runtime overhead. And that happens on any page that is mapped in your virtual memory space. However, if a page has tainted values in it, if one of the values in this page is shadowed in some way, what you can do is mark the entire page as unavailable in the virtual memory system. The translation still exists but what happens is when you touch something on this page, the hardware will cause a page fault, because it thinks that that translation is invalid. And then the operating system can signal the software that now is when we should turn on that analysis tool, because now is when we're touching something that's tainted. While any other pages in the system still do the really fast virtual-to-physical translation. So this allows you to have the hardware tell you whenever you're touching tainted data. Now one thing you might note here is it also gives you fault whenever you touch things around tainted data. That's the granularity gap problem, and it's something that I have some other work that solves something like that that's up at this next upcoming ASPLOS I recommend everybody read that paper. This is the results they showed. The nice thing they show here is that with LM benchmarks, just tiny micro benchmarks that don't have contained data they were able to reduce the system from over 100X to just under 2x overhead. Because the vast majority of time they're not within this slow taint analysis tool and so there's no overhead at all. Unfortunately, the bottom part here is where things kind of break down. If everything that you're touching is tainted, in this case network applications, networks throughput applications, because anything that came in over the network was untrusted, if everything you touch is tainted, you never turn the system off. So you're back to your 100 or more X overhead. So it doesn't give you any help at all, which means that even if you ship this to users and most of the time it was fast, some of your users are still going to be really mad because they're running some input set that really breaks the system. So that's where my stuff comes in. So I have two things I want to talk about here. And the first is demand-driven data rays detection where I'm going to do something similar to that demand-driven data flow analysis, except I'm going to do it for a different type of analysis. One that virtual memory watch points doesn't work for. And then I'm going to talk about sampling as a way to fix that last problem I mentioned where we can cap maximum overheads, and keep, and basically keep your maximum overhead under some user-defined threshold and just throw away some of our answers and get it wrong sometimes. But I'll get into that in a second. So when I talk about software data rays detection I want to make sure we're still all on the same page here. What this does is it adds checks around every memory access in the program, and what these checks look for are two things: First it looks to see if any of these memory accesses are causing interthread data sharing. And if so, it sees if they're appropriately synchronized in some way. There's multiple ways happens before detection, lock set before detection, et cetera, but I'm going to leave that a little nebulous for now, just say if there's not synchronization between these two interthread sharing axises then there's a data race. Let me give you an example how this would work. This is the same example I showed earlier where the open SSL security flaw exists except this is a different dynamic ordering. In this case, there's no obvious error in the program, when you run with this dynamic ordering, because the first thread allocates its buffer. Copies data into it. The second thread sees that buffer already exists doesn't do anything. There's no buffer overflow in this dynamic ordering. So what the software data race detector does it checks around on this access and says there's no interthread sharing. This is not the second access in an interthread sharing so there's no data race. In fact, it does that for everything on this page -- every other instruction on this thread as well. However, when it gets to the first instruction of thread two it asks those questions again. It says, first, is the value that this instruction is accessing shared? Is it write shared between threads? Is there a write and a read or a write and a write, et cetera, in fact in this ordering there is. Pointer is written by thread one, and is read by thread two with its instruction. So that's when we get to the second question. If there is this interthread sharing, this movement of data between threads, is there some type of interleaf synchronization. I didn't add one in this example. So in fact there is a data race here. In fact, it's pretty powerful. There was nothing obviously wrong with this program, yet the data race detector was still able to tell us there was a problem we needed to fix. Great I really like data race detectors. They're fun and pretty powerful. The problem is, as I mentioned before, they're really slow. So this is a collection of multithreaded benchmarks from the Phoenix suite and par SEC suite and the Y axis here is the time slow down of each of these benchmarks overrunning them without the data race detector Intel XE, it's a commercial data race detector you can go out and buy right now. These are actually pretty decent numbers if you look at the online tell grind tool, I believe they're maybe 1.52 times worse than this depending on specific settings. But the point to take home here is the dashed line that is purely illustrative in that you're about 75 to 80 times slower on average running this data race detector than running your program without any detection at all. So while you can find errors you have to spend a lot of time trying to do it. So I'd like to find some way to speed these up. And so one of the things that you should note about data race detection and data races in general, is that interthread sharing is what's really important with these things. So Netzer and Miller kind of formalized data race a bit in a 1992 paper. What they said was that data races are failures in programs that access and update shared data, and that's the important part here. So these were the five times you ran the data race detector on this example before. One of them is working -- one of these instructions is working entirely on thread local data. So you're copying this thread local mylen into len one, which is a thread local variable. So there was absolutely no reason to run the data race detector there. There could never be a data race on that. There's no sharing. On a slightly more advanced topic there are also instructions that access variables that are sometimes shared, but in this dynamic ordering, they are not participating in sharing. They're not the second instruction in a sharing event. So also the data race detector will never declare a data race on those instructions in this ordering. So we didn't need to run it there. That was wasted work. So what you can see here is that of those 5, times we ran the data race detector before, we only really needed to do two of them depending on what the first check to pointer was, I'll leave that there. So 40 percent of our work was used for work, 60 percent was useless. In fact, right, it's actually much worse than real programs. So this is the same benchmarks I showed before. And the Y axis here is the percent of dynamic memory operations that are participating in a write sharing event. You'll see this only goes up to 3 percent. So that's for one benchmark. Everything over here in the Phoenix suite, these are basically data parallel benchmarks, very little data sharing. Maybe 300 operations in a few billion that are participating in sharing. So the vast majority of the work that we're doing in our data race detector is completely useless. And, in fact, even in de-duplication where you have 3 percent of your dynamic operations participating, still 97 percent of the time we're doing work, not really doing anything useful, not going to get a useful answer out. So what that leads me to say is that we should use demand-driven analysis where we turn off this tool whenever we don't have to be doing anything. So rather than doing it on metadata, however, what we want to do is look for interthread sharing. So that's where this interthread sharing monitor at the bottom comes from, rather than some metadata checking utility. In this case, much like before, you send some instruction through it. And it tells you whether this is local or participating in sharing. If it's a local access, great, update program state. Fragments dudes Quake three no problem at all. Your next instruction comes down, and if it is participating in sharing, then that's when we need to run the data race detector. Flip the switch, spend time grinding on this instruction and deciding whether there's a problem here and eventually update program state. And then the next instruction can then just go through this sharing monitor again. And most of the time, 97 percent or more of the time, you're just going to update state immediately. You won't have to spend time in the tool. The question then is like before how do we build that utility at the bottom that tells us whenever we're doing this sharing? We could try virtual memory watch points just like we did for taint analysis. One way you could do this for instance Emery Berger does stuff like this for his deterministic execution engines is mark everything in memory as watched, in both threads, and as you touch data, as you touch the data that your thread is working on, you'll take faults on it. And what that fault handle will do is remove the watch point from that value. And eventually you'll carve out a local working set. So anything that's in that page right there that no longer has watch points in it is thread -- I touch that data last. So if I touch it again, it's free. There's no sharing going on. However -- right. And that takes no time. Sorry. If another thread touches that same address, it will still take a fault on it. And that is indicative of interthread sharing, because it was owned by thread one and now it's being touched by thread two. So, great, we can find interthread sharing that way. The problem is that this system causes about 100 percent of accesses to cause page faults which significantly reduces your performance. There are multiple reasons for that. The first is the granularity gap. So let's say that that system works the way I just said. Then after that interthread sharing event was caught, now all accesses are again watched, because there's one bite on page one that is watched in thread one and one bite that's unwatched in thread two. Well, you can't change the granularity of the page table system such that it's one bite. So now everything's watched again. You take a bunch of faults on data that you still own. But worst of all, page tables are per process, not per thread. So even if thread one has nothing watched on that page, if thread two is using the same virtual, the same page table, the same virtual memory space, then it's still watched because some of those bites are watched in a different thread. So what that means is that basically everything that you access in this program caused a page fault. So virtual memory faults don't cut it for finding interthread sharing unless you play some tricks and do say data race detection at the page granularity rather than at the access granularity. But anyway, so what I'd like to say is that there are better ways to do sharing detection in hardware. And so what I'm going to talk about is hardware performance counters. Let me give you a little bit of background on those and let me tell you how we can use them. So hardware performance counters kind of work like this. You have a pipeline of cache and some performance counters that sit next to all these. What these are normally used for is to read events in the processor and count them so that you can see where the slow-downs in your program are coming from. I might have had 500 branch miss predictions and a thousand cache misses so those are things I need to try to fix in my program. As events happen in the pipeline, these counters increment. Take two events in the pipeline. The counter now says two. And there's no overhead with that. This is done in hardware for free. Similarly something happens in the cache. You have a miss you count it in the performance counter. So if we can find some event that has this interthread sharing going on, and we can count it, great. Now we know how many happened. What we'd like to have is still have the hardware tell us about it, right? Well, you can do that with performance counters by setting it to say negative one. And when that event occurs, whatever it will be, and I'll get to that in a second, it's going to cause that counter to roll over to zero in which case you can take some interrupt. Now, one of the down sides this is not a precise event. And so we have to add a little bit more complexity here because otherwise you take that fault on an instruction that might not even have a data access in it. That's where we add even more complexity. So Intel processors in particular have this thing called PEBs precise event based sampling. It works like this. When you run that counter over to zero, it arms a piece of -- a piece of the PEBs hardware associated with that performance counter. And then the next time that event happens, you get a precise event of exactly what instruction did it. The register values of that time, and you take a fault and as soon as that instruction commits, you're now ready to send a signal to the data race detector that it needs to turn on. Now, some people out there might note that that means we've missed a couple of events before we turn the data detector race on, I'll get to that in a second. Great. But we can have the hardware tell us when events happen. The question now is there an event that we can actually use to find sharing between threads? And that's where another little Intel thing comes in. That's this event that I'm going to shorten to hit M. But if I try to remember the full name something like L-2 cache hit M other core or something like that, hit M. What that means is that there is write to read data sharing in the program. Let me give you an example. You have two cores in this chip. And each of those cores has some local cache associated with it. So when you write into the first cache line, it sends that cache line to the modified state rather than invalid or shared or exclusive. And then when another core reads that value, it needs to get that cache line, when core two reads that value it needs the cache line from core one and that movement of data is a hit M event. Because it means that you hit in a cache but it was in the modified state somewhere else. The reason you normally count these is because that's relatively slow proposition. It means when you're sharing data between threads you have to move a lot of data around the caches and that's slow. But for our purposes, that means that we can find out when there is write to read or read after write data sharing. Now, if we wanted to use this to turn on our data race detector there are, of course, as always some difficulties. There are limitations of the performance counter. First of all, it only finds write to read data sharing, it only finds write to write because RFO that hits or read for ownership, that hits on another cache line does not cause this to increment. It also does not find read to write. It only finds write to read. I might have said that wrong a couple of sentences ago. But it does not find read to write, even though that's a different dynamic ordering it doesn't see that. That would require a different performance event entirely. So we can only find one-third of the events we'd like to see. Similarly, hardware prefetcher things are not counted. So this only counts instructions that commit that cause a hit M event. So we very well may miss a number of events that we'd like to turn this system on for. And even if it did work perfectly, because we're counting cache events we might still miss some things. For instance, if you have two threads that share the same L1 cache they can share data all day long and we would never see an event. If you write a value into the cache and it eventually gets evicted and you read it later there's no cache event it's in main memory, then other things like false sharing, et cetera. So suffice it to say, this is not a perfect way to do this. So what I'm going to do go through here is an algorithm that lets you try to do this demand-driven data race detection in a best effort way. So the idea here is -- the hypothesis here, excuse me, is when you see a sharing event you're in a region of code where there's more sharing going on. Because a good parallel programmer will try not to be sharing data all the time. You'd like to load in some shared data when you have to and then work thread locally as much as possible, because that will significantly reduce your overheads for cache sharing reasons, et cetera. So what we'll do here is we'll start by executing some instruction. We'll execute the actual instruction on the real hardware, but what the analysis system will do is it will check to see if it's supposed to be turned on or not. If the analysis system is enabled, then you just run everything through the software data race detector. It's slow but you may be looking for errors. And in the software data race detector you can precisely keep track of whether you're sharing data between threads. In fact, the tool already does that, that 300X, 87 5X overheads i showed you before that tool is already checking stuff so it doesn't have to run the full algorithm if it doesn't want to. But if you have been sharing recently, great, you just go ahead and execute the next instruction. This circle right here is what the tool already does. This is where we sit right now when we have the slow tool that we already look at. However, if you have not been sharing data between threads recently, where I'm going to leave recently is kind of a nebulous concept. But in the last few thousand instructions let's say, then you can disable your tool entirely because you're probably in some region of code where there's no sharing going on. And in this case, when you get back to the center diamond all you do is you wait for the hardware to tell you if there is a sharing event going on. If there is, great, enable your tool, turn on your software data race detector, go back to your life in the slow lane. But what we'd like to see is that in 97 percent or more of the cases, you just go ahead and execute the next instruction, because the hardware doesn't interrupt you and you continue on full speed ahead, basically the same speed as your program would originally run. Again, what we'd like to see is that 97 percent of the time or more, maybe the vast majority of accesses out of billions are up in this corner here where there's no slowdown and maybe only 3 percent or less is down here where we exist now in the slow lane. And so I built this system. I added it on top of a real system. There's no simulation going on here. Real hardware. Commercial data race detector, et cetera, and what we see here is the speed-up you can get over the tool that's on all the time. So this is a tool that does that algorithm I just showed you. And as you can see here, the Y axis is number of times faster it is to turn this off whenever we don't need to be on. And so for data parallel benchmarks like Phoenix, we can see almost a 10 times performance improvement on average, and in fact in some really nasty corner case benchmarks like matrix multiplier we're about 51 times faster. Parsec has more data sharing going on. It's not as easy to turn it off all the time. It's not data parallel in most cases. So you see a slightly reduced performance gain where it's about three times faster, which is in my opinion still respectable. In fact, in Freak Mind here you're about three times faster because much like Phoenix it's a data parallel benchmark. It's an Open MP benchmark, very little data sharing going on. Great, the tool's faster. The next question is, does it still find errors. Because if it's infinitely fast and doesn't give us any useful answers it's a completely useless tool. So there's not a great way to display this data. I didn't want to bring up a giant table of all of this, but these bubbles are all of the data races that this full, that the regular tool can find in any of these benchmarks. And that's the number on the right. So, for instance, the always on tool that you can go out and buy in the store right now, for K means it finds one data race. For face sim it finds four, et cetera. The number on the left is the number that my demand-driven tool finds. And so you'll see right up here it only finds two of the four data races static data races in face sim. In fact, the reason for missing those two is one of the reasons I mentioned earlier. The write and the read are so far apart that the write is no longer in the cache whenever the read happens. So there's no event to see there. It misses it. Sorry. But one thing I do like to brag about is the green bar, anything that's highlighted in green here data races that were actually non-benign, where I know benign is a bad word when you start talking about data races, but not ad hoc synchronization variables, et cetera, these were races we reported to the Parsec developers, and they will be fixing and my patches will be fixing in the next version of Parsec. And my favorite story about that is in fact Freak Mind because it's 13 times faster, I was able to run the benchmark the first time, see that there was a data race, recompile it with debug symbols on, run it again, try to hunt down exactly where this is, run it a few more times and find out exactly what the problem was all before the tool ever came back to give me the run times for the full always on data race detector. 13 times faster is quite a bit faster. So, right, and I also ran this on some benchmarks that I don't list here. This was the Rad Bench Suite the race atomicity violation and deadlock benchmark suite which is a collection of nasty concurrency bugs. And this was over all the ones that had data races in them. And overall, of the static data races any of these programs this demand-driven tool is able to find 97 percent of the ones always on tool is able to find. The 3 percent comes from those two that I missed in face sim. Question? >>: Does this show that all these races are write before read and there are no other write before read races. >> Joseph Greathouse: The question, because I don't think there's a microphone out there, does it mean that all the races are write before read data races? No. In fact, let me -- let me see if I can blow ahead to that table. There we go. This is the actual table of all of that junk, where the rows here are the type of data race, and the columns here are in which particular program. So what this shows is that this still finds all kinds of data races, but it kind of supports our hypothesis that even, say, write to read or write to write data races, they happen near other write to read accesses. These sharing events all happen in basically the same region of code. So you can use this one event to turn on the race detector at the right time and leave it on for some long amount of time and you'll catch all those data races too, except in face sim where those were the only two accesses and even though they were the right kind of accesses they were so far apart we still missed them. >>: But there's a threshold which basically means that when you turn it off it's going to determine, it's a trade-off between this stuff and ->> Joseph Greathouse: You're right. There is a performance versus accuracy trade-off. If the first time we see a sharing event happen we leave it on for the rest of the program. We'll probably find more stuff. I still can't guarantee you'll find all of it. But the data race might be the one at the very beginning and there's no more data races. So all I'll say is that I didn't look into that deeply. I understand that's true and I absolutely agree, this was done on a time, pretty tight schedule. So what I did was the amount of time that I had the system turned on after data race was kind of intrinsic to the tool itself. It just so happened that after something like 2,000 accesses, 2,000 instructions, in that range, it was easy to turn the tool off at that point. But, yeah, I found it only needed to be a few thousand to get these kind of accuracy numbers. I found if you made it really long, if you left it at a million, it was almost always on because for a lot of these programs the shared memory access is relatively frequent. I mean, maybe every 500,000 instructions it would turn the system on. So in that case it might never turn off. It is definitely a knob that if you were going to do this in real product you would want to tweak that number, rather than do just exactly what I did here. So I think I might have spent a lot of time talking about that and I'm sorry because I haven't got to fixing the other problem here, which is great, that's a demand-driven tool but what if it's always on. What if I'm sharing data all the time? What if my demand-driven data flow analysis tool is always touching tainted data? Well, in that case what I'd like to do is use sampling to reduce the overhead and let the user say how much overhead they want to see or let the system administrator say my users are okay with 10 percent overhead. So what we see here is a graph the left side is where we have no analysis, and there's no runtime overhead. And the write side is where our analysis tool finds every error it can find but the overhead is really nasty. And those are really the only two points that exist in a taint analysis system right now. What we'd like to do is fill in the middle here where you can change your overhead to whatever you want and what you give up is accuracy. So I might say I'm okay with 10 percent overhead I can find 10 percent of every error that happens. That would be ideal. Because what that gives you is an accuracy versus speed knob. And it also allows you to send this out to more people. So, for instance, developers right now sit at the always-on, always-find every error realm. They turn on Val Grind whenever they want to find memory leaks and they get whatever it finds. But if you have this column, most users sit down here at zero. In fact, the vast majority sit down at zero. They don't find any errors until the program crashes. But if you have this knob that allows you to trade off performance versus accuracy, maybe your beta testers can sit at 20 percent overhead and find a bunch of errors for you. Maybe your gigantic base of end users can sit at 1 percent overhead, and, sure, they only find maybe 1 percent or maybe one out of every thousand errors that happens. But because you have so many users that are testing at so little overhead that they don't notice it, in aggregate you can find a lot more errors, because they test a lot more inputs than you could possibly think to try and they -- there's just more of them even if they're all running the same input. You'll see an error more often because there's a lot of them. So what we'd like to do is do that type of sampling, where you have that knob. Unfortunately, you can't just naively sample data flow analysis. And by naive, I mean the classic way of doing sampling is to maybe turn on your analysis every tenth instruction. So you have 10 percent overhead that way, give or take. That works for some types of analysis. It works for sampling performance counters, for instance. But if you do that for a data flow analysis, things fall apart. So the gigantic knob here or the gigantic switch here is going to tell us when we're performing propagation or assignment or checks in this program. So we're on right now because we're not over some overhead threshold. And we do assignments and propagations like we showed before in the example quite a long time ago. However, we go over our overhead. I don't want to spend any more time analyzing this stuff. So we flip the switch, turn it off, and so now when the next instruction reads an untrusted value, Y is untrusted and we're using it as the source for this. We skip the propagation instructions, because that's overhead that we can't deal with right now. So now Z is untainted. We trust it implicitly. However, similarly, this validation operation also gets skipped. So when I show validate X up here, the first thing you might think is well just don't skip those. But validations are not easy to find. Any movement of data in the program can be an implicit validation. Maybe the source of a move instruction, of a copy instruction is trusted, and I copied over some untrusted value. So now that value is trusted again. So if we skip that instruction that does the validation for us, now in the future when we turn the system back on, because we're under some overhead threshold, we start -- we don't trust values now that we should. So W comes from X and we never decided to trust X anymore. So we don't trust W now. And that means that when we perform the checks, we get very different answers than what we saw before. So, sure, we get false negatives on Z. That's implicit in any sampling system, you're going to miss some errors. But the bad thing here is we now get false positives. W is supposed to be trusted here. We say there's an error there. So now we can't really trust any answer that this system gives us. So that's probably a bad way to ingratiate your developers you give them a giant bunch of errors and say some of these are right, some of these are wrong, have at it. So what we'd like to do is instead of sampling code, instead sample data. So what that means is that the sampling system needs to be aware of the shadow data flow. Instead of being on all the time and having this metadata flow look like this, instead what we'd like to do is over multiple users look at subsets of that data flow. And anytime you want to turn this system off, rather than just skipping instructions, what you should instead do remove the metadata from the data flow and not propagate it in some way that should help you prevent these false problems while still reducing your total overhead. So as an example of that, in this case again big switch, doing the same example I've done a few times systems on do the initial propagations, go over your head, the system turns off and you skip the data flow of this propagation. Now, this looks very similar to the sampling instruction case, right? Sure. But you also skip the movement of the data flow from X into X here. And so therefore when you turn the system back on, W is now trusted as it should be. And so, yes, you do get false negatives. Any sampling system is going to have that. But you no longer have this false positive problem. Of course, the question is: How do you skip the data flow for this validation I just said earlier finding these validation operations is not easy. I can't just say, of course, turn the validation system on. So instead, what we can use is this demand-driven system I mentioned earlier to remove data flows that the system is too slow. So in this case this is the same setup I've showed many times before where your metadata detection, your virtual memory watch points are down here at the bottom. And as you send instructions through and you update program state, eventually you hit some metadata and you turn your instrumentation on. And maybe this time your instrumentation is on for a really long time. Eventually you hit some overhead threshold. The user says I don't want more than 5 percent overhead. I don't want this program to be too slow. It makes me mad. I'll close the program. I'll go install Linux or something. So once that happens, you chuck you basically want to flip a coin you don't want to do it deterministically, because if you do it deterministically every time you get the same answer and that defeats the way you do sampling. If you win or lose the coin flip, depending on what you want to call it, you clear the metadata from that page. You mark everything on that entire page as implicitly trusted, and now whenever you run operations that touch that page, they work in the native application. Or if they are still touching metadata, you would eventually clear some of that. So what that does is it removes metadata from the system and lets you go back to operating at full speed and it does sampling in that manner. And, again, continue to frag dudes in Quake Three. So build a system that did that. This is kind of the complex prototype. So let me try to explain this here. This was a -- the way this demand-driven taint analysis worked is you have multiple virtual machines running under a single hypervisor. The hypervisor does the page table analysis, the virtual memory watch points, and if a virtual machine touches tainted data, the entire thing is moved into QEMU. Where QEMU does taint analysis at the instruction level, the x86 instruction level. Eventually you could move the entire virtual machine back to running on the real hardware. What I added here was this OHM, overhead manager, that watches how long a virtual machine is in analysis versus running on real hardware. And if you cross some threshold, then it flips a coin for you and forcibly untaints things from the page table system. And that then allows you to more often than not start running that virtual machine back on real hardware and your overheads go down. So much like before, there's two types of benchmarks you can run on this. The first is does this actually improve performance, can we control the overheads. That's one axis of that accuracy versus performance knob. However, the other one is do we still find errors, and if so at what rate. So for those two categories I want to talk about the benchmarks a little bit. So because anything that comes over the network is tainted, the worst applications that I showed before were network throughput benchmarks. As [inaudible] receive is this server constantly receiving data over an SHHS tunnel and throwing it over into DEV null. So everything it does is working on decoding this encrypted packet that's entirely tainted. So the vast majority of your work is tainted and the whole system slows. 150 times slower. And what we'd like to see is that as we turn the knob, we can accurately control that overhead. Then there's these real world security exploits. So these were me digging through a list of like some exploit database online and trying to get these two the actually work. These are five random benchmarks that are network-based and that have either stack or heap overflows. So, well, first let me start off with the performance analysis. So the X axis here is our overhead threshold. The maximum amount of time we want to stay in analysis without turning the system off. Well the Y axis here is total throughput where this is analogous to performance in a network throughput benchmark. The blue dashed line is when there's no analysis system at all. How fast is this run in the real world when you're not doing taint analysis. And as you can see here, as you go from the system always being on, where you're 150 times slower, as you turn this knob and reduce the amount of time that you're in analysis, you can basically linearly control the amount of overhead that you see in this system. So when you're down at 10 percent overhead, you're about 10 percent slower. So, great, that's nice. And in fact this worked for the other benchmarks that I don't show on this talk. So we can control performance. What about accuracy? Well, first things first. If all you're doing is receiving these bad packets that are going to exploit these programs, even at a 1 percent maximum overhead we were always able to find those errors. But that's not fair, most servers are not that underutilized. What this benchmark does it sends a torent of data through HHS receive. All of that is benign. It causes no errors. However, at some point in that torent of benign data, you send the one packet that exploits the program, the benchmark program. And the Y axis here then is the percent chance of finding that error from that one packet. Seeing the exploit and having the taint analysis system say aha I caught it I'm going to report this to the developer they have a problem. The X axis here is five different bins of total performance. When you're up at 90 percent threshold where your performance is still pretty bad, 90 percent slowdown say you can still catch the error most of the time. But I think the interesting part of this graph is over here at 10 percent, where like I mentioned before, your performance is only 10 percent below what you can get without doing any analysis at all. And still for four of these five benchmarks you get to find the error about 10 percent of the time, which means that we do look quite a bit like that bar that I showed way back at the beginning. Of course, Apache is a little bit more difficult, because its data flows are very long and so it's much more likely that you will cut off the data flow for performance reasons before you find the error. However, even in something that's pretty nasty, we're still able to find that error one out of a thousand times at only a 10 percent overhead. So great, we have some type of sampling system to solve that. And I think that's the end of the talk. I hit back-up slides. So I can take any questions or arguments. Maybe somebody didn't like this. Maybe somebody loves it. >>: So how general are these results? So you're doing this QEMU emulation for looking for particular exploits here. Can you do other kinds of analyses? I guess the question is, and this is just for SSH receive. >> Joseph Greathouse: Sure. The question there for the microphone, et cetera, is how general is this. So this is a taint analysis system, a very particular type of taint analysis system, in an emulator for some very specific benchmarks. So actually when you said it was just for SSH receive, I think that any type of benign application would work pretty well here. So actually we found out that SSH was really nasty, right, this performance was really, really bad. For other throughput benchmarks that didn't have as bad a performance, these go up. You have a higher chance because there's less, the system's on less so you have a higher chance that it's on when the error happens. >>: Right. >> Joseph Greathouse: Now, for the last one of the last parts of my dissertation is to actually look at this for dynamic balance checking things. I think it will work but I can't give you any quantitative data. But those are -- I would wager that on the whole for applications that are this type of dynamic data flow thing, you'll do pretty well unless your dynamic data flow is really, really long, because extremely long data flows do poorly with a system like this, because you have to have the system on the entire time. And if you turn the system off anywhere in the middle, you don't see anything below it. One thing I will mention is this is kind of a bad case for this, because the taint analysis system here is extremely inefficient. As I showed way back at the beginning, they found it was 100 times slower. It's 150 times slower when it's on all the time. There are tools out there, for instance, mini MU is the big one right now that was released maybe six months ago. They showed that for a very specifically designed dynamic analysis engine, they were able to get about a five X overhead for taint analysis. And so if your baseline overhead is much lower then you'll turn the system off less and then again your accuracy will go up. What I can't guarantee you is the accuracy of these benchmarks. I think that these are representative of the type of errors that you see in the real world. I got them from real world exploit mailing lists, but this might not stop Stucksnet [phonetic]. That might have been a really, really nasty error. I don't know. Such as the life of security, when you're not doing purely formal analyses, you kind of have to go with what you have. >>: Could you explain again how you got from the SSH receive numbers to kind of projecting them on to the other five benchmarks? >> Joseph Greathouse: So do you mean -- so the question was how did I get from the SSH receive numbers to projecting them on the other 5, you mean these numbers? This blue bar is not real data. This is just do we get some type of linear ability to reduce overhead and still find errors. But if I take that off, does that make a little more sense, or I guess what do you mean by that question? >>: How did you measure or calculate this percent chance of detecting the exploit? >> Joseph Greathouse: So you'll see here the way we did it was doing these tests, a large number of times. Turning the system on, waiting until you go outside of the overhead window that we set. So basically turning it on for a while, sending in a whole bunch of packets that are not going to cause an exploit. Sending in the one exploit packet and seeing if we caught it. Doing that a thousand times or 5,000 times. So the error bars here are 95 percent confidence intervals in the mean of detecting it over the number of tests that we ran. >>: These are actual empirical results not just statistical? >> Joseph Greathouse: Yes, this is on a real system. >>: What is SSH receive running a background? How does that relate? >> Joseph Greathouse: So if SSHD running in the background relates insofar as if you don't have all these benign packets coming in that aren't causing errors but are still performing taint, you still have to perform taint analysis in this demand-driven system. If there's nothing going on in the system and I send an exploit packet in, for every one of these benchmarks I always find it even at a 1 percent overhead, because the system is off for all the time because it's a demand-driven analysis system. The one packet comes in and it exploits the system in a tenth of a second. So even at a one second overhead we still find it. So I thought that wasn't fair, right? I put that in the paper. I couldn't even make a table for it because it was just, yes, we win. But the idea here is that in a very bad situation, where your servers really, really busy and one exploit packet comes in, this is still the chance of finding that error in this really nasty system. Anyone else? Otherwise sounds like I'm on the back-up slides. >>: You didn't do anything about hardware. I mean hardware support for this stuff, right? So clearly you could do better with hardware support, right? >> Joseph Greathouse: So funny you should ask. The sampling data -- the data flow sampling system originally came from a paper that we wrote actually Dave was on that paper, too, back in micro 2008 where we did it entirely in hardware, where the way that we did sampling was we had cache on chip that held your metadata. If that overflowed you randomly picked something out of it and threw it away. And in such a way you could do the subseting of the data flows. But I also have the thought that one of these took virtual memory. One took performance counters. I think there's hardware that you could add that would work for everything. In fact, like I said, I have a paper coming up in less than two weeks at ASPLOS a case for unlimited watch points if you have a particularly well designed hardware that allows you to have a virtually unlimited number because it stores them in memory has a cache on chip for them. You can use that to accelerate a whole lot of systems, not just taint analysis or data race detection, but deterministic execution, transactional memory, speculative program optimizations, et cetera. I'm trying to make in other talks that I give to hardware companies I try to make the case they should pay attention to that because it's not just one piece of hardware that accelerates one analysis, but it's one piece of hardware that works for a lot of people. >>: How does that relate for lifeguards, I guess? >> Joseph Greathouse: So you could use it for that as well. It could relate to lifeguards. I see what you're asking, you mean the guys at CMU? >>: Right. >> Joseph Greathouse: So I think if anyone built a system that allowed you to have any type of fine-grained memory protection, that would be great. Where you can use lifeguards, you can use monitoring memory protection. You could use mem tracker. A whole bunch of other systems. In fact, I compare three or four of them in the paper itself. In particular, I think those systems are not designed in such a way to be generic for lots of different analyses that aren't just data flow analyses. The lifeguard stuff and, for instance, mem tracker out of Georgia Tech, is very optimized for taking a value, seeing its metadata and propagating it. If you're instead worried about setting a watch point on all of memory and carving out working sets, those work much less well because most of them store watch points as bit maps. If you want to watch a gigabyte of memory, you have to write a gigabit of bits, rather than saying watch the first byte through to the end and break it apart. If somebody ended up throwing fine-grained memory protection on a processor you could probably figure out how to use it to do all this stuff. I wish someone would do something lying that. People have been citing monitor memory detection for a decade and it's not been built now. So my argument then is if we can get enough software people to say we need this, then maybe they will build something like it rather than just saying it only works for memory protection. It should work for everything. >> Ben Zorn: All right. Let's thank the speaker. [applause]