>> Jim Larus: So it's my pleasure to welcome Tunji who is a graduate student of Todd Mowry at CMU. He's on the job market obviously. He's done some really interesting work over the past few years in both the architecture and the programming languages and the systems community, really brings a very broad, diverse background. And he's going to be talking about his PhD work today. So, welcome. >> Olatunji Ruwase: Thanks, Jim. Thanks for the introduction. Good morning, everyone. Thank you so much for coming for my talk. I'm just going to get straight into the talk. Jim has done a pretty good introduction. So I guess everyone here is very well familiar with the challenges that defects in software create for computing systems that we use today, in particular when those bugs actually make their way onto the systems of end user. And so this has created a lot of interest in tools that can actually monitor a program while it's executing in order to identify bugs and mitigate their harmful effects. In our case we call these sort of tools lifeguards. And in particular what's interesting about lifeguards is that they actually operate on unmodified binaries. So you an imagine deploying a lifeguard, for example, the block security attacks at least until the user is able to, you know, patch the software appropriately. And so lifeguards basically compliment a lot of the efforts that developers make to prevent bugs from getting out into the wild. So for this reason there have been a lot of proposals for lifeguards for finding different kinds of bugs ranging from things like data races to memory faults or security vulnerabilities in programs. So a little more about lifeguards work, I'm going to use this TaintCheck which is a security lifeguard to illustrate how lifeguards work in general. So TaintCheck is a lifeguard that attempts, that tries to prevent input-based security attacks. So in particular the sort of attacks that would come in, perhaps, through a network packet as we have here in this piece of code and that attempts to take control of your program. And the way TaintCheck does this -- So let me just fill out the rest of this code, this vulnerable code. The way TaintCheck detects these sort of attacks is basically by tagging input that comes from untrusted sources, like the network as being tainted or untrusted, and then, sort of tracking how tainted data propagates through the program as the program executes such that when we get to a point where we're about to make a control flow operation based on value that we got from the network then TaintCheck actually steps in and stops the execution and consequently stops the attack. What's interesting about lifeguards or what's interesting about TaintCheck in this case is that it actually maintains metadata about the registers and memory locations which enables it to track how the taint status of data propagates through the locations. So to deploy TaintCheck today or any other lifeguard for that matter, the standard approach is basically to instrument your binary. And there are a number of platforms or frameworks available that do dynamic binary instrumentation such as Valgrind or PIN and so many others. And essentially the way that works is if we just consider this piece of x86 code here, in particular this move from a memory location into a register, what would happen is this instruction will be instrumented with the corresponding TaintCheck rule for propagating the taint status of the memory location into that of the register. And that applies for the other instructions there as well. We will propagate the taint status from the source to the destination in general. And what's great about doing this is that we are actually able to detect bugs early. In particular bug instructions are identified before they even execute and they can be squashed. But as a downside to doing this, which is basically that our program becomes slower. Now I have sort of show TaintCheck code in terms of serial code but if we sort of dive in and see exactly what's going on with each one of these rules, it requires a number of instructions basically to propagate taints. So essentially we are spending a number of instructions to analyze each instruction in the program. So if I inline all of the necessary TaintCheck code here, our code suddenly becomes very slow. It becomes very big and consequently very slow. And so if you then look at your program at a very high level, and let's say we're looking at a program execution where time goes from top to bottom, essentially what we see is that the instrumented program becomes a lot slower and a big chunk of that time is spent actually running lifeguard code, right, which makes everything slower. So this leads to the question -- Well, in terms of how this plays out I mean here I'm listing some tools, some lifeguard tools that are out there that are being deployed on binary user instrumentation. And we see slowdowns, average slowdowns ranging from 10x to 30x. So sort of like an order of magnitude is lost basically by the fact that we're instrumenting our programs with this very sophisticated analysis [inaudible] lifeguards. And so the fact that we're coupling them together means our program has to want to slow as a lifeguard itself which leads to the question, well, what if we could decouple the lifeguard from the program itself. And this is what I mean by that: if we had the notion that we could run the lifeguard separately from the program. So what happens then? Sort of the lifeguard itself needs some way to know what to check, so I'm assume that we can stream an instruction trace of the program to the lifeguard for the lifeguard to perform its checking. And I'm assuming here that we could do this efficiently as I will talk about later. So one immediate impact is the fact that performance improves. We live in a world where there are many cores, there are a lot of parallel computing resources available, so having the program and a lifeguard run on separate resources with little interference just makes them run faster basically. Right? And in fact we saw this in the initial research work that I did in this direction called log-based architecture, LBA. We saw a scenario where the average slow down of TaintCheck dropped from 30x to 3.4x simply by running TaintCheck on a separate core from the program. Another benefit of the coupling which is what I spend most of my time working on is that the lifeguard actually becomes easier to optimize. The lifeguard code can be optimized. And in particular -- And I'll talk a lot more about this later -- it's much easier or more effective to parallelize the lifeguard or even apply interesting compiler optimizations on the lifeguard code basically by decoupling it. And I have some results out here where we see that the average slow down for TaintCheck -- This is on top of the improvements we had before by simply separating them; we could actually get additional improvements. So by parallelizing we could get the average slow down of TaintCheck down to about 2x and by applying some interesting paths-based optimizations on the lifeguard code, we could actually get a slow down, down to about 2.5x. So a final benefit of decoupling the lifeguard from the program is that it gives us a lot of flexibility in terms of ensuring the integrity of the lifeguard. In particular what this means is that we can actually run the lifeguard in a separate false domain from the program. Now when you're monitoring privilege code like device drivers which, you know, can destroy your system in all sorts of interesting ways this is actually very useful. And this is a big part of my thesis which I'll talk about later on in terms of monitoring device drivers and how decoupling helps with that. So I've sort of talked about all the benefits of decoupling. Of course there are some downsides to it and probably you've been thinking about that already, which is basically we lose this guarantee of being able to detect bugs early. In fact, we only detect bugs after some delay. So as an example here I have, you know, this piece of code here on the left-hand side and there's a bug there. And we don't detect that until much later when the lifeguard is able to analyze the trace that comes from that execution. And so what this means is that containing the side effects of bugs becomes much more difficult. But this is a challenge that I have tacked in my research in terms of monitoring applications and also monitoring in the kernel space. And I'll talk more about that later. So this is sort of the road map for the rest of this talk. I will start out with my early work that I did in terms of how do we take advantage of the fact that lifeguard is being decoupled to actually make the lifeguard faster, and I will talk about two ideas that I worked on. One is parallelizing the lifeguard and then doing interesting path-based optimizations on lifeguard. And after that then we'll just switch into the kernel-space and say what is decoupling bias in terms of monitoring device driver execution to mitigate bugs. And there I will describe a framework which I built for my thesis called Guardrail which actually allows you to sort of protect that I/O state from bugs in the device driver. I will conclude by basically giving a brief description of what I consider to be my future research directions and then summarize my contributions. And so switching to the section where we'll talk about how to optimize lifeguard code after it has been decoupled. I will start out by first motivating how do you actually decouple lifeguards in the application space, and here I'll be describing the log-based architectural project that I worked on much earlier and how that helps to decouple lifeguards. And then, I'll go into the optimization techniques which is parallelization of path-based optimization. And then, I'll come back and talk about some related other works that have looked into how do we make lifeguard code faster. So the basic idea behind log based architecture or LBA is that since we live in a world where we have many cores available why don't we run the lifeguard in a separate processor from the program. And if we do this what we get is improvement in performance like I showed you before. And show, moreover, since the transistor density is also increasing then perhaps we could dedicate some more gates into like helping us stream the instruction trace. So we're using hardware to stream the instruction trace from the application to the program that way we avoid the software cost of instruction tracing. And so we built this design and basically we reserved some amount of memory to sort of cache the log or the trace. And so here the lifeguard in this case ends up becoming like basically a collection of events handlers because what's happening is we are streaming the application events, the instructions into it, and the lifeguard is just responding and invoking the right analysis code to sort of do the right checking. Now the containment guarantee that we make in this design is basically that even though it might take us a while to detect a bug, the bug will actually not propagate outside the process context of the application. And we're able to achieve this in a couple of ways. One is the lifeguard itself runs as a separate process so it's sort of isolated from the application. And two is that we actually stall the program when it tries to interact with the kernel; so that system call boundary, we stall it there until the lifeguard can actually catch up. And we also stall it, of course, when a buffer which we use for the instruction trace fills up. And so with this design we were able to achieve some significant improvements. I have shown the TaintCheck numbers before. But, for example, a tool like AddressCheck which basically checks -- Yes? >>: [inaudible] the application because lifeguard is too slow... >> Olatunji Ruwase: So that shows up basically as the slow down numbers. Yeah. So the troubling happens because we have a fixed-size buffer and so once that fills up then it indicates how much faster the program is than the lifeguard. And so that shows... >>: And how does the signaling happen? >> Olatunji Ruwase: How does the signaling happen? So we do that in operating systems, so we have some operating system support here. Yeah. Okay. So AddressCheck is a tool that basically checks for access on allocated memory, and we saw like an improvement. We just slowed down relative to binary instrumentation from 19x to 3.2x. Lockset looks for data races and we were able to reduce the slow down from 30x slow down to like about 4.3x slow down. And MemCheck was probably had the least improvements but we still had like close to a 3x improvement in this case. And so this is baseline for the rest of the portion of this talk where I'm going to talk about how then do we, you know, tackle -- So the slow downs you're seeing here, as the question alluded to, is really the overhead of the lifeguard computation itself, the fact that we're using multiple instructions to analyze each instruction in the program. And so the question is how can we make this better? And so the first idea I pushed was, "Well, can't we parallelize the lifeguard?" Right? And so sort of see why this would make sense. So imagine we have an application here and we're just looking at time. We learn from top to bottom and we're looking for instructions here. In the instrumented case what ends up happening is we mix up, you know, the analysis. We interweave the analysis with the code and this is really hard to parallelize because it sort of requires the original code to be also parallelizable. Otherwise, you can't even parallelize here. But the coupling helps us because now if we decouple the analysis of the lifeguard from the program then the question is can the analysis itself be parallelizable? That's the problem we have. And we see that this is possible because the lifeguard is running behind the application. It means that the application is creating streams of events to be processed. And so there's no reason why we can't just fire off another lifeguard to process the streams that in future which the lifeguard can't get to right now. And this way we can get improvements. >>: Is that automatic or does that require [inaudible] support on the actual lifeguard? So do I have to have language support as someone who writes a lifeguard in, for instance, Valgrind? >> Olatunji Ruwase: We -- Let's see. So the way this works is, yes, we have some minimal API that you need to specify that you were interested. In particular, you need to specify how much parallelism you want because that sort of goes to it. And it only makes sense if you have as many cores. So, yeah, there is some minimal API that you need to change. >>: It's not even necessarily easy even if you have the API. I mean if you're trying to taint tracking. >> Olatunji Ruwase: I am going to get to that in a second. >>: Okay. [laughing] >> Olatunji Ruwase: Yes, as the questioner suggested, there are serial dependencies on what the lifeguard is doing. So in particular TaintCheck is a perfect example for this like I showed before. In order to detect a security attack, TaintCheck needs to track the propagation of data. And this is inherently sequential. In fact TaintCheck falls into this class of lifeguards that do this kind of propagation called dynamic information flow tracking or DIFT for short. And so MemCheck which detects on safe uses of initialized memory or initialized data also has the same behavior. They are very, very sequential; the computation they do is very sequential. So how do you handle this? And so I came up with this algorithm for parallelizing DIFTs which essentially works as follows. So here let's say we have this stream of application events that we want to process. What we first do is we break this up into segments and then run in that in parallel. In parallel we run an algorithm called symbolic inheritance tracking. Now the basic idea here is that rather than trying to track the propagation concretely, we can symbolically track how taint is inherited. And I'll show some more details about that later. But it turns out that you can do this in parallel. Now when you're done with that we can then process each of the segments sequentially resolving -- because essentially because of the dependencies we will still have some unresolved issues, dependencies in there, and the solving of this we call inheritance resolution. We sort of default next step where we then sequentially go through each segment and then resolve all of this inheritance. Now what's interesting about this is that we can actually perform in parallel in terms of the locations. So individual locations we can resolve the inheritance in parallel. And then for each segment we have to sort of do that sequentially. And so with this design we came up with an algorithm that basically has this nice property; although, it has asymptotic linear speedup. In practice the constant factors really prevent our implementation from actually having this behavior, however. So what do I mean by symbolic inheritance tracking? The idea here is that we are in parallel; we're in the first phase where we are trying to propagate the inheritance here. And so let's start with the very first instruction where we're copying from memory location Mx into R1. Well, what is the taint status of Mx? We don't know. Well, that's fine. All we just record there is the fact that, well, the taint status R1 now is whatever the taint status of Mx was in the previous segments. Okay, and so that's the inheritance so we can then sort of propagate this along as the program executes. And so we end up in this scenario where either we are able to resolve the taint status for sure or we're able to like create an inheritance saying, "It comes from the previous segment." So what we've essentially done is we've been able to collapse the propagation chain which would have been a difficulty for us. >>: Can you go back to the previous slide? >> Olatunji Ruwase: Sure. >>: So one thing that sort of occurs to me with this is that it looks a lot like carried propagation [inaudible]... >> Olatunji Ruwase: Yes. >>: There are schemes that involve lots of hardware to do carried propagation very fast. Is there an equivalent for this? >> Olatunji Ruwase: I haven't really thought about that. But I mean the -- The equivalent for this? I'm not sure. I think there are some slight differences there because I mean the idea is that we may not really know what the true answer is until we get to the next phase. And so we sort of have to have this sort of not "don't care" but this generic answer like we don't know what it is. So I don't know if the carried propagation hardware actually does that. >>: I think it does. When you add two [inaudible] together you don't know whether it's zero or one... >> Olatunji Ruwase: Or one, okay. >>: ...so you have to carry them. >> Olatunji Ruwase: Then in that case I think this is quite similar. >>: Yeah. >> Olatunji Ruwase: I think it's quite similar actually. Yeah, I wasn't aware of that. Yes? >>: So you'll have to excuse me, I'm a little bit naïve on how you do TaintCheck. But my assumption is that you do a load and then if you do operations on that load, to carry taint you just "or..." >> Olatunji Ruwase: Yes. >>: ...an "or" across all of the values that you've loaded. >> Olatunji Ruwase: No. I'm sorry, can you rephrase that...? >>: So if it is set at being tainted. >> Olatunji Ruwase: Yes. >>: Then I just "or" the results of when I apply operations on that load, something that's derived from that load. Right? >> Olatunji Ruwase: Right, right. >>: So isn't "or" an associate of operation? >> Olatunji Ruwase: It is an associate of operation. >>: So then isn't this -- Even though you have a dependence chain, isn't that a parallel operation by definition? So if I want to add up a bunch of numbers, because addition is associative, I can do this? >> Olatunji Ruwase: Right. But if you don't know -- Well, okay. I see. So that's assuming you know one of the values. Right? Like you know it was tainted. But if you don't know, you can make the wrong conclusions. So it is an associative value but it still depends on the input values. So if it's all on untainted the "or" wouldn't change it. But if at least one of the input values is tainted then the final answer is tainted, at least that's the rule. So maybe I should've clarified that. So if at least one of the values is tainted then the whole computation is tainted; that's the challenge. >>: I think what you said is right for each location, right, but the problem seems to me that you have the instructions not the [inaudible]. >>: But if you have the... >>: The representations. >>: You only need to do an "or" when that value surfaces, I guess is the point. So there is some parallelism in the fact that you can stall until you actually -- and do what you have. >> Olatunji Ruwase: Oh, okay. I see. Okay. >>: And then apply the "or." >>: That's very fine grain. >>: Yeah, yeah, yeah. >> Olatunji Ruwase: Yeah. Right. Yeah, so you sort of have to carry all the values along, is that what you're suggesting? So, yeah, we actually had some sort of similar problem. The problem is that the amount of [inaudible] that you need to carry that along is huge and it actually just kills your performance completely. All right, so here we are. So here we've been able to collapse the propagation chain and we're at the point where we know that the only taint values that we cannot resolve here are the ones that are dependent on the previous segments. And so we then move into the inheritance resolution stage and, like I said, we do that sequentially. So in other words we first resolve all the segments in J minus 1, and we have this nice property that then they can be done in parallel. I'll show you that in a second. Now we are ready to resolve the taints for the values in segment J. And we can also do that in parallel because they all depend on one sort of predecessor. Your question that you asked sort of alludes to the issue of there might be binary dependencies here as well. Here we assume for a particular lifeguard it's possible to have like a simplifying operation. So for TaintCheck, for example, when you combine -I'm sorry. I can rephrase this. For MemCheck, for example, if you have at least one initialized value, it doesn't matter what the other value is. And so for TaintCheck as well if you have one taint value it doesn't matter whether the other value is untainted. So you can actually always collapse it easily, so we get this nice property that we don't need to worry about binary operations. And so once we resolve this then we can go ahead and sort of resolve the inheritance for the next segment. And so in terms of the implementation I basically build parallel TaintCheck in this way where I have a bunch of parallel workers but I my inheritance resolution is done sequentially. So I have a single thread doing the inheritance resolution, and so we sort of call them the parallel workers and then we have a master essentially. We have to sort of compare this to sequential TaintCheck and assume that the master is equivalent to one sequential taint check and then, you give it a bunch of parallel workers to sort of help it improve performance. And so what does this look like in terms of performance? These are some specific benchmarks. And here I am showing the speedup over LBA, so over the sequential taint check running on LBA. And we saw some pretty decent improvements and this is with eight workers so the speedup is not quite linear. But we did see some interesting improvements and we saw in some cases up to 3.4x. On average we saw about a 1.7 improvement over the sequential approach using eight workers. And this sort of shows that there is opportunities to improve performance by parallelizing. We might even get some more if I had parallelized the master, for example. Yes? >>: So what's the slowdown of TaintCheck with LBA? >> Olatunji Ruwase: With LBA it's about 1.9x. I showed that a bit earlier. Okay, so yes? >>: [inaudible] show that one processor is both a worker and a master? >> Olatunji Ruwase: And this question? >>: The leftmost... >> Olatunji Ruwase: Oh, so you mean this leftmost one? Oh sorry, actually that's not what's going on here. So these are the workers, like this symbol indicates a worker. And the only thing I'm doing here is the fact that there are four segments and then the master, which is this symbol, works on each of the four segments. So maybe this figure is confusing. It's not really like the same processor. >>: Okay. >> Olatunji Ruwase: I mean it could be but it doesn't have to be. >>: Do you know is the master up at the bottleneck area? >> Olatunji Ruwase: No it's not actually. It actually turns out not to be. The bottleneck here really is that this inheritance tracking doesn't shrink -- like the segments that come out of it are still quite big. And so I mean we could either try to like break down segments even more but it turns out not to help so much. >>: How big is big? >> Olatunji Ruwase: Big is like half of what I give, so it only shrinks about half. So like I give it a segment and it only shrinks about half. And then the other thing... >>: The number of briefly exposed... >> Olatunji Ruwase: Right. >>: ...values is half what's... >> Olatunji Ruwase: What's -- Yes. Yes, [inaudible] eliminates half. >>: So how big is the segment, though, in terms of construction? >> Olatunji Ruwase: I think that this was probably like 16 K; these results for 16 K. >>: 16 K? >> Olatunji Ruwase: Yeah, 16,000 instructions. Probably the other interesting thing, which I sort of left out is that the inheritance resolution is about twice as slow as just propagation. So this guy is, you know, half the speed of the sequential taint check. So that's the cost you get for symbolically tracking inheritance rather than actually tracking propagation. Any more questions? Okay. All right, and so what if you don't have many cores to commit to making your lifeguard faster? What else can you do? Well, one approach is, "Well, why don't you do path-based optimizations?" And you might ask, "Why pathbased optimizations for lifeguards?" Well, because the compiler community has shown for so many years that, you know, in most programs the execution time is dominated by hot paths. Now if we consider what happens when we instrument those hot paths we see a number of results. And so here I have instrumentation of the hot paths. The cold bars are the analysis code; I've used different colors in this case so hopefully it delineates the different instrumentations. While, the black lines are basically the original code so we have something like this, like these basic blocks working up to something like this where we still have a lot of analysis code in there. So if we consider one such hot path that's been instrumented, we can conclude a number of things. One is that the lifeguard code has actually spent most of its time analyzing the hot paths. Right? So the overhead of the lifeguard mostly comes from the hot paths. It also means that if we could somehow optimize how the lifeguard analyzes these hot paths which run frequently then, you know, we could improve performance overall. But the challenge to doing this with instrumentation is that this requires you to analyze across two different kinds of code. There's the program code and there's the analysis code, and there's some context-switching code in between them. This is just really, really complication to find redundancies here. And this is where decoupling helps us. So by decoupling the analysis, the lifeguard from the program we can just focus on the lifeguard code itself to see if we can improve it by analyzing along paths. And so here I have basically the lifeguard code that is the correct analysis code for this particular program path. And we call that sort of unoptimized because we sort of just pull them out and sort of execute them in order, and so it's unoptimized. Now even with a standard compiler you could imagine that if you in-lined all of this code together, a standard compiler could probably find opportunities to, you know, constant propagate or maybe save some save and restore's here. So we could get some improvements there. But the real big win comes from the fact that, you know, the lifeguards have a particular way to compute, right, so TaintCheck propagates. And so if we could exploit the domain knowledge of these lifeguards, we could actually get even bigger optimizations. And one example which I'll talk about today is how the lifeguards access their metadata. So the metadata is basically the taint's values, how they access it. And you could also eliminate things like the redundant checks. You can find redundant checks because you know that's the lifeguard is doing. And as an example: here is the path handler for a particular mcf, for a particular benchmark from mcf. We start out with a path handler that's the taint analysis handler that [inaudible] x86 instructions. Now when we sort of compose the path handler together and give it to a standard compiler to in-line in and also propagation, we're able to like eliminate 5 percent of the instructions. But we apply TaintCheck's specific optimizations, we're actually able to [inaudible] about 45 percent of the instructions [inaudible]. So I talked about how metadata acts like a big source of overhead in lifeguards. So let's consider TaintCheck for example. And I've shown this before where, you know, to propagate the taint of a memory location into that of a register it requires like [inaudible] x86 instructions. But what's interesting is that 6 of these instructions are all to derive the metadata. And the reason for this is because most lifeguards use two-level tables to shadow the whole memory. This is like probably the most flexible way of doing this. So you have 6 instructions that you have to execute each time just to obtain the metadata. So the question is, what can we do to improve this? Just looking at the lifeguard code itself -- I've represented the memory locations with A, B, C, and D -- it's very hard to say whether there're any similarities in this because remember all of these values are just streaming in from the log to the analysis. And it's hard for us to tell is there any real value here. I mean are they similar? Are there optimization opportunities here? But instead what if we looked at the application path itself? And if we do that we see something very interesting. So here I've circled all the corresponding memory accesses in the path in the mcf program. And what we can see here, well, is like the first three are exactly the same address because, you know, the register doesn't change. So it's exactly the same address. So, you know, A,B, C are actually A. Right? And then what's also interesting is that even though the last one is not the same address, it's actually close enough. It's like 64 bytes away from A which basically means there's a high probability that it's forced into the same second-level table. So with the page table, we could leverage some of the work that we've done for accessing A. We could use it to also the other location D much faster. Now what was interesting, at least cool to me when I did this was like, "Wow, it optimized my lifeguard," but I couldn't find anything in my lifeguard code to help me. I had to go and look into another program which is basically the application I'm monitoring. And there I was able to find, you know, useful information that was able to enable me to implement this optimization. And so for me it was really cool being able to like optimize a program by looking at another program. And so these basically were the results that we had, that I got for this. And so here I'm showing average speedup over LBA on the Y-axis in percentages. And I'm showing two results. One is if you give a path [inaudible] to a standard compiler and it can perform inline constant propagation, what kind of improvements would you get and then, when you implement lifeguard specific optimizations like the kind I've described. And here we see that, well, the standard compiler does pretty well in some cases, like GCC could get a 21 percent improvement over an LBA, but in some cases it's bad like Lockset. And the problem here is the fact that when we in-line all there handlers, we basically increase the code size and that hurts. But the cool thing is that with lifeguard specific optimizations we always do better and sometimes quite significantly. Like we have 61 percent improvement here for AddressCheck. And so by leveraging the domain knowledge of what the lifeguard is actually doing, you can actually improve the lifeguard significantly simply using standard compiler tricks. >>: So [inaudible]. Suppose I ran a profiler then tried to figure basic walks and hot function and then applied basically in-lining and reorganization of basic blocks based on that profile information. Do you think GCC would then have been able to pick up a lot of the opposition or opportunities that lifeguard... >> Olatunji Ruwase: That's exactly what I did, actually. >>: Okay. >> Olatunji Ruwase: That's -- Yeah. >>: [inaudible] >> Olatunji Ruwase: That's exactly -- Yeah. So I figured out the hot paths and I presented it both to GCC and to my new compiler, and these are the results. Yeah, yeah. Okay. All right, so we get improvements this way. All right, so in terms of our related work into how do we make lifeguard code faster we can sort of categorize them based on whether they were targeted for coupled lifeguards or decoupled lifeguards. So for the coupled lifeguards, so Qin basically accelerates TaintCheck based on the observation that most locations have a stable taint value. So they're either always tainted or always untainted; they never really fluctuate. And so I was able to use that to do less taint propagation. Dalton basically proposed doing TaintCheck in propagation in hardware and that helps. And more recently Bosman, they've come up with some interesting techniques on machine that have titanium they have machines that have many registers basically partitioned in your registers for taint checking and for your actual program. And they've got some improvements there. And then, they had some interesting but custom metadata organization that probably only works for TaintCheck. And in the decoupled space, well, with the ISCA work we had some hardware accelerators; hardware acceleration with the ISCA and Vlachos work. The Nightingale work actually is kind of interesting. So they parallelize TaintCheck as well and basically they found some good speedups. But one of the results that it had from there which I didn't talk about today is the fact that you need at least four parallel workers before you can actually match the speed of the sequential simply because the cost in factors. And Corliss is basically a hardware extension in the commit phase where the analysis is performed at commit time using special hardware. So at this point we're going to sort of switch gears and sort of dive into the next part of my talk which is where I'll talk about using lifeguards in the kernel space, basically the Guardrail work. So the outline for this is first I'm going to motive, why did we consider device drivers as target application for lifeguard in the kernel space. And then, I'll give an overview of Guardrail and what sort of design goals we had in mind. I'll sort of look at some related work in that space. And then, I'll talk about how do we do containment? So we're going to be detecting bugs after the fact, so how do we do containment. And then I'll present some -- I wrote three new driver lifeguards. I'm going to present their bug detection effectiveness and then their end to end performance. So why drivers? I guess probably the wrong community to be asking this question. [laughing] Well, they are an important part of our software today, right? They allow us to use our hardware devices that we love, but unfortunately they're a major source of bugs. And, you know, one particular operating system reported that 85 percent of their crashes were due to drivers. And then the other thing that's actually interesting about them in terms of runtime monitoring, is the fact that drivers are actually very sensitive to perturbations which is something that you don't see in the application space. So this is because a lot of their computation is timing sensitive, so responding to interrupt is timing sensitive. If you stall it significantly enough you actually break the driver. And so for this reason this has been the key limitation to apply in sophisticated analysis to drivers because you slow them down enough, you just introduce a new set of reliability issues, right, besides the one you're looking for. So what does Guardrail do? So Guardrail is a system that I built for using the coupled analysis to actually mitigate driver bugs. In particular I'm looking to protect the I/O states, the percent I/O state from driver bugs. And the way Guardrail works is basically it streams an instruction trace to the lifeguard and then, it interposes on the interface between the driver and the device such that I/O operations are not allowed to pass through this until they've been verified by the lifeguard. And so there the lifeguard can run any arbitrary sophisticated analysis that you can think of. And so with this we had like a number of design goals which Guardrail meets. So the first is we wanted containment, right, so we wanted to protect the external world, starting from the device, from bugs in the driver. We have the unfortunate effect that the operating system kernel is still sort of vulnerable to driver bugs. We're able... >>: Is that because the lifeguard runs after the instruction has already clobbered your kernel? >> Olatunji Ruwase: Yes. Because we don't have an interposition later there. >>: Are you going to tell us how you implement interposition there on the outside because I'm not sure how you did that one either? >> Olatunji Ruwase: Oh, you mean on the kernel side? >>: No, between the driver and the device...? >> Olatunji Ruwase: Sure. >>: Okay. >> Olatunji Ruwase: Yeah. And also the fact that we run the lifeguard in a separate protection domain, which I talked about earlier, means that even if the driver hoses the system, our lifeguard is still able to at least tell us what went wrong. But in reality here we -- Yeah, if the driver hoses the kernel, the lifeguard can still tell us exactly what went wrong. >>: Do you also [inaudible] like DMA? >> Olatunji Ruwase: Yes. In an interesting way I'll get to that too in a second. Right. Right. So the other sort of design goal we had was generality. So drivers suffer from a variety of bugs, buffer overflows, data races, all sorts that you can think of, security vulnerabilities. And so we wanted a system that allowed lifeguard writers to write arbitrary analysis, and we sort of hit that by making the lifeguard sort of be a separate component that you can just implement your new analysis in it. And so in particular -- And also our interposition layer is actually transparent; it doesn't require changes to either the driver or the device and so it allows for generality. It allows us to support arbitrary combinations of drivers and devices. And in particular in this work I looked at about 11 Linux drivers from four different classes so audio, storage, network and disk. And also I implemented three new lifeguards for finding concurrency, operating system protocol errors such as DMA and in memory safety. Yes? >>: [inaudible] specify prevent [inaudible] corruption of I/O state [inaudible]? >> lifeguard: Sorry, I couldn't understand. >>: So do you prevent corruption of the I/O state or just the corruption of the device? >> lifeguard: Of the device, yeah. So here we're just safeguarding the I/O state from the driver bugs. Yeah, so the kernel is vulnerable to bugs in the driver [inaudible] Guardrail. >>: Do you actually run the lifeguard completely offline? >> lifeguard: This is online. This is all online. >>: How much buffering do you have to do for this instruction trace? Because that can explode very fast, right? >> lifeguard: It doesn't matter. It really doesn't explode so far because when you think about the instance of time when drivers execute, it's only for brief periods within your overall execution time. And so the driver is not all like applications which can be CPU intensive and run all the time; drivers only run intermittently. And they might run for a burst but then, you know, they go off and do some I/O and they don't run for a while. So detection fidelity is essentially that some bugs just require instruction level analysis. And so we want to support that. And we do that by basically providing the lifeguard with an instruction trace. And finally trustworthiness. And the idea here is that, well, ultimately we have to make some changes to the trusted computing base and we wanted that to be minimal. And in particular the I/O interposition is the only layer that really needs to be in the trusted computing base. And by ensuring that the only thing that it does is basically intercept the interaction between the driver and the device, it doesn't really do any sort of computation to detect errors. That's actually decoupled off to the lifeguard so that way it's at least conceptually very simple. And so that helps for the trustworthiness of TCB. Now in terms of existing work in this direction, in terms of driver liability there's been a lot of work. I think we can categorize existent dynamic approaches along two different axes. So the first axis, which are the columns here, is exactly when does the dynamic approach perform its correctness check? Right? So you have the coupled approach which is you perform the check-in before every driver instruction or the decoupled which is sort of like my work Guardrail-style where you do the check-in some time later. And on the other dimension is how much of the driver execution is checked by the dynamic analysis? So you could either check all of the driver instructions or there are a lot of techniques that check only the API interaction between the driver and either the kernel or the device. Now these design points matter because they impact things like performance. So one way to improve performance is basically to check just the API calls. Right? You ignore like the bulk of the driver execution. Another way to [inaudible] the performance is to decouple it, so run the checks sort of asynchronously. But of course this all comes at a price, right? So if you check only the driver API then you have poor fidelity in the sense that you miss a lot of bugs that occur within in the driver itself. And of course if you decouple then you have this containment challenge which I'll get to a little later. And of course the other properties sort of fall in place. And so in terms of our existing approaches mapped in, we have this sort of picture. And so perhaps the most immediate one to observe is that there is only one prior work that does like decouple checking and that's Aftersight. And Aftersight provides like no containment at all so it doesn't protect anything, let's just put it that way, because it detects the errors after the fact. It has no containment. And so in terms of the coupled approaches which is the majority of the approaches, we see that there is a lot of work in checking only the API to improve performance. So it's where people have sort of optimized for it so they want good performance and so they only check the API. And so of course the problem with that is that, you know, you miss some actual bugs. So Guardrail sort of sits here in the decoupled and checking all the driver instruction space. And the downside there is the containment, and I'll describe how I addressed that, how I solved that. >>: So can you say what you mean when you put like software fault isolation in the upper half of this? I mean I usually think of that as checking all instructions that like run? >> Olatunji Ruwase: So software isolation is basically almost like sandboxing. And so the idea is you're fine with the module that you're checking corrupt and it's on internal data structure, but you're just saying it will not get out. The problem there is that, well, while you could prevent things like writes that would directly corrupt the XO world but there's a question of whether your module itself is behaving correctly at a fundamental level. I mean it could constantly overflow its own buffer or corrupt its data. Is it even doing the right thing? It might not be able to corrupt the kernel but is it even providing the functionality that you want? >>: So you're worried that it's going to corrupt the device and decide the driver [inaudible] are happening? >> Olatunji Ruwase: Yeah. [inaudible] exactly. It's a good point. So even like a lot of the false isolation has really been about the interface between the driver and the kernel. Really the interface between the driver and the device is something that -- those actually fall into that as well. But even is my driving behaving correctly? Even if it's not crashing, is it displaying the right images? It may display the wrong images without actually crashing the kernel, and that's not what you want. >>: I guess the followup question is how many of those crashes that you mentioned are actually because of drivers [inaudible]? >> Olatunji Ruwase: Well, how many of which crashes? Well, I mean I didn't mention any particular crashes but I can answer you question. And the answer there is that yes, the challenge is that while -- [inaudible]. So crashes that come from the driver corrupting the kernel are well studied and there a lot of numbers there. Unfortunately crashes that come from the driver corrupting the device, they're not that many numbers on them. But there have been some really interesting cases. So for example a few years ago there was this E1000 network card that was basically being hosed by a driver. And, you know, your network card literally just dies. And the challenge there is if your driver actually hoses your disk, for example, it could cause some other kind of failures that are just impossible to trace. So it's really hard to actually quantify how bad this is. So I mean imagine that the driver actually corrupted -- So you have virtual memory, right, so you're storing stuff on disk. And you think your virtual memory is working well. What if the driver just scribbled over all of your virtual memory metadata? Your operating system starts failing in all sorts of arbitrary ways and it's very hard to sort of quantify this. I think this is part of the challenge why, you know, there hasn't been a really good study in terms of these kind of failures because they tend to be very catastrophic. Any questions? Okay. And so the other dimension to look at in dynamic approaches is what system components are they protecting. And like I said most of the work has really been in terms of protecting the operating system kernel. It was only recently that people started looking at protecting the I/O device state, and so Guardrail sort of falls into the space. But that was like an interesting area to go after. The other space is really well congested. There has been a lot of good work and interesting results from that direction. So I just decided to pursue this new space that is fairly new. >>: Can you talk a little bit about Nexus-RVM? >> Olatunji Ruwase: Oh, sure. >>: It protects both, right? [inaudible]... >> Olatunji Ruwase: It protects both, yes. Yes. So Nexus-RVM protects the kernel by basically running the drivers in user space. Okay, so that has [inaudible] protection. And it protects the device by basically requiring the hardware to basically write a reference validator which indicates how the drivers interact with the device. And so by trapping -So now that the driver is in user space, you can essentially trap all of its -- Well, you basically change its API such that all of its interaction with the device actually turns into system calls. And so you can interpose on that and then check that against the model. And then, the reference model sits in the kernel and then verifies basically that the driver is, you know, interacting with the device in the right way. >>: The reference model has to be written manually. >> Olatunji Ruwase: Yeah, it has to be written manually. It's very hard. It's specific to that particular device and it's not clear that it could be written in an open source fashion because for devices that are proprietary it reveals probably way too much about the [inaudible] that go into the device itself. Sorry? >>: So performance? >> Olatunji Ruwase: Yeah. Yeah. But it was really the first work to be looked into, to dynamic -- And they also made this interesting observation in that work where they said like, you know, "When we banged on our device hard enough without RVM and then our device stopped functioning." So they sent like all sorts of random data inputs to the device and they actually killed the network card as well. [inaudible] But Guardrail sort of falls into this space. All right, so containment which I guess has a bunch of interest here. So how do we contain bugs if we detect them after the fact? How do we prevent it from corrupting the device? So let me just rephrase the bug containment challenge. Here is we have a driver code. It's running. It's had a bug. It's gone past the bug so time is going from top to bottom. The analysis is still somewhere back, still checking some of its history. And the driver is going to do a disk write at some point. And so the question is, okay, is that bug going to cause terrible things when this disk write happens? What can we do to prevent this from happening? So in other word we have this challenge that we need to preserve our integrity even though we're detecting bugs after the fact. And so I looked at it and said, well, you know, this is because the driver talks directly to the device. So why don't we just put something in between them? And virtualization seemed like a really good approach to go, so I built something around a virtual machine monitor where I added an interposition layer here. And so with Guardrail -- So this is how I/O operations get executed. So first the driver issues an operation and we intercept it, okay, in the interposition layer. And then, we send a request to the lifeguard which is analyzing. And it'll say like, oh, is it okay to let this access go true? Have you caught up? Eventually let's say it's a good case. It says, yeah, it's okay. You can go ahead and do it. And then, the interposition layer then replays the access and -- I don't know. Should I take a question? Okay, very good. It replays the access. Now what's cool about this is that this is transparent to the driver and the device. So the driver doesn't know that this interception occurs and the device doesn't know that the access is coming from the virtualization layer and not the device, not from the driver. And in particular we do the interception using traps, and we complete operation using emulation which provides some performance challenges. >>: [inaudible] question. How does this work with DMA? >> Olatunji Ruwase: With what? >>: DMA? >> Olatunji Ruwase: So there are two parts to DMA, right, there's the control part and then there's the actual streaming data part. So what we are tracking here is ensuring that the control part is correct. >>: So the control part is just a data structure, like a network transmit cue that's sitting in memory. >> Olatunji Ruwase: Mmm-hmm. >>: And the driver is just doing memory writes to cue packet transmissions by cueing stuff in a memory data structure. >> Olatunji Ruwase: Mmm-hmm. >>: And asynchronously with that, the device is walking this thing in memory and pulling packets out of it. So the control channel... >> Olatunji Ruwase: Well, so... >>: To the extent that they exist, the control channel is just [inaudible] to memory. >> Olatunji Ruwase: No, it's a little more than that because the driver has to tell the device where it wrote stuff into. The device is not going to just go to some arbitrary location and start reading from there. >>: So the driver might tell the device, "Here's my transmit cue." >> Olatunji Ruwase: Right. >>: And the device starts transmitting. >> Olatunji Ruwase: Right. >>: And as long as there are still packets in that cue, the device will keep transmitting, the network device will keep transmitting. >> Olatunji Ruwase: Right. >>: And the driver can happily cue more stuff there. It doesn't have to tell the device, "I've got another packet. I've got another packet. I've got another packet." >> Olatunji Ruwase: Right. Right. So I... >>: This is asynchrony. >> Olatunji Ruwase: So the -- Right. Right. So eventually, right, even with this asynchrony at some point the driver has to be done and has to tell the device that it's okay to get the data, at some point, right. Or either the device has to generate an interrupt indicate that "I have completed" -- I mean, it's a transaction at the end of the day. >>: Well, maybe we should talk about this later.... >> Olatunji Ruwase: Yeah, Okay. >>: My argument is that there's a control channel but it's asynchronous in the sense that the driver can get stuff done without having to invoke that for every operation. And I'm sure that in this system I could split things out to the device without... >> Olatunji Ruwase: Oh, no, no, no. Okay, so now I -- Okay, I think now I [inaudible]. So the driver can write corrupted data and slip it out but it would not modify the state of the device itself. Because the control data that you're sending -- sorry, the I/O data that you're packet cues is not actually going to affect the state of the device itself. You can only do that by programming these device registers and sending them some arbitrary values. >>: Okay. >> Olatunji Ruwase: I mean, there is a performance implication for trapping and emulation but by doing this the design tradeoff that I made was that I would rather stall I/O operations, the device register read and writes, which are typically slow anyway and they're relatively infrequent rather than stall, rather than impact the actual CPU instructions of the driver which are much faster. But of course there is some performance overhead implied here. And we can see that here I'm showing performance relative to Linux, where Linux will be one. And so lower is worse. And here I am evaluating the performance of using Xen, so I built this in Xen. I added this interposition layer into Xen, and so I'm comparing against Xen the interposition numbers. And I did this for basically four classes of devices, so audio, video, disk and network. And the idea is to say what is the performance impact of doing this sort of interposition and every device register read and writes every I/O operation. And what we see is that in most cases the overheads are not so much; it's like up 10 percent. And this is because these operations are slow anyway to begin with and then, they're infrequent. But there are cases too where it gets worse. So one of the cases where I have a very poor number is on compilation. So we're losing about 30 percent performance relative to Linux. And it turns out that, well, the virtualization of Xen is not doing much better. So, you know, it seems like most of that overhead is coming from virtualization. But with Memcache D where I have the system fully saturated, I am losing close to 40 percent of the performance. And so here it shows both sides of it. So like 20 percent of that seems to be coming from virtualization and the other is coming from the fact that, as I'll show a little bit later, Memcache D for this particular workload actually makes the driver do a lot more I/O operations. I mean the get size is about 64 bytes, so it's doing more I/O operations per byte. And so in those cases we can see the slowdowns. And I'll show sort of how the interposition works. So now what does this allow me to do? Now I can write some new interesting tools that have not existed before. And in particular I wrote these three tools: DRCheck finds data races in drivers, DMACheck finds DMA protocol bugs, while DMCheck finds memory bugs, unsafe uses of unitialized data sort of like Memcheck in the user level. In the interest of time I'm going to focus only on these first two. >>: One thing [inaudible] said was that the reference monitors, they had to write them for each device so they'd have to know about the device. In the things you're talking about is I/O interposition in these checkers, are those per device or are those general? >> Olatunji Ruwase: So the I/O interposition itself is general. And the reason for that is because it intercepts the operation in sort of like a device agnostic way so it's essentially a trap. So the driver will map the device's memory into its address space and try to touch it. And I mean the details are like then I just revoke that so that every access to it becomes a trap. And so this works regardless of the driver or the device. >>: So you just need to know where the control write [inaudible]... >> Olatunji Ruwase: Yes. Exactly. Right, exactly. And I mean there are standard APIs for how to do this in the kernel. So, yeah. At least within a particular operating system kernel it's general in that sense. >>: But to go back to your motivation... >> Olatunji Ruwase: Sure. >>: ...to drivers that have bugs that cause hardware to be corrupted like the network card [inaudible]. >> Olatunji Ruwase: Yes. >>: To cache that you would need to write checker that understood the device's interface. >> Olatunji Ruwase: So I will put that in a class -- So like I said there are a wide variety of bugs that drivers could have. So it's conceivable that if I gave you an instruction stream, instructions trace of the driver, you could actually write a checker and get it to check for what the driver is doing and how it's programming the device registers and find such bugs. I didn't look into that in particular for my thesis because one of the challenges of doing that is you sort of need to know how the device behaves, right, and that seemed like really hard. So this is sort of like the easiest things I could do using like open source information. But I believe very much that you could actually write similar checks. All right, so I'm going to start with DRCheck which finds data races, so a quick background on data races. So a data race essentially is two accesses to shared data, typically from two different threads where at least one of the accesses is write and they're not serialized. And so like the example we have here will be data race, right. And like a lot of the work in data race detection has been done in the user space. And so some of the techniques they've used are things like trying to infer whether there's a consistent locking discipline for protecting shared data. So this would be the Lockset approach where we can see here that this access here with these locks here is safe because the access is consistent with Unlock Lx. And then the other approach used is Happens-Before which is basically the fact that, well, sudden operations are serialized in time. So like a lock and an unlock on the same lock are sort of serialized. And so, you know, one happens after the other. So the technique is to do this. And then in addition they maintain a vector clock to sort of keep track of what time was. And so here we'll see that we can conclude that this access to X here is safe. So this is vector clock of the variable and it indicates that Thread 1 wrote this access variable and it's [inaudible] time of 2. And this is the vector clock for Thread 2, and relative to its [inaudible] time of 3 for Thread 1. So the access is serialized in time. And then, there are hybrid approaches that combine because these broad approaches have different compliments and strengths and weaknesses. And so I think sort of converse, there is detection user-levels. And you might ask, well, why not just use these tools on the kernel? What interesting? Well it turns out that concurrency in kernel-mode is vastly more complicated than the user-level. So one of the things that can happen is that things that are actually racy in the kernel space will appear as being safe to a user-level detector. And one example is this notion of intrathread concurrency. This is when a thread actually races with itself. So when I was defining data races, the common definition implied that there has to be two threads involved to create a data race. In the kernel this is not true at all. >>: What does that mean? >> Olatunji Ruwase: I will get into that in the notes. So I'll go into that in a minute. And then the other problem is that safe accesses, the well-serialized accesses in the kernel will actually appear racy. And this is because the kernel synchronization is done using other things besides mutual exclusion primitives. In particular there's a form of synchronizing called state-based synchronization where drives serialize the operations based on the state of the device. And I'll talk about that too. So there's only one work that I'm aware of that finds data races in the kernel-mode dynamic approach and this is called DataCollider. And the way it works is basically, so say here we have the shared variable in two threads. DataCollider will stall a thread at as shared memory access and start a breakpoint on that data. And the idea is that if some other thread comes along and tries to access it then they'll be a collision because it obviously means that trail was not serialized relative to the stalled thread. And so the cool thing about DataCollider is that it has no false positives simply because it doesn't need to worry about a synchronization primitives or protocol or things like that. It doesn't worry about. And it only actually reports actual races, races that actually occur. So that's a strong --. But the downsides are basically that, well, because it stalls threads it cannot stall threads that are timing critical like interrupt handling threads. So if the first access is from and intra-handling thread then DataCollider will not stall it and we'll miss the race. And also it cannot stall threads indefinitely. It has to let them go eventually. Now this is less of a problem because typically no data corruption happens if the two conflicting accesses happen far about in time. But it's still a near miss and we would like to know that two conflicting accesses have occurred far apart in time; I would like to know that. So let me get back to sort of the things that are racy in the kernel might appear safe to the user space. So this is intra-thread concurrency. Now what is this all about? Well the idea here is that the kernel typically has to perform a wide different types of tasks of different priority levels. And so a kernel thread actually doesn't execute in just one context. In fact if we looked in the Linux kernel, threads can either be in the process context which is dubiously named, which it really means that it's servicing the system call from the user-level. Another context that the same kernel thread could be in is in a tophalf context where its servicing interrupts that journey by the device. And then, in the middle we have this bottom-half. And the idea is that, you know, the top half is the highest priority context. And so if a kernel thread is doing system call stuff and, you know, a packet arrives on a network card, we're going to interrupt it and use that same thread to go service that interrupt. And just to show as an example we have here a network driver. It starts out in Time 1. It starts out in a process context sending packets, you know, the send system call. And packets arrive and so we need to move it -- sorry -- into the top-half. Now the question is, what happens if it acts as the same data in both contexts, right? This is actually a race even though to a user-level detector, it's the same thread so it's serialized. But, no, this was actually a race because eventually when it returns to the process context, the context is not aware that another change has happened. Right? So the driver can get into an inconsistent state. Now DataCollider can detect this particular example here because, you know, it can stall the thread while it's in process context and then, when the thread tries to access it in the top-half it will actually detect that because it's, you know, fire off the breakpoint. And so the basic insight from here is essentially when we think about concurrency race detection in the kernel, we need to think not only about the threads but actually the context in which their executing. Now in terms of state-based synchronization: so this is where something is well serialized by the driver but it might appear to be buggy to a user-level detector. Now the basic idea here is that while the devices are themselves like basically finite-state machines, we can view them as finite-state machines. And here I have sort of like a snippit of a network card where it starts out in an inactive state and at some point it gets connected to the PCI bus and at some point it gets initialized straight into the packet transmission. So if I consider this network driver code where, well, the same flag has been updated by two different functions which could be executed by different threads, but they don't seem to be serialized then my access is erased, well, it turns out not to be erased simply because the probe function here is only valid when the device is in an inactive state. It's actually what takes the device from an inactive state and connects it to the PCI bus. While the open function only gets called after the device is already connected to the PCI bus. So in reality these two functions are serialized in time; they will never execute concurrently. And so this is actually not erased but a user-level detector. Because it doesn't see any explicit concurrency between them, we'll think that this is a race. >>: But DataCollider will [inaudible] this as a race. >> Olatunji Ruwase: DataCollider, yes. No, DataCollider has zero false positives. Yeah, I had that on the slide, right, zero false positives. That's the great thing about it. But this is a problem -- So this actually goes to the fundamental problem which is that by being agnostic to synchronization protocols, DataCollider will not have this sort of problem regardless of whatever new fancy synchronization protocol comes up in the future. That's really cool. In Guardrail we're trying to understand synchronization protocols, so we're sort of susceptible to this and in particular because the states in devices can vary... >>: Why are you doing that? Why don't you just use DataCollider's trick? >> Olatunji Ruwase: Why don't I use DataCollider as this? Well... >>: Use that trick, right? I mean this trick of just... >> Olatunji Ruwase: So combine [inaudible]... >>: ...[inaudible] >> Olatunji Ruwase: Actually it's an interesting direction, so we're balancing the strengths of DataCollider, actually in fact using it to help eliminate the dependence that DRCheck has for synchronization protocols is actually an interesting direction of research. But here I wanted to just really push as far as I can in terms of if I try to understand the synchronization protocols, what would that look like? And so the challenge that comes from this is that devices are very different and they have different states. And so how can I possibly encode all the different states into my tool or even for a device that hasn't been invented yet. So here at this point I have to make a trade-off, right? I can't possibly track all the possible states that a device could have. But fortunately the kernel itself is also aware of some of the states because the kernel is actually what invokes the driver code. And so for example here is the network and stock code that invokes the open function of a driver. And you can see here that it's actually checking to see that the device is actually connected to the PCI bus before it's doing that. And so in DRCheck we basically leverage this fact that some of the states are already exposed to the kernel in a standard way. And so we essentially just track those kind of kernel-visible states, so whatever driver device is connected to the PCI whether it's generating interrupts and things like that. So this allows us to sort of handle a wide variety of drivers but it also creates the potential that we may miss synchronization; we may have some false positives and I'll have those numbers in a second. So in particular what are the false positives that we have? So for my comparison I started out with DataCollider which has zero false positives. Well, okay, let just rephrase this. Well, DataCollider has zero false positives. This experiment was done using Linux drivers. Of course DataCollider I only available in Windows when I was doing this evaluation, but it has zero false positives. So no big deal here. Now Kernel-Lockset is a tool that I wrote where I took a user-level lockset -- Excuse me - and I basically made it aware of the locking primitives that exist in the kernel. And then, I applied it and then, whoa, I get all these false positives simply because, you know, the synchronization protocol in the kernel is different from that in the user space. But in DRCheck, we do better. I mean we still have some false positives, and a lot of this comes from the fact that we are only tracking states that are visible to the kernel so states that the driver and the device are both aware about. What they use to serialize themselves, we don't know. And so this is sort of the false positives that we get in here. But we think at least with this we're able to get a good balance between techniques that are like completely agnostic of synchronization protocol and those that try to understand synchronization protocols. >>: So when you say false positive here you mean things that are not in fact data races? >> Olatunji Ruwase: Yes. >>: But it's not necessarily true that all of the data races are bugs. >> Olatunji Ruwase: Right. Okay, so yeah. So the concept of benign races... >>: Yeah. >> Olatunji Ruwase: ...is what you're talking about. So I'll come to that -- So I'm not reporting the number of benign races that I have here. And I think that's a general problem for data race detection in general because... >>: Do you have a sense of how many of these are actually bugs that we care about? >> Olatunji Ruwase: Oh, these are all false positives. I haven't... >>: [inaudible] >>: Oh, sorry, sorry, sorry. Sorry, they're false positives, yes. >> Olatunji Ruwase: So they're all false positives. >>: I have a followup question. >> Olatunji Ruwase: Yes. >>: How many -- So these are all false positives but you have this other problem that if you try to be too precise, you are so precise that you don't report any real data races either, right? >> Olatunji Ruwase: Yes. >>: So when you are pursuing low false positives, you have to simultaneously make sure that you are doing something useful also. >> Olatunji Ruwase: Right. Absolutely. >>: So did you find any real races? >> Olatunji Ruwase: Yes, that's the next slide. >>: Okay. >> Olatunji Ruwase: So we found true races in Linux drivers, and this is after pruning away all of the benign races. This is where we actually found real races. So some of them were kind of interesting. So one of the races here actually never got a Linux bug report. So I mean the way I verify this was basically, you know, I have my driver, my driver version and then I keep looking through the commits in Linux. And I saw one that was fixed but there was no comments in there. And it acts like, "Hey guys, I thought there was a race here?" And yeah it was a race but the problem there was that the race causes failures in strange ways that are just hard to reproduce. And so unless you are like truly analyzing on the instruction level what the driver is doing, you would actually not find this conflict in access. And so, yeah, for DRCheck we found a total of nine races. So we decided to like -- Well, I mean like I said DataCollider is on Windows, so we looked at each race and tried to see will DataCollider catch this? So we then modeled like two forms of DataCollider. One is deterministic. So one thing I didn't mention before was that DataCollider probabilistically stalls threads. So here we're assuming that for every race here, DataCollider actually will stall the first access, right, like deterministically. And there, we saw that it only detects just two of these nine races. And the reason is because in some of these cases, the first access was in an interrupt handler. So in the example I was just talking about, the first access was an initial handler. And so because DataCollider doesn't stall it, it will actually miss that race. Then the other reason was because some of the races actually occurred far apart in time. And so while they might not lead to data corruption but they're near-misses that we would like to know about. So for the interrupt handling we decided okay let's invert it. So for a scenario where the first access is an interrupt handler, why don't we reverse such that the first access is no longer an interrupt handler so that DataCollider could actually stall that. And we see that if we did do that -- And this is what I call ideal -- DataCollider catches six out of the nine. And so the remaining three are basically the ones where either the two accesses are both interrupt handlers or they are far apart in time. So by checking more of the driver execution, in particular the interrupt context we are actually able to find more races. >>: If you have a race between two interrupt handlers, aren't they serialized anyway by the interrupt delivery mechanism from the device? >> Olatunji Ruwase: Right so... >>: How is that a race? >> Olatunji Ruwase: Right. Actually, you're right. So, yeah, you're right. Actually, no, that's not necessarily true. So it depends. On a particular core the interrupt handler is... >>: [inaudible] same driver on different cores... >> Olatunji Ruwase: On different cores, right. >>: [inaudible] >> Olatunji Ruwase: So, yeah. Two types of -- Yeah. >>: But even on the same core, the two interrupts have different priorities... >> Olatunji Ruwase: Priorities. >>: ...or goals, right? >> Olatunji Ruwase: Well, in Linux it's not... >>: [inaudible] serialized. >> Olatunji Ruwase: Yeah, not in Linux. Yeah, yeah, not in Linux. So it's really the separate cores. Okay, so yeah this is sort of the balance we're getting to, right, which is you can either try to be more precise and detect nothing or try to be too precise and get to many false positives. And so we sort of think that we sort of somewhere in the middle, but there is definitely room for improvement at least in terms of the false positives. So let's see. [inaudible] doing on time. DMACheck actually detects DMA buffers, bugs and how drivers use DMA buffer. So a DMA buffer is basically some part of a system memory that's shared by the driver and the device such that the device can directly copy data off of it which is good for I/O performance. So a number of issues come up with how DMA buffers are used. One is the fact that they are shared, so we have to avoid a driver and a device race on it. I don't know if that gets to your question. So the driver should avoid racing on the buffer while the device is using it. Now the other issue that comes up is the fact that the driver and the device actually act as the DMA buffer through different paths. Right? So the driver goes through a cache. The DMA buffer uses physical addresses directly. And so this leads to all sorts of coherence issues, and so you have to sort of be careful in terms of how you create your DMA buffers. In particular you should ensure that your DMA buffers are cache aligned. And finally DMA buffers are system resources. The driver should avoid leaking them or mapping them in inconsistent ways. Because Guardrail gives us an instruction trace of what the driver is doing, we can actually checking tools to check for violation of these rules. And in fact I did that for Linux drivers and I found about 25 violations of this nature, 7 of which were races between the driver and the device. And there were about 4 cases where the DMA buffer was misaligned. Now what's interesting is that previously no one has really studied these kind of violations before, but the fact that we have an instruction trace arbitrarily allows us to check for different rules. And in fact the way I went about this was I picked up the Linux documentation about how to write DMA. And I'm like, okay, these are the rules. I can write this up as a checking rule. I have an instruction trace of the driver; you know, I can find these violations. >>: Do you know how long that'll be in the base before [inaudible]... >> Olatunji Ruwase: Well... >>: [inaudible] >> Olatunji Ruwase: So, okay. I don't know exactly how long it's been in the base so my Linux kernel that I used for my evaluation is quite old actually. And so some of the devices might look really old. Bu the other thing that I found out -- because I contacted the DMA expert on Linux -- is that not all of these bugs actually will cause like an actual crash or corruption. And so people have been reluctant in actually fixing them. But it just -- This tool just actually shows like a new breadth of correctness checking that we can apply on drivers which is one of the reasons why I went down this direction. >>: How can you know if it's a race without knowing the details of the hardware specification? Because the way [inaudible] works is you write some stuff and then you set a bit. And that bit says, "I'm done with this," [inaudible] the device now. So it's perfectly safe to modify this until that bit is sent. >>: So when you're saying race here this is a race between the hardware and the software? >> Olatunji Ruwase: Yes. So the answer to that -- To be precise, you're right. But what I've taken here is really what the Linux kernel developers recommend in terms of how you use DMA. So there is API for mapping a DMA buffer to a device. And idea is while that is done then you should not touch; the driver should not touch that piece of memory. And so what I am doing is basically I'm looking for those maps when they happen. And at that point, you're right, maybe the device hasn't started using it so it's somewhat more conservative than reality but it's just like this is at least what the Linux [inaudible] requires. Now if I had the device logic, I could also encode that and say, well, until the driver writes to this particular control register, the device hasn't started doing DMA. But you actually get to a point of -- One of the limitations of this work which is that the analysis cannot directly examine the device state. We're sort of relying on the driver to do this, like to poke the device register to read. So you can sort of get into that. But this was just an interesting direction to go in especially since issue like this don't exist on the user level. Like data race exists on user level. Memory [inaudible] exists in user level. >>: When you [inaudible] these bugs what was the update from the developer community? >> Olatunji Ruwase: They were like, "They don't cause any crashes." And like I said this is a very old kernel. So they were not particularly interested. So that's basically the bug detection effectiveness we get by doing lifeguard-style instruction [inaudible] checking in drivers. So this I my design. So now I'll talk about the end-to-end performance, but first let me just show how I built this. So for my design basically I built my prototype where the analysis runs in a separate virtual machine from the driver, and so it's isolated in that sense. This is done in Linux as I said. I used Xen to build my interposition layer. And that's sort of how this picture looks. And then for my instruction tracing, I sort of rely on the LBA hardware-assisted instruction tracing. And so this essentially is what the current prototype looks like. All right, and this is only a simulation because hardware tracing doesn't exist today. And in particular I just sort of want to point out this is sort of like your conventional system set up. But I reserve about a half of a megabyte for memory for holding the instruction trace. And in my simulation, which this is actually in [inaudible] I evaluate like protecting the network card and the hard driver. And the driver VM has two virtual CPU's. The analysis VM has one. This is a duel core system that I'm simulating. And in particular I’m measuring end-to-end performance. And in particular I'm looking at what would it cost to protect my device from any of these kind of errors in an online fashion. And these are the two drivers that I used. These are stock Linux drivers, so the old Linux drivers. And so this is what the end-to-end performance looks like. So here I'm reporting the performance normalized to Linux. And this is for the disk workload using the postmark benchmark which tests different kinds of file operations. And here I’m reporting the transaction rates, the read rates and the write rates. And here, you know, we see the performance is not so bad in some cases. I mean in a lot of cases we have less than 10 percent impact; although, there is like one or two where we have like 12 to 15 percent. >>: [inaudible]? >> Olatunji Ruwase: Unfortunately this is not quantified here. I have some backup slides that can get into that. >>: [inaudible]? >> Olatunji Ruwase: Yes. But what about network? So here for network server I'm normalizing to true puts that a server will deliver normalized to Linux. You know, Linux is one. And I used a range of network applications. And in general the performance seems reasonable for most of these benchmarks until we get to network streaming where I'm losing close to 60 percent performance in that case. And then, the question is why is network streaming so bad? Well it's because this reflects the frequency in which the driver performs I/O operations, so the frequency. And in particular I’m reporting here the device register accesses which is how I/O operations are performed, and I'm reporting in particular the reads and the writes. And so on interesting observation is that, well, there are usually more writes than reads really in the network space. I mean this is kind of interesting. And this scale in on a log scale so each is an order of magnitude difference. And we can see here that the streamed benchmarks are actually performing an order of magnitude more device writes than other benchmarks or at least. So while this sort of shows like the worst case because what happens there is this is the point where the lifeguard has to catch up. We have to stall; the lifeguard has to catch up. And so even though this shows like the worst case here, the good thing is that this doesn't seem to be prevalent for a lot of workloads. Okay. So sort of that's the end of Guardrail. And so I'll just quickly go through some of the future work that I intend to do. I intend to continue in this direction of looking at device driver reliability. And in particular the first thing I would like to do is to see if I can actually protect the operating system kernel as well. But this is going to be quite challenging relative to the interposition layer that I had here because the kernel driver crossing happens much more frequently than the I/O driver crossing. So this is going to be really tricky. And so maybe speculation might be a good way to go rather than stalling the operations. And so you assume that the operation is successful and sort of [inaudible] if it turns out to be unsafe. And then, the other challenge here is that the granularity with which kernel and driver share memory can be at a sub-page level, right? Because they're in the same address space. Whereas for the device, everything is on a nice page granularity which makes it easier to like, you know, restrict access. But now when you want to restrict access to sub-parts of a page it's going to be really hard. And so I might be looking into things like maybe adapting the memory allocation policies in the kernel so that driver data goes into separate pages from the kernel. And of course I also want to tackle sort of the virtualization overheads that I'm seeing here with the interposition layer where I would like to have sort of like a lightweight virtualization and then synchronous I/O writes. Now this one's kind of interesting. So I showed that the device register writes were like really the big performance killer, and this is because in a virtualized space the I/O writes become synchronized. They become synchronized which is bad especially if your driver does a lot of those. In fact, while the I/O reads get worse, like maybe 5x slowdown, the I/O writes are actually like 30 to 40x slowdown. And simply because in the nonvirtualized space, I/O writes are being made asynchronous anyway. So the fact that you issue a write to a device doesn't mean you get a response back but your thread can continuing executing. But with virtualization everything just sort of becomes synchronous. So I would love to explore the ability asynchronous writes where the guesses continues executing while the virtualization layer takes on the responsibility of actually completing the I/O writes for it. And so with that my contributions. So basically I've looked into decoupling as an idea for improving the performance of lifeguards. And I've shown that we can actually contain bugs even through we are delaying the detection of bugs. And this great because then we can use increasing processor counts to actually improve run-time monitoring in general. I have proposed novel software optimization for existing lifeguards, so these are existing lifeguards that I did not write myself. And I was able to get good speedups for them by basically parallelizing them or applying compiler optimizations. And then I've sort of gone into the kernel space and said, okay, what if we could do this sort of detailed correctness check in the kernel space? And I've shown this is practical by decoupling. Here I've shown a framework called Guardrail which actually safeguards the I/O state from device bugs. And it does containment using commodity virtualization. I have created three novel lifeguards for finding data races, memory faults and DMA bugs in drivers. And with that, I'll conclude and take questions. >>: I think rather than questions we should thank the speaker because he's had plenty of questions. [applause]