>> Jim Larus: So it's my pleasure to welcome... Mowry at CMU. He's on the job market obviously. He's...

advertisement
>> Jim Larus: So it's my pleasure to welcome Tunji who is a graduate student of Todd
Mowry at CMU. He's on the job market obviously. He's done some really interesting
work over the past few years in both the architecture and the programming languages
and the systems community, really brings a very broad, diverse background. And he's
going to be talking about his PhD work today. So, welcome.
>> Olatunji Ruwase: Thanks, Jim. Thanks for the introduction. Good morning, everyone.
Thank you so much for coming for my talk. I'm just going to get straight into the talk. Jim
has done a pretty good introduction. So I guess everyone here is very well familiar with
the challenges that defects in software create for computing systems that we use today,
in particular when those bugs actually make their way onto the systems of end user. And
so this has created a lot of interest in tools that can actually monitor a program while it's
executing in order to identify bugs and mitigate their harmful effects.
In our case we call these sort of tools lifeguards. And in particular what's interesting
about lifeguards is that they actually operate on unmodified binaries. So you an imagine
deploying a lifeguard, for example, the block security attacks at least until the user is
able to, you know, patch the software appropriately. And so lifeguards basically
compliment a lot of the efforts that developers make to prevent bugs from getting out into
the wild. So for this reason there have been a lot of proposals for lifeguards for finding
different kinds of bugs ranging from things like data races to memory faults or security
vulnerabilities in programs.
So a little more about lifeguards work, I'm going to use this TaintCheck which is a
security lifeguard to illustrate how lifeguards work in general. So TaintCheck is a
lifeguard that attempts, that tries to prevent input-based security attacks. So in particular
the sort of attacks that would come in, perhaps, through a network packet as we have
here in this piece of code and that attempts to take control of your program. And the way
TaintCheck does this -- So let me just fill out the rest of this code, this vulnerable code.
The way TaintCheck detects these sort of attacks is basically by tagging input that
comes from untrusted sources, like the network as being tainted or untrusted, and then,
sort of tracking how tainted data propagates through the program as the program
executes such that when we get to a point where we're about to make a control flow
operation based on value that we got from the network then TaintCheck actually steps in
and stops the execution and consequently stops the attack.
What's interesting about lifeguards or what's interesting about TaintCheck in this case is
that it actually maintains metadata about the registers and memory locations which
enables it to track how the taint status of data propagates through the locations.
So to deploy TaintCheck today or any other lifeguard for that matter, the standard
approach is basically to instrument your binary. And there are a number of platforms or
frameworks available that do dynamic binary instrumentation such as Valgrind or PIN
and so many others. And essentially the way that works is if we just consider this piece
of x86 code here, in particular this move from a memory location into a register, what
would happen is this instruction will be instrumented with the corresponding TaintCheck
rule for propagating the taint status of the memory location into that of the register. And
that applies for the other instructions there as well. We will propagate the taint status
from the source to the destination in general. And what's great about doing this is that
we are actually able to detect bugs early. In particular bug instructions are identified
before they even execute and they can be squashed. But as a downside to doing this,
which is basically that our program becomes slower. Now I have sort of show
TaintCheck code in terms of serial code but if we sort of dive in and see exactly what's
going on with each one of these rules, it requires a number of instructions basically to
propagate taints. So essentially we are spending a number of instructions to analyze
each instruction in the program. So if I inline all of the necessary TaintCheck code here,
our code suddenly becomes very slow. It becomes very big and consequently very slow.
And so if you then look at your program at a very high level, and let's say we're looking
at a program execution where time goes from top to bottom, essentially what we see is
that the instrumented program becomes a lot slower and a big chunk of that time is
spent actually running lifeguard code, right, which makes everything slower. So this
leads to the question -- Well, in terms of how this plays out I mean here I'm listing some
tools, some lifeguard tools that are out there that are being deployed on binary user
instrumentation. And we see slowdowns, average slowdowns ranging from 10x to 30x.
So sort of like an order of magnitude is lost basically by the fact that we're instrumenting
our programs with this very sophisticated analysis [inaudible] lifeguards.
And so the fact that we're coupling them together means our program has to want to
slow as a lifeguard itself which leads to the question, well, what if we could decouple the
lifeguard from the program itself. And this is what I mean by that: if we had the notion
that we could run the lifeguard separately from the program. So what happens then?
Sort of the lifeguard itself needs some way to know what to check, so I'm assume that
we can stream an instruction trace of the program to the lifeguard for the lifeguard to
perform its checking. And I'm assuming here that we could do this efficiently as I will talk
about later.
So one immediate impact is the fact that performance improves. We live in a world
where there are many cores, there are a lot of parallel computing resources available, so
having the program and a lifeguard run on separate resources with little interference just
makes them run faster basically. Right? And in fact we saw this in the initial research
work that I did in this direction called log-based architecture, LBA. We saw a scenario
where the average slow down of TaintCheck dropped from 30x to 3.4x simply by running
TaintCheck on a separate core from the program.
Another benefit of the coupling which is what I spend most of my time working on is that
the lifeguard actually becomes easier to optimize. The lifeguard code can be optimized.
And in particular -- And I'll talk a lot more about this later -- it's much easier or more
effective to parallelize the lifeguard or even apply interesting compiler optimizations on
the lifeguard code basically by decoupling it. And I have some results out here where we
see that the average slow down for TaintCheck -- This is on top of the improvements we
had before by simply separating them; we could actually get additional improvements.
So by parallelizing we could get the average slow down of TaintCheck down to about 2x
and by applying some interesting paths-based optimizations on the lifeguard code, we
could actually get a slow down, down to about 2.5x.
So a final benefit of decoupling the lifeguard from the program is that it gives us a lot of
flexibility in terms of ensuring the integrity of the lifeguard. In particular what this means
is that we can actually run the lifeguard in a separate false domain from the program.
Now when you're monitoring privilege code like device drivers which, you know, can
destroy your system in all sorts of interesting ways this is actually very useful. And this is
a big part of my thesis which I'll talk about later on in terms of monitoring device drivers
and how decoupling helps with that.
So I've sort of talked about all the benefits of decoupling. Of course there are some
downsides to it and probably you've been thinking about that already, which is basically
we lose this guarantee of being able to detect bugs early. In fact, we only detect bugs
after some delay. So as an example here I have, you know, this piece of code here on
the left-hand side and there's a bug there. And we don't detect that until much later when
the lifeguard is able to analyze the trace that comes from that execution.
And so what this means is that containing the side effects of bugs becomes much more
difficult. But this is a challenge that I have tacked in my research in terms of monitoring
applications and also monitoring in the kernel space. And I'll talk more about that later.
So this is sort of the road map for the rest of this talk. I will start out with my early work
that I did in terms of how do we take advantage of the fact that lifeguard is being
decoupled to actually make the lifeguard faster, and I will talk about two ideas that I
worked on. One is parallelizing the lifeguard and then doing interesting path-based
optimizations on lifeguard. And after that then we'll just switch into the kernel-space and
say what is decoupling bias in terms of monitoring device driver execution to mitigate
bugs. And there I will describe a framework which I built for my thesis called Guardrail
which actually allows you to sort of protect that I/O state from bugs in the device driver.
I will conclude by basically giving a brief description of what I consider to be my future
research directions and then summarize my contributions.
And so switching to the section where we'll talk about how to optimize lifeguard code
after it has been decoupled. I will start out by first motivating how do you actually
decouple lifeguards in the application space, and here I'll be describing the log-based
architectural project that I worked on much earlier and how that helps to decouple
lifeguards. And then, I'll go into the optimization techniques which is parallelization of
path-based optimization. And then, I'll come back and talk about some related other
works that have looked into how do we make lifeguard code faster.
So the basic idea behind log based architecture or LBA is that since we live in a world
where we have many cores available why don't we run the lifeguard in a separate
processor from the program. And if we do this what we get is improvement in
performance like I showed you before. And show, moreover, since the transistor density
is also increasing then perhaps we could dedicate some more gates into like helping us
stream the instruction trace. So we're using hardware to stream the instruction trace
from the application to the program that way we avoid the software cost of instruction
tracing.
And so we built this design and basically we reserved some amount of memory to sort of
cache the log or the trace. And so here the lifeguard in this case ends up becoming like
basically a collection of events handlers because what's happening is we are streaming
the application events, the instructions into it, and the lifeguard is just responding and
invoking the right analysis code to sort of do the right checking.
Now the containment guarantee that we make in this design is basically that even
though it might take us a while to detect a bug, the bug will actually not propagate
outside the process context of the application. And we're able to achieve this in a couple
of ways. One is the lifeguard itself runs as a separate process so it's sort of isolated from
the application. And two is that we actually stall the program when it tries to interact with
the kernel; so that system call boundary, we stall it there until the lifeguard can actually
catch up. And we also stall it, of course, when a buffer which we use for the instruction
trace fills up.
And so with this design we were able to achieve some significant improvements. I have
shown the TaintCheck numbers before. But, for example, a tool like AddressCheck
which basically checks -- Yes?
>>: [inaudible] the application because lifeguard is too slow...
>> Olatunji Ruwase: So that shows up basically as the slow down numbers. Yeah. So
the troubling happens because we have a fixed-size buffer and so once that fills up then
it indicates how much faster the program is than the lifeguard. And so that shows...
>>: And how does the signaling happen?
>> Olatunji Ruwase: How does the signaling happen? So we do that in operating
systems, so we have some operating system support here. Yeah. Okay. So
AddressCheck is a tool that basically checks for access on allocated memory, and we
saw like an improvement. We just slowed down relative to binary instrumentation from
19x to 3.2x. Lockset looks for data races and we were able to reduce the slow down
from 30x slow down to like about 4.3x slow down. And MemCheck was probably had the
least improvements but we still had like close to a 3x improvement in this case. And so
this is baseline for the rest of the portion of this talk where I'm going to talk about how
then do we, you know, tackle -- So the slow downs you're seeing here, as the question
alluded to, is really the overhead of the lifeguard computation itself, the fact that we're
using multiple instructions to analyze each instruction in the program. And so the
question is how can we make this better?
And so the first idea I pushed was, "Well, can't we parallelize the lifeguard?" Right? And
so sort of see why this would make sense. So imagine we have an application here and
we're just looking at time. We learn from top to bottom and we're looking for instructions
here. In the instrumented case what ends up happening is we mix up, you know, the
analysis. We interweave the analysis with the code and this is really hard to parallelize
because it sort of requires the original code to be also parallelizable. Otherwise, you
can't even parallelize here.
But the coupling helps us because now if we decouple the analysis of the lifeguard from
the program then the question is can the analysis itself be parallelizable? That's the
problem we have. And we see that this is possible because the lifeguard is running
behind the application. It means that the application is creating streams of events to be
processed. And so there's no reason why we can't just fire off another lifeguard to
process the streams that in future which the lifeguard can't get to right now.
And this way we can get improvements.
>>: Is that automatic or does that require [inaudible] support on the actual lifeguard? So
do I have to have language support as someone who writes a lifeguard in, for instance,
Valgrind?
>> Olatunji Ruwase: We -- Let's see. So the way this works is, yes, we have some
minimal API that you need to specify that you were interested. In particular, you need to
specify how much parallelism you want because that sort of goes to it. And it only makes
sense if you have as many cores. So, yeah, there is some minimal API that you need to
change.
>>: It's not even necessarily easy even if you have the API. I mean if you're trying to
taint tracking.
>> Olatunji Ruwase: I am going to get to that in a second.
>>: Okay.
[laughing]
>> Olatunji Ruwase: Yes, as the questioner suggested, there are serial dependencies
on what the lifeguard is doing. So in particular TaintCheck is a perfect example for this
like I showed before. In order to detect a security attack, TaintCheck needs to track the
propagation of data. And this is inherently sequential. In fact TaintCheck falls into this
class of lifeguards that do this kind of propagation called dynamic information flow
tracking or DIFT for short. And so MemCheck which detects on safe uses of initialized
memory or initialized data also has the same behavior. They are very, very sequential;
the computation they do is very sequential. So how do you handle this?
And so I came up with this algorithm for parallelizing DIFTs which essentially works as
follows. So here let's say we have this stream of application events that we want to
process. What we first do is we break this up into segments and then run in that in
parallel. In parallel we run an algorithm called symbolic inheritance tracking.
Now the basic idea here is that rather than trying to track the propagation concretely, we
can symbolically track how taint is inherited. And I'll show some more details about that
later. But it turns out that you can do this in parallel. Now when you're done with that we
can then process each of the segments sequentially resolving -- because essentially
because of the dependencies we will still have some unresolved issues, dependencies
in there, and the solving of this we call inheritance resolution. We sort of default next
step where we then sequentially go through each segment and then resolve all of this
inheritance.
Now what's interesting about this is that we can actually perform in parallel in terms of
the locations. So individual locations we can resolve the inheritance in parallel. And then
for each segment we have to sort of do that sequentially. And so with this design we
came up with an algorithm that basically has this nice property; although, it has
asymptotic linear speedup. In practice the constant factors really prevent our
implementation from actually having this behavior, however.
So what do I mean by symbolic inheritance tracking? The idea here is that we are in
parallel; we're in the first phase where we are trying to propagate the inheritance here.
And so let's start with the very first instruction where we're copying from memory location
Mx into R1. Well, what is the taint status of Mx? We don't know. Well, that's fine. All we
just record there is the fact that, well, the taint status R1 now is whatever the taint status
of Mx was in the previous segments. Okay, and so that's the inheritance so we can then
sort of propagate this along as the program executes.
And so we end up in this scenario where either we are able to resolve the taint status for
sure or we're able to like create an inheritance saying, "It comes from the previous
segment." So what we've essentially done is we've been able to collapse the
propagation chain which would have been a difficulty for us.
>>: Can you go back to the previous slide?
>> Olatunji Ruwase: Sure.
>>: So one thing that sort of occurs to me with this is that it looks a lot like carried
propagation [inaudible]...
>> Olatunji Ruwase: Yes.
>>: There are schemes that involve lots of hardware to do carried propagation very fast.
Is there an equivalent for this?
>> Olatunji Ruwase: I haven't really thought about that. But I mean the -- The equivalent
for this? I'm not sure. I think there are some slight differences there because I mean the
idea is that we may not really know what the true answer is until we get to the next
phase. And so we sort of have to have this sort of not "don't care" but this generic
answer like we don't know what it is. So I don't know if the carried propagation hardware
actually does that.
>>: I think it does. When you add two [inaudible] together you don't know whether it's
zero or one...
>> Olatunji Ruwase: Or one, okay.
>>: ...so you have to carry them.
>> Olatunji Ruwase: Then in that case I think this is quite similar.
>>: Yeah.
>> Olatunji Ruwase: I think it's quite similar actually. Yeah, I wasn't aware of that. Yes?
>>: So you'll have to excuse me, I'm a little bit naïve on how you do TaintCheck. But my
assumption is that you do a load and then if you do operations on that load, to carry taint
you just "or..."
>> Olatunji Ruwase: Yes.
>>: ...an "or" across all of the values that you've loaded.
>> Olatunji Ruwase: No. I'm sorry, can you rephrase that...?
>>: So if it is set at being tainted.
>> Olatunji Ruwase: Yes.
>>: Then I just "or" the results of when I apply operations on that load, something that's
derived from that load. Right?
>> Olatunji Ruwase: Right, right.
>>: So isn't "or" an associate of operation?
>> Olatunji Ruwase: It is an associate of operation.
>>: So then isn't this -- Even though you have a dependence chain, isn't that a parallel
operation by definition? So if I want to add up a bunch of numbers, because addition is
associative, I can do this?
>> Olatunji Ruwase: Right. But if you don't know -- Well, okay. I see. So that's assuming
you know one of the values. Right? Like you know it was tainted. But if you don't know,
you can make the wrong conclusions. So it is an associative value but it still depends on
the input values. So if it's all on untainted the "or" wouldn't change it. But if at least one
of the input values is tainted then the final answer is tainted, at least that's the rule. So
maybe I should've clarified that. So if at least one of the values is tainted then the whole
computation is tainted; that's the challenge.
>>: I think what you said is right for each location, right, but the problem seems to me
that you have the instructions not the [inaudible].
>>: But if you have the...
>>: The representations.
>>: You only need to do an "or" when that value surfaces, I guess is the point. So there
is some parallelism in the fact that you can stall until you actually -- and do what you
have.
>> Olatunji Ruwase: Oh, okay. I see. Okay.
>>: And then apply the "or."
>>: That's very fine grain.
>>: Yeah, yeah, yeah.
>> Olatunji Ruwase: Yeah. Right. Yeah, so you sort of have to carry all the values
along, is that what you're suggesting? So, yeah, we actually had some sort of similar
problem. The problem is that the amount of [inaudible] that you need to carry that along
is huge and it actually just kills your performance completely.
All right, so here we are. So here we've been able to collapse the propagation chain and
we're at the point where we know that the only taint values that we cannot resolve here
are the ones that are dependent on the previous segments. And so we then move into
the inheritance resolution stage and, like I said, we do that sequentially. So in other
words we first resolve all the segments in J minus 1, and we have this nice property that
then they can be done in parallel. I'll show you that in a second. Now we are ready to
resolve the taints for the values in segment J. And we can also do that in parallel
because they all depend on one sort of predecessor.
Your question that you asked sort of alludes to the issue of there might be binary
dependencies here as well. Here we assume for a particular lifeguard it's possible to
have like a simplifying operation. So for TaintCheck, for example, when you combine -I'm sorry. I can rephrase this. For MemCheck, for example, if you have at least one
initialized value, it doesn't matter what the other value is. And so for TaintCheck as well if
you have one taint value it doesn't matter whether the other value is untainted. So you
can actually always collapse it easily, so we get this nice property that we don't need to
worry about binary operations.
And so once we resolve this then we can go ahead and sort of resolve the inheritance
for the next segment. And so in terms of the implementation I basically build parallel
TaintCheck in this way where I have a bunch of parallel workers but I my inheritance
resolution is done sequentially. So I have a single thread doing the inheritance
resolution, and so we sort of call them the parallel workers and then we have a master
essentially. We have to sort of compare this to sequential TaintCheck and assume that
the master is equivalent to one sequential taint check and then, you give it a bunch of
parallel workers to sort of help it improve performance.
And so what does this look like in terms of performance? These are some specific
benchmarks. And here I am showing the speedup over LBA, so over the sequential taint
check running on LBA. And we saw some pretty decent improvements and this is with
eight workers so the speedup is not quite linear. But we did see some interesting
improvements and we saw in some cases up to 3.4x. On average we saw about a 1.7
improvement over the sequential approach using eight workers. And this sort of shows
that there is opportunities to improve performance by parallelizing. We might even get
some more if I had parallelized the master, for example.
Yes?
>>: So what's the slowdown of TaintCheck with LBA?
>> Olatunji Ruwase: With LBA it's about 1.9x. I showed that a bit earlier. Okay, so yes?
>>: [inaudible] show that one processor is both a worker and a master?
>> Olatunji Ruwase: And this question?
>>: The leftmost...
>> Olatunji Ruwase: Oh, so you mean this leftmost one? Oh sorry, actually that's not
what's going on here. So these are the workers, like this symbol indicates a worker. And
the only thing I'm doing here is the fact that there are four segments and then the
master, which is this symbol, works on each of the four segments.
So maybe this figure is confusing. It's not really like the same processor.
>>: Okay.
>> Olatunji Ruwase: I mean it could be but it doesn't have to be.
>>: Do you know is the master up at the bottleneck area?
>> Olatunji Ruwase: No it's not actually. It actually turns out not to be. The bottleneck
here really is that this inheritance tracking doesn't shrink -- like the segments that come
out of it are still quite big. And so I mean we could either try to like break down segments
even more but it turns out not to help so much.
>>: How big is big?
>> Olatunji Ruwase: Big is like half of what I give, so it only shrinks about half. So like I
give it a segment and it only shrinks about half. And then the other thing...
>>: The number of briefly exposed...
>> Olatunji Ruwase: Right.
>>: ...values is half what's...
>> Olatunji Ruwase: What's -- Yes. Yes, [inaudible] eliminates half.
>>: So how big is the segment, though, in terms of construction?
>> Olatunji Ruwase: I think that this was probably like 16 K; these results for 16 K.
>>: 16 K?
>> Olatunji Ruwase: Yeah, 16,000 instructions. Probably the other interesting thing,
which I sort of left out is that the inheritance resolution is about twice as slow as just
propagation. So this guy is, you know, half the speed of the sequential taint check. So
that's the cost you get for symbolically tracking inheritance rather than actually tracking
propagation.
Any more questions? Okay. All right, and so what if you don't have many cores to
commit to making your lifeguard faster? What else can you do? Well, one approach is,
"Well, why don't you do path-based optimizations?" And you might ask, "Why pathbased optimizations for lifeguards?" Well, because the compiler community has shown
for so many years that, you know, in most programs the execution time is dominated by
hot paths. Now if we consider what happens when we instrument those hot paths we
see a number of results. And so here I have instrumentation of the hot paths. The cold
bars are the analysis code; I've used different colors in this case so hopefully it
delineates the different instrumentations. While, the black lines are basically the original
code so we have something like this, like these basic blocks working up to something
like this where we still have a lot of analysis code in there. So if we consider one such
hot path that's been instrumented, we can conclude a number of things.
One is that the lifeguard code has actually spent most of its time analyzing the hot paths.
Right? So the overhead of the lifeguard mostly comes from the hot paths. It also means
that if we could somehow optimize how the lifeguard analyzes these hot paths which run
frequently then, you know, we could improve performance overall. But the challenge to
doing this with instrumentation is that this requires you to analyze across two different
kinds of code. There's the program code and there's the analysis code, and there's some
context-switching code in between them. This is just really, really complication to find
redundancies here.
And this is where decoupling helps us. So by decoupling the analysis, the lifeguard from
the program we can just focus on the lifeguard code itself to see if we can improve it by
analyzing along paths. And so here I have basically the lifeguard code that is the correct
analysis code for this particular program path. And we call that sort of unoptimized
because we sort of just pull them out and sort of execute them in order, and so it's
unoptimized.
Now even with a standard compiler you could imagine that if you in-lined all of this code
together, a standard compiler could probably find opportunities to, you know, constant
propagate or maybe save some save and restore's here. So we could get some
improvements there. But the real big win comes from the fact that, you know, the
lifeguards have a particular way to compute, right, so TaintCheck propagates. And so if
we could exploit the domain knowledge of these lifeguards, we could actually get even
bigger optimizations. And one example which I'll talk about today is how the lifeguards
access their metadata. So the metadata is basically the taint's values, how they access
it.
And you could also eliminate things like the redundant checks. You can find redundant
checks because you know that's the lifeguard is doing. And as an example: here is the
path handler for a particular mcf, for a particular benchmark from mcf.
We start out with a path handler that's the taint analysis handler that [inaudible] x86
instructions. Now when we sort of compose the path handler together and give it to a
standard compiler to in-line in and also propagation, we're able to like eliminate 5
percent of the instructions. But we apply TaintCheck's specific optimizations, we're
actually able to [inaudible] about 45 percent of the instructions [inaudible].
So I talked about how metadata acts like a big source of overhead in lifeguards. So let's
consider TaintCheck for example. And I've shown this before where, you know, to
propagate the taint of a memory location into that of a register it requires like [inaudible]
x86 instructions.
But what's interesting is that 6 of these instructions are all to derive the metadata. And
the reason for this is because most lifeguards use two-level tables to shadow the whole
memory. This is like probably the most flexible way of doing this. So you have 6
instructions that you have to execute each time just to obtain the metadata. So the
question is, what can we do to improve this?
Just looking at the lifeguard code itself -- I've represented the memory locations with A,
B, C, and D -- it's very hard to say whether there're any similarities in this because
remember all of these values are just streaming in from the log to the analysis. And it's
hard for us to tell is there any real value here. I mean are they similar? Are there
optimization opportunities here? But instead what if we looked at the application path
itself? And if we do that we see something very interesting.
So here I've circled all the corresponding memory accesses in the path in the mcf
program.
And what we can see here, well, is like the first three are exactly the same address
because, you know, the register doesn't change. So it's exactly the same address. So,
you know, A,B, C are actually A. Right? And then what's also interesting is that even
though the last one is not the same address, it's actually close enough. It's like 64 bytes
away from A which basically means there's a high probability that it's forced into the
same second-level table. So with the page table, we could leverage some of the work
that we've done for accessing A. We could use it to also the other location D much
faster.
Now what was interesting, at least cool to me when I did this was like, "Wow, it optimized
my lifeguard," but I couldn't find anything in my lifeguard code to help me. I had to go
and look into another program which is basically the application I'm monitoring. And
there I was able to find, you know, useful information that was able to enable me to
implement this optimization. And so for me it was really cool being able to like optimize a
program by looking at another program.
And so these basically were the results that we had, that I got for this. And so here I'm
showing average speedup over LBA on the Y-axis in percentages. And I'm showing two
results. One is if you give a path [inaudible] to a standard compiler and it can perform inline constant propagation, what kind of improvements would you get and then, when you
implement lifeguard specific optimizations like the kind I've described. And here we see
that, well, the standard compiler does pretty well in some cases, like GCC could get a 21
percent improvement over an LBA, but in some cases it's bad like Lockset. And the
problem here is the fact that when we in-line all there handlers, we basically increase the
code size and that hurts. But the cool thing is that with lifeguard specific optimizations
we always do better and sometimes quite significantly. Like we have 61 percent
improvement here for AddressCheck. And so by leveraging the domain knowledge of
what the lifeguard is actually doing, you can actually improve the lifeguard significantly
simply using standard compiler tricks.
>>: So [inaudible]. Suppose I ran a profiler then tried to figure basic walks and hot
function and then applied basically in-lining and reorganization of basic blocks based on
that profile information. Do you think GCC would then have been able to pick up a lot of
the opposition or opportunities that lifeguard...
>> Olatunji Ruwase: That's exactly what I did, actually.
>>: Okay.
>> Olatunji Ruwase: That's -- Yeah.
>>: [inaudible]
>> Olatunji Ruwase: That's exactly -- Yeah. So I figured out the hot paths and I
presented it both to GCC and to my new compiler, and these are the results. Yeah,
yeah.
Okay. All right, so we get improvements this way. All right, so in terms of our related
work into how do we make lifeguard code faster we can sort of categorize them based
on whether they were targeted for coupled lifeguards or decoupled lifeguards. So for the
coupled lifeguards, so Qin basically accelerates TaintCheck based on the observation
that most locations have a stable taint value. So they're either always tainted or always
untainted; they never really fluctuate. And so I was able to use that to do less taint
propagation. Dalton basically proposed doing TaintCheck in propagation in hardware
and that helps. And more recently Bosman, they've come up with some interesting
techniques on machine that have titanium they have machines that have many registers
basically partitioned in your registers for taint checking and for your actual program. And
they've got some improvements there. And then, they had some interesting but custom
metadata organization that probably only works for TaintCheck.
And in the decoupled space, well, with the ISCA work we had some hardware
accelerators; hardware acceleration with the ISCA and Vlachos work. The Nightingale
work actually is kind of interesting. So they parallelize TaintCheck as well and basically
they found some good speedups. But one of the results that it had from there which I
didn't talk about today is the fact that you need at least four parallel workers before you
can actually match the speed of the sequential simply because the cost in factors. And
Corliss is basically a hardware extension in the commit phase where the analysis is
performed at commit time using special hardware.
So at this point we're going to sort of switch gears and sort of dive into the next part of
my talk which is where I'll talk about using lifeguards in the kernel space, basically the
Guardrail work.
So the outline for this is first I'm going to motive, why did we consider device drivers as
target application for lifeguard in the kernel space. And then, I'll give an overview of
Guardrail and what sort of design goals we had in mind. I'll sort of look at some related
work in that space. And then, I'll talk about how do we do containment? So we're going
to be detecting bugs after the fact, so how do we do containment.
And then I'll present some -- I wrote three new driver lifeguards. I'm going to present
their bug detection effectiveness and then their end to end performance. So why
drivers? I guess probably the wrong community to be asking this question. [laughing]
Well, they are an important part of our software today, right? They allow us to use our
hardware devices that we love, but unfortunately they're a major source of bugs. And,
you know, one particular operating system reported that 85 percent of their crashes were
due to drivers. And then the other thing that's actually interesting about them in terms of
runtime monitoring, is the fact that drivers are actually very sensitive to perturbations
which is something that you don't see in the application space.
So this is because a lot of their computation is timing sensitive, so responding to
interrupt is timing sensitive. If you stall it significantly enough you actually break the
driver. And so for this reason this has been the key limitation to apply in sophisticated
analysis to drivers because you slow them down enough, you just introduce a new set of
reliability issues, right, besides the one you're looking for.
So what does Guardrail do? So Guardrail is a system that I built for using the coupled
analysis to actually mitigate driver bugs. In particular I'm looking to protect the I/O states,
the percent I/O state from driver bugs. And the way Guardrail works is basically it
streams an instruction trace to the lifeguard and then, it interposes on the interface
between the driver and the device such that I/O operations are not allowed to pass
through this until they've been verified by the lifeguard. And so there the lifeguard can
run any arbitrary sophisticated analysis that you can think of.
And so with this we had like a number of design goals which Guardrail meets. So the
first is we wanted containment, right, so we wanted to protect the external world, starting
from the device, from bugs in the driver. We have the unfortunate effect that the
operating system kernel is still sort of vulnerable to driver bugs. We're able...
>>: Is that because the lifeguard runs after the instruction has already clobbered your
kernel?
>> Olatunji Ruwase: Yes. Because we don't have an interposition later there.
>>: Are you going to tell us how you implement interposition there on the outside
because I'm not sure how you did that one either?
>> Olatunji Ruwase: Oh, you mean on the kernel side?
>>: No, between the driver and the device...?
>> Olatunji Ruwase: Sure.
>>: Okay.
>> Olatunji Ruwase: Yeah. And also the fact that we run the lifeguard in a separate
protection domain, which I talked about earlier, means that even if the driver hoses the
system, our lifeguard is still able to at least tell us what went wrong.
But in reality here we -- Yeah, if the driver hoses the kernel, the lifeguard can still tell us
exactly what went wrong.
>>: Do you also [inaudible] like DMA?
>> Olatunji Ruwase: Yes. In an interesting way I'll get to that too in a second. Right.
Right. So the other sort of design goal we had was generality. So drivers suffer from a
variety of bugs, buffer overflows, data races, all sorts that you can think of, security
vulnerabilities. And so we wanted a system that allowed lifeguard writers to write
arbitrary analysis, and we sort of hit that by making the lifeguard sort of be a separate
component that you can just implement your new analysis in it.
And so in particular -- And also our interposition layer is actually transparent; it doesn't
require changes to either the driver or the device and so it allows for generality. It allows
us to support arbitrary combinations of drivers and devices. And in particular in this work
I looked at about 11 Linux drivers from four different classes so audio, storage, network
and disk.
And also I implemented three new lifeguards for finding concurrency, operating system
protocol errors such as DMA and in memory safety. Yes?
>>: [inaudible] specify prevent [inaudible] corruption of I/O state [inaudible]?
>> lifeguard: Sorry, I couldn't understand.
>>: So do you prevent corruption of the I/O state or just the corruption of the device?
>> lifeguard: Of the device, yeah. So here we're just safeguarding the I/O state from the
driver bugs. Yeah, so the kernel is vulnerable to bugs in the driver [inaudible] Guardrail.
>>: Do you actually run the lifeguard completely offline?
>> lifeguard: This is online. This is all online.
>>: How much buffering do you have to do for this instruction trace? Because that can
explode very fast, right?
>> lifeguard: It doesn't matter. It really doesn't explode so far because when you think
about the instance of time when drivers execute, it's only for brief periods within your
overall execution time. And so the driver is not all like applications which can be CPU
intensive and run all the time; drivers only run intermittently. And they might run for a
burst but then, you know, they go off and do some I/O and they don't run for a while.
So detection fidelity is essentially that some bugs just require instruction level analysis.
And so we want to support that. And we do that by basically providing the lifeguard with
an instruction trace. And finally trustworthiness. And the idea here is that, well, ultimately
we have to make some changes to the trusted computing base and we wanted that to be
minimal. And in particular the I/O interposition is the only layer that really needs to be in
the trusted computing base. And by ensuring that the only thing that it does is basically
intercept the interaction between the driver and the device, it doesn't really do any sort of
computation to detect errors. That's actually decoupled off to the lifeguard so that way
it's at least conceptually very simple. And so that helps for the trustworthiness of TCB.
Now in terms of existing work in this direction, in terms of driver liability there's been a lot
of work. I think we can categorize existent dynamic approaches along two different axes.
So the first axis, which are the columns here, is exactly when does the dynamic
approach perform its correctness check? Right? So you have the coupled approach
which is you perform the check-in before every driver instruction or the decoupled which
is sort of like my work Guardrail-style where you do the check-in some time later.
And on the other dimension is how much of the driver execution is checked by the
dynamic analysis? So you could either check all of the driver instructions or there are a
lot of techniques that check only the API interaction between the driver and either the
kernel or the device.
Now these design points matter because they impact things like performance. So one
way to improve performance is basically to check just the API calls. Right? You ignore
like the bulk of the driver execution. Another way to [inaudible] the performance is to
decouple it, so run the checks sort of asynchronously. But of course this all comes at a
price, right? So if you check only the driver API then you have poor fidelity in the sense
that you miss a lot of bugs that occur within in the driver itself. And of course if you
decouple then you have this containment challenge which I'll get to a little later. And of
course the other properties sort of fall in place.
And so in terms of our existing approaches mapped in, we have this sort of picture. And
so perhaps the most immediate one to observe is that there is only one prior work that
does like decouple checking and that's Aftersight. And Aftersight provides like no
containment at all so it doesn't protect anything, let's just put it that way, because it
detects the errors after the fact. It has no containment.
And so in terms of the coupled approaches which is the majority of the approaches, we
see that there is a lot of work in checking only the API to improve performance. So it's
where people have sort of optimized for it so they want good performance and so they
only check the API. And so of course the problem with that is that, you know, you miss
some actual bugs.
So Guardrail sort of sits here in the decoupled and checking all the driver instruction
space. And the downside there is the containment, and I'll describe how I addressed
that, how I solved that.
>>: So can you say what you mean when you put like software fault isolation in the
upper half of this? I mean I usually think of that as checking all instructions that like run?
>> Olatunji Ruwase: So software isolation is basically almost like sandboxing. And so
the idea is you're fine with the module that you're checking corrupt and it's on internal
data structure, but you're just saying it will not get out. The problem there is that, well,
while you could prevent things like writes that would directly corrupt the XO world but
there's a question of whether your module itself is behaving correctly at a fundamental
level.
I mean it could constantly overflow its own buffer or corrupt its data. Is it even doing the
right thing? It might not be able to corrupt the kernel but is it even providing the
functionality that you want?
>>: So you're worried that it's going to corrupt the device and decide the driver
[inaudible] are happening?
>> Olatunji Ruwase: Yeah. [inaudible] exactly. It's a good point. So even like a lot of the
false isolation has really been about the interface between the driver and the kernel.
Really the interface between the driver and the device is something that -- those actually
fall into that as well. But even is my driving behaving correctly? Even if it's not crashing,
is it displaying the right images? It may display the wrong images without actually
crashing the kernel, and that's not what you want.
>>: I guess the followup question is how many of those crashes that you mentioned are
actually because of drivers [inaudible]?
>> Olatunji Ruwase: Well, how many of which crashes? Well, I mean I didn't mention
any particular crashes but I can answer you question. And the answer there is that yes,
the challenge is that while -- [inaudible]. So crashes that come from the driver corrupting
the kernel are well studied and there a lot of numbers there. Unfortunately crashes that
come from the driver corrupting the device, they're not that many numbers on them. But
there have been some really interesting cases. So for example a few years ago there
was this E1000 network card that was basically being hosed by a driver. And, you know,
your network card literally just dies. And the challenge there is if your driver actually
hoses your disk, for example, it could cause some other kind of failures that are just
impossible to trace.
So it's really hard to actually quantify how bad this is. So I mean imagine that the driver
actually corrupted -- So you have virtual memory, right, so you're storing stuff on disk.
And you think your virtual memory is working well. What if the driver just scribbled over
all of your virtual memory metadata? Your operating system starts failing in all sorts of
arbitrary ways and it's very hard to sort of quantify this. I think this is part of the challenge
why, you know, there hasn't been a really good study in terms of these kind of failures
because they tend to be very catastrophic.
Any questions? Okay. And so the other dimension to look at in dynamic approaches is
what system components are they protecting. And like I said most of the work has really
been in terms of protecting the operating system kernel. It was only recently that people
started looking at protecting the I/O device state, and so Guardrail sort of falls into the
space. But that was like an interesting area to go after. The other space is really well
congested. There has been a lot of good work and interesting results from that direction.
So I just decided to pursue this new space that is fairly new.
>>: Can you talk a little bit about Nexus-RVM?
>> Olatunji Ruwase: Oh, sure.
>>: It protects both, right? [inaudible]...
>> Olatunji Ruwase: It protects both, yes. Yes. So Nexus-RVM protects the kernel by
basically running the drivers in user space. Okay, so that has [inaudible] protection. And
it protects the device by basically requiring the hardware to basically write a reference
validator which indicates how the drivers interact with the device. And so by trapping -So now that the driver is in user space, you can essentially trap all of its -- Well, you
basically change its API such that all of its interaction with the device actually turns into
system calls. And so you can interpose on that and then check that against the model.
And then, the reference model sits in the kernel and then verifies basically that the driver
is, you know, interacting with the device in the right way.
>>: The reference model has to be written manually.
>> Olatunji Ruwase: Yeah, it has to be written manually. It's very hard. It's specific to
that particular device and it's not clear that it could be written in an open source fashion
because for devices that are proprietary it reveals probably way too much about the
[inaudible] that go into the device itself. Sorry?
>>: So performance?
>> Olatunji Ruwase: Yeah. Yeah. But it was really the first work to be looked into, to
dynamic -- And they also made this interesting observation in that work where they said
like, you know, "When we banged on our device hard enough without RVM and then our
device stopped functioning." So they sent like all sorts of random data inputs to the
device and they actually killed the network card as well. [inaudible] But Guardrail sort of
falls into this space.
All right, so containment which I guess has a bunch of interest here. So how do we
contain bugs if we detect them after the fact? How do we prevent it from corrupting the
device? So let me just rephrase the bug containment challenge. Here is we have a driver
code. It's running. It's had a bug. It's gone past the bug so time is going from top to
bottom. The analysis is still somewhere back, still checking some of its history. And the
driver is going to do a disk write at some point. And so the question is, okay, is that bug
going to cause terrible things when this disk write happens? What can we do to prevent
this from happening?
So in other word we have this challenge that we need to preserve our integrity even
though we're detecting bugs after the fact. And so I looked at it and said, well, you know,
this is because the driver talks directly to the device. So why don't we just put something
in between them? And virtualization seemed like a really good approach to go, so I built
something around a virtual machine monitor where I added an interposition layer here.
And so with Guardrail -- So this is how I/O operations get executed. So first the driver
issues an operation and we intercept it, okay, in the interposition layer. And then, we
send a request to the lifeguard which is analyzing. And it'll say like, oh, is it okay to let
this access go true? Have you caught up? Eventually let's say it's a good case. It says,
yeah, it's okay. You can go ahead and do it.
And then, the interposition layer then replays the access and -- I don't know. Should I
take a question? Okay, very good. It replays the access. Now what's cool about this is
that this is transparent to the driver and the device. So the driver doesn't know that this
interception occurs and the device doesn't know that the access is coming from the
virtualization layer and not the device, not from the driver. And in particular we do the
interception using traps, and we complete operation using emulation which provides
some performance challenges.
>>: [inaudible] question. How does this work with DMA?
>> Olatunji Ruwase: With what?
>>: DMA?
>> Olatunji Ruwase: So there are two parts to DMA, right, there's the control part and
then there's the actual streaming data part. So what we are tracking here is ensuring that
the control part is correct.
>>: So the control part is just a data structure, like a network transmit cue that's sitting in
memory.
>> Olatunji Ruwase: Mmm-hmm.
>>: And the driver is just doing memory writes to cue packet transmissions by cueing
stuff in a memory data structure.
>> Olatunji Ruwase: Mmm-hmm.
>>: And asynchronously with that, the device is walking this thing in memory and pulling
packets out of it. So the control channel...
>> Olatunji Ruwase: Well, so...
>>: To the extent that they exist, the control channel is just [inaudible] to memory.
>> Olatunji Ruwase: No, it's a little more than that because the driver has to tell the
device where it wrote stuff into. The device is not going to just go to some arbitrary
location and start reading from there.
>>: So the driver might tell the device, "Here's my transmit cue."
>> Olatunji Ruwase: Right.
>>: And the device starts transmitting.
>> Olatunji Ruwase: Right.
>>: And as long as there are still packets in that cue, the device will keep transmitting,
the network device will keep transmitting.
>> Olatunji Ruwase: Right.
>>: And the driver can happily cue more stuff there. It doesn't have to tell the device,
"I've got another packet. I've got another packet. I've got another packet."
>> Olatunji Ruwase: Right. Right. So I...
>>: This is asynchrony.
>> Olatunji Ruwase: So the -- Right. Right. So eventually, right, even with this
asynchrony at some point the driver has to be done and has to tell the device that it's
okay to get the data, at some point, right.
Or either the device has to generate an interrupt indicate that "I have completed" -- I
mean, it's a transaction at the end of the day.
>>: Well, maybe we should talk about this later....
>> Olatunji Ruwase: Yeah, Okay.
>>: My argument is that there's a control channel but it's asynchronous in the sense that
the driver can get stuff done without having to invoke that for every operation. And I'm
sure that in this system I could split things out to the device without...
>> Olatunji Ruwase: Oh, no, no, no. Okay, so now I -- Okay, I think now I [inaudible]. So
the driver can write corrupted data and slip it out but it would not modify the state of the
device itself. Because the control data that you're sending -- sorry, the I/O data that
you're packet cues is not actually going to affect the state of the device itself. You can
only do that by programming these device registers and sending them some arbitrary
values.
>>: Okay.
>> Olatunji Ruwase: I mean, there is a performance implication for trapping and
emulation but by doing this the design tradeoff that I made was that I would rather stall
I/O operations, the device register read and writes, which are typically slow anyway and
they're relatively infrequent rather than stall, rather than impact the actual CPU
instructions of the driver which are much faster.
But of course there is some performance overhead implied here. And we can see that
here I'm showing performance relative to Linux, where Linux will be one. And so lower is
worse. And here I am evaluating the performance of using Xen, so I built this in Xen. I
added this interposition layer into Xen, and so I'm comparing against Xen the
interposition numbers. And I did this for basically four classes of devices, so audio,
video, disk and network. And the idea is to say what is the performance impact of doing
this sort of interposition and every device register read and writes every I/O operation.
And what we see is that in most cases the overheads are not so much; it's like up 10
percent. And this is because these operations are slow anyway to begin with and then,
they're infrequent. But there are cases too where it gets worse.
So one of the cases where I have a very poor number is on compilation. So we're losing
about 30 percent performance relative to Linux. And it turns out that, well, the
virtualization of Xen is not doing much better. So, you know, it seems like most of that
overhead is coming from virtualization.
But with Memcache D where I have the system fully saturated, I am losing close to 40
percent of the performance. And so here it shows both sides of it. So like 20 percent of
that seems to be coming from virtualization and the other is coming from the fact that, as
I'll show a little bit later, Memcache D for this particular workload actually makes the
driver do a lot more I/O operations. I mean the get size is about 64 bytes, so it's doing
more I/O operations per byte. And so in those cases we can see the slowdowns.
And I'll show sort of how the interposition works. So now what does this allow me to do?
Now I can write some new interesting tools that have not existed before. And in
particular I wrote these three tools: DRCheck finds data races in drivers, DMACheck
finds DMA protocol bugs, while DMCheck finds memory bugs, unsafe uses of unitialized
data sort of like Memcheck in the user level.
In the interest of time I'm going to focus only on these first two.
>>: One thing [inaudible] said was that the reference monitors, they had to write them
for each device so they'd have to know about the device. In the things you're talking
about is I/O interposition in these checkers, are those per device or are those general?
>> Olatunji Ruwase: So the I/O interposition itself is general. And the reason for that is
because it intercepts the operation in sort of like a device agnostic way so it's essentially
a trap. So the driver will map the device's memory into its address space and try to touch
it. And I mean the details are like then I just revoke that so that every access to it
becomes a trap. And so this works regardless of the driver or the device.
>>: So you just need to know where the control write [inaudible]...
>> Olatunji Ruwase: Yes. Exactly. Right, exactly. And I mean there are standard APIs
for how to do this in the kernel. So, yeah. At least within a particular operating system
kernel it's general in that sense.
>>: But to go back to your motivation...
>> Olatunji Ruwase: Sure.
>>: ...to drivers that have bugs that cause hardware to be corrupted like the network
card [inaudible].
>> Olatunji Ruwase: Yes.
>>: To cache that you would need to write checker that understood the device's
interface.
>> Olatunji Ruwase: So I will put that in a class -- So like I said there are a wide variety
of bugs that drivers could have. So it's conceivable that if I gave you an instruction
stream, instructions trace of the driver, you could actually write a checker and get it to
check for what the driver is doing and how it's programming the device registers and find
such bugs. I didn't look into that in particular for my thesis because one of the challenges
of doing that is you sort of need to know how the device behaves, right, and that seemed
like really hard.
So this is sort of like the easiest things I could do using like open source information. But
I believe very much that you could actually write similar checks.
All right, so I'm going to start with DRCheck which finds data races, so a quick
background on data races. So a data race essentially is two accesses to shared data,
typically from two different threads where at least one of the accesses is write and
they're not serialized.
And so like the example we have here will be data race, right. And like a lot of the work
in data race detection has been done in the user space. And so some of the techniques
they've used are things like trying to infer whether there's a consistent locking discipline
for protecting shared data. So this would be the Lockset approach where we can see
here that this access here with these locks here is safe because the access is consistent
with Unlock Lx.
And then the other approach used is Happens-Before which is basically the fact that,
well, sudden operations are serialized in time. So like a lock and an unlock on the same
lock are sort of serialized. And so, you know, one happens after the other. So the
technique is to do this. And then in addition they maintain a vector clock to sort of keep
track of what time was. And so here we'll see that we can conclude that this access to X
here is safe. So this is vector clock of the variable and it indicates that Thread 1 wrote
this access variable and it's [inaudible] time of 2. And this is the vector clock for Thread
2, and relative to its [inaudible] time of 3 for Thread 1. So the access is serialized in time.
And then, there are hybrid approaches that combine because these broad approaches
have different compliments and strengths and weaknesses.
And so I think sort of converse, there is detection user-levels. And you might ask, well,
why not just use these tools on the kernel? What interesting? Well it turns out that
concurrency in kernel-mode is vastly more complicated than the user-level. So one of
the things that can happen is that things that are actually racy in the kernel space will
appear as being safe to a user-level detector. And one example is this notion of intrathread concurrency. This is when a thread actually races with itself. So when I was
defining data races, the common definition implied that there has to be two threads
involved to create a data race. In the kernel this is not true at all.
>>: What does that mean?
>> Olatunji Ruwase: I will get into that in the notes. So I'll go into that in a minute. And
then the other problem is that safe accesses, the well-serialized accesses in the kernel
will actually appear racy. And this is because the kernel synchronization is done using
other things besides mutual exclusion primitives. In particular there's a form of
synchronizing called state-based synchronization where drives serialize the operations
based on the state of the device. And I'll talk about that too.
So there's only one work that I'm aware of that finds data races in the kernel-mode
dynamic approach and this is called DataCollider. And the way it works is basically, so
say here we have the shared variable in two threads. DataCollider will stall a thread at
as shared memory access and start a breakpoint on that data. And the idea is that if
some other thread comes along and tries to access it then they'll be a collision because
it obviously means that trail was not serialized relative to the stalled thread.
And so the cool thing about DataCollider is that it has no false positives simply because
it doesn't need to worry about a synchronization primitives or protocol or things like that.
It doesn't worry about. And it only actually reports actual races, races that actually occur.
So that's a strong --. But the downsides are basically that, well, because it stalls threads
it cannot stall threads that are timing critical like interrupt handling threads. So if the first
access is from and intra-handling thread then DataCollider will not stall it and we'll miss
the race.
And also it cannot stall threads indefinitely. It has to let them go eventually. Now this is
less of a problem because typically no data corruption happens if the two conflicting
accesses happen far about in time. But it's still a near miss and we would like to know
that two conflicting accesses have occurred far apart in time; I would like to know that.
So let me get back to sort of the things that are racy in the kernel might appear safe to
the user space. So this is intra-thread concurrency. Now what is this all about? Well the
idea here is that the kernel typically has to perform a wide different types of tasks of
different priority levels. And so a kernel thread actually doesn't execute in just one
context. In fact if we looked in the Linux kernel, threads can either be in the process
context which is dubiously named, which it really means that it's servicing the system call
from the user-level. Another context that the same kernel thread could be in is in a tophalf context where its servicing interrupts that journey by the device. And then, in the
middle we have this bottom-half. And the idea is that, you know, the top half is the
highest priority context. And so if a kernel thread is doing system call stuff and, you
know, a packet arrives on a network card, we're going to interrupt it and use that same
thread to go service that interrupt.
And just to show as an example we have here a network driver. It starts out in Time 1. It
starts out in a process context sending packets, you know, the send system call. And
packets arrive and so we need to move it -- sorry -- into the top-half. Now the question
is, what happens if it acts as the same data in both contexts, right? This is actually a
race even though to a user-level detector, it's the same thread so it's serialized. But, no,
this was actually a race because eventually when it returns to the process context, the
context is not aware that another change has happened. Right? So the driver can get
into an inconsistent state.
Now DataCollider can detect this particular example here because, you know, it can stall
the thread while it's in process context and then, when the thread tries to access it in the
top-half it will actually detect that because it's, you know, fire off the breakpoint.
And so the basic insight from here is essentially when we think about concurrency race
detection in the kernel, we need to think not only about the threads but actually the
context in which their executing.
Now in terms of state-based synchronization: so this is where something is well
serialized by the driver but it might appear to be buggy to a user-level detector. Now the
basic idea here is that while the devices are themselves like basically finite-state
machines, we can view them as finite-state machines. And here I have sort of like a
snippit of a network card where it starts out in an inactive state and at some point it gets
connected to the PCI bus and at some point it gets initialized straight into the packet
transmission.
So if I consider this network driver code where, well, the same flag has been updated by
two different functions which could be executed by different threads, but they don't seem
to be serialized then my access is erased, well, it turns out not to be erased simply
because the probe function here is only valid when the device is in an inactive state. It's
actually what takes the device from an inactive state and connects it to the PCI bus.
While the open function only gets called after the device is already connected to the PCI
bus. So in reality these two functions are serialized in time; they will never execute
concurrently. And so this is actually not erased but a user-level detector. Because it
doesn't see any explicit concurrency between them, we'll think that this is a race.
>>: But DataCollider will [inaudible] this as a race.
>> Olatunji Ruwase: DataCollider, yes. No, DataCollider has zero false positives. Yeah,
I had that on the slide, right, zero false positives. That's the great thing about it. But this
is a problem -- So this actually goes to the fundamental problem which is that by being
agnostic to synchronization protocols, DataCollider will not have this sort of problem
regardless of whatever new fancy synchronization protocol comes up in the future.
That's really cool. In Guardrail we're trying to understand synchronization protocols, so
we're sort of susceptible to this and in particular because the states in devices can
vary...
>>: Why are you doing that? Why don't you just use DataCollider's trick?
>> Olatunji Ruwase: Why don't I use DataCollider as this? Well...
>>: Use that trick, right? I mean this trick of just...
>> Olatunji Ruwase: So combine [inaudible]...
>>: ...[inaudible]
>> Olatunji Ruwase: Actually it's an interesting direction, so we're balancing the
strengths of DataCollider, actually in fact using it to help eliminate the dependence that
DRCheck has for synchronization protocols is actually an interesting direction of
research. But here I wanted to just really push as far as I can in terms of if I try to
understand the synchronization protocols, what would that look like? And so the
challenge that comes from this is that devices are very different and they have different
states. And so how can I possibly encode all the different states into my tool or even for
a device that hasn't been invented yet. So here at this point I have to make a trade-off,
right? I can't possibly track all the possible states that a device could have.
But fortunately the kernel itself is also aware of some of the states because the kernel is
actually what invokes the driver code. And so for example here is the network and stock
code that invokes the open function of a driver. And you can see here that it's actually
checking to see that the device is actually connected to the PCI bus before it's doing
that. And so in DRCheck we basically leverage this fact that some of the states are
already exposed to the kernel in a standard way. And so we essentially just track those
kind of kernel-visible states, so whatever driver device is connected to the PCI whether
it's generating interrupts and things like that. So this allows us to sort of handle a wide
variety of drivers but it also creates the potential that we may miss synchronization; we
may have some false positives and I'll have those numbers in a second.
So in particular what are the false positives that we have? So for my comparison I
started out with DataCollider which has zero false positives. Well, okay, let just rephrase
this. Well, DataCollider has zero false positives. This experiment was done using Linux
drivers. Of course DataCollider I only available in Windows when I was doing this
evaluation, but it has zero false positives. So no big deal here.
Now Kernel-Lockset is a tool that I wrote where I took a user-level lockset -- Excuse me - and I basically made it aware of the locking primitives that exist in the kernel. And then,
I applied it and then, whoa, I get all these false positives simply because, you know, the
synchronization protocol in the kernel is different from that in the user space.
But in DRCheck, we do better. I mean we still have some false positives, and a lot of this
comes from the fact that we are only tracking states that are visible to the kernel so
states that the driver and the device are both aware about. What they use to serialize
themselves, we don't know. And so this is sort of the false positives that we get in here.
But we think at least with this we're able to get a good balance between techniques that
are like completely agnostic of synchronization protocol and those that try to understand
synchronization protocols.
>>: So when you say false positive here you mean things that are not in fact data
races?
>> Olatunji Ruwase: Yes.
>>: But it's not necessarily true that all of the data races are bugs.
>> Olatunji Ruwase: Right. Okay, so yeah. So the concept of benign races...
>>: Yeah.
>> Olatunji Ruwase: ...is what you're talking about. So I'll come to that -- So I'm not
reporting the number of benign races that I have here. And I think that's a general
problem for data race detection in general because...
>>: Do you have a sense of how many of these are actually bugs that we care about?
>> Olatunji Ruwase: Oh, these are all false positives. I haven't...
>>: [inaudible]
>>: Oh, sorry, sorry, sorry. Sorry, they're false positives, yes.
>> Olatunji Ruwase: So they're all false positives.
>>: I have a followup question.
>> Olatunji Ruwase: Yes.
>>: How many -- So these are all false positives but you have this other problem that if
you try to be too precise, you are so precise that you don't report any real data races
either, right?
>> Olatunji Ruwase: Yes.
>>: So when you are pursuing low false positives, you have to simultaneously make
sure that you are doing something useful also.
>> Olatunji Ruwase: Right. Absolutely.
>>: So did you find any real races?
>> Olatunji Ruwase: Yes, that's the next slide.
>>: Okay.
>> Olatunji Ruwase: So we found true races in Linux drivers, and this is after pruning
away all of the benign races. This is where we actually found real races. So some of
them were kind of interesting. So one of the races here actually never got a Linux bug
report. So I mean the way I verify this was basically, you know, I have my driver, my
driver version and then I keep looking through the commits in Linux. And I saw one that
was fixed but there was no comments in there. And it acts like, "Hey guys, I thought
there was a race here?" And yeah it was a race but the problem there was that the race
causes failures in strange ways that are just hard to reproduce. And so unless you are
like truly analyzing on the instruction level what the driver is doing, you would actually
not find this conflict in access.
And so, yeah, for DRCheck we found a total of nine races. So we decided to like -- Well,
I mean like I said DataCollider is on Windows, so we looked at each race and tried to
see will DataCollider catch this?
So we then modeled like two forms of DataCollider. One is deterministic. So one thing I
didn't mention before was that DataCollider probabilistically stalls threads. So here we're
assuming that for every race here, DataCollider actually will stall the first access, right,
like deterministically.
And there, we saw that it only detects just two of these nine races. And the reason is
because in some of these cases, the first access was in an interrupt handler. So in the
example I was just talking about, the first access was an initial handler. And so because
DataCollider doesn't stall it, it will actually miss that race.
Then the other reason was because some of the races actually occurred far apart in
time. And so while they might not lead to data corruption but they're near-misses that we
would like to know about. So for the interrupt handling we decided okay let's invert it. So
for a scenario where the first access is an interrupt handler, why don't we reverse such
that the first access is no longer an interrupt handler so that DataCollider could actually
stall that.
And we see that if we did do that -- And this is what I call ideal -- DataCollider catches
six out of the nine. And so the remaining three are basically the ones where either the
two accesses are both interrupt handlers or they are far apart in time.
So by checking more of the driver execution, in particular the interrupt context we are
actually able to find more races.
>>: If you have a race between two interrupt handlers, aren't they serialized anyway by
the interrupt delivery mechanism from the device?
>> Olatunji Ruwase: Right so...
>>: How is that a race?
>> Olatunji Ruwase: Right. Actually, you're right. So, yeah, you're right. Actually, no,
that's not necessarily true. So it depends. On a particular core the interrupt handler is...
>>: [inaudible] same driver on different cores...
>> Olatunji Ruwase: On different cores, right.
>>: [inaudible]
>> Olatunji Ruwase: So, yeah. Two types of -- Yeah.
>>: But even on the same core, the two interrupts have different priorities...
>> Olatunji Ruwase: Priorities.
>>: ...or goals, right?
>> Olatunji Ruwase: Well, in Linux it's not...
>>: [inaudible] serialized.
>> Olatunji Ruwase: Yeah, not in Linux. Yeah, yeah, not in Linux. So it's really the
separate cores. Okay, so yeah this is sort of the balance we're getting to, right, which is
you can either try to be more precise and detect nothing or try to be too precise and get
to many false positives. And so we sort of think that we sort of somewhere in the middle,
but there is definitely room for improvement at least in terms of the false positives.
So let's see. [inaudible] doing on time. DMACheck actually detects DMA buffers, bugs
and how drivers use DMA buffer. So a DMA buffer is basically some part of a system
memory that's shared by the driver and the device such that the device can directly copy
data off of it which is good for I/O performance.
So a number of issues come up with how DMA buffers are used. One is the fact that
they are shared, so we have to avoid a driver and a device race on it. I don't know if that
gets to your question. So the driver should avoid racing on the buffer while the device is
using it.
Now the other issue that comes up is the fact that the driver and the device actually act
as the DMA buffer through different paths. Right? So the driver goes through a cache.
The DMA buffer uses physical addresses directly. And so this leads to all sorts of
coherence issues, and so you have to sort of be careful in terms of how you create your
DMA buffers. In particular you should ensure that your DMA buffers are cache aligned.
And finally DMA buffers are system resources. The driver should avoid leaking them or
mapping them in inconsistent ways. Because Guardrail gives us an instruction trace of
what the driver is doing, we can actually checking tools to check for violation of these
rules. And in fact I did that for Linux drivers and I found about 25 violations of this nature,
7 of which were races between the driver and the device. And there were about 4 cases
where the DMA buffer was misaligned. Now what's interesting is that previously no one
has really studied these kind of violations before, but the fact that we have an instruction
trace arbitrarily allows us to check for different rules.
And in fact the way I went about this was I picked up the Linux documentation about how
to write DMA. And I'm like, okay, these are the rules. I can write this up as a checking
rule. I have an instruction trace of the driver; you know, I can find these violations.
>>: Do you know how long that'll be in the base before [inaudible]...
>> Olatunji Ruwase: Well...
>>: [inaudible]
>> Olatunji Ruwase: So, okay. I don't know exactly how long it's been in the base so my
Linux kernel that I used for my evaluation is quite old actually. And so some of the
devices might look really old. Bu the other thing that I found out -- because I contacted
the DMA expert on Linux -- is that not all of these bugs actually will cause like an actual
crash or corruption. And so people have been reluctant in actually fixing them. But it just
-- This tool just actually shows like a new breadth of correctness checking that we can
apply on drivers which is one of the reasons why I went down this direction.
>>: How can you know if it's a race without knowing the details of the hardware
specification? Because the way [inaudible] works is you write some stuff and then you
set a bit. And that bit says, "I'm done with this," [inaudible] the device now. So it's
perfectly safe to modify this until that bit is sent.
>>: So when you're saying race here this is a race between the hardware and the
software?
>> Olatunji Ruwase: Yes. So the answer to that -- To be precise, you're right. But what
I've taken here is really what the Linux kernel developers recommend in terms of how
you use DMA. So there is API for mapping a DMA buffer to a device. And idea is while
that is done then you should not touch; the driver should not touch that piece of memory.
And so what I am doing is basically I'm looking for those maps when they happen. And
at that point, you're right, maybe the device hasn't started using it so it's somewhat more
conservative than reality but it's just like this is at least what the Linux [inaudible]
requires.
Now if I had the device logic, I could also encode that and say, well, until the driver
writes to this particular control register, the device hasn't started doing DMA. But you
actually get to a point of -- One of the limitations of this work which is that the analysis
cannot directly examine the device state. We're sort of relying on the driver to do this,
like to poke the device register to read. So you can sort of get into that.
But this was just an interesting direction to go in especially since issue like this don't
exist on the user level. Like data race exists on user level. Memory [inaudible] exists in
user level.
>>: When you [inaudible] these bugs what was the update from the developer
community?
>> Olatunji Ruwase: They were like, "They don't cause any crashes." And like I said this
is a very old kernel. So they were not particularly interested. So that's basically the bug
detection effectiveness we get by doing lifeguard-style instruction [inaudible] checking in
drivers.
So this I my design. So now I'll talk about the end-to-end performance, but first let me
just show how I built this. So for my design basically I built my prototype where the
analysis runs in a separate virtual machine from the driver, and so it's isolated in that
sense. This is done in Linux as I said. I used Xen to build my interposition layer. And
that's sort of how this picture looks. And then for my instruction tracing, I sort of rely on
the LBA hardware-assisted instruction tracing. And so this essentially is what the current
prototype looks like. All right, and this is only a simulation because hardware tracing
doesn't exist today. And in particular I just sort of want to point out this is sort of like your
conventional system set up.
But I reserve about a half of a megabyte for memory for holding the instruction trace.
And in my simulation, which this is actually in [inaudible] I evaluate like protecting the
network card and the hard driver. And the driver VM has two virtual CPU's. The analysis
VM has one. This is a duel core system that I'm simulating.
And in particular I’m measuring end-to-end performance. And in particular I'm looking at
what would it cost to protect my device from any of these kind of errors in an online
fashion. And these are the two drivers that I used. These are stock Linux drivers, so the
old Linux drivers.
And so this is what the end-to-end performance looks like. So here I'm reporting the
performance normalized to Linux. And this is for the disk workload using the postmark
benchmark which tests different kinds of file operations.
And here I’m reporting the transaction rates, the read rates and the write rates. And
here, you know, we see the performance is not so bad in some cases. I mean in a lot of
cases we have less than 10 percent impact; although, there is like one or two where we
have like 12 to 15 percent.
>>: [inaudible]?
>> Olatunji Ruwase: Unfortunately this is not quantified here. I have some backup slides
that can get into that.
>>: [inaudible]?
>> Olatunji Ruwase: Yes. But what about network? So here for network server I'm
normalizing to true puts that a server will deliver normalized to Linux. You know, Linux is
one. And I used a range of network applications. And in general the performance seems
reasonable for most of these benchmarks until we get to network streaming where I'm
losing close to 60 percent performance in that case. And then, the question is why is
network streaming so bad? Well it's because this reflects the frequency in which the
driver performs I/O operations, so the frequency. And in particular I’m reporting here the
device register accesses which is how I/O operations are performed, and I'm reporting in
particular the reads and the writes. And so on interesting observation is that, well, there
are usually more writes than reads really in the network space. I mean this is kind of
interesting.
And this scale in on a log scale so each is an order of magnitude difference. And we can
see here that the streamed benchmarks are actually performing an order of magnitude
more device writes than other benchmarks or at least.
So while this sort of shows like the worst case because what happens there is this is the
point where the lifeguard has to catch up. We have to stall; the lifeguard has to catch up.
And so even though this shows like the worst case here, the good thing is that this
doesn't seem to be prevalent for a lot of workloads.
Okay. So sort of that's the end of Guardrail. And so I'll just quickly go through some of
the future work that I intend to do. I intend to continue in this direction of looking at
device driver reliability. And in particular the first thing I would like to do is to see if I can
actually protect the operating system kernel as well. But this is going to be quite
challenging relative to the interposition layer that I had here because the kernel driver
crossing happens much more frequently than the I/O driver crossing. So this is going to
be really tricky. And so maybe speculation might be a good way to go rather than stalling
the operations. And so you assume that the operation is successful and sort of
[inaudible] if it turns out to be unsafe.
And then, the other challenge here is that the granularity with which kernel and driver
share memory can be at a sub-page level, right? Because they're in the same address
space. Whereas for the device, everything is on a nice page granularity which makes it
easier to like, you know, restrict access. But now when you want to restrict access to
sub-parts of a page it's going to be really hard. And so I might be looking into things like
maybe adapting the memory allocation policies in the kernel so that driver data goes into
separate pages from the kernel.
And of course I also want to tackle sort of the virtualization overheads that I'm seeing
here with the interposition layer where I would like to have sort of like a lightweight
virtualization and then synchronous I/O writes.
Now this one's kind of interesting. So I showed that the device register writes were like
really the big performance killer, and this is because in a virtualized space the I/O writes
become synchronized. They become synchronized which is bad especially if your driver
does a lot of those. In fact, while the I/O reads get worse, like maybe 5x slowdown, the
I/O writes are actually like 30 to 40x slowdown. And simply because in the nonvirtualized space, I/O writes are being made asynchronous anyway. So the fact that you
issue a write to a device doesn't mean you get a response back but your thread can
continuing executing.
But with virtualization everything just sort of becomes synchronous. So I would love to
explore the ability asynchronous writes where the guesses continues executing while the
virtualization layer takes on the responsibility of actually completing the I/O writes for it.
And so with that my contributions. So basically I've looked into decoupling as an idea for
improving the performance of lifeguards. And I've shown that we can actually contain
bugs even through we are delaying the detection of bugs. And this great because then
we can use increasing processor counts to actually improve run-time monitoring in
general. I have proposed novel software optimization for existing lifeguards, so these are
existing lifeguards that I did not write myself. And I was able to get good speedups for
them by basically parallelizing them or applying compiler optimizations. And then I've
sort of gone into the kernel space and said, okay, what if we could do this sort of detailed
correctness check in the kernel space? And I've shown this is practical by decoupling.
Here I've shown a framework called Guardrail which actually safeguards the I/O state
from device bugs. And it does containment using commodity virtualization. I have
created three novel lifeguards for finding data races, memory faults and DMA bugs in
drivers.
And with that, I'll conclude and take questions.
>>: I think rather than questions we should thank the speaker because he's had plenty
of questions.
[applause]
Download