>> Tom Ball: Okay. Hello, everybody. Welcome. ... welcome Todd Mowry from Carnegie Mellon University where he's a...

advertisement
>> Tom Ball: Okay. Hello, everybody. Welcome. My name is Tom Ball, and it's my pleasure to
welcome Todd Mowry from Carnegie Mellon University where he's a professor in the Computer
Science Department. He received his Ph.D. from Stanford in 1994, one year after I got my Ph.D.,
so we're pretty close there in time. And his research interests are varied and span areas of
computer architecture, compilers, operating systems, parallel processing, database performance
and modular robotics, which is a very cool topic too.
He's served a rotation as the director of Intel's Research Lab in Pittsburgh and associate editor of
ACM Transactions on Computer Systems. So today he's going to talk to us about butterfly
analysis for dynamic parallel programming monitoring using dataflow analysis. Something like
that.
>> Todd Mowry: Exactly.
>> Tom Ball: You'll probably describe it a lot better than me. Thanks, Tom.
>> Todd Mowry: Sure. So I'm going to talk about, this is work that was done primarily by my
student, Michelle Goodstein, and this work appeared in ASPLOS conference last summer, and
what I was going to do is I'm going to start by telling you a bit about the project that led to this
particular piece of work and then I'll talk about the work and then afterwards I'll talk about where
we're going to sort of follow along with the next steps.
So let me just back up.
So a little bit about myself. So I don't have -- my background is not in verification or particularly
informal methods. I work mostly in compilers and architecture, and most of my research had
been on making things run faster either through parallelism or through cache optimizations. And
after doing that for many years I wanted to do something that wasn't just about making things run
faster.
One thing I was concerned about is with the whole word moving to multicore and with everyone
trying to start to write parallel programs, I know firsthand how hard that is, so I wanted to do a
project that might help people with functionality bugs in parallel programs. So that was the
motivation for this overall project.
And as everyone knows, when a bug occurs, as your program is executing often at the moment
that the root cause of the problem occurs you don't necessarily notice the problem right away.
Typically you notice it later when something catastrophic happens.
So at this point you get something like a core dump, and you can see what the world looks like at
that point, but that's not exactly what you want. What you really want to do is know what did
things look like back here so that you can find the real problem.
And the way the classic debugging happens is that at this point you would then try to rerun the
system with a debugger turned on and try to sneak up on the bug from the beginning, but
especially when you're thinking about a parallel system where things are timing dependent, it may
be extremely difficult to reproduce all the right circumstances to get that bug, and if debugging
slows things down and dilates time, then maybe it's not even possible to reproduce the bug with
the distorted timing.
So there are many ways to attack bugs. So static analysis is one great way to do that, and a way
that I think about what we're doing, we've been focusing on dynamic analysis not because we
don't believe in static analysis but because we think this would be complementary to static
analysis. So do everything you can with fancy static analysis tools, but for the things that you still
can't catch or for the things you're not quite sure about how they behave at run time, our goal is to
build the world's nicest, fastest run time analysis framework and then let clever people figure out
new kinds of tools to build on top of that.
So we didn't come up with the idea of doing dynamic analysis, of course. There have been lots of
tools to do things looking -- tools that are typically built using binary instrumentation tools like Pin
or Valgrind, and they look for things like memory bugs, security bugs, concurrency bugs and so
on.
So in particular we're interested in very sophisticated detailed tools that look typically at each and
every instruction. So it's easy to do dynamic analysis without much overhead if you only monitor
things that occur fairly infrequently. But if you want to look at every instruction and do something
interesting, the overheads can be very, very high.
But if you could make this run fast enough, a nice thing about doing an analysis dynamically is
that hopefully you notice the root cause of the problem before the catastrophic consequences
occur down the road.
Also, if you can look at every instruction, we think you have a better chance of finding things
rather than just, say, look at a core dump or a window of the last million instructions after the
system crashes because you don't miss anything. You can see all the instructions from the
beginning of when the program runs.
So it would wonderful if these tools were wickedly fast and we could use them for everything. But
the bad news is typically these tools, when they're implemented today with binary
instrumentation, slow things down quite a bit. Slowdowns of 30 to 100 times are very common
with interesting tools like this.
And when things slow down this much, the problem is you can't really run these on a live system.
Interactivity doesn't really work anymore. So that was an issue we wanted to fix.
So in the sort of overarching project that we've been working on what we've been trying to do is to
accelerate this kind of analysis to the point where the slowdowns are tolerable without resorting to
sampling. So we're still looking at every instruction, but we can get the overheads down to, say,
something like 20 percent or 50 percent to the point where you could actually think about leaving
this on on a live system.
And I've talked about how we've done that in previous visits here and talks so I'm not going to go
into a lot of detail on that. I'm just going to quickly summarize this in case there are people who
haven't heard this before.
So how do we run so much faster than -- yes?
>>: Is 20 percent and 50 percent performance, is that something acceptable?
>> Todd Mowry: Well, actually we can -- I think we can reduce it lower than that. So our actual
goal is no slowdown, and ->>: [inaudible]
>> Todd Mowry: Yeah, so actually this is -- we've done a lot of other work that I'm not showing.
Actually I wasn't going to -- actually I can talk at great length about that.
So we've done a lot of work on different ways to accelerate things, and these are fairly
conservative numbers. The 20 percent to 50 percent was our number from three years ago, and
we've gotten much better results since then.
In fact, you can always close the rest of the gap by doing just a little bit of sampling. So if I'm
willing to look at three quarters of instructions instead of 100 percent of them, I can just
immediately make any slowdown go away.
>>: [inaudible]
>> Todd Mowry: Sorry, what's that?
>>: [inaudible]
>> Todd Mowry: Yeah. So our goal is to it without any sampling, and I think we can get down to
no slowdown. There are different tricks for basically getting to that. We're very close to that.
But it's more that we sped things up by, like, more than a factor of 20, ask there's more low
hanging fruit that we're working on to speed it up more. So things got a lot better. The point is
when you're looking at this kind of slowdown, nobody even thinks about running this on a live
system.
Okay. But in a nutshell, the thing that we do, which, again, I've talked about in earlier visits, is we
don't use binary instrumentation. Instead what we do is we modify the hardware a little bit to
basically have the processor collect some of the information that we would want to binary
instrumentation.
And this is really not a big deal in that the hardware already has this information, and in fact it
captures this internally for debugging purposes in some cases, but let me tell you a little bit about
what we do.
So, first of all, the application is running on modified on one processor, and the analysis -- instead
of taking the analysis and stitching it together into one software thread and running that on one
processor, we keep them separate. We run the analysis on a separate core on the same chip
and then we -- the hardware logs the information that we care about dynamically and stores it in
memory and then that is consumed by the analysis.
So it's basically the same type of information flow that happened with binary instrumentation, but
we're pulling it apart and running it on separate processors.
So the reasons why it runs faster are, first of all, well, we're using another processor, and so that
gets us some benefits. It's not just that there's another processor, but the fact that we have
another set of registers and another cache, that helps quite a bit because when you take a
program that was already compiled to use all the registers and then you try to stick analysis into
that, there's a lot of register spilling, plus the caching effects matter.
The problem isn't that this analysis is knocking the cache -- is displacing the data from the
application. It's the other way around. This analysis typically has a much smaller footprint than
the application, but it's actually the bottleneck. And if you want this to keep up, you want it to run
as fast as possible, so we don't want the application to be knocking the data out of the cache for
this tool.
And, finally -- so with the things I described so far, if we do this the right way, and hardware
designers don't think this is a crazy idea, then you could run this at full speed because this is just
running on bear hardware, and the information is just being captured in the background and sent
over to this processor.
Now, it turns out that the problem, the performance bottleneck, is on the consumer side because
these tools may do something like five to ten instructions worth of analysis for every one
instruction in the application. So they're starting off with a big disadvantage in terms of
instructions.
What was that?
>>: [inaudible]
>> Todd Mowry: Oh, okay.
So what we did in part is both through hardware -- it says hardware here, but we also did this
through the compiler -- we found ways to recognize redundancy in the analysis and eliminate it or
reduce it, and we've had papers on that which I'm also happy to talk about in more detail.
Yes?
>>: Isn't that analysis running the exactly same type of processor as the [inaudible]?
>> Todd Mowry: So in our designs we've made each processor the same. So any processor
could either run an application or could run the lifeguard. We did that just for flexibility so that if
you wanted to use all resources just for computation, you could do that, so that we weren't hard
wiring anything. But you certainly could -- it's interesting to thing about optimizing a processor
just for doing this kind of analysis.
So with all these things turned on, a very ->>: So I can think of this as another type of tool [inaudible] where you have like two cores that
are running the same thing, just one is doing something else. My question is -- and you also
have some changes to the hardware, so are you proposing that every work station or every
machine should be changed so that -- the hardware should be changed so that it will support
[inaudible]? Do you think this is feasible?
>> Todd Mowry: Yes. So --
>>: Do you think really it's worth, like, all the costs that will be associated with hardware change?
Because you know that's very expensive.
>> Todd Mowry: Yes. So I often have worn an Intel hat, and this project was done jointly with a
bunch of Intel people, and we've spent quite a bit of time talking to Intel architects throughout the
whole project about the feasibility of this. So it's not -- this is not technically infeasible. The
question is just sort of value proposition and so on, and in fact Ben and [inaudible] have helped
participate in discussions about that with Intel. So it's a topic of discussion.
Do you guys have other questions?
>>: [inaudible] applications usually when you make a small change [inaudible] just an extra write
to the memory or an extra read from the memory can result in totally different set of results, so I
was wondering if your construct, your conditional construct, you're adding [inaudible].
>> Todd Mowry: It does not -- let's see. Well, it's not that there's zero perturbance of the system
because it is writing a log into a small piece of memory, and it can sometimes -- there's nothing
special about that memory that would prevent it from ever conflicting with other memory accesses
in the application. And in our experiments we measure this.
The interference between the log that we're storing here and the normal cache data is very small
because this buffer is not very big at all. But there is a small effect there. So it's not that there's
zero perturbance from what we're doing, but the hope is that it's not very large. It's certainly
better than, I think, what it is when you slow things down by an order of magnitude.
>>: So what about the memory traffic that's going off core?
>> Todd Mowry: Well, one of the things that we like about this is -- so people are worried about
off-chip bandwidth, and a nice thing in my mind about this is all of the -- nearly all of the traffic
associated with doing this analysis stays on chip because the log almost never gets knocked out
of the chip because it gets consumed more or less right away, and then the metadata that the
lifeguard uses caches very well, so ->>: So it's not off chip. So it's just the same core. You might even be sharing a cache?
>> Todd Mowry: Yes, they are typically sharing the last-level cache.
>>: I see.
>> Todd Mowry: So that's actually where this thing resides is it just really never goes off chip.
So one of the -- and we've papers where we look at a wide variety of different lifeguards, and this
is one that is part of a Valgrind tool that's called Address Check. It's one of the simpler ones, but
it initially has a slowdown of about 30x, and using our bells and whistles we can get that down to
about, say, 2 percent, something that's in the noise.
For the more sophisticated tools it's a little higher than that, like 20 percent, 30 percent or so, but
it's small enough that you can think about actually turning this on and leaving it on all the time.
Okay. So that's just kind of context. So that's the project.
But what I've talked about so far, this was just single threaded applications. So we're using one
lifeguard thread to monitor one application thread. And what we wanted to do is look at parallel
programs, and this makes life much more interesting and challenging.
Okay. So now I'm jumping into the official slides from Michelle's presentation so there's a little bit
of repetition here.
The idea is when we think about finding these problems, you can either try to cash things using
static analysis before anything executes or during execution or after something crashes. And all
of these are useful things to do, but we're focusing on things in the middle. We're trying to cash
problems during the middle of execution.
Okay. And the way that these tools work is that they associate extra state, the shadow state or
metadata, for all of the data in the application, and that shadow state is tracking some
correctness property of the data. For example, memory checkers record whether a location has
been allocated and initialized and freed and that type of thing.
So, for example, as your program is running if you load from an address, it's going to go over and
check whether you actually have allocated that thing, and if not, then it knows that there's a
problem.
Okay. So for sequential programs this is basically just a finite state machine where you go from
instruction to instruction. It knows how to update this metadata, and it knows how to check the
metadata. So that's fairly straightforward.
Okay. Now what happens when we want to look at parallel programs? Well, it would be nice if
we didn't have to change things very much. So what happens in this situation? Well, that would
be nice. That's not what really happens, though.
So now imagine that, for example, on one thread we're allocating memory and we're
dereferencing it and we want to know whether we're going to have a null point or exception, and
it's possible that this other thread is going to set P to be null, and we want to -- imagine we want
to write a lifeguard that's going to check for this problem.
So now we need to worry about the ordering of when this assignment occurs relative to this and
this access and whether this assignment is visible at this point. So the way this basically
happens today, more or less or less the state of the art is that people track this by doing time
slicing.
So the idea is that you basically force the different threads to interleave on one core, and then if
you do that you can run one lifeguard. A single-threaded lifeguard can look at this time-slides
execution and pretend that it's basically one program except that you have thread IDs whenever
you switch from one thread to another, but the same mechanism of having a finite state machine
basically still works.
So the good news is this will work, it will function, but the bad news is that it's slow, because now
we're throwing away all of our parallelism. We have to take all of these threads and put them all
onto one core, and we don't want to do that. We want to get the performance benefits of
parallelism.
Okay. So let's say we want to -- yes?
>>: So let me see if I understand. You only want to check actual races basically instead of
possible races?
>> Todd Mowry: Right. Yes.
>>: So can we tell by the interleaving -- it isn't all possible interleavings, a particular interleaving,
and you're going to cache that one, right?
>> Todd Mowry: For the most part, yes. Actually, when I talk about what we actually did in the
butterfly analysis, it's basically a hybrid where within a recent window of time we're actually
considering all possible interleavings within a window, but beyond that we aren't. But with what
I'm talking about here, it's only considering one interleaving.
So the idea of the tool is simply to say given how things have behaved so far, does it look okay.
It's not trying to think about anything about what else could have happened with a different
interleaving.
But the idea here is with the time slicing you force a particular interleaving and then you just
analyze that interleaving and that's all that happens.
Okay. So if we wanted to do this in parallel, now the analysis needs to actually run in parallel.
And the tricky part is thinking about these dependencies. So one possibility -- we had another
paper at ASPLOS last year where we looked at hardware support to do this. So if the hardware
logged the communication between processors the right way, then you would have all the
information you would need for the lifeguards to know precisely whenever I access something,
am I reading something where this write was visible or not.
So if you're willing to add extra information to all the coherence messages, you can track this
precisely. So that's one possibility. But that requires even more new hardware.
And also it really only works if you have sequential consistency or total store ordering. So that's
one possibility, but that's not what I'm going to talk about today.
So instead I gave Michelle this challenge, I said, okay, let's -- so it's natural to sit down and think
if I'm going to analyze this parallel thing, step number one is figure out what the ordering is. What
is the interleaved ordering of everything?
But let's assume that that's impossible to observe, that we just don't know the interleaving. So in
particular I think that's important for a couple reasons. First of all, without this new special
hardware we really don't know what that exactly interleaving is, at least within the near term -- I
mean within the recent history. And also this idea of an interleaving only makes sense if you
have a sequentially consistent new machine, and although programmers like sequential
consistency, real machines aren't sequentially consistent.
So in real machines sequential consistency is an illusion. Under the covers, things are happening
out of order. So let's embrace that and say that things really happen concurrently, within reason.
So the idea -- the way that we're tackling this is we're not simply saying we don't know anything at
all about ordering. What we're saying is that we have some idea of a window -- a bounded
window of uncertainty where from the perspective of this reference over here, if we go back far
enough on other threads, then there's some point where enough time has passed that we know
that memory accesses are now visible because there's only so much buffering in the hardware, in
the reorder buffers and the memory system. And that may be tens of thousands or hundreds of
thousands of instructions.
So beyond this point we know that things are visible, but in the recent past and in the near future
on concurrent threads we don't know -- we'll assume that we don't know the ordering, and
therefore anything is possible.
So, for example, in this case if I see -- if there's an access where I'm setting P to be null, then if
that occurs within this window of uncertainty, I have to assume the worst, which is that this was
visible before I dereferenced this and that this would be a problem.
So the idea is we want to have our analysis -- instead of having a total ordering, we have only a
partial ordering information. And we're going to come up -- we want to design something that's
conservative but correct.
So I'll talk more about how this works.
So, first of all, we have this window of uncertainty of a certain size, and one question is how big is
it? So the answer is it's fairly big. It is typically thousands to tens of thousands of instructions. It
needs to be a large enough number of instructions in this window that you would expect the
reorder buffer to flush and the memory buffers to flush on the chip so that it would at least hit the
cache, the last level shared cache.
Although it's big relative to these structures, it's actually quite small relative to how many
instructions have executed since the program started running probably. So we do know things
about quite a bit of the execution. We just don't know what happened exactly in the last 10,000
instructions or so.
Okay. So the idea is that as the program is running, we're capturing either through our fancy
hardware support or through maybe even binary instrumentation, you're logging or you're
watching all the instructions dynamically as they go by, and as you see these dynamic traces
what we want to do is divide the execution, the log, into these different windows.
So each window should contain at least W instructions is W is the window side that we're
targeting, and we call these things epochs because everyone calls windows like this epochs. We
discovered that if you use the term epoch you immediately get a whole lot of related work
suggestions when your paper gets reviewed even if they have nothing to do with your paper, but
that's okay.
So the idea is you carve these into these different epochs, and one thing is know is you don't
need to precisely carve them because when you actually -- oh, second, this is not a barrier. So
the application is not stalled when we're making these cuts. This is something that happens in
the background. So we just take the logs after they're generated and we carve these things out.
Also, they don't need to be -- it doesn't need to be a precise cut. So it's okay that there's a little
bit of stagger between them as long as it's large enough that you have both W and whatever
amount of stagger there is. So the way you would really implement this is, for example, you
would just pass a token around between the different threads in the software and have a fence,
and from the time you started until the time that it comes back, you don't really start counting W
yet, and when it's made a complete round trip, then you count W instructions and then you know
you've hit an entire window and then you can send the token around again.
Yes?
>>: So do you use some static information to put some initial guidelines [inaudible]?
>> Todd Mowry: For this?
>>: Yeah. Because it's like static analysis [inaudible].
>> Todd Mowry: In our experiments we just used machine parameters and we didn't look at the
code to worry about this because code typically -- well, you could try to adjust the window size or
think about optimizing it, but we were just erring on the safe side and made this a big number and
didn't try to think about resizing it. But I have some ->>: [inaudible]
>> Todd Mowry: Oh, so you weren't -- so not specific to the epochs but just generally if we did
static analysis ->>: No, for the epochs in particular, if you used some static analysis and then you combine with
the dynamic analysis you could save some performance overhead.
>> Todd Mowry: Well, that's very -- I mean, yeah. We haven't looked at that. I'm very interested
in combining static analysis with what we've done.
>>: Right, yeah. I remember you said this complemented it.
>> Todd Mowry: So the short answer is we didn't look at that for this. So we just did it all
dynamically. In a way -- I believe the right thing to do is to combine static analysis with what
we're doing. I've been in a way trying to make life hard for ourselves by not doing that static
analysis yet just to see if we can keep up and get decent performance where we don't use any
static analysis, but static analysis should only make things better.
Okay. So first we carve things into these epochs, and this doesn't slow down the application.
This just happens in the background as you capture these logs.
Okay. So then what do we do? Well, conceptually we can think about these logically time
stamped epochs, and relative to a particular layer of epochs if we go two layers either back or
ahead, then enough time has passed that we know that there's actually a real ordering there, but
if we think about adjacent ones, then those are too close. We don't know what the ordering is.
So what we really need to do is consider a sliding window of three of these different layers at a
time. So the way the analysis works is it just moves from three levels one at a time --
>>: Is that because you don't have the [inaudible]?
>> Todd Mowry: Actually, it's not even -- even if they were precisely cut, the issue is that an
instruction near the top of this overlaps some number of instructions into the ->>: Got it.
>> Todd Mowry: It's just because of the boundaries.
>>: Yeah.
>> Todd Mowry: Okay. Oops. I apologize. The fonts here are way too small.
But the idea is that -- so, for example, if we're allocating memory here and dereferencing it down
here and over here on these different threads, in this case it's two epochs away, so that's okay.
We know that that's ordered and that there's not a problem here with respect to -- when you touch
this, this will definitely have already been visible.
But this one over here is too close. So for this case there's a good possibility, which is that it
actually saw the malloc, but there's a bad possibility, which is that it didn't and it actually
dereferenced this before the malloc was visible.
Okay. So that's the -- so, so far the idea is we're going to have these epochs and we've going to
think about these windows of uncertainty. But that's not the interesting part.
The interesting part is how we actually do the analysis. Okay. One other piece of information is
that things are a little bit -- there's some special information that we know, which is with respect
to, say, a particular epoch on a particular thread, we know that for the same thread in the
previous epoch, that that definitely occurred before the event here because it's on the same
thread. That's usual thread ordering. And we also know that the things after it definitely have not
occurred yet.
But otherwise, for all the other concurrent threads, we don't know -- we have to assume that
anything could be interleaved in any way on those other threads with respect to any instruction
here.
So we kept drawing these things on the whiteboard and decided that that looked like a butterfly or
a dog bone, I guess, but that's why it's called butter analysis, not for any other good reason. And
people point out that our butterfly fly backwards because we put the head at the top of the
whiteboard, but it's actually moving this way. But forward data flow is flowing that way, though,
so it does go through the head and come out the tail.
Okay. So we have this -- here's our problem, which is we need to do the analysis considering
everything here to be concurrent. So what's the big deal?
Well, the big deal is a combinatorial explosion of possibilities. So the size -- these blocks are
large. They have tens of thousands of instructions, and we have a large number of these
threads, so there's a huge number of -- even though we bounded the window of uncertainty, the
number of possible combinations is enormous.
So we don't want -- for performance reasons, remember, we're trying to do this analysis on-the-fly
as the program is running without -- and we don't want to slow things down, so we can't afford to
do something that's super computationally expensive.
We also don't want to say to the lifeguard writer, well, okay, write the code here that happens to
know what to do with all of these concurrent things. Good luck with that. Instead we want to
have this framework where the framework knows how to deal with all the concurrency, and if you
just write down your analysis in a straightforward way it will just take care of all of this for you.
So there are all these possible interactions, and -- all right. So I'm a compiler person, and I tend
to think about data flow analysis and that type of thing by default, and I was staring at this and
thinking, wow, can I just somehow treat -- even though this is a dynamic piece of the log, what if I
thought about this like a control flow graph. Is there some way that I can do flow analysis across
this and borrow techniques from that?
Okay. Well, it turns out that that's not a good idea to do it in a straightforward way, and here are
some reasons for that.
First, if we just think about -- let's not think about everything. Let's just start by thinking about one
of these concurrent blocks. So the issue is that information can escape from this block and affect
any instruction here after every instruction in this block as opposed to in a normal flow analysis
where things only enter through the top and exit through the bottom. So that's a problem.
And, also, similarly, information can enter this one when we go to analyze it at the beginning of
each instruction. So you might think, well, okay, well, why not just represent -- take each
instruction and make it its own basic block?
Well, we could do that, but now we've got, you know, just a ridiculous number of little basic
blocks, you know, tens and tens of thousands of them now. So obviously that's going to be too
slow to compute, so that's not really practical.
And there's another bad thing about doing it this way. We realized that -- this is an example here
where one thread is executing along, and imagine that you're doing taint analysis to see whether
you're doing information flow tracking and you want to know whether X is tainted, and assume
that at the beginning of this it's not tainted. So we're executing along, we check it, this check
should be okay because it's only tainted after the check, and this other thread isn't tainting it.
But it turns out if we just treat this like a flow graph and let it crank away and look for -- solve the
problem, it can decide that there's a problem here, and the reason is that it can think that this taint
has propagated information up here and down through here and then back up over here and
down here again. So we'll actually get wrong answers if we try to do this.
Okay. So we don't want to do data flow analysis in that sort of traditional style, but the idea that
we had was that we started thinking about something that's called interval analysis or region
based analysis where you try to collapse a flow graph into, say, a single transfer function to
represent the entire thing. And you do this by reducing different components. For example, if
there's two nodes where one follows the other, you can compose them together, you can deal
with meet operations, and the most interesting thing in my opinion is the thing you do for loops.
So, for example, to make this back edge go away what we want to do is compute the closure of
the transfer function to represent anywhere from zero to any number of instances of this transfer
function being applied. But the idea is with this closure operation and region based analysis, in
one shot you can summarize the net effect of any number of iterations through this loop. And that
felt very much like what we were trying to do with concurrency.
So the idea is we're -- our solution was inspired by that thought. So it's different, though, because
in interval analysis we're dealing with a static control flow graph and we're computing the closure
for a loop. In our case this is dynamic chunks of a log, and the closure is with respect to all of
these concurrent blocks in the wings of the butterfly.
Another thing that's different is that information can enter and escape anywhere in the middle of
these blocks. So that also makes it different.
But similar to flow analysis, we wanted to have this standard framework where if you wrote your
problem in a certain way, the framework will just take care of all the details for you.
Okay. So in the paper we go through different examples of data flow problems and lifeguards
and show how to adapt the ideas of flow analysis to work here. And one of the things that's quite
different, as I've been mentioning in the past, normally you have in and out and you might
generate and kill different pieces of information, and we had to introduce new concepts. Our new
concept we called Side-Out and Side-In. Sounds like we spent too much time playing volleyball
here.
But that captures -- Side-Out captured the fact that information can escape after each instruction
and have side effects on other instructions, and then Side-In captures the opposite of that, that
information can flow into a block before any instruction and not just at the beginning of it.
So it's these things that make the equations look fundamentally different from what you're used to
seeing with normal data flow analysis on control flow graphs.
>>: So those two steps are sort of treating the statements in a flow-insensitive fashion. I mean,
that's another way of looking at it. You're basically saying there are all these [inaudible] possible,
and you have a flow insensitive step interleaved with a flow sensitive step. So within each thread
you need to be flow sensitive but sort of across, these side things are like flow insensitive.
>> Todd Mowry: Yes.
>>: Gotcha.
>> Todd Mowry: So -- okay. So what we do is, in this framework, the person writing the tool,
they have to specify various things. They have to talk about which instructions and which system
calls or library calls are interesting that will cause the metadata to be changed or checked, and
you have to talk about how to represent the metadata, and you also have to specify a meet
operation similar to normal data flow analysis.
And then the framework operates in two passes. What happens is we make -- oh, and to be
specific, what I mean is there's this sliding window. As the window slides to the next layer, in total
we will walk over each layer twice, but that doesn't mean we record the whole log and then go
back and start it over from the beginning. We just do this two passes as we're sliding the window
forward.
In the first pass over a block we compute the effect of its instructions on all the other concurrent
instructions within the three layers, and on the second pass, that's the flow sensitive one where
we gather all that information together and then we can walk down through the thread and decide
what the real state of the world is at that point.
Okay. So then to illustrate this, although this isn't an interesting lifeguard, it's the simplest thing
we could come up with that was a stand-in for a real piece of analysis, which is just available
expressions. That's something that compilers compute to look for redundancy. And so I'll show
you what that looks like and then I'll show you how that can be applied to a real lifeguard.
But the idea behind available expressions is that it's -- an expression is available, something like
A plus B is available if it's evaluated along every path that it can arrive at a certain point. And this
is interesting because it has to be true for all instances that arrive somewhere, and a lot of the
analysis tools care about that. What is true for all possible interleavings?
Okay. So for available expressions let's say the expression of interest is A minus B, and so we
see it here, here and here, and let's say we want to know whether it's available right here.
So we carve things into epochs and imagine that it ends up looking like this, and also imagine
that the very beginning before we reach this point that A minus B is available along -- across all
these different threads.
So intuitively what's the answer that we want to get here? Well, if you just look up here it may
look like everything's good except there's a problem down here, so down here we're rewriting B,
and so there's a path where this is not available. So if we go here to here to here and there,
which is a possible interleaving, then it's actually not -- A minus B -- the most recent A minus B
has not been computed at this point because B has changed.
So we wanted to get the answer that this is not available in this specific example. Okay. So as I
said earlier, there are two passes. In the first pass we're going to summarize the effect of a block
on the other blocks. And this boils down to computing the Side-Out or the side effects of it
coming out the sides as well as the bottom.
So in this case the net effect of this block is that it's going to kill A minus B because it's rewriting
B. So that will end up showing up in the Side-Out for this block.
So we do that for -- we've done that for all the different blocks. You can see down here that it's -Side-Out contains the fact that it's going to kill A minus B. So that's our first pass. We've
computed this.
And then in the second pass we collect all these things together and we do a meet of all of that
and then we can summarize -- we have a summary of all of the side outs, and given how meet
occurs, you'll kill something if it's killed anywhere, so killing A minus B ends up in the input after
this -- that we get after this meet, and then we can do our second pass where we now realize that
this is not available, because even though it was available at the point where we started to enter
this block, probably it's getting killed because of the concurrent accesses, so it's not available.
Yes?
>>: So the [inaudible] killed A minus B, but did it happen to take into account the fact that Z -you know, you had this path going to Z equals A minus B and then B equals B minus 1, but is that
necessary [inaudible]?
>> Todd Mowry: No. That was just -- that's not the only path.
>>: Okay. I see.
>> Todd Mowry: It's actually just simply a fact that ->>: [inaudible]
>> Todd Mowry: That was just an illustration that touched everything.
Yes?
>>: So how does this analysis work together with a machine that works based on chance? So
you can do this [inaudible] in hardware and either you can do these intersections [inaudible] and
say these two epochs do not access even the same [inaudible]. So wouldn't it simplify your
analysis a lot?
>> Todd Mowry: Yeah. Actually we do that in some of the lifeguards. That's a very good
observation. That's a good trick for knowing that -- to accelerate things, because if you know that
there's nothing that overlaps -- if the read and write sets don't overlap, then you don't need to
actually do the second pass. So we can short circuit the second pass in some cases. And that's
something we want to explore much more in our future work is ideas like that for accelerating it,
taking advantage of that, plus static analysis and other ways to memoize things. I think that I
think that there's a lot of potential for that.
And you could even -- if there was hardware support, like some kind of bloom filter or something,
you could use that, too. So you could, in the first pass, load up these bloom filters and exchange
them and then just do a quick test in hardware to see whether there's any conflict. That might be
a nice way to do the checking too.
>>: So how about if you actually -- let's say that these guys actually communicate, but if they
communicate you squash one of the guys. So basically you enforce the framework do not
communicate, so [inaudible] approach to this.
>> Todd Mowry: So something more transactional or ->>: Yes.
>> Todd Mowry: Yeah. So in fact that's another possibility. We were -- we didn't want to -- well,
we weren't thinking about -- our philosophy was to take whatever machine model or memory
model the machine had and look for problems. In fact, as we -- as we started doing this, it's really
tempting to think of ways to simplify the problem and say, well, if we have properly synchronized
programs then we could play this trick and that trick, but I always remind the group it's exactly the
programs that aren't properly synchronized that we have to be paying attention to. But, anyway,
that's a tangent there. That's an interesting thought.
Ben?
>>: So is this analysis happening dynamically every time you want to analyze this effect?
>> Todd Mowry: Right. So there's a lot of potential for reusing things. That's, like, number one
on our list is to take advantage of memoization, because you keep -- you know, if you're in a loop,
you're going to see the same code path again and again. Some things will be a little different,
maybe some data addresses will be different, but right now we're actually regenerating everything
on-the-fly all the time. So the overheads are much higher right now than they need to be in this
paper.
Let's see. Okay. So in our ASPLOS paper we looked at available expressions and reaching
definitions. They're interesting because one of them takes the intersection and one takes the
union of the possible inputs, and you can apply those concepts to understand how to do real
lifeguards, although there are other extra details that make those even more interesting.
And what we did in the paper is we have proofs showing that there will never be false-negatives
in the analysis, but there can be false-positives certainly. And in fact I've got numbers on that
coming up in the results.
And also you can apply this to a machine that has relaxed consistency. It doesn't depend on
sequential consistency.
Okay. So just very quickly, I won't go into a ton of detail on this, but looking at address check,
this is a real lifeguard, and it's looking for memory allocation bugs. And I talked about it a little bit
before. And it's a little bit like available expressions. You want to know whether something is
true in all cases.
So, for example, here if we are dereferencing Q and then allocating something for P and
dereferencing it and freeing something over here, you know, we start with our epochs, we do our
analysis, and after the first pass we might decide things like this block is mallocing P and not
freeing anything, and we compute this everywhere. So down here we realize that we're freeing
Q, and we exchange all this information after the first pass, we do our meet, there's a little fancy
animation, and now we know that the beginning of this block, that somehow concurrently
something is freeing Q and nothing is mallocing anything. And now when we do the second pass
we know that there's a problem here when we dereference Q because down there there was a
free of Q. So this is a problem.
There's tons more detail in the paper if you're curious.
Okay. So now onto some results. So we built the code to do this, and we ran our experiments a
top of our logging infrastructure. And the way that that works is we're emulating the hardware
changes inside of Simics, but a top of that we have a full software stack running on it.
And we measured 4-, 8-, and 16-core CMPs where we use half of the cores for the application
and the other half for the lifeguards, and we looked at both 8K and 64K epoch window sizes. The
first set of results are for the 64K results.
Yes?
>>: So I didn't understand what kind of [inaudible].
>> Todd Mowry: Yes. So in the first few slides I was talking about the logging hardware for the
project, but I forgot to mention that the butterfly work doesn't actually really need any of that. You
can do all the butterfly stuff just with binary instrumentation. It doesn't require any new hardware.
We just used this because it just makes everything run faster, so everything would be 20 times
slower if we did it with binary instrumentation, but you can do it that way.
The good news is it would still be a win for binary instrumentation because you could at least run
it in parallel. I mean, your tool is going to start off 30 times slower, but it can at least speed up
with more cores, whereas if you time slice it's going to stay 30 times slower and never get faster
even if you have more and more threads.
>>: But there's no additional hardware to support butterfly analysis [inaudible]?
>> Todd Mowry: Yeah. In fact, a goal of the analysis was to not require any special hardware
support. So it works on existing hardware. It just -- like I said, in our experiments we went ahead
and used our logging because we like it, but we didn't need to do that.
Okay. So here are a couple of applications that we were looking at, and along the x axis for each
application this is the number of cores running the application, and there's the same number also
running the lifeguard. And the y axis is time, normalized time. And it's normalized to the
application running by itself on one core.
So the blue bars show time when we run the application in parallel by itself without monitoring.
So it gets faster because we get parallel speedup in most cases. So that's good.
And we're only going to make things worse than this because monitoring isn't going to speed
anything up, it's just going to make things slower.
Okay. So the next bar in red is the time sliced approach. So if we just took those concurrent
threads and made them run interleaved on one processor and then used one lifeguard to monitor
it, we get the execution time shown in red. And is not surprisingly, this doesn't typically ever run
faster than one processor. Sometimes it runs much slower due to a variety of effects like partly ->>: [inaudible]
>> Todd Mowry: Due to measurement error. It shouldn't ever run faster. But basically this isn't a
good number. You want to compare this performance with the performance of the blue bar, and
you can see that it's way slower than the performance without monitoring. So this doesn't look
very good.
And then, finally, the green bar is with the butterfly analysis, and so we actually can speed up. So
butterfly analysis is not -- does not have zero overhead. It's a little worse than the blue case. In
some cases it's actually significantly worse due to the fact that we haven't done anything hardly to
optimize to take advantage of memoization or other things like that.
I think in our ongoing work -- in our future work we're exploring quite a few ideas to reduce this
sort of constant overhead of what we're doing, but the good news is that it speeds up and it
scales. So like this is a nice case here where it's getting a lot faster.
Yes?
>>: [inaudible]
>> Todd Mowry: Yeah. So to make that comparison what you could argue is that really I should
be comparing the performance of running the application on 4 cores and using 4 cores to do the
analysis with the application running on 8 cores. And the data is actually there. You can just say
compare this blue bar with that green bar.
So it does make it look even a little bit worse, but it's better than time slicing still. And the other
argument for that is -- well, but if your program is buggy, you would rather run the tool and find
the bug. I've had that discussion many times with people at Intel. I don't know if that was of the
point you were making or not, but sometimes people complain that we're using half the cores to
do the analysis, and we actually are pursuing an idea where we think we can dramatically reduce
the overhead for doing that. But that's still ongoing work.
Ben?
>>: It seems like at some point you get to the conclusion that having [inaudible] language that
has run time checking and takes [inaudible] is part of the language run time is way better than
this, because arguably you are determining certain properties as address check properties, but at
a high cost. I mean, at this point it's like half the cores ->> Todd Mowry: Actually I even hate having address check up here at all because it's a really
bad motivation for doing this kind of analysis. There are other ways to get memory safety
checking, but I think that security checking and concurrency checking, which are things we've
looked at also, those are much more legitimate. I think security in particular, doing information
flow tracking, that's -- but, yeah ->>: It's an interesting thing because I think so -- you know, I don't think I've seen this case lately,
but I think you could make the case which is that if you start from a bad position -- in this case,
you know, language that doesn't have garbage collection or whatever -- you know, the cost of
getting back to that, especially as you go to concurrency and these kind of things, it really doesn't
scale, you know. You sort of lose too much at that point, especially -- and, you know, it just
reminds me of security properties. I mean, in a similar way the sort of work going on here
[inaudible] which embeds the ability to [inaudible] security properties, again, in a type system. So
it's -- you know, you get -- I mean, it's sort of like what you're saying, too, about doing static
checking. The more static checking you can do, the less of this you have to do at run time.
>> Todd Mowry: Yeah. The ideal thing would be to have type safe languages and then only add
this in for whatever else is left. And then the numbers will probably look really good because
there will be way less to have to check. But there will probably be some room for some dynamic
things that you would want this to look at. You can still have bugs in type safe programs, I
believe.
Yes?
>>: [inaudible] so you are collecting information from the user's machine so you can be able to ->> Todd Mowry: Yeah, you would definitely -- you'd certainly imagine using it that way. We have
a couple of goals. One of them is, yes, you can collect better debugging information to send back
to the developer so they can fix the bug. But we're also hoping long-term that if we could catch
problems early enough, that we could do something in the software to not -- to maybe work
around the problem or mitigate the damage. We wouldn't necessarily be able to fix the problem,
but like if there was -- if we detected a malicious packet arriving, maybe we could rewind the
system and kick it out and run forward without it, or if there's a data race maybe or something like
that, maybe we can back up and do that type ->>: [inaudible]
>> Todd Mowry: Yeah. So what we envisioned is having some recovery mechanism combined
with this. That isn't something that I'm talking about yet or that we've looked at, but that was part
of our long-term -- that was actually what started the whole project is we thought about doing
recovery and trying to have systems that were more robust, but then we got a little bogged down
in figuring out how to find bugs. Turns out that that's a hard problem.
Okay. So the green bars are much better than the red bars and, in many cases, not too far off
from the blue bars. In some cases the overhead's larger. There's actually a case here where we
didn't actually quite beat the red bar, but that's mostly just because we start off with a very large
amount of overhead. We are scaling very nicely, and if we can do some of the obvious things to
reduce the constant overheads, I think the performance would look much better. But as you go to
more and more cores, the scaling benefits of this will become more and more attractive.
Okay. I'm getting close to the end here.
So one of the things that I mentioned is that this is conservative analysis. It doesn't have
false-negatives, but it can have false-positives. And I also said that we looked at different window
sizes.
So what you can do at a high level, having a larger window size is better for performance
generally because there are some constant overheads every time you kind of move from one
window to the next, so you can mitigate that a bit.
So in these experiments the purple bar is for a smaller window size of 8K instructions and green
is 64K instructions. So generally the larger window is faster. This is execution time here.
But that comes at a cost, which is false-positives. The larger the window size, the more things
get mixed in there. Then you can see that the false-positive rate goes up. In one case ocean it
gets to be probably unacceptably high.
Yes?
>>: These are fractions of all memory accesses that are falsely categorized as errors?
>> Todd Mowry: Yes.
>>: So how does this compare to the true error? I mean, if I got a report, what's my -- what's the
probability that it's a real bug versus it's a false-positive?
>> Todd Mowry: Oh, I think I may have that in a backup slide. Let's see. I can either jump there
now or maybe get it at the end. I think we have that answer. I'll get back to that at the end.
But anyway, this is a parameter that you can play with. You can trade off performance and
false-positives with this, and we've got ideas, and in just a few slides I'm going to talk about
another idea to build upon that.
Okay. So that basically talks -- that covers what we discussed in the ASPLOS paper, but now I'll
very briefly talk about where we're going with this and what we've been working on since then.
So the themes of what we've been working on are improving precision and improving overhead -I'm sorry, improving performance. So we have false-positives, we'd like to have fewer of them,
and we're not quite keeping up with unmonitored execution, and we'd like to do something about
that also.
So the first thing that we're looking at is that we are -- some of the false-positives that we get are
due to the fact that in the initial analysis, we just ignored synchronization completely. So if you
had a case like this, this should not appear to be a taint problem, but we would think that it was.
So what we're doing is we've been -- the paper we're going to submit this summer to ASPLOS,
we've extended our analysis, and the basic idea is that whenever there is explicit synchronization
in the middle of a block, we basically break it into a sub-block and then we have a way of meeting
together. We basically merge two inputs before we process this other block so it's not -- at a high
level it sounds kind of straightforward, but there were a few details to work out in our equations to
make this look clean and elegant still. And we used vector clocks also. And so that's one of the
things we've been working on.
Lots of animations here.
Ah. Now, the next thing we're doing which I'm actually really excited about is that by default, with
our conservative analysis, you get -- a false-positive looks the same as a real -- a true positive,
unfortunately. But what we can do is instead of just having a binary state is something tainted or
untainted, we can create another state which is a maybe tainted state.
So actually if I back this up a second, in this case here one possibility is that there's a taint and
then an untaint and a read. In that case it's good, it's not tainted, but another interleaving is that
we do the untaint and then the taint sneaks in here between these two things and then we do the
read and then it's tainted. So both possibilities exist within this window of uncertainty.
But in our analysis we can recognize that and flag it separately and say here's a case where
we've seen both possibilities. So that would be a third state. So we'll have definitely untainted,
definitely tainted, and possibly tainted, meaning that there's -- both possibilities exist.
And I think that this is very interesting because that state -- well, first of all, once we model that
we'll immediately be able to bound the false-positives because a false-positive would have to
show up as this case. It wouldn't be that case.
And also we want to use this as a -- this is a metric that we can use to dynamically adjust our
precision. If we see a lot of these maybe cases occurring, then we may want to dial down the
window size or do other things like that to do more precise analysis. So that's another idea that
we're pursuing.
>>: But isn't the whole premise here that the program in normal execution mostly works, which
by definition would mean that things that always end up tainted would probably have been
detected, right? In some sense, you know, the maybe tainted only occurs because, you know,
there is one interleaving that's bad, but it's not the common case, right? So that's a bug. But a
bug that occurs in every execution is pretty easy to fix in some sense, right, in a concurrency? So
I understand the idea here, but I'm not sure that for real bugs it would actually help separate the
false-positives from the real bugs.
>> Todd Mowry: Well, what we wanted to do was when we see the maybe tainted, what we
might want to do is even, say, rewind our analysis and try to do other more expensive things to try
and definitively tell what is this thing really, is it really a problem or really not a problem. That's
what -- the idea, rather than just saying, wow, and kind of hold on to a maybe tainted case.
On-the-fly we want to zero in on that.
And then one thing you could do with this is to dynamically adjust the resolution of the window. If
you see a lot of these maybe tainted cases, then maybe we should make the window smaller to
try to have less noise in our results. Even though that's going to slow things down a little bit, it
might be worth it.
Yes?
>>: How close do you run these window sizes? Kind of the minimum of what you're forced to
use?
>> Todd Mowry: Right now not really at all. Right now they're much, much larger than they need
to be. There's another issue. If you make it too small, then you can start getting false-negatives.
There are other things that we can do in this case. We can start inserting other -- under the
covers we could start adding other markers or things to track what's really happening, basically -in the hardware-based approach, the hardware will pass extra information to track the real
communication all the time. We could do that in software. It would be expensive. But we might
think it's worth doing if we start seeing cases like this, and we could zero in on those specific
cases and maybe get more precision that way but only pay for it when we really need it. So that's
the idea.
Okay. So, finally, last slide, so the idea is this is a way to do parallel monitoring that was inspired
by interval style data flow analysis. I think it has some nice properties because if you write down
your problem in this style, the framework will take care of all the unpleasant details of thinking
about concurrency and running reasonably efficiently, and the fact that information goes in and
out of the sides of blocks was one thing that made it new and interesting for us and we saw some
decent performance. And with the larger window sizes, the false-positives rates weren't too bad,
and we're excited about the things we're looking at in the future with this.
So that's it.
>> Tom Ball: Thank you very much.
>> Todd Mowry: Thanks.
[applause]
Download