Document 17881920

advertisement
>> Jim Larus: It’s my pleasure to introduce Brandon Lucia from the University of Washington. He’s a
grad student working with Professor Luis Ceze who, and he’s finishing up this year. He’s done work I
think with a number of people in this room before. He’s going to give a talk on his Ph.D. research, so
Brandon.
>> Brandon Lucia: Cool. Alright thanks Jim for the introduction. Today I’m going to talk about my
research on using architecture and system support to make concurrent and parallel software more
correct and more reliable. So as Jim mentioned this is the work I’ve done during my Ph.D. at the
University of Washington with my advisor Luis Ceze. So, I’m going to begin my talk today by talking
about what are the key challenges that I’ve looked at in my research. The key challenges that I’ve
focused on are those posed by concurrency and parallelism. In particular what the impact of
concurrency and parallelism is on the problems of correctness and reliability.
Then I’m going to dive into the key research themes that show up in my approach to solving these
research challenges. The focus of my work is on using architecture and system support to solve these
problems. Then after that I’ll dive into a couple of my research contributions in some more detail to
give you an idea for the kind of projects that I like to work on. At the end of my talk I’m going to talk
about some of the things that I’m interested in, in the future going forward.
So, first I’ll talk about the key research challenges that I’ve looked at. So, I’m going to show you now
that concurrency and parallelism are essential and they’re unavoidable. There are two main reasons
why concurrency and parallelism are essential and unavoidable. That is that there’s pressure from the
bottom because technology is changing and there’s pressure from the top because the applications that
people care about running on computers are changing.
I’ll talk about technology first. The best example of a change in technology is the shift to multicore
devices. Everyone has a multicore in their phone, in their laptop, and everywhere else. In order to get
energy efficient and high performance computation out of a multicore you need software that exposes
parallelism down to the multicore. At the other end of the spectrum we have warehouse-scale
computers. These are the things that run our data centers and to fully utilize these and get energy
efficiency and performance you also need to write software that maps computation across all the nodes
in a data center.
So, we see a need for parallelism because of the shift in technology. This is also true because of the shift
in applications that we’re seeing. So we have mobile applications and server and Cloud applications are
two examples of this. In mobile devices we have little devices. They run on a multicore and they’re
powered off a battery. In order to get energy efficient computation out of a multicore you need
software that utilizes that multicore. That kind of software requires that there’s parallelism to map
down to the multicore. So mobile applications demand this especially because energy efficiency is
important when you’re running off a battery and in serving Cloud domain you’re running on warehouse
scale computers so you need parallelism to get performance. In addition, there’s also some concurrency
constraints here. Mobile applications need to communicate with Cloud applications and in Cloud and
server applications you need to coordinate sharing of resources across simultaneous client requests.
So there’s really a need because of these applications for concurrency and parallelism. So there is a
need but why is that an interesting research question? So to talk about that we can look back at the
model that we have, we’re all familiar with this model, four sequential programs. In a sequential
program there’s one thread of control and execution hops through that thread of control by doing a
series of steps like A and then B and then C. If you’re a good programmer you write your software and
then you give it an input and you run the program and you see what the output is. If you try enough
inputs and you see enough outputs that match the specification that you have in your head, or you have
written down somewhere then you can put that software out in the world and have some assurance
that it’s correct.
However, the story is different when we look at multi-threaded software. Multi-threaded software,
shared memory multi-threading is the most common idiom these days for writing concurrent and
parallel programs. In multi-threaded software there’s not just a single thread of control. There’s
multiple threads of control and they can interact by sharing, reading and writing a shared memory
space, and explicitly synchronizing with one another to order events in different threads. Any events in
different threads that aren’t explicitly ordered with synchronization execute independently.
This leads to what we call the nondeterministic thread interleaving problem. The nondeterministic
thread interleaving problem manifests in the following way. If we take this program and we give it an
input and we run it, and we see its output we might get one output on the first execution with that
input. If we take the same input, run the program again we could get a different output. The reason is
that independent operations in different threads could execute in any order in those executions,
potentially changing the result of the computation.
So the nondeterministic thread interleaving problem has several implications. The first is that these
programs are hard to write. They’re hard to write because it’s hard to understand how different
interleavings of independent operations will impact the execution of the program. These programs are
hard to debug. Bugs might only manifest as failures in some program executions. When they do it’s
hard to reason about what the effects were that actually led to the failure.
Finally, testing these programs is currently infeasible because testing requires looking not just at the
space of all possible inputs but also at the space of all possible interleavings. Although there have been
some advances in recent years in the area of testing multi-threaded software by some people in the
room. So, it’s getting better but at the moment it’s still infeasible to comprehensively test these
programs.
Just to show you this is not an academic day dream. This isn’t just something we do in the lab. Here are
three examples from recent headlines that illustrate that concurrency bugs cause problems in the real
world. We’ve seen infrastructure failures, security holes that have led to millions of dollars being stolen,
Amazon web services went down. So this is a serious problem and these problems were all the result of
concurrency errors in software.
So, they’re difficult to write, they’re hard to debug, and they’re infeasible to fully test. So bugs are going
to find their way into production systems and cause those kinds of problems. What we want is that
these programs are simple to write. We want them to be easier to debug so when we have bugs we can
fix them. We want them to be reliable despite the fact that we can’t comprehensively test them and
bugs might find their way into production systems.
So, I’ve just identified three key research challenges, programmability, debugging, and failure avoidance.
These are the challenges that I’ve looked at during the work that I’ve done on my thesis. So, now I’m
going to talk about the themes that show up in my approach to solving those key research challenges.
There are four key themes that I’m going to talk about.
The first is looking at system and architecture support across the system stack. Second I’ll look at
designing new abstractions that allow us to develop new analyses. Third I’m interested in systems that
leverage the behavior of collections of machines. Finally, I’m interested in mechanisms that are useful
not just during development but for the lifetime of the system.
So first I’ll talk about my approach to using architecture in system support. So many people see this
slide and they see the word architecture here and they think, this guy works on hardware. Hardware is
part of architecture but the way that I think of architecture is that hardware is where the architecture
begins. We have to think about the interaction between hardware and the compiler, and the compiler
and the operating system, and the hardware and language run times, and programming languages and
software engineering tools, and even things at the application level. Like statistical models of the
behavior of a program. So my approach to architectural support is to look across the entire system
stack.
The second theme that I’m going to talk about is the use of new program abstractions that allow us to
develop new analyses that are more powerful, an example, more powerful than prior work for solving
some problems. An example of a new abstraction from my work is the use of something called a
context-aware communication graph. I’m going to describe this in more detail later but I’m showing you
now because this is an abstraction. Context-aware communication graphs correspond to the
interactions between threads that occur when a program is executing. This is important because it
allowed us to develop a new analysis that helped us to find bugs that are the result, to find concurrency
bugs better than prior techniques.
The third theme is that I like to take advantage of the behavior of collections of systems. The way that I
do this is by collecting data in individual machines and incorporating it together into models that help us
find bugs when they show up as anomalies. Help us to predict where failures might happen to avoid
failures. To feed information back to the programmer so they have a better set of data to work from
when they’re trying to fix bugs.
Finally, I’m interested in systems to remain useful, not just during development but also in deployment
for the lifetime of the system. So we’ve developed mechanisms that help during debugging and these
have remained useful in deployment because they feed information from deployment machines back to
developers. Failure avoidance mechanisms are useful during deployment because they continue to
provide value when systems are running in production. So, I’m interested in these kinds of mechanisms
especially when they’re hardware mechanisms that provide benefit for the lifetime of the system.
So, I’ve used these themes to address these key research challenges. My publication record shows that
I’ve worked in all three of these areas. It also shows that I’ve worked across the layers of the system
stack. I have papers that showed up at Micro and ISCA which are architecture conferences but also at
OOPSLA which is a programming systems conference and PLDI which is a conference on programming
language design and implementation.
I’d like to talk about all three of these today. But, unfortunately I’ll only have time to talk about my
work on debugging and failure avoidance. I’m going to do that now by jumping into two of my research
contributions in more detail to give you an idea for the projects that I’ve done in those two areas. There
are two efforts that I’m going to focus on. The first is a project called Recon; this is a project in which we
used architecture and system support to make it easier to debug concurrent programs. The second is a
system called Aviso. Aviso is a technique that helps, enables production systems to cooperate by
sharing information so that they can avoid failures that are the result of concurrency errors.
So, I’ll talk about Recon first. If the lighting was a little better in here you’d, so Recon is a technique for
finding bugs in programs. If the lighting was better you’d see that on this antelope there are birds that
are finding the bugs and pulling them out of the antelope’s fur, so, ha, ha, yeah. Okay, so to…
>>: [inaudible]
>> Brandon Lucia: What’s that?
>>: Is that what advocating birds or bugs on…
>> Brandon Lucia: That’s actually, that’s my future work section. I’m going to see if we can recruit
pigeons that are a nuisance in the cities and we can put them to work. So, in order to understand a
technique that helps with concurrency debugging I unfortunately need everyone to just read a little bit
of code. The codes pretty simple…
[laughter]
>>: [inaudible]
>> Brandon Lucia: Ha, ha, I know how much everyone loves reading code but it’s going to help a lot. So
there are three threads in the program. The programs pretty simple, green is setting some flag called
ready equal to true and then it’s setting a shared object pointer equal to some new object. Blue checks
that flag to see if it’s true and then takes a copy of that shared object pointer. The programmer when
they wrote this they had this invariant in mind. Whenever ready is true shared object is going to be a
valid pointer. You can see that they implemented this incorrectly if you are good at reading this kind of
bubbles on slides code. What blue does with that pointer that it copied is to put it into this Q called Q
that it shares with the red thread. The red thread dQ’s object from the Q and uses them.
Okay, so the program executes like that and what happens? We see that green sets ready to true and
blue sees that ready is true and takes a copy of that pointer, but green hasn’t set the pointer to a new
object yet. So it has an invalid pointer at this point. It nQ’s the invalid pointer and then red uses the
invalid pointer when it dQ’s it from that Q. So what’s interesting about this bug is that the root cause is
over here in the interaction between blue and green. But the failure manifests over here in red. That
makes this a very difficult error to debug.
Luckily there’s been a lot of prior work in debugging this kind of error. One category of prior work is
work that uses program traces to debug these kinds of errors. Program traces are big lists of everything
that happened during a programs execution. The programmer can look at this list of things that
happened starting from the point of the failure and hopefully eventually they get back to the point in
the execution where the root cause happened. These techniques are effective but they’re limited in one
way and that is if the execution is very long and there’s a large distance between the root cause of the
bug and the failure, like this could be several days, then the traces can be huge. The programmers going
to have to look at gigabytes of stuff to try and figure out what’s wrong with the program. So those
techniques are useful but they give too much information.
So there have been other techniques that try to help debugging by focusing in on a narrow subset of
the operations that happened during the execution. These are techniques that help to debug using
dependence information. So dependences occur when operations access the same piece of memory,
like the green and the blue thread are both accessing this ready variable. These techniques are useful
because they focus in on just the operations that the programmer might care about. But in fact they
sometimes give too little information, like you can see here these operations are dependent so they
might be selected in one of these techniques but they don’t tell the whole story. They don’t include
information about the accesses to the shared object variable. That would be important for
understanding why this bug happened.
So what we want to do is develop the technique that gives neither too little nor too much information.
We want to show the programmer the root cause but we don’t want to distract them. We want to take
a Q from those dependence based techniques and we want to give the programmer information about
communication. Communication is inter-thread dependence. When one thread writes a variable and
another thread reads or over-writes that, that’s when we have communication. So we want to show the
programmer when that happens to.
So our goal is to develop a debugging methodology that can reconstruct the root cause of failures. We
want to include all the code that’s involved in the root cause and we want to show it to the programmer
in time order. We want to give them the information about the communication that occurred. When
we do that we have a reconstructed execution fragment. That’s one of the main contributions of this
work. These reconstructed execution fragments are actually derived from a model of inter-thread
communication that we also developed in this work.
>>: Can I ask a question Brandon?
>> Brandon Lucia: Absolutely.
>>: I have a straw man proposal for solving this problem.
>> Brandon Lucia: Say again.
>>: I have a straw man proposal for solving this problem.
>> Brandon Lucia: Okay, what is it?
>>: Proposal that you asked a programmer to write down the invariant intervention…
>> Brandon Lucia: And then check it.
>>: And then check it dynamically. It will fail exactly at that window.
>> Brandon Lucia: So the problem is that the programmer often doesn’t really know that invariant
explicitly. They have it sort of implicitly in their brain and I think often programmers haven’t thought far
enough ahead to really encode that and crystallize it and put it down in code. There’s also the problem
of asking people to express invariancing code which can sometimes be complicated. Actually writing
these things, that was a simple invariant but invariants could be much more complicated. You could
have pre and post conditions on API entry points and you could have data structure and variants that
are not so simple to express. That work is complementary although. I think that’s a great idea. I wish
people would do that.
So, cool, feel free to interrupt if you have questions. We have a fairly long time slot and we can talk in
the middle of the talk if you like. Okay, so we want a debugging methodology that produces those
reconstructed execution fragments. Here’s an overview of the methodology that we developed in this
work. The first step is that the program crashes and someone sends a bug report to the developer. The
developer looks in the bug report and sees that there’s some bug triggering input and they use our tool
to run the program repeatedly with that but triggering input. Our tool generates communication
graphs. In particular context aware communication graphs which I mentioned a little earlier in my talk
and I’m going to explain in detail in just a second. Having produced a set of context-aware
communication graphs from many executions the programmer labels each of them as having come from
a program execution that manifested the failure or did not, so buggy or not buggy. Then our tool takes
that set of labeled communication graphs and it builds a set of reconstructions that might help the
programmer understand the bug. Then the last step is to look at that set of reconstructions and assign a
rank to each one. So the programmer knows which one is most likely to be beneficial to look at when
they’re trying to figure why a failure happened.
Okay, so now I’m going to go through each of those steps in a little more detail, starting with what
communication graphs are and how we build them. So communication, as I said a second ago, happens
when one thread writes a valued memory and another thread reads or over-writes that value. We can
pretty naturally represent that as a graph where the nodes are static program points and edges exist
between nodes whenever two instructions are executed during the execution. So we have the source,
the sink, and we have shared memory communication encoded, indicated by the edge. If we do that we
have what we would call a simple static communication graph. Static because the nodes represent
static program instructions and in fact this is a little too simple. The way that it’s too simple is that
representing static program instructions in this graph doesn’t differentiate between different dynamic
instances of the same program instruction. So if you’re going around a loop the first threaderation of
the loop is the same as the second and so forth, whereas for understanding why a bug happened it
might be interesting to differentiate between instructions executing in different contexts.
To get around that we could look at a dynamic communication graph. In a dynamic communication
graph every different dynamic instance of a static instruction would be differentiated. So the way to
think about this is there’s some monotonically increasing instruction counter. Whenever an instruction
executes it adds a node to the graph that’s identified by the instruction address and that counter. This is
essentially a program trace. The main, while this gets around the problem with the static graphs the
problem with this is that it’s unbounded. So we end up with that too much information problem that
we had before.
In this work we developed a middle ground between the simple static graph and the unbounded
dynamic graph. We called that a context-aware communication graph. The key idea in a context-aware
communication graph is that a node represents a static instruction executing in a particular
communication context. We add communication context to the nodes. A communication context
encodes abstractly a short history of communication events that preceded the instruction represented
by that node, so if there’s some sequence of communication events and then an instruction executes
that’s one node in the graph. If there’s a different sequence of communication events and then that
same instruction executes it’s another node in the graph. So we differentiate between different
instances of static instruction. Sure.
>>: So a couple of questions.
>> Brandon Lucia: Okay.
>>: If I have a loop with the instruction in there I can have multiple instances of the instruction at the
same communication graph label? The communication is outside the hoop sync.
>> Brandon Lucia: Yeah, so if you are in a loop and there’s no communication taking place then you
would add multiple instances of that instruction. In practice the graph doesn’t grow. The node is
already there so we don’t add anything.
>>: Okay, and what communication you mean shared memory communication?
>> Brandon Lucia: Yeah, I mean shared memory communication in the way that I described before
when one thread writes a value and another thread reads or over writes it, that’s when communication
occurs. That’s…
>>: And…
>> Brandon Lucia: Yeah.
>>: Semaphore is in all the other lock on all those other things how do they…
>> Brandon Lucia: U-huh. So those would show up as, so the question was whether synchronization
operations would show up in the communication graph. In fact they would because they manipulate
pieces of shared memory. So as you’ll see in my implementation we instrument programs at a very low
level and we have another implementation which uses hardware support. So we’re observing the
execution from a very low level of abstraction and all these things look similar. They look like shared
memory operations. Sure.
>>: Should I think of the entitles as something like multiple clause?
>> Brandon Lucia: No, I would think of them a little more like calling context in a compiler analysis but
instead of being calling context we’re looking at communication context. So rather than abstractly
encoding a call stack we’re abstractly encoding the sequence of communication operations that
preceded this operation. Did that help?
>>: Yeah, I’m still, I’m trying to understand what’s wrong with the logical [indiscernible] analysis like in a
distributed system where each node maintains an [indiscernible] messages…
>> Brandon Lucia: So…
>>: Potentially what you’re doing here is you’re encoding the communication that’s happened, seems
like the same, seems like it should be a jewel.
>> Brandon Lucia: They encode similar information. This I expect is cheaper to implement which is one
of the reasons that we did it this way, because we only have to do things when things actually share not
preemptively on other operations, so, yeah.
>>: [inaudible] question.
>> Brandon Lucia: Sure.
>>: So, I’m still am trying to understand what you mean by preceding. So for T nine the communication,
T nine let’s say equals nine, that red value that’s written by [inaudible]. So you put the green dot there…
>> Brandon Lucia: So the green dot, yeah so I was hoping to gloss over those details but I can get into
those details now. So the entries that go into the context are indicators that say a local read or write
happened. A read or write that didn’t communicate or a remote read or write happened, meaning that
a read or write happened which did. So that’s how we abstract, we abstract away the addresses.
>>: Oh, so you don’t know where you read it from but it’s, you know that it’s a remote?
>> Brandon Lucia: We know that it’s a remote read. So you can actually think of it, our motivation for
getting to this abstraction was thinking about coherence. So if preceding some operation there was an
incoming coherence request that showed that some other processor had written to some piece of
memory that would populate one entry in the communication context. So that might be another way of
helping to think about it is local operations are just memory operations and remote operations are
incoming coherence requests.
>>: And one more question, so…
>> Brandon Lucia: Sure.
>>: Let’s say T equals fifteen was nine was not accessing the same variable as T equals nine, you’re
accessing different variables, even then you would put the communication of the previous instruction in
the context of T equals fifteen?
>> Brandon Lucia: That’s correct. Yeah, so the context…
>>: [inaudible] local and its inter-thread happens before an inter-thread [indiscernible].
>> Brandon Lucia: Right. You can think of the context as a thread local property. The context is always
being updated and whenever a node gets added to the graph by a particular thread you grab the current
context and you add it to the node. Then the context changes and you add another node and you grab
the new context. Okay.
>>: [inaudible] sort of like K limiting insert of the call graph analysis…
>> Brandon Lucia: Exactly.
>>: You’re going to go back so if I think of it as it happens before graph dynamically you have a
compression technique that is doing basically paths of one K and that sort of, that now becomes your
context for identify unknown as unique…
>> Brandon Lucia: Absolutely.
>>: In your compressed [indiscernible].
>> Brandon Lucia: That’s a great way to think about it is as an analogy to K bounded calling context.
That’s the perfect way to think about it. That’s the way I think about it, so. Okay, I’m going to move on
just so I can get through all the content here, so.
>> Jim Larus: You have good time.
>> Brandon Lucia: Yeah, but it’s about time, so.
[laughter]
Okay, so I just described how we build these context-aware communication graphs. Now I’m going to
talk about how we go from communication graphs to these things called reconstructed execution
fragments, which I described a second ago. So to build a reconstruction we start with an edge from the
communication graph. I’ve omitted the context just so the diagrams are simpler. Okay, then, oh, you
know I forgot to mention something just a second ago because we got in that discussion. One more
thing that we add to this graph is a form of bounded timestamp. The way that these works is not
especially, it’s not especially interesting it’s a monotonically increasing counter that we update in a lossy
way so the representation remains bounded.
So we start with one of these edges and then we want to build a reconstruction. So we can look at
those lossy timestamps that I just described and we can populate three regions of a reconstruction, the
prefix body and the suffix. We populate those regions based on those timestamps. So, to populate the
body for example we look at operations that showed up in the graph that had timestamps between the
timestamp on this source node and the one on this synch node. We do the same for the prefix and the
suffix. So building reconstructions is very straight forward. Yep.
>>: [inaudible] how do you know when it was a remote read but you don’t know where you read them.
So how do you know what is the source and what’s the [inaudible]?
>> Brandon Lucia: I don’t think I understand your question.
>>: So, when explaining [indiscernible] you said that you don’t have information as to which type of
[indiscernible] that you read it from other than the fact that it was remote read, that as you read a value
that was remotely written by somebody…
>> Brandon Lucia: Yeah, so there’s, we keep a distinction between the entries and the context which
are abstracted and the entry and the nodes in the graph. So a node is the tuple of a static instruction
address and the communication context in which it executed. So you know which operation it was.
That’s how you know it was a read or a write.
>>: [inaudible] okay for the arrow, right, so you know that the blue box is actually a read or some read
of text…
>> Brandon Lucia: Or over write, yeah.
>>: And it was a remote read that is it read something [indiscernible] by somebody else on the thread…
>> Brandon Lucia: That’s right.
>>: But you don’t know which thread it was.
>> Brandon Lucia: We actually do. We keep track of that. We don’t record that in the graph though. So
in our implementation we need to keep track of that because we need to be able to identify when
remotes, reads and writes are remote. But the graph abstracts away threads. That’s actually important
for remaining bounded because if you think of applications that have thousands of threads, like
something that’s built with [indiscernible] then it might be a scalability problem for our representation if
we actually encoded the thread in the graph. Did that answer your question? Good, okay.
>>: This graph makes it look like the time is essentially sequencing everything. Can you have multiple
instructions happening at the same time?
>> Brandon Lucia: Yes, but our timestamps are sort of; they’re sort of a cheap implementation of
timestamps. So we have this monotonically increasing counter that gets updated lossily. So we don’t do
that but you very easily could. You could think about things that happened concurrently and use that as
the time stamp instead. The reason we do this was as a convenience in our implementation because we
actually had…
>>: [inaudible] processes?
>> Brandon Lucia: Yeah, so we used the real time, the, what’s it called? The Intel timestamp counter
instruction.
>>: So you can have multiple instructions happening at the same time, same timestamp on different
processors?
>> Brandon Lucia: Yes, due to imprecision in that counter yes you could. Okay…
>>: I’m trying to understand, like this picture makes it look like everything is serialized. So I’m trying to
understand if you have to construct an arbitrary serialization of all the instructions across all the
different processors or whether your timestamp just gives [indiscernible]?
>> Brandon Lucia: The timestamp gives us the serialization.
>>: Okay.
>> Brandon Lucia: The time, think of it as a system wide time that we’re using to populate this.
>>: Okay, then I don’t understand how you can’t end up with multiple instructions occurring at exactly
the same time.
>>: You mean RBCST on two machines could have the same router?
>> Brandon Lucia: Yeah, so I guess…
>>: [inaudible]
>> Brandon Lucia: Because of the, because of precision in that thing, yeah because of concurrency and
in precision in that counter it’s possible. I’ve omitted that because I don’t think it’s an especially
important detail. But you’re right that that could happen. If things did have the same timestamp
because they happened on two processors to have the same counter they would end up in the same
region of the reconstruction. So you wouldn’t necessarily know the ordering across those things but
you’d know which region they showed up in. There’s something I’m going to get to which makes it less
important to know…
>>: [inaudible]
>> Brandon Lucia: Ordering within a region, yeah, and I’ll show you that in just a second. I’ll come back
to your question when I get there if you want, yeah. So it’s actually this right here. So the reason that
that’s not especially important is that we take, so one of the big problems with dealing with concurrency
errors is that you get different behavior from one execution to the next. That means you get different
reconstructions from one execution to the next, even if you start with the same communication graph
edge.
So, we have a way of aggregating reconstructions together that came from different executions.
Obviously from different executions there will be substantially different and incomparable timestamps.
So the way that we produce an ordering that we show to the programmer eventually is by aggregating
across executions and combining things that occurred in the same region of the reconstruction. So the,
this is why I was sort of hedging around that question because I was going to get to this. It only matters
that they end up in the right region of the reconstruction. Yeah, and then we know ordering prefix
things happen before source and source happened before body, and so forth.
>>: So in the reconstruction on the right hand side of the equal sign, the blue and the green oval that
are sort of parallel to each other means one of those occurred, both of those occurred, the rhythm
occurred. What’s the semantics…
>> Brandon Lucia: So there’s something else that I’m leaving out of this diagram for simplicity because
usually I smoke through this in about 10 minutes, ha, ha, but I…
>>: [indiscernible]
>> Brandon Lucia: No, no that’s fine, yeah. I’ll add more detail so. In our actually implementation these
things come with confidence values. The confidence value says, this happened in fifty percent and this
happened in fifty percent, or this happened in ninety-nine point nine, nine, nine percent and this
happened in one percent of executions.
>>: But what does that mean?
>> Brandon Lucia: It means, so we build reconstructions from graph edges that came from, graphs from
failing executions. So if we see green in ninety-nine point nine, nine, nine percent of the body regions
from failing executions then we can have some confidence that when the program fails whether this is
significant or not is something else that we decide, but when the program fails that thing tends to
happen between the source and the synch, very often happens between the source and the synch. So
that’s what that confidence value gives us.
>>: Okay.
>> Brandon Lucia: Question in the back.
>>: [inaudible] independent [indiscernible] it could be that, you know, like fifty percent of the time
green happens and fifty percent of the time blue happens but they never occur together.
>> Brandon Lucia: U-huh.
>>: So there’s no dependence encoded in this problem, is that correct?
>> Brandon Lucia: No we’re not encoding [indiscernible]. We’re treating them as independent. Sorry, I
probably just went out of the range of the camera, so.
>>: Sorry.
[laughter]
>> Brandon Lucia: Okay, so I’ve just talked about how we build reconstructions starting with those
graphs. Talked about how we aggregate reconstructions from different executions. Now I’m going to
talk about how we figure out which reconstructions are actually useful. We do that by representing
reconstructions as a vector of numeric valued features. Each of those features represents a different
property of the reconstructions. Using the values of those features we can compute a rank for each of
the reconstructions. So our tool works by generating lots of reconstructions. Computing these feature
vectors, computing a rank, and then ranking the reconstructions that were produced. Our goal is to
produce a rank ordered list of reconstructions where the first one in that list is one that points the
programmer to the root cause of the bug. So you’re all probably wondering is what are B, C, and R?
What are those features?
So, I’m not going to talk about all of them. But I’ll talk about one to give you an intuition for how the
features work to help us figure out which reconstructions are related to a bug. So one of the features
that we use is something that we call the buggy frequency ratio and intuition is this, you build a
reconstruction around a graph edge. If the graph edge occurs often in failing executions and occurs very
rarely in non-failing executions then we assume that that graph edge might have something to do with
the failure. So we improve the rank of reconstructions built around that edge and conversely if the
thing, if this were the other way and this graph edge were to happen often in non-failing executions and
often in non-failing executions and rarely in failing executions then we would say, that’s probably not
very useful for understanding the bug. So we would give that reconstruction a lower rank. So that’s the
intuition behind the features. The other features encode similar ideas but for different properties of the
reconstructions. Sure.
>>: [inaudible] question…
>> Brandon Lucia: Yes.
>>: Does an encoder capture bugs that occur when two things happen together, in some sense because
these are independent, so?
>> Brandon Lucia: Yeah, so one of the other features that we look at looks at the consistency of things
happening in a particular region of the reconstruction. That captures that two things, that idea that two
things happen at once. So if, maybe we should, maybe we can talk about this later because I think it
would be easier to talk about it offline than to try to get into it without a whiteboard right now, so.
>>: [inaudible]
>> Brandon Lucia: Yeah, one of our other features does capture that property though. Okay, now I’m
going to talk a little bit about our implementation. Our implementation…
>>: [inaudible]
>> Brandon Lucia: Sure, yeah.
>>: So how dependent are you on the quality of labels buggy, non-buggy, because you could have a
non-buggy run where the bug has just not caused a crash?
>> Brandon Lucia: Yeah, we are completely and absolutely dependent on that property. Something that
I’m really interested in my future work is to make systems that can tell you earlier than we know now
that something has gone wrong. I think that’s actually a very hard problem in general.
Okay, so now I’m going to talk about implementation. Our implementation we started with a softer
implementation. We used binary instrumentation, for C plus plus we use Pin and for Java we used Road
Runner. Our instrumentation is simple; we inject code around memory operations. The code that we
inject updates a data structure that represents the communication graph. So you can go to my website
and you can download the stuff now and you can use it if you want to. That makes it pretty cool in my
opinion because it’s practical and you can go and run it on your machine. The down side is that using
binary instrumentation is a little bit of a bummer because the overhead can range between fifty percent
for some applications to like a hundred X.
So obviously a hundred X slow down is a little bit of a drag but if you look at tools like Valgrent you see
overheads that are actually similar for some applications. So it’s high but it’s not unreasonable. People
actually use Valgrent in practical software development. So we saw those overheads and we were
encouraged because it was useable but we wanted it to go faster and so we looked at how we could use
hardware support to make graph collection more efficient. Our base design for our hardware support
mechanisms was a multicore processor that has coherent data caches. I’m going to add some things to
this design and they’ll show up in blue and those are the extensions that we proposed.
The first extension that we proposed is communication metadata. Communication metadata is
information that we add to each of the lines in the cache. In particular we record what was the last
instruction to access each cache line? That’s enough information to, that’s the information that
processors need to build the communication graph. We extended the cache coherence protocol to
shuttle our communication metadata around. That’s useful for the following reason. Cache coherence
protocol messages are sent between processors when communication is taking place in the application.
So if we attach our communication metadata to coherence protocol messages and a processor receives
an incoming coherence message they know that communication is happening and they know the
instruction with which the communication is happening. So they can actually using that information
build an edge that they can add to the graph. Yep.
>>: When you say instruction do you mean PIP or…
>> Brandon Lucia: Instruction path.
>>: Okay, not your previous thing when you had like…
>> Brandon Lucia: I was doing that for illustrative purposes...
>>: [inaudible] still identify instruction just by [indiscernible].
>> Brandon Lucia: I’m not sure, I think I lost you.
>>: [inaudible] right, right back at the beginning of that problem with the static identifying the structure
statically versus dynamically.
>> Brandon Lucia: Oh, right. So, the context is part of our hardware support. I left it out of this diagram
because usually I actually find out that I don’t get into this much detail in the discussion. I’m really glad
you guys are asking the questions, this is more fun than the normal talks that I give where everyone is
just silent. But the context is part of that.
>>: Okay, so you also have some additional information.
>> Brandon Lucia: Yeah, we keep the context on the core so in the metadata it’s actually instruction
context to both that get stored. Right and we also add a simple hardware structure to store the
communication graph. That’s a fixed sized FIFO and its fix sized so that when it reaches capacity
software trap happens, we have a run time layer that empties it out, stores it in memory, and you can
use it during debugging. We have a software tool that does all the other stuff that I described a minute
ago. Sure.
>>: So you have to worry about fall sharing here, right, in some sense?
>> Brandon Lucia: Yeah, absolutely so fall sharing means that we’re going to see communication that
didn’t really happen and cache evictions mean that we don’t see communication that might have
happened and things get out of date. So that’s some imprecision and we have numbers on that in our
paper. We showed that it’s not a huge problem for debugging but, yeah, it does show up as a problem.
>>: So back on your software [indiscernible]. Did you guys look at in your compiler [indiscernible] did
you have any kind of compiler analysis that looked at code and said, hey this is old, I can guarantee this
is [indiscernible] versus not in which case…
>> Brandon Lucia: No, but we cheat a little bit and we excluded stack locations assuming that they
wouldn’t be shared. In Java that’s reasonable and C plus plus people can do whatever they want to but
we find that common practice is not to do that. So we omitted those accesses, yeah.
Okay, now I’ll just talk about some of our evaluations. So we built that tool and we simulated that
hardware and now I’m going to talk about how we evaluated that. So if we had just built a compiler
optimization we could take some program user optimization and show that our optimization makes it go
lots faster. Evaluating this was a little less straightforward. So we had to come up with a measure of
what was the quality of our technique? We measured quality by looking in that rank ordered list of
reconstructions that Recon produces. The quality is higher if an earlier entry in that rank ordered list
points us to the root cause of the bug. The quality is lower if there are more things ahead of that root
cause reconstruction that don’t have anything to do with the root cause.
We also looked at performance which is just the run time overhead. We looked at some of the
hardware overheads as well. For benchmarks this was also kind of a challenge and I guess I, some of you
in the room could empathize with me here, finding tools, finding programs to evaluate concurrency
debugging tools can be a real challenge because there’s no standard benchmark suite. So we actually
went to the web and we found programs like MySQL, Apache, Java Standard Library, things like that.
We found bug triggering inputs and we reproduced those bug and we showed that our tool can actually
lead us to the root cause of the failures that the bugs trigger. We evaluated performance using a set of
standard benchmarks, PARSEC, Da Capo, and Java Grande.
So here’s a high level summary of the results that we found when we evaluated the quality of our
system. The first was that using a set, a reasonably sized set of graphs from buggy and non-buggy
executions, twenty-five was the number. We found that a reconstruction of the bugs root cause was
first in that rank order list that Recon outputs. That was nice because it shows that with a modest
amount of effort devoted to collecting graphs the programmer is led to the root that caused the bug.
We also identified a tradeoff with respect to effort. That tradeoff was the following; if the programmer
uses more graphs then the quality is higher. If the programmer uses fewer graphs they spend less time
collecting graphs but the quality is lower. So they might have to spend time looking through what are
effectively false positives in the output. Sure.
>>: So does it matter what fraction of the twenty-five are buggy?
>> Brandon Lucia: Right, so that was twenty-five buggy and twenty-five non-buggy.
>>: So fifty-fifty.
>> Brandon Lucia: Fifty in total.
>>: So suppose it was, suppose the bug didn’t occur all the time and it was ten-forty?
>> Brandon Lucia: So we actually, in the experiments that we used to illustrate this trade off we used
five or fifteen buggy graphs assuming it was harder to get buggy executions and twenty-five correct
graphs, because correct graphs are essentially in limitless abundance.
Okay, our performance evaluation we showed ten to a hundred times overhead in software, like I said
before, pretty high but comparable with other tools. There are two sources of hardware overhead that
we found interesting. One is how often do those traps happen where you have to empty out the FIFO
store it in memory? Two is how often do you need to update the metadata that’s hanging on the end of
the cache line? So we found that traps are pretty infrequent, less than one in ten million on average.
Smiling and have a question, what’s up?
>>: Well…
[laughter]
>>: You’ve heard the traps weren’t frequent but what I would [indiscernible] as more is how long does it
take you to handle the trap in [indiscernible] FIFO and what’s that overhead on the overall
performance?
>> Brandon Lucia: Yeah, so I don’t have the numbers on that. We can talk about that later but the
infrequency helps to aromatize that cost. But you’re right; I mean it’s really the increase in latency that
could be a problem, yeah. Second result is how often do we need to update that metadata? Because
that could be a problem and you’ll see that this is considerably more frequent, two percent of memory
operations is fairly often. However, in hardware implementation this can happen in parallel with
accessing the cache line itself. So it’s not likely a performance problem because it can be parallelized.
Okay, so just to summarize those themes that I described…
>>: [inaudible] cache line or for…
>> Brandon Lucia: For cache line.
>>: [inaudible] cache line.
>> Brandon Lucia: Yeah and it’s imprecise because of that. We have an analysis of that in our paper if
you’re interested in checking that out, yeah. So I just showed you that we developed new abstractions,
context-aware communication graphs and we use those to build reconstructed execution fragments.
There was support across the system stack. I showed you hardware and software implementation and I
showed you in our results some of the tradeoffs of using each of those. This is a system that is useful
even in deployment because with a hardware implementation we can collect this information all the
time and send it back to developers. Finally this system takes advantage of collective behavior because
information could be pulled from many systems that run the same piece of software and the
information can be combined.
So that’s what I have to say about Recon. This is a new architecture and system support mechanism for
making concurrency debugging easier. Okay, yep, questions? I’m just…
>>: [inaudible]
>> Brandon Lucia: Starting to look at the time. We have been doing a lot of questions, I want to make
sure I do get through everything without keeping…
>>: You have…
>> Brandon Lucia: I don’t want to, okay, sure.
>>: Seriously.
>>: So if for instance if I did something really dumb like for instance I just recorded the last thread that
accesses to the variable and kept that as my kind of straw man that say this is where the potential bug,
this is the IP of the source, of the root cause of the bug could occur. So, okay, what I’m trying to get out
of this is did you guys do any kind of analysis where you had some sort of a baseline that said that, you
know I’m a very complicated system, is there any kind of baseline where you have some comparison
that says something simple like the straw man [indiscernible] unique…
>> Brandon Lucia: Yeah, so…
>>: That actually do something that’s real to do [inaudible].
>> Brandon Lucia: We didn’t do that but something that I would really like to do in the future is to
actually get some human subjects into the lab and say debug using technique A, debug using technique
B, and do a comparison. Maybe not even just across the work that I’ve done but across work that has
come from other groups. I think it would be a really informative study to see which techniques are
actually good and it might involve collaborating with some HCI people because that’s a little bit outside
my area of expertise. But it could be really interesting to see those results.
>>: Okay.
>> Brandon Lucia: So…
>>: [inaudible]
>> Brandon Lucia: Cool, well, yeah so that’s my answer and I would love to see more human subject
studies going on in this area of research. I just, you don’t see that many and I think it would be really
cool to see more of those, so.
Alright, now I’m going to change gears. I’m going to talk about a system that isn’t about finding and
fixing bugs for a programmer but rather it’s about systems cooperating to learn how to automatically
avoid failures. You can see just like these buffalo are all looking outward. They’re cooperating to avoid
failures which would be lion attacks or something in this example, ha, ha. These photos, I took a trip to
Zimbabwe so I’ve got a bunch of stupid vacation photos in my talk, ha, ha.
>>: [inaudible]
[laughter]
>>: There are bugs that…
>> Brandon Lucia: Yeah, it’s a little bit, you know debugging and failure avoidance is really synergistic,
they go together, so. So, I’m going to start this section of my talk with an example that shows you at a
high level how our system works. But first I’m going to talk about how things work today. When you
develop software today you have your development and debugging system, you make your application
and then you push it out to the deployed systems like this. The deployed systems run and sometimes
they get one of these thread interleavings that leads to a failure, so this might be a concurrency bug. If
you’re a good developer you collect the core dump and you have that sent back to your development
and debugging box. With the core dump in hand you can spend time trying to come up with a patch and
figure out what went wrong with the program. The interesting thing about this is the developer is active
but the deployed systems are passive in this process, just waiting for a patch to come from the
developer. In the mean time, the deployed systems might experience the same failure over again
degrading the reliability of the community of systems.
So in this work we had the idea to make the deployed systems be active in this process as well. We
make them cooperate by sharing information to learn why failures happen and what they can do to
avoid those failures in future program executions. Okay, so now I’m going to give you an overview of
what things look like if we have Aviso which is our system that takes advantage of that idea. So, just like
you have a development debugging server we have an Aviso server. In the deployed systems we see
that the application is linked against the Aviso run time, which runs on the same machines as the
application itself. We see that same failure and just like we sent a core dump back in the baseline
system, in the case with Aviso we sent an event history back to the Aviso server. Aviso does some
analysis on that event history and the information that it extracts from that analysis goes into building a
model of what happened in that failure, what happened preceding that failure. It’s important to note
that this is a cooperative model. Any time a failure happens over here it ships an event history over to
the Aviso server and contributes to that model. So nodes are cooperating, deployed system nodes are
cooperating by sharing information. Using the cooperative failure model the Aviso server generates
constraints on the execution schedule of the program that restrict certain, the order of certain events in
different threads. When Aviso finds a constraint that prevents a failure it ships it back across to the
deployed systems. The deployed systems can use those constraints to avoid failures and note that if
one node fails and has a constraint sent to it that same constraint can be sent to all the other machines
trivially and they can share the wealth of failure avoidance.
>>: Are these constraints guarantees or are they probabilistic?
>> Brandon Lucia: I’ll show you. I’m going to get to that, yeah. So there are three parts to the system,
the first is what are constraints and how do they work? The second is, what are the event histories that
we collect and how do we use the information in the event histories to generate constraints? Finally I’m
going to talk about what goes into that cooperative failure model and how is it useful for picking which
constraints are going to avoid bugs.
So, first I’ll talk about these schedule constraints. To talk about these I need to show you a little bit
more code. This code is really simple though, there are two threads, you have the green thread, it’s
doing something funny which is set this variable to null and then set it to a new object, so it does two
operations. Blue thread is acquiring a lock and then using that pointer that green is playing with over
there and then releasing the lock. So this program is broken in several ways. We can talk about them,
yeah, at length. The important thing to know though is if it executes under this interleaving blue uses
the null pointer because green set it to null and then blue used it. That’s a problem. The way to
understand this bug is that this bug is characterized by the event ordering that I’ve indicated with those
dashed arrows. When P equals null happens and then P pointer use happens we get the failure, only if P
pointer use precedes the assignment of P to that new variable.
We can also observe that if we had a different ordering of events like P equals null followed by
assignment of P to the new pointer and then the use of P, well that wouldn’t lead to a failure, so one of
the key ideas in this work is to shift the execution away from failing schedules like the one on the
previous slide and toward non-failing schedules like the one on this slide. To do that we developed the
idea of a schedule constraint and a schedule constraint says that a pair of operations contribute to a
failure and reordering around those operations can prevent that failure. So a schedule constraint is
really nothing more than a pair of instructions in the execution. The semantics of a scheduled constraint
are very simple. We have a scheduled constraint like this and it has the green instruction and the blue
instruction. The semantics are the following, when in the execution we reach that first instruction the
constraint gets activated. Subsequently in the execution when we reach that second instruction, the
blue one that instruction gets delayed. Those are the semantics of a schedule constraint.
Now I’m going to show you with that example why this is actually effective at avoiding failures, it’s
effective because in this example you can see that P equals null gets executed. That activates the
schedule constraint, then P pointer use tries to execute, normally that would cause a failure but instead
the constraint delays the execution of that operation. In the mean time, green steps in executes its P
equals new P and later after that delay expires blue gets to execute its operation without failing.
>>: What are you expressing in terms of delay as opposed to sort of, you know thinking of the second
grade instruction as enabling the blue instruction to continue?
>> Brandon Lucia: That’s a really great question. So why don’t we just figure out what instruction this is
and make constraints that have all three of those instructions, right?
>>: Well, no I would just…
>> Brandon Lucia: Okay.
>>: Yeah.
>> Brandon Lucia: Yeah, something like that. The main reason is, you’ll recall from the previous
example that the failure occurred at this instruction. So if we want to do forensic analysis in our server
we don’t know that this instruction exists. We have an event history and I’ll show you in a second what
kind of event histories we keep. The event history doesn’t say anything about P equals new P.
>>: But you have the code of the program.
>> Brandon Lucia: We have the code of the program and something I’m looking at in future work is
doing a better job of tuning these delays based on predictions of which instructions might be good
constraint deactivators, yeah, okay.
>>: [inaudible] constructions?
>> Brandon Lucia: Say again.
>>: These constraints are referring to dynamic constructions or static…
>> Brandon Lucia: Static instructions. A constraint is a pair of static program instructions. When a
dynamic instance of the first instruction in the pair occurs it activates the constraint. When an active
constraint is, when a constraint is active and the second instruction executes then that causes a delay
like this.
>>: In any [indiscernible]…
>> Brandon Lucia: Say again.
>>: With those instructions could occur on any thread, right?
>> Brandon Lucia: So if a constraint is active because the first instruction executed in one thread, in any
thread except the one that activated the constraint.
>>: Oh…
>> Brandon Lucia: And not in the same thread but in any other thread. Otherwise you’d get some
atomic region that prevented itself from proceeding because…
>>: So how long are [indiscernible] a scenario where this causes timeouts and it cascades with the
system you have even worse [indiscernible].
>> Brandon Lucia: Yeah, so that’s a problem. The delays are fairly short on the order of hundreds of
instructions. We did a characterization of the delay, we established this empirically. One of the things I
want to do in the future is do a better job of figuring out how long those delays should be and if there
are program events, as Jim pointed out, program events that we could use to trigger expiration of a
delay instead. But we did this empirically and found a range of, that the range of failures that we were
dealing with in our experiments fit in to a particular delay window. So that’s and area I want to look at
in future work.
>>: [inaudible] what if there was a, some other constraint between those two green instructions that
decided to delay the allocation of P to satisfy some other constraint, right. So then these two delays
would cancel each other…
>> Brandon Lucia: Yeah, so…
>>: And so you would not [indiscernible].
>> Brandon Lucia: Right, so the situation is where you essentially end up with live lock because delays
cascade between threads, so.
>>: Not a live lock because you’re using delays they just cancel out because you have two different
constraints there inserting delays that just cancel each other.
>> Brandon Lucia: So…
>>: So…
>> Brandon Lucia: Yeah, I agree this is problematic. One delay could undo the good of another. So the
good news is you’re only as bad as the program was initially. The bad news is that means that this
mechanism doesn’t actually work. Another answer to that question is that I’m actually trying to work on
formalism right now that shows that as long as delays are acyclic and the hard part is defining what
acyclicity is for these kind of things then you can’t end up with situations where delays cause live lock or
in the way you described as undo one another’s work, so.
>>: I don’t agree with your statement you’re only as bad as [indiscernible]…
>>: No.
>>: Increase in delay that kind of causes…
[laughter]
>> Brandon Lucia: Plus performance…
>>: Much worse then, yeah.
>> Brandon Lucia: With performance [indiscernible].
>>: Yes.
>> Brandon Lucia: You’re right. So there is an impact on latency, yeah. We did find in our evaluation of
this that delays are very infrequent however. That’s a property of the applications that we looked at. So
you’re right to say that if delays happen very frequently this could be increasing latencies and causing
timeouts and bad things to happen. In practice we found that’s not the case. Furthermore, in our
model for selecting constraints which I’m going to talk about in just a couple minutes, we can build in a
quality of service constraint that says don’t use schedule constraints that degrade quality of service,
meaning cause timeouts, cause unacceptable increase in request latency, things like that.
>>: Yeah but it not just latency, right. You’re adding delays and what you’re doing is biasing the
schedules to run, a subset of schedules. Those schedules could kind of cause…
>> Brandon Lucia: Well…
>>: [inaudible] expose other bugs that you…
>> Brandon Lucia: So, u-huh…
>>: Wouldn’t then you know…
>> Brandon Lucia: That’s a very pessimistic view…
>>: [inaudible]…
>> Brandon Lucia: I mean the reason…
>>: [inaudible]…
>> Brandon Lucia: We’re doing this is because…
[laughter]
The reason we’re doing this is to bias the schedule away from schedules that we think…
>>: Right.
>> Brandon Lucia: Are going to cause bugs…
>>: Yeah, but you don’t know that in advance so could be kind of, you have no guarantee…
>> Brandon Lucia: It’s true but I think that that’s an incredibly pessimistic view. That’s saying that when
you go to avoid one bug you’re going to land another bug. I just…
>>: [inaudible]…
>> Brandon Lucia: Think it’s possible…
>>: [indiscernible]
>>: [inaudible]
[laughter]
>>: Actually, you know this might be one place where [indiscernible] are [indiscernible]…
>> Brandon Lucia: Yeah.
>>: Think of there being a [indiscernible] that P is not null before the use, right.
>> Brandon Lucia: U-huh.
>>: So what you do at run time is that you evaluate the variant and if it fails…
>> Brandon Lucia: U-huh.
>>: They’re going to crash. So rather than crashing the program you just delay…
>> Brandon Lucia: You hang this thread…
>>: And then…
>> Brandon Lucia: Yeah.
>>: Evaluate again and hopefully…
>> Brandon Lucia: Yeah.
>>: It’s true again and then you just run, right.
>> Brandon Lucia: It actually sound like…
>>: That might be a sure way of like avoiding all of these problems. You’re about to crash, rather than
crashing, you know you actually make the program…
>> Brandon Lucia: The problem is having specification.
>>: [inaudible]
>> Brandon Lucia: We don’t…
>>: [inaudible] isn’t specification, right. There’s a lot of [indiscernible] wasn’t done here [indiscernible]
right.
>> Brandon Lucia: U-huh.
>>: So you can just [indiscernible] for it.
>>: [inaudible] or actually…
>> Brandon Lucia: You’re right. The…
>>: [inaudible] Java you’d have the null test equipment, right.
>> Brandon Lucia: You could even thread this. Yeah…
>>: [inaudible]
>> Brandon Lucia: Well, we should collaborate on that kind of project in the future.
>>: [inaudible]
>> Brandon Lucia: I like that idea.
[laughter]
>>: Yeah, so…
>>: [inaudible] then your actual benchmarks…
>>: [inaudible]
>>: Is this the kind of thing you see go wrong or do you see things go wrong that are semantically off,
like you get the wrong answer but it doesn’t crash?
>> Brandon Lucia: You’re talking about the question of how do we identify failures?
>>: I’m saying in the benchmarks…
>> Brandon Lucia: Yeah.
>>: You said in practice this happens rarely. What is it that goes wrong in practice, is it null
[indiscernible] or is it we got the wrong answer because something we referenced the wrong pointer?
>> Brandon Lucia: Yeah, I touched on this point before you came in. So the way that we identify failures
is actually looking for fail stop conditions, assertion failures and segmentation faults and signals and
things. In general at finding failures is a hard problem and predicting when something has gone wrong is
unsolved. I think a cool thing to look at in the future. So, sure.
>>: [inaudible]
>> Brandon Lucia: Yeah.
>>: So could you maybe try different values of the delay in production…
>> Brandon Lucia: U-huh.
>>: And monitor the effects this has on maybe the latency that you’re seeing?
>> Brandon Lucia: You can…
>>: [inaudible] end latency?
>> Brandon Lucia: You can and that would increase the search space because you’d have to try tuples of
pairs of, tuples of constraint and delay time…
>>: What I’m saying is…
>> Brandon Lucia: But it’s feasible, yeah.
>>: This, without, okay, right, right, but…
>> Brandon Lucia: You’ll see…
>>: [inaudible]…
>> Brandon Lucia: When I get to my discussion…
>>: [indiscernible] flawless…
>> Brandon Lucia: Of the model…
>>: This, you know I guess…
>> Brandon Lucia: Yeah.
>>: The largest delay that has no effect on the [indiscernible] system, right, something like that.
>> Brandon Lucia: That’s certainly what you want. When I get to the model I can talk just for a second
about how we can incorporate that information into the model…
>>: But you could certainly just send out a variety of constraints with different delays to each, if you
have a large collection of systems, right, you could kind of in parallel search the space effectively…
>> Brandon Lucia: Yeah.
>>: By running different delays.
>> Brandon Lucia: Have you seen my talk before? Ha, ha.
[laughter]
Yeah, this is essential to the way the technique works. So, yeah, so something I mentioned before is
that the way we generate these constraints is by collecting a history of events that happened before the
failure. We use that information to generate constraints. So, now I’m going to talk about what goes
into those histories and how we collect that information.
So if we have a program like this we need to instrument events that are interesting when we’re trying to
deal with concurrency bugs. There are two kinds of events that we think are interesting. One is
synchronization events and the other is sharing events. Synchronization is lock and unlocks, threads,
bonds, and joints, things like that. These are easy to find with a compiler and if there’s custom
synchronization or something like that the programmer can tell our system this is what I’m using for
synchronization.
Sharing events are memory operations that access memory locations that could be shared across
threads. It’s harder to find. So the way we do that is by using a profiler because using a compiler we
have to be conservative and it’s hard to identify a reasonably small set of sharing events in the
execution. So once we’ve found sharing events and synchronization events we have a compiler pass
that inserts calls into our run time into the program. Those run time calls are used to populate the
event history that I described before.
So the event history is the data structure that exists in the run time while the program executes. When
the execution unfolds we see the event history gets P equals null because that’s an event and then we
see this acquire lock and then we see P pointer use, and then we see a failure. Aviso also monitors for
failures and it considers assertion failures signaled, terminating signal deliveries and things like that, fail
stop conditions to be indicators of failures. Like I said a second ago we’re looking at other ways of
identifying failures.
So after the program fails we have this event history that shows what happened leading up to the
failure. We want to generate a set of constraints that are candidates for preventing that failure
behavior that occurred. We do that by enumerating all the possible pairs of instructions in a window of
that event history that execute in different threads. So in the [indiscernible] example event history,
here we have two different constraints that we can generate. The P equal null acquire lock and the P
equal null and the P pointer use. You’ll remember this is the one from a second ago that I showed
actually works to avoid that failure by adding a delay.
Okay, now I’ve just showed you how we can generate a set of constraints. But I didn’t tell you how we
decide which one is actually useful for avoiding the failure. So that’s the last part of my talk is Aviso gets
a bunch of failures, builds up a big set of constraints, and then it needs to decide which ones it wants to
send over to the deployed systems so they can use it to avoid failures. Which one does it pick? To
answer that question we develop a constraint selection model. Our constraint selection model has two
parts, the first part is the event pair model and the second part is the failure feedback model.
The event pair model looks at pairs of events that occur in the programs execution and in particular how
frequently pairs of events occur in non-failing portions of the execution. To get that information Aviso
sparsely samples event histories from non-failing execution. The intuition behind why this is useful is if a
pair of events happens often in a correct portion of the execution then it’s unlikely to be responsible for
the failure. So trying to reorder around those events isn’t likely to have any impact on whether the
failure manifests or not.
The other side of the model is the failure feedback model. This model gets populated when we start
issuing constraints out to deployed systems. It explicitly tracks the impact on the failure rate of the
system when a particular constraint is active and when no constraints are active. So the intuition here is
that if a constraint, if the instance of a constraint being used by a system correlates with a decrease in
the failure rate then that constraint is more likely to be useful in future executions for avoiding the
failure.
We have a way of combining that information together that I’m not going to describe in detail into a
combined model that is a probability distribution defined over the set of all the constraints that we have
and Aviso draws constraints according to the probability distribution and issues them to the deployed
systems. So if anyone’s a machine learning person in the audience this is an instance of reinforcement
learning and it’s a variant of the K-Armed Bandit model for reinforcement learning. Yep.
>>: How many failures do you need to see for the feedback model to actually be useful?
>> Brandon Lucia: [inaudible]…
>>: You only get one sample, one data point, right, per one crash?
>> Brandon Lucia: Yep.
>>: Is that…
>> Brandon Lucia: That’s right, yeah, so we found in our experiments that it’s relatively a few ten to a
hundred and we start to see the feedback having an impact on which constraint is drawn. So in a way
you can think about this model as being predictive. We have an infinite amount of correct execution
data and then a failure happens. So this predictive model says which pairs of events aren’t likely to be
useful. So we discard those as much as we can. But we have some that we either don’t have enough
information about or are actually useful for preventing the bug…
>>: [inaudible]…
>> Brandon Lucia: So we use the prediction…
>>: [inaudible] failure is already to [indiscernible] if the…
>> Brandon Lucia: It…
>>: Crashes occurred then by the time they, you know it’s a pretty big emergency maybe. I don’t know
I’m just…
>> Brandon Lucia: It might be a pretty big emergency but it doesn’t fix the program…
>>: Look, I know but maybe at that point like there’s ten people working on it and they might fix it and
deploy it…
>> Brandon Lucia: So as an antidote this is, time is becoming an issue, but…
>>: [inaudible]
>> Brandon Lucia: There is an antidote that I like to talk about and this is something I saw in the
MEMCACHED developer board. They had this bug and it was open for a year. It was a lost update that
triggered an assertion failure at some point later and the developers saw the bug report and then
decided to ignore it because they said fixing this would introduce a seven percent performance
overhead. So it stayed open for a year, who knows how many people using MEMCACHED experienced
this bug, saw that their server went down and restarted the stupid thing. Eventually enough bug reports
came in that they actually went and finished it, fixed the bug, a year later submitted a patch with the
seven percent performance degradation. In contrast our system was able to fix that bug with a fifteen
percent performance overhead which I’ll show you later and it did it in the space of ten to a hundred
executions rather than the number of executions that had bug reports submitted for them in the space
of a year. So that kind of helps to tune the timeframe for how bugs get dealt with. In general bugs can
stay open for multiple years; a year could be generous for some open source packages that are pretty
widely used.
>>: [inaudible]…
>>: So there’s millions of dollars right there in lost revenue?
>> Brandon Lucia: It could be. I mean that’s hard to quantify but I mean it’s a, yeah.
>>: [inaudible] contribute to that [inaudible].
>> Brandon Lucia: Say again.
[laughter]
>>: [inaudible].
>>: Yeah.
[laughter]
>> Brandon Lucia: Well, I mean so the information we’re collecting says a lot about why the bug
happened. So send it back to developers they can use this. Yeah.
>>: [inaudible]
>> Brandon Lucia: Well some are, I mean, ha, ha.
>>: My questions on [indiscernible].
>> Brandon Lucia: Okay.
>>: So mine was just like, you know you put this FBI in sixty seconds the Citibank window. So what your
stuff is doing is you’re reducing the window to forty-give seconds. I’d rather shut the program down, ha,
ha, and not let people…
>> Brandon Lucia: I don’t understand, what is the window that…
>>: So there was the, one of the three things you motivated your talk was, I think was this casino
robbery…
>> Brandon Lucia: Yeah.
>>: This guy is exploiting this race or something…
>> Brandon Lucia: Yeah, yeah.
>>: With a sixty second window that download [inaudible]. So what are you think, [indiscernible]
somebody might do is shrink the size of the window but still keep it opened as opposed to someone
noticing or detecting and just shutting the system down.
>> Brandon Lucia: Unless there was some way of identifying that a failure or an attack…
>>: Something bad is happening…
>> Brandon Lucia: [indiscernible], yeah.
>>: It’s better sometimes just to shut it down, right…
>>: Are you saying that…
>> Brandon Lucia: But for…
>>: Prove that…
>> Brandon Lucia: Yeah.
>>: He’s trying stuff, right…
>> Brandon Lucia: For…
>>: [inaudible]…
>> Brandon Lucia: To keep the system available in the case of fail stop bugs this is definitely, I think this
is a better option than letting the system crash, if availability is the most important thing.
>>: Correct, but I’m say that sometimes…
>> Brandon Lucia: Sometimes it’s not…
>>: Sometimes we use the wrong metric, right…
>> Brandon Lucia: That’s absolutely true.
>>: There’s times where you kind of more or less sacrifice availability…
>> Brandon Lucia: Yeah.
>>: Because it’s worse to be available.
>> Brandon Lucia: Even in the case of security bugs this can still be useful and I think especially when
combined with techniques for identifying that something anomalous is happening in the execution. I
haven’t done that work yet but if you think about a technique that says, hey an attack might be
happening maybe we can use a mechanism like this in combination with something like that to keep the
system available and to close the security hole. I think that’s something interesting to think about, so...
>>: I think this is great for the staging area or [indiscernible] where you can [inaudible].
>> Brandon Lucia: Sure.
[laughter]
>> Jim Larus: So you and I are going to lunch so [indiscernible] can go ahead…
>> Brandon Lucia: Do you want to talk about [indiscernible] because I have a few more slides…
>> Jim Larus: Sure.
>> Brandon Lucia: I’m afraid people are…
[laughter]
[indiscernible] start leaving because it is…
>>: [inaudible]
>> Brandon Lucia: Twenty minutes to twelve now, ha, ha.
[laughter]
>>: So you know what…
[laughter]
>>: One piece of information that you don’t seem to use is that in correct executions something
happens in correct executions which doesn’t happen in incorrect executions so you don’t have to worry
about that [indiscernible].
>> Brandon Lucia: Right, we have half of that information but we don’t have…
>>: You don’t have that half.
>> Brandon Lucia: We don’t have the other half.
>>: Right.
>> Brandon Lucia: Right, so we can incorporate information from failing event histories into our
predictive model but I haven’t done that because I couldn’t come up with a way that reliably produced
good predictions. It’s a hard problem because you also have a data sparsity problem because you only
see, you see fewer failing executions than you see correct executions. There are lots of events in a
program. So for some of those events…
>>: [inaudible]
>> Brandon Lucia: You don’t have, you don’t have information from the failing executions which makes
it a hard thing to even incorporate that information. Yeah.
>>: So failure [indiscernible] instance which is static IP address, right?
>> Brandon Lucia: Yes.
>>: It tends, you’re not adding any context.
>> Brandon Lucia: We have no, its context insensitive.
>>: So have you thought about adding more context like your context-aware stuff…
>> Brandon Lucia: I have.
>>: And see if it does actually do failure of…
>> Brandon Lucia: So call stack information would eliminate some spurious delays but collecting it is
expensive. I mean it adds overhead to collect the call stack information.
>>: [inaudible] right, so you can collect a lot more…
>> Brandon Lucia: No we’re not sampling so in order to activate a constraint we would need to know
that a particular instruction executing in a call stack was happening. So we would need to do a check
that computed the call stack at each activation point, so.
Cool, our implementation is simpler than this slide makes it seem. There are three parts the run time,
the compiler and profiler, and the server. Compiler and profiler, the profiler was written in Pin, compiler
we wrote as a pass for LLVM and it takes responsibility for finding and instrumenting events and linking
to the run time. The important thing about the interaction between the run time and the server is that
they exchange event histories and they exchange schedule constraints. The server maintains the model
of how to draw constraints and they communicate over HTTP so the system is portable in
implementation. You can put it anywhere and it’s not, doesn’t need to be on a single machine for
example.
So, now I’m going to talk about how we evaluated our system. My goal in the evaluation is to convince
you that we measurably decreased failure rates in our experiments with some real applications and that
our technique has overheads which are reasonable especially when availability is the key concern. Our
set up was to use a small cluster of machines that all run the application and we had a single Aviso
server. W used the setup benchmarks that partially overlaps with the ones that I used in the previous
study, MEMCACHED and Apache, and Transmission were the biggest applications we looked at.
So here’s a summary, a very high level summary of the results. For some of the, for one of the
applications we had a reduction in failure rate of eighty-five times. That was failure in the PHP
processing subsystem of Apache, HTTP-D, and we saw, yeah, eighty-five times decrease in the failure
rate, so it happened eighty-give times less frequently.
>>: That just one bug…
>> Brandon Lucia: This is one bug.
>>: You just avoid [indiscernible].
>> Brandon Lucia: That’s correct. I mean that frequently was over a humongous space. In our
experiments we use hundreds of billions of requests hitting these systems so it was a very large space of
execution that we looked at.
>>: Did you have any executions where you had; you were actually avoiding multiple bugs…
>> Brandon Lucia: Yeah, we had a study that didn’t make it into the paper where we took two different
bugs in a version of Apache and we showed that we can avoid them and that the key to that is that we
have schedule constraints and we need to decide if it’s the same bug happening or a different one. We
do that by fingerprinting bugs based on the event history that preceded the failure. So doing that we
can send one constraint for each bug that we’ve fingerprinted and we can solve that problem. It didn’t
make it into the paper but we showed that it does work without increasing overhead as like the product
of the delay, the overheads, so.
The average case overhead was about fifteen times decrease in that rate of manifestation of the
failures. The overheads that we saw were practical especially if availability is the most important thing.
There were overheads as low as five percent when we were monitoring the execution and using delays
to avoid failures. The average overhead was around twenty percent. So these are overheads that are
acceptable in production systems, like I said, especially when latency is not the highest priority and
availability is more important.
Okay, so there were, just to wrap up this section. Schedule constraints are the new abstractions that we
introduced. We have support and compiler run time and we have a statistical model at the application
level. There are, this is a system that’s useful for the systems lifetime because it actually helps deployed
systems be more reliable. It takes advantage of collective behavior by sampling information from many
deployed systems.
So that’s it for the two projects that I was going to drill down into. This has been just awesome to have
this many questions. I really appreciate it. Usually it can get dry to give this talk a million times, so, ha,
ha.
So now I’m going to move on in like three minutes and talk about some future work and then I’ll take
more questions afterwards if there are people that are wondering things. So in the future I’m interested
in continuing work in the direction of reliability and in looking at adapting these techniques to energy
efficiency. I’m also looking at some emerging architectures which I think are interesting. So I’m going to
talk about those now.
To start with though this is a picture of the way that I think computer systems are being built today and I
think it’s getting worse. So we have multicores and in order to get good performance out of a multicore
or a data center you have to put a lot of burden on the programmer to get that programming right so
that we get reliable execution. The burden is primarily on the programmer to go an avoid crashes, just
like this guy on the bike needs to…
>>: Is that also in Zimbabwe?
>> Brandon Lucia: No that’s not, ha, ha, this is stock art from the internet, ha, ha. This is stock art from
the internet…
>>: Oh.
>> Brandon Lucia: I just, I found this photo, ha, ha. I thought it was funny. So this guy gets brick level
parallelism but he pays for it in that he has to carefully stack these bricks on his bicycle.
[laughter]
He gets good performance but it’s really hard. So I think we need to address the reliability problem. In
the past we’ve been focusing a lot on the performance problem and I think the problem is getting worse
as we move toward heterogeneous architectures where the programming problems going to be more
complicated. When we’re addressing reliability and performance we need to keep in mind complexity.
We need to balance where complexity ends up in a system. Does it end up in the architecture or the
compiler or the language, or in the programmer’s hands, or wherever. We need to keep that in mind
when we’re coming up with solutions.
So one thrust of my future work is going to be to continue to look at reliability, reliability is the problem
I’ve been talking about. In fact I think that as the performance benefits of Moore’s law are petering out
because of the utilization wall and the power wall we’re going to need to find other ways of adding
value to platforms. This is especially interesting to companies but I think that this is interesting in
general. One way that I think is a very promising way to do that is to add features to architectures and
systems that improve reliability all the time. I described two of them today, one that has hardware
mechanisms and another that’s a software layer. So I think there’s a big opportunity to do research in
reliability.
One idea in particular that I’m really interested in is the idea of decoupling the process of developing
software from the reliability of the software. Aviso is a really early example of this. You’re taking some
of the responsibility for avoiding failures out of the hands of the programmer. One area where I think
this is especially interesting is in shared resource platforms like Cloud applications and in mobile
applications. So I think of the process as hardening applications in these kinds of platforms. The
programmer doesn’t see anything different they just deploy the software. The user doesn’t see
anything different they just get software as it’s distributed.
Some interesting points related to how we can take advantage of…
>>: [inaudible]
>> Brandon Lucia: Shared infrastructure.
>>: Can you go back to the previous slide?
>> Brandon Lucia: Sure.
>>: So…
>> Brandon Lucia: Is it about the cats, ha, ha?
>>: At least in this company…
>> Brandon Lucia: Yeah.
>>: I’ve never seen that anybody would care so much about reliability especially if [indiscernible].
>> Brandon Lucia: U-huh.
>>: That, you know if the user doesn’t see any perceptible benefit right, why would a company invest in
reliability?
>> Brandon Lucia: They see it by comparison to other platforms. So you have all sorts of reviews on,
just take mobile space for an example. You have Android versus IPhone…
>>: Okay.
>> Brandon Lucia: If I’m an end user and I’m saying which phone do I want to buy the next version of?
Well you can look at Android and you can look at IPhone and say, which one has more crashes and then
you can go and buy a Microsoft Windows phone or something like that and so this one has fewer
crashes because someone baked something into the software run time layers to improve the reliability.
That could actually…
>>: That has never happened in the…
>> Brandon Lucia: It’s never happened…
>>: [inaudible]
>> Brandon Lucia: I totally agree with you. It’s never happened because people have focused on making
performance better in subsequent…
>>: [inaudible]
>> Brandon Lucia: Generations. What?
>>: Or features.
>> Brandon Lucia: But what are, features are essentially performance. Features are things like vector
processing and that gets performance. Features are things like better optimizing compilers, it’s for
performance. So I think reliability, no?
>>: Features are features on your phone.
>>: Yeah.
>>: Something like…
>>: You know I want to talk to my phone so…
>>: [inaudible]
>>: I think the argument against that is that we spend enormous amount of money on test set and in
our software…
>>: The other argument is we expose [indiscernible] lots of developers now, right, if you’re on Windows
[indiscernible] you can get [indiscernible] and we expose those to [indiscernible] more reliable.
>> Brandon Lucia: Right, so I think that because you have shared resource platforms like that you can do
things like you just described and Aviso and like what you just described is only the very beginning and I
think there’s a lot of other opportunities. So this shows some of the advantages to looking at these
platforms and some of the opportunities are there. One is that you have the common infrastructure so
you don’t need to boil the ocean. If you want to push a new testing tool or a new optimization
technique, or a new failure of winds technique load it into the platform and you get it, everyone has to
use it. You have control over the hardware. So if you get, if you find that you can get easier, simplify the
programming problem using some heterogeneous hardware for solving some problems or you can get
better performance. You have control over the hardware and the environment.
You have massive scale so those models like I showed in the statistical models that we use in Recon and
in Aviso they improve when you have more data. You’ll have lots of data if you’re looking at a Cloud
system. We can also make systems that do something similar to what Aviso does and that is to make
changes to the way that they behave experimentally. Some of those changes might turn out bad but if
one of those changes turns out to be really good then that change can be shared with all the other
systems on that platform. I think that’s a really cool idea.
Finally I think it’s interesting to look at how we can use a model of behavior built in one system and we
can transfer the information over to another system. So what can we learn about Windows by looking
at Lynx for example? Are there things at some level of abstraction that will transfer usefully between
those systems? I think that there are. It’s going to require changes to the system. We’re going to need
new primitives for introspecting into the behavior of the system, things related to concurrency like
coherence, sequences of events potentially from different threads, exposing that up in an efficient way
to run time layers or to the developer is going to be a challenge, and energy which is a problem on
everyone’s mind especially in the mobile space.
I want to look at new mechanisms for failure avoidance. You’ll notice that there was no hardware
support in Aviso but Aviso one of the challenges that it has to overcome is the overhead of enforcing
those constraints. I think with hardware support we could do a better job of that. So I think changing
lower levels of the system whether actually in hardware or not is an interesting thing to look at when
dealing with failure avoidance, also looking not just at concurrence programs but at sequential programs
as well.
I also think another way to deal with this problem that, the problem that programming is so difficult is to
just do the programming for the programmers, so looking at synthesis techniques. I’m working on a
project with some natural language processing researchers right now where we mined a bunch of code
from the internet and we’re looking at ways of incorporating that into an active learning based code
synthesis engine.
The last idea I want to talk about is that power failures impact your reliability. If you have a platform
that experiences power failures often that’s a reliability issue. So energy efficiency is a form of being
reliable. I’m especially interested in this area in the domain of small and unpowered devices. Intel has a
little device called a Wisp and this was developed in collaboration with several people from academia.
It’s a very interesting device because it doesn’t have a battery. The way that it powers itself is by
harvesting ambient radio frequency energy charging a super capacitor and as the super capacitor
discharges it does a little bit of computation. That’s a really interesting platform because it requires
interruption tolerance during the execution. Power failure goes from being the once and a while event
where someone kicks the plug out of the wall to being maybe ten or a hundred times every second
depending on the size of the capacitor and the rate of the computation. That fundamentally changes
the way that you design what is an operating system. How to you program devices like this? Maybe we
want to treat power failures as recoverable exceptions. What would be the system layers that we
require to do that? So I think this is really an interesting problem to look at especially as these devices
find use in more ubiquitous computing applications.
I also think that looking at ways of avoiding power failures by for example trading off a failure due to,
trading off energy related failures and programmer liability mechanisms. A programmer liability
mechanism is like a null check. You can allied a null check to save enough energy to keep the system
alive you might want to do that. But you only want to decide to do that when it’s really important. So
you have to have some introspection on how much energy remains and how you can make that tradeoff
dynamically is an interesting question.
I’m going to skip this last bullet and just point out that there’s lots of cool applications for this stuff with
people working in health and environmental sciences especially in the northwest we have lots of
forestry and water research. There’s lots of interesting health applications that would be relevant to a
company like Microsoft, especially working with these small devices and how they’re programmed and
things like that. So I think there’s a lot of really interesting and visible opportunities for collaboration
and applications there.
Okay, so that’s my talk. There’s a big list of collaborators that I’ve worked with over the years at UW, as
well as several people from Microsoft Research and HP Labs and IBM. I really appreciate you giving me
your attention and asking so many questions. I’ll take more questions in the last five minutes if there
are any.
>> Jim Larus: Alright, are there any more questions? I didn’t think so, ha, ha.
>> Brandon Lucia: Cool.
>> Jim Larus: Thank you very much.
>> Brandon Lucia: Thank you very much. Yeah this is great.
[applause]
Download