>> Andrew Baumann: Okay. So thank you all... Baumann, and it's my pleasure to introduce Gautam Altekar from...

advertisement
>> Andrew Baumann: Okay. So thank you all for coming. I'm Andrew
Baumann, and it's my pleasure to introduce Gautam Altekar from UC Berkeley.
You might remember some of Gautam's work. For example, at the last SOSP
there was a paper about output deterministic free play. More recently he's been
taking some of those similar ideas and extending them to replay debugging of
distributed systems and datacenter applications. And I'm very excited to hear
about that work and what he has to say.
>> Gautam Altekar: Thank you. Okay. Thank you, Andrew for that introduction.
Okay. Thank you for coming to my talk. My name is Gautam. And I'd like to talk
to you about debugging datacenter applications. Now, the slide only my name is
there, but actually there's several collaborators, including Dennis Geels who is
now at Google, my colleague, my advisor and several folks from APFL.
Okay. So we all know that debugging is hard. Now, debugging datacenter
software is really hard. Datacenter software. What am I talking about? I'm
talking about large-scale data-intensive distributed applications, things like
Hadoop, Map Reduce, Cassandra and Hypertable distributed key-value stores.
Memcached the object and distributed object caching system.
Now, why is it hard to debug these applications? Well, there are three main
reasons. Foremost is nondeterminism. You develop these applications, you
deploy them to production. Something goes wrong and then you try to reproduce
that in a development environment and you can't reproduce the failure. And
because you can't reproduce it, you can't employ traditional cyclic debugging
techniques to recently reexecute and home in on the root cause.
Now, the second thing that makes it difficult is that these large scale systems are
prone to distributed bugs, bugs that spend multiple nodes in the system. And
when you're talking about thousands of nodes you're talking about really complex
interactions that are very difficult to get your mind around.
And the final thing that really makes things complicated is the need for
performance. These applications are part of 24 by 7 services. You can't just
stop them to do your debugging. And you can't use heavyweight instrumentation
to collect debug logs for later debugging.
So in this talk, I want to focus on the problem of how we can make it easier to
debug these datacenter applications despite these challenges that we're faced
with.
Now, the Holy Grail of debugging in our view is a system that automatically
isolates the root cause of the failure and automatically fixes the defect underlying
the root cause. Okay? So at this point we're thinking okay, why automated
debugging, why not use, you know, static analysis testing, model checking,
simulation. Define the errors before they manifest in production. Well, as good
as these tools are, oftentimes they miss errors, particularly when you're talking
about large state spaces.
In the case of things like static analysis, they can be conservative and they can
give you about -- tell you about all sorts of errors but still the developer may not
be interested in in oh, in going through these error reports. And eventually errors
will manifest into production. And at that point you would -- it would be nice to
have and automated debugger to debug those errors for you, those failures for
you.
Now, as you might imagine, building a fully automated debugger is very difficult,
okay. Building one for datacenter applications is even more difficult. So the goal
of this project is much more modest and that is we want to try and build a
semi-automated bug isolation system. Now, semi-automated, what do I mean by
that? I mean that we still want developers to be engaged in debug isolation
process. So there's still going to be some manual effort involved. But at the
same time we want to go beyond printfs and GBD to review the overall amount of
work that he has to do. Shouldn't have to do, you know, laborious tedious work.
Now, I should mention that there are certain things we aren't trying to do. In
particular we weren't trying to automatically fix the bug with this system. And
that's deferred to future work. And I'll talk a little bit about that later on.
Okay. So I've told you with what we're trying to do, what the problem is. Semi
automated bug isolation. What exactly is it that we've done. What is our
contribution? A key contribution is a system and framework that we call ADDA.
It stands for automated debugging for datacenter applications. And what ADDA
is was framework at the high level for analyzing datacenter executions. Okay?
So it offers two things for this purpose. It offers a powerful analysis plugin
architecture with which you can construct sophisticated distributed analysis tools,
things like global invariant checking, distributed dataflow and communication
graph analysis.
And beyond being powerful, it offers a very simple programming model for the
framework. So you can write new plugins that are sophisticated fairly easily.
And so these plugins are written in the Python programming language. We
chose that because for its rapid prototyping capabilities. And then on top of the
Python we also provide facilities for inspecting and reasoning about distributed
state. So that's key. We wanted to make -- we want to make it easy to reason
about distributed state.
Okay. So plugins. And analysis. What am I talking about? So let's take a look
at an example. This is an honest to goodness ADDA plugin. And it's very
simple. It performs a distributed check, a global invariant check. And basically
checks that no messages were lost during an execution. Okay. So it's a
distributed check. It's very simple. And fundamentally it's written in Python and it
employs a callback model whereby whenever a message is sent in distributed
system on send gets invoked. Whenever it's received, on receive gets invoked.
And on sends and receives you maintain this set of in-transit messages,
represented by the intransit Python variable. At the end of the execution we look
at the set and say okay, any messages that are still in transit that were never
delivered, okay, then those messages were lost. Question? Yes.
>>: What [inaudible] find trend?
>> Gautam Altekar: So this is the application messaging layer. So system call
level.
>>: [inaudible] you don't have reliable transport?
>> Gautam Altekar: Yeah. So for example in the case if you use UDP, for
example. Yeah, this would be a concern for UDP. TCP obviously might not be a
problem.
Okay. So that's a very concrete example of what ADDA offers. But beyond that
there are really four key features offered by the analysis framework programming
model. First is the illusion of shared memory. You don't need to explicitly send
messages around to maintain global state. You can store it as single Python
variables. You debt serializability to illusion the callbacks, execute one at a time
so you don't need to worry about concurrency issues, data races when you're
developing these plugins.
And a really powerful feature is a causally consistent view of distributed state, so
what that basically means is that whenever you look at distributed state you'll
never see a state in which a message has been received but not yet sent. And
this is a very important property for reasoning about causality in the distributed
system.
And finally ADDA provides fine-grained perimeters for performing deep
inspection of your execution so things like instructional level analysis, taint-flow
analysis, instruction tracing, that sort of thing, all that stuff comes out of the box
with ADDA.
So at this than point you're wondering okay, well that's great. A powerful
analysis framework okay, I'd like to use that, but what's the problem. What is the
challenge we -- behind developing something like this?
The main challenge is that there is a conflict between providing these powerful
analysis tools and achieving good in production performance, okay? Things like
providing the illusion of shared memory, serializability, causal consistency. All of
these things are very expensive to provide in an in production execution, causal
consistency, for example, requires distributed snapshots. You're talking about
Chandy and Lamport. Heavyweight techniques. And then fine-grained
introspection, taint-flow analysis, instruction level analysis you're talking about
binary translation. You can't do this kind of stuff on an in production execution
without major slowdowns.
Okay. So I've told you about the problem. I've told you that there is a conflict
between providing powerful analysis and achieving good in production
performance. So how do we resolve this conflict? That's the key question.
So the key observation that we make is that for debugging purposes, we don't
need to do these analysis in production. For debugging purposes it suffices to do
the analysis offline, okay. We don't need to do it in production. It suffices to do it
offline for debugging. And this observation motivates ADDA's approach of using
deterministic replay technology to perform the analysis on an offline replay
execution. Okay?
And, in fact, we shift the burden, the overheads of analysis from in production to
an offline replay execution. This works in two phases. I basically run your
datacenter application on the production cluster with some light tracing, and then
later when you want to perform an analysis you do a replay on a development
cluster and then you turn on the heavyweight analysis that you've written using
ADDA. Question. Yes?
>>: [inaudible].
>> Gautam Altekar: Absolutely not. ADDA supports a partial analysis node in
which you have a thousand nodes you can actually try and replay it on one node.
It's going to be really slow of course if that's possible. Okay. So deterministic
replay is the key technology that we leverage to perform the analysis offline. But
you might say well there's a problem here. These datacenter applications are
nondeterministic. How do you make sure that you can deterministically replay
them? And because when you run them, you know, again they might not do the
same thing.
So to address this problem, we employ what is known as record replay
technology. Okay? It's a brand -- it's a kind of technology for providing
deterministic replay. And it works in two phases. And the first phase you record
your applications nondeterministic data, things like inputs, incoming messages,
thread interleavings to a log file. And this is done in production. From your
production datacenter application.
And this -- all this information goes to a log file. And at a later time when you
want to do the analysis, you start up your replay session and you basically -- this
entails rerunning your application but this time using the information to log file to
drive the replay. Okay? Now, at this point you're probably wondering okay, why
use record replay. Why not employ some of the more recent deterministic
execution techniques, things like Kendo and dOS and CoreDet?
Well, the short answer to this is that these techniques are complementary to
record replay and that the goal of deterministic execution is to minimize the
amount of nondeterminism in these applications. Now, that's good for us
because the less determinism there is then the less we have to record. And at
the same time it's important to note that determinism can't be completely
eliminated. There's nondeterminism inherent in the external -- in the
environment, in the network and the routers for example DNS messages, it still
has to be recorded. Even with deterministic execution tools. So we think that
record, replay and deterministic execution techniques are complementary and
can coexist and have much to learn from each other.
Now, okay. Of we want to do record replay. What exactly are we looking for in a
datacenter replay system? There's three things. For most we want to be able to
record efficiently, okay? Good enough for production-use demands. At most 10
percent slowdown and about 100 KBps per node logging rates.
The second requirement is that we'd like to be able to replay the entire cluster if
it's desired. And the reason is that we don't know where -- on which nodes a
bug, a distributed bug in particular is going to manifest. It could be on one node,
it could be on two nodes, it could be the whole cluster, so we would like to be
able to replay. Questions?
>>: [inaudible] slow down just when you decide to do some recording for replay
or is this slow down ->> Gautam Altekar: Yeah, this is record, record mode slowdown. We're willing
to tolerate much more slowdown for replay because it's offline, but for recording it
should be at most as much -- I realize it's kind of on the high end but I think this is
at least good enough for at least partial time.
>>: [inaudible] 10 percent slowdown or is it just like you decide there's some
issue you want to try to record, then you could ->> Gautam Altekar: Yeah. Yeah. So, you know, if you notice some problems in
your production you can just turn it on for a while and you'll take a 10 percent hit
for that time, you can turn it off and then you can replay it. So that's when usage
model for the system. Question?
>>: 10 percent on aggregate throughput or on 99 [inaudible] latency? What do
you mean ->> Gautam Altekar: Aggregate throughput. Throughput. Not latency but
actually throughput of these applications.
>>: [inaudible] latency [inaudible].
>> Gautam Altekar: It might. Well, no, in practice the latency is actually pretty
good too. So it's not -- I don't think -- I don't think we make any sacrifices in
latency either. But our focus has been really on getting good throughput
because that's what -- that's what these distributed datacenter applications care
about rather than latency. Or at least the apps that we consider. Question?
>>: Just to follow up on Rich's question, if you did turn on recording for just a
period of time, do you have to take a checkpoint [inaudible] at that period?
>> Gautam Altekar: Yes. You do. Yeah. You would have to do that. Or you
would have to restart your services for those nodes that you consider.
>>: Definition of deterministic replay. You're not recording every thread
interleaving?
>> Gautam Altekar: Absolutely. I'm going to get to that.
>>: [inaudible].
>> Gautam Altekar: I'm going to get to that in just a few slides. Any other
questions? Question?
>>: [inaudible] sort of wonder about your assumptions about [inaudible] so have
there been studies done to -- of these [inaudible] and, you know, what sort of
techniques would capture them. I mean Microsoft, going back to the static
analysis, whenever we get security alerts or things like that, we get these bugs
from the [inaudible] and we look and see were there could have been the static
analysis [inaudible] that they could have gotten [inaudible] so I wonder, do you
have, you know, data about the times of bugs that -- and, you know, you made
the case that regardless bugs will always get into production. But that said, you
know, it would be nice to sort of know how many of those bugs that made it to
production could have been [inaudible]. So that's -- have there been studies
[inaudible].
>> Gautam Altekar: Yeah. I agree that would be nice, but I don't think there's
any comprehensive study of that kind. And we haven't done it. Most of this work
has been motivated by our own prior experience in building large-scale
distributed applications in house in the RAD Lab at Berkeley. And there we've
encountered many issues but I don't think we've ever documented the formal
published paper on it. So it's just kind of -- you know, people kind of generally
accept that this is an issue but we don't really have hard numbers. And I agree
that ->>: [inaudible] there's lots of war stories. There's very little hard data sort of
about how much people time it costs, you know, how much machine time it costs.
[inaudible] everybody sort of accepts I guess it's a problem. But we don't really
know how big a problem it is in terms of cost. We also [inaudible] look at any
classifications [inaudible].
>> Gautam Altekar: Yeah. Well, I think that kind of data would be very
interesting, especially coming from industry. I think that would have more clout.
Because our experience has been very grads students and it's not clear how
much we can do with that kind of system and manpower. So I think it would be a
great project to do for the industry.
>>: Like this basis, this idea that the [inaudible] I believe you observe this like.
>> Gautam Altekar: Oh, yeah.
>>: Like most of the [inaudible] 50 percent ->> Gautam Altekar: Well, the hardest -- among the hardest ones that
qualitatively speaking but we don't have -- again I can't really say there's a 15
percent of all bugs are like this. Something to do for future work. Okay.
So those were the requirements. Now, if you look at the related work in this area
in replay systems, you'll see that there are many of them, many systems but
none of them are quite suitable for the datacenter. In particular they're deficient
in one requirement or the other, systems like VMware's deterministic replay
system, Microsoft's has several replay systems that provide -- you know, they
can be made to or they already provide whole cluster replay and they have wide
applicability. However, in the datacenter context in -- you know, in the context of
processing terabytes [inaudible] computations, these systems don't record very
efficiently. You can't use them in production. In the datacenter.
And then you have systems like IBM's Deja Vu system, an older system records
efficiently and provides wide applicant but it makes certain assumptions that don't
hold in the datacenter. So you may not be able to provide whole cluster replay.
And then finally you have Microsoft's R2 system which records efficiently and can
provide distributed system replay, however you have to retrofit your application
using certain -- using the annotation framework. And this might be a bit of an
annotation burden. And so this is -- this doesn't have quite the be applicability
that we -- wide applicability that we were looking for.
Okay. So what have we done? Well, we've built a datacenter replay system that
meets all three of these requirements, records efficiently, not quite 10 percent, I'll
say that up front, it's about 40 percent right now. We're working on getting that
down. It provides whole-cluster replay. And it has wide applicability in the sense
that you can record and replay arbitrary Linux x86 applications, particularly in the
EC2 environment.
Now, ADDA, our analysis framework, uses DCR to do the offline analysis, okay?
And again, DCR is designed for large scale data intensive computations,
applications like Hadoop, Map Reduce, Cassandra Memcached, Hypertable, so
on, so forth.
Okay. So deterministic replay is the key. But -- and we've built a system that
meets all three of these requirements. But how do we do it? What is the key?
Well, the key intuition behind this replay system is that for debugging purposes
we don't need to produce an identical run to that of the original that we saw in
production. It often suffices for debugging purposes to produces some run that
exhibits the original run's control-plane behavior, okay? Control-plane behavior.
What am I talking about? If you look at most datacenter applications, they can
be separated into two components. A control plane code component and the
data plane code component.
The control-plane of the application is kind of the administrative brains of the
system that manages the flow of data through the distributed application. It tests
the complicated, does things like distributed data placement, replica consistency.
And it accounts for 99 percent of the code. But at the same time, it accounts for
just one percent of the aggregate systems traffic, just one percent of all the
traffic.
Now, this is in contrast to the data plane code which is the work horse of the
system. It process the data. So things like checksum verification, string
matching goes to every byte, computes the checks and that kind of thing.
Now, it turns out data plane code is actually very simple. It accounts for about
one percent of the code, a lot of it coming from libraries. But at the same time it
accounts for 99 percent of all data, traffic that is generated in the system. Now,
the key observation here is that most failures in these datacenter applications
stem from the control plane. And this is backed up empirically in our HotDep
paper, and you can take a look at that for some numbers.
Now, what does this observation do for us? Well, now we can relax the
determinism guarantees offered by a system to when what we call control-plane
determinism. And if we shoot for control plane determinism, we can all together
avoid the need to the record the data plane, the most data intensive, traffic
intensive component. And as a result we can meet all the requirements. We can
record efficiently. Recording the control plane is very cheap, it's just one percent
of all the traffic.
We can provide whole system replay and because now we can record all of the
nodes, all of the control planes. And we can provide wide applicability because
we don't need to resort to any kind of specially purpose hardware or languages in
order to provide efficient recording.
So control plane determinism is the key to meeting all three of these
requirements.
Okay. So if control plane determinism is the key, then how do we take this and
turn it into a concrete system design? Okay? So what you see here is at a
distributed design okay and operates in two phases, a record phase and a
replay-debug phase.
Now, in the first phase ADDA's replay system, DCR, records each node's
execution and logs it to a distributed file system. We use the Hadoop distributed
file system in this case.
Now, you'll notice that each node doesn't record all sources of nondeterminism,
each node controls just the control plane inputs and outputs. Again, we're
shooting for control plane determinism so we just need to record control plane
inputs and outputs. Now, during replay, the replay system reads the control
plane inputs and outputs and starts up a replay session. And then on top of this
distributed replay session we run the distributed analysis plugins. So that's the
high-level view of ADDA's architecture. Got a question?
>>: Yes. So when you talk about the control plane more specifically do you
mean that you need to record the headers of messages but not the bodies?
>> Gautam Altekar: Yeah. That's one way to think about it. You have the
metadata describing the data. That's at the headers of messages. In other
cases in datacenter applications components are exclusively control plane. So if
you think about say Map Reduce, okay, so you have a job tracker which is
exclusively a control-plane component, it's responsible for maintaining the
mappings of jobs, that performs and exclusive control-plane activity and so you
could record all of its channels.
>>: So how do you automatically or maybe not determine, you know, which data
is controlled by [inaudible].
>> Gautam Altekar: Okay.
>>: [inaudible] so this log history of work of [inaudible] scientific community
[inaudible] programs, how is this different from the 20, 30 year's predecessors?
>> Gautam Altekar: Well, so there's several challenges. The sheer volume of
the data that has to be recorded is a major challenge. In the scientific computing
it's mostly compute heavy you have -- you don't -- well, I don't know what
applications you have in mind, but it's not that -- it's not that scale of processing.
>>: [inaudible].
>> Gautam Altekar: Well, okay. So in scientific ->>: Look at the very large scale [inaudible] they are running a very large 10s of
thousands of processors in the space.
>> Gautam Altekar: Okay. But do they have stringent introduction overhead
requirements?
>>: Yes.
>> Gautam Altekar: So 10s -- they demand 10 percent or otherwise it doesn't ->>: [inaudible].
>> Gautam Altekar: Okay. Well, another challenge then is shared memory
multiprocessors. Now, you have concurrency. And how to you provide replay for
concurrent execution.
>>: All of that is hard to [inaudible].
>> Gautam Altekar: Certainly there are many techniques. But the overheads of
those techniques are quite high. If you look at them carefully. So things like -one of the earlier systems was the instant replay system which proposed a
CREW model of recording shared memory accesses. It's very expensive, the
overheads are very expensive for in-production. I mean I don't think you could
use those systems in a production datacenter.
>>: [inaudible] message based, message passing systems and you can
[inaudible] determinism [inaudible].
>>: [inaudible].
>>: I think it's also the case that in the HP [inaudible] typically you're operating
on a [inaudible] inputs and outputs [inaudible] do things like [inaudible].
>>: They do do deterministic checkpoints but the replay systems [inaudible]
checkpoint [inaudible].
>>: [inaudible] is the nondeterminism strictly from the messages or is there also
shared memory ->> Gautam Altekar: There's shared memory as well.
>>: Okay. How does the [inaudible] the log of the shared memory
nondeterminism compared to the volume [inaudible].
>> Gautam Altekar: So for these -- so it depends obviously on the application,
the amount of sharing. So I'm going to get to that in just a few slides if you don't
mind. So are there any other questions? Okay. Okay. So a question was -- so
when you look at this design, you're looking at -- you're looking at two questions
come to mind, okay. So first of all, how do you identify the control plane, okay?
We're talking about recording control-plane I/O. How do you know what that is?
And the second question is okay, you record just the control-plane I/O, but you
don't know what the control -- data-plane inputs are. Because you don't have all
the inputs to the program how is it that you provide replay? Okay.
So let's start with the first question, how do we identify control-plane I/O? So we
have essentially and automated identification technique. And I stress it's a
heuristic. It's not perfect. And this heuristic is based on the observation the
control-plane channels have operated low data rates. Okay? They execute one
percent of all traffic in the datacenter application. So this leads to an automated
classification technique that's phases. First we interpose on nondeterministic
channels and then we classify these channels as control or data plane by looking
at their data rates. So we interpose specifically on so network and find channels
basically using system calls, sys send, sys receive, the kernel level and then we
interpose on shared memory -- a particular type of shared memorandum
channels, in particular data-race channels, using the page-based concurrent
read/exclusive write memory sharing protocol.
Now, this is a conservative protocol. It doesn't detect data races but it will detect
conflicting accesses to pages. And if there is a data race, then it is handled. It is
intercepted.
Now, in the second phase the key challenge of course in classification is that you
have these bursts in any kind of communication channel, even in the control
plane. So it's not a flat data rate, and you can't just use that as a threshold. So
to deal with that problem, we employ a token bucket filter. And we found that
basically a rate of about 100 Kbps with bursts of 1,000 bytes suffices
conservatively to capture most control-plane I/O.
And similarly for shared memory communication. A fault rate of 200 faults per
second, 1,000 faults is a pretty conservative bound on capturing control-plane
I/O. Now, if all of this fails, keep in mind that you can always go in there and
annotate specific components, say okay, you know, this is the control plane, I
know this. The Hadoop job trackers, the control-plane components, so forget -don't worry about automatically trying and figuring out if it's a control plane or not.
>>: [inaudible] control-plane shared-memory accesses [inaudible] I understand
you're trying -- distributed system messages, data [inaudible] control messages
over the network but you [inaudible] page measuring a number of faults by
[inaudible] read only ->> Gautam Altekar: Yeah, so ->>: What does that mean control page [inaudible].
>> Gautam Altekar: Yeah. So in terms of shared memory we define locking
operations on data, coordinated accesses to data processing. You know,
suppose you have a red-black tree that you need to coordinate access to, then
you acquire a lock on a shared memory location. Then I would consider that to
be a control plane -- a form of control-plane communication that needs to be
intercepted.
>> Gautam Altekar: And you expect those accesses to be lower than [inaudible].
>> Gautam Altekar: Yeah, so ->>: Protected.
>> Gautam Altekar: So for example Linux you use hue texts to avoid kind of
spinning, spin locks. So you try once. If you don't then you're blocking the
kernel. So this reduces the kind of -- the high data rates that you might see
otherwise. Okay?
Okay. Of course now the second question is, okay, record just the control-plane
I/O. Now how is it that you provide replay with just that information? To address
this question, we employ what we call -- a technique we call Deterministic-Run
Inference, DRI for short. The key idea is that, yes, it is true that we don't record
the data-plane inputs of the original run, however, we can infer these inputs
postmortem, offline, okay? And again, this inference doesn't need to be precise.
We don't need to infer the exact concrete values of the original run. It suffices so
long as we infer some values that make -- make the replay execution exhibit the
same control-plane behavior.
So this is a relaxation of determinism that we're shooting for. And this -- this -this inference process works in three phases. First you do the recording. You
send the control-plane I/O to the inference mechanism and then the inference
mechanism will compute concrete control and data-plane inputs, control-plane
inputs are already required, so no computation is required. And then you feed
this into a subsequent execution and then you get a control-plane deterministic
replay run. Okay. Question?
>>: So to you [inaudible].
>>: A protocol that's pigging backing control-plane messages on top of -- yes.
Yes. So one of the -- yeah. So depends on how good the heuristic is. So for
example we've -- one thing that we do is we observe that you have, you know,
any -- even in data-plane channels, you have these embedded control-plane
channels, particular message headers are a type of control-plane channel. So
you can just say okay, I'm going to consider the first 32 bytes of every application
messaging boundary to be control plane as a kind of heuristic in order to capture.
So, yes, it's a heuristic -- it isn't perfect but we think we can -- with these
engineering trace make it -- make it more accurate, more approximate to control
plane.
>>: So even though you might have only a small number of bytes recorded
[inaudible] be able to replay it? If [inaudible] if I have 38 bytes and you record
only 32 [inaudible] replay?
>> Gautam Altekar: Yes.
>>: Okay.
>> Gautam Altekar: [inaudible].
>>: Do you think this may be is a more fundamental insight? I mean the control
plane versus data plane that's just giving you [inaudible] but it seems like what
you really want is you want [inaudible] just the bytes control our data or whatever
to let you get this.
>> Gautam Altekar: Right. Well, sure. But I mean you can take -- if you take
this to the extreme, right, and not record any control plane, none of the inputs, it
is possible you could get something -- you could get an execution but there's a
chance that you may not reproduce the underlying root cause as a result. Their
observation is that the control plane -- it is important to try and reproduce the
control plane because the root cause is embedded as part of that code. So we
want to reproduce that behavior of that code as much as possible.
>>: I guess I'm just asking, have you considered -- you could look at the group
[inaudible] maybe there's some slice in the program and it's maybe not even
control versus data but something even smaller that [inaudible].
>> Gautam Altekar: That's possible. Well, although we haven't completely
tested the limits of the control-plane determinism model it seems to be pretty
good so far, but some challenges with shared memory of course. So there's
room for future work on that.
>>: Sure.
>> Gautam Altekar: Question?
>>: What if the data plane kept inferred like what if the code said receive the
message, if the signature is correct then do X. How are you going to generate a
data plane ->> Gautam Altekar: These are all very good questions and I'm going to address
that in two slides. Okay? So how does this inference work? Let me just give
you a brief overview of this.
There's two phases. First you take your distributed publication, your program
and you translate it into a logical formula known traditionally as a verification
condition. And this is done using symbolic execution techniques. And the
resulting formula is basically expresses the program's control-plane output as a
function of its control-plane input and data-plane input.
Now, we know the control-plane input and output because we recorded it, okay?
It's in the logs. So we know the concrete values there. But we don't know the
data-plane inputs. They're unknowns. So in order to figure that out, we send this
formula over to a constraint solver which thinks about it for a while and then
returns concrete values to the data-plane inputs. Okay? So that's the basic idea
-- question?
>>: Are you translating program or program trace? Because translating a
program into [inaudible] is a little difficult in the presence of [inaudible].
>> Gautam Altekar: Oh, yeah, absolutely.
>>: So what is it you're actually translating?
>> Gautam Altekar: It's a partial trace so we have basically an execution with
recorded control-plane I/O, right? But some of the stuff -- some of the inputs we
don't know, and so we'll still have to deal with loops and things like that.
>>: So ->> Gautam Altekar: So it's a kind -- it's static verification condition generation,
but we have some of the inputs so that allows us to kind of employ more -reduce the path space and be more like symbolic execution.
>>: So I have -- just to be concrete, I have some program that's taking some
input off the network. We know what that input is. And you're going to reexecute
the code with that input to get sort of a partial program. What's the technique
exactly?
>> Gautam Altekar: The technique is a symbolic execution, okay? So ->>: [inaudible] as far as an input.
>> Gautam Altekar: Which -- which ->>: [inaudible] execution along a path or.
>> Gautam Altekar: Along a set of paths. It's multipath symbolic execution. So
we have to consider multiple paths. So in the simplest case we take the
application, we pick a path for the application, symbolically execute for it. So
now we have a formula for one path.
>>: Right.
>> Gautam Altekar: Pick a different path. This may entail going around two
times in a loop, okay? And we pick another path, so on and so ->>: Pick a path that's consistent with the control planing.
>> Gautam Altekar: Exactly. So that's why when we do the symbolic execution
we feed in the control-plane inputs that we know. So then this focuses the two
essentially path traces that are consistent with those inputs that that we receive.
So it's a technique also for reducing the search space for the symbolic execution.
Okay. Okay. So a recap. What we're trying to do offline for datacenter -- offline
replay of datacenter applications. We record the control-plane I/O and we try to
infer the rest, the data-plane inputs. A gentleman brought up a very important
question. What about the scalability challenges. Okay.
So indeed, this problem is not solved and it is essential challenge to making this
inference technique work. Now, if you do this inference naively, it doesn't scale.
And the basic reason is that you're -- you know, you're searching for gigabytes of
concrete inputs, data-plane inputs. And this is an exponential search. Do an
exponential search space.
And for fundamentally there are -- more specifically, there are two problems.
First, you're doing a multipath symbolic execution to a potentially exponential
path space and even if you do manage to -- so in addition to the symbolic
execution, the cost of the symbolic execution per path is quite high. And we do it
at the instruction level so we're talking about some 50X, 60 X slowdown in our
unoptimized implementation.
Now, even if you do manage to generate a formula for any given path, you're
talking about gigabytes, hundreds of gigabyte size formulas, okay? And this is
not surprising because you're talking about gigabytes of inputs. Okay? And
even if these formulas weren't that large, you're talking about constraints that are
very hard to invert and solve for hatch functions. So this seems insurmountable.
Is there any kind of hope?
So we make two observations, okay? The first is that if you look at these
applications, most of the unknowns in the data plane come from network and file
data-plane inputs okay? 99 percent, in fact. Just one percent of the input come
from a particular type of nondeterminism data-plane inputs. So I think this comes
back to explaining kind of the communication cost of shared memory. Data
races account for a very tiny fraction of the overall inputs. 99 percent of the
inputs come from network and file inputs, basically the datasets that you're
processing these applications.
Now, the second part of this observation is that the network and file inputs can be
can be derived from external data -- are derived from external datasets, for
example click logs. And these datasets moreover are persistently stored in
distributed storage, okay? And why are they persistently stored? Well, usually
for fault tolerance purposes. You have these click logs you want to tally.
Sometimes something goes wrong in the tallying process so you need to be able
to restart it and try it again.
And so because these inputs are persistent we have access to them during
replay. And also they're persistent in append only, read only distributed storage.
Now, these two observations lead to the idea of using these persistent datasets
to regenerate the original network and file data-plane inputs. And if we can do
that, if we can regenerate these concrete inputs, then we can get rid of 99
percent of the inference task, the inference unknowns. So that's -- that's a key
idea behind ADDA.
So how does this work exactly? So we have a technique, it's called data-plane
regeneration. And the basic idea, you give it access to the persistent datasets
stored in HDFS, for example, and then it will regenerate the original concrete
network and file data-plane inputs without using inference. That's important to
emphasize.
And there are two observations behind this operation. First is that if you look at
these applications, the inputs for any given node are the outputs of upstream
nodes, okay? This is a basic property of distributed applications. Pretty easy to
see. Second property is that if you look at the outputs of any single node, they
are exclusively the function of its inputs and the ordering of operations on those
inputs. So things like memory access interleaving of inputs.
So, no, if we put these two observations together, the implication is that we can
regenerate the original concrete data-plane input simply by replaying the inputs
in order of operations of upstream nodes.
Okay. Let's make this a little bit more concrete and look at this in detail. So
there are two indicates. The first case, the easy case is where you have
boundary nodes, node that are directly connected to the persistent storage.
Cosmos, GFS, HGFS, so on, so forth. In this case if you want to regenerate the
data-plane inputs, all you have to do is read the inputs from the file, just open the
file from the persistent store and read it back in again. So this is directly
connected.
Now, the internal case is a little bit more complicated because these nodes aren't
directly directed to the persistent store, they have to communicate through some
other nodes. So for example let's consider was would happen if we need to
recompute -- regenerate the inputs to internal node B. The first observation is
that B's data, the -- B's input is actually C's output. Okay? That's very simple.
And then second observation is that, okay, C's output is actually a function of its
input which is one data-plane input and the ordering of operations. Here, in this
case, multiplication, then the addition. And that the result is 3, which is B's input.
And hence we have completed the regeneration task in this very simple case.
Now, we can extend this using induction to the rest of the cluster. You can see
that data-plane inputs can be regenerated for all the nodes with these two cases.
Yes?
>>: [inaudible] computations are purely [inaudible] and if you I/O bound the
storage and you're having to read all the storage back in and you're having to
[inaudible] all the same states and you have a latent bug that [inaudible]
recompute [inaudible] to recomputation? Don't you have to recompute [inaudible]
usually one ->> Gautam Altekar: You mean to replay and do the analysis?
>>: Often you can -- you accelerate replays and future replay, right? You cut out
-- you [inaudible] but in this system since you're actually bound to the data
[inaudible].
>> Gautam Altekar: Yeah. So in reality, yes, we have to -- because we have to
regenerate the data and the data has to traverse the network links then yes,
you're talking about actually having to wait -- you know, if you have two weeks
worth of computation, then, yes, you will have to wait two weeks. And that is -that is a -- that is a drawback of using the system. However, it's offline. So, you
know, if you have -- if you have a bug that you're having a really hard time with
which we think is so useful in those cases.
>>: And if the bug was caused by let's say the checksum missed corruption on
data wire, which caused a function to do something correct on the data that
would not be [inaudible] this case, right, because you're not replaying -- you're
not [inaudible] to replay the inputs, you're depending on the inputs that existed on
persistent storage to begin with?
>> Gautam Altekar: Right.
>>: Okay.
>> Gautam Altekar: Yeah, you're depending on the inputs in persistent storage.
So one thing I'll note, this is mostly these applications communicate by TCP, so
there's -- and the probability of checksum violations, things like that ->>: Is actually [inaudible] actually the TCP check [inaudible] scale.
>> Gautam Altekar: Okay.
>>: Having corruption in TCP is not an uncommon [inaudible] having the
corruption you know to be weak, you know. To miss file [inaudible] I'm just
curious. I'm just trying to understand what the scope is.
>> Gautam Altekar: Yeah.
>>: So it's [inaudible].
>> Gautam Altekar: Certainly. There are tradeoffs in actually regenerating the
computation. And if -- yeah, if you have -- if you have, you know, data gets
corrupted and checksum fails then yes, it becomes a problem.
>>: And you're assuming that your lead play cluster has access to the same data
store app or [inaudible].
>> Gautam Altekar: Yes.
>>: [inaudible].
>> Gautam Altekar: It has access to the same distributed data store. That's
hosting these persistent files. That's the assumption. And, yes, all of the stuff
depends on that assumption quite critically. Okay?
Okay. So at this point, we've covered a lot. So let me do a quick recap. ADDA
-- here is distributed ADDA's -- ADDA's distributed design again. It records just
the control-plane inputs and outputs. And then it uses the control-plane inputs
and outputs to infer control-plane deterministic run. And this is done using
program verification techniques in bulk execution. Once it has the distributed
replay then it performs the analysis on top of the replay. That's the basic design
of the system. Now, how well does this system actually work?
So with this evaluation we have two questions. First is how effective is ADDA as
a debugging platform? What kinds of interesting tools can you write on top of it?
And what is the actual overhead of using ADDA in production -- both in
production and offline.
And to evaluate -- to answer both of these questions, we run ADDA on three
real-world applications, the Hypertable distributed key-value store, used by
companies like Baidu and Quantcast. The Cassandra distributed key-value store
used by Facebook and others. And then Memcached which almost everybody
uses, and so distributed object cache.
>>: [inaudible].
>> Gautam Altekar: I/O to disk or the persistent store?
>>: [inaudible].
>> Gautam Altekar: Okay. All of them do I/O to the persistent store actually. So
for Memcached depends on the particular application setup.
>>: [inaudible].
>> Gautam Altekar: Cassandra, for example, you could -- in our experiments we
got the datasets out of the distributed storage. We mounted a ->>: [inaudible] perhaps that's distributed storage [inaudible] cache. That's what
I'm asking.
>> Gautam Altekar: Yeah.
>>: Are you ->> Gautam Altekar: Yeah.
>>: You're dealing with -- you're dealing with benchmarks where we could sell
memory.
>> Gautam Altekar: Uh-huh.
>>: Okay.
>> Gautam Altekar: Okay. So what about effectiveness? So to gauge this, we
developed three powerful sophisticated debugging plugins for ADDA. The first is
a global invariant checker with which we check many invariants and we actually
found three bugs in research distributed systems.
The second is our most powerful analysis we think. It's a distributed dataflow
analysis. It's written about 10 lines of Python and tracks the flow of data through
the distributed system. So lets you trace dataflow. And we've used it to actually
debug a data loss bug in the Hypertable key-value store.
The final tool that we developed was a communication graph analysis makes a
graph of the communication patterns of all nodes and the applications -- we've
used it to isolate bottleneck nodes in Hypertable.
Now, I want to focus specifically on our most powerful, most interesting tool in my
opinion. And the data-flow -- distributed data-flow analysis plugin we called
DDFLOW. And the basic idea is to dynamically track data through the distributed
system, okay? And so this works in two phases. Okay. We track taint within a
node simply by maintaining a taint map of memory states and then updating it as
the data-flows to the machines and then we track taint across nodes by keeping
track of which messages are tainted by the input file or message.
Now, the key question was how easy is it to develop DDFLOW, this plugin, and
we took about -- it's about 10 lines of Python. I've omitted some initialization
code that bloats the code a little bit. But this is -- this is our most interesting
plugin in the sense it uses almost all of the framework features.
For example, it uses the shared memory analysis model to maintain the set of all
in-transit tainted messages, okay? So that's a message taint set variable that
you see here. It uses causal consistency properties. For example you're insured
that the on receive callback will be invoked after the on send for any given
message, okay? So causality is very important to track taint. It uses
serializability in the sense that on on send and on receive, it will look as though
they're executed one at a time. And so you don't have to worry about data races
on the message taint set variable. And then finally it leverages the fine-grained
introspection primitives provided by ADDA and in particular the local taint-flow
analysis. Okay?
Now, that was the effectiveness. Now, how well does this thing actually perform
in-production and offline? Now, for production we want to look at the cost
throughput and the storage costs as well because you can't record a ton of data
on these systems. Storage is important. You can't double the terabytes of data
that you're already using.
And then secondly, we want to look at offline analysis. How long does it take to
run these ADDA analysis plugins? And we just -- and for this setup we had a
very simple -- it's a very simple setup on our local 16 node cluster is to make it
easier to understand what was going on. And then we have basically input
datasets that we varied from one gigabyte to three gigabytes. They were all
stored in persistent storage in our local HDFS cluster.
Okay. So what about in-production overhead? So here you see two graphs, the
first showing the recording slowdown for these applications; the second for the
recording rates for these applications. And the basic take away for the first is
that ADDA achieves an overhead slowdown -- a slowdown of about 1.2X, okay?
That's almost across the board. 1.2X. And then for the recording rate, the take
away message is that you have an average recording rate of about 200
gigabytes a day. Now, this is in contrast to the overhead's 4.3 terabytes a day
that you would get if you were to record both control and data planes.
ADDA in contrast records just the control plane, and it's able to get away with
much less in terms of record ->>: [inaudible].
>> Gautam Altekar: I'm sorry?
>>: Over three -- so you have 200 gigabytes of -- traced it over three gigabytes
of data being manipulated?
>> Gautam Altekar: Uh-huh.
>>: Okay.
>> Gautam Altekar: Yeah. So this is control plane. And it actually should be
shall less but because of implementation issues with our system we record more
than we should. It's a matter of type tuning our heuristics actually.
Okay. Now, key question is well with, how does this overhead scale as you
increase the datasets because these are actually small datasets. So the first
graph shows the slowdown, throughput slowdown as we scale the input sizes.
And then the second graph shows the log sizes as we scale the input sizes.
Now, you can see that slowdowns for ADDA stays relatively stable as the input
sizes increase and it's not too surprising because again we intercept and record
just the control-plane data and it's not a whole lot of data.
Now, for the second graph you can see that ADDA -- ADDA's -- it's a little hard to
see, but ADDA's log file size does grow as the input scales, however, this growth
is nowhere near the growth that you see when you record both the control and
data points.
And the reason is that if you record both, then you have to obviously record all of
the data as well. Okay. So now, that was in production. What about offline
analysis speed some.
Now, here are two graphs. In the first one you see the slowdown for analyzing a
one -- a uniprocessor recording, okay? All nodes in the system had -- were
restricted to use just one processor in the first graph. In the second graph you
see the slow downs where all nodes were using two CPUs. Okay? Now, the
take-away for the first graph is that you get an average slowdown of 50X for the
DDFLOW analysis plugin. And in particular if you look at the breakdown for this
overhead, you will see that most of the, you know, head of the uniprocessor case
comes from the analysis itself. Distributed data-flow at the instruction level. And
this is not surprising because this requires instruction you level tracing. And so it
makes sense that the analysis dominates the replay cost.
>>: [inaudible] for the symbolic execution?
>> Gautam Altekar: I'm sorry?
>>: The symbolic execution and the taint.
>> Gautam Altekar: So for the uniprocessor case we don't need to do symbolic
execution. We don't need to do inference at all because we were able to
regenerate the data-plane inputs entirely. And as result, we don't -- this
overhead the just for the taint flow analysis.
>>: [inaudible].
>> Gautam Altekar: I'm sorry?
>>: [inaudible].
>> Gautam Altekar: No, we use Valgrind. It's structured along the same design
as David's catch con system. Yeah, it's called libflex, which is a back-end
development binary translation system.
And so now if you -- what's interesting here is that the first graph, you know,
contrasts quite clearly with the second one in that the multi-processor slowdown,
most of the slowdown there is contributed to the replay phase, not the analysis
phase, which is a bad thing because ideally the system should just stay out of
your way and lets you do expensive analysis, right, or lightweight analysis if you
so desire.
And the overhead there is about 800X slowdown for computing these
multi-processor runs. And the question is why, why does 1-CPU perform so
much better than 2-CPUs?
Now, what you see here is a graph that breaks down the replay time both for the
1-CPU recording case and the 2-CPU recording case. And the breakdown is into
two parts, the amount of time spent in data-plane regeneration and the amount of
time spent in the inference process itself.
Now, the interesting thing here is that the 1-CPU case you will see that no time is
spent in inference, okay? Again, because data-plane regeneration recomputes
all of the concrete values. All right? So no inference is needed for the 1-CPU
case. For the 2-CPU case, it's a different story. We need inference to compute
the outcomes of data races, okay, because we didn't -- we may not have
captured all of those data races using the CREW interception protocol that we
used.
And unfortunately this inference requires a backtracking search through the path
space, and this is where most of the expense comes from for the two processor
case.
>>: [inaudible] when you're doing the symbolic execution and looking at the
paths, these are paths of multithreaded execution as well?
>> Gautam Altekar: Yeah.
>>: So the number of -- so as the number of -- as the number of threads that you
have to interleave increases then you're going to get exponential ->> Gautam Altekar: Yeah, so ->>: [inaudible] path space.
>> Gautam Altekar: Absolutely. But we have some tricks where we avoid -- we
don't actually try all the different interleavings. In fact, in particular, we consider
racing read to be inputs, symbolic inputs and as a result we can eliminate the
need to search through all the different schedules. So it's a relaxation, it's an
optimization that makes things better. But still, you still have to consider can
multiple paths symbolically execute upon multiple paths.
>>: Multiple paths not of single threaded execution but multithreaded ->> Gautam Altekar: Multithreaded executions.
>>: Okay.
>> Gautam Altekar: Okay. So I've taught you about ADDA. I've told about you
the basic design, basic evaluation. And I want to emphasize to you at this point
that ADDA is a real system and that it works on real datacenter applications. For
this purpose, I'd like to give you a brief demo, okay? First -- the goals of this
demo are twofold. First I want to show that ADDA can analyze real datacenter
applications, okay? And for this purpose I will replay recording of Hypertable -the Hypertable key-value store and I will perform some simple analysis on that.
And then for the second phase, I want to show you the beyond just doing
analysis is actually useful for debugging and for that purpose I will use our
distributed data-flow analysis plugin on a Hypertable bug. Okay? To help with
that process.
Now, for the first part, this will be easier if I had two screens, but that's okay. The
first thing I want to show you is that ADDA does simple stuff quite easily. So for
example, you know, you want to look at the aggregate output of all of the nodes
in the system. Well, ADDA can provide that for you. Let me pause this replay for
a second.
What you're seeing here is ADDA's main console interface. The top panel
indicates the replay stats of the system. Here you can see that we're seven
percent into the replay execution. This is a very short replay execution I
collected for purposes of the demo. In the bottom you'll see the aggregate TTY
output of all the nodes in the system. Here you can see that the hyperspace -the Hypertable lock server starting up. And then the master -- let me get to -- if
we let it go, okay, at some point you'll see the slaves are starting up, so on and
so forth, they're communicating and dumping information to the log files.
That's a very simple replay mode analysis. Just looking at the output. Very
simple type of thing. And of course you might say, well, gee, and output, that's
not too useful for debugging, I need more information.
So ADDA can give you more information. So here I'm going to pause this replay
for a second. What you're seeing here is a distributed instruction trace,
instruction-by-instruction trace of Hypertable in replay. And we currently have
paused it at the startup sequence where the lock server has just started up and
it's skating code inside the dynamic linker and then you can just single step and
let it go.
It will take quite a long time. But again, this demonstrates the type of
heavyweight analysis that you can do offline that would not be possible to do in
production.
Now, of course you say well -- you might say okay, instruction traits, that's too
much information. I want -- you know, I want the bigger picture. I want to look at
-- see the bigger picture of my distributed system. And ADDA can give that to
you as well. Let me pause that. So here what you're seeing here is a birds-eye
view of the Hypertable datacenter application in replay. Again, you have the top
panel which is a replay status, the progress. And then the left-most threads
panel shows you all the active threads in the entire distributed -- in the entire
datacenter application. You can see a bunch of lock server [inaudible] have
started up, including the shell and all this stuff.
On the right top, you will see all the active communication channels between all
the threads. This includes sockets, files, pipes, all that kind of thing. You can
see that, you know, it's loaded a bunch of libraries as you might expect from
startup and started some sockets.
And then below that you see some in-transit -- the in-transit panel which gives
you all the messages that are in transit. Currently we have a TCP message in
transit that has been sent but not yet received. And below that you have a list of
all of the received messages.
Now, and important thing to point out that I should point out with this is that
replay -- distributed replay is causally consistent, okay? What that means is that
you're not going to see a message in the received panel until it has shown up,
until after it has shown up in the in-transit panel. And I wish I had a bigger
screen to show you that, but basically you should see mash at 35 at the end -okay. Scroll past it. But anyway, basically mash at 35 shows up only after it's
shown up in the in-transit panel. It's causally consistent replay.
Okay. So that was part one of the demo. Okay.
Part two of the demo I'm going to show you that going beyond doing these kind
of analysis tricks it's actually -- the system is actually using for debugging. So
let's consider data-log bug the Hypertable distributed key-value store. In this
particular bug what's happening is that you sent -- you insert some data into the
table, okay, some real data, but then when you try and looked it up -- when you
tried to do a lookup later you can't find the data. The data is gone and you want
to know what happened to it, okay?
So I'll tell you up front that the data loss is caused by a race in the table migration
code within Hypertable. Hypertable splits large tables once they get to a certain
size. And so what's happening is that when you inserted the data, split occurred
currently and as a result the inserted data went into the long shard essentially.
Now, the question is okay, can we use ADDA's distributed data-flow analysis to
actually figure out what happened to our data? Does it go to the node that we
expected in.
Now, from inspecting the log improving we know that the lookups go to slave
node one. The question is does the data go there as well? So we can use the
data -- distributed data-flow analysis to figure out where this data went.
I've started up a session already for you, [inaudible] let me start -- let it go again.
And what you're seeing is it's a set of all function that is are operating on the data
that you inserted into the distributed system. And now the data has flowed -node 2 is the slave. So basically the data has gone there. And then the slave
responds to the client. Okay. So let me just pause this.
So what's happening here is that you have the data, it's being inserted into the
client Hypertable and it's being inserted -- you can see it's loading the data
source and then, you know, does some processing on it. And at some point it
reaches node 2. The data gets sent up to node 2 which is the slave, Hypertable
slave, slave, inserts it into a thread black tree, give the unresolved symbol -- not
all symbols are available. And then it replies. So the take-away message from
this trace is that the data goes to node 2, not node 1. Now, to do similar type of
thing manually would take a lot of effort. You would have to instrument your
program to track this particular lost data. And then that's assuming that you
could actually reproduce the data race offline. So this is just an example of the
power of offline analysis that ADDA provides.
>>: [inaudible].
>> Gautam Altekar: What's that? Oh, because we know from the logs, we look
-- we look -- we did a lookup -- we looked inside the logs and we saw where the
-- where the -- where all the datasets were being inserted. I'm sorry. Where all
the lookups were happening. We knew where they were inserted we wouldn't
have to do all of this.
We though where the lookup went to because it's in the log file. I didn't show that
part. But ->>: So this trace filtered based on this one action [inaudible] other client request
in processing?
>> Gautam Altekar: Yes. Well, so it's only showing you the functions that
process the data that you're interested in. So all of the other client request
processing is not shown here. And that's one of the key advantages, the taint
flow analysis because it narrows down exactly what -- just the things that you
want to see.
>>: [inaudible] setup say this was the taint source, how did you do that? Was
that in the Python?
>> Gautam Altekar: Yeah.
>>: Code?
>> Gautam Altekar: I'll show you. Okay. Ignore that. So basically -- so we have
a command line interface. You can see here -- if you could see the cursor ->>: [inaudible] the file that represents the source of the taint?
>> Gautam Altekar: Yes.
>>: Oh, okay.
>> Gautam Altekar: And basically you can specify a file, storage and -- or you
can --
>>: And whoever reads that will [inaudible].
>> Gautam Altekar: Yeah. That's one option. Another option is you can actually
explicitly mark memory read ins at a certain time.
>>: Sure.
>> Gautam Altekar: Which is a bit more complicated, so which is why I didn't do
it that way. So ->>: So if there are lots of other database tables or tables that are being read as
a matter of course, those won't be tainted?
>> Gautam Altekar: Yeah. Exactly. Okay. So that was the demo of the system.
Now, the key question is where is all of this going. Okay.
So in the short term we address the limitations of data. And there are many of
them. [inaudible] pointed out a key limitation is the fact the replay of multiple -multiprocessor runs is very slow and perhaps impractically so.
And of course the key challenge here is the inference that you have an
exponential-sized path space you have to -- and then you have the cost of
symbolic execution. So there are many things that we haven't tried yet. A simple
thing you could do is actually record the path in the original execution. This has
a certain cost. So maybe we can record selectively, you know, path samples,
branch samples. And maybe that can narrow down the search space of paths.
Another option is to use annotations to summarize loops. The key observation
here is that we can limit the amount of burden by doing the annotation just on the
data-plane code, which is the part that we don't know. And that data-plane code
is just one percent of the overall application code.
Now, second challenge of course is large formulas. You have hundreds of
gigabytes formulas. One thing that we've observed is that okay, you could look
at these formulas, you could actually split them up into smaller chunks because
these datacenter computations operate on chunks. And if that's the case, then
you can solve just those parts of the formula that your analysis is interested in,
no more. Yes?
>>: [inaudible] or anything like that [inaudible].
>> Gautam Altekar: No, not -- no, not -- no, not at this point. We have a simple
-- I think I used the same disjoint path -- disjoint set computation that you use in
catch con to separate out the formulas into different groups and then just plug it
in.
So stepping back at a higher level, where is all of this research going? What is
the ultimate aim? Well, the holy grail again is the fully-automated debugger. It's
a very hard thing to do. ADDA only provides semi-automated bug isolation. It
doesn't even try to do automated bug fixing.
So, you know, there are many opportunities in terms of synergy with examining
automated debugging techniques. For example you have delta debugging,
statistical bug isolation, invariant inference. All of these techniques can be
combined with the power of offline analysis that ADDA provides to make them
even bet, to even narrow down the root causes even further. So all these things
can make interesting ADDA tools.
And the second level is, you know, the next step of course is automated bug
fixing is very hard. One observation is that made in techniques of program
synthesis is that if you have program specification that you can generate the
concrete implementation from that. Of course a key challenge here is that okay
you assume access to a specification. But perhaps ADDA can help with that by
inferring distributed invariance. So that would be an interesting direction to try to
pursue.
So with that I'll try to conclude with some of the key take-away points that if you
combine the notion of control-plane determinism, recording just the control-plane
I/O, combine it with inference and some cool techniques like data-plane
regeneration then you can get record-efficient datacenter replay. It is possible.
And that in turn enables us to do fairly powerful offline distributed analysis.
Now, of course, there are many things to address in the future. One of them is
the multiprocessor replay issue, which is very hard. But we'll continue to bang
away at that.
And at this point I'll stop and take any questions you may have.
[applause].
>> Gautam Altekar: Okay?
>>: So suppose you just want to [inaudible] multiprocessor replay time and just
do EC2 instances for all microinstances, for all single core. What kind of hit
would you take on your Hadoop throughput versus having [inaudible].
>> Gautam Altekar: Well, that simplifies the problem be enormously. So if you
know, you look at commercial replay systems like the Mware, if you give them
very light I/O intensive tasks they do -- they get really good performance. They're
hardly under one percent in terms of throughput. And those are the numbers that
are published. And I've seen the system. I interned there, so I know what the
capabilities of uniprocessor replay recording efficiency are.
So in that case, I think it can be quite good. Of course the key challenge is the
multiprocessor that makes things a lot more complicated.
>>: [inaudible] I guess I just ask if you wand this and you just wanted Hadoop
throughput why don't you just get a 10X more microinstances on the [inaudible].
>> Gautam Altekar: That's an interesting usage model I think for this type of -the datacenter applications.
>>: [inaudible].
>> Gautam Altekar: Yeah. I actually don't know the answer to that question. I
don't -- I'm not the one operating the systems in production but ->>: [inaudible] Cassandra [inaudible] if you could just duck that, you could get
this really nice capability.
>> Gautam Altekar: Separate the tasks into distinct components that don't share
much, then you avoid a lot of these problems.
>>: Do these systems, you know, are viewing invariant [inaudible] just data
auditing, distributed systems in the old world [inaudible] they had some
something called audit subsystem [inaudible] distributed system and each node
had a little process that audited its data structures and basically, you know, some
low overhead communication with other parts of the switch to sort of make sure
things were consistent and throw alarms and do all that. So my understanding is,
you know, you got -- you got six nods of reliability if this auto subsystems is on
and you got three nods if you turned it off because it also had some pair action,
that is integral to the system was a self-route, was an auditing and repair
subsystem. And I just sort of wonder in the architecture of the thing that you're
looking at how much is built in [inaudible].
>> Gautam Altekar: My impression is not a whole lot. Like pretty typical thing is
just to have checksums, checksum verification. But beyond that, these systems
rely on recovering by using the persistent datasets. So I've talked to folks at
Google with -- you know, that work in the ads back-end teams and they have to
process terabytes of click log. A lot of times things go wrong and, you know, it's
usually a bug. But they don't have -- they don't have checks for all of that
because they don't want to ->>: The difference with telephony is that the data is so huge and throughput is
so essential compared to like telephony you have a really strict data control
separation that they can't tolerate doing these runtime checks in this sort of
auditing architecture, they can't tolerate the overhead of extra -- extra runtime
checks.
>> Gautam Altekar: Well, you just -- so ->>: Huh?
>>: [inaudible] penalty for getting it wrong is high.
>>: [inaudible].
>>: And besides [inaudible] switch had to be reliable because the old days, right
[laughter].
>>: Just click refresh.
>>: And also at&t was a regulated monopoly, so there's a whole other reason
that it was overengineered. But yeah, okay, so your point is there's not much
pseudo -- what we call ship asserts? There's not a lot of -- like, so ->>: [inaudible].
>>: When I get a ship assert just means I have assertions in my code that are -that are in my live code.
>> Gautam Altekar: Yeah. Not to my knowledge. I mean I think there's very
light logging that's being done. But beyond that, I don't think -- it might leave
some assertions on, but again, if -- you know, they'll turn off anything that affects
the aggregate throughput of the cluster, so ->>: Wouldn't the potential performance enhancements you mentioned was to
add annotations of the data-plane code and you have techniques that separate
out and traffic -- [inaudible] separate out data-plane trafficking, control type traffic.
Is it straightforward to then work back from those techniques to figuring out which
code is data plane in which now where you would have to put the annotation?
>> Gautam Altekar: Yeah. So that's an interesting point. We did -- we did a
study in which we basically used our taint-flow analysis to track these datasets
and then figured out which code -- what code was tainted by these datasets and
found out, you know -- that's where we got the one percent overhead from. And
also we got -- narrowed down the locations that we had to actually look for, you
know. So we -- the developer doesn't have to actually manually go through the
code and say okay, this is data-plane, this is control-plane. We can actually
provide that to you automatically using the dynamic analysis.
Interesting question is, you know, maybe can you provide this statically, you
know, maybe some interesting static techniques. That way you aren't limited to
just one execution or a set of executions.
>> Andrew Baumann: Thank you again.
>> Gautam Altekar: Thank you.
[applause]
Download