>> Andrew Baumann: Okay. So thank you all for coming. I'm Andrew Baumann, and it's my pleasure to introduce Gautam Altekar from UC Berkeley. You might remember some of Gautam's work. For example, at the last SOSP there was a paper about output deterministic free play. More recently he's been taking some of those similar ideas and extending them to replay debugging of distributed systems and datacenter applications. And I'm very excited to hear about that work and what he has to say. >> Gautam Altekar: Thank you. Okay. Thank you, Andrew for that introduction. Okay. Thank you for coming to my talk. My name is Gautam. And I'd like to talk to you about debugging datacenter applications. Now, the slide only my name is there, but actually there's several collaborators, including Dennis Geels who is now at Google, my colleague, my advisor and several folks from APFL. Okay. So we all know that debugging is hard. Now, debugging datacenter software is really hard. Datacenter software. What am I talking about? I'm talking about large-scale data-intensive distributed applications, things like Hadoop, Map Reduce, Cassandra and Hypertable distributed key-value stores. Memcached the object and distributed object caching system. Now, why is it hard to debug these applications? Well, there are three main reasons. Foremost is nondeterminism. You develop these applications, you deploy them to production. Something goes wrong and then you try to reproduce that in a development environment and you can't reproduce the failure. And because you can't reproduce it, you can't employ traditional cyclic debugging techniques to recently reexecute and home in on the root cause. Now, the second thing that makes it difficult is that these large scale systems are prone to distributed bugs, bugs that spend multiple nodes in the system. And when you're talking about thousands of nodes you're talking about really complex interactions that are very difficult to get your mind around. And the final thing that really makes things complicated is the need for performance. These applications are part of 24 by 7 services. You can't just stop them to do your debugging. And you can't use heavyweight instrumentation to collect debug logs for later debugging. So in this talk, I want to focus on the problem of how we can make it easier to debug these datacenter applications despite these challenges that we're faced with. Now, the Holy Grail of debugging in our view is a system that automatically isolates the root cause of the failure and automatically fixes the defect underlying the root cause. Okay? So at this point we're thinking okay, why automated debugging, why not use, you know, static analysis testing, model checking, simulation. Define the errors before they manifest in production. Well, as good as these tools are, oftentimes they miss errors, particularly when you're talking about large state spaces. In the case of things like static analysis, they can be conservative and they can give you about -- tell you about all sorts of errors but still the developer may not be interested in in oh, in going through these error reports. And eventually errors will manifest into production. And at that point you would -- it would be nice to have and automated debugger to debug those errors for you, those failures for you. Now, as you might imagine, building a fully automated debugger is very difficult, okay. Building one for datacenter applications is even more difficult. So the goal of this project is much more modest and that is we want to try and build a semi-automated bug isolation system. Now, semi-automated, what do I mean by that? I mean that we still want developers to be engaged in debug isolation process. So there's still going to be some manual effort involved. But at the same time we want to go beyond printfs and GBD to review the overall amount of work that he has to do. Shouldn't have to do, you know, laborious tedious work. Now, I should mention that there are certain things we aren't trying to do. In particular we weren't trying to automatically fix the bug with this system. And that's deferred to future work. And I'll talk a little bit about that later on. Okay. So I've told you with what we're trying to do, what the problem is. Semi automated bug isolation. What exactly is it that we've done. What is our contribution? A key contribution is a system and framework that we call ADDA. It stands for automated debugging for datacenter applications. And what ADDA is was framework at the high level for analyzing datacenter executions. Okay? So it offers two things for this purpose. It offers a powerful analysis plugin architecture with which you can construct sophisticated distributed analysis tools, things like global invariant checking, distributed dataflow and communication graph analysis. And beyond being powerful, it offers a very simple programming model for the framework. So you can write new plugins that are sophisticated fairly easily. And so these plugins are written in the Python programming language. We chose that because for its rapid prototyping capabilities. And then on top of the Python we also provide facilities for inspecting and reasoning about distributed state. So that's key. We wanted to make -- we want to make it easy to reason about distributed state. Okay. So plugins. And analysis. What am I talking about? So let's take a look at an example. This is an honest to goodness ADDA plugin. And it's very simple. It performs a distributed check, a global invariant check. And basically checks that no messages were lost during an execution. Okay. So it's a distributed check. It's very simple. And fundamentally it's written in Python and it employs a callback model whereby whenever a message is sent in distributed system on send gets invoked. Whenever it's received, on receive gets invoked. And on sends and receives you maintain this set of in-transit messages, represented by the intransit Python variable. At the end of the execution we look at the set and say okay, any messages that are still in transit that were never delivered, okay, then those messages were lost. Question? Yes. >>: What [inaudible] find trend? >> Gautam Altekar: So this is the application messaging layer. So system call level. >>: [inaudible] you don't have reliable transport? >> Gautam Altekar: Yeah. So for example in the case if you use UDP, for example. Yeah, this would be a concern for UDP. TCP obviously might not be a problem. Okay. So that's a very concrete example of what ADDA offers. But beyond that there are really four key features offered by the analysis framework programming model. First is the illusion of shared memory. You don't need to explicitly send messages around to maintain global state. You can store it as single Python variables. You debt serializability to illusion the callbacks, execute one at a time so you don't need to worry about concurrency issues, data races when you're developing these plugins. And a really powerful feature is a causally consistent view of distributed state, so what that basically means is that whenever you look at distributed state you'll never see a state in which a message has been received but not yet sent. And this is a very important property for reasoning about causality in the distributed system. And finally ADDA provides fine-grained perimeters for performing deep inspection of your execution so things like instructional level analysis, taint-flow analysis, instruction tracing, that sort of thing, all that stuff comes out of the box with ADDA. So at this than point you're wondering okay, well that's great. A powerful analysis framework okay, I'd like to use that, but what's the problem. What is the challenge we -- behind developing something like this? The main challenge is that there is a conflict between providing these powerful analysis tools and achieving good in production performance, okay? Things like providing the illusion of shared memory, serializability, causal consistency. All of these things are very expensive to provide in an in production execution, causal consistency, for example, requires distributed snapshots. You're talking about Chandy and Lamport. Heavyweight techniques. And then fine-grained introspection, taint-flow analysis, instruction level analysis you're talking about binary translation. You can't do this kind of stuff on an in production execution without major slowdowns. Okay. So I've told you about the problem. I've told you that there is a conflict between providing powerful analysis and achieving good in production performance. So how do we resolve this conflict? That's the key question. So the key observation that we make is that for debugging purposes, we don't need to do these analysis in production. For debugging purposes it suffices to do the analysis offline, okay. We don't need to do it in production. It suffices to do it offline for debugging. And this observation motivates ADDA's approach of using deterministic replay technology to perform the analysis on an offline replay execution. Okay? And, in fact, we shift the burden, the overheads of analysis from in production to an offline replay execution. This works in two phases. I basically run your datacenter application on the production cluster with some light tracing, and then later when you want to perform an analysis you do a replay on a development cluster and then you turn on the heavyweight analysis that you've written using ADDA. Question. Yes? >>: [inaudible]. >> Gautam Altekar: Absolutely not. ADDA supports a partial analysis node in which you have a thousand nodes you can actually try and replay it on one node. It's going to be really slow of course if that's possible. Okay. So deterministic replay is the key technology that we leverage to perform the analysis offline. But you might say well there's a problem here. These datacenter applications are nondeterministic. How do you make sure that you can deterministically replay them? And because when you run them, you know, again they might not do the same thing. So to address this problem, we employ what is known as record replay technology. Okay? It's a brand -- it's a kind of technology for providing deterministic replay. And it works in two phases. And the first phase you record your applications nondeterministic data, things like inputs, incoming messages, thread interleavings to a log file. And this is done in production. From your production datacenter application. And this -- all this information goes to a log file. And at a later time when you want to do the analysis, you start up your replay session and you basically -- this entails rerunning your application but this time using the information to log file to drive the replay. Okay? Now, at this point you're probably wondering okay, why use record replay. Why not employ some of the more recent deterministic execution techniques, things like Kendo and dOS and CoreDet? Well, the short answer to this is that these techniques are complementary to record replay and that the goal of deterministic execution is to minimize the amount of nondeterminism in these applications. Now, that's good for us because the less determinism there is then the less we have to record. And at the same time it's important to note that determinism can't be completely eliminated. There's nondeterminism inherent in the external -- in the environment, in the network and the routers for example DNS messages, it still has to be recorded. Even with deterministic execution tools. So we think that record, replay and deterministic execution techniques are complementary and can coexist and have much to learn from each other. Now, okay. Of we want to do record replay. What exactly are we looking for in a datacenter replay system? There's three things. For most we want to be able to record efficiently, okay? Good enough for production-use demands. At most 10 percent slowdown and about 100 KBps per node logging rates. The second requirement is that we'd like to be able to replay the entire cluster if it's desired. And the reason is that we don't know where -- on which nodes a bug, a distributed bug in particular is going to manifest. It could be on one node, it could be on two nodes, it could be the whole cluster, so we would like to be able to replay. Questions? >>: [inaudible] slow down just when you decide to do some recording for replay or is this slow down ->> Gautam Altekar: Yeah, this is record, record mode slowdown. We're willing to tolerate much more slowdown for replay because it's offline, but for recording it should be at most as much -- I realize it's kind of on the high end but I think this is at least good enough for at least partial time. >>: [inaudible] 10 percent slowdown or is it just like you decide there's some issue you want to try to record, then you could ->> Gautam Altekar: Yeah. Yeah. So, you know, if you notice some problems in your production you can just turn it on for a while and you'll take a 10 percent hit for that time, you can turn it off and then you can replay it. So that's when usage model for the system. Question? >>: 10 percent on aggregate throughput or on 99 [inaudible] latency? What do you mean ->> Gautam Altekar: Aggregate throughput. Throughput. Not latency but actually throughput of these applications. >>: [inaudible] latency [inaudible]. >> Gautam Altekar: It might. Well, no, in practice the latency is actually pretty good too. So it's not -- I don't think -- I don't think we make any sacrifices in latency either. But our focus has been really on getting good throughput because that's what -- that's what these distributed datacenter applications care about rather than latency. Or at least the apps that we consider. Question? >>: Just to follow up on Rich's question, if you did turn on recording for just a period of time, do you have to take a checkpoint [inaudible] at that period? >> Gautam Altekar: Yes. You do. Yeah. You would have to do that. Or you would have to restart your services for those nodes that you consider. >>: Definition of deterministic replay. You're not recording every thread interleaving? >> Gautam Altekar: Absolutely. I'm going to get to that. >>: [inaudible]. >> Gautam Altekar: I'm going to get to that in just a few slides. Any other questions? Question? >>: [inaudible] sort of wonder about your assumptions about [inaudible] so have there been studies done to -- of these [inaudible] and, you know, what sort of techniques would capture them. I mean Microsoft, going back to the static analysis, whenever we get security alerts or things like that, we get these bugs from the [inaudible] and we look and see were there could have been the static analysis [inaudible] that they could have gotten [inaudible] so I wonder, do you have, you know, data about the times of bugs that -- and, you know, you made the case that regardless bugs will always get into production. But that said, you know, it would be nice to sort of know how many of those bugs that made it to production could have been [inaudible]. So that's -- have there been studies [inaudible]. >> Gautam Altekar: Yeah. I agree that would be nice, but I don't think there's any comprehensive study of that kind. And we haven't done it. Most of this work has been motivated by our own prior experience in building large-scale distributed applications in house in the RAD Lab at Berkeley. And there we've encountered many issues but I don't think we've ever documented the formal published paper on it. So it's just kind of -- you know, people kind of generally accept that this is an issue but we don't really have hard numbers. And I agree that ->>: [inaudible] there's lots of war stories. There's very little hard data sort of about how much people time it costs, you know, how much machine time it costs. [inaudible] everybody sort of accepts I guess it's a problem. But we don't really know how big a problem it is in terms of cost. We also [inaudible] look at any classifications [inaudible]. >> Gautam Altekar: Yeah. Well, I think that kind of data would be very interesting, especially coming from industry. I think that would have more clout. Because our experience has been very grads students and it's not clear how much we can do with that kind of system and manpower. So I think it would be a great project to do for the industry. >>: Like this basis, this idea that the [inaudible] I believe you observe this like. >> Gautam Altekar: Oh, yeah. >>: Like most of the [inaudible] 50 percent ->> Gautam Altekar: Well, the hardest -- among the hardest ones that qualitatively speaking but we don't have -- again I can't really say there's a 15 percent of all bugs are like this. Something to do for future work. Okay. So those were the requirements. Now, if you look at the related work in this area in replay systems, you'll see that there are many of them, many systems but none of them are quite suitable for the datacenter. In particular they're deficient in one requirement or the other, systems like VMware's deterministic replay system, Microsoft's has several replay systems that provide -- you know, they can be made to or they already provide whole cluster replay and they have wide applicability. However, in the datacenter context in -- you know, in the context of processing terabytes [inaudible] computations, these systems don't record very efficiently. You can't use them in production. In the datacenter. And then you have systems like IBM's Deja Vu system, an older system records efficiently and provides wide applicant but it makes certain assumptions that don't hold in the datacenter. So you may not be able to provide whole cluster replay. And then finally you have Microsoft's R2 system which records efficiently and can provide distributed system replay, however you have to retrofit your application using certain -- using the annotation framework. And this might be a bit of an annotation burden. And so this is -- this doesn't have quite the be applicability that we -- wide applicability that we were looking for. Okay. So what have we done? Well, we've built a datacenter replay system that meets all three of these requirements, records efficiently, not quite 10 percent, I'll say that up front, it's about 40 percent right now. We're working on getting that down. It provides whole-cluster replay. And it has wide applicability in the sense that you can record and replay arbitrary Linux x86 applications, particularly in the EC2 environment. Now, ADDA, our analysis framework, uses DCR to do the offline analysis, okay? And again, DCR is designed for large scale data intensive computations, applications like Hadoop, Map Reduce, Cassandra Memcached, Hypertable, so on, so forth. Okay. So deterministic replay is the key. But -- and we've built a system that meets all three of these requirements. But how do we do it? What is the key? Well, the key intuition behind this replay system is that for debugging purposes we don't need to produce an identical run to that of the original that we saw in production. It often suffices for debugging purposes to produces some run that exhibits the original run's control-plane behavior, okay? Control-plane behavior. What am I talking about? If you look at most datacenter applications, they can be separated into two components. A control plane code component and the data plane code component. The control-plane of the application is kind of the administrative brains of the system that manages the flow of data through the distributed application. It tests the complicated, does things like distributed data placement, replica consistency. And it accounts for 99 percent of the code. But at the same time, it accounts for just one percent of the aggregate systems traffic, just one percent of all the traffic. Now, this is in contrast to the data plane code which is the work horse of the system. It process the data. So things like checksum verification, string matching goes to every byte, computes the checks and that kind of thing. Now, it turns out data plane code is actually very simple. It accounts for about one percent of the code, a lot of it coming from libraries. But at the same time it accounts for 99 percent of all data, traffic that is generated in the system. Now, the key observation here is that most failures in these datacenter applications stem from the control plane. And this is backed up empirically in our HotDep paper, and you can take a look at that for some numbers. Now, what does this observation do for us? Well, now we can relax the determinism guarantees offered by a system to when what we call control-plane determinism. And if we shoot for control plane determinism, we can all together avoid the need to the record the data plane, the most data intensive, traffic intensive component. And as a result we can meet all the requirements. We can record efficiently. Recording the control plane is very cheap, it's just one percent of all the traffic. We can provide whole system replay and because now we can record all of the nodes, all of the control planes. And we can provide wide applicability because we don't need to resort to any kind of specially purpose hardware or languages in order to provide efficient recording. So control plane determinism is the key to meeting all three of these requirements. Okay. So if control plane determinism is the key, then how do we take this and turn it into a concrete system design? Okay? So what you see here is at a distributed design okay and operates in two phases, a record phase and a replay-debug phase. Now, in the first phase ADDA's replay system, DCR, records each node's execution and logs it to a distributed file system. We use the Hadoop distributed file system in this case. Now, you'll notice that each node doesn't record all sources of nondeterminism, each node controls just the control plane inputs and outputs. Again, we're shooting for control plane determinism so we just need to record control plane inputs and outputs. Now, during replay, the replay system reads the control plane inputs and outputs and starts up a replay session. And then on top of this distributed replay session we run the distributed analysis plugins. So that's the high-level view of ADDA's architecture. Got a question? >>: Yes. So when you talk about the control plane more specifically do you mean that you need to record the headers of messages but not the bodies? >> Gautam Altekar: Yeah. That's one way to think about it. You have the metadata describing the data. That's at the headers of messages. In other cases in datacenter applications components are exclusively control plane. So if you think about say Map Reduce, okay, so you have a job tracker which is exclusively a control-plane component, it's responsible for maintaining the mappings of jobs, that performs and exclusive control-plane activity and so you could record all of its channels. >>: So how do you automatically or maybe not determine, you know, which data is controlled by [inaudible]. >> Gautam Altekar: Okay. >>: [inaudible] so this log history of work of [inaudible] scientific community [inaudible] programs, how is this different from the 20, 30 year's predecessors? >> Gautam Altekar: Well, so there's several challenges. The sheer volume of the data that has to be recorded is a major challenge. In the scientific computing it's mostly compute heavy you have -- you don't -- well, I don't know what applications you have in mind, but it's not that -- it's not that scale of processing. >>: [inaudible]. >> Gautam Altekar: Well, okay. So in scientific ->>: Look at the very large scale [inaudible] they are running a very large 10s of thousands of processors in the space. >> Gautam Altekar: Okay. But do they have stringent introduction overhead requirements? >>: Yes. >> Gautam Altekar: So 10s -- they demand 10 percent or otherwise it doesn't ->>: [inaudible]. >> Gautam Altekar: Okay. Well, another challenge then is shared memory multiprocessors. Now, you have concurrency. And how to you provide replay for concurrent execution. >>: All of that is hard to [inaudible]. >> Gautam Altekar: Certainly there are many techniques. But the overheads of those techniques are quite high. If you look at them carefully. So things like -one of the earlier systems was the instant replay system which proposed a CREW model of recording shared memory accesses. It's very expensive, the overheads are very expensive for in-production. I mean I don't think you could use those systems in a production datacenter. >>: [inaudible] message based, message passing systems and you can [inaudible] determinism [inaudible]. >>: [inaudible]. >>: I think it's also the case that in the HP [inaudible] typically you're operating on a [inaudible] inputs and outputs [inaudible] do things like [inaudible]. >>: They do do deterministic checkpoints but the replay systems [inaudible] checkpoint [inaudible]. >>: [inaudible] is the nondeterminism strictly from the messages or is there also shared memory ->> Gautam Altekar: There's shared memory as well. >>: Okay. How does the [inaudible] the log of the shared memory nondeterminism compared to the volume [inaudible]. >> Gautam Altekar: So for these -- so it depends obviously on the application, the amount of sharing. So I'm going to get to that in just a few slides if you don't mind. So are there any other questions? Okay. Okay. So a question was -- so when you look at this design, you're looking at -- you're looking at two questions come to mind, okay. So first of all, how do you identify the control plane, okay? We're talking about recording control-plane I/O. How do you know what that is? And the second question is okay, you record just the control-plane I/O, but you don't know what the control -- data-plane inputs are. Because you don't have all the inputs to the program how is it that you provide replay? Okay. So let's start with the first question, how do we identify control-plane I/O? So we have essentially and automated identification technique. And I stress it's a heuristic. It's not perfect. And this heuristic is based on the observation the control-plane channels have operated low data rates. Okay? They execute one percent of all traffic in the datacenter application. So this leads to an automated classification technique that's phases. First we interpose on nondeterministic channels and then we classify these channels as control or data plane by looking at their data rates. So we interpose specifically on so network and find channels basically using system calls, sys send, sys receive, the kernel level and then we interpose on shared memory -- a particular type of shared memorandum channels, in particular data-race channels, using the page-based concurrent read/exclusive write memory sharing protocol. Now, this is a conservative protocol. It doesn't detect data races but it will detect conflicting accesses to pages. And if there is a data race, then it is handled. It is intercepted. Now, in the second phase the key challenge of course in classification is that you have these bursts in any kind of communication channel, even in the control plane. So it's not a flat data rate, and you can't just use that as a threshold. So to deal with that problem, we employ a token bucket filter. And we found that basically a rate of about 100 Kbps with bursts of 1,000 bytes suffices conservatively to capture most control-plane I/O. And similarly for shared memory communication. A fault rate of 200 faults per second, 1,000 faults is a pretty conservative bound on capturing control-plane I/O. Now, if all of this fails, keep in mind that you can always go in there and annotate specific components, say okay, you know, this is the control plane, I know this. The Hadoop job trackers, the control-plane components, so forget -don't worry about automatically trying and figuring out if it's a control plane or not. >>: [inaudible] control-plane shared-memory accesses [inaudible] I understand you're trying -- distributed system messages, data [inaudible] control messages over the network but you [inaudible] page measuring a number of faults by [inaudible] read only ->> Gautam Altekar: Yeah, so ->>: What does that mean control page [inaudible]. >> Gautam Altekar: Yeah. So in terms of shared memory we define locking operations on data, coordinated accesses to data processing. You know, suppose you have a red-black tree that you need to coordinate access to, then you acquire a lock on a shared memory location. Then I would consider that to be a control plane -- a form of control-plane communication that needs to be intercepted. >> Gautam Altekar: And you expect those accesses to be lower than [inaudible]. >> Gautam Altekar: Yeah, so ->>: Protected. >> Gautam Altekar: So for example Linux you use hue texts to avoid kind of spinning, spin locks. So you try once. If you don't then you're blocking the kernel. So this reduces the kind of -- the high data rates that you might see otherwise. Okay? Okay. Of course now the second question is, okay, record just the control-plane I/O. Now how is it that you provide replay with just that information? To address this question, we employ what we call -- a technique we call Deterministic-Run Inference, DRI for short. The key idea is that, yes, it is true that we don't record the data-plane inputs of the original run, however, we can infer these inputs postmortem, offline, okay? And again, this inference doesn't need to be precise. We don't need to infer the exact concrete values of the original run. It suffices so long as we infer some values that make -- make the replay execution exhibit the same control-plane behavior. So this is a relaxation of determinism that we're shooting for. And this -- this -this inference process works in three phases. First you do the recording. You send the control-plane I/O to the inference mechanism and then the inference mechanism will compute concrete control and data-plane inputs, control-plane inputs are already required, so no computation is required. And then you feed this into a subsequent execution and then you get a control-plane deterministic replay run. Okay. Question? >>: So to you [inaudible]. >>: A protocol that's pigging backing control-plane messages on top of -- yes. Yes. So one of the -- yeah. So depends on how good the heuristic is. So for example we've -- one thing that we do is we observe that you have, you know, any -- even in data-plane channels, you have these embedded control-plane channels, particular message headers are a type of control-plane channel. So you can just say okay, I'm going to consider the first 32 bytes of every application messaging boundary to be control plane as a kind of heuristic in order to capture. So, yes, it's a heuristic -- it isn't perfect but we think we can -- with these engineering trace make it -- make it more accurate, more approximate to control plane. >>: So even though you might have only a small number of bytes recorded [inaudible] be able to replay it? If [inaudible] if I have 38 bytes and you record only 32 [inaudible] replay? >> Gautam Altekar: Yes. >>: Okay. >> Gautam Altekar: [inaudible]. >>: Do you think this may be is a more fundamental insight? I mean the control plane versus data plane that's just giving you [inaudible] but it seems like what you really want is you want [inaudible] just the bytes control our data or whatever to let you get this. >> Gautam Altekar: Right. Well, sure. But I mean you can take -- if you take this to the extreme, right, and not record any control plane, none of the inputs, it is possible you could get something -- you could get an execution but there's a chance that you may not reproduce the underlying root cause as a result. Their observation is that the control plane -- it is important to try and reproduce the control plane because the root cause is embedded as part of that code. So we want to reproduce that behavior of that code as much as possible. >>: I guess I'm just asking, have you considered -- you could look at the group [inaudible] maybe there's some slice in the program and it's maybe not even control versus data but something even smaller that [inaudible]. >> Gautam Altekar: That's possible. Well, although we haven't completely tested the limits of the control-plane determinism model it seems to be pretty good so far, but some challenges with shared memory of course. So there's room for future work on that. >>: Sure. >> Gautam Altekar: Question? >>: What if the data plane kept inferred like what if the code said receive the message, if the signature is correct then do X. How are you going to generate a data plane ->> Gautam Altekar: These are all very good questions and I'm going to address that in two slides. Okay? So how does this inference work? Let me just give you a brief overview of this. There's two phases. First you take your distributed publication, your program and you translate it into a logical formula known traditionally as a verification condition. And this is done using symbolic execution techniques. And the resulting formula is basically expresses the program's control-plane output as a function of its control-plane input and data-plane input. Now, we know the control-plane input and output because we recorded it, okay? It's in the logs. So we know the concrete values there. But we don't know the data-plane inputs. They're unknowns. So in order to figure that out, we send this formula over to a constraint solver which thinks about it for a while and then returns concrete values to the data-plane inputs. Okay? So that's the basic idea -- question? >>: Are you translating program or program trace? Because translating a program into [inaudible] is a little difficult in the presence of [inaudible]. >> Gautam Altekar: Oh, yeah, absolutely. >>: So what is it you're actually translating? >> Gautam Altekar: It's a partial trace so we have basically an execution with recorded control-plane I/O, right? But some of the stuff -- some of the inputs we don't know, and so we'll still have to deal with loops and things like that. >>: So ->> Gautam Altekar: So it's a kind -- it's static verification condition generation, but we have some of the inputs so that allows us to kind of employ more -reduce the path space and be more like symbolic execution. >>: So I have -- just to be concrete, I have some program that's taking some input off the network. We know what that input is. And you're going to reexecute the code with that input to get sort of a partial program. What's the technique exactly? >> Gautam Altekar: The technique is a symbolic execution, okay? So ->>: [inaudible] as far as an input. >> Gautam Altekar: Which -- which ->>: [inaudible] execution along a path or. >> Gautam Altekar: Along a set of paths. It's multipath symbolic execution. So we have to consider multiple paths. So in the simplest case we take the application, we pick a path for the application, symbolically execute for it. So now we have a formula for one path. >>: Right. >> Gautam Altekar: Pick a different path. This may entail going around two times in a loop, okay? And we pick another path, so on and so ->>: Pick a path that's consistent with the control planing. >> Gautam Altekar: Exactly. So that's why when we do the symbolic execution we feed in the control-plane inputs that we know. So then this focuses the two essentially path traces that are consistent with those inputs that that we receive. So it's a technique also for reducing the search space for the symbolic execution. Okay. Okay. So a recap. What we're trying to do offline for datacenter -- offline replay of datacenter applications. We record the control-plane I/O and we try to infer the rest, the data-plane inputs. A gentleman brought up a very important question. What about the scalability challenges. Okay. So indeed, this problem is not solved and it is essential challenge to making this inference technique work. Now, if you do this inference naively, it doesn't scale. And the basic reason is that you're -- you know, you're searching for gigabytes of concrete inputs, data-plane inputs. And this is an exponential search. Do an exponential search space. And for fundamentally there are -- more specifically, there are two problems. First, you're doing a multipath symbolic execution to a potentially exponential path space and even if you do manage to -- so in addition to the symbolic execution, the cost of the symbolic execution per path is quite high. And we do it at the instruction level so we're talking about some 50X, 60 X slowdown in our unoptimized implementation. Now, even if you do manage to generate a formula for any given path, you're talking about gigabytes, hundreds of gigabyte size formulas, okay? And this is not surprising because you're talking about gigabytes of inputs. Okay? And even if these formulas weren't that large, you're talking about constraints that are very hard to invert and solve for hatch functions. So this seems insurmountable. Is there any kind of hope? So we make two observations, okay? The first is that if you look at these applications, most of the unknowns in the data plane come from network and file data-plane inputs okay? 99 percent, in fact. Just one percent of the input come from a particular type of nondeterminism data-plane inputs. So I think this comes back to explaining kind of the communication cost of shared memory. Data races account for a very tiny fraction of the overall inputs. 99 percent of the inputs come from network and file inputs, basically the datasets that you're processing these applications. Now, the second part of this observation is that the network and file inputs can be can be derived from external data -- are derived from external datasets, for example click logs. And these datasets moreover are persistently stored in distributed storage, okay? And why are they persistently stored? Well, usually for fault tolerance purposes. You have these click logs you want to tally. Sometimes something goes wrong in the tallying process so you need to be able to restart it and try it again. And so because these inputs are persistent we have access to them during replay. And also they're persistent in append only, read only distributed storage. Now, these two observations lead to the idea of using these persistent datasets to regenerate the original network and file data-plane inputs. And if we can do that, if we can regenerate these concrete inputs, then we can get rid of 99 percent of the inference task, the inference unknowns. So that's -- that's a key idea behind ADDA. So how does this work exactly? So we have a technique, it's called data-plane regeneration. And the basic idea, you give it access to the persistent datasets stored in HDFS, for example, and then it will regenerate the original concrete network and file data-plane inputs without using inference. That's important to emphasize. And there are two observations behind this operation. First is that if you look at these applications, the inputs for any given node are the outputs of upstream nodes, okay? This is a basic property of distributed applications. Pretty easy to see. Second property is that if you look at the outputs of any single node, they are exclusively the function of its inputs and the ordering of operations on those inputs. So things like memory access interleaving of inputs. So, no, if we put these two observations together, the implication is that we can regenerate the original concrete data-plane input simply by replaying the inputs in order of operations of upstream nodes. Okay. Let's make this a little bit more concrete and look at this in detail. So there are two indicates. The first case, the easy case is where you have boundary nodes, node that are directly connected to the persistent storage. Cosmos, GFS, HGFS, so on, so forth. In this case if you want to regenerate the data-plane inputs, all you have to do is read the inputs from the file, just open the file from the persistent store and read it back in again. So this is directly connected. Now, the internal case is a little bit more complicated because these nodes aren't directly directed to the persistent store, they have to communicate through some other nodes. So for example let's consider was would happen if we need to recompute -- regenerate the inputs to internal node B. The first observation is that B's data, the -- B's input is actually C's output. Okay? That's very simple. And then second observation is that, okay, C's output is actually a function of its input which is one data-plane input and the ordering of operations. Here, in this case, multiplication, then the addition. And that the result is 3, which is B's input. And hence we have completed the regeneration task in this very simple case. Now, we can extend this using induction to the rest of the cluster. You can see that data-plane inputs can be regenerated for all the nodes with these two cases. Yes? >>: [inaudible] computations are purely [inaudible] and if you I/O bound the storage and you're having to read all the storage back in and you're having to [inaudible] all the same states and you have a latent bug that [inaudible] recompute [inaudible] to recomputation? Don't you have to recompute [inaudible] usually one ->> Gautam Altekar: You mean to replay and do the analysis? >>: Often you can -- you accelerate replays and future replay, right? You cut out -- you [inaudible] but in this system since you're actually bound to the data [inaudible]. >> Gautam Altekar: Yeah. So in reality, yes, we have to -- because we have to regenerate the data and the data has to traverse the network links then yes, you're talking about actually having to wait -- you know, if you have two weeks worth of computation, then, yes, you will have to wait two weeks. And that is -that is a -- that is a drawback of using the system. However, it's offline. So, you know, if you have -- if you have a bug that you're having a really hard time with which we think is so useful in those cases. >>: And if the bug was caused by let's say the checksum missed corruption on data wire, which caused a function to do something correct on the data that would not be [inaudible] this case, right, because you're not replaying -- you're not [inaudible] to replay the inputs, you're depending on the inputs that existed on persistent storage to begin with? >> Gautam Altekar: Right. >>: Okay. >> Gautam Altekar: Yeah, you're depending on the inputs in persistent storage. So one thing I'll note, this is mostly these applications communicate by TCP, so there's -- and the probability of checksum violations, things like that ->>: Is actually [inaudible] actually the TCP check [inaudible] scale. >> Gautam Altekar: Okay. >>: Having corruption in TCP is not an uncommon [inaudible] having the corruption you know to be weak, you know. To miss file [inaudible] I'm just curious. I'm just trying to understand what the scope is. >> Gautam Altekar: Yeah. >>: So it's [inaudible]. >> Gautam Altekar: Certainly. There are tradeoffs in actually regenerating the computation. And if -- yeah, if you have -- if you have, you know, data gets corrupted and checksum fails then yes, it becomes a problem. >>: And you're assuming that your lead play cluster has access to the same data store app or [inaudible]. >> Gautam Altekar: Yes. >>: [inaudible]. >> Gautam Altekar: It has access to the same distributed data store. That's hosting these persistent files. That's the assumption. And, yes, all of the stuff depends on that assumption quite critically. Okay? Okay. So at this point, we've covered a lot. So let me do a quick recap. ADDA -- here is distributed ADDA's -- ADDA's distributed design again. It records just the control-plane inputs and outputs. And then it uses the control-plane inputs and outputs to infer control-plane deterministic run. And this is done using program verification techniques in bulk execution. Once it has the distributed replay then it performs the analysis on top of the replay. That's the basic design of the system. Now, how well does this system actually work? So with this evaluation we have two questions. First is how effective is ADDA as a debugging platform? What kinds of interesting tools can you write on top of it? And what is the actual overhead of using ADDA in production -- both in production and offline. And to evaluate -- to answer both of these questions, we run ADDA on three real-world applications, the Hypertable distributed key-value store, used by companies like Baidu and Quantcast. The Cassandra distributed key-value store used by Facebook and others. And then Memcached which almost everybody uses, and so distributed object cache. >>: [inaudible]. >> Gautam Altekar: I/O to disk or the persistent store? >>: [inaudible]. >> Gautam Altekar: Okay. All of them do I/O to the persistent store actually. So for Memcached depends on the particular application setup. >>: [inaudible]. >> Gautam Altekar: Cassandra, for example, you could -- in our experiments we got the datasets out of the distributed storage. We mounted a ->>: [inaudible] perhaps that's distributed storage [inaudible] cache. That's what I'm asking. >> Gautam Altekar: Yeah. >>: Are you ->> Gautam Altekar: Yeah. >>: You're dealing with -- you're dealing with benchmarks where we could sell memory. >> Gautam Altekar: Uh-huh. >>: Okay. >> Gautam Altekar: Okay. So what about effectiveness? So to gauge this, we developed three powerful sophisticated debugging plugins for ADDA. The first is a global invariant checker with which we check many invariants and we actually found three bugs in research distributed systems. The second is our most powerful analysis we think. It's a distributed dataflow analysis. It's written about 10 lines of Python and tracks the flow of data through the distributed system. So lets you trace dataflow. And we've used it to actually debug a data loss bug in the Hypertable key-value store. The final tool that we developed was a communication graph analysis makes a graph of the communication patterns of all nodes and the applications -- we've used it to isolate bottleneck nodes in Hypertable. Now, I want to focus specifically on our most powerful, most interesting tool in my opinion. And the data-flow -- distributed data-flow analysis plugin we called DDFLOW. And the basic idea is to dynamically track data through the distributed system, okay? And so this works in two phases. Okay. We track taint within a node simply by maintaining a taint map of memory states and then updating it as the data-flows to the machines and then we track taint across nodes by keeping track of which messages are tainted by the input file or message. Now, the key question was how easy is it to develop DDFLOW, this plugin, and we took about -- it's about 10 lines of Python. I've omitted some initialization code that bloats the code a little bit. But this is -- this is our most interesting plugin in the sense it uses almost all of the framework features. For example, it uses the shared memory analysis model to maintain the set of all in-transit tainted messages, okay? So that's a message taint set variable that you see here. It uses causal consistency properties. For example you're insured that the on receive callback will be invoked after the on send for any given message, okay? So causality is very important to track taint. It uses serializability in the sense that on on send and on receive, it will look as though they're executed one at a time. And so you don't have to worry about data races on the message taint set variable. And then finally it leverages the fine-grained introspection primitives provided by ADDA and in particular the local taint-flow analysis. Okay? Now, that was the effectiveness. Now, how well does this thing actually perform in-production and offline? Now, for production we want to look at the cost throughput and the storage costs as well because you can't record a ton of data on these systems. Storage is important. You can't double the terabytes of data that you're already using. And then secondly, we want to look at offline analysis. How long does it take to run these ADDA analysis plugins? And we just -- and for this setup we had a very simple -- it's a very simple setup on our local 16 node cluster is to make it easier to understand what was going on. And then we have basically input datasets that we varied from one gigabyte to three gigabytes. They were all stored in persistent storage in our local HDFS cluster. Okay. So what about in-production overhead? So here you see two graphs, the first showing the recording slowdown for these applications; the second for the recording rates for these applications. And the basic take away for the first is that ADDA achieves an overhead slowdown -- a slowdown of about 1.2X, okay? That's almost across the board. 1.2X. And then for the recording rate, the take away message is that you have an average recording rate of about 200 gigabytes a day. Now, this is in contrast to the overhead's 4.3 terabytes a day that you would get if you were to record both control and data planes. ADDA in contrast records just the control plane, and it's able to get away with much less in terms of record ->>: [inaudible]. >> Gautam Altekar: I'm sorry? >>: Over three -- so you have 200 gigabytes of -- traced it over three gigabytes of data being manipulated? >> Gautam Altekar: Uh-huh. >>: Okay. >> Gautam Altekar: Yeah. So this is control plane. And it actually should be shall less but because of implementation issues with our system we record more than we should. It's a matter of type tuning our heuristics actually. Okay. Now, key question is well with, how does this overhead scale as you increase the datasets because these are actually small datasets. So the first graph shows the slowdown, throughput slowdown as we scale the input sizes. And then the second graph shows the log sizes as we scale the input sizes. Now, you can see that slowdowns for ADDA stays relatively stable as the input sizes increase and it's not too surprising because again we intercept and record just the control-plane data and it's not a whole lot of data. Now, for the second graph you can see that ADDA -- ADDA's -- it's a little hard to see, but ADDA's log file size does grow as the input scales, however, this growth is nowhere near the growth that you see when you record both the control and data points. And the reason is that if you record both, then you have to obviously record all of the data as well. Okay. So now, that was in production. What about offline analysis speed some. Now, here are two graphs. In the first one you see the slowdown for analyzing a one -- a uniprocessor recording, okay? All nodes in the system had -- were restricted to use just one processor in the first graph. In the second graph you see the slow downs where all nodes were using two CPUs. Okay? Now, the take-away for the first graph is that you get an average slowdown of 50X for the DDFLOW analysis plugin. And in particular if you look at the breakdown for this overhead, you will see that most of the, you know, head of the uniprocessor case comes from the analysis itself. Distributed data-flow at the instruction level. And this is not surprising because this requires instruction you level tracing. And so it makes sense that the analysis dominates the replay cost. >>: [inaudible] for the symbolic execution? >> Gautam Altekar: I'm sorry? >>: The symbolic execution and the taint. >> Gautam Altekar: So for the uniprocessor case we don't need to do symbolic execution. We don't need to do inference at all because we were able to regenerate the data-plane inputs entirely. And as result, we don't -- this overhead the just for the taint flow analysis. >>: [inaudible]. >> Gautam Altekar: I'm sorry? >>: [inaudible]. >> Gautam Altekar: No, we use Valgrind. It's structured along the same design as David's catch con system. Yeah, it's called libflex, which is a back-end development binary translation system. And so now if you -- what's interesting here is that the first graph, you know, contrasts quite clearly with the second one in that the multi-processor slowdown, most of the slowdown there is contributed to the replay phase, not the analysis phase, which is a bad thing because ideally the system should just stay out of your way and lets you do expensive analysis, right, or lightweight analysis if you so desire. And the overhead there is about 800X slowdown for computing these multi-processor runs. And the question is why, why does 1-CPU perform so much better than 2-CPUs? Now, what you see here is a graph that breaks down the replay time both for the 1-CPU recording case and the 2-CPU recording case. And the breakdown is into two parts, the amount of time spent in data-plane regeneration and the amount of time spent in the inference process itself. Now, the interesting thing here is that the 1-CPU case you will see that no time is spent in inference, okay? Again, because data-plane regeneration recomputes all of the concrete values. All right? So no inference is needed for the 1-CPU case. For the 2-CPU case, it's a different story. We need inference to compute the outcomes of data races, okay, because we didn't -- we may not have captured all of those data races using the CREW interception protocol that we used. And unfortunately this inference requires a backtracking search through the path space, and this is where most of the expense comes from for the two processor case. >>: [inaudible] when you're doing the symbolic execution and looking at the paths, these are paths of multithreaded execution as well? >> Gautam Altekar: Yeah. >>: So the number of -- so as the number of -- as the number of threads that you have to interleave increases then you're going to get exponential ->> Gautam Altekar: Yeah, so ->>: [inaudible] path space. >> Gautam Altekar: Absolutely. But we have some tricks where we avoid -- we don't actually try all the different interleavings. In fact, in particular, we consider racing read to be inputs, symbolic inputs and as a result we can eliminate the need to search through all the different schedules. So it's a relaxation, it's an optimization that makes things better. But still, you still have to consider can multiple paths symbolically execute upon multiple paths. >>: Multiple paths not of single threaded execution but multithreaded ->> Gautam Altekar: Multithreaded executions. >>: Okay. >> Gautam Altekar: Okay. So I've taught you about ADDA. I've told about you the basic design, basic evaluation. And I want to emphasize to you at this point that ADDA is a real system and that it works on real datacenter applications. For this purpose, I'd like to give you a brief demo, okay? First -- the goals of this demo are twofold. First I want to show that ADDA can analyze real datacenter applications, okay? And for this purpose I will replay recording of Hypertable -the Hypertable key-value store and I will perform some simple analysis on that. And then for the second phase, I want to show you the beyond just doing analysis is actually useful for debugging and for that purpose I will use our distributed data-flow analysis plugin on a Hypertable bug. Okay? To help with that process. Now, for the first part, this will be easier if I had two screens, but that's okay. The first thing I want to show you is that ADDA does simple stuff quite easily. So for example, you know, you want to look at the aggregate output of all of the nodes in the system. Well, ADDA can provide that for you. Let me pause this replay for a second. What you're seeing here is ADDA's main console interface. The top panel indicates the replay stats of the system. Here you can see that we're seven percent into the replay execution. This is a very short replay execution I collected for purposes of the demo. In the bottom you'll see the aggregate TTY output of all the nodes in the system. Here you can see that the hyperspace -the Hypertable lock server starting up. And then the master -- let me get to -- if we let it go, okay, at some point you'll see the slaves are starting up, so on and so forth, they're communicating and dumping information to the log files. That's a very simple replay mode analysis. Just looking at the output. Very simple type of thing. And of course you might say, well, gee, and output, that's not too useful for debugging, I need more information. So ADDA can give you more information. So here I'm going to pause this replay for a second. What you're seeing here is a distributed instruction trace, instruction-by-instruction trace of Hypertable in replay. And we currently have paused it at the startup sequence where the lock server has just started up and it's skating code inside the dynamic linker and then you can just single step and let it go. It will take quite a long time. But again, this demonstrates the type of heavyweight analysis that you can do offline that would not be possible to do in production. Now, of course you say well -- you might say okay, instruction traits, that's too much information. I want -- you know, I want the bigger picture. I want to look at -- see the bigger picture of my distributed system. And ADDA can give that to you as well. Let me pause that. So here what you're seeing here is a birds-eye view of the Hypertable datacenter application in replay. Again, you have the top panel which is a replay status, the progress. And then the left-most threads panel shows you all the active threads in the entire distributed -- in the entire datacenter application. You can see a bunch of lock server [inaudible] have started up, including the shell and all this stuff. On the right top, you will see all the active communication channels between all the threads. This includes sockets, files, pipes, all that kind of thing. You can see that, you know, it's loaded a bunch of libraries as you might expect from startup and started some sockets. And then below that you see some in-transit -- the in-transit panel which gives you all the messages that are in transit. Currently we have a TCP message in transit that has been sent but not yet received. And below that you have a list of all of the received messages. Now, and important thing to point out that I should point out with this is that replay -- distributed replay is causally consistent, okay? What that means is that you're not going to see a message in the received panel until it has shown up, until after it has shown up in the in-transit panel. And I wish I had a bigger screen to show you that, but basically you should see mash at 35 at the end -okay. Scroll past it. But anyway, basically mash at 35 shows up only after it's shown up in the in-transit panel. It's causally consistent replay. Okay. So that was part one of the demo. Okay. Part two of the demo I'm going to show you that going beyond doing these kind of analysis tricks it's actually -- the system is actually using for debugging. So let's consider data-log bug the Hypertable distributed key-value store. In this particular bug what's happening is that you sent -- you insert some data into the table, okay, some real data, but then when you try and looked it up -- when you tried to do a lookup later you can't find the data. The data is gone and you want to know what happened to it, okay? So I'll tell you up front that the data loss is caused by a race in the table migration code within Hypertable. Hypertable splits large tables once they get to a certain size. And so what's happening is that when you inserted the data, split occurred currently and as a result the inserted data went into the long shard essentially. Now, the question is okay, can we use ADDA's distributed data-flow analysis to actually figure out what happened to our data? Does it go to the node that we expected in. Now, from inspecting the log improving we know that the lookups go to slave node one. The question is does the data go there as well? So we can use the data -- distributed data-flow analysis to figure out where this data went. I've started up a session already for you, [inaudible] let me start -- let it go again. And what you're seeing is it's a set of all function that is are operating on the data that you inserted into the distributed system. And now the data has flowed -node 2 is the slave. So basically the data has gone there. And then the slave responds to the client. Okay. So let me just pause this. So what's happening here is that you have the data, it's being inserted into the client Hypertable and it's being inserted -- you can see it's loading the data source and then, you know, does some processing on it. And at some point it reaches node 2. The data gets sent up to node 2 which is the slave, Hypertable slave, slave, inserts it into a thread black tree, give the unresolved symbol -- not all symbols are available. And then it replies. So the take-away message from this trace is that the data goes to node 2, not node 1. Now, to do similar type of thing manually would take a lot of effort. You would have to instrument your program to track this particular lost data. And then that's assuming that you could actually reproduce the data race offline. So this is just an example of the power of offline analysis that ADDA provides. >>: [inaudible]. >> Gautam Altekar: What's that? Oh, because we know from the logs, we look -- we look -- we did a lookup -- we looked inside the logs and we saw where the -- where the -- where all the datasets were being inserted. I'm sorry. Where all the lookups were happening. We knew where they were inserted we wouldn't have to do all of this. We though where the lookup went to because it's in the log file. I didn't show that part. But ->>: So this trace filtered based on this one action [inaudible] other client request in processing? >> Gautam Altekar: Yes. Well, so it's only showing you the functions that process the data that you're interested in. So all of the other client request processing is not shown here. And that's one of the key advantages, the taint flow analysis because it narrows down exactly what -- just the things that you want to see. >>: [inaudible] setup say this was the taint source, how did you do that? Was that in the Python? >> Gautam Altekar: Yeah. >>: Code? >> Gautam Altekar: I'll show you. Okay. Ignore that. So basically -- so we have a command line interface. You can see here -- if you could see the cursor ->>: [inaudible] the file that represents the source of the taint? >> Gautam Altekar: Yes. >>: Oh, okay. >> Gautam Altekar: And basically you can specify a file, storage and -- or you can -- >>: And whoever reads that will [inaudible]. >> Gautam Altekar: Yeah. That's one option. Another option is you can actually explicitly mark memory read ins at a certain time. >>: Sure. >> Gautam Altekar: Which is a bit more complicated, so which is why I didn't do it that way. So ->>: So if there are lots of other database tables or tables that are being read as a matter of course, those won't be tainted? >> Gautam Altekar: Yeah. Exactly. Okay. So that was the demo of the system. Now, the key question is where is all of this going. Okay. So in the short term we address the limitations of data. And there are many of them. [inaudible] pointed out a key limitation is the fact the replay of multiple -multiprocessor runs is very slow and perhaps impractically so. And of course the key challenge here is the inference that you have an exponential-sized path space you have to -- and then you have the cost of symbolic execution. So there are many things that we haven't tried yet. A simple thing you could do is actually record the path in the original execution. This has a certain cost. So maybe we can record selectively, you know, path samples, branch samples. And maybe that can narrow down the search space of paths. Another option is to use annotations to summarize loops. The key observation here is that we can limit the amount of burden by doing the annotation just on the data-plane code, which is the part that we don't know. And that data-plane code is just one percent of the overall application code. Now, second challenge of course is large formulas. You have hundreds of gigabytes formulas. One thing that we've observed is that okay, you could look at these formulas, you could actually split them up into smaller chunks because these datacenter computations operate on chunks. And if that's the case, then you can solve just those parts of the formula that your analysis is interested in, no more. Yes? >>: [inaudible] or anything like that [inaudible]. >> Gautam Altekar: No, not -- no, not -- no, not at this point. We have a simple -- I think I used the same disjoint path -- disjoint set computation that you use in catch con to separate out the formulas into different groups and then just plug it in. So stepping back at a higher level, where is all of this research going? What is the ultimate aim? Well, the holy grail again is the fully-automated debugger. It's a very hard thing to do. ADDA only provides semi-automated bug isolation. It doesn't even try to do automated bug fixing. So, you know, there are many opportunities in terms of synergy with examining automated debugging techniques. For example you have delta debugging, statistical bug isolation, invariant inference. All of these techniques can be combined with the power of offline analysis that ADDA provides to make them even bet, to even narrow down the root causes even further. So all these things can make interesting ADDA tools. And the second level is, you know, the next step of course is automated bug fixing is very hard. One observation is that made in techniques of program synthesis is that if you have program specification that you can generate the concrete implementation from that. Of course a key challenge here is that okay you assume access to a specification. But perhaps ADDA can help with that by inferring distributed invariance. So that would be an interesting direction to try to pursue. So with that I'll try to conclude with some of the key take-away points that if you combine the notion of control-plane determinism, recording just the control-plane I/O, combine it with inference and some cool techniques like data-plane regeneration then you can get record-efficient datacenter replay. It is possible. And that in turn enables us to do fairly powerful offline distributed analysis. Now, of course, there are many things to address in the future. One of them is the multiprocessor replay issue, which is very hard. But we'll continue to bang away at that. And at this point I'll stop and take any questions you may have. [applause]. >> Gautam Altekar: Okay? >>: So suppose you just want to [inaudible] multiprocessor replay time and just do EC2 instances for all microinstances, for all single core. What kind of hit would you take on your Hadoop throughput versus having [inaudible]. >> Gautam Altekar: Well, that simplifies the problem be enormously. So if you know, you look at commercial replay systems like the Mware, if you give them very light I/O intensive tasks they do -- they get really good performance. They're hardly under one percent in terms of throughput. And those are the numbers that are published. And I've seen the system. I interned there, so I know what the capabilities of uniprocessor replay recording efficiency are. So in that case, I think it can be quite good. Of course the key challenge is the multiprocessor that makes things a lot more complicated. >>: [inaudible] I guess I just ask if you wand this and you just wanted Hadoop throughput why don't you just get a 10X more microinstances on the [inaudible]. >> Gautam Altekar: That's an interesting usage model I think for this type of -the datacenter applications. >>: [inaudible]. >> Gautam Altekar: Yeah. I actually don't know the answer to that question. I don't -- I'm not the one operating the systems in production but ->>: [inaudible] Cassandra [inaudible] if you could just duck that, you could get this really nice capability. >> Gautam Altekar: Separate the tasks into distinct components that don't share much, then you avoid a lot of these problems. >>: Do these systems, you know, are viewing invariant [inaudible] just data auditing, distributed systems in the old world [inaudible] they had some something called audit subsystem [inaudible] distributed system and each node had a little process that audited its data structures and basically, you know, some low overhead communication with other parts of the switch to sort of make sure things were consistent and throw alarms and do all that. So my understanding is, you know, you got -- you got six nods of reliability if this auto subsystems is on and you got three nods if you turned it off because it also had some pair action, that is integral to the system was a self-route, was an auditing and repair subsystem. And I just sort of wonder in the architecture of the thing that you're looking at how much is built in [inaudible]. >> Gautam Altekar: My impression is not a whole lot. Like pretty typical thing is just to have checksums, checksum verification. But beyond that, these systems rely on recovering by using the persistent datasets. So I've talked to folks at Google with -- you know, that work in the ads back-end teams and they have to process terabytes of click log. A lot of times things go wrong and, you know, it's usually a bug. But they don't have -- they don't have checks for all of that because they don't want to ->>: The difference with telephony is that the data is so huge and throughput is so essential compared to like telephony you have a really strict data control separation that they can't tolerate doing these runtime checks in this sort of auditing architecture, they can't tolerate the overhead of extra -- extra runtime checks. >> Gautam Altekar: Well, you just -- so ->>: Huh? >>: [inaudible] penalty for getting it wrong is high. >>: [inaudible]. >>: And besides [inaudible] switch had to be reliable because the old days, right [laughter]. >>: Just click refresh. >>: And also at&t was a regulated monopoly, so there's a whole other reason that it was overengineered. But yeah, okay, so your point is there's not much pseudo -- what we call ship asserts? There's not a lot of -- like, so ->>: [inaudible]. >>: When I get a ship assert just means I have assertions in my code that are -that are in my live code. >> Gautam Altekar: Yeah. Not to my knowledge. I mean I think there's very light logging that's being done. But beyond that, I don't think -- it might leave some assertions on, but again, if -- you know, they'll turn off anything that affects the aggregate throughput of the cluster, so ->>: Wouldn't the potential performance enhancements you mentioned was to add annotations of the data-plane code and you have techniques that separate out and traffic -- [inaudible] separate out data-plane trafficking, control type traffic. Is it straightforward to then work back from those techniques to figuring out which code is data plane in which now where you would have to put the annotation? >> Gautam Altekar: Yeah. So that's an interesting point. We did -- we did a study in which we basically used our taint-flow analysis to track these datasets and then figured out which code -- what code was tainted by these datasets and found out, you know -- that's where we got the one percent overhead from. And also we got -- narrowed down the locations that we had to actually look for, you know. So we -- the developer doesn't have to actually manually go through the code and say okay, this is data-plane, this is control-plane. We can actually provide that to you automatically using the dynamic analysis. Interesting question is, you know, maybe can you provide this statically, you know, maybe some interesting static techniques. That way you aren't limited to just one execution or a set of executions. >> Andrew Baumann: Thank you again. >> Gautam Altekar: Thank you. [applause]