>> Jim Larus: It’s my pleasure to introduce Brandon Lucia from the University of Washington. He’s a grad student working with Professor Luis Ceze who, and he’s finishing up this year. He’s done work I think with a number of people in this room before. He’s going to give a talk on his Ph.D. research, so Brandon. >> Brandon Lucia: Cool. Alright thanks Jim for the introduction. Today I’m going to talk about my research on using architecture and system support to make concurrent and parallel software more correct and more reliable. So as Jim mentioned this is the work I’ve done during my Ph.D. at the University of Washington with my advisor Luis Ceze. So, I’m going to begin my talk today by talking about what are the key challenges that I’ve looked at in my research. The key challenges that I’ve focused on are those posed by concurrency and parallelism. In particular what the impact of concurrency and parallelism is on the problems of correctness and reliability. Then I’m going to dive into the key research themes that show up in my approach to solving these research challenges. The focus of my work is on using architecture and system support to solve these problems. Then after that I’ll dive into a couple of my research contributions in some more detail to give you an idea for the kind of projects that I like to work on. At the end of my talk I’m going to talk about some of the things that I’m interested in, in the future going forward. So, first I’ll talk about the key research challenges that I’ve looked at. So, I’m going to show you now that concurrency and parallelism are essential and they’re unavoidable. There are two main reasons why concurrency and parallelism are essential and unavoidable. That is that there’s pressure from the bottom because technology is changing and there’s pressure from the top because the applications that people care about running on computers are changing. I’ll talk about technology first. The best example of a change in technology is the shift to multicore devices. Everyone has a multicore in their phone, in their laptop, and everywhere else. In order to get energy efficient and high performance computation out of a multicore you need software that exposes parallelism down to the multicore. At the other end of the spectrum we have warehouse-scale computers. These are the things that run our data centers and to fully utilize these and get energy efficiency and performance you also need to write software that maps computation across all the nodes in a data center. So, we see a need for parallelism because of the shift in technology. This is also true because of the shift in applications that we’re seeing. So we have mobile applications and server and Cloud applications are two examples of this. In mobile devices we have little devices. They run on a multicore and they’re powered off a battery. In order to get energy efficient computation out of a multicore you need software that utilizes that multicore. That kind of software requires that there’s parallelism to map down to the multicore. So mobile applications demand this especially because energy efficiency is important when you’re running off a battery and in serving Cloud domain you’re running on warehouse scale computers so you need parallelism to get performance. In addition, there’s also some concurrency constraints here. Mobile applications need to communicate with Cloud applications and in Cloud and server applications you need to coordinate sharing of resources across simultaneous client requests. So there’s really a need because of these applications for concurrency and parallelism. So there is a need but why is that an interesting research question? So to talk about that we can look back at the model that we have, we’re all familiar with this model, four sequential programs. In a sequential program there’s one thread of control and execution hops through that thread of control by doing a series of steps like A and then B and then C. If you’re a good programmer you write your software and then you give it an input and you run the program and you see what the output is. If you try enough inputs and you see enough outputs that match the specification that you have in your head, or you have written down somewhere then you can put that software out in the world and have some assurance that it’s correct. However, the story is different when we look at multi-threaded software. Multi-threaded software, shared memory multi-threading is the most common idiom these days for writing concurrent and parallel programs. In multi-threaded software there’s not just a single thread of control. There’s multiple threads of control and they can interact by sharing, reading and writing a shared memory space, and explicitly synchronizing with one another to order events in different threads. Any events in different threads that aren’t explicitly ordered with synchronization execute independently. This leads to what we call the nondeterministic thread interleaving problem. The nondeterministic thread interleaving problem manifests in the following way. If we take this program and we give it an input and we run it, and we see its output we might get one output on the first execution with that input. If we take the same input, run the program again we could get a different output. The reason is that independent operations in different threads could execute in any order in those executions, potentially changing the result of the computation. So the nondeterministic thread interleaving problem has several implications. The first is that these programs are hard to write. They’re hard to write because it’s hard to understand how different interleavings of independent operations will impact the execution of the program. These programs are hard to debug. Bugs might only manifest as failures in some program executions. When they do it’s hard to reason about what the effects were that actually led to the failure. Finally, testing these programs is currently infeasible because testing requires looking not just at the space of all possible inputs but also at the space of all possible interleavings. Although there have been some advances in recent years in the area of testing multi-threaded software by some people in the room. So, it’s getting better but at the moment it’s still infeasible to comprehensively test these programs. Just to show you this is not an academic day dream. This isn’t just something we do in the lab. Here are three examples from recent headlines that illustrate that concurrency bugs cause problems in the real world. We’ve seen infrastructure failures, security holes that have led to millions of dollars being stolen, Amazon web services went down. So this is a serious problem and these problems were all the result of concurrency errors in software. So, they’re difficult to write, they’re hard to debug, and they’re infeasible to fully test. So bugs are going to find their way into production systems and cause those kinds of problems. What we want is that these programs are simple to write. We want them to be easier to debug so when we have bugs we can fix them. We want them to be reliable despite the fact that we can’t comprehensively test them and bugs might find their way into production systems. So, I’ve just identified three key research challenges, programmability, debugging, and failure avoidance. These are the challenges that I’ve looked at during the work that I’ve done on my thesis. So, now I’m going to talk about the themes that show up in my approach to solving those key research challenges. There are four key themes that I’m going to talk about. The first is looking at system and architecture support across the system stack. Second I’ll look at designing new abstractions that allow us to develop new analyses. Third I’m interested in systems that leverage the behavior of collections of machines. Finally, I’m interested in mechanisms that are useful not just during development but for the lifetime of the system. So first I’ll talk about my approach to using architecture in system support. So many people see this slide and they see the word architecture here and they think, this guy works on hardware. Hardware is part of architecture but the way that I think of architecture is that hardware is where the architecture begins. We have to think about the interaction between hardware and the compiler, and the compiler and the operating system, and the hardware and language run times, and programming languages and software engineering tools, and even things at the application level. Like statistical models of the behavior of a program. So my approach to architectural support is to look across the entire system stack. The second theme that I’m going to talk about is the use of new program abstractions that allow us to develop new analyses that are more powerful, an example, more powerful than prior work for solving some problems. An example of a new abstraction from my work is the use of something called a context-aware communication graph. I’m going to describe this in more detail later but I’m showing you now because this is an abstraction. Context-aware communication graphs correspond to the interactions between threads that occur when a program is executing. This is important because it allowed us to develop a new analysis that helped us to find bugs that are the result, to find concurrency bugs better than prior techniques. The third theme is that I like to take advantage of the behavior of collections of systems. The way that I do this is by collecting data in individual machines and incorporating it together into models that help us find bugs when they show up as anomalies. Help us to predict where failures might happen to avoid failures. To feed information back to the programmer so they have a better set of data to work from when they’re trying to fix bugs. Finally, I’m interested in systems to remain useful, not just during development but also in deployment for the lifetime of the system. So we’ve developed mechanisms that help during debugging and these have remained useful in deployment because they feed information from deployment machines back to developers. Failure avoidance mechanisms are useful during deployment because they continue to provide value when systems are running in production. So, I’m interested in these kinds of mechanisms especially when they’re hardware mechanisms that provide benefit for the lifetime of the system. So, I’ve used these themes to address these key research challenges. My publication record shows that I’ve worked in all three of these areas. It also shows that I’ve worked across the layers of the system stack. I have papers that showed up at Micro and ISCA which are architecture conferences but also at OOPSLA which is a programming systems conference and PLDI which is a conference on programming language design and implementation. I’d like to talk about all three of these today. But, unfortunately I’ll only have time to talk about my work on debugging and failure avoidance. I’m going to do that now by jumping into two of my research contributions in more detail to give you an idea for the projects that I’ve done in those two areas. There are two efforts that I’m going to focus on. The first is a project called Recon; this is a project in which we used architecture and system support to make it easier to debug concurrent programs. The second is a system called Aviso. Aviso is a technique that helps, enables production systems to cooperate by sharing information so that they can avoid failures that are the result of concurrency errors. So, I’ll talk about Recon first. If the lighting was a little better in here you’d, so Recon is a technique for finding bugs in programs. If the lighting was better you’d see that on this antelope there are birds that are finding the bugs and pulling them out of the antelope’s fur, so, ha, ha, yeah. Okay, so to… >>: [inaudible] >> Brandon Lucia: What’s that? >>: Is that what advocating birds or bugs on… >> Brandon Lucia: That’s actually, that’s my future work section. I’m going to see if we can recruit pigeons that are a nuisance in the cities and we can put them to work. So, in order to understand a technique that helps with concurrency debugging I unfortunately need everyone to just read a little bit of code. The codes pretty simple… [laughter] >>: [inaudible] >> Brandon Lucia: Ha, ha, I know how much everyone loves reading code but it’s going to help a lot. So there are three threads in the program. The programs pretty simple, green is setting some flag called ready equal to true and then it’s setting a shared object pointer equal to some new object. Blue checks that flag to see if it’s true and then takes a copy of that shared object pointer. The programmer when they wrote this they had this invariant in mind. Whenever ready is true shared object is going to be a valid pointer. You can see that they implemented this incorrectly if you are good at reading this kind of bubbles on slides code. What blue does with that pointer that it copied is to put it into this Q called Q that it shares with the red thread. The red thread dQ’s object from the Q and uses them. Okay, so the program executes like that and what happens? We see that green sets ready to true and blue sees that ready is true and takes a copy of that pointer, but green hasn’t set the pointer to a new object yet. So it has an invalid pointer at this point. It nQ’s the invalid pointer and then red uses the invalid pointer when it dQ’s it from that Q. So what’s interesting about this bug is that the root cause is over here in the interaction between blue and green. But the failure manifests over here in red. That makes this a very difficult error to debug. Luckily there’s been a lot of prior work in debugging this kind of error. One category of prior work is work that uses program traces to debug these kinds of errors. Program traces are big lists of everything that happened during a programs execution. The programmer can look at this list of things that happened starting from the point of the failure and hopefully eventually they get back to the point in the execution where the root cause happened. These techniques are effective but they’re limited in one way and that is if the execution is very long and there’s a large distance between the root cause of the bug and the failure, like this could be several days, then the traces can be huge. The programmers going to have to look at gigabytes of stuff to try and figure out what’s wrong with the program. So those techniques are useful but they give too much information. So there have been other techniques that try to help debugging by focusing in on a narrow subset of the operations that happened during the execution. These are techniques that help to debug using dependence information. So dependences occur when operations access the same piece of memory, like the green and the blue thread are both accessing this ready variable. These techniques are useful because they focus in on just the operations that the programmer might care about. But in fact they sometimes give too little information, like you can see here these operations are dependent so they might be selected in one of these techniques but they don’t tell the whole story. They don’t include information about the accesses to the shared object variable. That would be important for understanding why this bug happened. So what we want to do is develop the technique that gives neither too little nor too much information. We want to show the programmer the root cause but we don’t want to distract them. We want to take a Q from those dependence based techniques and we want to give the programmer information about communication. Communication is inter-thread dependence. When one thread writes a variable and another thread reads or over-writes that, that’s when we have communication. So we want to show the programmer when that happens to. So our goal is to develop a debugging methodology that can reconstruct the root cause of failures. We want to include all the code that’s involved in the root cause and we want to show it to the programmer in time order. We want to give them the information about the communication that occurred. When we do that we have a reconstructed execution fragment. That’s one of the main contributions of this work. These reconstructed execution fragments are actually derived from a model of inter-thread communication that we also developed in this work. >>: Can I ask a question Brandon? >> Brandon Lucia: Absolutely. >>: I have a straw man proposal for solving this problem. >> Brandon Lucia: Say again. >>: I have a straw man proposal for solving this problem. >> Brandon Lucia: Okay, what is it? >>: Proposal that you asked a programmer to write down the invariant intervention… >> Brandon Lucia: And then check it. >>: And then check it dynamically. It will fail exactly at that window. >> Brandon Lucia: So the problem is that the programmer often doesn’t really know that invariant explicitly. They have it sort of implicitly in their brain and I think often programmers haven’t thought far enough ahead to really encode that and crystallize it and put it down in code. There’s also the problem of asking people to express invariancing code which can sometimes be complicated. Actually writing these things, that was a simple invariant but invariants could be much more complicated. You could have pre and post conditions on API entry points and you could have data structure and variants that are not so simple to express. That work is complementary although. I think that’s a great idea. I wish people would do that. So, cool, feel free to interrupt if you have questions. We have a fairly long time slot and we can talk in the middle of the talk if you like. Okay, so we want a debugging methodology that produces those reconstructed execution fragments. Here’s an overview of the methodology that we developed in this work. The first step is that the program crashes and someone sends a bug report to the developer. The developer looks in the bug report and sees that there’s some bug triggering input and they use our tool to run the program repeatedly with that but triggering input. Our tool generates communication graphs. In particular context aware communication graphs which I mentioned a little earlier in my talk and I’m going to explain in detail in just a second. Having produced a set of context-aware communication graphs from many executions the programmer labels each of them as having come from a program execution that manifested the failure or did not, so buggy or not buggy. Then our tool takes that set of labeled communication graphs and it builds a set of reconstructions that might help the programmer understand the bug. Then the last step is to look at that set of reconstructions and assign a rank to each one. So the programmer knows which one is most likely to be beneficial to look at when they’re trying to figure why a failure happened. Okay, so now I’m going to go through each of those steps in a little more detail, starting with what communication graphs are and how we build them. So communication, as I said a second ago, happens when one thread writes a valued memory and another thread reads or over-writes that value. We can pretty naturally represent that as a graph where the nodes are static program points and edges exist between nodes whenever two instructions are executed during the execution. So we have the source, the sink, and we have shared memory communication encoded, indicated by the edge. If we do that we have what we would call a simple static communication graph. Static because the nodes represent static program instructions and in fact this is a little too simple. The way that it’s too simple is that representing static program instructions in this graph doesn’t differentiate between different dynamic instances of the same program instruction. So if you’re going around a loop the first threaderation of the loop is the same as the second and so forth, whereas for understanding why a bug happened it might be interesting to differentiate between instructions executing in different contexts. To get around that we could look at a dynamic communication graph. In a dynamic communication graph every different dynamic instance of a static instruction would be differentiated. So the way to think about this is there’s some monotonically increasing instruction counter. Whenever an instruction executes it adds a node to the graph that’s identified by the instruction address and that counter. This is essentially a program trace. The main, while this gets around the problem with the static graphs the problem with this is that it’s unbounded. So we end up with that too much information problem that we had before. In this work we developed a middle ground between the simple static graph and the unbounded dynamic graph. We called that a context-aware communication graph. The key idea in a context-aware communication graph is that a node represents a static instruction executing in a particular communication context. We add communication context to the nodes. A communication context encodes abstractly a short history of communication events that preceded the instruction represented by that node, so if there’s some sequence of communication events and then an instruction executes that’s one node in the graph. If there’s a different sequence of communication events and then that same instruction executes it’s another node in the graph. So we differentiate between different instances of static instruction. Sure. >>: So a couple of questions. >> Brandon Lucia: Okay. >>: If I have a loop with the instruction in there I can have multiple instances of the instruction at the same communication graph label? The communication is outside the hoop sync. >> Brandon Lucia: Yeah, so if you are in a loop and there’s no communication taking place then you would add multiple instances of that instruction. In practice the graph doesn’t grow. The node is already there so we don’t add anything. >>: Okay, and what communication you mean shared memory communication? >> Brandon Lucia: Yeah, I mean shared memory communication in the way that I described before when one thread writes a value and another thread reads or over writes it, that’s when communication occurs. That’s… >>: And… >> Brandon Lucia: Yeah. >>: Semaphore is in all the other lock on all those other things how do they… >> Brandon Lucia: U-huh. So those would show up as, so the question was whether synchronization operations would show up in the communication graph. In fact they would because they manipulate pieces of shared memory. So as you’ll see in my implementation we instrument programs at a very low level and we have another implementation which uses hardware support. So we’re observing the execution from a very low level of abstraction and all these things look similar. They look like shared memory operations. Sure. >>: Should I think of the entitles as something like multiple clause? >> Brandon Lucia: No, I would think of them a little more like calling context in a compiler analysis but instead of being calling context we’re looking at communication context. So rather than abstractly encoding a call stack we’re abstractly encoding the sequence of communication operations that preceded this operation. Did that help? >>: Yeah, I’m still, I’m trying to understand what’s wrong with the logical [indiscernible] analysis like in a distributed system where each node maintains an [indiscernible] messages… >> Brandon Lucia: So… >>: Potentially what you’re doing here is you’re encoding the communication that’s happened, seems like the same, seems like it should be a jewel. >> Brandon Lucia: They encode similar information. This I expect is cheaper to implement which is one of the reasons that we did it this way, because we only have to do things when things actually share not preemptively on other operations, so, yeah. >>: [inaudible] question. >> Brandon Lucia: Sure. >>: So, I’m still am trying to understand what you mean by preceding. So for T nine the communication, T nine let’s say equals nine, that red value that’s written by [inaudible]. So you put the green dot there… >> Brandon Lucia: So the green dot, yeah so I was hoping to gloss over those details but I can get into those details now. So the entries that go into the context are indicators that say a local read or write happened. A read or write that didn’t communicate or a remote read or write happened, meaning that a read or write happened which did. So that’s how we abstract, we abstract away the addresses. >>: Oh, so you don’t know where you read it from but it’s, you know that it’s a remote? >> Brandon Lucia: We know that it’s a remote read. So you can actually think of it, our motivation for getting to this abstraction was thinking about coherence. So if preceding some operation there was an incoming coherence request that showed that some other processor had written to some piece of memory that would populate one entry in the communication context. So that might be another way of helping to think about it is local operations are just memory operations and remote operations are incoming coherence requests. >>: And one more question, so… >> Brandon Lucia: Sure. >>: Let’s say T equals fifteen was nine was not accessing the same variable as T equals nine, you’re accessing different variables, even then you would put the communication of the previous instruction in the context of T equals fifteen? >> Brandon Lucia: That’s correct. Yeah, so the context… >>: [inaudible] local and its inter-thread happens before an inter-thread [indiscernible]. >> Brandon Lucia: Right. You can think of the context as a thread local property. The context is always being updated and whenever a node gets added to the graph by a particular thread you grab the current context and you add it to the node. Then the context changes and you add another node and you grab the new context. Okay. >>: [inaudible] sort of like K limiting insert of the call graph analysis… >> Brandon Lucia: Exactly. >>: You’re going to go back so if I think of it as it happens before graph dynamically you have a compression technique that is doing basically paths of one K and that sort of, that now becomes your context for identify unknown as unique… >> Brandon Lucia: Absolutely. >>: In your compressed [indiscernible]. >> Brandon Lucia: That’s a great way to think about it is as an analogy to K bounded calling context. That’s the perfect way to think about it. That’s the way I think about it, so. Okay, I’m going to move on just so I can get through all the content here, so. >> Jim Larus: You have good time. >> Brandon Lucia: Yeah, but it’s about time, so. [laughter] Okay, so I just described how we build these context-aware communication graphs. Now I’m going to talk about how we go from communication graphs to these things called reconstructed execution fragments, which I described a second ago. So to build a reconstruction we start with an edge from the communication graph. I’ve omitted the context just so the diagrams are simpler. Okay, then, oh, you know I forgot to mention something just a second ago because we got in that discussion. One more thing that we add to this graph is a form of bounded timestamp. The way that these works is not especially, it’s not especially interesting it’s a monotonically increasing counter that we update in a lossy way so the representation remains bounded. So we start with one of these edges and then we want to build a reconstruction. So we can look at those lossy timestamps that I just described and we can populate three regions of a reconstruction, the prefix body and the suffix. We populate those regions based on those timestamps. So, to populate the body for example we look at operations that showed up in the graph that had timestamps between the timestamp on this source node and the one on this synch node. We do the same for the prefix and the suffix. So building reconstructions is very straight forward. Yep. >>: [inaudible] how do you know when it was a remote read but you don’t know where you read them. So how do you know what is the source and what’s the [inaudible]? >> Brandon Lucia: I don’t think I understand your question. >>: So, when explaining [indiscernible] you said that you don’t have information as to which type of [indiscernible] that you read it from other than the fact that it was remote read, that as you read a value that was remotely written by somebody… >> Brandon Lucia: Yeah, so there’s, we keep a distinction between the entries and the context which are abstracted and the entry and the nodes in the graph. So a node is the tuple of a static instruction address and the communication context in which it executed. So you know which operation it was. That’s how you know it was a read or a write. >>: [inaudible] okay for the arrow, right, so you know that the blue box is actually a read or some read of text… >> Brandon Lucia: Or over write, yeah. >>: And it was a remote read that is it read something [indiscernible] by somebody else on the thread… >> Brandon Lucia: That’s right. >>: But you don’t know which thread it was. >> Brandon Lucia: We actually do. We keep track of that. We don’t record that in the graph though. So in our implementation we need to keep track of that because we need to be able to identify when remotes, reads and writes are remote. But the graph abstracts away threads. That’s actually important for remaining bounded because if you think of applications that have thousands of threads, like something that’s built with [indiscernible] then it might be a scalability problem for our representation if we actually encoded the thread in the graph. Did that answer your question? Good, okay. >>: This graph makes it look like the time is essentially sequencing everything. Can you have multiple instructions happening at the same time? >> Brandon Lucia: Yes, but our timestamps are sort of; they’re sort of a cheap implementation of timestamps. So we have this monotonically increasing counter that gets updated lossily. So we don’t do that but you very easily could. You could think about things that happened concurrently and use that as the time stamp instead. The reason we do this was as a convenience in our implementation because we actually had… >>: [inaudible] processes? >> Brandon Lucia: Yeah, so we used the real time, the, what’s it called? The Intel timestamp counter instruction. >>: So you can have multiple instructions happening at the same time, same timestamp on different processors? >> Brandon Lucia: Yes, due to imprecision in that counter yes you could. Okay… >>: I’m trying to understand, like this picture makes it look like everything is serialized. So I’m trying to understand if you have to construct an arbitrary serialization of all the instructions across all the different processors or whether your timestamp just gives [indiscernible]? >> Brandon Lucia: The timestamp gives us the serialization. >>: Okay. >> Brandon Lucia: The time, think of it as a system wide time that we’re using to populate this. >>: Okay, then I don’t understand how you can’t end up with multiple instructions occurring at exactly the same time. >>: You mean RBCST on two machines could have the same router? >> Brandon Lucia: Yeah, so I guess… >>: [inaudible] >> Brandon Lucia: Because of the, because of precision in that thing, yeah because of concurrency and in precision in that counter it’s possible. I’ve omitted that because I don’t think it’s an especially important detail. But you’re right that that could happen. If things did have the same timestamp because they happened on two processors to have the same counter they would end up in the same region of the reconstruction. So you wouldn’t necessarily know the ordering across those things but you’d know which region they showed up in. There’s something I’m going to get to which makes it less important to know… >>: [inaudible] >> Brandon Lucia: Ordering within a region, yeah, and I’ll show you that in just a second. I’ll come back to your question when I get there if you want, yeah. So it’s actually this right here. So the reason that that’s not especially important is that we take, so one of the big problems with dealing with concurrency errors is that you get different behavior from one execution to the next. That means you get different reconstructions from one execution to the next, even if you start with the same communication graph edge. So, we have a way of aggregating reconstructions together that came from different executions. Obviously from different executions there will be substantially different and incomparable timestamps. So the way that we produce an ordering that we show to the programmer eventually is by aggregating across executions and combining things that occurred in the same region of the reconstruction. So the, this is why I was sort of hedging around that question because I was going to get to this. It only matters that they end up in the right region of the reconstruction. Yeah, and then we know ordering prefix things happen before source and source happened before body, and so forth. >>: So in the reconstruction on the right hand side of the equal sign, the blue and the green oval that are sort of parallel to each other means one of those occurred, both of those occurred, the rhythm occurred. What’s the semantics… >> Brandon Lucia: So there’s something else that I’m leaving out of this diagram for simplicity because usually I smoke through this in about 10 minutes, ha, ha, but I… >>: [indiscernible] >> Brandon Lucia: No, no that’s fine, yeah. I’ll add more detail so. In our actually implementation these things come with confidence values. The confidence value says, this happened in fifty percent and this happened in fifty percent, or this happened in ninety-nine point nine, nine, nine percent and this happened in one percent of executions. >>: But what does that mean? >> Brandon Lucia: It means, so we build reconstructions from graph edges that came from, graphs from failing executions. So if we see green in ninety-nine point nine, nine, nine percent of the body regions from failing executions then we can have some confidence that when the program fails whether this is significant or not is something else that we decide, but when the program fails that thing tends to happen between the source and the synch, very often happens between the source and the synch. So that’s what that confidence value gives us. >>: Okay. >> Brandon Lucia: Question in the back. >>: [inaudible] independent [indiscernible] it could be that, you know, like fifty percent of the time green happens and fifty percent of the time blue happens but they never occur together. >> Brandon Lucia: U-huh. >>: So there’s no dependence encoded in this problem, is that correct? >> Brandon Lucia: No we’re not encoding [indiscernible]. We’re treating them as independent. Sorry, I probably just went out of the range of the camera, so. >>: Sorry. [laughter] >> Brandon Lucia: Okay, so I’ve just talked about how we build reconstructions starting with those graphs. Talked about how we aggregate reconstructions from different executions. Now I’m going to talk about how we figure out which reconstructions are actually useful. We do that by representing reconstructions as a vector of numeric valued features. Each of those features represents a different property of the reconstructions. Using the values of those features we can compute a rank for each of the reconstructions. So our tool works by generating lots of reconstructions. Computing these feature vectors, computing a rank, and then ranking the reconstructions that were produced. Our goal is to produce a rank ordered list of reconstructions where the first one in that list is one that points the programmer to the root cause of the bug. So you’re all probably wondering is what are B, C, and R? What are those features? So, I’m not going to talk about all of them. But I’ll talk about one to give you an intuition for how the features work to help us figure out which reconstructions are related to a bug. So one of the features that we use is something that we call the buggy frequency ratio and intuition is this, you build a reconstruction around a graph edge. If the graph edge occurs often in failing executions and occurs very rarely in non-failing executions then we assume that that graph edge might have something to do with the failure. So we improve the rank of reconstructions built around that edge and conversely if the thing, if this were the other way and this graph edge were to happen often in non-failing executions and often in non-failing executions and rarely in failing executions then we would say, that’s probably not very useful for understanding the bug. So we would give that reconstruction a lower rank. So that’s the intuition behind the features. The other features encode similar ideas but for different properties of the reconstructions. Sure. >>: [inaudible] question… >> Brandon Lucia: Yes. >>: Does an encoder capture bugs that occur when two things happen together, in some sense because these are independent, so? >> Brandon Lucia: Yeah, so one of the other features that we look at looks at the consistency of things happening in a particular region of the reconstruction. That captures that two things, that idea that two things happen at once. So if, maybe we should, maybe we can talk about this later because I think it would be easier to talk about it offline than to try to get into it without a whiteboard right now, so. >>: [inaudible] >> Brandon Lucia: Yeah, one of our other features does capture that property though. Okay, now I’m going to talk a little bit about our implementation. Our implementation… >>: [inaudible] >> Brandon Lucia: Sure, yeah. >>: So how dependent are you on the quality of labels buggy, non-buggy, because you could have a non-buggy run where the bug has just not caused a crash? >> Brandon Lucia: Yeah, we are completely and absolutely dependent on that property. Something that I’m really interested in my future work is to make systems that can tell you earlier than we know now that something has gone wrong. I think that’s actually a very hard problem in general. Okay, so now I’m going to talk about implementation. Our implementation we started with a softer implementation. We used binary instrumentation, for C plus plus we use Pin and for Java we used Road Runner. Our instrumentation is simple; we inject code around memory operations. The code that we inject updates a data structure that represents the communication graph. So you can go to my website and you can download the stuff now and you can use it if you want to. That makes it pretty cool in my opinion because it’s practical and you can go and run it on your machine. The down side is that using binary instrumentation is a little bit of a bummer because the overhead can range between fifty percent for some applications to like a hundred X. So obviously a hundred X slow down is a little bit of a drag but if you look at tools like Valgrent you see overheads that are actually similar for some applications. So it’s high but it’s not unreasonable. People actually use Valgrent in practical software development. So we saw those overheads and we were encouraged because it was useable but we wanted it to go faster and so we looked at how we could use hardware support to make graph collection more efficient. Our base design for our hardware support mechanisms was a multicore processor that has coherent data caches. I’m going to add some things to this design and they’ll show up in blue and those are the extensions that we proposed. The first extension that we proposed is communication metadata. Communication metadata is information that we add to each of the lines in the cache. In particular we record what was the last instruction to access each cache line? That’s enough information to, that’s the information that processors need to build the communication graph. We extended the cache coherence protocol to shuttle our communication metadata around. That’s useful for the following reason. Cache coherence protocol messages are sent between processors when communication is taking place in the application. So if we attach our communication metadata to coherence protocol messages and a processor receives an incoming coherence message they know that communication is happening and they know the instruction with which the communication is happening. So they can actually using that information build an edge that they can add to the graph. Yep. >>: When you say instruction do you mean PIP or… >> Brandon Lucia: Instruction path. >>: Okay, not your previous thing when you had like… >> Brandon Lucia: I was doing that for illustrative purposes... >>: [inaudible] still identify instruction just by [indiscernible]. >> Brandon Lucia: I’m not sure, I think I lost you. >>: [inaudible] right, right back at the beginning of that problem with the static identifying the structure statically versus dynamically. >> Brandon Lucia: Oh, right. So, the context is part of our hardware support. I left it out of this diagram because usually I actually find out that I don’t get into this much detail in the discussion. I’m really glad you guys are asking the questions, this is more fun than the normal talks that I give where everyone is just silent. But the context is part of that. >>: Okay, so you also have some additional information. >> Brandon Lucia: Yeah, we keep the context on the core so in the metadata it’s actually instruction context to both that get stored. Right and we also add a simple hardware structure to store the communication graph. That’s a fixed sized FIFO and its fix sized so that when it reaches capacity software trap happens, we have a run time layer that empties it out, stores it in memory, and you can use it during debugging. We have a software tool that does all the other stuff that I described a minute ago. Sure. >>: So you have to worry about fall sharing here, right, in some sense? >> Brandon Lucia: Yeah, absolutely so fall sharing means that we’re going to see communication that didn’t really happen and cache evictions mean that we don’t see communication that might have happened and things get out of date. So that’s some imprecision and we have numbers on that in our paper. We showed that it’s not a huge problem for debugging but, yeah, it does show up as a problem. >>: So back on your software [indiscernible]. Did you guys look at in your compiler [indiscernible] did you have any kind of compiler analysis that looked at code and said, hey this is old, I can guarantee this is [indiscernible] versus not in which case… >> Brandon Lucia: No, but we cheat a little bit and we excluded stack locations assuming that they wouldn’t be shared. In Java that’s reasonable and C plus plus people can do whatever they want to but we find that common practice is not to do that. So we omitted those accesses, yeah. Okay, now I’ll just talk about some of our evaluations. So we built that tool and we simulated that hardware and now I’m going to talk about how we evaluated that. So if we had just built a compiler optimization we could take some program user optimization and show that our optimization makes it go lots faster. Evaluating this was a little less straightforward. So we had to come up with a measure of what was the quality of our technique? We measured quality by looking in that rank ordered list of reconstructions that Recon produces. The quality is higher if an earlier entry in that rank ordered list points us to the root cause of the bug. The quality is lower if there are more things ahead of that root cause reconstruction that don’t have anything to do with the root cause. We also looked at performance which is just the run time overhead. We looked at some of the hardware overheads as well. For benchmarks this was also kind of a challenge and I guess I, some of you in the room could empathize with me here, finding tools, finding programs to evaluate concurrency debugging tools can be a real challenge because there’s no standard benchmark suite. So we actually went to the web and we found programs like MySQL, Apache, Java Standard Library, things like that. We found bug triggering inputs and we reproduced those bug and we showed that our tool can actually lead us to the root cause of the failures that the bugs trigger. We evaluated performance using a set of standard benchmarks, PARSEC, Da Capo, and Java Grande. So here’s a high level summary of the results that we found when we evaluated the quality of our system. The first was that using a set, a reasonably sized set of graphs from buggy and non-buggy executions, twenty-five was the number. We found that a reconstruction of the bugs root cause was first in that rank order list that Recon outputs. That was nice because it shows that with a modest amount of effort devoted to collecting graphs the programmer is led to the root that caused the bug. We also identified a tradeoff with respect to effort. That tradeoff was the following; if the programmer uses more graphs then the quality is higher. If the programmer uses fewer graphs they spend less time collecting graphs but the quality is lower. So they might have to spend time looking through what are effectively false positives in the output. Sure. >>: So does it matter what fraction of the twenty-five are buggy? >> Brandon Lucia: Right, so that was twenty-five buggy and twenty-five non-buggy. >>: So fifty-fifty. >> Brandon Lucia: Fifty in total. >>: So suppose it was, suppose the bug didn’t occur all the time and it was ten-forty? >> Brandon Lucia: So we actually, in the experiments that we used to illustrate this trade off we used five or fifteen buggy graphs assuming it was harder to get buggy executions and twenty-five correct graphs, because correct graphs are essentially in limitless abundance. Okay, our performance evaluation we showed ten to a hundred times overhead in software, like I said before, pretty high but comparable with other tools. There are two sources of hardware overhead that we found interesting. One is how often do those traps happen where you have to empty out the FIFO store it in memory? Two is how often do you need to update the metadata that’s hanging on the end of the cache line? So we found that traps are pretty infrequent, less than one in ten million on average. Smiling and have a question, what’s up? >>: Well… [laughter] >>: You’ve heard the traps weren’t frequent but what I would [indiscernible] as more is how long does it take you to handle the trap in [indiscernible] FIFO and what’s that overhead on the overall performance? >> Brandon Lucia: Yeah, so I don’t have the numbers on that. We can talk about that later but the infrequency helps to aromatize that cost. But you’re right; I mean it’s really the increase in latency that could be a problem, yeah. Second result is how often do we need to update that metadata? Because that could be a problem and you’ll see that this is considerably more frequent, two percent of memory operations is fairly often. However, in hardware implementation this can happen in parallel with accessing the cache line itself. So it’s not likely a performance problem because it can be parallelized. Okay, so just to summarize those themes that I described… >>: [inaudible] cache line or for… >> Brandon Lucia: For cache line. >>: [inaudible] cache line. >> Brandon Lucia: Yeah and it’s imprecise because of that. We have an analysis of that in our paper if you’re interested in checking that out, yeah. So I just showed you that we developed new abstractions, context-aware communication graphs and we use those to build reconstructed execution fragments. There was support across the system stack. I showed you hardware and software implementation and I showed you in our results some of the tradeoffs of using each of those. This is a system that is useful even in deployment because with a hardware implementation we can collect this information all the time and send it back to developers. Finally this system takes advantage of collective behavior because information could be pulled from many systems that run the same piece of software and the information can be combined. So that’s what I have to say about Recon. This is a new architecture and system support mechanism for making concurrency debugging easier. Okay, yep, questions? I’m just… >>: [inaudible] >> Brandon Lucia: Starting to look at the time. We have been doing a lot of questions, I want to make sure I do get through everything without keeping… >>: You have… >> Brandon Lucia: I don’t want to, okay, sure. >>: Seriously. >>: So if for instance if I did something really dumb like for instance I just recorded the last thread that accesses to the variable and kept that as my kind of straw man that say this is where the potential bug, this is the IP of the source, of the root cause of the bug could occur. So, okay, what I’m trying to get out of this is did you guys do any kind of analysis where you had some sort of a baseline that said that, you know I’m a very complicated system, is there any kind of baseline where you have some comparison that says something simple like the straw man [indiscernible] unique… >> Brandon Lucia: Yeah, so… >>: That actually do something that’s real to do [inaudible]. >> Brandon Lucia: We didn’t do that but something that I would really like to do in the future is to actually get some human subjects into the lab and say debug using technique A, debug using technique B, and do a comparison. Maybe not even just across the work that I’ve done but across work that has come from other groups. I think it would be a really informative study to see which techniques are actually good and it might involve collaborating with some HCI people because that’s a little bit outside my area of expertise. But it could be really interesting to see those results. >>: Okay. >> Brandon Lucia: So… >>: [inaudible] >> Brandon Lucia: Cool, well, yeah so that’s my answer and I would love to see more human subject studies going on in this area of research. I just, you don’t see that many and I think it would be really cool to see more of those, so. Alright, now I’m going to change gears. I’m going to talk about a system that isn’t about finding and fixing bugs for a programmer but rather it’s about systems cooperating to learn how to automatically avoid failures. You can see just like these buffalo are all looking outward. They’re cooperating to avoid failures which would be lion attacks or something in this example, ha, ha. These photos, I took a trip to Zimbabwe so I’ve got a bunch of stupid vacation photos in my talk, ha, ha. >>: [inaudible] [laughter] >>: There are bugs that… >> Brandon Lucia: Yeah, it’s a little bit, you know debugging and failure avoidance is really synergistic, they go together, so. So, I’m going to start this section of my talk with an example that shows you at a high level how our system works. But first I’m going to talk about how things work today. When you develop software today you have your development and debugging system, you make your application and then you push it out to the deployed systems like this. The deployed systems run and sometimes they get one of these thread interleavings that leads to a failure, so this might be a concurrency bug. If you’re a good developer you collect the core dump and you have that sent back to your development and debugging box. With the core dump in hand you can spend time trying to come up with a patch and figure out what went wrong with the program. The interesting thing about this is the developer is active but the deployed systems are passive in this process, just waiting for a patch to come from the developer. In the mean time, the deployed systems might experience the same failure over again degrading the reliability of the community of systems. So in this work we had the idea to make the deployed systems be active in this process as well. We make them cooperate by sharing information to learn why failures happen and what they can do to avoid those failures in future program executions. Okay, so now I’m going to give you an overview of what things look like if we have Aviso which is our system that takes advantage of that idea. So, just like you have a development debugging server we have an Aviso server. In the deployed systems we see that the application is linked against the Aviso run time, which runs on the same machines as the application itself. We see that same failure and just like we sent a core dump back in the baseline system, in the case with Aviso we sent an event history back to the Aviso server. Aviso does some analysis on that event history and the information that it extracts from that analysis goes into building a model of what happened in that failure, what happened preceding that failure. It’s important to note that this is a cooperative model. Any time a failure happens over here it ships an event history over to the Aviso server and contributes to that model. So nodes are cooperating, deployed system nodes are cooperating by sharing information. Using the cooperative failure model the Aviso server generates constraints on the execution schedule of the program that restrict certain, the order of certain events in different threads. When Aviso finds a constraint that prevents a failure it ships it back across to the deployed systems. The deployed systems can use those constraints to avoid failures and note that if one node fails and has a constraint sent to it that same constraint can be sent to all the other machines trivially and they can share the wealth of failure avoidance. >>: Are these constraints guarantees or are they probabilistic? >> Brandon Lucia: I’ll show you. I’m going to get to that, yeah. So there are three parts to the system, the first is what are constraints and how do they work? The second is, what are the event histories that we collect and how do we use the information in the event histories to generate constraints? Finally I’m going to talk about what goes into that cooperative failure model and how is it useful for picking which constraints are going to avoid bugs. So, first I’ll talk about these schedule constraints. To talk about these I need to show you a little bit more code. This code is really simple though, there are two threads, you have the green thread, it’s doing something funny which is set this variable to null and then set it to a new object, so it does two operations. Blue thread is acquiring a lock and then using that pointer that green is playing with over there and then releasing the lock. So this program is broken in several ways. We can talk about them, yeah, at length. The important thing to know though is if it executes under this interleaving blue uses the null pointer because green set it to null and then blue used it. That’s a problem. The way to understand this bug is that this bug is characterized by the event ordering that I’ve indicated with those dashed arrows. When P equals null happens and then P pointer use happens we get the failure, only if P pointer use precedes the assignment of P to that new variable. We can also observe that if we had a different ordering of events like P equals null followed by assignment of P to the new pointer and then the use of P, well that wouldn’t lead to a failure, so one of the key ideas in this work is to shift the execution away from failing schedules like the one on the previous slide and toward non-failing schedules like the one on this slide. To do that we developed the idea of a schedule constraint and a schedule constraint says that a pair of operations contribute to a failure and reordering around those operations can prevent that failure. So a schedule constraint is really nothing more than a pair of instructions in the execution. The semantics of a scheduled constraint are very simple. We have a scheduled constraint like this and it has the green instruction and the blue instruction. The semantics are the following, when in the execution we reach that first instruction the constraint gets activated. Subsequently in the execution when we reach that second instruction, the blue one that instruction gets delayed. Those are the semantics of a schedule constraint. Now I’m going to show you with that example why this is actually effective at avoiding failures, it’s effective because in this example you can see that P equals null gets executed. That activates the schedule constraint, then P pointer use tries to execute, normally that would cause a failure but instead the constraint delays the execution of that operation. In the mean time, green steps in executes its P equals new P and later after that delay expires blue gets to execute its operation without failing. >>: What are you expressing in terms of delay as opposed to sort of, you know thinking of the second grade instruction as enabling the blue instruction to continue? >> Brandon Lucia: That’s a really great question. So why don’t we just figure out what instruction this is and make constraints that have all three of those instructions, right? >>: Well, no I would just… >> Brandon Lucia: Okay. >>: Yeah. >> Brandon Lucia: Yeah, something like that. The main reason is, you’ll recall from the previous example that the failure occurred at this instruction. So if we want to do forensic analysis in our server we don’t know that this instruction exists. We have an event history and I’ll show you in a second what kind of event histories we keep. The event history doesn’t say anything about P equals new P. >>: But you have the code of the program. >> Brandon Lucia: We have the code of the program and something I’m looking at in future work is doing a better job of tuning these delays based on predictions of which instructions might be good constraint deactivators, yeah, okay. >>: [inaudible] constructions? >> Brandon Lucia: Say again. >>: These constraints are referring to dynamic constructions or static… >> Brandon Lucia: Static instructions. A constraint is a pair of static program instructions. When a dynamic instance of the first instruction in the pair occurs it activates the constraint. When an active constraint is, when a constraint is active and the second instruction executes then that causes a delay like this. >>: In any [indiscernible]… >> Brandon Lucia: Say again. >>: With those instructions could occur on any thread, right? >> Brandon Lucia: So if a constraint is active because the first instruction executed in one thread, in any thread except the one that activated the constraint. >>: Oh… >> Brandon Lucia: And not in the same thread but in any other thread. Otherwise you’d get some atomic region that prevented itself from proceeding because… >>: So how long are [indiscernible] a scenario where this causes timeouts and it cascades with the system you have even worse [indiscernible]. >> Brandon Lucia: Yeah, so that’s a problem. The delays are fairly short on the order of hundreds of instructions. We did a characterization of the delay, we established this empirically. One of the things I want to do in the future is do a better job of figuring out how long those delays should be and if there are program events, as Jim pointed out, program events that we could use to trigger expiration of a delay instead. But we did this empirically and found a range of, that the range of failures that we were dealing with in our experiments fit in to a particular delay window. So that’s and area I want to look at in future work. >>: [inaudible] what if there was a, some other constraint between those two green instructions that decided to delay the allocation of P to satisfy some other constraint, right. So then these two delays would cancel each other… >> Brandon Lucia: Yeah, so… >>: And so you would not [indiscernible]. >> Brandon Lucia: Right, so the situation is where you essentially end up with live lock because delays cascade between threads, so. >>: Not a live lock because you’re using delays they just cancel out because you have two different constraints there inserting delays that just cancel each other. >> Brandon Lucia: So… >>: So… >> Brandon Lucia: Yeah, I agree this is problematic. One delay could undo the good of another. So the good news is you’re only as bad as the program was initially. The bad news is that means that this mechanism doesn’t actually work. Another answer to that question is that I’m actually trying to work on formalism right now that shows that as long as delays are acyclic and the hard part is defining what acyclicity is for these kind of things then you can’t end up with situations where delays cause live lock or in the way you described as undo one another’s work, so. >>: I don’t agree with your statement you’re only as bad as [indiscernible]… >>: No. >>: Increase in delay that kind of causes… [laughter] >> Brandon Lucia: Plus performance… >>: Much worse then, yeah. >> Brandon Lucia: With performance [indiscernible]. >>: Yes. >> Brandon Lucia: You’re right. So there is an impact on latency, yeah. We did find in our evaluation of this that delays are very infrequent however. That’s a property of the applications that we looked at. So you’re right to say that if delays happen very frequently this could be increasing latencies and causing timeouts and bad things to happen. In practice we found that’s not the case. Furthermore, in our model for selecting constraints which I’m going to talk about in just a couple minutes, we can build in a quality of service constraint that says don’t use schedule constraints that degrade quality of service, meaning cause timeouts, cause unacceptable increase in request latency, things like that. >>: Yeah but it not just latency, right. You’re adding delays and what you’re doing is biasing the schedules to run, a subset of schedules. Those schedules could kind of cause… >> Brandon Lucia: Well… >>: [inaudible] expose other bugs that you… >> Brandon Lucia: So, u-huh… >>: Wouldn’t then you know… >> Brandon Lucia: That’s a very pessimistic view… >>: [inaudible]… >> Brandon Lucia: I mean the reason… >>: [inaudible]… >> Brandon Lucia: We’re doing this is because… [laughter] The reason we’re doing this is to bias the schedule away from schedules that we think… >>: Right. >> Brandon Lucia: Are going to cause bugs… >>: Yeah, but you don’t know that in advance so could be kind of, you have no guarantee… >> Brandon Lucia: It’s true but I think that that’s an incredibly pessimistic view. That’s saying that when you go to avoid one bug you’re going to land another bug. I just… >>: [inaudible]… >> Brandon Lucia: Think it’s possible… >>: [indiscernible] >>: [inaudible] [laughter] >>: Actually, you know this might be one place where [indiscernible] are [indiscernible]… >> Brandon Lucia: Yeah. >>: Think of there being a [indiscernible] that P is not null before the use, right. >> Brandon Lucia: U-huh. >>: So what you do at run time is that you evaluate the variant and if it fails… >> Brandon Lucia: U-huh. >>: They’re going to crash. So rather than crashing the program you just delay… >> Brandon Lucia: You hang this thread… >>: And then… >> Brandon Lucia: Yeah. >>: Evaluate again and hopefully… >> Brandon Lucia: Yeah. >>: It’s true again and then you just run, right. >> Brandon Lucia: It actually sound like… >>: That might be a sure way of like avoiding all of these problems. You’re about to crash, rather than crashing, you know you actually make the program… >> Brandon Lucia: The problem is having specification. >>: [inaudible] >> Brandon Lucia: We don’t… >>: [inaudible] isn’t specification, right. There’s a lot of [indiscernible] wasn’t done here [indiscernible] right. >> Brandon Lucia: U-huh. >>: So you can just [indiscernible] for it. >>: [inaudible] or actually… >> Brandon Lucia: You’re right. The… >>: [inaudible] Java you’d have the null test equipment, right. >> Brandon Lucia: You could even thread this. Yeah… >>: [inaudible] >> Brandon Lucia: Well, we should collaborate on that kind of project in the future. >>: [inaudible] >> Brandon Lucia: I like that idea. [laughter] >>: Yeah, so… >>: [inaudible] then your actual benchmarks… >>: [inaudible] >>: Is this the kind of thing you see go wrong or do you see things go wrong that are semantically off, like you get the wrong answer but it doesn’t crash? >> Brandon Lucia: You’re talking about the question of how do we identify failures? >>: I’m saying in the benchmarks… >> Brandon Lucia: Yeah. >>: You said in practice this happens rarely. What is it that goes wrong in practice, is it null [indiscernible] or is it we got the wrong answer because something we referenced the wrong pointer? >> Brandon Lucia: Yeah, I touched on this point before you came in. So the way that we identify failures is actually looking for fail stop conditions, assertion failures and segmentation faults and signals and things. In general at finding failures is a hard problem and predicting when something has gone wrong is unsolved. I think a cool thing to look at in the future. So, sure. >>: [inaudible] >> Brandon Lucia: Yeah. >>: So could you maybe try different values of the delay in production… >> Brandon Lucia: U-huh. >>: And monitor the effects this has on maybe the latency that you’re seeing? >> Brandon Lucia: You can… >>: [inaudible] end latency? >> Brandon Lucia: You can and that would increase the search space because you’d have to try tuples of pairs of, tuples of constraint and delay time… >>: What I’m saying is… >> Brandon Lucia: But it’s feasible, yeah. >>: This, without, okay, right, right, but… >> Brandon Lucia: You’ll see… >>: [inaudible]… >> Brandon Lucia: When I get to my discussion… >>: [indiscernible] flawless… >> Brandon Lucia: Of the model… >>: This, you know I guess… >> Brandon Lucia: Yeah. >>: The largest delay that has no effect on the [indiscernible] system, right, something like that. >> Brandon Lucia: That’s certainly what you want. When I get to the model I can talk just for a second about how we can incorporate that information into the model… >>: But you could certainly just send out a variety of constraints with different delays to each, if you have a large collection of systems, right, you could kind of in parallel search the space effectively… >> Brandon Lucia: Yeah. >>: By running different delays. >> Brandon Lucia: Have you seen my talk before? Ha, ha. [laughter] Yeah, this is essential to the way the technique works. So, yeah, so something I mentioned before is that the way we generate these constraints is by collecting a history of events that happened before the failure. We use that information to generate constraints. So, now I’m going to talk about what goes into those histories and how we collect that information. So if we have a program like this we need to instrument events that are interesting when we’re trying to deal with concurrency bugs. There are two kinds of events that we think are interesting. One is synchronization events and the other is sharing events. Synchronization is lock and unlocks, threads, bonds, and joints, things like that. These are easy to find with a compiler and if there’s custom synchronization or something like that the programmer can tell our system this is what I’m using for synchronization. Sharing events are memory operations that access memory locations that could be shared across threads. It’s harder to find. So the way we do that is by using a profiler because using a compiler we have to be conservative and it’s hard to identify a reasonably small set of sharing events in the execution. So once we’ve found sharing events and synchronization events we have a compiler pass that inserts calls into our run time into the program. Those run time calls are used to populate the event history that I described before. So the event history is the data structure that exists in the run time while the program executes. When the execution unfolds we see the event history gets P equals null because that’s an event and then we see this acquire lock and then we see P pointer use, and then we see a failure. Aviso also monitors for failures and it considers assertion failures signaled, terminating signal deliveries and things like that, fail stop conditions to be indicators of failures. Like I said a second ago we’re looking at other ways of identifying failures. So after the program fails we have this event history that shows what happened leading up to the failure. We want to generate a set of constraints that are candidates for preventing that failure behavior that occurred. We do that by enumerating all the possible pairs of instructions in a window of that event history that execute in different threads. So in the [indiscernible] example event history, here we have two different constraints that we can generate. The P equal null acquire lock and the P equal null and the P pointer use. You’ll remember this is the one from a second ago that I showed actually works to avoid that failure by adding a delay. Okay, now I’ve just showed you how we can generate a set of constraints. But I didn’t tell you how we decide which one is actually useful for avoiding the failure. So that’s the last part of my talk is Aviso gets a bunch of failures, builds up a big set of constraints, and then it needs to decide which ones it wants to send over to the deployed systems so they can use it to avoid failures. Which one does it pick? To answer that question we develop a constraint selection model. Our constraint selection model has two parts, the first part is the event pair model and the second part is the failure feedback model. The event pair model looks at pairs of events that occur in the programs execution and in particular how frequently pairs of events occur in non-failing portions of the execution. To get that information Aviso sparsely samples event histories from non-failing execution. The intuition behind why this is useful is if a pair of events happens often in a correct portion of the execution then it’s unlikely to be responsible for the failure. So trying to reorder around those events isn’t likely to have any impact on whether the failure manifests or not. The other side of the model is the failure feedback model. This model gets populated when we start issuing constraints out to deployed systems. It explicitly tracks the impact on the failure rate of the system when a particular constraint is active and when no constraints are active. So the intuition here is that if a constraint, if the instance of a constraint being used by a system correlates with a decrease in the failure rate then that constraint is more likely to be useful in future executions for avoiding the failure. We have a way of combining that information together that I’m not going to describe in detail into a combined model that is a probability distribution defined over the set of all the constraints that we have and Aviso draws constraints according to the probability distribution and issues them to the deployed systems. So if anyone’s a machine learning person in the audience this is an instance of reinforcement learning and it’s a variant of the K-Armed Bandit model for reinforcement learning. Yep. >>: How many failures do you need to see for the feedback model to actually be useful? >> Brandon Lucia: [inaudible]… >>: You only get one sample, one data point, right, per one crash? >> Brandon Lucia: Yep. >>: Is that… >> Brandon Lucia: That’s right, yeah, so we found in our experiments that it’s relatively a few ten to a hundred and we start to see the feedback having an impact on which constraint is drawn. So in a way you can think about this model as being predictive. We have an infinite amount of correct execution data and then a failure happens. So this predictive model says which pairs of events aren’t likely to be useful. So we discard those as much as we can. But we have some that we either don’t have enough information about or are actually useful for preventing the bug… >>: [inaudible]… >> Brandon Lucia: So we use the prediction… >>: [inaudible] failure is already to [indiscernible] if the… >> Brandon Lucia: It… >>: Crashes occurred then by the time they, you know it’s a pretty big emergency maybe. I don’t know I’m just… >> Brandon Lucia: It might be a pretty big emergency but it doesn’t fix the program… >>: Look, I know but maybe at that point like there’s ten people working on it and they might fix it and deploy it… >> Brandon Lucia: So as an antidote this is, time is becoming an issue, but… >>: [inaudible] >> Brandon Lucia: There is an antidote that I like to talk about and this is something I saw in the MEMCACHED developer board. They had this bug and it was open for a year. It was a lost update that triggered an assertion failure at some point later and the developers saw the bug report and then decided to ignore it because they said fixing this would introduce a seven percent performance overhead. So it stayed open for a year, who knows how many people using MEMCACHED experienced this bug, saw that their server went down and restarted the stupid thing. Eventually enough bug reports came in that they actually went and finished it, fixed the bug, a year later submitted a patch with the seven percent performance degradation. In contrast our system was able to fix that bug with a fifteen percent performance overhead which I’ll show you later and it did it in the space of ten to a hundred executions rather than the number of executions that had bug reports submitted for them in the space of a year. So that kind of helps to tune the timeframe for how bugs get dealt with. In general bugs can stay open for multiple years; a year could be generous for some open source packages that are pretty widely used. >>: [inaudible]… >>: So there’s millions of dollars right there in lost revenue? >> Brandon Lucia: It could be. I mean that’s hard to quantify but I mean it’s a, yeah. >>: [inaudible] contribute to that [inaudible]. >> Brandon Lucia: Say again. [laughter] >>: [inaudible]. >>: Yeah. [laughter] >> Brandon Lucia: Well, I mean so the information we’re collecting says a lot about why the bug happened. So send it back to developers they can use this. Yeah. >>: [inaudible] >> Brandon Lucia: Well some are, I mean, ha, ha. >>: My questions on [indiscernible]. >> Brandon Lucia: Okay. >>: So mine was just like, you know you put this FBI in sixty seconds the Citibank window. So what your stuff is doing is you’re reducing the window to forty-give seconds. I’d rather shut the program down, ha, ha, and not let people… >> Brandon Lucia: I don’t understand, what is the window that… >>: So there was the, one of the three things you motivated your talk was, I think was this casino robbery… >> Brandon Lucia: Yeah. >>: This guy is exploiting this race or something… >> Brandon Lucia: Yeah, yeah. >>: With a sixty second window that download [inaudible]. So what are you think, [indiscernible] somebody might do is shrink the size of the window but still keep it opened as opposed to someone noticing or detecting and just shutting the system down. >> Brandon Lucia: Unless there was some way of identifying that a failure or an attack… >>: Something bad is happening… >> Brandon Lucia: [indiscernible], yeah. >>: It’s better sometimes just to shut it down, right… >>: Are you saying that… >> Brandon Lucia: But for… >>: Prove that… >> Brandon Lucia: Yeah. >>: He’s trying stuff, right… >> Brandon Lucia: For… >>: [inaudible]… >> Brandon Lucia: To keep the system available in the case of fail stop bugs this is definitely, I think this is a better option than letting the system crash, if availability is the most important thing. >>: Correct, but I’m say that sometimes… >> Brandon Lucia: Sometimes it’s not… >>: Sometimes we use the wrong metric, right… >> Brandon Lucia: That’s absolutely true. >>: There’s times where you kind of more or less sacrifice availability… >> Brandon Lucia: Yeah. >>: Because it’s worse to be available. >> Brandon Lucia: Even in the case of security bugs this can still be useful and I think especially when combined with techniques for identifying that something anomalous is happening in the execution. I haven’t done that work yet but if you think about a technique that says, hey an attack might be happening maybe we can use a mechanism like this in combination with something like that to keep the system available and to close the security hole. I think that’s something interesting to think about, so... >>: I think this is great for the staging area or [indiscernible] where you can [inaudible]. >> Brandon Lucia: Sure. [laughter] >> Jim Larus: So you and I are going to lunch so [indiscernible] can go ahead… >> Brandon Lucia: Do you want to talk about [indiscernible] because I have a few more slides… >> Jim Larus: Sure. >> Brandon Lucia: I’m afraid people are… [laughter] [indiscernible] start leaving because it is… >>: [inaudible] >> Brandon Lucia: Twenty minutes to twelve now, ha, ha. [laughter] >>: So you know what… [laughter] >>: One piece of information that you don’t seem to use is that in correct executions something happens in correct executions which doesn’t happen in incorrect executions so you don’t have to worry about that [indiscernible]. >> Brandon Lucia: Right, we have half of that information but we don’t have… >>: You don’t have that half. >> Brandon Lucia: We don’t have the other half. >>: Right. >> Brandon Lucia: Right, so we can incorporate information from failing event histories into our predictive model but I haven’t done that because I couldn’t come up with a way that reliably produced good predictions. It’s a hard problem because you also have a data sparsity problem because you only see, you see fewer failing executions than you see correct executions. There are lots of events in a program. So for some of those events… >>: [inaudible] >> Brandon Lucia: You don’t have, you don’t have information from the failing executions which makes it a hard thing to even incorporate that information. Yeah. >>: So failure [indiscernible] instance which is static IP address, right? >> Brandon Lucia: Yes. >>: It tends, you’re not adding any context. >> Brandon Lucia: We have no, its context insensitive. >>: So have you thought about adding more context like your context-aware stuff… >> Brandon Lucia: I have. >>: And see if it does actually do failure of… >> Brandon Lucia: So call stack information would eliminate some spurious delays but collecting it is expensive. I mean it adds overhead to collect the call stack information. >>: [inaudible] right, so you can collect a lot more… >> Brandon Lucia: No we’re not sampling so in order to activate a constraint we would need to know that a particular instruction executing in a call stack was happening. So we would need to do a check that computed the call stack at each activation point, so. Cool, our implementation is simpler than this slide makes it seem. There are three parts the run time, the compiler and profiler, and the server. Compiler and profiler, the profiler was written in Pin, compiler we wrote as a pass for LLVM and it takes responsibility for finding and instrumenting events and linking to the run time. The important thing about the interaction between the run time and the server is that they exchange event histories and they exchange schedule constraints. The server maintains the model of how to draw constraints and they communicate over HTTP so the system is portable in implementation. You can put it anywhere and it’s not, doesn’t need to be on a single machine for example. So, now I’m going to talk about how we evaluated our system. My goal in the evaluation is to convince you that we measurably decreased failure rates in our experiments with some real applications and that our technique has overheads which are reasonable especially when availability is the key concern. Our set up was to use a small cluster of machines that all run the application and we had a single Aviso server. W used the setup benchmarks that partially overlaps with the ones that I used in the previous study, MEMCACHED and Apache, and Transmission were the biggest applications we looked at. So here’s a summary, a very high level summary of the results. For some of the, for one of the applications we had a reduction in failure rate of eighty-five times. That was failure in the PHP processing subsystem of Apache, HTTP-D, and we saw, yeah, eighty-five times decrease in the failure rate, so it happened eighty-give times less frequently. >>: That just one bug… >> Brandon Lucia: This is one bug. >>: You just avoid [indiscernible]. >> Brandon Lucia: That’s correct. I mean that frequently was over a humongous space. In our experiments we use hundreds of billions of requests hitting these systems so it was a very large space of execution that we looked at. >>: Did you have any executions where you had; you were actually avoiding multiple bugs… >> Brandon Lucia: Yeah, we had a study that didn’t make it into the paper where we took two different bugs in a version of Apache and we showed that we can avoid them and that the key to that is that we have schedule constraints and we need to decide if it’s the same bug happening or a different one. We do that by fingerprinting bugs based on the event history that preceded the failure. So doing that we can send one constraint for each bug that we’ve fingerprinted and we can solve that problem. It didn’t make it into the paper but we showed that it does work without increasing overhead as like the product of the delay, the overheads, so. The average case overhead was about fifteen times decrease in that rate of manifestation of the failures. The overheads that we saw were practical especially if availability is the most important thing. There were overheads as low as five percent when we were monitoring the execution and using delays to avoid failures. The average overhead was around twenty percent. So these are overheads that are acceptable in production systems, like I said, especially when latency is not the highest priority and availability is more important. Okay, so there were, just to wrap up this section. Schedule constraints are the new abstractions that we introduced. We have support and compiler run time and we have a statistical model at the application level. There are, this is a system that’s useful for the systems lifetime because it actually helps deployed systems be more reliable. It takes advantage of collective behavior by sampling information from many deployed systems. So that’s it for the two projects that I was going to drill down into. This has been just awesome to have this many questions. I really appreciate it. Usually it can get dry to give this talk a million times, so, ha, ha. So now I’m going to move on in like three minutes and talk about some future work and then I’ll take more questions afterwards if there are people that are wondering things. So in the future I’m interested in continuing work in the direction of reliability and in looking at adapting these techniques to energy efficiency. I’m also looking at some emerging architectures which I think are interesting. So I’m going to talk about those now. To start with though this is a picture of the way that I think computer systems are being built today and I think it’s getting worse. So we have multicores and in order to get good performance out of a multicore or a data center you have to put a lot of burden on the programmer to get that programming right so that we get reliable execution. The burden is primarily on the programmer to go an avoid crashes, just like this guy on the bike needs to… >>: Is that also in Zimbabwe? >> Brandon Lucia: No that’s not, ha, ha, this is stock art from the internet, ha, ha. This is stock art from the internet… >>: Oh. >> Brandon Lucia: I just, I found this photo, ha, ha. I thought it was funny. So this guy gets brick level parallelism but he pays for it in that he has to carefully stack these bricks on his bicycle. [laughter] He gets good performance but it’s really hard. So I think we need to address the reliability problem. In the past we’ve been focusing a lot on the performance problem and I think the problem is getting worse as we move toward heterogeneous architectures where the programming problems going to be more complicated. When we’re addressing reliability and performance we need to keep in mind complexity. We need to balance where complexity ends up in a system. Does it end up in the architecture or the compiler or the language, or in the programmer’s hands, or wherever. We need to keep that in mind when we’re coming up with solutions. So one thrust of my future work is going to be to continue to look at reliability, reliability is the problem I’ve been talking about. In fact I think that as the performance benefits of Moore’s law are petering out because of the utilization wall and the power wall we’re going to need to find other ways of adding value to platforms. This is especially interesting to companies but I think that this is interesting in general. One way that I think is a very promising way to do that is to add features to architectures and systems that improve reliability all the time. I described two of them today, one that has hardware mechanisms and another that’s a software layer. So I think there’s a big opportunity to do research in reliability. One idea in particular that I’m really interested in is the idea of decoupling the process of developing software from the reliability of the software. Aviso is a really early example of this. You’re taking some of the responsibility for avoiding failures out of the hands of the programmer. One area where I think this is especially interesting is in shared resource platforms like Cloud applications and in mobile applications. So I think of the process as hardening applications in these kinds of platforms. The programmer doesn’t see anything different they just deploy the software. The user doesn’t see anything different they just get software as it’s distributed. Some interesting points related to how we can take advantage of… >>: [inaudible] >> Brandon Lucia: Shared infrastructure. >>: Can you go back to the previous slide? >> Brandon Lucia: Sure. >>: So… >> Brandon Lucia: Is it about the cats, ha, ha? >>: At least in this company… >> Brandon Lucia: Yeah. >>: I’ve never seen that anybody would care so much about reliability especially if [indiscernible]. >> Brandon Lucia: U-huh. >>: That, you know if the user doesn’t see any perceptible benefit right, why would a company invest in reliability? >> Brandon Lucia: They see it by comparison to other platforms. So you have all sorts of reviews on, just take mobile space for an example. You have Android versus IPhone… >>: Okay. >> Brandon Lucia: If I’m an end user and I’m saying which phone do I want to buy the next version of? Well you can look at Android and you can look at IPhone and say, which one has more crashes and then you can go and buy a Microsoft Windows phone or something like that and so this one has fewer crashes because someone baked something into the software run time layers to improve the reliability. That could actually… >>: That has never happened in the… >> Brandon Lucia: It’s never happened… >>: [inaudible] >> Brandon Lucia: I totally agree with you. It’s never happened because people have focused on making performance better in subsequent… >>: [inaudible] >> Brandon Lucia: Generations. What? >>: Or features. >> Brandon Lucia: But what are, features are essentially performance. Features are things like vector processing and that gets performance. Features are things like better optimizing compilers, it’s for performance. So I think reliability, no? >>: Features are features on your phone. >>: Yeah. >>: Something like… >>: You know I want to talk to my phone so… >>: [inaudible] >>: I think the argument against that is that we spend enormous amount of money on test set and in our software… >>: The other argument is we expose [indiscernible] lots of developers now, right, if you’re on Windows [indiscernible] you can get [indiscernible] and we expose those to [indiscernible] more reliable. >> Brandon Lucia: Right, so I think that because you have shared resource platforms like that you can do things like you just described and Aviso and like what you just described is only the very beginning and I think there’s a lot of other opportunities. So this shows some of the advantages to looking at these platforms and some of the opportunities are there. One is that you have the common infrastructure so you don’t need to boil the ocean. If you want to push a new testing tool or a new optimization technique, or a new failure of winds technique load it into the platform and you get it, everyone has to use it. You have control over the hardware. So if you get, if you find that you can get easier, simplify the programming problem using some heterogeneous hardware for solving some problems or you can get better performance. You have control over the hardware and the environment. You have massive scale so those models like I showed in the statistical models that we use in Recon and in Aviso they improve when you have more data. You’ll have lots of data if you’re looking at a Cloud system. We can also make systems that do something similar to what Aviso does and that is to make changes to the way that they behave experimentally. Some of those changes might turn out bad but if one of those changes turns out to be really good then that change can be shared with all the other systems on that platform. I think that’s a really cool idea. Finally I think it’s interesting to look at how we can use a model of behavior built in one system and we can transfer the information over to another system. So what can we learn about Windows by looking at Lynx for example? Are there things at some level of abstraction that will transfer usefully between those systems? I think that there are. It’s going to require changes to the system. We’re going to need new primitives for introspecting into the behavior of the system, things related to concurrency like coherence, sequences of events potentially from different threads, exposing that up in an efficient way to run time layers or to the developer is going to be a challenge, and energy which is a problem on everyone’s mind especially in the mobile space. I want to look at new mechanisms for failure avoidance. You’ll notice that there was no hardware support in Aviso but Aviso one of the challenges that it has to overcome is the overhead of enforcing those constraints. I think with hardware support we could do a better job of that. So I think changing lower levels of the system whether actually in hardware or not is an interesting thing to look at when dealing with failure avoidance, also looking not just at concurrence programs but at sequential programs as well. I also think another way to deal with this problem that, the problem that programming is so difficult is to just do the programming for the programmers, so looking at synthesis techniques. I’m working on a project with some natural language processing researchers right now where we mined a bunch of code from the internet and we’re looking at ways of incorporating that into an active learning based code synthesis engine. The last idea I want to talk about is that power failures impact your reliability. If you have a platform that experiences power failures often that’s a reliability issue. So energy efficiency is a form of being reliable. I’m especially interested in this area in the domain of small and unpowered devices. Intel has a little device called a Wisp and this was developed in collaboration with several people from academia. It’s a very interesting device because it doesn’t have a battery. The way that it powers itself is by harvesting ambient radio frequency energy charging a super capacitor and as the super capacitor discharges it does a little bit of computation. That’s a really interesting platform because it requires interruption tolerance during the execution. Power failure goes from being the once and a while event where someone kicks the plug out of the wall to being maybe ten or a hundred times every second depending on the size of the capacitor and the rate of the computation. That fundamentally changes the way that you design what is an operating system. How to you program devices like this? Maybe we want to treat power failures as recoverable exceptions. What would be the system layers that we require to do that? So I think this is really an interesting problem to look at especially as these devices find use in more ubiquitous computing applications. I also think that looking at ways of avoiding power failures by for example trading off a failure due to, trading off energy related failures and programmer liability mechanisms. A programmer liability mechanism is like a null check. You can allied a null check to save enough energy to keep the system alive you might want to do that. But you only want to decide to do that when it’s really important. So you have to have some introspection on how much energy remains and how you can make that tradeoff dynamically is an interesting question. I’m going to skip this last bullet and just point out that there’s lots of cool applications for this stuff with people working in health and environmental sciences especially in the northwest we have lots of forestry and water research. There’s lots of interesting health applications that would be relevant to a company like Microsoft, especially working with these small devices and how they’re programmed and things like that. So I think there’s a lot of really interesting and visible opportunities for collaboration and applications there. Okay, so that’s my talk. There’s a big list of collaborators that I’ve worked with over the years at UW, as well as several people from Microsoft Research and HP Labs and IBM. I really appreciate you giving me your attention and asking so many questions. I’ll take more questions in the last five minutes if there are any. >> Jim Larus: Alright, are there any more questions? I didn’t think so, ha, ha. >> Brandon Lucia: Cool. >> Jim Larus: Thank you very much. >> Brandon Lucia: Thank you very much. Yeah this is great. [applause]