Document 17831848

>> Jay Lorch: [laughs] Okay, welcome, everyone, to Colin Scott. He’s no stranger to the area; he went to the University of Washington for his undergrad; and although he’s a PhD student now at Berkeley, he spends most of his time here, collaborating with people at UW and living in our glorious area. He does research on networks and distributed systems, and ‘cause distributed systems are notoriously buggy, he spends a lot of research on ways to find bugs, and understand them, and ultimately fix them. And he’ll tell us about some of his research in this area now. >> Colin Scott: Thanks, Jay. Yeah, so like Jay mentioned, I work on a systen—sorry—systems and networking. I also pursue research that’s at the intersection of software engineering, and programming languages, and formal methods with systems—distributed systems. So part of the reason that I’m excited about Microsoft Research is that there’s an opportunity to collaborate with all four of those groups, and you have phenomenal groups … people working on that. Okay, so what I’ll be discussing today is my dissertation work; I also have some other work that I’ll briefly touch on at the very end, but this is basically my dissertation. Okay, so I’ll just go ahead and jump in. When a software developer receives a bug report for a distributed system, they typically start their debugging process by looking at the console output—or the log files—for each machine in the cluster, and their intent here is to try and understand what events in the execution caused the bug—triggered the bug. Now, if the system was running for a long period of time, these log files could be quite large; if it’s running for several hours, they could be multiple gigabytes. So on a lucky day, the developer might immediately find some obvious piece of diagnostic information—maybe like a stack trace—that immediately tells them what is the root cause. What is the line of code that is causing my problem? Or maybe they can just look at the last few events in these log files and then use their intuition to try and understand what was actually causing the bug. But often, developers are not so lucky; in these cases, they need to find the events throughout the execution—not just at the end—that were responsible for getting the system into that unsafe, buggy state. Okay, so effectively, what they need to do is filter out the events in this execution that were not relevant—there might be many thousands of extraneous events—so that they’re left with the handful of events that are responsible for triggering the bug, that help them understand what actually went wrong. So you can imagine that this whole process could be very time-consuming. So more generally, software developers spend a significant portion of their time on debugging—some form of debugging—at work. According to one study—this was actually a … I think this was a study of Microsoft product groups, just a random sampling of product groups—they found that forty-nine percent of time at work was spent on doing some form of debugging, okay? Now, some portion of that time is spent understanding: what were the environmental conditions that led the system to trigger the bug in the first place? And then, some other portion of that time is spent coming up with a fix to the code so that future executions don’t exhibit the same problem, ‘kay? So our goal is to allow developers to focus most of that effort—most of their effort—on that latter part—coming up with a fix to the code—instead of first having to understand what were the conditions that trigger the bug in the first place, okay? The way that we achieve our goal is by identifying—thanks—what we call a minimal causal sequence of events that triggers the same bug as before. And what I mean here is just that if we were to execute just this minimal sequence of events—instead of the entire execution—we would end up with the system exhibiting the same bug as before, okay? Now, the key idea here is that the smaller the execution, the easier it should be for a human to understand—the less time it should take for a human to understand what was going wrong. And in general, human time is much, much more expensive than machine time, so the value that we’re adding here is that we’re saving human time. Now, this is certainly true of my experience in debugging—I mean, it’s also corroborated by research in the field of psychology—that smaller event traces should be easier to understand. So for the rest of this talk, I’ll start by defining a simple model of distributed systems—a computational model—to give us intuition for what exactly we’re trying to achieve; then, we’ll look at how we obtain faulty executions using our tool, DEMi—the distributed execution minimizer. You could imagine— sorry—we happen to use fuzz testing—a randomized concurrency testing—to find our bugs, but you could imagine using production executions and then feeding them into our tool, and then also minimizing them. For the core of the talk, I’ll discuss how we perform minimization once we’re given one of these faulty executions; and then lastly, I’ll look at how well this works in practice before I conclude. Okay, so you can think of a distributed system as just a collection of N single-threaded processes, and each process is a little automata; so it has unbounded memory—it’s a … it’s essentially a little Turing machine—it starts in a known initial state; and then it changes states deterministically according to some transition function, and I’ll describe what that is next. Yeah? >>: Is there anything special about single-threaded? You mention it’s a single-thread process. >> Colin Scott: Well, conceptually, if you have a multi-th … if you have a machine that’s running multiple threads, you can treat that as a separate thread, as long as there’s no shared state between the threads. At the end of the talk, I’ll discuss how you could adapt this to a multi-threaded system, where … but essentially, the key idea—the key insight—is that message passing—this is a message-passing model— message passing is equivalent to shared memory. If you interposed on reads and writes to shared memory, you could treat it in exactly the same way as a message-passing system, yeah. Okay, right. So you can think of the network as just a giant buffer; it’s a buffer that keeps all messages that have been sent from one process in the cluster but not yet delivered to the … to its recipient, okay? Now, the execution takes place one step at a time. Conceptually, we’re going to linearize all of the events in this execution. In reality, these things—these processes—may be running concurrently, but there’s always some partial order that we can define, so conceptually, it’s easiest to think of it as a linear order. So one at a time, we’re going to—each step in the execution—we’re going to pick one of these pending buffers to deliver—in this ca … sorry—pending messages in the buffer to deliver—in this case, there’s only one—and then based on the contents of that message and the current state of the recipient, it will … the recipient will then go through some transition—some straight tan … sate transit … state transition—based on its transition function, okay? So it goes through a state transition, and in the process, it will send zero or more messages to other machines in the cluster; does this make sense? And note here that we can treat timers as a special kind of message; this is just a special kind of message that we deliver back to the sender at some known, later point in the execution. So in this model, we assume that the … none of these processes have access to a local clock—or at least, we interpose on that local clock, okay? Now, as an … a step in this execution could also be external, and by external, I just mean that it’s outside of the control of the processes themselves. So a couple examples might be: we … if we assume that there is some process outside of the system that we’re not modelling, it might … we might—for example, client—we might have a client send a message and put that into the message buffer; or we might have a new process be sort of non-interministically created; or we might force one of these processes through a crash recovery event—so we force it back to its initial state. Now, note here that I didn’t explicitly model fail-stop crashes where a single machine crashes, because that’s equivalent to a schedule where we just partition that process from all of the other processes in the system. So this is sort of … that … fail-stop crashes are sort of implicit in our model. >>: How ‘bout message loss or duplication? >> Colin Scott: So a message loss is equivalent to having the message just stay in the buffer indefinitely—this is an infinite buffer. Duplication, we could model that; we don’t happen to model that; but it would be a simple change to the model. [coughs] Okay, sorry. So a schedule—which we denote with this symbol here, tau—is just a sequence of these events, either external—which I’m showing here in dark green—or internal—which I’m showing here in blue. And a valid schedule is just a schedule such that, at each step—if we were execute that schedule—at each step of the schedule—the execution—the next event to play would be currently pending in the buffer. So it’s valid … it’s a valid schedule for us to execute. Now, recall that we’re trying to find bugs here, so we assume that we have access to some invariant, and this is a safety condition—some predicate p—that’s defined over the local variables of all of the processes, okay? And this predicate will tell us whether or not the system is currently in a safe or unsafe state. So an example of a invariant violation might be if you violate linearizability in the … if you’re trying to do raft consensus or some sort of consensus protocol. Okay, now ultimately, what we’re looking for is some faulty execution; this is just a particular schedule such that, if we were execute each events in a schedule, we would end up with the system violating that invariant. Does this make sense? Okay, right, so now that we have that sort of formalism in hand, we can now get an intuition for exactly what we’re trying to achieve. We assume that we’re given one of these faulty executions—some schedule, tau—such that we … if we execute it, we end up with an invariant violation. And now, what we’re looking for, informally, is a locally minimal reproducing sequence, tau prime. And just to unpack what I mean by this, by reproducing, I just mean that if we were to execute tau prime, we would also get the same invariant violation; tau prime should be at least as short as the original and should not contain any external events that were not also in the original. And then, by locally minimal, I just mean: if we were to remove any single external event, e, from tau prime, there would not exist some other schedule, tau double prime, such that if we were to—containing just the external events … same external events minus e—such that tau double prime also violates p. Yeah? >>: [indiscernible] all the states of the process are known, or …? >> Colin Scott: Sorry, what … we assume that we can always take a snapshot of the … at any point in the execution, we can just halt the state of the system—we can halt the execution—take a snapshot, and then the user is going to define—the application developer, who has some domain-specific knowledge about what kinds of bugs they’re looking for—they define this predicate over the local variables contained in that snapshot, yeah. So there is some work that the developer has to do for us. >>: I have a question. >> Colin Scott: Yeah? [coughs] >>: When the … when your message handler sends a whole bunch of messages out into the network … >> Colin Scott: Right. >>: … is that treated as an atomic step in your computation model? >> Colin Scott: Yes, and that’s actually crucial. So I’ll discuss a little bit more about the implementation details, but the systems that we interpose on, we need a well-defined point where the syst … where the process blocks—like you said, an atomic start and an atomic end. This way, we can ensure that every step is sort of … this way, we can ensure linearizability when we do our minimization. >>: But what is worrying me is that that doesn’t really match what happens in reality, right? I mean, in reality, you could have … you could do a send … >> Colin Scott: Right. >>: … and that can cause a de-queue immediately—right—before the other sends have happened, and that can cause that message handler to start executing, if you know what I mean? >> Colin Scott: Yeah, you’re absolutely right. So I think for now—for this part of the talk, where I’m just kind of defining it clearly—let’s assume that we’re alway … we’ll always be able to interpose on any point. Like you said, maybe there would be some de-queue that we don’t know about. Let’s assume for now that we’re going to interpose on it; we sort of play God; and we get to interpose on any nondeterministic event. >>: I see. >> Colin Scott: And there’s a lot of engineering effort that I had to do to make sure that, in the systems we were testing, that didn’t happen. >>: It’s just an engineering question; it’s not something foundational to your model here. I see. >> Colin Scott: Exactly, and later in the talk, I fully admit that it may be a lot of engineering effort to do … to get this to work in practice. So later in the talk, I’ll sort of touch on the engineering aspects of this, yeah. Okay, right, so the reason that we focus … we focus on removing external events in this formulation, because external events are the first layer of abstraction that developers think about when they’re reasoning about the behavior of their system. After we’ve minimized external events, we can go ahead and trying to minimize the internal events remaining in our schedule, tau prime. And conceptually, what this just means is that we’re gonna try keeping some of those messages in the buffer, and not delivering them, and seeing if we can still trigger the invariant violation. Does this make sense? >>: How will you define it in terms of removing exactly one? You can imagine removing more than one. >> Colin Scott: Yep, you’re exactly right. Yeah, so that’s … it’s an excellent point. Unfortunately, if you try to remove more than one, that’s equivalent to enumerating the power set … >>: Yes. >> Colin Scott: … and that’s a problem. So if we had infinite time … actually, well, one nice thing about our algorithm that we use is that if we had infinite time, we would give you a provably minimal result, but of course, we don’t have an infinite time, so this is why we’re looking for a local minima here. Okay, and in general—by the way—in general, if you get stuck in some local minima, you could run it again and induce some sort of nondeterminism, and we might find some other local minima that might be smaller. >>: [indiscernible] local minima. >> Colin Scott: Exactly, yep. Okay, right, so now onto how we obtain these faulty executions. [coughs] You can think of a distributed system just as an application process that runs on each machine in the cluster; I’m showing three here. That application process calls into some messaging library—some RPC library—which, in turn, calls into the operating system to have the bytes sent across the wire. So what we do is we interpose on this messaging library; in our case, we interpose on a particular library called akka, but you could imagine applying this same sort of ideas to a different messaging library. We interpose on this library such that if … whenever—sorry—whenever an application sends a message, we intercept that message before it’s sent to the operating system; so we place it into a buffer that we control. And now, at this point, we essentially get to play God, so we get to choose arbitrarily which of these pending messages we want to deliver next. So we’re going to essentially linearize the events in this execution. In this case, there’s only one choice to be made, but there might be other choices. So we record our choice to disk, including the contents of that message, and then we go ahead and allow this message to be delivered to the recipient. Now, based on the content of the message and the recipient’s current state, it’ll go through some state transition and then send some zero or more messages to other machines in the cluster. Now again, we could make some choice, or based on some probability, we might also decide to inject some external event, and let’s suppose that we decide to have this process go through a cache … crash recovery. Now, be—again—because we precisely control when each ex … when each process gets to execute or not, we can—at any point in time—we can just stop the state of the world—stop the execution—and then take a snapshot of the local state of each process, and then run our invariant over it, okay? And we’ll just keep doing that until we find our bug— some invariant violation we care about. Okay, so now, once we’re given one of these sequences recorded with our tool, now we’re gonna go ahead and try and minimize it. So to make the discussion more concrete, I’m gonna briefly … I assume most people probably know what consensus is trying to achieve, but I’m gonna briefly go over the mechanics of how raft consensus works; raft is a particular kind of consensus protocol. So in raft, we have some processy … some number of processes in the cluster, and they vote; the first thing that they do is that they vote in order to decide on some leader. So in this case we have Putin over here is our leader. >>: Why debug it? >> Colin Scott: What’s that? [laughter] Exa … that’s the joke, exactly. Putin doesn’t really need consensus; he doesn’t need a quorum, but in raft, he would need it; he would need a quorum. Okay, so let’s suppose that he received a quorum of votes; now client requests will be directed to the leader; the leader will mark in his log that he might commit this entry; he hasn’t yet committed this entry. Now he tries to replicate that entry to the rest of his peers; they also log that they might commit that entry; and then, they acknowledge back to the leader that they have received the entry. Now, at this point, the leader knows that a quorum of the processes in the cluster have received that entry in this particular slot, so it goes ahead and marks that it has committed this entry, and then it tells the rest of its peers that they can go ahead and commit that entry as well. Now of course, the hard part about consensus is that there might be failures or arbitrary message delays. So what I’m showing here is a particular bug that we found and minimized using our tool. So this is a bug in an implementation of that raft consensus protocol. So the bug here—the invariant violation— was that at the end of this execution—at this point here—we had two processes in this cluster who both believe that they were leader in the same election term. So in raft, that’s a very, very bad state to be in, because those two leaders might overwrite each other’s log entries; they might violate the linearizability … the main linearizability constraint that raft is trying to achieve, okay? So the two leader … the two processes here, by the way, were the green process and the orange process, and if we look at this minimized execution, you can almost, sort of, immediately tell what was going wrong. So the green node here requests a vote from the red node, and then the red node grants that vote, and then later on, the green node requests a vote from the very same red node and then grants … it again grants the vote. So the bug here—and by the way, the orange … the same thing happens for the orange and blue node— the bug here was that we were accepting duplicate votes from the same … the very same peer. In other words, we—the actual implementation of this—we had an integer that we incremented once where … every time we received a vote granted message, instead of—what you really wanted was—sort of a hash map that points from which peer you got it from to which … how many votes you’ve received, okay? And by the way, I gave a talk at Salesforce with … and my host was Diego Ongaro, the guy who invented raft, and he said that in his implementation of raft, they had the very same bug. So apparently, this is not … even Diego himself is having this kinds of implementation issues. Now, in retrospect, this bug is easy to figure out, but the original execution for this—this initial fuzz test that we generated—was something like fifteen hundred events. Okay, so what we’re looking for is this minimal output—something that’s ver … pretty easy for a developer to understand—and … but we’re not initially given that minimized output, right? So the initial execution we’re given might be thousand of events; now we’d like to find that minimal output. So how do we do this? One straightforward approach might be to just simply enumerate all possible schedules—all possible interleavings of external events and internal events—that are at least as short as the original. So the original is some point in this schedule space, and we’re going to enumerate all of them, and we’re going to execute every single event, and then at each step, we’re going to check the invariant violation. >>: Yeah … that’s O of n factorial many loads. >> Colin Scott: Yeah, oh, very good. [laughter] Yeah, sorry. >>: I’m wondering: why did M get to be so big in the first place? See, you said you got a fifteen hundred length trace for this thing; why didn’t you just restart and only try traces of length two, three, four, and five, and so on? >> Colin Scott: Yeah, that’s a great question. So the point is that if you do breadth-first search when you’re finding bugs, then your bugs are minimal … your executions are minimal by construction, and that is a good idea; we actually tried that; it turns out that it takes a very long time to get to interesting lengths. So most bugs happen at at least length, like, thirty, say, and even getting to executions of length thirty with breadth-first search takes a very, very long per … amount of time—like we ran it for twenty-four hours, and it still hadn’t got past, like, length five. So we just found that—I mean, in general—that this space of possible interleavings is so large that we—in our experience—we found that randomized testing was much more effective for finding bugs. Incidentally, the MoDist—I don’t know if you guys know MoDist; it was done by MSR here—they also did an evaluation; they’re doing something very similar to us in terms of finding bugs; what they found was that systematic enumeration often gets stuck in these … that only exploring some small part of the schedule space. So they—when they combine systematic exploration with randomness—they found that they found bugs much more effectively. Now of course, the problem is: when you’re using randomness, now you have to minimize, which is where our technique comes in. Yeah? >>: I have another quick question. It’s about the engineering of your system, really. >> Colin Scott: Right. >>: So you mention that there is this invariant that can refer to states of all the processes. So how did you actually implement that? I mean, the … you … the state of a process—I mean—there could be stag, and it’s just all sorts of heap and stuff. >> Colin Scott: Right. >>: How does somebody even write that, and how do you know that this thing that you wrote in the invariant refers to this bit of memory? >> Colin Scott: Yeah, good question. So in our case—in the case of raft—our job was made much, much easier, because Leslie Lamport did that work for us; he said, “Here are the well-defined safety conditions that we care about.” >>: Okay, so you can just bake it in. >> Colin Scott: Yeah, so … and then each process has the local variables that are defined by Leslie Lamport or Diego, and then we just examine the values of those local variables and see if they match what the safety conditions prescribe. Now, I totally agree that they’re some … and so most systems— unlike Leslie Lamport’s work—in most systems, we don’t write our specs first; we actually just write a bunch of code, and then maybe, we realize that we want something else, so we change our code, and it’s just really hard to actually understand what it is that we actually really want. So I’ll get to Ratul in a second; I have just one idea there. Michael Ernst at UW did his PhD dissertation on this thing called Daikon; it’s inferring likely invariants. So this is one way that you could help developers sort of … >>: No, I think that you read too much into my question. I’m just talking … >> Colin Scott: [laughs] Oh, okay, great. >>: … basics. Well, I’m making sure … >> Colin Scott: Yeah. >>: How do you write code that simultaneously looks over the state of multiple processes? >> Colin Scott: Well, a distributed snapshot is a well-known … in our case, we … I don’t actually need Chandy-Lamport, but a distributed snapshot, conceptually, just takes a consistent snapshot of all the local variables. So you … what … if you can—sorry—as long as you can look at the local variables of one process—you say, “I know the locations in memory that I want to examine”—then you can just use that same code to run it across multiple processes. So there are separate questions: how do you obtain the snapshot, and then how do you run the invariant? I think Ratul was first, and then I’ll get … >>: You saw one thing that’s a … invariants that we’re running here, like some invariants may get violated, like, transiently. >> Colin Scott: Yeah, good … yeah. >>: And so does your code know when to actually value its own invariants? >> Colin Scott: So if it—sorry—if your safety condition is violated transiently … so a safety condit … the definition of safety condition is that it should never be violated. So if it’s violated transiently, that means that maybe you haven’t … one answer—one snarky answer—might be that you just haven’t written your invariant violation … or invariant very well. That you sh … >>: I think we have maybe converse ... so I was just thinking, like, take distributed-out computation; let’s say a set of processes are trying to compute shortest paths. >> Colin Scott: Right. >>: Right, so I think: before the whole thing converges, like, paths are not short … >> Colin Scott: Right. >>: … and what your invariant is correctness, that—you know—after things converge, you should basically get shortest paths. >> Colin Scott: The … so what you’re actually describing is a Leibnitz condition. I think that’s the challenge. >>: So it’s a safety condition, but it still the … you need to be defined as after the [indiscernible] that is your safety … it’s part of your safety property. After you converge, then paths are short, right? So before you converge, you just [indiscernible] >>: But how does … I’m just wondering, like, how does this engine know that things have converged, and now’s the right time to check it, versus you shouldn’t have checked it before this has happened? Or these types of invariants are not checkable? I’m trying to … >> Colin Scott: No, no, I think they are checkable. I think in practice, like, it … people do—developers— do exactly what you said; they define some sort of threshold. They say, “If five seconds have passed, then the assertion should hold.” And the problem with that—like you said—is that it can be transiently violated. If the execution times are slightly different, then all of a sudden, we have flakiness. I don’t know. I don’t have a great answer. I mean, one thing you could do is empirically build a distribution of how long the convergent times take and then pick some time in that convergence ti … in that distribution that would be reasonable. I mean, it’s a hard question. What you really want … I think what you really want is a Leibnitz condition, but it’s … unfortunately, Leibnitz involves infinity, and you can’t execute systems infinitely long, so … there’s a—sorry—there is one answer. There’s a nice paper called MaceMC. I don’t know if you’ve heard of it. The basic idea is that you can convert Leibnitz conditions into safety conditions using some heuristics. So they do random walks of the search space, and they say, “If we’ve done n random walks, and we haven’t … and the Leibnitz condition still doesn’t hold, then we’ll … well, we’ll consider that a Leibnitz condition … a Leibnitz violation.” And that’s … that might be one way of getting rid of this transient issue you mentioned. I don’t know if that … are you satisfied? >>: We can continue to talk offline later, perhaps. >> Colin Scott: Yeah. Sure. >>: [indiscernible] You said you could do the minimization in factorial time. Before, you said that you— pity for you—if you were given infinite time, you could do it, but actually, you … >> Colin Scott: Yeah, you’re right. I meant factorial time. That’s right. Yeah, you’re absolutely right. Which is still … could be potentially years. Right, okay, so one observation that others have made is that many of the schedules in this schedule space are commutative. And I’ll explain what this means, but incidentally, I don’t know if … many of you probably know Patrice Godefroid. This was actually his dissertation work on this topic. So the basic idea here is that: suppose we have two pending events, i2 and i3, or they’re both in the message buffer at the same time, and let’s further assume that they have different destinations—so they’re destined for different processes—and they are concurrent with each other according to happens-before relation. So they’re … essentially, they’re happening on different machines. Now, if you consider the state of the overall system at some step, n, before we execute either of those, and then we decide to deliver i2 and then deliver i3, we’ll actually end up at exactly the same state if we had instead … as if we had instead executed—sorry—delivered i3 before i2. So we know that any schedules that only differ in the order of i2 and i3 are … will end up … end us up in the exact same state. Now, the algorithm that we use to reason about this kind of commutativity is called a Dynamic Partial Order Reduction—again, this was Patrice. Conceptually … sorry, I guess I’ll explain how this works in a few slides, but essentially, what it does for us is it allows us to only explore one of those commutative schedules, and the rest of the equivalent schedules we get to ignore. Okay, so this is great. We’ve reduced the size of the schedule space by some factor k, where k … again, I’m being somewhat sort of loose in my notation, but k is like the size of each equivalence class. But in general, this is still intractable. So our approach is to prioritize the order in which we explore schedules within this schedule space, and the idea here is that we’re given … the user is going to give us some fixed time budget—they’re only willing to wait so long before they hit control C—and then we’re going to try and make as much minimization progress as we can within that time budget. Okay now, some of you may observe—may have already observed—that for any prioritization function that we choose, an adversary could compute … or—sorry—could construct the program under test, or even just the initial execution, such that whatever prioritization function we chose makes very, very little progress within our time budget. And the reasoning behind this objection is actually absolutely sound, but our conjecture is that the systems that we care about in practice are not adversarial—in particular, they exhibit some set of program properties, or they adhere to some constrained computational models that make them amenable to the kinds of prioritization functions that we’re gonna define. Does this make sense? So the research agenda is really trying to define what are these program properties that we care about in practice and then defining prioritization functions for them. So I’ll briefly go over the main program property that we assume. Consider a single process. You can think of the single process as a state machine—some potentially-infinite I/O automata—and each transition here is the event of receiving a message. Now, at each state in this state machine, we have some set of local variables. I’m only showing two here—x and y—but you could imagine it might have many more local variables. Now consider the cross product of all of the processes in the cluster. This is, again, some state machine. This defines, sort of, the behavior of the overall system. Now, let’s assume that two properties hold. One is that any given invariant—we might care about lots of different invariants—but any single invariant is defined over a small subset of the process’s local variables. So we only look at, maybe, x and not x and y. In this case … again, imagine that there are many, many variables here. Another property would be that each event—each transition in this diagram here—only affects a small subset of the receiver’s variables’ values. So in other words, when we receive a message, the receiver is not going to flip all of its local variables, it’s only gonna flip some local variables that are relevant to that kind of message, okay? Now, if those two facts hold, then it seems highly likely that the initial execution—the initial execution is some path through this overall state machine here—that initial execution will define a path that contains loops. And essentially, what we’re going to do is remove those loops. So if we’re—sorry—if we were to reduce this state machine to only transitions that only affect the invariant’s local variables, then that state machine would contain loops, and we’re going to reduce those loops. Does make sense? Okay, now the challenge, of course, is that we don’t know … we treat the … we’re treating the system as a black box, and because we’re treating it as a black box, we don’t know which local variables are actually relevant or not, or which events are actually relevant or not. So our approach is going to be … is going to experiment … we’re going to experiment with different executions in order to infer what those local variables or what those events are. Does this make sense? Okay, and the key insight—sorry—the key insight here is as follows: we know one very important piece of information, which is that the original execution, when we execute it, caused the system to violate the invariant. So you can think of this original execution as sort of a guide for how we can get the system to progress through its state machine so that it ends up in this buggy state here. But again, like I said, there are some … there are gonna be some loops in that st … in the path through that state machine. So what we’re going to do is we’re going to selectively mask off some of these original events and see if we can still trigger the invariant violation so that we can find a shorter execution. Okay, so the way we do that is somewhat detailed; I’ll walk through it slowly. Recall that we’re trying to minimize external events first, so just consider the external events from that original execution. Now let’s suppose that we only consider the right half of these external events. We’re going to ignore the left half. Now, the algorithm that we use to do this splitting between right halfed and left half is called Delta Debugging. Conceptually, you can just think of it as just a modified version of binary search. Okay, so now, [coughs] we’re just gonna consider these last three external events, and now, we’d like to know: is there some schedule—some execution—containing just these three external events that would trigger the same invariant violation? So the way that we find that schedule was as follows. We walk through each of the original ext … internal events, and we check: is that message that we originally delivered, at this point in the execution, is it currently pending in the buffer at … in our current execution? And if it is, let’s go ahead and deliver it. If we ever get to a message that is not currently pending, we’ll just skip it over. And we’ll keep doing this for all the original messages that we delivered in the original execution, okay? And now at the end, we just check: is the invariant still there? Now, let’s suppose in this case that it was; it was still there. What that means is that we can now ignore the first three external events. Those are not necessary. >>: Does this require re-execution? >> Colin Scott: Yes. Yeah, exactly. We’re going to re-execute, starting from a known initial state, for each of these schedules, yep. Okay, now we’re gonna proceed with the rest of our minimization. So now we’re gonna consider just the rest … right half of the remaining external events, and again, we’re gonna try and find some schedule. Now, let’ suppose, in this case, that we didn’t find the invariant violation at the end. That doesn’t necessarily mean that we’re done. There could be some other schedule containing just these last two external events that also triggers the same invariant violation. So remember that algorithm I told you about earlier, DPOR. D … what DPOR will do will … it will set backtrack points at each step in this execution where there was some alternate, noncommutative choice it might have made about which message to deliver next, okay? Now, what we’re going to do is continue exploring these backtrack points until either we find some schedule that triggers the invariant violation, or until we run out of our time budget for that subsequence, okay? Then we’ll continue this until we finally produce our minimal output. Now, there was one major detail which I swept under the rug here, which is that it’s not always straightforward for us to compare messages in the current execution with messages from the original execution, because we’re essentially modifying history. So as an example here, let’s look at one … we have some message that we originally delivered on the left here. And now, in our modified execution, we’ve got some pending message, which looks very similar except for one field, which is our sequence number here. So let’s suppose, in this system, that the processes keep a sequence number; this is just an integer that they increment by one for every message they send or receive. Now, we removed some of the events, so now the sequence number has lower value. But we, as humans, know that—at least for most bugs invo … for any bugs that don’t involve the sequence number, which should be most bugs—we know that we should be able to mask over that message field. We know that that … the value of that local variable should not affect whether or not we trigger the bug. So we know that we can mask over that when we’re comparing the equivalency of these two messages. And by the way, the intuition for this observation is exactly the same as the one I showed you earlier. This sequence number in the message is reflected in some local variable at each process, and that local variable does not happen to affect our vi … invariant. Yeah? >>: Is it possible to infer this by analyzing the program … >> Colin Scott: Yes. >>: … to see some local var … the relation between local variables? >> Colin Scott: Yes, that’s exactly right. Another … an even another thing you might do—other than applying a program analysis—you might also try and experimentally try and infer which of these fields are nondeterministic. We don’t do that. For our prototype, we just assume that the user is going to give us some fingerprint function which tells us which fields to ignore. But I think you’re absolutely right; that’s a … that would be an interesting avenue for future work. In general, we want to decrease the amount of engineering work it would … effort it takes to get our system to work. Okay, so in the first phase, we’re gonna use this user-defined fingerprint function to choose the initial schedule that DPOR executes. And then, in the second phase, what we’ll do is we’ll only match messages according to their type. So by type here, I mean a class tag or an object type—at a lang … a language-level class tag. And the intuition here is that messages that have the same type should be semantically similar to the original, except they’re gonna differ in some field … some of the messa … the values of the message fields. So again, we’re trying to stay as close as possible to the original execution, except now, we’re going to explore backtrack points that only differ in the type in small numb … small values of the fields. This make sense? Yeah, Ratul? >>: So tell me something. So what exactly happens … so suppose you don’t … you want to drop message with sequence number three and instead, go directly to one with five; your matching will say they’re equivalent, but when you inject it into your system, is the sequence number three or five? >> Colin Scott: In the example I gave, we’re trying to play this message on the left here, but it doesn’t exist in our pending buffer. There’s only one message that matches; it just happens to have a sequence number of three. So our user-defined … your … the user-defined function tells us that these two are equivalent. Now, suppose that there were … there was some ambiguity—there was some other message that had the same type—in that case, we’re going to explore a backtrack point. Sorry, I have too many animations. So if there’s ambiguity in which message we might choose—there are multiple pending messages that match—we can … we’ll try one of ‘em now, and then backtrack, and try the other one in its place later, if that makes sense. Okay, so the last observation we make is that often, the contents of external messages affect the behavior of the system. And if we try and minimize the contents of those external messages, we can get better minimization. So I’ll go over a brief example here. Let’s suppose that our system assumes that we have some bootstrap message. This bootstrap message tells each process what are their peers. So in the beginning of time, they don’t know which … who their peers are, and we send them … at the … in our execution, we happen to send them a bootstrap message with a list—a, b, c, d, e—which tells them the cluster. Now, let’s suppose that we’re going to mask off some of these values of that list. We’ll mask off one, and then we’ll mask off two. At some point, the quorum size will actually change; it changes from three to two, and now, the remaining processes in this cluster have to wait for fewer messages before they can achieve quorum. So if we minimize these external contents—which we control—we can in fact get better internal minimization by doing this. Does this make sense? Okay. >>: Is it the saving [indiscernible] I mean, by changing the quorum size, you might miss a bug, or— right—maybe the bug was only happening—you know—with f equals two, because—you know—you sort of hard-coded a quarine of looking for two messages only or something. >> Colin Scott: Yeah. >>: And they won’t find the schedule. >> Colin Scott: Yeah, so we’re always doing this experimentally. So in that case—in the case that you described—if we tried … we had the original execution, which triggered the bug, and which had five entries; we try masking off one of the contents; and everything we try doesn’t trigger the bug. We try masking off two; everything we tried doesn’t trigger the bug. So we say … so we put our hands up, and we say, “Okay, fine, we’ll go back to five.” Yeah. Okay, so just to … yeah, go ahead. >>: Yeah, so back to that example, so I think just follow-up to Christoph original question. So this … there’s the process of exploring this state space; you can do that in a very smart way; and you argue that—you know—combining that with randomness actually helps sporting a bigger space. >> Colin Scott: Right. >>: But for this particular one, I think it probably be better off having—you know—three … >> Colin Scott: From the be … very beginning. >>: Yeah, we will probably find the same bug … >> Coin Scott: Yeah. >>: … if it actually shows up. >> Colin Scott: I totally agree. So if you’re just doing randomized testing, you should just start with a small cluster size, and then increment it. However, if you’re trying to minimize production executions, you don’t necessarily have control over the size of the initial cluster. So this technique might be more useful for a production run, yeah. >>: But hang on one second. I mean, your technique fundamentally requires this ability to replay executions. >> Colin Scott: Yeah. >>: If you have generated a fault from a production execution, how are you going to do replay? >> Colin Scott: That’s a … it’s a great question. We don’t … we have a section … I have a dissection in my dissertation which discusses how you might do that. So you need a couple things; you need a partial order—so you need … you might need a Lamport clock on all of your messages, so you can always partially order the execution. >>: Uh-huh. >> Colin Scott: You would also need to make sure that—like you said—that each external event, like a failure, is non-redundant—so you have exactly one of those, a unique event for each external event. So that might … I totally admit that that might be hard to obtain in practice. We also have some thoughts about how you might reduce that overhead. So if you’re looking at the kinds of systems we’re looking at, which are actor systems or message-passing systems, it gets a lot easier; you don’t actually need— because there’s no shared state—you don’t actually need Lamport clocks. So we have some thoughts about that, but we haven’t pursued it yet. Yeah? >>: Did you let these … skipping something like [indiscernible] is impossible. >> Colin Scott: Sorry, say that again? So … >>: Do you assume that you are just skipping over some events—that’s what you’re doing as in this binary search. >> Colin Scott: Yeah. >>: This message being [indiscernible] You assume that your doing that will not crash [indiscernible] >> Colin Scott: Ah, you’re absolutely right that—sorry—that when we do this splitting between left half and right half, you’re absolutely right that we might have an invalid or semantically invalid split, which might cause a crash or just doesn’t … it’s nonsensical set of external events. >>: So that’s the invariant. >> Colin Scott: Yeah, so in that case, we assume that the invariant that the user defines correctly disambiguates those cases. So if there’s a crash that we don’t … that was not our original bug, we’ll save that for later, like, “By the way, we’ve figured out a way to trigger a new bug,” but we’re not gonna use that for a minimization, yeah. Okay, almost done here, so I’m just summarizing all of our techniques in one slide here—I guess you can read yourself. One thing I will say is that once we’ve minimized external events, we’ll go ahead and use the same techniques that we use to minimize external events to then minimize internal events. So like I said, that just means we’re gonna keep them in the buffer, and not deliver them, and see if we can still trigger the bug. Okay, so before I end, I’ll quickly look at how well this works in practice. We’ve applied our tool to two different system—distributed systems—so far; one is an implementation of the raft consensus protocol; and the other is the Spark data analytics engine. So we should note … you should note that these are obviously two very different systems; they’re trying to achieve very different things; they have very different behaviors. And also note that the raft implement we looked at was a somewhat early-stage development project, whereas Spark is obviously very, very mature. So what I’m showing here is the size of the executions on the y-axis, and then on the x-axis is each of our case studies—so each of this is a distinct bug. And then the blue bars here are showing the size of the initial execution that we found with randomized concurrency testing; and then the green bars here show the size of the minimal out … the minimized output that our tool produces. So a few takeaways from this graph: I don’t actually show it here, but it turn … we found that randomized concurrency testing—or fuzz testing—turns out to be very useful for uncovering bugs that developers didn’t anticipate. So that … so this raft implementation had existing unit tests and integration tests, but each unit test only told explores one particular ordering, and there’s an exponential space of possible orderings, so there are a lot of things that they didn’t anticipate. Yeah, go ahead. >>: Question about the use of the phrase “fuzz testing.” >> Colin Scott: Yeah. >>: I think by that phrase, you mean that you have control over all scheduling choices. >> Colin Scott: That’s right. >>: And you randomly pick one choice. >> Colin Scott: Yep. >>: Right? So I just wanted to point out that that’s what you’re assuming. A lot of people in the industry … >> Colin Scott: Yep? >>: … they just think that: “Oh—you know—fuzz testing is when I artificially inject some delays here and there; I create big workloads to indirectly—not directly—influence schedules.” >> Colin Scott: Right, right. >>: Okay. >> Colin Scott: And I agree that in production, that’s actually might … that might be what you want to do. So in production, it might be very … it might be too much engineering effort to actually interpose on all the sources of nondeterminism, so one way to deal with that nondeterminism would be to just replay multiple times—flip the coin multiple times—and then hope that one of those coin flips is gonna trigger the bug. So that’s one way. We had a prior … the first chapter of my dissertation, we had a prior SIGCOMM paper on how you would do this for systems where you don’t have as much control. One messy part is: now you come dependent on wall clock time, which is not … it’s not a clean way of thinking about the problem, but it’s more practical, yeah. But you could, in practice, do that without enough interposition. Yeah? >>: So raft and Spark here are re-implements on that [indiscernible] where you took some implementation, and you’re looking at bugs that already existed. What are these bugs here that you’re talking about? >> Colin Scott: So these are … so—sorry—Spark and raft were both implemented in Scala; well, Spark is actually multiple languages, but the play … the parts of the process that we interposed on were Scala. Raft is completely written in Scala. >>: So you used implementations that already there? >> Colin Scott: That’s right. >>: I see. >> Colin Scott: So Spark obviously is on Apache, and then this one we found on GitHub. These bugs … all of the bugs in raft, in fact, were un … previously unknown. So all we did was we sat down and re … we wrote down all the five safety conditions that Leslie Lamport prescribed, and then we just random … we did randomized testing, and then for each … and then we checked those five invariants at each step, and then this is what we uncovered. Yeah? >>: Were these fixed bugs at the end? >> Colin Scott: Yes. Yeah. >>: Then if it’s a fixed bug, is it possible to come up with the optimal number of messages? >> Colin Scott: Yes. I’ll get to that in one slide. >>: Okay. >> Colin Scott: Okay, so what we’re showing here is that across all of our case studies, we get pretty good reduction. We’re improving the state of the art. So they would have started with this, and we give them this. So we’re … it seems like we’re helping developers, but you might also ask, “How much room is there for improvement? How far off are we from optimal?” And so what I did here was, again, I’m showing here the green bars are the size of our … the output of our tool, and then the orange bars are the minimal trace—so this is the smallest manual trace that I could produce by hand. So it took me … it was actually fairly painstaking; it took me about a month to do this, but I … what it shows here is how much room there is for improvement. So across all of our case studies, we’re within a factor of five—of four point six, actually—from that smallest manual trace. In these two cases here—raft-58a and raft42—part of the reason that we were so inflated that we were far away from that optimal was that we ran out of our time budget. So what I’m showing here is the minimization runtime on the y-axis, and we had some maximum time budget. But in general, across all of our case studies, we’re doing pretty well—most of these finish within ten minutes. In a few cases, if we had a better or a more smart prioritization function, we could have found our minimal output in much less time. So there is definitely room for improvement, but in general, we’re doing pretty well. Yeah? >>: How many Berthas are you running, and how does that play into the size of the trace? >> Colin Scott: Yeah, good question. We were … in this case, we were only running four processes in— sorry—for raft, we were running four processes; for Spark, we were running—I mean, they have a master and a … they have a whole bunch of processes—like, roughly a dozen. If we had more processes, I think the executions … so in practice, the executions could be much, much larger, so … >>: Yeah, I’m impressed that there are bugs that require even, like, fifty-something messages to reproduce in four processes. So that’s impressive. >> Colin Scott: Right. Yeah, that’s right. It’s essentially be … that’s … if you think about it, this is only a couple of round-trips. >>: Yeah. >> Colin Scott: If you have four processes, and there’s some bootstrapping … >>: Oh, each RPC is its two, and then … >> Colin Scott: Yeah. >>: Yeah, alright. >> Colin Scott: And there’s always gonna be some bootstrapping messages that are always gonna be present in the execution, so … okay. Alright, so there are many more details that I didn’t have a chance to go over here. I’d encourage you to check out our NSDI paper for a much more lucid explanation of those details. So I probably don’t have time to do a demo. I have a pretty cool demo, it’s like GDB for your … it’s essentially a GDB breakpoint for your network. So you get to choose which pending message to deliver next. So I think it’s pretty neat. >>: Yeah. >>: Go ahead. Do it. >>: Let’s do it. >>: Do it. >>: Let’s do it. >> Colin Scott: Okay, sweet. Okay, so I’ve got a Java program here. This is a … I com … what I did is I compiled the akka … the—sorry—the raft implementation, which is a Scala or Java program, and then I interpose on … so I used AspectJ to interpose on the RPC library that it uses. So I’ve got this dash-dashinteractive; that’s just telling our tool how it’s gonna behave. So what I’m showing here: I’ve got four raft processes in this cluster, and then, I’m just printing out the external events. So we have … in this case, we … the processes assume they have some bootstrap message that tells them their peers, and then, we also have two client commands: in this case, append word—please append a word to the raft lock—and then this here is just our prompt for the console. So a little bit further up, I’m just showing you: this is just the console; this is—here at the top—is the console output printed out by each process. So it’s just saying, “Oh, I’m getting ready to be … to run my raft implementation.” And then, now, what I’m showing here is the set of pending events. So these are the messages that have been sent, but not yet delivered. So we interposed on them, and then they … and then we get to choose, now, which of these we’re gonna deliver next. So now I’ve got a little—essentially—like a GDB prompt. I just type “help,” and it tells us … tells me all the commands I can run. And then I … then … so now, what I might do is deliver some message, or check the invariant, or cause one of the processes to fail, et cetera. So let’s suppose that we’re gonna … so the way … >>: Scaling all that on … processes on your machine, right? >> Colin Scott: Yeah, so … yeah, that’s right. So each of these processes is running locally on my machine, but they are not aware of colocation. So as far as they know … >>: Oh, okay. >> Colin Scott: … as far as they know, they’re running a distributed system across the network. >>: Yeah. >> Colin Scott: Okay, so now, let’s suppose that we allow the first process to have its bootstrap message. Now it’s gonna set a timer. It set a timer, and it says, “I’m gonna try—when this timer goes off—I’m gonna try and elect myself as leader.” So let’s go ahead and allow that timer message to go through. Now, it’s going through a state transition where it wants to begin an election. Again, I’ll let that message go through, and now, it’s sent request-vote messages to all of the other processes in the cluster; so it’s gonna go and try and get itself elected. Now, if I had some bug in mind about—some particular interleaving in mind—about how to trigger some bug, I could use this system to try and trigger it. And then, at any point, I can check the invariant, and in this case, we haven’t done much, so there’s no var … there’s no violation yet, but we could just keep doing this until we find the bug. And now when I exit out of this prompt, we saved a recording of the … of all the events that we played. So I could now, if I wanted to, replay the execution that I had generated here. I could also do this programatically instead of doing it interactively; I could also just make random choices programmatically. And one thing to note here is that because we’re interposing on timers, we’re able to execute these … run these executions much, much faster than you would be able to in practice, and because we don’t have to wait for the wall clock time of each timer to go off. So it’s actually pretty amazing how many executions we can go through in a minute. If we get lucky, we might trigger the bug. I know there’s a known bug here, but maybe I won’t test my luck. So I actually have a … I also saved a recording of some fuzz test that ended up with triggering the invariant violation, so now, what I’m a do is try and minimize the recording that I put to disk. So now, what it’s doing is it’s—like I said—it’s … what I describe in the talk is just experimentally trying smaller subsequences of events, and eventually—after about ten minutes—it’ll stop and tell us that it has some minimal output. Now, one cool piece of this—sorry—I think one cool aspect of this tool is that we’re now allowing developers to generate regression tests without having to write any—well, they have to write a little bit of code—but without having to write much code. They just told our tool, “Hey, please fuzz,” and given … using the invariant that we gave the tool, it’ll just run fuzz tests until it finds some bug, and then it’ll minimize them, and then it’ll save them to disk. And now, we can actually go change the system. We can go add print statements to help us debug, or we might even change how the system behaves. And then we can just—say, a month later—we can just rerun that execution that we saved to disk—that minimal execution we saved to disk—and unless the protocol has changed … if the protocol radically changes then that save … that recording will no longer be valid, but as long as the protocol doesn’t change too much, we can rerun that test to see if the bug is … it came up again. So I think this is pretty neat. >>: So what is the complexity of this minimizing tool that you are running right now? >>: Colin Scott: Well, the complexity in the worst case is n factorial, like what we said. >>: [indiscernible] factorial. >> Colin Scott: Yeah, but we … our … I actually have system configured to exit fairly soon. So I give it a small time budget; I give it a time budget of … well, this—in this case—it’ll actually finish in about ten minutes. Yeah, I don’t know if we want to wait ten minutes, but … [laughter] yeah? >>: How large is the raft program you are testing with? >> Colin Scott: Raft is relatively small; it’s fifteen hundred lines of Scala. Spark is much, much larger; Spark is something like forty thousand lines of code. >>: Yeah, but you only interpose certain messages—right—you are not trying all the possibilities, right? For instance, I start with a fifteen hundred lines, but I assume the code that are touched will execute probably a small fraction. >> Colin Scott: That’s right. So in … so one major advantage of doing black-box testing is … yes? >>: [indiscernible] sorry. >> Colin Scott: Oh, sorry. One major advantage of doing black-bock testing is that the system might be written in multiple languages—like you said, there might be lots and lots of code that’s actually been used—and we’re agnostic to that. >>: I understand. Just walk through together at … but did you know how big of the raft program that is actually involved in this? >> Colin Scott: There was a little bit of work I needed to do to deal with nondeterminism, so there are some sources of nondeterminism outside of the … conceptually, all we do is we interpose on the RP slayer, and that’s it. So we actually interpose below the application; we shouldn’t have to touch it; but in some cases, they also depend on nondeterminism that’s outside of the control of the RPC library. So an example would be like a hash map; suppose you keep values in a hash map, and then you iterate through all the values; the order in which the JVM chooses to put values in the hash map depends on the memory address, which is nondeterministic. So in some cases, we had to sort the values of the hash map to get better determinism. So there was some changes we needed to make the application, but for the most part, we’re basically agnostic to it—conceptually, we’re agnostic to it. >>: Mmhmm. >> Colin Scott: Yeah? >>: You talk about fixating this to multi-threaded apps … >> Colin Scott: Yeah. >>: [indiscernible] Yeah. >> Colin Scott: Yeah, good question. I’ll just go right to a slide here. So like I said earlier, shared memory is functionally equivalent to message passing, so there’re actually some existing … there are a bunch of other papers that look at—essentially what we’re trying to do—systematic exploration of schedules for multi-core systems instead of distributed systems. And the basic idea here is that you just interpose on the language runtime, so you detect whenever the pro … whenever a thread writes or reads to shared memory, and then you trap, and you block that thread, and then you treat it as if we had just sent a message—it’s conceptually the same. It would be if some amount of engineering effort to interpose on that runtime—which we haven’t done yet—but you could do it. >>: And the invariant is [indiscernible] shared. >> Colin Scott: Yeah, the invariant would be defined over the shared memory, yeah. >>: Better not be using too much shared memory over net scale. >> Colin Scott: That’s right, or the schedule space would be too huge. And then—I don’t know—Shaz Qadeer has a bunch of techniques for helping with that problem, yeah. [laughs] >>: I’m a huge fan of binary search, so I’m glad you’re using it productively. I have comment about the problem formulation. >> Colin Scott: Yeah. >>: You start out by saying you want to minimize an existing trace … >> Colin Scott: Right. >>: I’m wondering if you had set the problem in the following alternative way … >> Colin Scott: Right. >>: … and you just say that I have a faulty execution, and I have a generic tool for doing prioritized search available to me, except that the tool needs a particular prioritization function. >> Colin Scott: Yeah. >>: And all I’m going to do is—I’m not gonna bother trying to minimize this execution—I will just try to learn from that execution a prioritization function, or maybe a family of prioritization functions, that’s likely to meet to a violation of the same invariant. >> Colin Scott: Yep. >>: Forget about trying to minimize that particular execution itself. >> Colin Scott: Right. >>: So how would you compare that problem formulation to what you already have? >> Colin Scott: I actually think your intuition is right that it’s very, very similar—it’s almost the same. >>: Okay. >> Colin Scott: So in future work … we don’t currently learn. So I have some master’s students who’re— who I’m working with—who are looking at using this tool called Synoptic; Synoptic looks at log files and then tries to infer it … learn a state machine from those log files. So what they’re looking at is try—like you said—trying to learn some prioritization function based on the behavior of the system—the observed behavior of the system. So another answer to your question is: conceptually, what we’re doing is essentially explicit state model checking. >>: Okay, prioritized search. >> Colin Scott: Prioritized search. >>: Yeah. >> Colin Scott: And … except that we have one additional piece of information, which is we know that the original execution triggered the bug. >>: Right. >> Colin Scott: But that’s essentially what we’re doing. >>: Okay. >> Colin Scott. Yep. Okay, I had a couple slides here before I ended. So this is the technical conclusion—y’know—the results we found for those two systems were pretty … leave us pretty optimistic that these kinds of techniques can be applied more broadly. Our tool is open source. You can check it up on GitHub. And of course, I’d encourage you to check out our paper. I had a … I wanted to go over a couple slides about where I want to go—where my research trajectory is currently headed. I have done work on a lot of prior topics, not just minimization. The common thread through all of these is troubleshooting and reliability. My area … my background is in networking distributed systems. I think this is an increasingly-important area, because everything is becoming distributed. We all have smartphones in our pockets. But unfortunately, the kinds of tools that we use to develop concurrent and distributed systems lag significantly behind with the kinds of tools we have for sequential code. So actually, this is a quote from Parkinson; Parkinson is an MSR researcher. I never met this person, but they have a conjecture that—or a claim—that the kinds of tools that we use to develop concurrent or distributed code lags behind sequential tools by about a decade. So I see … >>: More like twenty years. >> Colin Scott: Yeah, maybe more, that’s right. [laughter] Maybe even more. So I see a great need to kind of bridge this gap. Now, there are great … there’s a lot of great research from the software engineering, and the programming languages, 401 methods communities on how to debug, verify, et cetera—lots and lots of cool tools. There’s amazing things you can do with program analysis. >>: [indiscernible] I thought it was sequential. Sequential [indiscernible] what does sequential mean? >> Colin Scott: Well, a conc … a distributed system is just a concurrent system that happens to also have partial failure in asynchrony. So what I really just mean: I’m just trying to distinguish between concurrency and sequential code. Most of … when you … lots of verification tools, for example, assume a single-threaded sequential computational model. >>: There’s also huge gap between, like, concurrently working and also [indiscernible] >> Colin Scott: That’s right. Yeah, that’s right. >>: Numbers together one place. >> Colin Scott: That’s right. I think parsh … you’re exactly right. Partial failure makes things even worse—even worse. So anyway, the point is that I see a great need to bridge this gap, and there’s a lot of great research from these two communities that we could use to help us deal with these problems. So I have a bunch of projects that I would like to pursue. These are just some ideas that I’ve had in the last five years for ideas that we could … where we could take ideas from software engineering and programming languages to help us adapt them—it’s not just straightforward adaptation, we just have … we have to do some research here—but we could take these ideas and apply them to problems in concurrency and distributed systems. So this is basically the direction that I’m planning to move in. So thanks a lot. If you have any questions you can e-mail me. My e-mail’s very easy to remember, cs@cs.berkeley.edu. [laughter] Thank you. [applause] >> Jay Lorch: We peppered him with many questions throughout the talk. Are there any remaining? Okay, great. Then we’re done. >>: Whoo-hoo.

Document 17831848

Related documents

Products

Support

Document 17831848

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib