Document 17831848

advertisement
>> Jay Lorch: [laughs] Okay, welcome, everyone, to Colin Scott. He’s no stranger to the area; he went to
the University of Washington for his undergrad; and although he’s a PhD student now at Berkeley, he
spends most of his time here, collaborating with people at UW and living in our glorious area. He does
research on networks and distributed systems, and ‘cause distributed systems are notoriously buggy, he
spends a lot of research on ways to find bugs, and understand them, and ultimately fix them. And he’ll
tell us about some of his research in this area now.
>> Colin Scott: Thanks, Jay. Yeah, so like Jay mentioned, I work on a systen—sorry—systems and
networking. I also pursue research that’s at the intersection of software engineering, and programming
languages, and formal methods with systems—distributed systems. So part of the reason that I’m
excited about Microsoft Research is that there’s an opportunity to collaborate with all four of those
groups, and you have phenomenal groups … people working on that. Okay, so what I’ll be discussing
today is my dissertation work; I also have some other work that I’ll briefly touch on at the very end, but
this is basically my dissertation.
Okay, so I’ll just go ahead and jump in. When a software developer receives a bug report for a
distributed system, they typically start their debugging process by looking at the console output—or the
log files—for each machine in the cluster, and their intent here is to try and understand what events in
the execution caused the bug—triggered the bug. Now, if the system was running for a long period of
time, these log files could be quite large; if it’s running for several hours, they could be multiple
gigabytes. So on a lucky day, the developer might immediately find some obvious piece of diagnostic
information—maybe like a stack trace—that immediately tells them what is the root cause. What is the
line of code that is causing my problem? Or maybe they can just look at the last few events in these log
files and then use their intuition to try and understand what was actually causing the bug. But often,
developers are not so lucky; in these cases, they need to find the events throughout the execution—not
just at the end—that were responsible for getting the system into that unsafe, buggy state. Okay, so
effectively, what they need to do is filter out the events in this execution that were not relevant—there
might be many thousands of extraneous events—so that they’re left with the handful of events that are
responsible for triggering the bug, that help them understand what actually went wrong. So you can
imagine that this whole process could be very time-consuming.
So more generally, software developers spend a significant portion of their time on debugging—some
form of debugging—at work. According to one study—this was actually a … I think this was a study of
Microsoft product groups, just a random sampling of product groups—they found that forty-nine
percent of time at work was spent on doing some form of debugging, okay? Now, some portion of that
time is spent understanding: what were the environmental conditions that led the system to trigger the
bug in the first place? And then, some other portion of that time is spent coming up with a fix to the
code so that future executions don’t exhibit the same problem, ‘kay? So our goal is to allow developers
to focus most of that effort—most of their effort—on that latter part—coming up with a fix to the
code—instead of first having to understand what were the conditions that trigger the bug in the first
place, okay? The way that we achieve our goal is by identifying—thanks—what we call a minimal causal
sequence of events that triggers the same bug as before. And what I mean here is just that if we were
to execute just this minimal sequence of events—instead of the entire execution—we would end up
with the system exhibiting the same bug as before, okay? Now, the key idea here is that the smaller the
execution, the easier it should be for a human to understand—the less time it should take for a human
to understand what was going wrong. And in general, human time is much, much more expensive than
machine time, so the value that we’re adding here is that we’re saving human time. Now, this is
certainly true of my experience in debugging—I mean, it’s also corroborated by research in the field of
psychology—that smaller event traces should be easier to understand.
So for the rest of this talk, I’ll start by defining a simple model of distributed systems—a computational
model—to give us intuition for what exactly we’re trying to achieve; then, we’ll look at how we obtain
faulty executions using our tool, DEMi—the distributed execution minimizer. You could imagine—
sorry—we happen to use fuzz testing—a randomized concurrency testing—to find our bugs, but you
could imagine using production executions and then feeding them into our tool, and then also
minimizing them. For the core of the talk, I’ll discuss how we perform minimization once we’re given
one of these faulty executions; and then lastly, I’ll look at how well this works in practice before I
conclude.
Okay, so you can think of a distributed system as just a collection of N single-threaded processes, and
each process is a little automata; so it has unbounded memory—it’s a … it’s essentially a little Turing
machine—it starts in a known initial state; and then it changes states deterministically according to
some transition function, and I’ll describe what that is next. Yeah?
>>: Is there anything special about single-threaded? You mention it’s a single-thread process.
>> Colin Scott: Well, conceptually, if you have a multi-th … if you have a machine that’s running multiple
threads, you can treat that as a separate thread, as long as there’s no shared state between the threads.
At the end of the talk, I’ll discuss how you could adapt this to a multi-threaded system, where … but
essentially, the key idea—the key insight—is that message passing—this is a message-passing model—
message passing is equivalent to shared memory. If you interposed on reads and writes to shared
memory, you could treat it in exactly the same way as a message-passing system, yeah.
Okay, right. So you can think of the network as just a giant buffer; it’s a buffer that keeps all messages
that have been sent from one process in the cluster but not yet delivered to the … to its recipient, okay?
Now, the execution takes place one step at a time. Conceptually, we’re going to linearize all of the
events in this execution. In reality, these things—these processes—may be running concurrently, but
there’s always some partial order that we can define, so conceptually, it’s easiest to think of it as a linear
order. So one at a time, we’re going to—each step in the execution—we’re going to pick one of these
pending buffers to deliver—in this ca … sorry—pending messages in the buffer to deliver—in this case,
there’s only one—and then based on the contents of that message and the current state of the
recipient, it will … the recipient will then go through some transition—some straight tan … sate transit …
state transition—based on its transition function, okay? So it goes through a state transition, and in the
process, it will send zero or more messages to other machines in the cluster; does this make sense? And
note here that we can treat timers as a special kind of message; this is just a special kind of message that
we deliver back to the sender at some known, later point in the execution. So in this model, we assume
that the … none of these processes have access to a local clock—or at least, we interpose on that local
clock, okay?
Now, as an … a step in this execution could also be external, and by external, I just mean that it’s outside
of the control of the processes themselves. So a couple examples might be: we … if we assume that
there is some process outside of the system that we’re not modelling, it might … we might—for
example, client—we might have a client send a message and put that into the message buffer; or we
might have a new process be sort of non-interministically created; or we might force one of these
processes through a crash recovery event—so we force it back to its initial state. Now, note here that I
didn’t explicitly model fail-stop crashes where a single machine crashes, because that’s equivalent to a
schedule where we just partition that process from all of the other processes in the system. So this is
sort of … that … fail-stop crashes are sort of implicit in our model.
>>: How ‘bout message loss or duplication?
>> Colin Scott: So a message loss is equivalent to having the message just stay in the buffer
indefinitely—this is an infinite buffer. Duplication, we could model that; we don’t happen to model that;
but it would be a simple change to the model. [coughs] Okay, sorry. So a schedule—which we denote
with this symbol here, tau—is just a sequence of these events, either external—which I’m showing here
in dark green—or internal—which I’m showing here in blue. And a valid schedule is just a schedule such
that, at each step—if we were execute that schedule—at each step of the schedule—the execution—the
next event to play would be currently pending in the buffer. So it’s valid … it’s a valid schedule for us to
execute. Now, recall that we’re trying to find bugs here, so we assume that we have access to some
invariant, and this is a safety condition—some predicate p—that’s defined over the local variables of all
of the processes, okay? And this predicate will tell us whether or not the system is currently in a safe or
unsafe state. So an example of a invariant violation might be if you violate linearizability in the … if
you’re trying to do raft consensus or some sort of consensus protocol. Okay, now ultimately, what
we’re looking for is some faulty execution; this is just a particular schedule such that, if we were execute
each events in a schedule, we would end up with the system violating that invariant. Does this make
sense?
Okay, right, so now that we have that sort of formalism in hand, we can now get an intuition for exactly
what we’re trying to achieve. We assume that we’re given one of these faulty executions—some
schedule, tau—such that we … if we execute it, we end up with an invariant violation. And now, what
we’re looking for, informally, is a locally minimal reproducing sequence, tau prime. And just to unpack
what I mean by this, by reproducing, I just mean that if we were to execute tau prime, we would also get
the same invariant violation; tau prime should be at least as short as the original and should not contain
any external events that were not also in the original. And then, by locally minimal, I just mean: if we
were to remove any single external event, e, from tau prime, there would not exist some other
schedule, tau double prime, such that if we were to—containing just the external events … same
external events minus e—such that tau double prime also violates p. Yeah?
>>: [indiscernible] all the states of the process are known, or …?
>> Colin Scott: Sorry, what … we assume that we can always take a snapshot of the … at any point in the
execution, we can just halt the state of the system—we can halt the execution—take a snapshot, and
then the user is going to define—the application developer, who has some domain-specific knowledge
about what kinds of bugs they’re looking for—they define this predicate over the local variables
contained in that snapshot, yeah. So there is some work that the developer has to do for us.
>>: I have a question.
>> Colin Scott: Yeah? [coughs]
>>: When the … when your message handler sends a whole bunch of messages out into the network …
>> Colin Scott: Right.
>>: … is that treated as an atomic step in your computation model?
>> Colin Scott: Yes, and that’s actually crucial. So I’ll discuss a little bit more about the implementation
details, but the systems that we interpose on, we need a well-defined point where the syst … where the
process blocks—like you said, an atomic start and an atomic end. This way, we can ensure that every
step is sort of … this way, we can ensure linearizability when we do our minimization.
>>: But what is worrying me is that that doesn’t really match what happens in reality, right? I mean, in
reality, you could have … you could do a send …
>> Colin Scott: Right.
>>: … and that can cause a de-queue immediately—right—before the other sends have happened, and
that can cause that message handler to start executing, if you know what I mean?
>> Colin Scott: Yeah, you’re absolutely right. So I think for now—for this part of the talk, where I’m just
kind of defining it clearly—let’s assume that we’re alway … we’ll always be able to interpose on any
point. Like you said, maybe there would be some de-queue that we don’t know about. Let’s assume for
now that we’re going to interpose on it; we sort of play God; and we get to interpose on any
nondeterministic event.
>>: I see.
>> Colin Scott: And there’s a lot of engineering effort that I had to do to make sure that, in the systems
we were testing, that didn’t happen.
>>: It’s just an engineering question; it’s not something foundational to your model here. I see.
>> Colin Scott: Exactly, and later in the talk, I fully admit that it may be a lot of engineering effort to do
… to get this to work in practice. So later in the talk, I’ll sort of touch on the engineering aspects of this,
yeah. Okay, right, so the reason that we focus … we focus on removing external events in this
formulation, because external events are the first layer of abstraction that developers think about when
they’re reasoning about the behavior of their system. After we’ve minimized external events, we can go
ahead and trying to minimize the internal events remaining in our schedule, tau prime. And
conceptually, what this just means is that we’re gonna try keeping some of those messages in the
buffer, and not delivering them, and seeing if we can still trigger the invariant violation. Does this make
sense?
>>: How will you define it in terms of removing exactly one? You can imagine removing more than one.
>> Colin Scott: Yep, you’re exactly right. Yeah, so that’s … it’s an excellent point. Unfortunately, if you
try to remove more than one, that’s equivalent to enumerating the power set …
>>: Yes.
>> Colin Scott: … and that’s a problem. So if we had infinite time … actually, well, one nice thing about
our algorithm that we use is that if we had infinite time, we would give you a provably minimal result,
but of course, we don’t have an infinite time, so this is why we’re looking for a local minima here. Okay,
and in general—by the way—in general, if you get stuck in some local minima, you could run it again and
induce some sort of nondeterminism, and we might find some other local minima that might be smaller.
>>: [indiscernible] local minima.
>> Colin Scott: Exactly, yep. Okay, right, so now onto how we obtain these faulty executions. [coughs]
You can think of a distributed system just as an application process that runs on each machine in the
cluster; I’m showing three here. That application process calls into some messaging library—some RPC
library—which, in turn, calls into the operating system to have the bytes sent across the wire. So what
we do is we interpose on this messaging library; in our case, we interpose on a particular library called
akka, but you could imagine applying this same sort of ideas to a different messaging library. We
interpose on this library such that if … whenever—sorry—whenever an application sends a message, we
intercept that message before it’s sent to the operating system; so we place it into a buffer that we
control. And now, at this point, we essentially get to play God, so we get to choose arbitrarily which of
these pending messages we want to deliver next. So we’re going to essentially linearize the events in
this execution. In this case, there’s only one choice to be made, but there might be other choices. So
we record our choice to disk, including the contents of that message, and then we go ahead and allow
this message to be delivered to the recipient. Now, based on the content of the message and the
recipient’s current state, it’ll go through some state transition and then send some zero or more
messages to other machines in the cluster. Now again, we could make some choice, or based on some
probability, we might also decide to inject some external event, and let’s suppose that we decide to
have this process go through a cache … crash recovery. Now, be—again—because we precisely control
when each ex … when each process gets to execute or not, we can—at any point in time—we can just
stop the state of the world—stop the execution—and then take a snapshot of the local state of each
process, and then run our invariant over it, okay? And we’ll just keep doing that until we find our bug—
some invariant violation we care about.
Okay, so now, once we’re given one of these sequences recorded with our tool, now we’re gonna go
ahead and try and minimize it. So to make the discussion more concrete, I’m gonna briefly … I assume
most people probably know what consensus is trying to achieve, but I’m gonna briefly go over the
mechanics of how raft consensus works; raft is a particular kind of consensus protocol. So in raft, we
have some processy … some number of processes in the cluster, and they vote; the first thing that they
do is that they vote in order to decide on some leader. So in this case we have Putin over here is our
leader.
>>: Why debug it?
>> Colin Scott: What’s that? [laughter] Exa … that’s the joke, exactly. Putin doesn’t really need
consensus; he doesn’t need a quorum, but in raft, he would need it; he would need a quorum. Okay, so
let’s suppose that he received a quorum of votes; now client requests will be directed to the leader; the
leader will mark in his log that he might commit this entry; he hasn’t yet committed this entry. Now he
tries to replicate that entry to the rest of his peers; they also log that they might commit that entry; and
then, they acknowledge back to the leader that they have received the entry. Now, at this point, the
leader knows that a quorum of the processes in the cluster have received that entry in this particular
slot, so it goes ahead and marks that it has committed this entry, and then it tells the rest of its peers
that they can go ahead and commit that entry as well. Now of course, the hard part about consensus is
that there might be failures or arbitrary message delays.
So what I’m showing here is a particular bug that we found and minimized using our tool. So this is a
bug in an implementation of that raft consensus protocol. So the bug here—the invariant violation—
was that at the end of this execution—at this point here—we had two processes in this cluster who both
believe that they were leader in the same election term. So in raft, that’s a very, very bad state to be in,
because those two leaders might overwrite each other’s log entries; they might violate the linearizability
… the main linearizability constraint that raft is trying to achieve, okay? So the two leader … the two
processes here, by the way, were the green process and the orange process, and if we look at this
minimized execution, you can almost, sort of, immediately tell what was going wrong. So the green
node here requests a vote from the red node, and then the red node grants that vote, and then later on,
the green node requests a vote from the very same red node and then grants … it again grants the vote.
So the bug here—and by the way, the orange … the same thing happens for the orange and blue node—
the bug here was that we were accepting duplicate votes from the same … the very same peer. In other
words, we—the actual implementation of this—we had an integer that we incremented once where …
every time we received a vote granted message, instead of—what you really wanted was—sort of a
hash map that points from which peer you got it from to which … how many votes you’ve received,
okay? And by the way, I gave a talk at Salesforce with … and my host was Diego Ongaro, the guy who
invented raft, and he said that in his implementation of raft, they had the very same bug. So apparently,
this is not … even Diego himself is having this kinds of implementation issues. Now, in retrospect, this
bug is easy to figure out, but the original execution for this—this initial fuzz test that we generated—was
something like fifteen hundred events.
Okay, so what we’re looking for is this minimal output—something that’s ver … pretty easy for a
developer to understand—and … but we’re not initially given that minimized output, right? So the initial
execution we’re given might be thousand of events; now we’d like to find that minimal output. So how
do we do this? One straightforward approach might be to just simply enumerate all possible
schedules—all possible interleavings of external events and internal events—that are at least as short as
the original. So the original is some point in this schedule space, and we’re going to enumerate all of
them, and we’re going to execute every single event, and then at each step, we’re going to check the
invariant violation.
>>: Yeah … that’s O of n factorial many loads.
>> Colin Scott: Yeah, oh, very good. [laughter] Yeah, sorry.
>>: I’m wondering: why did M get to be so big in the first place? See, you said you got a fifteen hundred
length trace for this thing; why didn’t you just restart and only try traces of length two, three, four, and
five, and so on?
>> Colin Scott: Yeah, that’s a great question. So the point is that if you do breadth-first search when
you’re finding bugs, then your bugs are minimal … your executions are minimal by construction, and
that is a good idea; we actually tried that; it turns out that it takes a very long time to get to interesting
lengths. So most bugs happen at at least length, like, thirty, say, and even getting to executions of
length thirty with breadth-first search takes a very, very long per … amount of time—like we ran it for
twenty-four hours, and it still hadn’t got past, like, length five. So we just found that—I mean, in
general—that this space of possible interleavings is so large that we—in our experience—we found that
randomized testing was much more effective for finding bugs. Incidentally, the MoDist—I don’t know if
you guys know MoDist; it was done by MSR here—they also did an evaluation; they’re doing something
very similar to us in terms of finding bugs; what they found was that systematic enumeration often gets
stuck in these … that only exploring some small part of the schedule space. So they—when they
combine systematic exploration with randomness—they found that they found bugs much more
effectively. Now of course, the problem is: when you’re using randomness, now you have to minimize,
which is where our technique comes in. Yeah?
>>: I have another quick question. It’s about the engineering of your system, really.
>> Colin Scott: Right.
>>: So you mention that there is this invariant that can refer to states of all the processes. So how did
you actually implement that? I mean, the … you … the state of a process—I mean—there could be stag,
and it’s just all sorts of heap and stuff.
>> Colin Scott: Right.
>>: How does somebody even write that, and how do you know that this thing that you wrote in the
invariant refers to this bit of memory?
>> Colin Scott: Yeah, good question. So in our case—in the case of raft—our job was made much, much
easier, because Leslie Lamport did that work for us; he said, “Here are the well-defined safety conditions
that we care about.”
>>: Okay, so you can just bake it in.
>> Colin Scott: Yeah, so … and then each process has the local variables that are defined by Leslie
Lamport or Diego, and then we just examine the values of those local variables and see if they match
what the safety conditions prescribe. Now, I totally agree that they’re some … and so most systems—
unlike Leslie Lamport’s work—in most systems, we don’t write our specs first; we actually just write a
bunch of code, and then maybe, we realize that we want something else, so we change our code, and
it’s just really hard to actually understand what it is that we actually really want. So I’ll get to Ratul in a
second; I have just one idea there. Michael Ernst at UW did his PhD dissertation on this thing called
Daikon; it’s inferring likely invariants. So this is one way that you could help developers sort of …
>>: No, I think that you read too much into my question. I’m just talking …
>> Colin Scott: [laughs] Oh, okay, great.
>>: … basics. Well, I’m making sure …
>> Colin Scott: Yeah.
>>: How do you write code that simultaneously looks over the state of multiple processes?
>> Colin Scott: Well, a distributed snapshot is a well-known … in our case, we … I don’t actually need
Chandy-Lamport, but a distributed snapshot, conceptually, just takes a consistent snapshot of all the
local variables. So you … what … if you can—sorry—as long as you can look at the local variables of one
process—you say, “I know the locations in memory that I want to examine”—then you can just use that
same code to run it across multiple processes. So there are separate questions: how do you obtain the
snapshot, and then how do you run the invariant? I think Ratul was first, and then I’ll get …
>>: You saw one thing that’s a … invariants that we’re running here, like some invariants may get
violated, like, transiently.
>> Colin Scott: Yeah, good … yeah.
>>: And so does your code know when to actually value its own invariants?
>> Colin Scott: So if it—sorry—if your safety condition is violated transiently … so a safety condit … the
definition of safety condition is that it should never be violated. So if it’s violated transiently, that
means that maybe you haven’t … one answer—one snarky answer—might be that you just haven’t
written your invariant violation … or invariant very well. That you sh …
>>: I think we have maybe converse ... so I was just thinking, like, take distributed-out computation;
let’s say a set of processes are trying to compute shortest paths.
>> Colin Scott: Right.
>>: Right, so I think: before the whole thing converges, like, paths are not short …
>> Colin Scott: Right.
>>: … and what your invariant is correctness, that—you know—after things converge, you should
basically get shortest paths.
>> Colin Scott: The … so what you’re actually describing is a Leibnitz condition. I think that’s the
challenge.
>>: So it’s a safety condition, but it still the … you need to be defined as after the [indiscernible] that is
your safety … it’s part of your safety property. After you converge, then paths are short, right? So
before you converge, you just [indiscernible]
>>: But how does … I’m just wondering, like, how does this engine know that things have converged,
and now’s the right time to check it, versus you shouldn’t have checked it before this has happened? Or
these types of invariants are not checkable? I’m trying to …
>> Colin Scott: No, no, I think they are checkable. I think in practice, like, it … people do—developers—
do exactly what you said; they define some sort of threshold. They say, “If five seconds have passed,
then the assertion should hold.” And the problem with that—like you said—is that it can be transiently
violated. If the execution times are slightly different, then all of a sudden, we have flakiness. I don’t
know. I don’t have a great answer. I mean, one thing you could do is empirically build a distribution of
how long the convergent times take and then pick some time in that convergence ti … in that
distribution that would be reasonable. I mean, it’s a hard question. What you really want … I think what
you really want is a Leibnitz condition, but it’s … unfortunately, Leibnitz involves infinity, and you can’t
execute systems infinitely long, so … there’s a—sorry—there is one answer. There’s a nice paper called
MaceMC. I don’t know if you’ve heard of it. The basic idea is that you can convert Leibnitz conditions
into safety conditions using some heuristics. So they do random walks of the search space, and they
say, “If we’ve done n random walks, and we haven’t … and the Leibnitz condition still doesn’t hold, then
we’ll … well, we’ll consider that a Leibnitz condition … a Leibnitz violation.” And that’s … that might be
one way of getting rid of this transient issue you mentioned. I don’t know if that … are you satisfied?
>>: We can continue to talk offline later, perhaps.
>> Colin Scott: Yeah. Sure.
>>: [indiscernible] You said you could do the minimization in factorial time. Before, you said that you—
pity for you—if you were given infinite time, you could do it, but actually, you …
>> Colin Scott: Yeah, you’re right. I meant factorial time. That’s right. Yeah, you’re absolutely right.
Which is still … could be potentially years. Right, okay, so one observation that others have made is that
many of the schedules in this schedule space are commutative. And I’ll explain what this means, but
incidentally, I don’t know if … many of you probably know Patrice Godefroid. This was actually his
dissertation work on this topic. So the basic idea here is that: suppose we have two pending events, i2
and i3, or they’re both in the message buffer at the same time, and let’s further assume that they have
different destinations—so they’re destined for different processes—and they are concurrent with each
other according to happens-before relation. So they’re … essentially, they’re happening on different
machines. Now, if you consider the state of the overall system at some step, n, before we execute
either of those, and then we decide to deliver i2 and then deliver i3, we’ll actually end up at exactly the
same state if we had instead … as if we had instead executed—sorry—delivered i3 before i2. So we
know that any schedules that only differ in the order of i2 and i3 are … will end up … end us up in the
exact same state. Now, the algorithm that we use to reason about this kind of commutativity is called a
Dynamic Partial Order Reduction—again, this was Patrice. Conceptually … sorry, I guess I’ll explain how
this works in a few slides, but essentially, what it does for us is it allows us to only explore one of those
commutative schedules, and the rest of the equivalent schedules we get to ignore.
Okay, so this is great. We’ve reduced the size of the schedule space by some factor k, where k … again,
I’m being somewhat sort of loose in my notation, but k is like the size of each equivalence class. But in
general, this is still intractable. So our approach is to prioritize the order in which we explore schedules
within this schedule space, and the idea here is that we’re given … the user is going to give us some fixed
time budget—they’re only willing to wait so long before they hit control C—and then we’re going to try
and make as much minimization progress as we can within that time budget. Okay now, some of you
may observe—may have already observed—that for any prioritization function that we choose, an
adversary could compute … or—sorry—could construct the program under test, or even just the initial
execution, such that whatever prioritization function we chose makes very, very little progress within
our time budget. And the reasoning behind this objection is actually absolutely sound, but our
conjecture is that the systems that we care about in practice are not adversarial—in particular, they
exhibit some set of program properties, or they adhere to some constrained computational models that
make them amenable to the kinds of prioritization functions that we’re gonna define. Does this make
sense? So the research agenda is really trying to define what are these program properties that we care
about in practice and then defining prioritization functions for them.
So I’ll briefly go over the main program property that we assume. Consider a single process. You can
think of the single process as a state machine—some potentially-infinite I/O automata—and each
transition here is the event of receiving a message. Now, at each state in this state machine, we have
some set of local variables. I’m only showing two here—x and y—but you could imagine it might have
many more local variables. Now consider the cross product of all of the processes in the cluster. This is,
again, some state machine. This defines, sort of, the behavior of the overall system. Now, let’s assume
that two properties hold. One is that any given invariant—we might care about lots of different
invariants—but any single invariant is defined over a small subset of the process’s local variables. So we
only look at, maybe, x and not x and y. In this case … again, imagine that there are many, many variables
here. Another property would be that each event—each transition in this diagram here—only affects a
small subset of the receiver’s variables’ values. So in other words, when we receive a message, the
receiver is not going to flip all of its local variables, it’s only gonna flip some local variables that are
relevant to that kind of message, okay? Now, if those two facts hold, then it seems highly likely that the
initial execution—the initial execution is some path through this overall state machine here—that initial
execution will define a path that contains loops. And essentially, what we’re going to do is remove
those loops. So if we’re—sorry—if we were to reduce this state machine to only transitions that only
affect the invariant’s local variables, then that state machine would contain loops, and we’re going to
reduce those loops. Does make sense?
Okay, now the challenge, of course, is that we don’t know … we treat the … we’re treating the system as
a black box, and because we’re treating it as a black box, we don’t know which local variables are
actually relevant or not, or which events are actually relevant or not. So our approach is going to be … is
going to experiment … we’re going to experiment with different executions in order to infer what those
local variables or what those events are. Does this make sense? Okay, and the key insight—sorry—the
key insight here is as follows: we know one very important piece of information, which is that the
original execution, when we execute it, caused the system to violate the invariant. So you can think of
this original execution as sort of a guide for how we can get the system to progress through its state
machine so that it ends up in this buggy state here. But again, like I said, there are some … there are
gonna be some loops in that st … in the path through that state machine. So what we’re going to do is
we’re going to selectively mask off some of these original events and see if we can still trigger the
invariant violation so that we can find a shorter execution.
Okay, so the way we do that is somewhat detailed; I’ll walk through it slowly. Recall that we’re trying to
minimize external events first, so just consider the external events from that original execution. Now
let’s suppose that we only consider the right half of these external events. We’re going to ignore the left
half. Now, the algorithm that we use to do this splitting between right halfed and left half is called Delta
Debugging. Conceptually, you can just think of it as just a modified version of binary search. Okay, so
now, [coughs] we’re just gonna consider these last three external events, and now, we’d like to know: is
there some schedule—some execution—containing just these three external events that would trigger
the same invariant violation? So the way that we find that schedule was as follows. We walk through
each of the original ext … internal events, and we check: is that message that we originally delivered, at
this point in the execution, is it currently pending in the buffer at … in our current execution? And if it is,
let’s go ahead and deliver it. If we ever get to a message that is not currently pending, we’ll just skip it
over. And we’ll keep doing this for all the original messages that we delivered in the original execution,
okay? And now at the end, we just check: is the invariant still there? Now, let’s suppose in this case
that it was; it was still there. What that means is that we can now ignore the first three external events.
Those are not necessary.
>>: Does this require re-execution?
>> Colin Scott: Yes. Yeah, exactly. We’re going to re-execute, starting from a known initial state, for
each of these schedules, yep. Okay, now we’re gonna proceed with the rest of our minimization. So
now we’re gonna consider just the rest … right half of the remaining external events, and again, we’re
gonna try and find some schedule. Now, let’ suppose, in this case, that we didn’t find the invariant
violation at the end. That doesn’t necessarily mean that we’re done. There could be some other
schedule containing just these last two external events that also triggers the same invariant violation.
So remember that algorithm I told you about earlier, DPOR. D … what DPOR will do will … it will set
backtrack points at each step in this execution where there was some alternate, noncommutative choice
it might have made about which message to deliver next, okay? Now, what we’re going to do is
continue exploring these backtrack points until either we find some schedule that triggers the invariant
violation, or until we run out of our time budget for that subsequence, okay? Then we’ll continue this
until we finally produce our minimal output.
Now, there was one major detail which I swept under the rug here, which is that it’s not always
straightforward for us to compare messages in the current execution with messages from the original
execution, because we’re essentially modifying history. So as an example here, let’s look at one … we
have some message that we originally delivered on the left here. And now, in our modified execution,
we’ve got some pending message, which looks very similar except for one field, which is our sequence
number here. So let’s suppose, in this system, that the processes keep a sequence number; this is just
an integer that they increment by one for every message they send or receive. Now, we removed some
of the events, so now the sequence number has lower value. But we, as humans, know that—at least
for most bugs invo … for any bugs that don’t involve the sequence number, which should be most
bugs—we know that we should be able to mask over that message field. We know that that … the value
of that local variable should not affect whether or not we trigger the bug. So we know that we can mask
over that when we’re comparing the equivalency of these two messages. And by the way, the intuition
for this observation is exactly the same as the one I showed you earlier. This sequence number in the
message is reflected in some local variable at each process, and that local variable does not happen to
affect our vi … invariant. Yeah?
>>: Is it possible to infer this by analyzing the program …
>> Colin Scott: Yes.
>>: … to see some local var … the relation between local variables?
>> Colin Scott: Yes, that’s exactly right. Another … an even another thing you might do—other than
applying a program analysis—you might also try and experimentally try and infer which of these fields
are nondeterministic. We don’t do that. For our prototype, we just assume that the user is going to
give us some fingerprint function which tells us which fields to ignore. But I think you’re absolutely
right; that’s a … that would be an interesting avenue for future work. In general, we want to decrease
the amount of engineering work it would … effort it takes to get our system to work.
Okay, so in the first phase, we’re gonna use this user-defined fingerprint function to choose the initial
schedule that DPOR executes. And then, in the second phase, what we’ll do is we’ll only match
messages according to their type. So by type here, I mean a class tag or an object type—at a lang … a
language-level class tag. And the intuition here is that messages that have the same type should be
semantically similar to the original, except they’re gonna differ in some field … some of the messa … the
values of the message fields. So again, we’re trying to stay as close as possible to the original execution,
except now, we’re going to explore backtrack points that only differ in the type in small numb … small
values of the fields. This make sense? Yeah, Ratul?
>>: So tell me something. So what exactly happens … so suppose you don’t … you want to drop
message with sequence number three and instead, go directly to one with five; your matching will say
they’re equivalent, but when you inject it into your system, is the sequence number three or five?
>> Colin Scott: In the example I gave, we’re trying to play this message on the left here, but it doesn’t
exist in our pending buffer. There’s only one message that matches; it just happens to have a sequence
number of three. So our user-defined … your … the user-defined function tells us that these two are
equivalent. Now, suppose that there were … there was some ambiguity—there was some other
message that had the same type—in that case, we’re going to explore a backtrack point. Sorry, I have
too many animations. So if there’s ambiguity in which message we might choose—there are multiple
pending messages that match—we can … we’ll try one of ‘em now, and then backtrack, and try the
other one in its place later, if that makes sense.
Okay, so the last observation we make is that often, the contents of external messages affect the
behavior of the system. And if we try and minimize the contents of those external messages, we can get
better minimization. So I’ll go over a brief example here. Let’s suppose that our system assumes that
we have some bootstrap message. This bootstrap message tells each process what are their peers. So
in the beginning of time, they don’t know which … who their peers are, and we send them … at the … in
our execution, we happen to send them a bootstrap message with a list—a, b, c, d, e—which tells them
the cluster. Now, let’s suppose that we’re going to mask off some of these values of that list. We’ll
mask off one, and then we’ll mask off two. At some point, the quorum size will actually change; it
changes from three to two, and now, the remaining processes in this cluster have to wait for fewer
messages before they can achieve quorum. So if we minimize these external contents—which we
control—we can in fact get better internal minimization by doing this. Does this make sense? Okay.
>>: Is it the saving [indiscernible] I mean, by changing the quorum size, you might miss a bug, or—
right—maybe the bug was only happening—you know—with f equals two, because—you know—you
sort of hard-coded a quarine of looking for two messages only or something.
>> Colin Scott: Yeah.
>>: And they won’t find the schedule.
>> Colin Scott: Yeah, so we’re always doing this experimentally. So in that case—in the case that you
described—if we tried … we had the original execution, which triggered the bug, and which had five
entries; we try masking off one of the contents; and everything we try doesn’t trigger the bug. We try
masking off two; everything we tried doesn’t trigger the bug. So we say … so we put our hands up, and
we say, “Okay, fine, we’ll go back to five.” Yeah. Okay, so just to … yeah, go ahead.
>>: Yeah, so back to that example, so I think just follow-up to Christoph original question. So this …
there’s the process of exploring this state space; you can do that in a very smart way; and you argue
that—you know—combining that with randomness actually helps sporting a bigger space.
>> Colin Scott: Right.
>>: But for this particular one, I think it probably be better off having—you know—three …
>> Colin Scott: From the be … very beginning.
>>: Yeah, we will probably find the same bug …
>> Coin Scott: Yeah.
>>: … if it actually shows up.
>> Colin Scott: I totally agree. So if you’re just doing randomized testing, you should just start with a
small cluster size, and then increment it. However, if you’re trying to minimize production executions,
you don’t necessarily have control over the size of the initial cluster. So this technique might be more
useful for a production run, yeah.
>>: But hang on one second. I mean, your technique fundamentally requires this ability to replay
executions.
>> Colin Scott: Yeah.
>>: If you have generated a fault from a production execution, how are you going to do replay?
>> Colin Scott: That’s a … it’s a great question. We don’t … we have a section … I have a dissection in
my dissertation which discusses how you might do that. So you need a couple things; you need a partial
order—so you need … you might need a Lamport clock on all of your messages, so you can always
partially order the execution.
>>: Uh-huh.
>> Colin Scott: You would also need to make sure that—like you said—that each external event, like a
failure, is non-redundant—so you have exactly one of those, a unique event for each external event. So
that might … I totally admit that that might be hard to obtain in practice. We also have some thoughts
about how you might reduce that overhead. So if you’re looking at the kinds of systems we’re looking
at, which are actor systems or message-passing systems, it gets a lot easier; you don’t actually need—
because there’s no shared state—you don’t actually need Lamport clocks. So we have some thoughts
about that, but we haven’t pursued it yet. Yeah?
>>: Did you let these … skipping something like [indiscernible] is impossible.
>> Colin Scott: Sorry, say that again? So …
>>: Do you assume that you are just skipping over some events—that’s what you’re doing as in this
binary search.
>> Colin Scott: Yeah.
>>: This message being [indiscernible] You assume that your doing that will not crash [indiscernible]
>> Colin Scott: Ah, you’re absolutely right that—sorry—that when we do this splitting between left half
and right half, you’re absolutely right that we might have an invalid or semantically invalid split, which
might cause a crash or just doesn’t … it’s nonsensical set of external events.
>>: So that’s the invariant.
>> Colin Scott: Yeah, so in that case, we assume that the invariant that the user defines correctly
disambiguates those cases. So if there’s a crash that we don’t … that was not our original bug, we’ll save
that for later, like, “By the way, we’ve figured out a way to trigger a new bug,” but we’re not gonna use
that for a minimization, yeah. Okay, almost done here, so I’m just summarizing all of our techniques in
one slide here—I guess you can read yourself. One thing I will say is that once we’ve minimized external
events, we’ll go ahead and use the same techniques that we use to minimize external events to then
minimize internal events. So like I said, that just means we’re gonna keep them in the buffer, and not
deliver them, and see if we can still trigger the bug.
Okay, so before I end, I’ll quickly look at how well this works in practice. We’ve applied our tool to two
different system—distributed systems—so far; one is an implementation of the raft consensus protocol;
and the other is the Spark data analytics engine. So we should note … you should note that these are
obviously two very different systems; they’re trying to achieve very different things; they have very
different behaviors. And also note that the raft implement we looked at was a somewhat early-stage
development project, whereas Spark is obviously very, very mature. So what I’m showing here is the
size of the executions on the y-axis, and then on the x-axis is each of our case studies—so each of this is
a distinct bug. And then the blue bars here are showing the size of the initial execution that we found
with randomized concurrency testing; and then the green bars here show the size of the minimal out …
the minimized output that our tool produces. So a few takeaways from this graph: I don’t actually show
it here, but it turn … we found that randomized concurrency testing—or fuzz testing—turns out to be
very useful for uncovering bugs that developers didn’t anticipate. So that … so this raft implementation
had existing unit tests and integration tests, but each unit test only told explores one particular ordering,
and there’s an exponential space of possible orderings, so there are a lot of things that they didn’t
anticipate. Yeah, go ahead.
>>: Question about the use of the phrase “fuzz testing.”
>> Colin Scott: Yeah.
>>: I think by that phrase, you mean that you have control over all scheduling choices.
>> Colin Scott: That’s right.
>>: And you randomly pick one choice.
>> Colin Scott: Yep.
>>: Right? So I just wanted to point out that that’s what you’re assuming. A lot of people in the
industry …
>> Colin Scott: Yep?
>>: … they just think that: “Oh—you know—fuzz testing is when I artificially inject some delays here and
there; I create big workloads to indirectly—not directly—influence schedules.”
>> Colin Scott: Right, right.
>>: Okay.
>> Colin Scott: And I agree that in production, that’s actually might … that might be what you want to
do. So in production, it might be very … it might be too much engineering effort to actually interpose on
all the sources of nondeterminism, so one way to deal with that nondeterminism would be to just replay
multiple times—flip the coin multiple times—and then hope that one of those coin flips is gonna trigger
the bug. So that’s one way. We had a prior … the first chapter of my dissertation, we had a prior
SIGCOMM paper on how you would do this for systems where you don’t have as much control. One
messy part is: now you come dependent on wall clock time, which is not … it’s not a clean way of
thinking about the problem, but it’s more practical, yeah. But you could, in practice, do that without
enough interposition. Yeah?
>>: So raft and Spark here are re-implements on that [indiscernible] where you took some
implementation, and you’re looking at bugs that already existed. What are these bugs here that you’re
talking about?
>> Colin Scott: So these are … so—sorry—Spark and raft were both implemented in Scala; well, Spark is
actually multiple languages, but the play … the parts of the process that we interposed on were Scala.
Raft is completely written in Scala.
>>: So you used implementations that already there?
>> Colin Scott: That’s right.
>>: I see.
>> Colin Scott: So Spark obviously is on Apache, and then this one we found on GitHub. These bugs …
all of the bugs in raft, in fact, were un … previously unknown. So all we did was we sat down and re …
we wrote down all the five safety conditions that Leslie Lamport prescribed, and then we just random …
we did randomized testing, and then for each … and then we checked those five invariants at each step,
and then this is what we uncovered. Yeah?
>>: Were these fixed bugs at the end?
>> Colin Scott: Yes. Yeah.
>>: Then if it’s a fixed bug, is it possible to come up with the optimal number of messages?
>> Colin Scott: Yes. I’ll get to that in one slide.
>>: Okay.
>> Colin Scott: Okay, so what we’re showing here is that across all of our case studies, we get pretty
good reduction. We’re improving the state of the art. So they would have started with this, and we give
them this. So we’re … it seems like we’re helping developers, but you might also ask, “How much room
is there for improvement? How far off are we from optimal?” And so what I did here was, again, I’m
showing here the green bars are the size of our … the output of our tool, and then the orange bars are
the minimal trace—so this is the smallest manual trace that I could produce by hand. So it took me … it
was actually fairly painstaking; it took me about a month to do this, but I … what it shows here is how
much room there is for improvement. So across all of our case studies, we’re within a factor of five—of
four point six, actually—from that smallest manual trace. In these two cases here—raft-58a and raft42—part of the reason that we were so inflated that we were far away from that optimal was that we
ran out of our time budget. So what I’m showing here is the minimization runtime on the y-axis, and we
had some maximum time budget. But in general, across all of our case studies, we’re doing pretty
well—most of these finish within ten minutes. In a few cases, if we had a better or a more smart
prioritization function, we could have found our minimal output in much less time. So there is definitely
room for improvement, but in general, we’re doing pretty well. Yeah?
>>: How many Berthas are you running, and how does that play into the size of the trace?
>> Colin Scott: Yeah, good question. We were … in this case, we were only running four processes in—
sorry—for raft, we were running four processes; for Spark, we were running—I mean, they have a
master and a … they have a whole bunch of processes—like, roughly a dozen. If we had more processes,
I think the executions … so in practice, the executions could be much, much larger, so …
>>: Yeah, I’m impressed that there are bugs that require even, like, fifty-something messages to
reproduce in four processes. So that’s impressive.
>> Colin Scott: Right. Yeah, that’s right. It’s essentially be … that’s … if you think about it, this is only a
couple of round-trips.
>>: Yeah.
>> Colin Scott: If you have four processes, and there’s some bootstrapping …
>>: Oh, each RPC is its two, and then …
>> Colin Scott: Yeah.
>>: Yeah, alright.
>> Colin Scott: And there’s always gonna be some bootstrapping messages that are always gonna be
present in the execution, so … okay. Alright, so there are many more details that I didn’t have a chance
to go over here. I’d encourage you to check out our NSDI paper for a much more lucid explanation of
those details. So I probably don’t have time to do a demo. I have a pretty cool demo, it’s like GDB for
your … it’s essentially a GDB breakpoint for your network. So you get to choose which pending message
to deliver next. So I think it’s pretty neat.
>>: Yeah.
>>: Go ahead. Do it.
>>: Let’s do it.
>>: Do it.
>>: Let’s do it.
>> Colin Scott: Okay, sweet. Okay, so I’ve got a Java program here. This is a … I com … what I did is I
compiled the akka … the—sorry—the raft implementation, which is a Scala or Java program, and then I
interpose on … so I used AspectJ to interpose on the RPC library that it uses. So I’ve got this dash-dashinteractive; that’s just telling our tool how it’s gonna behave. So what I’m showing here: I’ve got four
raft processes in this cluster, and then, I’m just printing out the external events. So we have … in this
case, we … the processes assume they have some bootstrap message that tells them their peers, and
then, we also have two client commands: in this case, append word—please append a word to the raft
lock—and then this here is just our prompt for the console. So a little bit further up, I’m just showing
you: this is just the console; this is—here at the top—is the console output printed out by each process.
So it’s just saying, “Oh, I’m getting ready to be … to run my raft implementation.” And then, now, what
I’m showing here is the set of pending events. So these are the messages that have been sent, but not
yet delivered. So we interposed on them, and then they … and then we get to choose, now, which of
these we’re gonna deliver next. So now I’ve got a little—essentially—like a GDB prompt. I just type
“help,” and it tells us … tells me all the commands I can run. And then I … then … so now, what I might
do is deliver some message, or check the invariant, or cause one of the processes to fail, et cetera. So
let’s suppose that we’re gonna … so the way …
>>: Scaling all that on … processes on your machine, right?
>> Colin Scott: Yeah, so … yeah, that’s right. So each of these processes is running locally on my
machine, but they are not aware of colocation. So as far as they know …
>>: Oh, okay.
>> Colin Scott: … as far as they know, they’re running a distributed system across the network.
>>: Yeah.
>> Colin Scott: Okay, so now, let’s suppose that we allow the first process to have its bootstrap
message. Now it’s gonna set a timer. It set a timer, and it says, “I’m gonna try—when this timer goes
off—I’m gonna try and elect myself as leader.” So let’s go ahead and allow that timer message to go
through. Now, it’s going through a state transition where it wants to begin an election. Again, I’ll let
that message go through, and now, it’s sent request-vote messages to all of the other processes in the
cluster; so it’s gonna go and try and get itself elected. Now, if I had some bug in mind about—some
particular interleaving in mind—about how to trigger some bug, I could use this system to try and trigger
it. And then, at any point, I can check the invariant, and in this case, we haven’t done much, so there’s
no var … there’s no violation yet, but we could just keep doing this until we find the bug. And now when
I exit out of this prompt, we saved a recording of the … of all the events that we played. So I could now,
if I wanted to, replay the execution that I had generated here. I could also do this programatically
instead of doing it interactively; I could also just make random choices programmatically. And one thing
to note here is that because we’re interposing on timers, we’re able to execute these … run these
executions much, much faster than you would be able to in practice, and because we don’t have to wait
for the wall clock time of each timer to go off. So it’s actually pretty amazing how many executions we
can go through in a minute. If we get lucky, we might trigger the bug. I know there’s a known bug here,
but maybe I won’t test my luck. So I actually have a … I also saved a recording of some fuzz test that
ended up with triggering the invariant violation, so now, what I’m a do is try and minimize the recording
that I put to disk. So now, what it’s doing is it’s—like I said—it’s … what I describe in the talk is just
experimentally trying smaller subsequences of events, and eventually—after about ten minutes—it’ll
stop and tell us that it has some minimal output.
Now, one cool piece of this—sorry—I think one cool aspect of this tool is that we’re now allowing
developers to generate regression tests without having to write any—well, they have to write a little bit
of code—but without having to write much code. They just told our tool, “Hey, please fuzz,” and given
… using the invariant that we gave the tool, it’ll just run fuzz tests until it finds some bug, and then it’ll
minimize them, and then it’ll save them to disk. And now, we can actually go change the system. We
can go add print statements to help us debug, or we might even change how the system behaves. And
then we can just—say, a month later—we can just rerun that execution that we saved to disk—that
minimal execution we saved to disk—and unless the protocol has changed … if the protocol radically
changes then that save … that recording will no longer be valid, but as long as the protocol doesn’t
change too much, we can rerun that test to see if the bug is … it came up again. So I think this is pretty
neat.
>>: So what is the complexity of this minimizing tool that you are running right now?
>>: Colin Scott: Well, the complexity in the worst case is n factorial, like what we said.
>>: [indiscernible] factorial.
>> Colin Scott: Yeah, but we … our … I actually have system configured to exit fairly soon. So I give it a
small time budget; I give it a time budget of … well, this—in this case—it’ll actually finish in about ten
minutes. Yeah, I don’t know if we want to wait ten minutes, but … [laughter] yeah?
>>: How large is the raft program you are testing with?
>> Colin Scott: Raft is relatively small; it’s fifteen hundred lines of Scala. Spark is much, much larger;
Spark is something like forty thousand lines of code.
>>: Yeah, but you only interpose certain messages—right—you are not trying all the possibilities, right?
For instance, I start with a fifteen hundred lines, but I assume the code that are touched will execute
probably a small fraction.
>> Colin Scott: That’s right. So in … so one major advantage of doing black-box testing is … yes?
>>: [indiscernible] sorry.
>> Colin Scott: Oh, sorry. One major advantage of doing black-bock testing is that the system might be
written in multiple languages—like you said, there might be lots and lots of code that’s actually been
used—and we’re agnostic to that.
>>: I understand. Just walk through together at … but did you know how big of the raft program that is
actually involved in this?
>> Colin Scott: There was a little bit of work I needed to do to deal with nondeterminism, so there are
some sources of nondeterminism outside of the … conceptually, all we do is we interpose on the RP
slayer, and that’s it. So we actually interpose below the application; we shouldn’t have to touch it; but
in some cases, they also depend on nondeterminism that’s outside of the control of the RPC library. So
an example would be like a hash map; suppose you keep values in a hash map, and then you iterate
through all the values; the order in which the JVM chooses to put values in the hash map depends on
the memory address, which is nondeterministic. So in some cases, we had to sort the values of the hash
map to get better determinism. So there was some changes we needed to make the application, but for
the most part, we’re basically agnostic to it—conceptually, we’re agnostic to it.
>>: Mmhmm.
>> Colin Scott: Yeah?
>>: You talk about fixating this to multi-threaded apps …
>> Colin Scott: Yeah.
>>: [indiscernible] Yeah.
>> Colin Scott: Yeah, good question. I’ll just go right to a slide here. So like I said earlier, shared
memory is functionally equivalent to message passing, so there’re actually some existing … there are a
bunch of other papers that look at—essentially what we’re trying to do—systematic exploration of
schedules for multi-core systems instead of distributed systems. And the basic idea here is that you just
interpose on the language runtime, so you detect whenever the pro … whenever a thread writes or
reads to shared memory, and then you trap, and you block that thread, and then you treat it as if we
had just sent a message—it’s conceptually the same. It would be if some amount of engineering effort
to interpose on that runtime—which we haven’t done yet—but you could do it.
>>: And the invariant is [indiscernible] shared.
>> Colin Scott: Yeah, the invariant would be defined over the shared memory, yeah.
>>: Better not be using too much shared memory over net scale.
>> Colin Scott: That’s right, or the schedule space would be too huge. And then—I don’t know—Shaz
Qadeer has a bunch of techniques for helping with that problem, yeah. [laughs]
>>: I’m a huge fan of binary search, so I’m glad you’re using it productively. I have comment about the
problem formulation.
>> Colin Scott: Yeah.
>>: You start out by saying you want to minimize an existing trace …
>> Colin Scott: Right.
>>: I’m wondering if you had set the problem in the following alternative way …
>> Colin Scott: Right.
>>: … and you just say that I have a faulty execution, and I have a generic tool for doing prioritized
search available to me, except that the tool needs a particular prioritization function.
>> Colin Scott: Yeah.
>>: And all I’m going to do is—I’m not gonna bother trying to minimize this execution—I will just try to
learn from that execution a prioritization function, or maybe a family of prioritization functions, that’s
likely to meet to a violation of the same invariant.
>> Colin Scott: Yep.
>>: Forget about trying to minimize that particular execution itself.
>> Colin Scott: Right.
>>: So how would you compare that problem formulation to what you already have?
>> Colin Scott: I actually think your intuition is right that it’s very, very similar—it’s almost the same.
>>: Okay.
>> Colin Scott: So in future work … we don’t currently learn. So I have some master’s students who’re—
who I’m working with—who are looking at using this tool called Synoptic; Synoptic looks at log files and
then tries to infer it … learn a state machine from those log files. So what they’re looking at is try—like
you said—trying to learn some prioritization function based on the behavior of the system—the
observed behavior of the system. So another answer to your question is: conceptually, what we’re
doing is essentially explicit state model checking.
>>: Okay, prioritized search.
>> Colin Scott: Prioritized search.
>>: Yeah.
>> Colin Scott: And … except that we have one additional piece of information, which is we know that
the original execution triggered the bug.
>>: Right.
>> Colin Scott: But that’s essentially what we’re doing.
>>: Okay.
>> Colin Scott. Yep. Okay, I had a couple slides here before I ended. So this is the technical
conclusion—y’know—the results we found for those two systems were pretty … leave us pretty
optimistic that these kinds of techniques can be applied more broadly. Our tool is open source. You can
check it up on GitHub. And of course, I’d encourage you to check out our paper. I had a … I wanted to
go over a couple slides about where I want to go—where my research trajectory is currently headed. I
have done work on a lot of prior topics, not just minimization. The common thread through all of these
is troubleshooting and reliability. My area … my background is in networking distributed systems. I
think this is an increasingly-important area, because everything is becoming distributed. We all have
smartphones in our pockets. But unfortunately, the kinds of tools that we use to develop concurrent
and distributed systems lag significantly behind with the kinds of tools we have for sequential code. So
actually, this is a quote from Parkinson; Parkinson is an MSR researcher. I never met this person, but
they have a conjecture that—or a claim—that the kinds of tools that we use to develop concurrent or
distributed code lags behind sequential tools by about a decade. So I see …
>>: More like twenty years.
>> Colin Scott: Yeah, maybe more, that’s right. [laughter] Maybe even more. So I see a great need to
kind of bridge this gap. Now, there are great … there’s a lot of great research from the software
engineering, and the programming languages, 401 methods communities on how to debug, verify, et
cetera—lots and lots of cool tools. There’s amazing things you can do with program analysis.
>>: [indiscernible] I thought it was sequential. Sequential [indiscernible] what does sequential mean?
>> Colin Scott: Well, a conc … a distributed system is just a concurrent system that happens to also have
partial failure in asynchrony. So what I really just mean: I’m just trying to distinguish between
concurrency and sequential code. Most of … when you … lots of verification tools, for example, assume
a single-threaded sequential computational model.
>>: There’s also huge gap between, like, concurrently working and also [indiscernible]
>> Colin Scott: That’s right. Yeah, that’s right.
>>: Numbers together one place.
>> Colin Scott: That’s right. I think parsh … you’re exactly right. Partial failure makes things even
worse—even worse. So anyway, the point is that I see a great need to bridge this gap, and there’s a lot
of great research from these two communities that we could use to help us deal with these problems.
So I have a bunch of projects that I would like to pursue. These are just some ideas that I’ve had in the
last five years for ideas that we could … where we could take ideas from software engineering and
programming languages to help us adapt them—it’s not just straightforward adaptation, we just have …
we have to do some research here—but we could take these ideas and apply them to problems in
concurrency and distributed systems. So this is basically the direction that I’m planning to move in.
So thanks a lot. If you have any questions you can e-mail me. My e-mail’s very easy to remember,
cs@cs.berkeley.edu. [laughter] Thank you. [applause]
>> Jay Lorch: We peppered him with many questions throughout the talk. Are there any remaining?
Okay, great. Then we’re done.
>>: Whoo-hoo.
Download