>> Peter Bodik: All right. Welcome everybody. ... introduce Ivan Beschastnikh who recently graduated from UW and

advertisement
>> Peter Bodik: All right. Welcome everybody. Happy to
introduce Ivan Beschastnikh who recently graduated from UW and
will be starting as an assistant professor at UBC up in
Vancouver. And he'll talk about his work in software
engineering particularly how to understand complex logs from
distributive systems. All right.
>> Ivan Beschastnikh: Thanks, Peter.
Hi, everybody.
So today I'll be telling you about my basically dissertation
work, and the perspective that you should have in this work is
what I'm attempting to do is the following, take the log like
the one you're seeing on the left and attempting to convert it
into a model, a more abstract model like this thing on the
right. And the idea is that you can use this model for various
other tasks once you've done this conversion.
So kind of the high level view, though, for this is that I
started in systems, have been building large systems, some of
them distributive, and have always been running into this
problem, right, which a lot of people face, which is that you
build the system and somehow it behaves in an unexpected way.
It does something that you would not expect it to do. So the
question is how do you answer this question.
And typically the setting here is that you have written some
code and perhaps lots of code and you've looked at it and you've
attempted to test it, but you still have this question. And of
course there's some developer, some confused developer, and this
developer has a mental model of what the code is supposed to do.
So, right, she's thinking about the code. She wrote down the
code. And the reason this question comes up is that there's
some disconnect, right, some mismatch between the mental model
the developer has of their artifact and then the actual
artifact, the implementation.
So one typical way to kind of try to bridge these two is to
instrument the system, run it, get a bunch of run time
observations, behavior of the system. So you would output this
log, and then you would inspect this log and try to find out,
okay, there's some lines in this log that somehow when mapped to
my mental model give me some kind of contradiction, right. They
create -- you know, maybe there's a loop on that invalid state,
and that loop -- I had never thought about a loop, but it
actually exists in the implementation. So the idea here is that
you would generate this log, inspect it and then check it for
validity against your mental model.
And this sounds really plausible except in practice your systems
are going to have huge logs. So your logs may be gigabytes
long. They might be very elaborate. And as a developer faced
with this had very large log, you're not really sure where to
look within the log or what to look for. So the problem is the
log gives you a very concrete view of the system, and it's very
easy to get a very large log, right. I just record all the
methods I called. I just record all the activities. It's very
easy to get a lot information. The problem is to actually
inspect it.
So, you know, and then in a distributive setting, things get
worse, right. The problem is that now you have multiple
hosts, processes, and you have to reconstruct a distributed
execution where you've captured logs for the different entities
in your system, and somehow you have to string them all
together.
So going back to this, to this kind of setting what I'm going to
tell you about today is an attempt to actually replace the log
with something a little bit more abstract. So you want to not
deal with the log but deal with something that matches your
mental model more closely, right. And the idea is that if you
provide the developer with a model that models the log, a
different representation, if you will, that's closer to their
mental model, then it might be -- it would be very much easier
for the developer to actually inspect the model and find the
disconnect, find where their mental model differs from the
implementation. So that's kind of the context for the work.
And this process of going from a log to a model is typically
referred to as model inference; at least that's what I'm going
to refer to it as. And lots of great use cases for this. So
I've told you about mental model validation, but you could also
use this for test case generation.
So, for example, you generate the model from this log, and if
your model generalizes then you can predict behaviors that are
plausible for your system, and then you can use those behaviors
to actually induce executions in the system. You could also use
it to evaluate test suites. So you might say I tested my
system, you know, in this one environment, and then I deploy it
in production. So its behavior in the test environment is going
to differ from production behavior. And so how do I compare
that behavior? How do I know what's present in production
that's not present during testing.
So one way to do that is to generate the two different model and
then compare them. We know how to compare models in different
ways. And you can find paths that are in production that are
not in testing that should be exercised. You can also use it
for anomaly detection. This is kind of the classic case where
you take the model from last week and then you take the model
from today and then you compare the two. So a bunch of numerous
applications you can use this for
In my case, it's going to be being mostly mental model
application. And so I'm not the first person to work on this
topic. People have worked on this in the software engineering
domain, and they refer to this as specification mining or
process discovery. And usually the name here depends on the
task that you're going to use that you're going to apply to the
model. And prior work spans at least a decade of research, and
a lot of the challenges that remain in this prior work, things
like efficiency, accuracy and distribution. And those are the
ones that I'm going to sort of talk about in my work today.
So the first one's efficiency, how do you get this process, this
model inference process, to work on very large logs. All right.
So it's okay -- you know a lot of the prior work works fine if
you have a hundred -- you know, this log is a hundred lines long
or 200 lines long. How do you get it work on a gigabyte log?
How do you make this model more accurate? What notion of
accuracy can you use. And then, finally, how do you get it to
work for a distributed system, distributed setting.
So the three tools that I built are Synoptic, Dynoptic and
InvariMint.
And to briefly go over it, you know, the contributions of
Synoptic is it gains efficiency through the process of
refinement. And I'll go into more details about what that
means. It gains accuracy by mining certain properties from this
log and then making sure that those properties are going to be
true in the final model.
And then Dynoptic is sort of a follow-on word on Synoptic which
infers a different kind of model. So the model I'm going to
attempt to infer here is one where I have a finite-state machine
for a process. So Synoptic infers a finite-state machine like
model. And in the Dynoptic case, I actually want to model my
distributed system as a set of finite-state machines. And in
Dynoptic, it's going to apply some of the same aspects as
Synoptic like refinement and mining properties, but also it's
going to handle distribution.
And then the final work is InvariMint, which sort of very
different from the top two, and I'm not going to talk about it
much in this talk, but it fits into this puzzle in that it takes
the idea of mining properties and composing them to the extreme,
essentially, where there is no refinement in InvariMint. It
just mines a set of properties here and composes them in an
interesting way.
So this is really the motivation for my talk. And next I'll
really tell you about Synoptic and Dynoptic. So there's not
going to be any InvariMint in the talk.
So jumping into Synoptic. So the goal is you have this log, and
you want to produce this model. And the way Synoptic is going
to do this -- you know, initially I didn't tell you about any
constraints on the log.
So the first step is going to be just to parse this log. So I'm
going to assume that the user will give me a log and a set of
regular expressions that will match the lines that they care
about in the log. So if you care about disks in your system,
then you give me regular expressions to extract disk events.
And then I'll build you a model that's relevant to just those
sets of events.
The second step is to then build a compact model. You want to
build a model that will include as many behaviors as possible.
So include the behaviors that you have in the log but then many
more things. Then what we're going to do is mine these
properties or what I call log invariants, and they're going to
be very temporal like things, like lock is always followed by
unlock, open always precedes read, and then use these invariants
to constrain the model. So I'm going to take this initial
model, use the invariants that I mined and then build you a more
accurate model.
>>: [indiscernible]
>> Ivan Beschastnikh: Right. I'll give you -- yeah. I'll give
you an example of what I mean by regular expressions. But
overall, I think of this log as a set of events. So I do not
reason about state at all. So the model I'm going to give you
actually is going to be an event-based model. So I'm modeling
sequences of events. And so your regular expressions basically
have to tell me for every log line that you care about associate
an abstract event with that log line. So if you're sending a
message, like it might be an acknowledgement message, like
launch back into TCP. It has sequence numbers. It has all
sorts of stuff in it. But your abstract event type is
acknowledgement, right. That would make sense if you're
attempting to come up with a model that reasons just the event
types. And depending on the setting, you might say that's
unreasonable, right. So you might say, okay, maybe I care about
every fifth acknowledgement packet. So it would be
acknowledgement five, acknowledgement ten, right.
>>: So this not an event.
>> Ivan Beschastnikh: Right. So you would include that into
your regular expression. But, in general, this does not reason
event data. So it's not powerful enough to reason event data
and state.
>>: So if you have this acknowledgement model, right, if we
wanted to model a handshake, then I have to give all that data
stacking out all of these events?
>> Ivan Beschastnikh: Yeah, you would. I mean, you know, the
answer here depends on what that log looks like. If your log is
in a certain format, then you can just say you have a regular
expression that matches different kinds of events and extracts
some sort of subset.
Let me work through these different steps. And, actually, I'll
go through them in order.
And the example that I'll use is kind of a very simplified
version of two phase commit. So in two phase commit, you have a
manager and some number of replicas. And the manager is going
to propose a transaction, and then your replicas will reply with
either commit or abort. And then the manager will say, Okay,
collect all this information and then either reply with
transaction commit if I've seen only commits or reply with
transaction abort if there's one abort or more.
So the way we're going to cheat here and use Synoptic is that
we're going to maintain a totally ordered log as the manager.
So we're not going to care about logs anywhere else, because I
can have a total order view global view as the manager. That's
the log I'll plug into Synoptic. So in this case, the -- my
input might be these sets of events, and then my regular
expressions are going to extract just the packet types or event
types that I care about. And for two phase commit, the things
you care about are, you know, what kind of messages are you
sending propose abort, transaction abort, commit, so forth. So
I'm going to essentially extract these execution chains from
this log.
One thing that I should mention is you also need something that
tells you when -- basically tells you execution boundaries. So
in this case, the transaction ID is going to be your execution
boundary. So every new transaction will induce a new execution.
And the kind of format I'm using here is that the square node is
going to be this initial node. So all of the proposed nodes,
the first ones, are going to be the first ones. And then the
very bottom most node, that terminal node, is going to be the
response. So I have this initial set of traces, going to be
huge number of them.
And what I want to do next is build this compact model. So the
it compact model is going to be built very simply where I want
one node, one abstract node, for every kind of event that I
have. So I take all of the -- for example, I might take all the
proposed nodes and create one single proposed node to represent
all of them. And I'll take all the commit nodes and create one
commit node to represent all of them. And I'll do this for all
of the event types that I have. And then I'll create edges
between these based on the concrete observations.
So if there's -- if there's a concrete edge between commit and
transaction commit, then there's going to be an abstract edge
between the commit in the abstract model and the transaction
commit in the abstract model. So, basically, I just built you a
model that's very compact, compact in the sense that there's
only one node per event type, and it admits all the behaviors
that I observed in the log by construction. But this model also
admits lots of other things.
So, you know, now the question becomes how do I get this model
to be a little bit more accurate. And this is where invariants
will come in or those log properties. So we built this compact
model. Now we're going to mine these salient log invariants.
So it turns out you can get away from pretty complex systems
with very simple properties.
So for Synoptic, we're actually going to use these three
properties. They're going to be temporal properties. You can
express them in LTL, but I'm not going to show that to you.
But, in general, we know from prior work by Dwyer, et al., for
example, that in general when you specify systems there are very
few patterns that you tend to reuse, right?
So these three actually cover the top six patterns out of the
ten or so that Dwyer documented. So the patterns here are going
to be X always followed by Y. And on the example log, X always
followed by Y would look like abort is also followed by
transaction abort. So when you see an abort event, then you
know that before the trace ends you will see transaction abort.
And commit always precedes transaction commit is kind of the
reverse, you know, looking back. So you look at transaction
commit and that third trace, and looking back, you must have
gone -- you must gone through a commit commit event. So this
one really parallels causality. And then the final one is abort
never followed by transaction commit which is basically what you
think it is. If I reach an abort event, then I will never see
transaction event in the same execution.
Got it?
>>: To do the model based on the log, you say you want to
get -- the most accurate would be to capture exactly what
happens [indiscernible] compact model?
>> Ivan Beschastnikh: It's actually -- that initial model that
I told you about, right, that construction, the model is -represents the log, but it also captures other things. So there
are paths in that model that you have never observed that are
illegal. And so if I -- it can happen because you're
essentially stitching together executions. So here in this
model, this edge might have come -- abort always followed by
transaction abort might have come from here, right?
But then the edge preceding it might have come from a different
execution, and so you might get a path in there that actually
composes two different executions, right?
And this is really the power of this model is that it
generalizes by stitching together different executions based
on -- based on common events.
>>: Models?
>> Ivan Beschastnikh: Well, now my question is I have this
model, and it generalizes, right? How can I make it a little
bit more accurate. And accuracy will come from these
invariants, so these things that I told you about.
What I'll do next -- what I'll do next is actually mine these
invariants from the log, and then I'll tell you about a process
for changing the model to satisfy these properties.
So, for example, this first one, abort always followed by
transaction abort, if you actually mined this property from the
log, right, you would like the property to be true of your
model. That's also kind of a generalizing statement. It says
that there's a bunch of behaviors that we haven't seen of your
system, right? But if we've seen this property for all of the
executions, then we would like this property to be true of the
model, right?
And this property is also a correctness requirement for two
phase commit. So you would like this to be true of the model as
well.
>>: So these are inferred?
>> Ivan Beschastnikh: These are going to be mined. Yeah,
exactly. So I'll show you that in a sec.
I have a note here to kind of related work. There's been a lot
of work on actually inferring these temporal properties from
sequences in different domains and a lot in the software
engineering domain. And I guess the contribution of this work
is to actually use these properties for modeling. So they're
not -- I mean, you can see them in the tool, and I'll show you
where you can see them, but that's not the purpose of the tool.
So for two phase commitment here are all of invariants you would
mine. So this is the set of all of them. And some of them are
going to be false positives, because your log might not include
all possible executions of your system. And some would be
actually true of your system, right? And depending on how
interested you are in inspecting these, you can deselect some of
them, right, to have the model be not constrained by the false
positives.
>>: [inaudible]
>> Ivan Beschastnikh: No. These are mined automatically based
on these kinds of variants.
So now the question is how do you combine these two. So you
have this initial model. You have these properties, and you
want to refine the model to satisfy the properties. And to give
you an example, so for this initial model, right, the ones that
I grade out below all of these properties are already true of
this model. And what I mean by true or not true is that, for
example, the top one, abort always followed by transaction
abort, this property's not true of this model, because there's a
path in this model that violates this property, right? So
there's a path, proposed abort, commit, transaction commit,
where you go through the abort node, but then you don't reach
the transaction node. So that property's not true of the model.
So those three are not satisfied
Now the question is how can you satisfy them. And the answer is
going to be essentially for each violation, right, for each of
these counterexamples to the property, we're go use
counter-example guided abstraction refinement to eliminate it.
And I'll -- kind of a technical term. I'll show you how it
works.
So this initial model is what you start with, and then you have
those invariants, and this invariant is mined from the log. So
it's true of the log, but it's false in the current model,
right?
So the first step is to find the counter-example. So this
counter-example is exactly the same path that I showed you on
the previous slide. And you want to change the model to
eliminate this counter-example. So the thing to realize about
this model is that really is a partition graph. So it's an
abstraction of the underlying concrete events. So that commit
node, this commit abstract node, contains a bunch of concrete
instances of commit from the log, right?
So you could -- you have this underlying graph structure induced
on it. And so to refine a partition or one way to change this
graph is to refine this partition, right?
So somehow we grouped all these commit nodes assuming they're
same, right, but the realization is actually they're not the
same, and the way we're going to differentiate them is based on
these properties. So you can change the larger commit partition
into these two partitions that are smaller, right.
And by doing so, you actually eliminate that path. So now when
you kind of look at the more abstract version of this graph when
you reach the abort node, you have to go through the transaction
abort, right? There's no way can you get through the
transaction commit. So by doing this refinement, you've
eliminated the counter-example, right? You're going to do this
over and over again, right? So you're going to eventually
satisfy all of the properties that you mine from the log and get
a more accurate model, and that's kind of the core of the
Synoptic procedure.
>>: Does the procedure still guarantee all your original traces
still parse to the graph?
>> Ivan Beschastnikh: That's right. The model will still
always accept the logged executions.
>>: If you have a huge database, you probably have a large set
of invariants that you mine. And now depending on the order in
which you address invariants and all the counter-examples to
each invariant, you have a huge graph, that final graph that you
might end up with.
>> Ivan Beschastnikh: Perfect. You're leading right into my
next slide. So yes.
>>: Are you going to explain the refinement?
>> Ivan Beschastnikh: Huh?
>>: Are you going to explain how to do the refinement?
>> Ivan Beschastnikh: The refinement -- there's a set of
[indiscernible] that we use for refinement, because to eliminate
the path, what we do is actually find the first node that you're
stitching multiple executions along the counter-example, so the
first partition where that happens. And then the way you break
up the partition -- I mean, it's a little bit detailed, because
there's kind of different sets of concrete events. There are
events that are -- that are -- there are concrete events that
are from these two different paths that you're stitching
together that you should eliminate. So you have them in two
different partitions, right, and then everything else that's in
that partition. So different strategies are make these
partitions as balanced as possible and assign the remaining
commit nodes that are random. So there's actually different
kinds of ->>: You said you look at the log as a set of expressions. Then
you got into this problem that actually the log was not a set of
expressions it was a set of sequences [indiscernible] execution
was not a regular expression. It was actually ->> Ivan Beschastnikh: Right.
>>: It was a problem. You kind of forgot about that. And now
you're kind of coming back [indiscernible] and then just the
same model.
>> Ivan Beschastnikh: Well, I guess the initial regular
expression parsing is intended to extract the log, right?
So, you know, I don't care what -- I lost some things. But the
idea with the regular expressions is that you composed them. So
you get to choose what you lose and what you keep, right?
So do you care about the disk or do you care about the network
or do you care about both, which events interest you, right?
That is the concern that the user will have.
>>: Would you have -- would you treat each event individually?
You just look at pairs of events? Well, you look at pairs of
events when you create the initial model.
>> Ivan Beschastnikh: Correct.
>>: You look at sequences of ->> Ivan Beschastnikh: Those ->>: Or maybe you –
>> Ivan Beschastnikh: Those are already satisfied, right. So
in the model ->>: They've been generalizing, right? You maybe build the
model.
>> Ivan Beschastnikh: I'm generalizing as much as possible.
>>: Right. Then you have to find it, and then you have to ->> Ivan Beschastnikh: That's right. That's a good point. I
should tell you -Let me move to this slide. Let me describe this. So here I
worked out an example on a smaller log that has a bunch of these
models, right. And so the model on the left is sort of the
trace graph which is what you would actually parse from the log,
right? So these are the individual chains.
And then the model on the right is kind of the initial model
that you start with where you have one kind of partition for
every event, right? And so this is your stand, right?
The model over there is the most abstract one, and it's the most
condensed one. So it admits a lot of behavior.
The one over here is very large, because it's very concrete, and
it admits the only thing -- only the things that you've seen,
right?
So now the question is, like, which model do you want. So this
model is the log. So if you want to look at the gigabyte log,
you can use this model. That model is very small, but it's
going to admit a lot of stuff that you don't want.
>>: [indiscernible]
>> Ivan Beschastnikh: So there have been other approaches that
start with this and actually go this way, right.
I guess the approach that I'm describing here is one where you
start with that model and you go this way. The problem with
starting from this direction, right, is that you're going to
have to do a lot compaction. What we found in general is that
starting there is going to be way quicker. So performance
results for our technique are way better than the versions of
techniques where you can do compaction, because here where you
have to actually do -- this one algorithm which is like where
you do compaction based on the length of strings, right, and you
find ones that have identical runs, and then you compact them
and you merge them, right. So the algorithm is going to be very
inefficient on a very large graph. Where this thing is going to
be much more efficient, because you start with something much
more compact. So there's a lot of tradeoffs.
So our algorithm is to basically say there's going to be this
dividing line, and this dividing line separates models that, you
know, satisfy all the invariants, and obviously this one will
satisfy all of them by definition, and then models that violate
at least one invariant. Those are the ones in red, and so some
invariants violated, all invariants satisfied, right?
And then the exploration stage is going to be -- well, and
here's the model. You actually want to find -- you want to find
the model that is as abstract and small as possible, right, but
is in this green space, right, because it satisfies these
invariants. And you can change the definition of this space.
You can add another kind of invariant, and this line will move
to the left, right, or you can remove an invariant, and the line
will move to the right.
Now the operation that you have is refinement, right.
Refinement will start with the model that's further to the right
and will produce a number of models to the left, right,
depending on which invariant you're going to refine, depending
on which counter-example you want to eliminate, you know, and so
forth. And then one -- and other techniques use coarsening. So
refinement is the thing I described. Coarsening here is going
to be the technique from prior work which is like k-Tails where
you can merge these. So this one splits partitions. This one
merges partitions.
Now the question is how do you get to this intended goal model,
and that's really the problem that Synoptic attempts to solve.
In practice what this will look like is you started this initial
model. You have some number of choices. You're going to make a
choice. And in practice we found that, you know, there's a lot
of strategies, but they're all suboptimal. Unfortunately,
you're never going to get the global optimal here. And so
you're going to -- we select something very simple and cheap, as
cheap as possible, because each of these partitions might have,
you know, thousands of events. So choose some simple strategy
for actually doing the splitting of nodes and then keep going so
you end up with this model. You have some more refinement
choices. And then once you jump over this line and end up in a
model that satisfies all the invariants, then we're going to
apply coarsening.
And the idea with coarsening is that it's sort of like a local
optimization. It's where you want to merge partitions locally,
right, but it's not guaranteed to find -- not guaranteed to find
a global. So that's the full -- the full algorithm and one node
about prioric this idea of starting from the far left and then
going to right just using coarsening has been explored in
prioric as far back as the '70s, but it's very inefficient. So
really the contribution here is to come up with a refinement and
also to use these invariants for a guide for what kind of
refinement is actually working. So that's kind of the Synoptic
technique.
Now, the evaluation -- you know, we've done a couple of
different evaluations, one was to apply to a system. The second
one was to actually have students in a class use these. So the
handwritten diagram that I showed at very beginning was by a
student in the class that where they first had to write down the
model of their system and then run Synoptic on the system and
then compare the two models. And then we've also done some
formal evaluations to show that the algorithm always terminates,
that it has these nice properties like always satisfying, always
accepting the log, satisfying the properties, so forth.
So I'll only tell you about kind of this small user study with
the developer of reverse traceroute. Reverse traceroute is a
system for actually finding the reverse routing path. So
typically you run trace path to find forward path, but the
reverse path is usually obscured, and you don't see it. So
reverse traceroute is a system that uses multiple vantage points
in a controller to find out, you know, what path your packets
are taking on the reverse path. And deployed internally at
Google, has been deploying for awhile.
And what we've done is apply Synoptic to the server logs, and so
the developer had to change the logging code, which is not good.
Ideally you would like this to apply to just existing logs. But
they had to change the logging code to have better format. Free
form is too difficult to write the regular expressions. And
we've basically processed about a million events in 12 minutes.
And the model that we got is the following thing. So unless
you're a developer of various traceroutes, this will not make a
lot of sense to you, but, basically, a single path in this model
is a path that reverse traceroute takes as an algorithm, right.
So sometimes you perform measurements. Sometimes you assume
symmetry, because you don't have any measurements, so forth. So
those are the key things that were logged by the developer, and
each of these corresponds to a method in the code of the
controller.
And so I'm highlighting a couple of things.
One thing I should say is that there are numbers on these edges,
and the numbers are actually probabilities. So -- and they
don't always sum up to one in this diagram, because for this
developer this graph was more complicated. So we had to hide
the low probability edges in this diagram for this to be more
readable. So you can think of the model as -- you know, you
could look at just common behavior, which is what we're showing,
by hiding low probability edges, or you could look at rare
probability behavior if you want to find rare occurring bugs,
for example, and then you would only show the low probability
edges. So you could have different use cases. So this one was
showing the common behavior.
And the things I'm highlighting are two issues that the
developer found. One was these rhombus nodes that are shaded
are terminal nodes, but they shouldn't have been terminal. So
the system was -- essentially one execution of the system would
terminate at an event that was not supposed to be a terminal
event, and that was one thing the developer found by looking at
the model. And the second thing are these dashed edges. So
these dashed red edges were not present in the final model, and
so -- but they should have been, and that was the other feature.
>>: So you say the nodes should not be terminal. The developer
just knew that, right, or was that inferred?
>> Ivan Beschastnikh: The developer knew that.
>>: Okay.
>> Ivan Beschastnikh: Both of these features were found by the
developer. So I basically overlaid on top of the developer's
interpretation of what is a bug and what is not; okay?
>>: [indiscernible]
>> Ivan Beschastnikh: Yes. I'll show you the -- well, let me
actually show you the tool. I have it running online. And here
it is. It's called Synoptic. And so here's an example log,
right. So this log is going to be an apogee log where you have
a bunch of get requests to apogee, and so you have an access
log. And this log is going to be for a web application that's
like a shopping cart, right. So people come online. They check
out. They get credit cards, so forth. So the input to the
system is going to be this log and then these two regular
expressions. You'll note it matches every log line. So it
matches this thing perfectly. There's some IP address, and then
there's magic keyword type. And type is going to be our
abstract event type, right. So abstract event type is going to
be just the name of the THP file that you're getting.
And then the other thing that you need is to actually somehow
split these executions. So what is an execution boundary?
And in this case, execution boundary is simply the IP address.
So a new client to the server is a new execution of the system.
And so basically you have executions that are matched one to one
with the client.
>>: [inaudible]
>> Ivan Beschastnikh: And that's -- here you can actually
notice they're all intermixed. So it's going to actually pull
them out.
So this is the only impetus system. And then the model that
you'll get out will be something like this. It's -- I'm not a
graphics person. So kind of make this fantastically looking.
But here it is.
So basically you start off in this initial node, and then a
trace starts in initial and ends in terminal and then goes
through these partitions. And you can click on any one of these
partitions and then find out the log line that are matched up
that are basically merged into that one partition. So there's
some invalid nodes, and then there's two checkout nodes. So
there were two checkout nodes that were differentiated based on
the invariants because somehow they're considered different.
Let me see if I can make this a little bit better. Yes.
So the question for you guys then is can you find a bug in this
model.
>>: [indiscernible]
>> Ivan Beschastnikh: Yeah, that's right. So that's kind of
the idea, right. Do you guys see that?
So there's a path where the invalid coupon you would assume
would take you back to checkout. It goes to reduced price. So,
you know, what does that mean? Well, it means that there is -this transaction is actually in the log, and you can select two
of these partitions and say what the paths that go through
these, and you'll find there's some 15 traces that actually goes
through both of these nodes. And then ideally you would -- I
didn't build that thing yet. But that's the intuition for how
you would use this model. It's an extraction of what's in this
log. The log has a lot more detail. It has timestamps. It has
all this other stuff. Really when you think about the logic of
your program maybe you don't care about any of that, right.
What you care about is just the event sequence and whether it's
reasonable.
>>: How do you find the one, like 151?
>> Ivan Beschastnikh: Ah, yes. The probabilities are based
on -- the probabilities are based on the number of events; all
right. So initially for like this checkout node contains an
event, right, and then N over 2 goes to this guy. So
probabilities are based on the concrete -- based on the concrete
observed events.
>>: [indiscernible]
>> Ivan Beschastnikh: Yeah. So in the log, right, there were
some traces that went this way and some traces went that way,
right. And so exactly half of them took the edge, and the other
half took the other edge. And that's where I didn't talk about
them in the talk, because you could show much more information
on top of this model. And probabilities are just one thing that
we thought about.
The one other thing you could show is actually time or
distribution of time, right, because you have time between
events. Then you can say here's the shape of like the
timestamps, for example. But it's based on the concrete
underlying log, and it's added to the model after the fact. So
it's not used in the actual modeling.
>>: You're kind of assuming that the process or where the nodes
came from are completely right?
>> Ivan Beschastnikh: You assume it's a sequence of events
then, and that it's reasonable to model a sequence of events in
this formalism, right.
>>: Two batches are going out from the initial checkout, one
that goes to the credit card and one that goes to coupons.
>> Ivan Beschastnikh: You mean this one, right?
>>: Yeah.
>> Ivan Beschastnikh: Sorry. That's actually a bug. I keep -you know, it's sad. I keep presenting it over and over. People
keep finding it. I have to actually go fix it.
>>: So are we talking about -- so there's this interplay
between filtering the invariants and the complexity of the model
that you generate, right? And at least as you presented it, we
discover the invariants, and we select which ones we're going
to ->> Ivan Beschastnikh: We select all of them right now. So as
they develop -- right now -- you know, in that screen, in that
input screen, you don't get a choice. One thing I didn't ->>: Also you have some threshold for the strength of the
invariant.
>> Ivan Beschastnikh: The invariants are all very simple right
now. So they have to be true for all the executions, right.
So, yes, ideally you wouldn't want like a fraudulistic invariant
that would be more robust. So you would say 99 percent of the
time this is true; maybe the one percent of traces are actually
just malformed.
>>: The invariants that are true all of the time?
>> Ivan Beschastnikh: Yes. And we use all of those,
absolutely. And so you can look at them here right. So here
are all the invariants that we mined for this thing. And right
here's a low visualization. Always followed by, always preceded
by, never followed by. And you could take these and then
actually, you know, remove them.
And then the model would change, right. So yes, this gives you
some kind of knob to go and tweak. So if you disable all of the
invariants, then your model will be the initial model. You
would perform an order refinement.
>>: I was worrying and wondering about if you rely on a visual
representation of a machine, it usually doesn't scale around
[indiscernible] maximum a day or something like that. When
that's the case, what do you recommend? What do you recommend?
Like removing some of these things or use a fancy visualization
for very large things.
>> Ivan Beschastnikh: Yeah. When I started this project, I
didn't think that visualization would be kind of an important
component of it. It turns out it is. We found that in the
regular -- so, for example, in the trace route model that I
showed you we have to remove nodes and edges of low
probabilities in order to make that model kind of fit on the
screen and be readable. So that was one simplification that
we've done.
I would actually argue that what you really want to do is choose
a different -- choose a different smaller component or choose a
different abstraction for your model. So basically to me it
means to me that displace of abstraction that you selected with
these regular expressions on top of the log is simply too close
to the log, too concrete, right. So you should raise the level
of abstraction in order to make that model be more easily -easy to interpret.
But that's not always the case. You know sometimes you do have
a lot -- it really is a complex case. So this tool definitely
would not work for a the very complex system that has that
complexity.
>>: The question on the video I was watching online he's asking
what [inaudible] found in the reverse trials, how critical were
they, were they so long, did they actually cause crashes or
performance problems or in any sense at all.
>> Ivan Beschastnikh: Yes. The developer knew about one of
them. Like they knew that this thing happened. They didn't
fully understand where it came from. And actually the model
doesn't tell you where it comes from either, right. It's like a
diagnosis tool. You then have to go find the root cause after
all. So the underlying cause was a threading problem, was
concurrency problem. The bugs were not -- they didn't crash the
system. So they were not severe. They were not that severe.
Okay. So I'll switch over. I showed you this demo thing. So
to summarize, Synoptic takes this log for a single totally
ordered log and produces this event based model, and it does
this with refinement and leverages these mined invariants that
it satisfies in the final model. And primarily the use case
here was comprehension. So the idea is you want to help
developers with large and complex logs. That was the original
intent. And it's open source and deployed on this web interface
sort of scales, you know.
Yeah, please.
>>: What's the largest log that you've run?
>> Ivan Beschastnikh: Like a million log line.
>>: A million?
>> Ivan Beschastnikh: Yeah.
>>: So was it reverse?
>> Ivan Beschastnikh: Different versions of reverse tracer.
We've done larger ones for it. The complexity of the algorithm
really depends on the number of event types and the diversity of
the log that you have. So the more connected your graph becomes
the more difficult it is to check those properties, and then
refinement of course goes down as you have more.
So the next project I'll talk to you about -Yes, please.
>>: Did you do any investigation of how robust this is like to
the loss of load lines or if you have to log inside one
environment and not the other?
>> Ivan Beschastnikh: Right. So you wouldn't pick things up
that you haven't logged. So it's very much constrained by what
you have in the log.
>>: If you just take out a single log line from the log, how
badly would that screw up the model?
>> Ivan Beschastnikh: That's a good question. If you have
another execution that is just like this one, it wouldn't change
the model, because you essentially you have this robust
redundancy. You have underlying concrete -- execution would
still have the same edges and you would mine the same
invariants, and refinement would perform the same.
But if you remove a log line that would remove an invariant, for
example, it might change the model radically.
>>: In that case in one execution you have an event and the
other it does not, then that would show as an error?
>> Ivan Beschastnikh: What do you mean?
>>: So say one execution has some logs omitted, another
execution there are some logs. So will it show it in the
execution logs as an error?
>> Ivan Beschastnikh: As an error?
We wouldn't -- well, the model would present both executions,
right, and we would merge those sections in that it would be
able to merge, but it wouldn't show as an error, actually. It
would attempt to satisfy both. It would attempt to model both.
>>: [indiscernible]
>> Ivan Beschastnikh: What about what?
>>: [indiscernible]
>> Ivan Beschastnikh: So certain -- yes. So the invariants
have to be true for every execution. If there's at least one
execution even you have a million of them that doesn't satisfy
the invariant, then you throw it out. Then you're not going to
mine it, and then you're not going to use it.
>>: When you have one trace that looks like an invariant, you
have three conditions for the invariant that only happen in that
one trace and you run a million more runs maybe you would have
found the case where you get that three condition but you go the
other way so it's really not an invariant.
>> Ivan Beschastnikh: Right. It would be a false positive;
absolutely right. So you would mine the false positive along
with the true ones. The answer here is really you want as much
information as possible. For the model to be accurate, you need
to observe more things.
>>: So the initial model is just one [indiscernible].
>> Ivan Beschastnikh: Right.
>>: When you do that refinement [indiscernible] or how do you
do it?
>> Ivan Beschastnikh: Yeah, we ->>: Does it become a bottle neck?
>> Ivan Beschastnikh: Yeah. The memory pressure is definitely
a bottle neck or if you get ->>: Do you need to have the full memory?
>> Ivan Beschastnikh: You to need to have the full memory.
There's some optimizations that you can make where you really
need to store some concrete aspects of the traces. You don't
need to store all of them. For example, if you have two
identical traces, I just need one of them. I don't need both.
So that's just kind of.
>>: From the left.
>> Ivan Beschastnikh: You're right. And there are other cases
where you might not need to store some things, because, you
know, during the merge the invariant is satisfied, right. So if
certain partitions are not going to be ever refined, then you
can lose them.
>>: Will you know that?
>> Ivan Beschastnikh: Well ->>: The partition.
>> Ivan Beschastnikh: You can know that sometimes because the
underlying refinement assumes that you can only refine a
partition if it stitches multiple executions together, right,
that have different futures essentially, right. And that's when
you're going to want to refine it. So the optimization would
be, you know, I take that partition and then I check -- if you
can check that all of the invariants are satisfied below, then I
can throw away the state for them. That would be one
optimization. But we haven't implemented any of those. So that
would be kind of future work.
>>: Yeah.
>> Ivan Beschastnikh: Yeah. Definitely. This is a single
threaded Java process right now. So this not optimized at all.
It's mostly experimental.
So let me jump into Dynoptic thing, because it has a bunch of
details you guys might or might not enjoy.
So the sequential case was you have this, you know, the
intuitive model. There is kind of a sequence of events. The
question is what is intuitive in a distributed case. And in a
distributed case when we told students to write down a model of
their system, they came back with pictures that looked like
this. All right. So basically they would model each component
as a fine state machine. You would model the client
finite-state machine. You have the server finite-state machine.
The only catch is that the server admits events that the client
consumes, right. So they're actually linked, you know. And so
this was kind of -- this is very intuitive to students, and the
idea was well people are familiar with finite-state machines.
Let's use a formalism that's close to this.
So in the Synoptic case, you would infer an event based model.
In the Dynoptic case, we would actually infer a model that is
communicating, communicating finite-state machines. And let me
tell you more about what these things are. So CFSMs basically
have some number of processes, and they're connected with tyco
keys, and they're reliable. So one example of FSM, CFSM that
I'll use here is a very simple one where you have process one on
the left, process two on the right. And you actually have
states, and they both start off in their initial states. And
then you have these funny looking transitions where some of them
may be communication events. So exclamation M means that you
want to send M on Q1. So inserting Q1 into that Q. And then
question mark M means received that M on that Q. So and once
you receive it, you can proceed, right, but unless -- but you
cannot execute this event unless there's actually an M at the
head of that Q. So this is really modeling message passing, and
it also includes local events. So you can execute a local event
for free, right. And then you might communicate back over a
different Q, because the Qs are in either direction. And then
you would consume it on this end. And that would be a complete
execution.
>>: [indiscernible]
>> Ivan Beschastnikh: Exactly.
>>: So that could be anything.
>> Ivan Beschastnikh: Yes. I'm not assuming anything. This
could be a shared memory. This could be a socket. So this is
one execution of this CFSM. There's only one execution of the
CFSM, but in general they may be asynchronous, right, because
there might be another process that I'm communicating with
independently, right, and they don't have to match up.
So the idea here now is what if we have a log, and we want to
produce this kind of model, right. How could we infer this kind
of model from a set of events from the log.
So, you know, Dynoptic very similar sounding to Synoptic. So
our pipeline is going to resemble Synoptic. We're going to have
very similar steps where we parse the log, build this compact
model, mine some properties. These properties are going to be
more interesting, because now they're going to be events at
certain processes. So I might have something like send at
process one is always followed by receive at process two, use
something like refinement to get the final model. So very
similar.
So I'll describe all of these steps kind of now in the Dynoptic
setting. So the first question is, okay, so you have a log, how
do you get back an execution out of it, how do you parse it.
Well, it's a distributed log. So what you're going to need is
something like vector timestamps. So you want to actually
counter the total order -- the partial order of your system
execution. So these vector timestamps are logical clocks that
you can use to reconstruct the data, the partial order of the
execution as it occurred.
And what we have built is like a little library that you can
compile -- for Java, you compile your Java recompiled with this
jar, and then existing log messages would then have these vector
clocks put in automatically. So you get this log in for free.
So you don't have to have it as part of your system. We'll add
it in automatically. It's going to have an overhead, but you
don't have to worry about generating these timestamps.
>>: You said that you would add the vector clocks into the log?
>> Ivan Beschastnikh: Yes. So assume that your initial log
was -- you know, so if you didn't have these things initially,
right, you just had these things on the right, so then you
compile it with a jar, and it automatically, tracks, senses
causality and then just adds all these to every log line. So
when you log normally, you get Q exclamation M. Then it would
also prefix this vector clock. So the idea is that you just
want to capture the partial order, right?
And I don't want to change your log, because you log some things
for a reason, right? You just keep that. But I want to keep
the causality. I want to reconstruct it. So very basic vector
clock mechanism.
So you have this log now. So how do we build the initial model?
You're going to have kind of two steps. First all, we're going
to deal with states. Unfortunately because these CFSMs have
queues, you actually have to reason about message -- message
FIFO state, you know, what messages are outstanding. So first
we're going to reconstruct the state, and then we're going to do
partitioning not based on events but based on queue state. So
let me show you how this works.
So initially you have this time space DAG that you parse from
the log, and then you want to come up with a state based DAG.
The process here is going to be very simple. It's going to be
simulation, where you first see the first event. So I start off
with a state where both of the queues have no messages; they're
empty. The next queue, the first event, it's a send of M. So
then my simulation will basically say, okay, I just want to add
M on the queue. Then I receive it. Then I go to a new state
that has both queues empty. I execute a local event, doesn't
change the queue. And then I execute the act kind of sequence.
So the idea is that you would parse a bunch of these from your
DAG, and now you have a model that has dates and events, more
complicated then Synoptic, but you can apply some of the same
ideas. So the idea now is you want to build this initial global
model. That's compaction. And remember in Synoptic what we've
done is we've seen commit and commit, and we've merged them
together. We assume they're the same. In this case, we're
going to build a state based version of that. So we're going to
take -- look at these queues, and then we're going to merge
queues that are the same. Actually in practice there's going to
be approximation here where we'll get our merged queues that
have the same top K messages. So we're just going to assume
that after K the queues are identical. So this is where we're
going to lose our -- this is where it's not going to be exact.
So queue one and queue two is going to be the abstract state
that represents all of these orange states. And I'll do the
same thing for this guy and the same thing for this guy. So now
I have all of my abstract states, and I create errors between
them in the same way. So I have a transition between a state
where I have an M and a state where all queues are empty on
receive them if this actually happened in practice, right, if I
saw the concrete event that made this transformation.
So now I have the initial global model. It has some of the same
great features as Synoptic. So it accepts the log where every
one of these executions in the log is going to be a valid trace
in this model, but this model is actually -- you can think of
like a cross-product of the process. It's not actually CFSM.
You have to reason about global state, you know, global events
all of your nodes. So the actual decomposition is going to come
up in refinement.
So let me tell you about invariants. So invariants are going to
be very straight forward. You have a bunch of these DAGs, and
what you're going to do is mine the same kind of events, the
same kind of templates that we had in Synoptic. Except that now
your events have kind of a process idea associated with them,
right? So you might have an event that only executes at process
one, you know, is always followed by an event that executes at
process two. So you mine the same set of things.
And now the question is how do you use both the initial model
and invariants to get this CFSM formalism that I told you about
before.
>>: If a person writes a message send on queue one receive,
it's always enforced by construction, by your log. Am I missing
something here?
>> Ivan Beschastnikh: That one -- yeah, I think you're right.
It is enforced by construction in logging, but it's not enforced
by model. So you do have to -- you do have to -- you still have
to -- I haven't thought about that. That's a really great
point. So you don't have to mine it in a sense. So you can
just add those. Yeah. The reason it's actually there is
because the library that adds the vector timestamps came in
after the fact. So if you implement to your own version of
vector clock, if you have message lost, then that would be one
way of handling it.
So now you want to compose these two, use both of them together.
So this is the really fun part, the really complex part. So you
have this global model. So the first step is going to be to
decompose it into CFSM, and the decomposition is pretty
straightforward where you just pay attention to the events of
the individual processes, right. So I take this model, and I
only look at events for process one, right. And I treat events
for other processes as epsilon transitions. So I'm going to
receive an M, and then I'm going to send an M. And this is
going to be an epsilon transition. And then something happens
somewhere else, and I don't care what it is, but for me locally
I transition to this other state.
So using that approach you can decompose this thing into these
two CFSMs, and these two CFSMs are very compact. As you see,
they're just one state each, and the reason for that is because
it is the most compact global model for that example. And then
the next step is to use prior work on formal methods. Luckily I
didn't have to invent this. So there's been prior work and
actually model checking CFSMs, right. And so we're going to
throw it at this model that we have right now to get back
counter-example for an invariant if there is one.
So, for example, on this guy, there is an invariant M, send of M
is always followed by receive of M, and we use a model checker
called MCFCM which model checks CFSMs exactly. So it might not
terminate. So we're thinking of these things. So that's kind
of a site point. But the point is you want to check the
invariants. And you find there's a counter-example, right.
There's an execution where you can send them, and I can receive.
And then you can execute, and so this execution is a
counter-example to that invariant. And that execution is then
going to induce a refinement of this graph. So here I can -you know, I can send an M, and I can execute a local event,
right?
So the idea is you want to split this guy out to require that
every time you send an M you're going to receive an M, right?
And so this is refinement as before in the Synoptic case. And
once you have this new global model, you do this step again,
right? You're going to come back with a new CFSM, model check
it, and then eventually you're going to be done. So for this
example that I work out, you know, the Dynoptic would give you
exactly these two traces that you observed, you know, will give
you exactly the CFSM for that example.
So that's the Dynoptic process. It's more involved. It has
much more formal methods, parts, in it, and it's a little
trickier. I feel like we're still struggling to understand all
aspects of it, but we've done some preliminary evaluation on
this, and so we've simulated some protocols, gave some simulated
traces.
We've evaluated with Voldemort DHT which has a replica protocol
inside of it, and so we selected just the messages that do you
have replication which turns out to be really trivial. So it's
perfect for our tool. And then we've done a case study with
TCP, opening and closing handshake. The problem with CFSMs is
that they again don't model data. So you cannot -- you cannot
reason about sequence numbers, for example. So you cannot do
the data stage of TCP, but you can do the opening and closing
handshake pretty easily. And then there's been a bunch of
looking at formal evaluation and usability of these models.
I'll just show you the DHT result. So Voldemort Distributed
Hash Tables, it basically has this very narrow interface. You
can associate some value with a key, or you can retrieve the
value for a certain key. And you know this is actually
deployed, but it's open source, and you can download it and
exercise it.
So we ran Dynoptic on logs generated with unit tests by
Voldemort, and we just targeted the protocol messages for DHT.
And so this is the DFSM that you get out. It's not as pretty.
I had to prettify it manually.
But, basically, you have these unit test use two replicas on one
client, and this replification protocol is really
straightforward. You basically have a right side to it and left
side. On the right-hand side you're executing put. You're
associating a value with a key. And so this guy is going to
essentially execute put on this replica, so wait for a reply and
then execute a put on the lower replica. And then the same
thing for get. You're going to do the same exact same path. So
they kind of mirror each other.
So through inspection, we found out that this is indeed the true
model for replication in Voldemort. So it's really, really
simple in Voldemort. So that's why this model is pretty -disputably you can inspect it and then succinctly capture the
three node distributed execution.
So the contribution here very much like in the Synoptic case
except we have one more which is to handle distribution. So how
do you handle a log that's generated by multiple processes. Our
answer is you want one finite-state machine per process, and
that would be one way of doing that. And in our case, we found
that it elucidates distributed protocols, because this FSM -you know, so logs that you have no partial order in distributed
are hard. Logs that have partial order are exact but are even
more difficult because now you have to draw these things. So
now a more general model can help you understand these protocols
better. So it's open source. But it's not actually deployed to
you, but you can try it out.
So before I conclude, I want to thank a bunch of people. I
worked with a trio of advisors on this project and my
collaborator and a ton of students at UW and generously funded
by DARPA, Google and NSF.
So the contributions of this talk is that, you know, I think
logs have a lot of potential. They have a lot of content in
them. What I attempted to do is to apply basically formal
methods to log-in analysis and model inference in these two
tools. Synoptic infers sequential models. Dynoptic infers
distributed models. The idea is then you can use these to help
developers understand what goes on in their systems. So you can
find out more.
Thanks for your attention. Thanks for coming.
[applause]
>>: So what is your next step?
>> Ivan Beschastnikh: Actually, I'm excited about applying
model inference to other domains like other kinds of problems.
So I think Dynoptic is an interesting theoretical kind of tool.
I think it's really difficult to make it scale. So I was
thinking of take the Synoptic approach and then applying it to
logs that have more information.
So I was telling Peter about logs that have timing information,
for example, so you could have probability on edges but you
could also have time on edges. And you could use this to
actually think about performance within your system. So now I'm
actually modeling an execution, but I'm also reasoning about
time that it takes between events, and I could think about, you
know, give me the path and the model that is slowest based on
the observations, right, and I could think about that for doing
things like performance testing or performance analysis. So
that's like an immediate low hanging fruit that I was thinking
about.
I guess, in general, I'm more of a systems person, and I like to
apply these techniques to systems. So I'm thinking about doing
test case generation for distributed systems. That's one of
my -- I think that would be a great thing to go to next, because
distributed systems are very difficult to test, and I haven't -and I feel like a lot of people know how to test them well. So
when people write distributed systems code, they don't test as
much, but they test very specific things, and I was wondering if
there's some way to leverage that intuition that developers have
about their systems to generate test cases better. So they're
both kind of in the software engineering domain as applications,
as techniques, but the applications would be kind of the systems
for me.
>>: [inaudible]
>> Ivan Beschastnikh: Yeah. You know, there's some ways of
cheating with Synoptic. I also think that I don't think
modeling inference is applicable to all problems. I wouldn't
necessarily use model inference for test case generation. It
would be nice to, but not sure if it would work out.
>>: It seems like you touched on it. But you said that one of
the sort of major tasks that people started using this was the
realization it was actually more useful than the raw data that
was underneath of it.
>> Ivan Beschastnikh: Right.
>>: Have you put more thought into what you could do or what
can be done to improve that side of it?
It seems from a developer story of her as far as I could focus
on that seems most valuable.
>> Ivan Beschastnikh: Right. Yeah. I think a tighter
integration between the model and the log would be really
useful. So right now you have some kind of window that you can
peer through, you know, into the abstraction. So you have this
node, and you can see the log lines that are related to it.
But I think you can ask more queries of the model. Like if you
can actually ask a question like why are these two models -- why
are these two nodes split out, why can't I merge them. And then
the answer would be, well, there's this invariant and it would
violate this invariant. You know, where if I merged them, then
this path that violate this invariant. Being able to pose
questions like that I think would be useful, because oftentimes
when you have this log you have a very specific question, you
know, and the question is now the -- I think the research
question is can you interpret that question or pose it off this
abstraction and get an answer. So I think that would be very
helpful to actually -- so right now it's very much open ended.
You could use it for exploration, but I think building it
towards a certain set of tasks would be the right thing to do.
>>: It's a luxury to be able to just collect logs and let's go
find some bugs. Usually it's like, Oh, my God, it's totally
broken. What's happening with the system. That's usually why
people go into these sort of avenues to be able to have more
direct.
>> Ivan Beschastnikh: That was my initial goal. It's like just
comprehension overall, because you do use logs for so many
different things. You know, so if I just give you something
that's a little bit easier to use than a large text log, it will
make your life a little bit better. But I certainly don't want
to build in any assumptions about what you might be interested
in, which is why, you know, the abstraction that is done by the
regular expressions is left completely up to the user.
So thanks very much.
[applause]
Download