>> Peter Bodik: All right. Welcome everybody. Happy to introduce Ivan Beschastnikh who recently graduated from UW and will be starting as an assistant professor at UBC up in Vancouver. And he'll talk about his work in software engineering particularly how to understand complex logs from distributive systems. All right. >> Ivan Beschastnikh: Thanks, Peter. Hi, everybody. So today I'll be telling you about my basically dissertation work, and the perspective that you should have in this work is what I'm attempting to do is the following, take the log like the one you're seeing on the left and attempting to convert it into a model, a more abstract model like this thing on the right. And the idea is that you can use this model for various other tasks once you've done this conversion. So kind of the high level view, though, for this is that I started in systems, have been building large systems, some of them distributive, and have always been running into this problem, right, which a lot of people face, which is that you build the system and somehow it behaves in an unexpected way. It does something that you would not expect it to do. So the question is how do you answer this question. And typically the setting here is that you have written some code and perhaps lots of code and you've looked at it and you've attempted to test it, but you still have this question. And of course there's some developer, some confused developer, and this developer has a mental model of what the code is supposed to do. So, right, she's thinking about the code. She wrote down the code. And the reason this question comes up is that there's some disconnect, right, some mismatch between the mental model the developer has of their artifact and then the actual artifact, the implementation. So one typical way to kind of try to bridge these two is to instrument the system, run it, get a bunch of run time observations, behavior of the system. So you would output this log, and then you would inspect this log and try to find out, okay, there's some lines in this log that somehow when mapped to my mental model give me some kind of contradiction, right. They create -- you know, maybe there's a loop on that invalid state, and that loop -- I had never thought about a loop, but it actually exists in the implementation. So the idea here is that you would generate this log, inspect it and then check it for validity against your mental model. And this sounds really plausible except in practice your systems are going to have huge logs. So your logs may be gigabytes long. They might be very elaborate. And as a developer faced with this had very large log, you're not really sure where to look within the log or what to look for. So the problem is the log gives you a very concrete view of the system, and it's very easy to get a very large log, right. I just record all the methods I called. I just record all the activities. It's very easy to get a lot information. The problem is to actually inspect it. So, you know, and then in a distributive setting, things get worse, right. The problem is that now you have multiple hosts, processes, and you have to reconstruct a distributed execution where you've captured logs for the different entities in your system, and somehow you have to string them all together. So going back to this, to this kind of setting what I'm going to tell you about today is an attempt to actually replace the log with something a little bit more abstract. So you want to not deal with the log but deal with something that matches your mental model more closely, right. And the idea is that if you provide the developer with a model that models the log, a different representation, if you will, that's closer to their mental model, then it might be -- it would be very much easier for the developer to actually inspect the model and find the disconnect, find where their mental model differs from the implementation. So that's kind of the context for the work. And this process of going from a log to a model is typically referred to as model inference; at least that's what I'm going to refer to it as. And lots of great use cases for this. So I've told you about mental model validation, but you could also use this for test case generation. So, for example, you generate the model from this log, and if your model generalizes then you can predict behaviors that are plausible for your system, and then you can use those behaviors to actually induce executions in the system. You could also use it to evaluate test suites. So you might say I tested my system, you know, in this one environment, and then I deploy it in production. So its behavior in the test environment is going to differ from production behavior. And so how do I compare that behavior? How do I know what's present in production that's not present during testing. So one way to do that is to generate the two different model and then compare them. We know how to compare models in different ways. And you can find paths that are in production that are not in testing that should be exercised. You can also use it for anomaly detection. This is kind of the classic case where you take the model from last week and then you take the model from today and then you compare the two. So a bunch of numerous applications you can use this for In my case, it's going to be being mostly mental model application. And so I'm not the first person to work on this topic. People have worked on this in the software engineering domain, and they refer to this as specification mining or process discovery. And usually the name here depends on the task that you're going to use that you're going to apply to the model. And prior work spans at least a decade of research, and a lot of the challenges that remain in this prior work, things like efficiency, accuracy and distribution. And those are the ones that I'm going to sort of talk about in my work today. So the first one's efficiency, how do you get this process, this model inference process, to work on very large logs. All right. So it's okay -- you know a lot of the prior work works fine if you have a hundred -- you know, this log is a hundred lines long or 200 lines long. How do you get it work on a gigabyte log? How do you make this model more accurate? What notion of accuracy can you use. And then, finally, how do you get it to work for a distributed system, distributed setting. So the three tools that I built are Synoptic, Dynoptic and InvariMint. And to briefly go over it, you know, the contributions of Synoptic is it gains efficiency through the process of refinement. And I'll go into more details about what that means. It gains accuracy by mining certain properties from this log and then making sure that those properties are going to be true in the final model. And then Dynoptic is sort of a follow-on word on Synoptic which infers a different kind of model. So the model I'm going to attempt to infer here is one where I have a finite-state machine for a process. So Synoptic infers a finite-state machine like model. And in the Dynoptic case, I actually want to model my distributed system as a set of finite-state machines. And in Dynoptic, it's going to apply some of the same aspects as Synoptic like refinement and mining properties, but also it's going to handle distribution. And then the final work is InvariMint, which sort of very different from the top two, and I'm not going to talk about it much in this talk, but it fits into this puzzle in that it takes the idea of mining properties and composing them to the extreme, essentially, where there is no refinement in InvariMint. It just mines a set of properties here and composes them in an interesting way. So this is really the motivation for my talk. And next I'll really tell you about Synoptic and Dynoptic. So there's not going to be any InvariMint in the talk. So jumping into Synoptic. So the goal is you have this log, and you want to produce this model. And the way Synoptic is going to do this -- you know, initially I didn't tell you about any constraints on the log. So the first step is going to be just to parse this log. So I'm going to assume that the user will give me a log and a set of regular expressions that will match the lines that they care about in the log. So if you care about disks in your system, then you give me regular expressions to extract disk events. And then I'll build you a model that's relevant to just those sets of events. The second step is to then build a compact model. You want to build a model that will include as many behaviors as possible. So include the behaviors that you have in the log but then many more things. Then what we're going to do is mine these properties or what I call log invariants, and they're going to be very temporal like things, like lock is always followed by unlock, open always precedes read, and then use these invariants to constrain the model. So I'm going to take this initial model, use the invariants that I mined and then build you a more accurate model. >>: [indiscernible] >> Ivan Beschastnikh: Right. I'll give you -- yeah. I'll give you an example of what I mean by regular expressions. But overall, I think of this log as a set of events. So I do not reason about state at all. So the model I'm going to give you actually is going to be an event-based model. So I'm modeling sequences of events. And so your regular expressions basically have to tell me for every log line that you care about associate an abstract event with that log line. So if you're sending a message, like it might be an acknowledgement message, like launch back into TCP. It has sequence numbers. It has all sorts of stuff in it. But your abstract event type is acknowledgement, right. That would make sense if you're attempting to come up with a model that reasons just the event types. And depending on the setting, you might say that's unreasonable, right. So you might say, okay, maybe I care about every fifth acknowledgement packet. So it would be acknowledgement five, acknowledgement ten, right. >>: So this not an event. >> Ivan Beschastnikh: Right. So you would include that into your regular expression. But, in general, this does not reason event data. So it's not powerful enough to reason event data and state. >>: So if you have this acknowledgement model, right, if we wanted to model a handshake, then I have to give all that data stacking out all of these events? >> Ivan Beschastnikh: Yeah, you would. I mean, you know, the answer here depends on what that log looks like. If your log is in a certain format, then you can just say you have a regular expression that matches different kinds of events and extracts some sort of subset. Let me work through these different steps. And, actually, I'll go through them in order. And the example that I'll use is kind of a very simplified version of two phase commit. So in two phase commit, you have a manager and some number of replicas. And the manager is going to propose a transaction, and then your replicas will reply with either commit or abort. And then the manager will say, Okay, collect all this information and then either reply with transaction commit if I've seen only commits or reply with transaction abort if there's one abort or more. So the way we're going to cheat here and use Synoptic is that we're going to maintain a totally ordered log as the manager. So we're not going to care about logs anywhere else, because I can have a total order view global view as the manager. That's the log I'll plug into Synoptic. So in this case, the -- my input might be these sets of events, and then my regular expressions are going to extract just the packet types or event types that I care about. And for two phase commit, the things you care about are, you know, what kind of messages are you sending propose abort, transaction abort, commit, so forth. So I'm going to essentially extract these execution chains from this log. One thing that I should mention is you also need something that tells you when -- basically tells you execution boundaries. So in this case, the transaction ID is going to be your execution boundary. So every new transaction will induce a new execution. And the kind of format I'm using here is that the square node is going to be this initial node. So all of the proposed nodes, the first ones, are going to be the first ones. And then the very bottom most node, that terminal node, is going to be the response. So I have this initial set of traces, going to be huge number of them. And what I want to do next is build this compact model. So the it compact model is going to be built very simply where I want one node, one abstract node, for every kind of event that I have. So I take all of the -- for example, I might take all the proposed nodes and create one single proposed node to represent all of them. And I'll take all the commit nodes and create one commit node to represent all of them. And I'll do this for all of the event types that I have. And then I'll create edges between these based on the concrete observations. So if there's -- if there's a concrete edge between commit and transaction commit, then there's going to be an abstract edge between the commit in the abstract model and the transaction commit in the abstract model. So, basically, I just built you a model that's very compact, compact in the sense that there's only one node per event type, and it admits all the behaviors that I observed in the log by construction. But this model also admits lots of other things. So, you know, now the question becomes how do I get this model to be a little bit more accurate. And this is where invariants will come in or those log properties. So we built this compact model. Now we're going to mine these salient log invariants. So it turns out you can get away from pretty complex systems with very simple properties. So for Synoptic, we're actually going to use these three properties. They're going to be temporal properties. You can express them in LTL, but I'm not going to show that to you. But, in general, we know from prior work by Dwyer, et al., for example, that in general when you specify systems there are very few patterns that you tend to reuse, right? So these three actually cover the top six patterns out of the ten or so that Dwyer documented. So the patterns here are going to be X always followed by Y. And on the example log, X always followed by Y would look like abort is also followed by transaction abort. So when you see an abort event, then you know that before the trace ends you will see transaction abort. And commit always precedes transaction commit is kind of the reverse, you know, looking back. So you look at transaction commit and that third trace, and looking back, you must have gone -- you must gone through a commit commit event. So this one really parallels causality. And then the final one is abort never followed by transaction commit which is basically what you think it is. If I reach an abort event, then I will never see transaction event in the same execution. Got it? >>: To do the model based on the log, you say you want to get -- the most accurate would be to capture exactly what happens [indiscernible] compact model? >> Ivan Beschastnikh: It's actually -- that initial model that I told you about, right, that construction, the model is -represents the log, but it also captures other things. So there are paths in that model that you have never observed that are illegal. And so if I -- it can happen because you're essentially stitching together executions. So here in this model, this edge might have come -- abort always followed by transaction abort might have come from here, right? But then the edge preceding it might have come from a different execution, and so you might get a path in there that actually composes two different executions, right? And this is really the power of this model is that it generalizes by stitching together different executions based on -- based on common events. >>: Models? >> Ivan Beschastnikh: Well, now my question is I have this model, and it generalizes, right? How can I make it a little bit more accurate. And accuracy will come from these invariants, so these things that I told you about. What I'll do next -- what I'll do next is actually mine these invariants from the log, and then I'll tell you about a process for changing the model to satisfy these properties. So, for example, this first one, abort always followed by transaction abort, if you actually mined this property from the log, right, you would like the property to be true of your model. That's also kind of a generalizing statement. It says that there's a bunch of behaviors that we haven't seen of your system, right? But if we've seen this property for all of the executions, then we would like this property to be true of the model, right? And this property is also a correctness requirement for two phase commit. So you would like this to be true of the model as well. >>: So these are inferred? >> Ivan Beschastnikh: These are going to be mined. Yeah, exactly. So I'll show you that in a sec. I have a note here to kind of related work. There's been a lot of work on actually inferring these temporal properties from sequences in different domains and a lot in the software engineering domain. And I guess the contribution of this work is to actually use these properties for modeling. So they're not -- I mean, you can see them in the tool, and I'll show you where you can see them, but that's not the purpose of the tool. So for two phase commitment here are all of invariants you would mine. So this is the set of all of them. And some of them are going to be false positives, because your log might not include all possible executions of your system. And some would be actually true of your system, right? And depending on how interested you are in inspecting these, you can deselect some of them, right, to have the model be not constrained by the false positives. >>: [inaudible] >> Ivan Beschastnikh: No. These are mined automatically based on these kinds of variants. So now the question is how do you combine these two. So you have this initial model. You have these properties, and you want to refine the model to satisfy the properties. And to give you an example, so for this initial model, right, the ones that I grade out below all of these properties are already true of this model. And what I mean by true or not true is that, for example, the top one, abort always followed by transaction abort, this property's not true of this model, because there's a path in this model that violates this property, right? So there's a path, proposed abort, commit, transaction commit, where you go through the abort node, but then you don't reach the transaction node. So that property's not true of the model. So those three are not satisfied Now the question is how can you satisfy them. And the answer is going to be essentially for each violation, right, for each of these counterexamples to the property, we're go use counter-example guided abstraction refinement to eliminate it. And I'll -- kind of a technical term. I'll show you how it works. So this initial model is what you start with, and then you have those invariants, and this invariant is mined from the log. So it's true of the log, but it's false in the current model, right? So the first step is to find the counter-example. So this counter-example is exactly the same path that I showed you on the previous slide. And you want to change the model to eliminate this counter-example. So the thing to realize about this model is that really is a partition graph. So it's an abstraction of the underlying concrete events. So that commit node, this commit abstract node, contains a bunch of concrete instances of commit from the log, right? So you could -- you have this underlying graph structure induced on it. And so to refine a partition or one way to change this graph is to refine this partition, right? So somehow we grouped all these commit nodes assuming they're same, right, but the realization is actually they're not the same, and the way we're going to differentiate them is based on these properties. So you can change the larger commit partition into these two partitions that are smaller, right. And by doing so, you actually eliminate that path. So now when you kind of look at the more abstract version of this graph when you reach the abort node, you have to go through the transaction abort, right? There's no way can you get through the transaction commit. So by doing this refinement, you've eliminated the counter-example, right? You're going to do this over and over again, right? So you're going to eventually satisfy all of the properties that you mine from the log and get a more accurate model, and that's kind of the core of the Synoptic procedure. >>: Does the procedure still guarantee all your original traces still parse to the graph? >> Ivan Beschastnikh: That's right. The model will still always accept the logged executions. >>: If you have a huge database, you probably have a large set of invariants that you mine. And now depending on the order in which you address invariants and all the counter-examples to each invariant, you have a huge graph, that final graph that you might end up with. >> Ivan Beschastnikh: Perfect. You're leading right into my next slide. So yes. >>: Are you going to explain the refinement? >> Ivan Beschastnikh: Huh? >>: Are you going to explain how to do the refinement? >> Ivan Beschastnikh: The refinement -- there's a set of [indiscernible] that we use for refinement, because to eliminate the path, what we do is actually find the first node that you're stitching multiple executions along the counter-example, so the first partition where that happens. And then the way you break up the partition -- I mean, it's a little bit detailed, because there's kind of different sets of concrete events. There are events that are -- that are -- there are concrete events that are from these two different paths that you're stitching together that you should eliminate. So you have them in two different partitions, right, and then everything else that's in that partition. So different strategies are make these partitions as balanced as possible and assign the remaining commit nodes that are random. So there's actually different kinds of ->>: You said you look at the log as a set of expressions. Then you got into this problem that actually the log was not a set of expressions it was a set of sequences [indiscernible] execution was not a regular expression. It was actually ->> Ivan Beschastnikh: Right. >>: It was a problem. You kind of forgot about that. And now you're kind of coming back [indiscernible] and then just the same model. >> Ivan Beschastnikh: Well, I guess the initial regular expression parsing is intended to extract the log, right? So, you know, I don't care what -- I lost some things. But the idea with the regular expressions is that you composed them. So you get to choose what you lose and what you keep, right? So do you care about the disk or do you care about the network or do you care about both, which events interest you, right? That is the concern that the user will have. >>: Would you have -- would you treat each event individually? You just look at pairs of events? Well, you look at pairs of events when you create the initial model. >> Ivan Beschastnikh: Correct. >>: You look at sequences of ->> Ivan Beschastnikh: Those ->>: Or maybe you – >> Ivan Beschastnikh: Those are already satisfied, right. So in the model ->>: They've been generalizing, right? You maybe build the model. >> Ivan Beschastnikh: I'm generalizing as much as possible. >>: Right. Then you have to find it, and then you have to ->> Ivan Beschastnikh: That's right. That's a good point. I should tell you -Let me move to this slide. Let me describe this. So here I worked out an example on a smaller log that has a bunch of these models, right. And so the model on the left is sort of the trace graph which is what you would actually parse from the log, right? So these are the individual chains. And then the model on the right is kind of the initial model that you start with where you have one kind of partition for every event, right? And so this is your stand, right? The model over there is the most abstract one, and it's the most condensed one. So it admits a lot of behavior. The one over here is very large, because it's very concrete, and it admits the only thing -- only the things that you've seen, right? So now the question is, like, which model do you want. So this model is the log. So if you want to look at the gigabyte log, you can use this model. That model is very small, but it's going to admit a lot of stuff that you don't want. >>: [indiscernible] >> Ivan Beschastnikh: So there have been other approaches that start with this and actually go this way, right. I guess the approach that I'm describing here is one where you start with that model and you go this way. The problem with starting from this direction, right, is that you're going to have to do a lot compaction. What we found in general is that starting there is going to be way quicker. So performance results for our technique are way better than the versions of techniques where you can do compaction, because here where you have to actually do -- this one algorithm which is like where you do compaction based on the length of strings, right, and you find ones that have identical runs, and then you compact them and you merge them, right. So the algorithm is going to be very inefficient on a very large graph. Where this thing is going to be much more efficient, because you start with something much more compact. So there's a lot of tradeoffs. So our algorithm is to basically say there's going to be this dividing line, and this dividing line separates models that, you know, satisfy all the invariants, and obviously this one will satisfy all of them by definition, and then models that violate at least one invariant. Those are the ones in red, and so some invariants violated, all invariants satisfied, right? And then the exploration stage is going to be -- well, and here's the model. You actually want to find -- you want to find the model that is as abstract and small as possible, right, but is in this green space, right, because it satisfies these invariants. And you can change the definition of this space. You can add another kind of invariant, and this line will move to the left, right, or you can remove an invariant, and the line will move to the right. Now the operation that you have is refinement, right. Refinement will start with the model that's further to the right and will produce a number of models to the left, right, depending on which invariant you're going to refine, depending on which counter-example you want to eliminate, you know, and so forth. And then one -- and other techniques use coarsening. So refinement is the thing I described. Coarsening here is going to be the technique from prior work which is like k-Tails where you can merge these. So this one splits partitions. This one merges partitions. Now the question is how do you get to this intended goal model, and that's really the problem that Synoptic attempts to solve. In practice what this will look like is you started this initial model. You have some number of choices. You're going to make a choice. And in practice we found that, you know, there's a lot of strategies, but they're all suboptimal. Unfortunately, you're never going to get the global optimal here. And so you're going to -- we select something very simple and cheap, as cheap as possible, because each of these partitions might have, you know, thousands of events. So choose some simple strategy for actually doing the splitting of nodes and then keep going so you end up with this model. You have some more refinement choices. And then once you jump over this line and end up in a model that satisfies all the invariants, then we're going to apply coarsening. And the idea with coarsening is that it's sort of like a local optimization. It's where you want to merge partitions locally, right, but it's not guaranteed to find -- not guaranteed to find a global. So that's the full -- the full algorithm and one node about prioric this idea of starting from the far left and then going to right just using coarsening has been explored in prioric as far back as the '70s, but it's very inefficient. So really the contribution here is to come up with a refinement and also to use these invariants for a guide for what kind of refinement is actually working. So that's kind of the Synoptic technique. Now, the evaluation -- you know, we've done a couple of different evaluations, one was to apply to a system. The second one was to actually have students in a class use these. So the handwritten diagram that I showed at very beginning was by a student in the class that where they first had to write down the model of their system and then run Synoptic on the system and then compare the two models. And then we've also done some formal evaluations to show that the algorithm always terminates, that it has these nice properties like always satisfying, always accepting the log, satisfying the properties, so forth. So I'll only tell you about kind of this small user study with the developer of reverse traceroute. Reverse traceroute is a system for actually finding the reverse routing path. So typically you run trace path to find forward path, but the reverse path is usually obscured, and you don't see it. So reverse traceroute is a system that uses multiple vantage points in a controller to find out, you know, what path your packets are taking on the reverse path. And deployed internally at Google, has been deploying for awhile. And what we've done is apply Synoptic to the server logs, and so the developer had to change the logging code, which is not good. Ideally you would like this to apply to just existing logs. But they had to change the logging code to have better format. Free form is too difficult to write the regular expressions. And we've basically processed about a million events in 12 minutes. And the model that we got is the following thing. So unless you're a developer of various traceroutes, this will not make a lot of sense to you, but, basically, a single path in this model is a path that reverse traceroute takes as an algorithm, right. So sometimes you perform measurements. Sometimes you assume symmetry, because you don't have any measurements, so forth. So those are the key things that were logged by the developer, and each of these corresponds to a method in the code of the controller. And so I'm highlighting a couple of things. One thing I should say is that there are numbers on these edges, and the numbers are actually probabilities. So -- and they don't always sum up to one in this diagram, because for this developer this graph was more complicated. So we had to hide the low probability edges in this diagram for this to be more readable. So you can think of the model as -- you know, you could look at just common behavior, which is what we're showing, by hiding low probability edges, or you could look at rare probability behavior if you want to find rare occurring bugs, for example, and then you would only show the low probability edges. So you could have different use cases. So this one was showing the common behavior. And the things I'm highlighting are two issues that the developer found. One was these rhombus nodes that are shaded are terminal nodes, but they shouldn't have been terminal. So the system was -- essentially one execution of the system would terminate at an event that was not supposed to be a terminal event, and that was one thing the developer found by looking at the model. And the second thing are these dashed edges. So these dashed red edges were not present in the final model, and so -- but they should have been, and that was the other feature. >>: So you say the nodes should not be terminal. The developer just knew that, right, or was that inferred? >> Ivan Beschastnikh: The developer knew that. >>: Okay. >> Ivan Beschastnikh: Both of these features were found by the developer. So I basically overlaid on top of the developer's interpretation of what is a bug and what is not; okay? >>: [indiscernible] >> Ivan Beschastnikh: Yes. I'll show you the -- well, let me actually show you the tool. I have it running online. And here it is. It's called Synoptic. And so here's an example log, right. So this log is going to be an apogee log where you have a bunch of get requests to apogee, and so you have an access log. And this log is going to be for a web application that's like a shopping cart, right. So people come online. They check out. They get credit cards, so forth. So the input to the system is going to be this log and then these two regular expressions. You'll note it matches every log line. So it matches this thing perfectly. There's some IP address, and then there's magic keyword type. And type is going to be our abstract event type, right. So abstract event type is going to be just the name of the THP file that you're getting. And then the other thing that you need is to actually somehow split these executions. So what is an execution boundary? And in this case, execution boundary is simply the IP address. So a new client to the server is a new execution of the system. And so basically you have executions that are matched one to one with the client. >>: [inaudible] >> Ivan Beschastnikh: And that's -- here you can actually notice they're all intermixed. So it's going to actually pull them out. So this is the only impetus system. And then the model that you'll get out will be something like this. It's -- I'm not a graphics person. So kind of make this fantastically looking. But here it is. So basically you start off in this initial node, and then a trace starts in initial and ends in terminal and then goes through these partitions. And you can click on any one of these partitions and then find out the log line that are matched up that are basically merged into that one partition. So there's some invalid nodes, and then there's two checkout nodes. So there were two checkout nodes that were differentiated based on the invariants because somehow they're considered different. Let me see if I can make this a little bit better. Yes. So the question for you guys then is can you find a bug in this model. >>: [indiscernible] >> Ivan Beschastnikh: Yeah, that's right. So that's kind of the idea, right. Do you guys see that? So there's a path where the invalid coupon you would assume would take you back to checkout. It goes to reduced price. So, you know, what does that mean? Well, it means that there is -this transaction is actually in the log, and you can select two of these partitions and say what the paths that go through these, and you'll find there's some 15 traces that actually goes through both of these nodes. And then ideally you would -- I didn't build that thing yet. But that's the intuition for how you would use this model. It's an extraction of what's in this log. The log has a lot more detail. It has timestamps. It has all this other stuff. Really when you think about the logic of your program maybe you don't care about any of that, right. What you care about is just the event sequence and whether it's reasonable. >>: How do you find the one, like 151? >> Ivan Beschastnikh: Ah, yes. The probabilities are based on -- the probabilities are based on the number of events; all right. So initially for like this checkout node contains an event, right, and then N over 2 goes to this guy. So probabilities are based on the concrete -- based on the concrete observed events. >>: [indiscernible] >> Ivan Beschastnikh: Yeah. So in the log, right, there were some traces that went this way and some traces went that way, right. And so exactly half of them took the edge, and the other half took the other edge. And that's where I didn't talk about them in the talk, because you could show much more information on top of this model. And probabilities are just one thing that we thought about. The one other thing you could show is actually time or distribution of time, right, because you have time between events. Then you can say here's the shape of like the timestamps, for example. But it's based on the concrete underlying log, and it's added to the model after the fact. So it's not used in the actual modeling. >>: You're kind of assuming that the process or where the nodes came from are completely right? >> Ivan Beschastnikh: You assume it's a sequence of events then, and that it's reasonable to model a sequence of events in this formalism, right. >>: Two batches are going out from the initial checkout, one that goes to the credit card and one that goes to coupons. >> Ivan Beschastnikh: You mean this one, right? >>: Yeah. >> Ivan Beschastnikh: Sorry. That's actually a bug. I keep -you know, it's sad. I keep presenting it over and over. People keep finding it. I have to actually go fix it. >>: So are we talking about -- so there's this interplay between filtering the invariants and the complexity of the model that you generate, right? And at least as you presented it, we discover the invariants, and we select which ones we're going to ->> Ivan Beschastnikh: We select all of them right now. So as they develop -- right now -- you know, in that screen, in that input screen, you don't get a choice. One thing I didn't ->>: Also you have some threshold for the strength of the invariant. >> Ivan Beschastnikh: The invariants are all very simple right now. So they have to be true for all the executions, right. So, yes, ideally you wouldn't want like a fraudulistic invariant that would be more robust. So you would say 99 percent of the time this is true; maybe the one percent of traces are actually just malformed. >>: The invariants that are true all of the time? >> Ivan Beschastnikh: Yes. And we use all of those, absolutely. And so you can look at them here right. So here are all the invariants that we mined for this thing. And right here's a low visualization. Always followed by, always preceded by, never followed by. And you could take these and then actually, you know, remove them. And then the model would change, right. So yes, this gives you some kind of knob to go and tweak. So if you disable all of the invariants, then your model will be the initial model. You would perform an order refinement. >>: I was worrying and wondering about if you rely on a visual representation of a machine, it usually doesn't scale around [indiscernible] maximum a day or something like that. When that's the case, what do you recommend? What do you recommend? Like removing some of these things or use a fancy visualization for very large things. >> Ivan Beschastnikh: Yeah. When I started this project, I didn't think that visualization would be kind of an important component of it. It turns out it is. We found that in the regular -- so, for example, in the trace route model that I showed you we have to remove nodes and edges of low probabilities in order to make that model kind of fit on the screen and be readable. So that was one simplification that we've done. I would actually argue that what you really want to do is choose a different -- choose a different smaller component or choose a different abstraction for your model. So basically to me it means to me that displace of abstraction that you selected with these regular expressions on top of the log is simply too close to the log, too concrete, right. So you should raise the level of abstraction in order to make that model be more easily -easy to interpret. But that's not always the case. You know sometimes you do have a lot -- it really is a complex case. So this tool definitely would not work for a the very complex system that has that complexity. >>: The question on the video I was watching online he's asking what [inaudible] found in the reverse trials, how critical were they, were they so long, did they actually cause crashes or performance problems or in any sense at all. >> Ivan Beschastnikh: Yes. The developer knew about one of them. Like they knew that this thing happened. They didn't fully understand where it came from. And actually the model doesn't tell you where it comes from either, right. It's like a diagnosis tool. You then have to go find the root cause after all. So the underlying cause was a threading problem, was concurrency problem. The bugs were not -- they didn't crash the system. So they were not severe. They were not that severe. Okay. So I'll switch over. I showed you this demo thing. So to summarize, Synoptic takes this log for a single totally ordered log and produces this event based model, and it does this with refinement and leverages these mined invariants that it satisfies in the final model. And primarily the use case here was comprehension. So the idea is you want to help developers with large and complex logs. That was the original intent. And it's open source and deployed on this web interface sort of scales, you know. Yeah, please. >>: What's the largest log that you've run? >> Ivan Beschastnikh: Like a million log line. >>: A million? >> Ivan Beschastnikh: Yeah. >>: So was it reverse? >> Ivan Beschastnikh: Different versions of reverse tracer. We've done larger ones for it. The complexity of the algorithm really depends on the number of event types and the diversity of the log that you have. So the more connected your graph becomes the more difficult it is to check those properties, and then refinement of course goes down as you have more. So the next project I'll talk to you about -Yes, please. >>: Did you do any investigation of how robust this is like to the loss of load lines or if you have to log inside one environment and not the other? >> Ivan Beschastnikh: Right. So you wouldn't pick things up that you haven't logged. So it's very much constrained by what you have in the log. >>: If you just take out a single log line from the log, how badly would that screw up the model? >> Ivan Beschastnikh: That's a good question. If you have another execution that is just like this one, it wouldn't change the model, because you essentially you have this robust redundancy. You have underlying concrete -- execution would still have the same edges and you would mine the same invariants, and refinement would perform the same. But if you remove a log line that would remove an invariant, for example, it might change the model radically. >>: In that case in one execution you have an event and the other it does not, then that would show as an error? >> Ivan Beschastnikh: What do you mean? >>: So say one execution has some logs omitted, another execution there are some logs. So will it show it in the execution logs as an error? >> Ivan Beschastnikh: As an error? We wouldn't -- well, the model would present both executions, right, and we would merge those sections in that it would be able to merge, but it wouldn't show as an error, actually. It would attempt to satisfy both. It would attempt to model both. >>: [indiscernible] >> Ivan Beschastnikh: What about what? >>: [indiscernible] >> Ivan Beschastnikh: So certain -- yes. So the invariants have to be true for every execution. If there's at least one execution even you have a million of them that doesn't satisfy the invariant, then you throw it out. Then you're not going to mine it, and then you're not going to use it. >>: When you have one trace that looks like an invariant, you have three conditions for the invariant that only happen in that one trace and you run a million more runs maybe you would have found the case where you get that three condition but you go the other way so it's really not an invariant. >> Ivan Beschastnikh: Right. It would be a false positive; absolutely right. So you would mine the false positive along with the true ones. The answer here is really you want as much information as possible. For the model to be accurate, you need to observe more things. >>: So the initial model is just one [indiscernible]. >> Ivan Beschastnikh: Right. >>: When you do that refinement [indiscernible] or how do you do it? >> Ivan Beschastnikh: Yeah, we ->>: Does it become a bottle neck? >> Ivan Beschastnikh: Yeah. The memory pressure is definitely a bottle neck or if you get ->>: Do you need to have the full memory? >> Ivan Beschastnikh: You to need to have the full memory. There's some optimizations that you can make where you really need to store some concrete aspects of the traces. You don't need to store all of them. For example, if you have two identical traces, I just need one of them. I don't need both. So that's just kind of. >>: From the left. >> Ivan Beschastnikh: You're right. And there are other cases where you might not need to store some things, because, you know, during the merge the invariant is satisfied, right. So if certain partitions are not going to be ever refined, then you can lose them. >>: Will you know that? >> Ivan Beschastnikh: Well ->>: The partition. >> Ivan Beschastnikh: You can know that sometimes because the underlying refinement assumes that you can only refine a partition if it stitches multiple executions together, right, that have different futures essentially, right. And that's when you're going to want to refine it. So the optimization would be, you know, I take that partition and then I check -- if you can check that all of the invariants are satisfied below, then I can throw away the state for them. That would be one optimization. But we haven't implemented any of those. So that would be kind of future work. >>: Yeah. >> Ivan Beschastnikh: Yeah. Definitely. This is a single threaded Java process right now. So this not optimized at all. It's mostly experimental. So let me jump into Dynoptic thing, because it has a bunch of details you guys might or might not enjoy. So the sequential case was you have this, you know, the intuitive model. There is kind of a sequence of events. The question is what is intuitive in a distributed case. And in a distributed case when we told students to write down a model of their system, they came back with pictures that looked like this. All right. So basically they would model each component as a fine state machine. You would model the client finite-state machine. You have the server finite-state machine. The only catch is that the server admits events that the client consumes, right. So they're actually linked, you know. And so this was kind of -- this is very intuitive to students, and the idea was well people are familiar with finite-state machines. Let's use a formalism that's close to this. So in the Synoptic case, you would infer an event based model. In the Dynoptic case, we would actually infer a model that is communicating, communicating finite-state machines. And let me tell you more about what these things are. So CFSMs basically have some number of processes, and they're connected with tyco keys, and they're reliable. So one example of FSM, CFSM that I'll use here is a very simple one where you have process one on the left, process two on the right. And you actually have states, and they both start off in their initial states. And then you have these funny looking transitions where some of them may be communication events. So exclamation M means that you want to send M on Q1. So inserting Q1 into that Q. And then question mark M means received that M on that Q. So and once you receive it, you can proceed, right, but unless -- but you cannot execute this event unless there's actually an M at the head of that Q. So this is really modeling message passing, and it also includes local events. So you can execute a local event for free, right. And then you might communicate back over a different Q, because the Qs are in either direction. And then you would consume it on this end. And that would be a complete execution. >>: [indiscernible] >> Ivan Beschastnikh: Exactly. >>: So that could be anything. >> Ivan Beschastnikh: Yes. I'm not assuming anything. This could be a shared memory. This could be a socket. So this is one execution of this CFSM. There's only one execution of the CFSM, but in general they may be asynchronous, right, because there might be another process that I'm communicating with independently, right, and they don't have to match up. So the idea here now is what if we have a log, and we want to produce this kind of model, right. How could we infer this kind of model from a set of events from the log. So, you know, Dynoptic very similar sounding to Synoptic. So our pipeline is going to resemble Synoptic. We're going to have very similar steps where we parse the log, build this compact model, mine some properties. These properties are going to be more interesting, because now they're going to be events at certain processes. So I might have something like send at process one is always followed by receive at process two, use something like refinement to get the final model. So very similar. So I'll describe all of these steps kind of now in the Dynoptic setting. So the first question is, okay, so you have a log, how do you get back an execution out of it, how do you parse it. Well, it's a distributed log. So what you're going to need is something like vector timestamps. So you want to actually counter the total order -- the partial order of your system execution. So these vector timestamps are logical clocks that you can use to reconstruct the data, the partial order of the execution as it occurred. And what we have built is like a little library that you can compile -- for Java, you compile your Java recompiled with this jar, and then existing log messages would then have these vector clocks put in automatically. So you get this log in for free. So you don't have to have it as part of your system. We'll add it in automatically. It's going to have an overhead, but you don't have to worry about generating these timestamps. >>: You said that you would add the vector clocks into the log? >> Ivan Beschastnikh: Yes. So assume that your initial log was -- you know, so if you didn't have these things initially, right, you just had these things on the right, so then you compile it with a jar, and it automatically, tracks, senses causality and then just adds all these to every log line. So when you log normally, you get Q exclamation M. Then it would also prefix this vector clock. So the idea is that you just want to capture the partial order, right? And I don't want to change your log, because you log some things for a reason, right? You just keep that. But I want to keep the causality. I want to reconstruct it. So very basic vector clock mechanism. So you have this log now. So how do we build the initial model? You're going to have kind of two steps. First all, we're going to deal with states. Unfortunately because these CFSMs have queues, you actually have to reason about message -- message FIFO state, you know, what messages are outstanding. So first we're going to reconstruct the state, and then we're going to do partitioning not based on events but based on queue state. So let me show you how this works. So initially you have this time space DAG that you parse from the log, and then you want to come up with a state based DAG. The process here is going to be very simple. It's going to be simulation, where you first see the first event. So I start off with a state where both of the queues have no messages; they're empty. The next queue, the first event, it's a send of M. So then my simulation will basically say, okay, I just want to add M on the queue. Then I receive it. Then I go to a new state that has both queues empty. I execute a local event, doesn't change the queue. And then I execute the act kind of sequence. So the idea is that you would parse a bunch of these from your DAG, and now you have a model that has dates and events, more complicated then Synoptic, but you can apply some of the same ideas. So the idea now is you want to build this initial global model. That's compaction. And remember in Synoptic what we've done is we've seen commit and commit, and we've merged them together. We assume they're the same. In this case, we're going to build a state based version of that. So we're going to take -- look at these queues, and then we're going to merge queues that are the same. Actually in practice there's going to be approximation here where we'll get our merged queues that have the same top K messages. So we're just going to assume that after K the queues are identical. So this is where we're going to lose our -- this is where it's not going to be exact. So queue one and queue two is going to be the abstract state that represents all of these orange states. And I'll do the same thing for this guy and the same thing for this guy. So now I have all of my abstract states, and I create errors between them in the same way. So I have a transition between a state where I have an M and a state where all queues are empty on receive them if this actually happened in practice, right, if I saw the concrete event that made this transformation. So now I have the initial global model. It has some of the same great features as Synoptic. So it accepts the log where every one of these executions in the log is going to be a valid trace in this model, but this model is actually -- you can think of like a cross-product of the process. It's not actually CFSM. You have to reason about global state, you know, global events all of your nodes. So the actual decomposition is going to come up in refinement. So let me tell you about invariants. So invariants are going to be very straight forward. You have a bunch of these DAGs, and what you're going to do is mine the same kind of events, the same kind of templates that we had in Synoptic. Except that now your events have kind of a process idea associated with them, right? So you might have an event that only executes at process one, you know, is always followed by an event that executes at process two. So you mine the same set of things. And now the question is how do you use both the initial model and invariants to get this CFSM formalism that I told you about before. >>: If a person writes a message send on queue one receive, it's always enforced by construction, by your log. Am I missing something here? >> Ivan Beschastnikh: That one -- yeah, I think you're right. It is enforced by construction in logging, but it's not enforced by model. So you do have to -- you do have to -- you still have to -- I haven't thought about that. That's a really great point. So you don't have to mine it in a sense. So you can just add those. Yeah. The reason it's actually there is because the library that adds the vector timestamps came in after the fact. So if you implement to your own version of vector clock, if you have message lost, then that would be one way of handling it. So now you want to compose these two, use both of them together. So this is the really fun part, the really complex part. So you have this global model. So the first step is going to be to decompose it into CFSM, and the decomposition is pretty straightforward where you just pay attention to the events of the individual processes, right. So I take this model, and I only look at events for process one, right. And I treat events for other processes as epsilon transitions. So I'm going to receive an M, and then I'm going to send an M. And this is going to be an epsilon transition. And then something happens somewhere else, and I don't care what it is, but for me locally I transition to this other state. So using that approach you can decompose this thing into these two CFSMs, and these two CFSMs are very compact. As you see, they're just one state each, and the reason for that is because it is the most compact global model for that example. And then the next step is to use prior work on formal methods. Luckily I didn't have to invent this. So there's been prior work and actually model checking CFSMs, right. And so we're going to throw it at this model that we have right now to get back counter-example for an invariant if there is one. So, for example, on this guy, there is an invariant M, send of M is always followed by receive of M, and we use a model checker called MCFCM which model checks CFSMs exactly. So it might not terminate. So we're thinking of these things. So that's kind of a site point. But the point is you want to check the invariants. And you find there's a counter-example, right. There's an execution where you can send them, and I can receive. And then you can execute, and so this execution is a counter-example to that invariant. And that execution is then going to induce a refinement of this graph. So here I can -you know, I can send an M, and I can execute a local event, right? So the idea is you want to split this guy out to require that every time you send an M you're going to receive an M, right? And so this is refinement as before in the Synoptic case. And once you have this new global model, you do this step again, right? You're going to come back with a new CFSM, model check it, and then eventually you're going to be done. So for this example that I work out, you know, the Dynoptic would give you exactly these two traces that you observed, you know, will give you exactly the CFSM for that example. So that's the Dynoptic process. It's more involved. It has much more formal methods, parts, in it, and it's a little trickier. I feel like we're still struggling to understand all aspects of it, but we've done some preliminary evaluation on this, and so we've simulated some protocols, gave some simulated traces. We've evaluated with Voldemort DHT which has a replica protocol inside of it, and so we selected just the messages that do you have replication which turns out to be really trivial. So it's perfect for our tool. And then we've done a case study with TCP, opening and closing handshake. The problem with CFSMs is that they again don't model data. So you cannot -- you cannot reason about sequence numbers, for example. So you cannot do the data stage of TCP, but you can do the opening and closing handshake pretty easily. And then there's been a bunch of looking at formal evaluation and usability of these models. I'll just show you the DHT result. So Voldemort Distributed Hash Tables, it basically has this very narrow interface. You can associate some value with a key, or you can retrieve the value for a certain key. And you know this is actually deployed, but it's open source, and you can download it and exercise it. So we ran Dynoptic on logs generated with unit tests by Voldemort, and we just targeted the protocol messages for DHT. And so this is the DFSM that you get out. It's not as pretty. I had to prettify it manually. But, basically, you have these unit test use two replicas on one client, and this replification protocol is really straightforward. You basically have a right side to it and left side. On the right-hand side you're executing put. You're associating a value with a key. And so this guy is going to essentially execute put on this replica, so wait for a reply and then execute a put on the lower replica. And then the same thing for get. You're going to do the same exact same path. So they kind of mirror each other. So through inspection, we found out that this is indeed the true model for replication in Voldemort. So it's really, really simple in Voldemort. So that's why this model is pretty -disputably you can inspect it and then succinctly capture the three node distributed execution. So the contribution here very much like in the Synoptic case except we have one more which is to handle distribution. So how do you handle a log that's generated by multiple processes. Our answer is you want one finite-state machine per process, and that would be one way of doing that. And in our case, we found that it elucidates distributed protocols, because this FSM -you know, so logs that you have no partial order in distributed are hard. Logs that have partial order are exact but are even more difficult because now you have to draw these things. So now a more general model can help you understand these protocols better. So it's open source. But it's not actually deployed to you, but you can try it out. So before I conclude, I want to thank a bunch of people. I worked with a trio of advisors on this project and my collaborator and a ton of students at UW and generously funded by DARPA, Google and NSF. So the contributions of this talk is that, you know, I think logs have a lot of potential. They have a lot of content in them. What I attempted to do is to apply basically formal methods to log-in analysis and model inference in these two tools. Synoptic infers sequential models. Dynoptic infers distributed models. The idea is then you can use these to help developers understand what goes on in their systems. So you can find out more. Thanks for your attention. Thanks for coming. [applause] >>: So what is your next step? >> Ivan Beschastnikh: Actually, I'm excited about applying model inference to other domains like other kinds of problems. So I think Dynoptic is an interesting theoretical kind of tool. I think it's really difficult to make it scale. So I was thinking of take the Synoptic approach and then applying it to logs that have more information. So I was telling Peter about logs that have timing information, for example, so you could have probability on edges but you could also have time on edges. And you could use this to actually think about performance within your system. So now I'm actually modeling an execution, but I'm also reasoning about time that it takes between events, and I could think about, you know, give me the path and the model that is slowest based on the observations, right, and I could think about that for doing things like performance testing or performance analysis. So that's like an immediate low hanging fruit that I was thinking about. I guess, in general, I'm more of a systems person, and I like to apply these techniques to systems. So I'm thinking about doing test case generation for distributed systems. That's one of my -- I think that would be a great thing to go to next, because distributed systems are very difficult to test, and I haven't -and I feel like a lot of people know how to test them well. So when people write distributed systems code, they don't test as much, but they test very specific things, and I was wondering if there's some way to leverage that intuition that developers have about their systems to generate test cases better. So they're both kind of in the software engineering domain as applications, as techniques, but the applications would be kind of the systems for me. >>: [inaudible] >> Ivan Beschastnikh: Yeah. You know, there's some ways of cheating with Synoptic. I also think that I don't think modeling inference is applicable to all problems. I wouldn't necessarily use model inference for test case generation. It would be nice to, but not sure if it would work out. >>: It seems like you touched on it. But you said that one of the sort of major tasks that people started using this was the realization it was actually more useful than the raw data that was underneath of it. >> Ivan Beschastnikh: Right. >>: Have you put more thought into what you could do or what can be done to improve that side of it? It seems from a developer story of her as far as I could focus on that seems most valuable. >> Ivan Beschastnikh: Right. Yeah. I think a tighter integration between the model and the log would be really useful. So right now you have some kind of window that you can peer through, you know, into the abstraction. So you have this node, and you can see the log lines that are related to it. But I think you can ask more queries of the model. Like if you can actually ask a question like why are these two models -- why are these two nodes split out, why can't I merge them. And then the answer would be, well, there's this invariant and it would violate this invariant. You know, where if I merged them, then this path that violate this invariant. Being able to pose questions like that I think would be useful, because oftentimes when you have this log you have a very specific question, you know, and the question is now the -- I think the research question is can you interpret that question or pose it off this abstraction and get an answer. So I think that would be very helpful to actually -- so right now it's very much open ended. You could use it for exploration, but I think building it towards a certain set of tasks would be the right thing to do. >>: It's a luxury to be able to just collect logs and let's go find some bugs. Usually it's like, Oh, my God, it's totally broken. What's happening with the system. That's usually why people go into these sort of avenues to be able to have more direct. >> Ivan Beschastnikh: That was my initial goal. It's like just comprehension overall, because you do use logs for so many different things. You know, so if I just give you something that's a little bit easier to use than a large text log, it will make your life a little bit better. But I certainly don't want to build in any assumptions about what you might be interested in, which is why, you know, the abstraction that is done by the regular expressions is left completely up to the user. So thanks very much. [applause]