>> Tom Ball: Hello, everybody. Welcome. Good Monday morning to everybody. My name is Tom Ball. And it's my great pleasure to welcome Domagoj Babic back to Microsoft Research. He was an intern with us previously and worked with Madan Musuvathi and myself for a little bit. He was an MSR graduate fellow recipient, fellowship recipient. And he got his PhD with Alan Hu at the University of British Columbia. Since there he's been in the industry for a little bit. And more recently a research science at UC Berkeley. And he's been working on security and program analysis. And we're going to hear more about that today. >> Domagoj Babic: Thanks, Tom. So thanks for that great introduction. Good morning, everyone. It has been a while since I visited MSR. I think the last time I was here was like three or four years ago. Yeah, perhaps. So today I'm going to talk a little bit about grammatical inference and its applications to security and program analysis. So this is the work that I've been mostly during the last year, year and a half with a number of collaborators at Berkeley. So you can see grammatical inference at the high level as a set of machine learning techniques that try to infer a formal language, either context free or regular, something like that. And, you know, I'm going to talk about this more. This is just so that you know what I'm talking about at this point. To motivate the work I'm going to start with some numbers first. So over the last decade we've been witnessing a deluge of malware. So according to some statistics anti-virus are receiving over 60,000 samples her day nowadays. And if you sum it up, it's around 20 million a year. And one of the main attack vectors that malware exploits are software flaws, surprisingly. For instance, only last year in 2010, we saw over 4,000 medium and high severity vulnerabilities and common vulnerabilities disclosures database and several hundred low severity ones. And this is probably just a drop in the sea of CVEs that haven't been disclosed or were just batched without filing CVE. So I think that state of the art is not surprising considering how complicated software systems are. And as Ken said in his talk probably software systems are the most complicated systems that people develop nowadays. And so the process of designing software systems is kind of like a two component process where in one part we are adding new features and the other part we are finding and removing bugs and fixing security issues. And the second part tends to be usually around 50 percent of the effort, sometimes much more. And I looked at some statistics for instance of the relative sizes of windows tests and development teams and it seems that the test team is somewhat larger. So this seems to be correct statistic. So effectively the complexity of the systems that we can develop at least in part is limited by what we can verify to be correct and secure. And so just recently, for instance, I talked to some people from Lockheed Martin and they said that they're really limited in the systems that they are developing by what they can test and verify. So that really seems to be a limiting factor. And so what happens if you actually allow the systems to be kind of partially faulty or partially insecure? Well, the cost of failures can be really tremendous, especially security-related failures. So according to some statistics Code Red, which was essentially -- which exploited an unchecked buffer in MS IIS server incurred over two billion dollars in damages. So now if you compare that, for instance, for the cost of bugs in hardware industry, it's roughly the same order of magnitude. And so it seems that -- so traditionally we consider that -- we believe that verification is especially successful in the verification in hardware industry exactly because these bugs can be so expensive. But it seems that we have the same -exactly the same thing happening in software industry. And so, well, a lot of work has been done on improving the state of the art on verification automated testing on various approaches to improve security. And automation has been kind of like the core aspect of many of those approaches. And I think that the community has definitely made a huge progress. And especially in automating reasoning for instance, like, you know, SMT has made huge -- has had a huge impact on version and automated testing. But one aspect that still is kind of problematic is inductive reasoning. So it seems that when it comes to inductive reasoning we still have a long way to go. And thinking about this I started looking into a set of techniques that are called inductive grammatical inference, which as I said before, a class of machine learning techniques that learn a formal language. And then there are several basic flavors of grammatical inference. So for instance you can learn from observed behavior and then generalized from that behavior and construct say a state machine or a [inaudible]. Or unlearn interactively by probing a black box or say a gray box and the mode in which you are learning and then listening to the responses, in that way, you know, learning your grammar. So this class of approaches was very intensively studied in the '90s and then somehow a large part of the machine learning community figured out that for many of their applications they can actually also use less expensive statistical approaches so that became kind of the focus of most have the research. But still there is small part of community that continued research long the grammatical inference -- in the grammatical inference direction. So some of the strengths of GI, grammatical inference, is that it's very effective at inferring structure. So often when you need to understand the structure of your problem, then GI is the way to go. And also another strength is that it's inherently inductive reasoning. But the problem is that it can often overgeneralize. So if you don't have negative counter examples that you are going to use for refining that, there's also a really danger for regenerallizing. For instance, if you have only positive examples, you always can infer trivially a single-state state machine is going to except everything, but it's not what you actually want. And so GI has as many applications. And I'm going to talk today about several applications that I've studied and also a little bit about some some possible applications that I'm planning to study in the future. So the first and probably the most obvious application is reverse engineering. For instance, if you have, you know, a proprietary or classified protocol and you want to figure out what that protocol does, then GI is kind of a good way to go about that. And I'm going to talk in the first part of the talk about inference of unknown proprietary protocols, and I'll describe how one could do that. So another possible application is for program abstraction. And that's something that I'm going to talk in the second part of the talk. So you can use GI to infer kind of stateful abstractions of your program, either, you know, abstraction of interfaces or abstraction of certain modules or abstractions of the whole program and then use those abstractions in various ways. And also another possibility is to infer abstractions of behavior. For instance for malware you're really interested in what malware does. And now the idea is to kind of infer models, stateful models of that behavior. And so that's something that I'm going to be discussing in the third part of the talk. Other possible applications, that's something actually that I'm working on these days is inference of invariants. So how can you use GI to infer interface invariants, or, for instance, queue invariants, if you're -- if you're analyzing say distributed systems with unbounded queues, then one of the critical parts of the analysis is figuring out what the queue invariants are. Because once you know the queue invariants essentially, especially if the system has certain properties, then analysis becomes much simpler and tractable. Also other possible applications are in synthesis. For instance, there was -- there is a seminal paper by Beeman, I think in '79, who showed that the grammatical inference can be used to infer kind of programs from essentially from traces. He focused his research on the inference of incomplete state machines but he -- and he proposed an algorithm for kind of for doing this inference. So this figure just illustrates the -- you know, how grammatical inference works at the very high level. So as I said, one possible flavor of GI is that, you know, you observe some traffic between two effective black boxes, you know, they exchange message, sequences of messages and then you observe the traffic and then you learn grammar of that protocol. And another possible is a proactive approach where you keep sending sequences of messages and you're listening to responses and then learn the state machine that way. So this is the outline of the talk. So in the first part I'm going to be talking about protocol inference that this was more or less warmup project for the student that I'm working with. And I'm going to use to it just introduce the basic concepts and to kind of explain how this inference works, inference, in the protocol setting. And then in the second part, I'm going to show how this protocol inference could be combined with program analysis to guide state space exploration. So I named that approach MACE or Model-Inference-Assisted Concolic Execution and just presented it last week at Usenix security. And then in the third part I'm going to present something slightly different. So this is more focused on inferring structure and inferring kind of abstractions of behavior of malware. And then I'm going to show how you can use GI there to kind of detect malware and I'm going to present some results on that. So to start with protocol inference, in particular we focused on botnet protocols. So those protocols are essentially proprietary protocols and the way these botnets work is that, you know, they infect potentially large number of machines. Some of these botnets infect hundreds of thousands of machines. And they're fairly complicated distributed systems so you ever like bot -- individual bots on clients and their clients in this whole distributed system and then you have servers that kind of support this whole network and they serve various purposes. So for instance in the MegaD botnet, common and control protocol, so you have clients on one side, then you have three types -- actually have more types of servers, but three types are like really important. So you have like the master server, which is used to send commands to individual bots, can, you know, send in commands to kind of like send spam. You can send them commands to kind of like start eavesdropping on the infected machine or, you know, all kinds of commands you essentially can control the individual bots. And then they have also the -- so which one was which one? Oh, this one was the template server. So they have template servers that kind of serve templates. And the way these -- inference [inaudible] bots work is that they have -- they keep getting fresh templates on a daily basis. And then they kind of individualize these templates so that they increase the chances of people clicking on them, and then they start spending around. And if you're controlling say hundreds of thousands of machines, you can really send like really huge amounts of spam through that botnet, and people are actually paying a lot for that service. And then there is also the third type of the serve called just SMTP server that's kind of like test server. So each bot sends a message, an e-mail to that server, and then if it gets response, that means that it can successfully send spam. So it's kind of like more or less a testing server. But it plays very important role in the whole protocol, as I will show later. And then there's some other servers as well. So for instance there is an update server. You know, these botnets have automatic distributed updates, you know, infrastructure, so they're fairly complicated. So now, the problem that we want to solve, we want to figure out how this protocol works so that we can, you know, figure out potentially how to defeat or how to kind of detect such a fix on unincorporated network or a automatic network. So how would we go about this? Well, we can apply the classic Angluin's L star algorithm. And I can't really go into all the details, but I want to give you kind of just an impression of how it works at the high level. So here we have a state machine that we want to infer. So it's an [inaudible] state machine. And we -let's assume between all the input alphabet, sigma I and the output alphabet of that state machine. So the way how L star works is that it constructs, so-called an observation table and then fills up all the rows and columns. And then when it's done, it just reads off the states and transitions from the table. So what we have here is that this part of the table stores -- contains all the outputs, meaning responses from the black box what we are kind of probing. And then the first row contains the suffixes and then the first column contains the prefixes of inputs. And the way you generate inputs is that you combine the prefix with the suffix, and that's how you get a sequence of messages, you get the query. And then you send that query to the black box whose state machine you're trying to learn. You're listening to the response, and then you store the response in the observation table. So for example here we start with epsilon and then concatenate it with A. And so we get string A is the input. And then the response here is Y. And so we store that in the table. We always store only the suffix of the response that's equal to the length of the string in the first row. So in this case, we store only a single sample. So repeat the same thing with other edges. So with B and C. And so at this point, we'll learn what we call state distinguishing vector. So this is kind of like a signature of a state. And then we extend the symbol in the first row, epsilon, with all the symbols in the input alphabet. And so now we concatenate this symbol, the prefix with the suffix, so we get A, A input. We execute that on the state machine. So we get the response Y, Y. But then we store only the suffix Y to the table. So you repeat the same thing with A, B and A, C. And now we repeat the same thing for other sequences. And now you can see that the first state distinguish vector is already present in the upper half of the table which essentially contains unique state distinguishing vectors for every state. So we don't need to move it up there. But the second vector doesn't have representative in the upper part of the table, so effectively it represents a new state. So we move it up there. And then the same for the third one. At this point we have effectively identified all three states. But then we still don't know all the transitions. So we take the sequences from the -- that represent states and append the -- all the symbols from the input alphabet getting sequences like B,A, B,B, and so on. >>: You got all the three states, but actually the algorithm doesn't know that. >> Domagoj Babic: Yes. >>: Right. Okay. Go ahead. >> Domagoj Babic: Yeah. >>: So we know that, but the algorithm is proceeding just with the same ->> Domagoj Babic: Yes. >>: What you're describing now is just -- okay. >> Domagoj Babic: Correct. So now we essentially concatenate for instance this part, the prefix with suffixes, and we fill up the table. And at the point, you know, when we check all the state distinguishing vectors we see that they all have representatives in the upper part. And so at this point, the algorithm makes conjecture that this is the state machine implemented by the black box. And so it reads off the states from the upper half and transitions from the lower half of the table. And at this point, we say that the table is closed and we've essentially learned the conjecture other than the full model. So now the problem is how do we check that this state machine is really what the black box actually implements? And there are various approaches to do that. So one possibility is to generate so-called sampling queries. And then you can, if you generate sufficient number of these sampling queries, then you can guarantee that the inferred state machine is correct with some probability and with some confidence. Another possible approach is that you can also do something that's called a black box model checking so -- bounded black box model checking so you can generate the distinguishing sequence up to certain depth of the state machine and then check that what you've inferred is really what the black box implements. So in all looks great but unfortunately it doesn't really work in practice. And there are several reasons for that. So first, the state space is really too large in practical application. So if you have, say, a 32 bit message, then the number of -- 32 bit kind of like packet size, you know, that some server receives or sends, then the number of messages that you need to consider is really large. And L star is not going to scale to that. Furthermore, there are some other problems in the context of inferring botnet protocols. So these botnets are fairly large. And also they're kind of owners. They have a lot of capability to inspect what's going on in the network. So, you know, if they figure out that you're kind of like playing with their botnet, they can figure the [inaudible] service attack on the source of that kind of weird traffic. So we experimented with these botnets we essentially had to use Tor which is like network anonymizer. And so it anonymizes all the traffic so that the receiver can figure out where the traffic is coming from. So that is kind of a nice solution, but unfortunately introduces a lot of delay. So on average we had delay of about 6.8 seconds per every message that we sent. And so it turns out that you need something like four and a half days to infer 17-state protocol. Which is not really acceptable in practice. And also there are some other problems like dealing with encryption, compression and non-determinism that we're not going to talk about those today. There's been some prior work that we build upon for dealing with encryption and compression. >>: [inaudible], I mean, what's the alphabet? >> Domagoj Babic: I'll get to that. I'll get to that. >>: Okay. >> Domagoj Babic: So ->>: That might be [inaudible]. >> Domagoj Babic: Yes. Yes. So that -- right. So at this point, I'm just saying that if you treat the packets as they are sent as alphabet, that's not going to scale. And so now I'm going to get to the part where I'm going to explain how to deal with the alphabet. So as I mentioned, it's computationally infeasible to really infer, you know, these protocols over the -- you know, over the packets that are actually sent over the network simply because, you know, the state space is too large. So the approach that we took in this particular work, and later we have -- we have changed it a little bit, is that we first observe communication between the client and the server in this case. If this case it was the communication between the bot and those servers that I mentioned earlier. And so we find the set of input messages and set of output messages. The output messages are all sent from the server back to the bot, to individual botnets. And then studying those, we came up with two abstraction functions. So one is the input message abstraction function and the other one is the output message abstraction function that take these network packets and then abstract them into abstract alphabets called sigma I and sigma O. And so once you write these abstraction functions manually then the abstraction is of course automatic. And it takes some effort. It's a bit tedious to come up with good abstraction functions because the state machine that you infer essentially depends on how well these abstraction functions are working. So it's a bit tedious. But it didn't actually take that much time. And we also use some prior work for reverse engineering of messaging formats to help us with that. And then in the later work what I'm going to present in second part of the paper, we actually figure out how to do the input message abstraction automatically. But we still require the output message abstraction function to be provided. And I'm going to talk about that later. So now what's happening is that, you know, you have the inference engine in some -- that has inferred some state so far. And then it starts sending -- you know, it keeps sending these sequences over -- of abstract input messages. And then we have a script that actually does concretization and also it has to take care about some other aspects, so for instance that it keeps the session alive, that it has the right session identifier. So there is some -- there's some details there that I'm going to abstract away. But essentially what it does, it really concretizes the messages from this abstract alphabet to concrete alphabet and sub-I. So we send these sequences to the server, collect responses and then we abstract them using the output abstraction function. And that's how we get sequences of abstract output messages. And then we'll learn -- we essentially refine the state machine that we have. So even with abstraction fortunately still doesn't quite work simply because the computational complexity is too high and we have pretty high delay -- message delay in this -- in this setting. So the complexity is quadratic in the size of the input alphabet and then also quadratic in the number of states and linear in the size of counterexamples that we construct by sampling queries. So unfortunately abstraction is still insufficient. So then studying the state abstractions that we inferred we also found that there is a lot of redundancy in those state machines. And the primary cause of redundancy seems to be our focus on inferring complete state machines, meaning that we want to know for every input message and for every state where the transition -- where the corresponding transition goes. And so it just happens that many of these messages don't do anything interesting in most of the states. You know, they do something interesting in only one or two states, but not in all states. So we understand up with cases like this one where we have like huge number of kind of like self loops. We just increase -- which just increase the cost of learning without really adding anything useful to the -you know, to the inferred state machine. So our idea was to try to use prediction and then rely on sampling to actually take missed predictions. Because if you remember, you need to do this -- and I'll generate the sampling queries anyways, you know, whether you are doing like sampling based checking or black box bounded black box model checking, you have to generate these queries anyways. So we might as well use them for checking our predictions as well. And so it turns out that this prediction that I'm going to explain, I'm going to actually explain only the first out of the two prediction approaches that we use. And the first one actually saves about 73 percent of queries, which makes a pretty big impact on the performance. Then we also have some probabilistic prediction which saves additional 13 percent. So the basic idea behind response prediction is as follows: So in this state machine, for instance, we have two self loops. Like the red one and the blue one, represented by red and blue state distinguishing vectors. So now we can see that the response in the upper part of the table is of course exactly the same as in the lower part of the table. So now the question is can we actually use that insight predict responses and therefore save these two like red and blue to avoid these two queries. So the first insight is here is that if you look at the S part of the table, which is the upper part, then the prefixes are essentially like you can see them as strings of messages that are kind of like shortest sequences that get you to each individual state. So effectively can see -- you can imagine expanding the state machine into spanning tree and then these sequences in the upper part of the table are, you know, the -- essentially labels of pods to each individual node in the tree. Because they're free of self loops, now we can try to use these sequences in the upper part of the table to predict responses. And that's actually what we do. So we introduce a restriction function rho which takes a sequence of input messages in the original input alphabet, and then a set of message that is are in this part, S part of the table. In particular in this case we would have only B and C instead of D. And then it essentially removes all the messages that are not in D. For instance, if you have, you know -- if you take this string being concatenated with A, then rho of A concatenated with A is just B because simply because A is not in D. So let's take another example in consideration. So for instance for B concatenated with C, then rho will be concatenated with C, B concatenated with C, and because all -- both of these symbols are in D. So there is nothing to abstract here. So formally this is what the rho function does. Well, if the sequence is an empty sequence, that's what it returns. Otherwise if the input sequence S is equal to A concatenated R and A is not in D, then we recurs on R, and otherwise we copy A to the output and then recurs on R again. So it's fairly simple function. So how does this work? So let's assume that we got to this the stage in building the table. So now we have input B concatenated with A concatenated with A. So we have B, A, A. And now, if we compute -- if you restrict that, it turns out that we get B from the prefix and then we can -- we see that we have already that sequence in the upper part of the table and we can use the whole distinguishing vector, distinction vector from that state to predict the response. And that's what we do. And then repeat the same for the next one. Unfortunately here we don't have -we can't reuse any of the previously generated sequences. We repeat the same thing again. And when we get, for instance, to C, concatenated with A, then again restriction of that is just C. We already have that rho in the upper part of the table of the so we just copy the response and that's it. At this point we are done. And then we can use -- in this case, it happens to -- in this case actually the predictions are correct, but when we mispredict something then we can use the sampling queries to detect that. Okay. So how well does this work? Well, here I have some results. So for MegaD we you'll got a huge saving. So from about 4.5 days to about 12 hours. And then if you parallelize we also have some parallelization, then you can get it out in 2.4 hours. However for SMTP protocols we actually didn't get that much of a saving. But there is actually good explanation behind that because when we are writing these abstraction functions for SMTP, we already knew how the protocol looks like. And so it was very easy for us to come up with a right abstraction. And so there was very little redundancy in that abstraction. However, when you are working with say proprietary or classic protocol then you don't really know how the protocol works or what is important, so you tend to err more on the side of caution. So you come up usually with abstractions that are too precise. And in that setting, this prediction actually saves a lot. And this is just high level architecture of the -- of our system. So we use L star and send queries, use response prediction to try to avoid sending them to the network. We use a whole bunch of bot emulators in parallel, I think about eight of them. And these are just scripts that we role to kind of like pretend to be bot. So in that sense, this experiment was safe. We were careful not to spread the infection. And also, there is a limit on actually how much you can parallelize because the Tor becomes the bottleneck, so after adding like more than eight emulators in parallel, we essentially started getting diminishing returns, so... And then we sent the queries to Tor and we get responses. And that's how the whole thing works. So here we have an example of a protocol state machine for SMTP inferred for post fix implementation. And these red edges showed -- are the edges that the prior work was not able to infer. So this is the kind of like incremental improvement of our work upon what was done in the past. So prior work, most of the prior work actually focused on incomplete state machine inference. So that's the essentially the reason why they make these edges. Another thing that we can do once when you have say, you know, protocol state machine, you can of course model check it. And this is the state machine for MegaD protocol. And one of the props, for instance, that we checked was whether we can still suspend templates from template server. The way it's supposed to work is that each bot is supposed to authenticate with the master server, get of authentication kind of ID and then send to it template server in order to get the template. But -- and that is that kind of -- the red part shows the standard kind of operation of this protocol that's a part that each bot is supposed to follow. But we actually found plenty of ways to just kind of bypass authentication, just send essentially random messages to the template server and that way steal the templates. And the reason why that is using is because you can essentially get unlimited access to fresh templates and therefore upset -- spend kind of filters before the first spend hits the net. And also another thing that you can do, you can use these differences for fingerprinting, for instance MegaD has its own SMTP protocol implementation which is slightly different from the post fix SMTP. So you can use these differences to kind of detect you're kind of corporate network, you know, that you have an a infection going on. And also it's possible we also found differences between different kind of implementations of SMTP, so you can also distinguish kind of different types of implementations like post fix from something else. >>: So you're saying that often the malware is going to be operating on known ports and so you are -- you're effectively sort of just probing the known points with these messages to figure out ->> Domagoj Babic: Yes. So in this case, we knew the course that it operated on. So what I'm saying, that if you -- if you do this kind of like inference and then you infer a state machine, then the state machine tells about you the differences between normal traffic and in some cases, of course ->>: So the idea is that you're going to have state machines [inaudible] correctly behaving protocols on certain ports and you're going to do this periodic learning in the environment and then compare ->> Domagoj Babic: Yeah, possibly. Or you can even essentially have like stateful firewalls that are just going to follow each kind of session and follow it through the state machine. And that way it will tell you that something is going on. Okay. So with that, I will move to the second part of the talk, which essentially is combining what I've just presented with DART. The main idea is to use that protocol model that we can infer to actually guide the search. What it actually does is that it combines DART with learning. And just my insight is that in many ways DART is very similar to what a decision procedures do. The one big difference is that the decision procedures also get a lot of leverage from learning various lammas that prevent them from making the same mistakes in the future. And so I just thought that perhaps, you know, combining this learning approach in some way with DART might give us some benefit. For instance, for reviewing the size of the search space, pruning search space or just providing more guidance. So the basic idea here is to use the approach that I presented in the first part to infer a state machine of say, you know, some implementation of protocol, say a server, and then use the state machine to first initialize the search the certain state so that way you get more control with the search and then do local exploration using just standard DART. Another benefit also is that it essentially -- the state machine specifies the sequence of messages that you need to get to particular state, which is something that can be fairly difficult to construct with standard approaches, with even with decision procedures simply because we don't have enough information to do that. So this is the MACE approach at the very high level of abstraction. So we start running some number of state explorers on the server or network application that we are interested in. And so we generate a whole bunch of these input and out messages. And as I said before, inferring state machine over all of these messages would be computationally feasible, so we need some kind of abstraction to reduce or obstruct these messages. And here we have a filter function that I'm going to explain later that does exactly that. So essentially figures out which input messages to keep and which to discard. It effectively decides what -- over which messages is the state machine going to learn over. And so then we go to L star, which uses this monotonically increasing set of input messages to learn kind of more and more refined state machines. And then we use just the standard, you know, approach that I presented in the first part. Once when you get the state machine we generate for every state the shortest sequence that essentially tells you how to, you know, get to that state. We initialize the state explorers state and then we repeat the process. And eventually this thing terminates because we limit the amount of time spent per each state in the state exploration phase. So you either don't discover any new messages or you infer a complete state machine and so the thing terminates. In practice we also do something else. We also -- actually in very first iteration we start with some set of seed messages to infer the very first state machine. And the reason for that is just to speed up the convergence, but, strictly speaking, it's not necessary. >>: Sorry. So let me understand. The classic way DART is used is you have some input symbolic input like a file or something, and then you try and get different inputs to increase coverage. But you really want to use that as a subroutine for testing really network protocols. So you have, in addition to the problem of finding the inputs sort of you want to also find the state machine to allow you to drive the program to interesting states. >> Domagoj Babic: So presenting this -- so the current version of MACE -- so what I'm presenting is really targeted awards, kind of networked application. But I think the same idea essentially applies to say parses because you could imagine inferring like context free language and then use that to guide further exploration in a similar way. >>: Right. Then you need some way to curve the response. >> Domagoj Babic: Yes. >>: I mean, you need this notion of and observable output that could be used to distinguish internal states? >> Domagoj Babic: Yeah. Yeah. But you can also -- you can essentially use the white box model, and you can -- you can analyze the application at the same time. >>: Right. >> Domagoj Babic: So the first [inaudible] that I presented on inference of botnet protocols and protocols in general really assumed a completely black box model. Simply because we didn't even have access to the code of these servers, we had no choice. So we had to treat those servers completely as black boxes. But in MACE we actually do kind of code analysis to infer these messages. So it's already kind of a combination of black box and white box approach. So the way it works is it's in ways similar to what I presented before. The difference here is that we actually use DART to generate mentals rather than just observe kind of random traffic. And another difference here is that now we infer the input abstraction function -- we essentially do this abstraction automatically as I'll describe on the next slide, but we still require the output abstraction function to be provided manually simply because it determines essentially the coarseness of the state machine that you infer and therefore, you know, it's fair -it seems fairly difficult automatically find the right trade-off between precision of the state machine and yet in the computational cost of inferring very precise state machine. So this is the filtering function. It takes the current version of the inferred automaton. It takes an input sequence, a sequence of input messages, sequence of output messages and then produces a set of new input messages that are going to be used to refine the current abstract input alphabet. And what it does is actually fairly simple. So it looks at whether their exists a path in the currently version of the state machine that produces the same output as the current -- as the output sequence that you pass through the function. So, in other words, if there is a way to produce that sequence of output messages with the current state machine, then you don't add anything to the input alphabet. On the other hand, if you can't produce the same sequence of output messages, then you know that there is something -- at least one new message. And then we add actually all the messages in the input sequence to the abstract input alphabet and then learn another -- essentially we repeat learning the second iteration of learning is done over this kind of refined alphabet able we learn over this refined state machine. So we evaluated had on a number of benchmarks. So we inferred the protocol on Vino implementation of the RFB protocol and on Samba implementation of the SMB protocol. And once when we inferred these state machines we also used them to test RealVNC and Win XP SMB without reinferring the protocol. And that of course relies on the assumption that the protocols are fairly similar. Because it can -- once when you infer the state machine you can use it for, you know, for testing various implementations, as long as the protocols are fairly, you know, similar. And for Vino we used the 45 second section of remote -- 45 second remote desktop session to generate the set of seed messages and for somebody use gen test it for, I don't know, how many seconds we ran it, just to, you know, to bootstrap the process. We ran this on the deter security test bed allocating essentially about 2.5 hours of state exploration for each discovered state, and we did this only for every -only when a new state was inferred. So we wouldn't repeat these for original states, for obvious reasons. And also for the coverage measurement experiments we made sure that the baseline which was state of the art DART engine, that the baseline gets exactly the same amount of time as the MACE approach. And that also includes the time required for -- that MACE required for learning the protocol. So for Vino we inferred the protocol in two iterations after about 150 minutes. And for Samba, we inferred it in about three iterations and it took us over 4,000 minutes. So this is the SMB protocol state machine that we inferred. So it's about 84 states. I think it's clear that it would be too much to ask a programmer to specify this. Perhaps not. I don't know. But it looks like completely -- it looks fairly difficult to infer and to specify. And so this table shows the vulnerabilities that we found. So we found seven vulnerabilities altogether. Four of which are knew. And we got some CVE numbers for them. And for instance for Vino the first one was inferred about -after about one hour of exploration in total, the second one after four hours, the third one after 15 hours. And the baseline, unfortunately, didn't have the capability to detect the very first vulnerability because it was an infinite loop. So it's kind of like denial of service attack. And their baseline implementation, DART implementation doesn't have a detector for infinite loops. So it want able to discover this one in particular. But it had the capability to discover the remaining two, which were wild out of bound rides -- reads, sorry. But even though it had the capability they actually failed to discover them even after 105 hours of exploration. And then for Samba, we found three vulnerabilities. We actually hadn't known about any of these when we found them, but later we found that they were already known and the baseline approach managed to discover only one of those, after about 602 hours while MACE took only about 12 hours. And then for RealVNC we found another one, and we found none for Win XP SMB. So this is an interesting graph. We wanted to kind of see how -- how deeply does the baseline approach get into the search space comparing to MACE. And so what we did is we expanded the state machine into three and then measured the percentage of states that each approach reaches at certain depth. And for MACE, of course not surprisingly, it can reach any state simply because it knows the state machine, so it's very easy to construct the sequence that's going to get you to that state. But what's interesting is that the baseline approach coverage actually falls very rapidly. And so for instance when you get to depths of five, it reaches only about 40 percent of states and then when you get to depths of eight, essentially it falls down to zero. And I believe that this is the reason why we also got much better coverage with MACE. So we got coverage improvements ranging from about six percent to 58 percent, depending on the benchmark. So my impression is that the reason why MACE works so well is that by learning a state machine you effectively use relatively cheap reasoning to infer a kind of like high level abstraction of the program, actually the protocol that it implements, and then you can use that to guide the search. And it's also very easy to construct sequences that are going to get it to certain state, which is something that's relatively difficult for DART to do on its own, simply because it doesn't have enough information. And also another side effect is that you also get kind of more control over your search because you kind of like can diversify your search more easily it's less likely you are going to get stuck in loops. So with that, I'll end the second part of the talk. And I have just about five minutes to zip over the malware detection part. >>: [inaudible]. >> Domagoj Babic: Okay. So, you know, what is malware. I'm going to skip that so this slide essentially shows the effectiveness of modern anti-virus tools. This is study done by Cisco security team. It essentially shows that on the physical day when new malware is released, only about 20 percent malware is detected by contemporary anti-virus tools. And then as they keep, you know, cranking out signatures and updating, you know, the signature database they get to about 60 percent or something like that after about seven days. But only one of these samples to suffices to actually, you know, to really create problems. So this is -this is far from satisfying. The reason why this analysis so difficult -- well, in general sources are not available, and binaries are quite often obfuscated or even encrypted, so it's very frequent that you can't even disassemble the code, you can't distinguish from what's code can, what's data. And also, many of these tools automatically detect that you're not did you go mode or that you're running some anti-virus and then it's difficult to analyze them. Just to summarize this smaller crash course, so as I mentioned earlier, you know, we are getting around 60,000 samples per day, so the daily volume is just too large for manual analysis. And unfortunately that's what's being done today. Many of these samples are analyzed manually and it takes about 15 to 20, 30 minutes for an experienced analyst to actually go through these samples. Also the cumulative volume is too large for expensive analysis because, you know, you're getting about 20 million of these samples per year. And then there is a huge backlog of malware that you have to go through. Also, signatures are unfortunately too easy to defeat, as the previous slide showed. And static analysis frequently very difficult or impossible. And so to address these last two issues, the security community has started researching behavioral detection approaches that really focus on what software does rather than how it does it. And one popular abstraction of behavior is essentially a sequence of system calls because in order to change the state of the system, then the application has to, you know, execute the system call, for instance, to create a file or change registry key. And so what you can do is you can generate kind of sequences of these system calls and then use that to kind of recognize what's potential and malicious, whatnot. What we did was slightly kind of more complicated. We actually used taint analysis to construct data flow dependency graphs of system calls. For instance if one system call generates some result and then later that result is kind of like change in the application [inaudible] there's another system call, then we would say that there is data flow dependency edge there. And a then of course that -- from that relation we can construct graph that you can see here. This is from a real world example called -- trojan called banker. So now once when you construct these graphs, you can imagine expanding them into trees. That's not what we really do because that would incur an exponential cost but you can just imagine for, [inaudible] you know -- that's how it works. It just simplifies the presentation. So, you know, you can imagine expanding this into trees and then we can eliminate the graphs that are common between malware and goodware. And then we end up with reduced set. And now we can use this reduced set to infer state machine that's going to distinguish the two. In the paper what I'm actually doing is slightly simpler than that, because I just used the positive examples. But one could use negative examples as well to get higher precision. So I don't think I'll have time to actually go into the -- into tree automata, so I'm going to skip the next two slides. The main idea is that you have these trees and then you construct window a set size. And then by zipping that window over the entire tree and creating the state for every kind of unique subtree that you see, you can essentially construct the tree automata and that's roughly how it works. And then of course the final -- the accepting states are those states that essentially accept the whole tree that you've seen. The K factor, which is the size of that window, is inductive bias. And that helps us to actually do inference from positive examples. But that K factor is very important because the larger the -- the smaller the K, the more abstract the state machines that you infer are going to be. So you can actually vary the trade-off bit between true positives and false positives by changing this K. And that's something that's very useful in practice. And that is actually due to theorem by Garcia that the language is determined by a K plus one size window or kind of contained in languages that are in the -- determined by the K size window. And so the algorithm that I came up with is essentially has almost linear complexity where it's KN complexity where N is the size of the graphs. So it's very efficient in practice. And it's certainly can scale to very large backlog of malware that we have. The overall algorithm works like this. So we collect the graphs. We learn automaton. And then we partition the test set according to the heights of the trees. And then we run all those graphs against the tree automaton. And then for each height we compute a score by computing the ratio of number of accepted trees with that height with a total number of trees in that partition. Having that height. And then we multiply it by the height of the tree because the idea is that the larger the tree that is accepted the more weight it should get. And that is the score that we compute for malware. And then the higher the score, the more likely it is that the simple is malicious. And then, you know, we did experiments on pretty large set of malware grouped some something like 48 families. And we've also used some goodware samples to compute the false positive rate. So I'm going to skip this. And so these are the results. So the rise in curves from left to right are the malware detection curves. So what the curve means is that it -- every point on the curve for every K, we have different Ks here, shows the percentage of malware that had score less than what you have on the X axis. So for instance in this case, 40 percent of malware had score less than -- no, wait. About 10 percent of malware had score less than 0.4. The curves that fall from left to right are the goodware detection curves. So they're exactly the opposite. So every point on that curve says was the percentage of goodware that had score larger than what is indicated on the X axis. So for instance now, you can find some sweet spot. I think that one of the sweet spots was, for instance, K equals 4 and the score of 0.6. At this point you would get something like 80 percent malware recognition rate with about five percent false positive rate. And you can adjust this as needed. And another interesting result is that you can also use this inferred tree automata to try to classify samples into families if you learned one automaton pair family, you know, you distinguish -- you kind of split these families into like train and test set and then you learn one automaton for every family and then you run it against the -- you run the samples from the tests -- from all test sets on the inferred automata. And then you can compute scores and that's exactly what I did here, and you can see that there is a fairly pronounced diagonal here which essentially indicates that this approach can be also used at least partially for classification. It would be nice if classification like capability were more pronounced but that's what they have at the moment. And so to summarize, I talked a little bit about grammatical inference, and I presented three possible applications, protocol inference and then also using that for guiding state-space exploration and also structural feature recognition for malware detection. And I think that many other applications are possible, like namely synthesis inference of queue invariants and many others. And I'd also like to make a prediction. My gut feeling is that in the coming years perhaps GI might come -might become as important as SMT, simply because it allows us to do kind of reasoning that other techniques are not that strong at. And it's very complementary approach that I think can complement SMT and other approaches that we've been using. So for future work, one of the things that I'm looking into is symbolic MACE, so how can we do this analysis with say symbolic automata and try to make it more kind of generic and that way also kind of more -- perhaps even -- perhaps we can even improve performance for going symbolic other than having to, you know, deal with concrete messages and potential large alphabets. Another possible direction to go is to try to do the same thing for context free grammars because that is kind of fairly big problem to get past these parsers in practice. And if you can infer context free grammar then you know the structure of the state space and it could help you get past the parser. And also something that I've been looking into is regular model checking of distributed systems and they are we can use grammatical inference to infer the queue invariants and also to some other things. So I'd like to wrap up with -- by acknowledging the people that have been working on me with a number of these projects, and also the funding agencies. Thanks. [applause]. >> Tom Ball: Now, questions. >>: [inaudible] so I thought that using machine learning for classification of malware was pretty standard. Not necessarily -- not using tree automata, but can you compare -- I mean, is it true [inaudible] what's the key strength using tree automata and this type of abstraction versus [inaudible] being used just for your hardware classification. >> Domagoj Babic: So that's a great question. So there was -- there are a few papers on using machine learning like I think feature identification and leap analysis or something like that, if I remember correctly, for malware detection. And they got relatively similar results. But their implementation as far as I know is not publicly available. So we're not able to do it like a direct comparison. Although we use very similar or essentially the same set of malware samples for benchmarking. Other than that, most of the main stream anti-virus tools as far as I know, actually used the argue based approach. And I also detect some behaviors but they are kind of like less systematic about it. But I this, yeah, it potentially statistical approaches might be useful for detecting these behaviors as well. And actually I have one collaboration with some people from Rice where we have started looking into this a little bit. I don't know what's going to come out of that. >>: So learning the most general automaton for some [inaudible] inputs is [inaudible]? >> Domagoj Babic: Well, it depends on a lot of factors. It depends on the languages that you want to learn, it depends on whether you have only positive examples, whether we have negative examples as well. It depends on whether it's passive or active. It depends on whether you want to learn a minimal automaton or not. So the answer, if you can specify your question a little bit more, perhaps I can give you more precise answer. >>: [inaudible] about, you know, inference of models for programs and so on, so that seems like the general idea at some level, at some point in the 2000s so, let's just learn A, B pairs because learning automata is hard. So there I guess the question [inaudible] the most general minimal automaton for a given set of -given set of traces. >> Domagoj Babic: So you're essentially talking about passive inference. So from observed set of traces. If you want to learn a finite state machine that's minimal, I believe that's MP complete. But you need to have negative examples -- samples as well. So I'm assuming you have both positive and negative samples. Do you have negative samples as well? >>: That's just enough sequences. I mean, one thing the general automata that captures every sequence. >> Domagoj Babic: Well, that's -- well, the most general automaton is just a single-state automaton. So that's kind of like little bit under specified still. But there are algorithms that are like RPNI that learn from positive and a negative sets that are kind of polynomial. If you want to learn only from positive sets, then approaches like, you know, based on K tested automata might be a good way to go. I don't know, I would need to learn more details about the problem that you have. >>: [inaudible] using an abstraction to guide the state so I [inaudible] in which he was doing these abstractions [inaudible] testing for those abstractions, so are you saying that the MACE is a different way of constructing those abstractions in which you can just do better testing of the things? So should I look at MACE that you also have an abstraction aligning with your [inaudible]. >> Domagoj Babic: Yeah. So the way I see it is essentially there really is kind of particular type of learning combined with DART. So you essentially do this state space exploration the same way as DART does. Just you're trying to use the information, whatever you discover during that exploration, to actually learn something about the state space and then use that knowledge to drive further search later down the line. And, you know, this is just one way of, you know, there are probably many, many possible ways to learn something from -- there's a lot of information that's kind of currently being discarded from DART. And, you know, I guess that you could come up with many, many different approaches to learn. This is just one kind of like simple point in the -- so [inaudible] had a question? >>: So you could look at it as -- as learning summaries of some sort. I mean, generally you could learn summaries of procedures but you could also learn a summary of the whole program. So your state machine is a summary of the input output behavior of the program. And those summaries you learned at that level you could also learn if you were able to observe certain -- the internal output -- input output relations of functions as well. >> Domagoj Babic: Right. Right. >>: And all that you've learned here [inaudible] summary is you learn behavior, behavioral components compositions that [inaudible] program analysis where here you're learning monolithic basically state machine, which is [inaudible] state machine, where you do not learn for each procedure individual [inaudible]. It would be interesting to see how -- I mean, to learn the [inaudible] abstractions from -- the abstraction of the system test. Otherwise you're going to have to pay their price [inaudible]. >>: But I guess in the case of the network protocol, you have a small set of output states. Is that what helps you? I mean ->> Domagoj Babic: Output states? >>: Well, how do you observe, like in your MACE, what do you observe [inaudible] again do you have some abstraction for the input? >>: [inaudible]. In your output -- your output alpha is fixed in advance. >> Domagoj Babic: No, only the abstraction function is fixed. >>: Oh, so the output is also part of the [inaudible]. >>: Right. And that determines what state machine you learn. >> Domagoj Babic: Yes, correct. So that's why we -- essentially it was very difficult -- we didn't actually succeed at doing this output abstraction automatically as well exactly because the statement tended to infer of is so sensitive to this output abstraction. You get like widely -- you know, very wide kind of range of state machines depending -- even with -- you know, even after doing kind of very small tweaks to the abstraction function. So, yeah, but ->>: But it seems like even like return code like an error code could just be useful for that. Right? You send it back, you get one of, I don't know, 10 error codes maybe in a packet, not recognized illegal states are for this message or hopefully there would be like a small set. I don't know. >> Domagoj Babic: A small set of ->>: Of error codes. You could use the error codes as -- I mean, presumably there would be some ->>: [inaudible] was interesting for Samba for instance was there's so many error codes that if you want to handle all of them, if you want to represent all of them in your alphabet, then the inference blows up. So we actually instructed many of these error responses into the equivalence class. We just care there was an error or not. But, yeah, if we could come up with some kind of more modular way or automatic way to abstract these or even to kind of infer a symbolic state machines where, you know, that would take care of lot of these issues that are essentially all due to the size of the alphabet. >> Tom Ball: Okay. Thanks again. [applause]