>> Tom Ball: Hello, everybody. Welcome. Good... My name is Tom Ball. And it's my great...

advertisement
>> Tom Ball: Hello, everybody. Welcome. Good Monday morning to everybody.
My name is Tom Ball. And it's my great pleasure to welcome Domagoj Babic
back to Microsoft Research. He was an intern with us previously and worked
with Madan Musuvathi and myself for a little bit. He was an MSR graduate fellow
recipient, fellowship recipient. And he got his PhD with Alan Hu at the University
of British Columbia. Since there he's been in the industry for a little bit. And
more recently a research science at UC Berkeley. And he's been working on
security and program analysis. And we're going to hear more about that today.
>> Domagoj Babic: Thanks, Tom. So thanks for that great introduction. Good
morning, everyone. It has been a while since I visited MSR. I think the last time I
was here was like three or four years ago. Yeah, perhaps.
So today I'm going to talk a little bit about grammatical inference and its
applications to security and program analysis. So this is the work that I've been
mostly during the last year, year and a half with a number of collaborators at
Berkeley.
So you can see grammatical inference at the high level as a set of machine
learning techniques that try to infer a formal language, either context free or
regular, something like that. And, you know, I'm going to talk about this more.
This is just so that you know what I'm talking about at this point.
To motivate the work I'm going to start with some numbers first. So over the last
decade we've been witnessing a deluge of malware. So according to some
statistics anti-virus are receiving over 60,000 samples her day nowadays. And if
you sum it up, it's around 20 million a year.
And one of the main attack vectors that malware exploits are software flaws,
surprisingly. For instance, only last year in 2010, we saw over 4,000 medium
and high severity vulnerabilities and common vulnerabilities disclosures
database and several hundred low severity ones. And this is probably just a drop
in the sea of CVEs that haven't been disclosed or were just batched without filing
CVE.
So I think that state of the art is not surprising considering how complicated
software systems are. And as Ken said in his talk probably software systems are
the most complicated systems that people develop nowadays.
And so the process of designing software systems is kind of like a two
component process where in one part we are adding new features and the other
part we are finding and removing bugs and fixing security issues. And the
second part tends to be usually around 50 percent of the effort, sometimes much
more. And I looked at some statistics for instance of the relative sizes of
windows tests and development teams and it seems that the test team is
somewhat larger. So this seems to be correct statistic.
So effectively the complexity of the systems that we can develop at least in part
is limited by what we can verify to be correct and secure. And so just recently,
for instance, I talked to some people from Lockheed Martin and they said that
they're really limited in the systems that they are developing by what they can
test and verify. So that really seems to be a limiting factor.
And so what happens if you actually allow the systems to be kind of partially
faulty or partially insecure? Well, the cost of failures can be really tremendous,
especially security-related failures. So according to some statistics Code Red,
which was essentially -- which exploited an unchecked buffer in MS IIS server
incurred over two billion dollars in damages. So now if you compare that, for
instance, for the cost of bugs in hardware industry, it's roughly the same order of
magnitude.
And so it seems that -- so traditionally we consider that -- we believe that
verification is especially successful in the verification in hardware industry exactly
because these bugs can be so expensive. But it seems that we have the same -exactly the same thing happening in software industry.
And so, well, a lot of work has been done on improving the state of the art on
verification automated testing on various approaches to improve security. And
automation has been kind of like the core aspect of many of those approaches.
And I think that the community has definitely made a huge progress. And
especially in automating reasoning for instance, like, you know, SMT has made
huge -- has had a huge impact on version and automated testing.
But one aspect that still is kind of problematic is inductive reasoning. So it seems
that when it comes to inductive reasoning we still have a long way to go. And
thinking about this I started looking into a set of techniques that are called
inductive grammatical inference, which as I said before, a class of machine
learning techniques that learn a formal language.
And then there are several basic flavors of grammatical inference. So for
instance you can learn from observed behavior and then generalized from that
behavior and construct say a state machine or a [inaudible]. Or unlearn
interactively by probing a black box or say a gray box and the mode in which you
are learning and then listening to the responses, in that way, you know, learning
your grammar.
So this class of approaches was very intensively studied in the '90s and then
somehow a large part of the machine learning community figured out that for
many of their applications they can actually also use less expensive statistical
approaches so that became kind of the focus of most have the research. But still
there is small part of community that continued research long the grammatical
inference -- in the grammatical inference direction.
So some of the strengths of GI, grammatical inference, is that it's very effective at
inferring structure. So often when you need to understand the structure of your
problem, then GI is the way to go. And also another strength is that it's inherently
inductive reasoning. But the problem is that it can often overgeneralize. So if
you don't have negative counter examples that you are going to use for refining
that, there's also a really danger for regenerallizing.
For instance, if you have only positive examples, you always can infer trivially a
single-state state machine is going to except everything, but it's not what you
actually want.
And so GI has as many applications. And I'm going to talk today about several
applications that I've studied and also a little bit about some some possible
applications that I'm planning to study in the future.
So the first and probably the most obvious application is reverse engineering.
For instance, if you have, you know, a proprietary or classified protocol and you
want to figure out what that protocol does, then GI is kind of a good way to go
about that. And I'm going to talk in the first part of the talk about inference of
unknown proprietary protocols, and I'll describe how one could do that.
So another possible application is for program abstraction. And that's something
that I'm going to talk in the second part of the talk. So you can use GI to infer
kind of stateful abstractions of your program, either, you know, abstraction of
interfaces or abstraction of certain modules or abstractions of the whole program
and then use those abstractions in various ways.
And also another possibility is to infer abstractions of behavior. For instance for
malware you're really interested in what malware does. And now the idea is to
kind of infer models, stateful models of that behavior. And so that's something
that I'm going to be discussing in the third part of the talk.
Other possible applications, that's something actually that I'm working on these
days is inference of invariants. So how can you use GI to infer interface
invariants, or, for instance, queue invariants, if you're -- if you're analyzing say
distributed systems with unbounded queues, then one of the critical parts of the
analysis is figuring out what the queue invariants are. Because once you know
the queue invariants essentially, especially if the system has certain properties,
then analysis becomes much simpler and tractable.
Also other possible applications are in synthesis. For instance, there was -- there
is a seminal paper by Beeman, I think in '79, who showed that the grammatical
inference can be used to infer kind of programs from essentially from traces. He
focused his research on the inference of incomplete state machines but he -- and
he proposed an algorithm for kind of for doing this inference.
So this figure just illustrates the -- you know, how grammatical inference works at
the very high level. So as I said, one possible flavor of GI is that, you know, you
observe some traffic between two effective black boxes, you know, they
exchange message, sequences of messages and then you observe the traffic
and then you learn grammar of that protocol.
And another possible is a proactive approach where you keep sending
sequences of messages and you're listening to responses and then learn the
state machine that way. So this is the outline of the talk. So in the first part I'm
going to be talking about protocol inference that this was more or less warmup
project for the student that I'm working with. And I'm going to use to it just
introduce the basic concepts and to kind of explain how this inference works,
inference, in the protocol setting.
And then in the second part, I'm going to show how this protocol inference could
be combined with program analysis to guide state space exploration. So I
named that approach MACE or Model-Inference-Assisted Concolic Execution
and just presented it last week at Usenix security.
And then in the third part I'm going to present something slightly different. So this
is more focused on inferring structure and inferring kind of abstractions of
behavior of malware. And then I'm going to show how you can use GI there to
kind of detect malware and I'm going to present some results on that.
So to start with protocol inference, in particular we focused on botnet protocols.
So those protocols are essentially proprietary protocols and the way these
botnets work is that, you know, they infect potentially large number of machines.
Some of these botnets infect hundreds of thousands of machines. And they're
fairly complicated distributed systems so you ever like bot -- individual bots on
clients and their clients in this whole distributed system and then you have
servers that kind of support this whole network and they serve various purposes.
So for instance in the MegaD botnet, common and control protocol, so you have
clients on one side, then you have three types -- actually have more types of
servers, but three types are like really important. So you have like the master
server, which is used to send commands to individual bots, can, you know, send
in commands to kind of like send spam. You can send them commands to kind
of like start eavesdropping on the infected machine or, you know, all kinds of
commands you essentially can control the individual bots.
And then they have also the -- so which one was which one? Oh, this one was
the template server. So they have template servers that kind of serve templates.
And the way these -- inference [inaudible] bots work is that they have -- they
keep getting fresh templates on a daily basis. And then they kind of individualize
these templates so that they increase the chances of people clicking on them,
and then they start spending around.
And if you're controlling say hundreds of thousands of machines, you can really
send like really huge amounts of spam through that botnet, and people are
actually paying a lot for that service.
And then there is also the third type of the serve called just SMTP server that's
kind of like test server. So each bot sends a message, an e-mail to that server,
and then if it gets response, that means that it can successfully send spam. So
it's kind of like more or less a testing server. But it plays very important role in
the whole protocol, as I will show later.
And then there's some other servers as well. So for instance there is an update
server. You know, these botnets have automatic distributed updates, you know,
infrastructure, so they're fairly complicated.
So now, the problem that we want to solve, we want to figure out how this
protocol works so that we can, you know, figure out potentially how to defeat or
how to kind of detect such a fix on unincorporated network or a automatic
network.
So how would we go about this? Well, we can apply the classic Angluin's L star
algorithm. And I can't really go into all the details, but I want to give you kind of
just an impression of how it works at the high level. So here we have a state
machine that we want to infer. So it's an [inaudible] state machine. And we -let's assume between all the input alphabet, sigma I and the output alphabet of
that state machine. So the way how L star works is that it constructs, so-called
an observation table and then fills up all the rows and columns. And then when
it's done, it just reads off the states and transitions from the table.
So what we have here is that this part of the table stores -- contains all the
outputs, meaning responses from the black box what we are kind of probing.
And then the first row contains the suffixes and then the first column contains the
prefixes of inputs. And the way you generate inputs is that you combine the
prefix with the suffix, and that's how you get a sequence of messages, you get
the query. And then you send that query to the black box whose state machine
you're trying to learn. You're listening to the response, and then you store the
response in the observation table.
So for example here we start with epsilon and then concatenate it with A. And so
we get string A is the input. And then the response here is Y. And so we store
that in the table. We always store only the suffix of the response that's equal to
the length of the string in the first row. So in this case, we store only a single
sample.
So repeat the same thing with other edges. So with B and C. And so at this
point, we'll learn what we call state distinguishing vector. So this is kind of like a
signature of a state. And then we extend the symbol in the first row, epsilon, with
all the symbols in the input alphabet. And so now we concatenate this symbol,
the prefix with the suffix, so we get A, A input. We execute that on the state
machine. So we get the response Y, Y. But then we store only the suffix Y to
the table.
So you repeat the same thing with A, B and A, C. And now we repeat the same
thing for other sequences. And now you can see that the first state distinguish
vector is already present in the upper half of the table which essentially contains
unique state distinguishing vectors for every state. So we don't need to move it
up there.
But the second vector doesn't have representative in the upper part of the table,
so effectively it represents a new state. So we move it up there. And then the
same for the third one.
At this point we have effectively identified all three states. But then we still don't
know all the transitions. So we take the sequences from the -- that represent
states and append the -- all the symbols from the input alphabet getting
sequences like B,A, B,B, and so on.
>>: You got all the three states, but actually the algorithm doesn't know that.
>> Domagoj Babic: Yes.
>>: Right. Okay. Go ahead.
>> Domagoj Babic: Yeah.
>>: So we know that, but the algorithm is proceeding just with the same ->> Domagoj Babic: Yes.
>>: What you're describing now is just -- okay.
>> Domagoj Babic: Correct. So now we essentially concatenate for instance
this part, the prefix with suffixes, and we fill up the table. And at the point, you
know, when we check all the state distinguishing vectors we see that they all
have representatives in the upper part. And so at this point, the algorithm makes
conjecture that this is the state machine implemented by the black box. And so it
reads off the states from the upper half and transitions from the lower half of the
table.
And at this point, we say that the table is closed and we've essentially learned
the conjecture other than the full model.
So now the problem is how do we check that this state machine is really what the
black box actually implements? And there are various approaches to do that. So
one possibility is to generate so-called sampling queries. And then you can, if
you generate sufficient number of these sampling queries, then you can
guarantee that the inferred state machine is correct with some probability and
with some confidence.
Another possible approach is that you can also do something that's called a black
box model checking so -- bounded black box model checking so you can
generate the distinguishing sequence up to certain depth of the state machine
and then check that what you've inferred is really what the black box implements.
So in all looks great but unfortunately it doesn't really work in practice. And there
are several reasons for that. So first, the state space is really too large in
practical application. So if you have, say, a 32 bit message, then the number of
-- 32 bit kind of like packet size, you know, that some server receives or sends,
then the number of messages that you need to consider is really large. And L
star is not going to scale to that.
Furthermore, there are some other problems in the context of inferring botnet
protocols. So these botnets are fairly large. And also they're kind of owners.
They have a lot of capability to inspect what's going on in the network. So, you
know, if they figure out that you're kind of like playing with their botnet, they can
figure the [inaudible] service attack on the source of that kind of weird traffic.
So we experimented with these botnets we essentially had to use Tor which is
like network anonymizer. And so it anonymizes all the traffic so that the receiver
can figure out where the traffic is coming from. So that is kind of a nice solution,
but unfortunately introduces a lot of delay. So on average we had delay of about
6.8 seconds per every message that we sent. And so it turns out that you need
something like four and a half days to infer 17-state protocol. Which is not really
acceptable in practice. And also there are some other problems like dealing with
encryption, compression and non-determinism that we're not going to talk about
those today. There's been some prior work that we build upon for dealing with
encryption and compression.
>>: [inaudible], I mean, what's the alphabet?
>> Domagoj Babic: I'll get to that. I'll get to that.
>>: Okay.
>> Domagoj Babic: So ->>: That might be [inaudible].
>> Domagoj Babic: Yes. Yes. So that -- right. So at this point, I'm just saying
that if you treat the packets as they are sent as alphabet, that's not going to
scale. And so now I'm going to get to the part where I'm going to explain how to
deal with the alphabet.
So as I mentioned, it's computationally infeasible to really infer, you know, these
protocols over the -- you know, over the packets that are actually sent over the
network simply because, you know, the state space is too large.
So the approach that we took in this particular work, and later we have -- we
have changed it a little bit, is that we first observe communication between the
client and the server in this case. If this case it was the communication between
the bot and those servers that I mentioned earlier. And so we find the set of
input messages and set of output messages. The output messages are all sent
from the server back to the bot, to individual botnets. And then studying those,
we came up with two abstraction functions. So one is the input message
abstraction function and the other one is the output message abstraction function
that take these network packets and then abstract them into abstract alphabets
called sigma I and sigma O. And so once you write these abstraction functions
manually then the abstraction is of course automatic. And it takes some effort.
It's a bit tedious to come up with good abstraction functions because the state
machine that you infer essentially depends on how well these abstraction
functions are working.
So it's a bit tedious. But it didn't actually take that much time. And we also use
some prior work for reverse engineering of messaging formats to help us with
that.
And then in the later work what I'm going to present in second part of the paper,
we actually figure out how to do the input message abstraction automatically.
But we still require the output message abstraction function to be provided. And
I'm going to talk about that later.
So now what's happening is that, you know, you have the inference engine in
some -- that has inferred some state so far. And then it starts sending -- you
know, it keeps sending these sequences over -- of abstract input messages.
And then we have a script that actually does concretization and also it has to
take care about some other aspects, so for instance that it keeps the session
alive, that it has the right session identifier. So there is some -- there's some
details there that I'm going to abstract away. But essentially what it does, it really
concretizes the messages from this abstract alphabet to concrete alphabet and
sub-I.
So we send these sequences to the server, collect responses and then we
abstract them using the output abstraction function. And that's how we get
sequences of abstract output messages. And then we'll learn -- we essentially
refine the state machine that we have.
So even with abstraction fortunately still doesn't quite work simply because the
computational complexity is too high and we have pretty high delay -- message
delay in this -- in this setting. So the complexity is quadratic in the size of the
input alphabet and then also quadratic in the number of states and linear in the
size of counterexamples that we construct by sampling queries. So unfortunately
abstraction is still insufficient.
So then studying the state abstractions that we inferred we also found that there
is a lot of redundancy in those state machines. And the primary cause of
redundancy seems to be our focus on inferring complete state machines,
meaning that we want to know for every input message and for every state where
the transition -- where the corresponding transition goes.
And so it just happens that many of these messages don't do anything interesting
in most of the states. You know, they do something interesting in only one or two
states, but not in all states. So we understand up with cases like this one where
we have like huge number of kind of like self loops. We just increase -- which
just increase the cost of learning without really adding anything useful to the -you know, to the inferred state machine.
So our idea was to try to use prediction and then rely on sampling to actually take
missed predictions. Because if you remember, you need to do this -- and I'll
generate the sampling queries anyways, you know, whether you are doing like
sampling based checking or black box bounded black box model checking, you
have to generate these queries anyways. So we might as well use them for
checking our predictions as well.
And so it turns out that this prediction that I'm going to explain, I'm going to
actually explain only the first out of the two prediction approaches that we use.
And the first one actually saves about 73 percent of queries, which makes a
pretty big impact on the performance. Then we also have some probabilistic
prediction which saves additional 13 percent.
So the basic idea behind response prediction is as follows: So in this state
machine, for instance, we have two self loops. Like the red one and the blue
one, represented by red and blue state distinguishing vectors. So now we can
see that the response in the upper part of the table is of course exactly the same
as in the lower part of the table. So now the question is can we actually use that
insight predict responses and therefore save these two like red and blue to avoid
these two queries.
So the first insight is here is that if you look at the S part of the table, which is the
upper part, then the prefixes are essentially like you can see them as strings of
messages that are kind of like shortest sequences that get you to each individual
state. So effectively can see -- you can imagine expanding the state machine
into spanning tree and then these sequences in the upper part of the table are,
you know, the -- essentially labels of pods to each individual node in the tree.
Because they're free of self loops, now we can try to use these sequences in the
upper part of the table to predict responses. And that's actually what we do. So
we introduce a restriction function rho which takes a sequence of input messages
in the original input alphabet, and then a set of message that is are in this part, S
part of the table. In particular in this case we would have only B and C instead of
D. And then it essentially removes all the messages that are not in D.
For instance, if you have, you know -- if you take this string being concatenated
with A, then rho of A concatenated with A is just B because simply because A is
not in D. So let's take another example in consideration. So for instance for B
concatenated with C, then rho will be concatenated with C, B concatenated with
C, and because all -- both of these symbols are in D. So there is nothing to
abstract here.
So formally this is what the rho function does. Well, if the sequence is an empty
sequence, that's what it returns. Otherwise if the input sequence S is equal to A
concatenated R and A is not in D, then we recurs on R, and otherwise we copy A
to the output and then recurs on R again. So it's fairly simple function.
So how does this work? So let's assume that we got to this the stage in building
the table. So now we have input B concatenated with A concatenated with A.
So we have B, A, A. And now, if we compute -- if you restrict that, it turns out
that we get B from the prefix and then we can -- we see that we have already that
sequence in the upper part of the table and we can use the whole distinguishing
vector, distinction vector from that state to predict the response. And that's what
we do.
And then repeat the same for the next one. Unfortunately here we don't have -we can't reuse any of the previously generated sequences. We repeat the same
thing again. And when we get, for instance, to C, concatenated with A, then
again restriction of that is just C. We already have that rho in the upper part of
the table of the so we just copy the response and that's it. At this point we are
done.
And then we can use -- in this case, it happens to -- in this case actually the
predictions are correct, but when we mispredict something then we can use the
sampling queries to detect that. Okay. So how well does this work? Well, here
I have some results. So for MegaD we you'll got a huge saving. So from about
4.5 days to about 12 hours. And then if you parallelize we also have some
parallelization, then you can get it out in 2.4 hours.
However for SMTP protocols we actually didn't get that much of a saving. But
there is actually good explanation behind that because when we are writing these
abstraction functions for SMTP, we already knew how the protocol looks like.
And so it was very easy for us to come up with a right abstraction.
And so there was very little redundancy in that abstraction. However, when you
are working with say proprietary or classic protocol then you don't really know
how the protocol works or what is important, so you tend to err more on the side
of caution. So you come up usually with abstractions that are too precise. And
in that setting, this prediction actually saves a lot.
And this is just high level architecture of the -- of our system. So we use L star
and send queries, use response prediction to try to avoid sending them to the
network. We use a whole bunch of bot emulators in parallel, I think about eight
of them. And these are just scripts that we role to kind of like pretend to be bot.
So in that sense, this experiment was safe. We were careful not to spread the
infection. And also, there is a limit on actually how much you can parallelize
because the Tor becomes the bottleneck, so after adding like more than eight
emulators in parallel, we essentially started getting diminishing returns, so...
And then we sent the queries to Tor and we get responses. And that's how the
whole thing works. So here we have an example of a protocol state machine for
SMTP inferred for post fix implementation. And these red edges showed -- are
the edges that the prior work was not able to infer. So this is the kind of like
incremental improvement of our work upon what was done in the past. So prior
work, most of the prior work actually focused on incomplete state machine
inference. So that's the essentially the reason why they make these edges.
Another thing that we can do once when you have say, you know, protocol state
machine, you can of course model check it. And this is the state machine for
MegaD protocol. And one of the props, for instance, that we checked was
whether we can still suspend templates from template server. The way it's
supposed to work is that each bot is supposed to authenticate with the master
server, get of authentication kind of ID and then send to it template server in
order to get the template. But -- and that is that kind of -- the red part shows the
standard kind of operation of this protocol that's a part that each bot is supposed
to follow.
But we actually found plenty of ways to just kind of bypass authentication, just
send essentially random messages to the template server and that way steal the
templates. And the reason why that is using is because you can essentially get
unlimited access to fresh templates and therefore upset -- spend kind of filters
before the first spend hits the net.
And also another thing that you can do, you can use these differences for
fingerprinting, for instance MegaD has its own SMTP protocol implementation
which is slightly different from the post fix SMTP. So you can use these
differences to kind of detect you're kind of corporate network, you know, that you
have an a infection going on. And also it's possible we also found differences
between different kind of implementations of SMTP, so you can also distinguish
kind of different types of implementations like post fix from something else.
>>: So you're saying that often the malware is going to be operating on known
ports and so you are -- you're effectively sort of just probing the known points
with these messages to figure out ->> Domagoj Babic: Yes. So in this case, we knew the course that it operated
on. So what I'm saying, that if you -- if you do this kind of like inference and then
you infer a state machine, then the state machine tells about you the differences
between normal traffic and in some cases, of course ->>: So the idea is that you're going to have state machines [inaudible] correctly
behaving protocols on certain ports and you're going to do this periodic learning
in the environment and then compare ->> Domagoj Babic: Yeah, possibly. Or you can even essentially have like
stateful firewalls that are just going to follow each kind of session and follow it
through the state machine. And that way it will tell you that something is going
on.
Okay. So with that, I will move to the second part of the talk, which essentially is
combining what I've just presented with DART. The main idea is to use that
protocol model that we can infer to actually guide the search.
What it actually does is that it combines DART with learning. And just my insight
is that in many ways DART is very similar to what a decision procedures do. The
one big difference is that the decision procedures also get a lot of leverage from
learning various lammas that prevent them from making the same mistakes in
the future. And so I just thought that perhaps, you know, combining this learning
approach in some way with DART might give us some benefit. For instance, for
reviewing the size of the search space, pruning search space or just providing
more guidance.
So the basic idea here is to use the approach that I presented in the first part to
infer a state machine of say, you know, some implementation of protocol, say a
server, and then use the state machine to first initialize the search the certain
state so that way you get more control with the search and then do local
exploration using just standard DART.
Another benefit also is that it essentially -- the state machine specifies the
sequence of messages that you need to get to particular state, which is
something that can be fairly difficult to construct with standard approaches, with
even with decision procedures simply because we don't have enough information
to do that.
So this is the MACE approach at the very high level of abstraction. So we start
running some number of state explorers on the server or network application that
we are interested in. And so we generate a whole bunch of these input and out
messages. And as I said before, inferring state machine over all of these
messages would be computationally feasible, so we need some kind of
abstraction to reduce or obstruct these messages. And here we have a filter
function that I'm going to explain later that does exactly that. So essentially
figures out which input messages to keep and which to discard. It effectively
decides what -- over which messages is the state machine going to learn over.
And so then we go to L star, which uses this monotonically increasing set of input
messages to learn kind of more and more refined state machines. And then we
use just the standard, you know, approach that I presented in the first part.
Once when you get the state machine we generate for every state the shortest
sequence that essentially tells you how to, you know, get to that state. We
initialize the state explorers state and then we repeat the process. And
eventually this thing terminates because we limit the amount of time spent per
each state in the state exploration phase. So you either don't discover any new
messages or you infer a complete state machine and so the thing terminates.
In practice we also do something else. We also -- actually in very first iteration
we start with some set of seed messages to infer the very first state machine.
And the reason for that is just to speed up the convergence, but, strictly
speaking, it's not necessary.
>>: Sorry. So let me understand. The classic way DART is used is you have
some input symbolic input like a file or something, and then you try and get
different inputs to increase coverage. But you really want to use that as a
subroutine for testing really network protocols. So you have, in addition to the
problem of finding the inputs sort of you want to also find the state machine to
allow you to drive the program to interesting states.
>> Domagoj Babic: So presenting this -- so the current version of MACE -- so
what I'm presenting is really targeted awards, kind of networked application. But
I think the same idea essentially applies to say parses because you could
imagine inferring like context free language and then use that to guide further
exploration in a similar way.
>>: Right. Then you need some way to curve the response.
>> Domagoj Babic: Yes.
>>: I mean, you need this notion of and observable output that could be used to
distinguish internal states?
>> Domagoj Babic: Yeah. Yeah. But you can also -- you can essentially use
the white box model, and you can -- you can analyze the application at the same
time.
>>: Right.
>> Domagoj Babic: So the first [inaudible] that I presented on inference of botnet
protocols and protocols in general really assumed a completely black box model.
Simply because we didn't even have access to the code of these servers, we had
no choice. So we had to treat those servers completely as black boxes. But in
MACE we actually do kind of code analysis to infer these messages. So it's
already kind of a combination of black box and white box approach.
So the way it works is it's in ways similar to what I presented before. The
difference here is that we actually use DART to generate mentals rather than just
observe kind of random traffic. And another difference here is that now we infer
the input abstraction function -- we essentially do this abstraction automatically
as I'll describe on the next slide, but we still require the output abstraction
function to be provided manually simply because it determines essentially the
coarseness of the state machine that you infer and therefore, you know, it's fair -it seems fairly difficult automatically find the right trade-off between precision of
the state machine and yet in the computational cost of inferring very precise state
machine.
So this is the filtering function. It takes the current version of the inferred
automaton. It takes an input sequence, a sequence of input messages,
sequence of output messages and then produces a set of new input messages
that are going to be used to refine the current abstract input alphabet. And what
it does is actually fairly simple. So it looks at whether their exists a path in the
currently version of the state machine that produces the same output as the
current -- as the output sequence that you pass through the function.
So, in other words, if there is a way to produce that sequence of output
messages with the current state machine, then you don't add anything to the
input alphabet.
On the other hand, if you can't produce the same sequence of output messages,
then you know that there is something -- at least one new message. And then
we add actually all the messages in the input sequence to the abstract input
alphabet and then learn another -- essentially we repeat learning the second
iteration of learning is done over this kind of refined alphabet able we learn over
this refined state machine.
So we evaluated had on a number of benchmarks. So we inferred the protocol
on Vino implementation of the RFB protocol and on Samba implementation of the
SMB protocol. And once when we inferred these state machines we also used
them to test RealVNC and Win XP SMB without reinferring the protocol. And that
of course relies on the assumption that the protocols are fairly similar. Because it
can -- once when you infer the state machine you can use it for, you know, for
testing various implementations, as long as the protocols are fairly, you know,
similar.
And for Vino we used the 45 second section of remote -- 45 second remote
desktop session to generate the set of seed messages and for somebody use
gen test it for, I don't know, how many seconds we ran it, just to, you know, to
bootstrap the process.
We ran this on the deter security test bed allocating essentially about 2.5 hours
of state exploration for each discovered state, and we did this only for every -only when a new state was inferred. So we wouldn't repeat these for original
states, for obvious reasons.
And also for the coverage measurement experiments we made sure that the
baseline which was state of the art DART engine, that the baseline gets exactly
the same amount of time as the MACE approach. And that also includes the
time required for -- that MACE required for learning the protocol.
So for Vino we inferred the protocol in two iterations after about 150 minutes.
And for Samba, we inferred it in about three iterations and it took us over 4,000
minutes.
So this is the SMB protocol state machine that we inferred. So it's about 84
states. I think it's clear that it would be too much to ask a programmer to specify
this. Perhaps not. I don't know. But it looks like completely -- it looks fairly
difficult to infer and to specify.
And so this table shows the vulnerabilities that we found. So we found seven
vulnerabilities altogether. Four of which are knew. And we got some CVE
numbers for them. And for instance for Vino the first one was inferred about -after about one hour of exploration in total, the second one after four hours, the
third one after 15 hours. And the baseline, unfortunately, didn't have the
capability to detect the very first vulnerability because it was an infinite loop. So
it's kind of like denial of service attack. And their baseline implementation, DART
implementation doesn't have a detector for infinite loops. So it want able to
discover this one in particular. But it had the capability to discover the remaining
two, which were wild out of bound rides -- reads, sorry. But even though it had
the capability they actually failed to discover them even after 105 hours of
exploration.
And then for Samba, we found three vulnerabilities. We actually hadn't known
about any of these when we found them, but later we found that they were
already known and the baseline approach managed to discover only one of
those, after about 602 hours while MACE took only about 12 hours.
And then for RealVNC we found another one, and we found none for Win XP
SMB.
So this is an interesting graph. We wanted to kind of see how -- how deeply
does the baseline approach get into the search space comparing to MACE. And
so what we did is we expanded the state machine into three and then measured
the percentage of states that each approach reaches at certain depth. And for
MACE, of course not surprisingly, it can reach any state simply because it knows
the state machine, so it's very easy to construct the sequence that's going to get
you to that state.
But what's interesting is that the baseline approach coverage actually falls very
rapidly. And so for instance when you get to depths of five, it reaches only about
40 percent of states and then when you get to depths of eight, essentially it falls
down to zero.
And I believe that this is the reason why we also got much better coverage with
MACE. So we got coverage improvements ranging from about six percent to 58
percent, depending on the benchmark. So my impression is that the reason why
MACE works so well is that by learning a state machine you effectively use
relatively cheap reasoning to infer a kind of like high level abstraction of the
program, actually the protocol that it implements, and then you can use that to
guide the search. And it's also very easy to construct sequences that are going
to get it to certain state, which is something that's relatively difficult for DART to
do on its own, simply because it doesn't have enough information.
And also another side effect is that you also get kind of more control over your
search because you kind of like can diversify your search more easily it's less
likely you are going to get stuck in loops.
So with that, I'll end the second part of the talk. And I have just about five
minutes to zip over the malware detection part.
>>: [inaudible].
>> Domagoj Babic: Okay. So, you know, what is malware. I'm going to skip that
so this slide essentially shows the effectiveness of modern anti-virus tools. This
is study done by Cisco security team. It essentially shows that on the physical
day when new malware is released, only about 20 percent malware is detected
by contemporary anti-virus tools. And then as they keep, you know, cranking out
signatures and updating, you know, the signature database they get to about 60
percent or something like that after about seven days. But only one of these
samples to suffices to actually, you know, to really create problems. So this is -this is far from satisfying.
The reason why this analysis so difficult -- well, in general sources are not
available, and binaries are quite often obfuscated or even encrypted, so it's very
frequent that you can't even disassemble the code, you can't distinguish from
what's code can, what's data.
And also, many of these tools automatically detect that you're not did you go
mode or that you're running some anti-virus and then it's difficult to analyze them.
Just to summarize this smaller crash course, so as I mentioned earlier, you
know, we are getting around 60,000 samples per day, so the daily volume is just
too large for manual analysis. And unfortunately that's what's being done today.
Many of these samples are analyzed manually and it takes about 15 to 20, 30
minutes for an experienced analyst to actually go through these samples.
Also the cumulative volume is too large for expensive analysis because, you
know, you're getting about 20 million of these samples per year. And then there
is a huge backlog of malware that you have to go through.
Also, signatures are unfortunately too easy to defeat, as the previous slide
showed. And static analysis frequently very difficult or impossible.
And so to address these last two issues, the security community has started
researching behavioral detection approaches that really focus on what software
does rather than how it does it. And one popular abstraction of behavior is
essentially a sequence of system calls because in order to change the state of
the system, then the application has to, you know, execute the system call, for
instance, to create a file or change registry key. And so what you can do is you
can generate kind of sequences of these system calls and then use that to kind
of recognize what's potential and malicious, whatnot. What we did was slightly
kind of more complicated. We actually used taint analysis to construct data flow
dependency graphs of system calls.
For instance if one system call generates some result and then later that result is
kind of like change in the application [inaudible] there's another system call, then
we would say that there is data flow dependency edge there. And a then of
course that -- from that relation we can construct graph that you can see here.
This is from a real world example called -- trojan called banker.
So now once when you construct these graphs, you can imagine expanding them
into trees. That's not what we really do because that would incur an exponential
cost but you can just imagine for, [inaudible] you know -- that's how it works. It
just simplifies the presentation. So, you know, you can imagine expanding this
into trees and then we can eliminate the graphs that are common between
malware and goodware. And then we end up with reduced set. And now we can
use this reduced set to infer state machine that's going to distinguish the two.
In the paper what I'm actually doing is slightly simpler than that, because I just
used the positive examples. But one could use negative examples as well to get
higher precision.
So I don't think I'll have time to actually go into the -- into tree automata, so I'm
going to skip the next two slides.
The main idea is that you have these trees and then you construct window a set
size. And then by zipping that window over the entire tree and creating the state
for every kind of unique subtree that you see, you can essentially construct the
tree automata and that's roughly how it works.
And then of course the final -- the accepting states are those states that
essentially accept the whole tree that you've seen. The K factor, which is the
size of that window, is inductive bias. And that helps us to actually do inference
from positive examples. But that K factor is very important because the larger
the -- the smaller the K, the more abstract the state machines that you infer are
going to be. So you can actually vary the trade-off bit between true positives and
false positives by changing this K. And that's something that's very useful in
practice. And that is actually due to theorem by Garcia that the language is
determined by a K plus one size window or kind of contained in languages that
are in the -- determined by the K size window.
And so the algorithm that I came up with is essentially has almost linear
complexity where it's KN complexity where N is the size of the graphs. So it's
very efficient in practice. And it's certainly can scale to very large backlog of
malware that we have.
The overall algorithm works like this. So we collect the graphs. We learn
automaton. And then we partition the test set according to the heights of the
trees. And then we run all those graphs against the tree automaton. And then
for each height we compute a score by computing the ratio of number of
accepted trees with that height with a total number of trees in that partition.
Having that height. And then we multiply it by the height of the tree because the
idea is that the larger the tree that is accepted the more weight it should get. And
that is the score that we compute for malware. And then the higher the score,
the more likely it is that the simple is malicious.
And then, you know, we did experiments on pretty large set of malware grouped
some something like 48 families. And we've also used some goodware samples
to compute the false positive rate. So I'm going to skip this.
And so these are the results. So the rise in curves from left to right are the
malware detection curves. So what the curve means is that it -- every point on
the curve for every K, we have different Ks here, shows the percentage of
malware that had score less than what you have on the X axis. So for instance
in this case, 40 percent of malware had score less than -- no, wait. About 10
percent of malware had score less than 0.4. The curves that fall from left to right
are the goodware detection curves. So they're exactly the opposite. So every
point on that curve says was the percentage of goodware that had score larger
than what is indicated on the X axis. So for instance now, you can find some
sweet spot. I think that one of the sweet spots was, for instance, K equals 4 and
the score of 0.6. At this point you would get something like 80 percent malware
recognition rate with about five percent false positive rate. And you can adjust
this as needed.
And another interesting result is that you can also use this inferred tree automata
to try to classify samples into families if you learned one automaton pair family,
you know, you distinguish -- you kind of split these families into like train and test
set and then you learn one automaton for every family and then you run it against
the -- you run the samples from the tests -- from all test sets on the inferred
automata. And then you can compute scores and that's exactly what I did here,
and you can see that there is a fairly pronounced diagonal here which essentially
indicates that this approach can be also used at least partially for classification.
It would be nice if classification like capability were more pronounced but that's
what they have at the moment.
And so to summarize, I talked a little bit about grammatical inference, and I
presented three possible applications, protocol inference and then also using that
for guiding state-space exploration and also structural feature recognition for
malware detection.
And I think that many other applications are possible, like namely synthesis
inference of queue invariants and many others. And I'd also like to make a
prediction. My gut feeling is that in the coming years perhaps GI might come -might become as important as SMT, simply because it allows us to do kind of
reasoning that other techniques are not that strong at. And it's very
complementary approach that I think can complement SMT and other
approaches that we've been using.
So for future work, one of the things that I'm looking into is symbolic MACE, so
how can we do this analysis with say symbolic automata and try to make it more
kind of generic and that way also kind of more -- perhaps even -- perhaps we can
even improve performance for going symbolic other than having to, you know,
deal with concrete messages and potential large alphabets.
Another possible direction to go is to try to do the same thing for context free
grammars because that is kind of fairly big problem to get past these parsers in
practice. And if you can infer context free grammar then you know the structure
of the state space and it could help you get past the parser.
And also something that I've been looking into is regular model checking of
distributed systems and they are we can use grammatical inference to infer the
queue invariants and also to some other things.
So I'd like to wrap up with -- by acknowledging the people that have been
working on me with a number of these projects, and also the funding agencies.
Thanks.
[applause].
>> Tom Ball: Now, questions.
>>: [inaudible] so I thought that using machine learning for classification of
malware was pretty standard. Not necessarily -- not using tree automata, but
can you compare -- I mean, is it true [inaudible] what's the key strength using tree
automata and this type of abstraction versus [inaudible] being used just for your
hardware classification.
>> Domagoj Babic: So that's a great question. So there was -- there are a few
papers on using machine learning like I think feature identification and leap
analysis or something like that, if I remember correctly, for malware detection.
And they got relatively similar results. But their implementation as far as I know
is not publicly available. So we're not able to do it like a direct comparison.
Although we use very similar or essentially the same set of malware samples for
benchmarking.
Other than that, most of the main stream anti-virus tools as far as I know, actually
used the argue based approach. And I also detect some behaviors but they are
kind of like less systematic about it. But I this, yeah, it potentially statistical
approaches might be useful for detecting these behaviors as well. And actually I
have one collaboration with some people from Rice where we have started
looking into this a little bit. I don't know what's going to come out of that.
>>: So learning the most general automaton for some [inaudible] inputs is
[inaudible]?
>> Domagoj Babic: Well, it depends on a lot of factors. It depends on the
languages that you want to learn, it depends on whether you have only positive
examples, whether we have negative examples as well. It depends on whether
it's passive or active. It depends on whether you want to learn a minimal
automaton or not.
So the answer, if you can specify your question a little bit more, perhaps I can
give you more precise answer.
>>: [inaudible] about, you know, inference of models for programs and so on, so
that seems like the general idea at some level, at some point in the 2000s so,
let's just learn A, B pairs because learning automata is hard. So there I guess
the question [inaudible] the most general minimal automaton for a given set of -given set of traces.
>> Domagoj Babic: So you're essentially talking about passive inference. So
from observed set of traces. If you want to learn a finite state machine that's
minimal, I believe that's MP complete. But you need to have negative examples
-- samples as well. So I'm assuming you have both positive and negative
samples. Do you have negative samples as well?
>>: That's just enough sequences. I mean, one thing the general automata that
captures every sequence.
>> Domagoj Babic: Well, that's -- well, the most general automaton is just a
single-state automaton. So that's kind of like little bit under specified still. But
there are algorithms that are like RPNI that learn from positive and a negative
sets that are kind of polynomial.
If you want to learn only from positive sets, then approaches like, you know,
based on K tested automata might be a good way to go. I don't know, I would
need to learn more details about the problem that you have.
>>: [inaudible] using an abstraction to guide the state so I [inaudible] in which he
was doing these abstractions [inaudible] testing for those abstractions, so are
you saying that the MACE is a different way of constructing those abstractions in
which you can just do better testing of the things? So should I look at MACE that
you also have an abstraction aligning with your [inaudible].
>> Domagoj Babic: Yeah. So the way I see it is essentially there really is kind of
particular type of learning combined with DART. So you essentially do this state
space exploration the same way as DART does. Just you're trying to use the
information, whatever you discover during that exploration, to actually learn
something about the state space and then use that knowledge to drive further
search later down the line.
And, you know, this is just one way of, you know, there are probably many, many
possible ways to learn something from -- there's a lot of information that's kind of
currently being discarded from DART. And, you know, I guess that you could
come up with many, many different approaches to learn. This is just one kind of
like simple point in the -- so [inaudible] had a question?
>>: So you could look at it as -- as learning summaries of some sort. I mean,
generally you could learn summaries of procedures but you could also learn a
summary of the whole program. So your state machine is a summary of the
input output behavior of the program. And those summaries you learned at that
level you could also learn if you were able to observe certain -- the internal output
-- input output relations of functions as well.
>> Domagoj Babic: Right. Right.
>>: And all that you've learned here [inaudible] summary is you learn behavior,
behavioral components compositions that [inaudible] program analysis where
here you're learning monolithic basically state machine, which is [inaudible] state
machine, where you do not learn for each procedure individual [inaudible]. It
would be interesting to see how -- I mean, to learn the [inaudible] abstractions
from -- the abstraction of the system test. Otherwise you're going to have to pay
their price [inaudible].
>>: But I guess in the case of the network protocol, you have a small set of
output states. Is that what helps you? I mean ->> Domagoj Babic: Output states?
>>: Well, how do you observe, like in your MACE, what do you observe
[inaudible] again do you have some abstraction for the input?
>>: [inaudible]. In your output -- your output alpha is fixed in advance.
>> Domagoj Babic: No, only the abstraction function is fixed.
>>: Oh, so the output is also part of the [inaudible].
>>: Right. And that determines what state machine you learn.
>> Domagoj Babic: Yes, correct. So that's why we -- essentially it was very
difficult -- we didn't actually succeed at doing this output abstraction automatically
as well exactly because the statement tended to infer of is so sensitive to this
output abstraction. You get like widely -- you know, very wide kind of range of
state machines depending -- even with -- you know, even after doing kind of very
small tweaks to the abstraction function. So, yeah, but ->>: But it seems like even like return code like an error code could just be useful
for that. Right? You send it back, you get one of, I don't know, 10 error codes
maybe in a packet, not recognized illegal states are for this message or hopefully
there would be like a small set. I don't know.
>> Domagoj Babic: A small set of ->>: Of error codes. You could use the error codes as -- I mean, presumably
there would be some ->>: [inaudible] was interesting for Samba for instance was there's so many error
codes that if you want to handle all of them, if you want to represent all of them in
your alphabet, then the inference blows up. So we actually instructed many of
these error responses into the equivalence class. We just care there was an
error or not. But, yeah, if we could come up with some kind of more modular way
or automatic way to abstract these or even to kind of infer a symbolic state
machines where, you know, that would take care of lot of these issues that are
essentially all due to the size of the alphabet.
>> Tom Ball: Okay. Thanks again.
[applause]
Download