>> Patrice Godefroid: Hello. Hi, everyone. It's... David Molnar who is going to talk about RFID security...

advertisement
>> Patrice Godefroid: Hello. Hi, everyone. It's my great pleasure to introduce
David Molnar who is going to talk about RFID security and security testing. So,
please, David.
>> David Molnar: All right. Thank you very much, Patrice. It is my pleasure
today to talk to you about two areas of computer security that example the
different range approaches we need to make progress in the area.
So the first thing I want to talk to you about is RFID, and to do that, of course, I
have to tell you what it is. So RFID is a term that refers to a range of
technologies where a small computer with an antenna called a tag is attached to
a person, an item or a collection of items. And then a different computer called
the reader is able to talk to this little tag and ask it for the information it carries.
And in many cases, and the cases I'll be talking about in this talk, the power
comes strictly from the radio waves to the reader. There's no battery to change,
nothing to wear out, and this is what makes it very attractive for the sort of
identification space.
So the interesting thing about RFID is the applications are everyone. The slide
just shows a few of them. So you can see here on the top we have Fast Track,
that's the California automatic toll payment system. You see credit cards on the
top right where the credit card sends your credit card number wirelessly over the
air to a reader as it comes in.
On the bottom right you have library books with RFID tags on them which I will
talk about. And of course your driver's licenses here in Washington now have
the option of the so called enhanced driver's license which includes an RFID tag.
And even United States passports now include these devices which I will also
talk about.
But one of the key problems with these devices is that what are the security and
privacy implications of using them?
[Video played:]
>>: [Inaudible] or is this her? The card reader that restricts access to the state
capitol says it's this gentleman.
>>: I was Ms. Pafly [phonetic].
>>: Jonathan Westhughes [phonetic] is really a security consultant hired by
State Senator Joe Simitian.
>>: I [inaudible].
>>: With a home-made antenna and a laptop Westhughes was able to read
radio waves emitted by Pafly's identity card, duplicate it and then gain access to
a secured area of the state capitol.
>>: All that was done within the moment of time without me even being aware of
it.
>>: Simitian's experiment illustrated just how easy it is for a hacker to read those
Radio Frequency Identification cards, or RFIDs, from a few feet away.
>>: If you can read someone's information and then literally in a matter of
seconds comb their card and pass yourself off as them, imagine the mischief that
people can do.
>> David Molnar: So what this shows is that people use these devices believing
that they have certain security properties when in fact they don't. And anyone
with some basic electrical engineering knowledge can come along and build
something that defeats the security of the system.
One of the things that makes it difficult to design around these security issues is
that as I said, RFID is a range of technologies. On the one hand you have things
like the chips which are used in US passports. They cover a lot of information,
up to and including a digital photograph of you, and they can do a lot of
computation up to including RSA signatures.
On the other hand, you have devices used for Wal-Mart originally designed for
tracking tubes of toothpaste which are now in enhanced driver's license and a so
called pass card which are nothing more than a sort of long range readable bar
code. They can be read at five to ten meters but don't really have much in the
way of computation.
So what I did as part of my work was take a look at some real deployments to
understand what are the security and privacy issues and how can we come up
with new methods of addressing these problems?
So the first one I want to talk to you about is library RFID. So the San Francisco
and Berkeley public libraries were considering spending millions of dollars on
RFID tags for every single book in the library. A dollar per book, a million books,
a million dollars of public funds. The payoff is that these tags enable easy self
checkout. You can walk up to a kiosk, wave the book over the kiosk and
automatically figures out who you are, what the book is and checks you out.
Or, as you can see in the picture, when you walk out the library door you walk
through these gates, the gates read all the books in your backpack, and if you
haven't checked one of them out, it reminds you that perhaps you need to do
that.
So there were actually huge public concerns, people coming up to open
meetings, people coming to library trustee meetings about well, what is on these
tags, who can read them and from how far? So I was able to serve as a pro
bono consultant for this Berkeley public library. So I volunteered my expertise as
computer scientist to help them understand these issues. In return they actually
let me ask vendors, so what do you really do, tell me.
And they had to answer. It was great. [laughter].
>> David Molnar: So one of the things I discovered was that some vendors
actually advocated putting the title and the author of the book on the tag. And
what's more, these tags could be read by any one with a compatible reader. So
that leads to an immediate privacy concern of someone being able to scan you
as you walk through these gates or anywhere like that, and learn exactly what
you're reading. And it turns out there's a long history of people being very
concerned about what you read and other people wanting to not reveal that
information to just anyone who might be passing by.
But then we looked a little bit deeper and said okay, what if you get rid of this title
and author? Well, a lot of different places also suggested putting the bar code,
the library bar code on the RFID tag. And this is sort of a static unique identifier
and it never changes. And one of the things we discovered was this attack we
call hot listing. Suppose you're interested in a specific book in the library. For
example, the FBI circulate a memo to its agents in late 2003 saying be on the
lookout for people with Almanacs because we believe they are using Almanacs
to plan an attack on the United States. [laughter]. You can't make this up.
I can go to a library that uses RFID like the Cesar Chavez Public Library branch
in Oakland and look at the tags, discover the bar code for the Almanacs and now
because it's a static unique identifier that never changes, every time I see that
particular bar code in the future I know it's from Oakland, and I know it's the
Koran or an Almanac or what have you.
So one of the things we did then is say, okay, let's work to figure out if there's
some new cryptography we could use or some new different methods we could
use to change these IDs every time and then we ran into the sort of systems
problem of the actual collision avoidance protocols used by the ISA 15693 tags
based in unique tag identifiers to the collision avoidance behavior of the radio
protocol. And by looking at the collision avoidance behavior you could just find
out what tag it was even if at the data level you were rewriting it every time.
So our contribution there was a sort of realize the full vertically integrated
systems problem. I then moved on to electronic passports. So the proposed -yes?
>>: [inaudible] for that or ->> David Molnar: I'll talk about some cryptography we came up as a partial
solution for that. And since then, there have been new air interfaced protocols
which don't have the same problems that we talked about. Does that answer
your question?
So the next point I want to talk to you about is passport. So here remember I
said that passports are sort of on the other end of the spectrum of computation.
Digital passport includes your name, your passport number, a photograph of you
and your nationality. So if you take a look at your passport, you have all this
information here on the front cover, like your picture, name, passport number, so
on so forth. All of that is on the chip inside the passport. For instance, my
passport I can tell by looking at the small symbol actually has the chip inside of it.
The original deployment choices United States include no encryption or
authentication of the chip. So anyone with a compatible reader could go up to
your passport and say excuse me, I'd like to know who you are, and the passport
would send all of this over to the reader. Yes?
>>: Is there a [inaudible] Faraday cage in the cover? Does it or does it not?
Because a closed passport is a fairly good [inaudible] if it's made of metal.
>> David Molnar: So the question is is there a Faraday cage in the cover. And
the answer is there is now. There was not in the original deployment choices.
So I'm talking about in 2005, when we started looking at this, the United States
State Department published what they called a concept of operations where they
said this is what we're going to do, and Faraday Cage was not part of that
concept. Does that answer your question?
So the key problems with this original deployment choices well, first of all, there's
a privacy issue. Someone comes up and reads your passport, they know who
you are, even if you didn't want to tell them. But there's actually a more serious
issue. The more serious issue is that because it didn't check whether or not the
information was coming from a real passport or from a small home made device
like the one you saw Jonathan Westhughes made, what this means is if I scan
your passport, I can pretend that I'm using your passport and anything that uses
automatic sort of use of the passport. For example, Australia trialed gates which
would slide your passport a new face recognition to figure out if you're allowed in
the country or not. So I notice here you have these nice gates at the instruction
-- the entry of the Microsoft Research.
>>: [inaudible].
>> David Molnar: So imagine coming back to your home country, swiping your
passport and then having a picture, a camera look at your face, look at your
passport's picture that's presented and say, oh, is this the right person or not in
because the original choices didn't actually check whether the information was
coming from a real passport or not, I could try to use your passport. And the
reason this is important is because it lets me try to imitate your face at home,
instead of at the border.
So I can go scan someone who looks somewhat like me and then figure out what
the correct Polaroid to hold up to the camera is at home or facial prosthetics or
what have you. But the point is that because the biometrics are coming from the
passport and it didn't authenticate the chip -- yes?
>>: So you're saying there was some kind of photographic signature that paired
the identifying information with the face so you could just replace the picture and
then show up and it looks like you?
>> David Molnar: Yes. So the question is there was a -- was there a
cryptographic signature that made sure so I couldn't just replace the identifying
information directly, and the answer is yes, there's something they call passive
authentication where the issuing nation science the information.
We've recently learned that no one checks those signatures. But that was not
part of my work.
And then the final thing, of course, is this -- this was a, you know, reveals the
nationality so it's a US citizen detector, which might be bad. So what we did was
we wrote this down. We submitted a formal kept to the state department with
electronic frontier foundation. 2400 other people sent in negative comments
about these choices. And there was public outcry demonstrations, what have
you. So the state department ended up adding encryption to E passports. Now,
whenever we have encryption of course a key question is where you get the key
used for the encryption. And the answer is the key is actually derived from
information that's right here on the front cover of the passport. When you go to
the airport as I did in San Francisco International before coming here, you either
swipe your passport or have it scanned and then the machine gets this key,
authentic indicates the passport and obtains the information.
This works for the passport workflow where you're already giving your passport
to somebody. But it doesn't work for other RFID applications where the whole
point of radio frequency and remote reads is to not touch things. Yes?
>>: It does work for the passport but now that you have to expose it over the
[inaudible] what does the RFID [inaudible] bar code would not do?
>>: Yeah. [inaudible] read the information.
>> David Molnar: So the question is what does the RFID do that the 2D bar code
does not do? And the answer is in the particular choices they ended up with,
there's no difference, but there is an optional feature not yet implemented where
the chip would do a challenge response protocol to prove it was an actual chip
made in an actual passport facility. This is called active authentication. It is an
optional feature. It has not been implemented in current US passports to my
knowledge. Does that answer your question? Does that answer your question?
Yes?
>>: It doesn't answer my question. What does that [inaudible].
>> David Molnar: Well, the active authentication would actually authenticate the
specific chip, not just the actual digital -- not just the information on it.
>>: Does part of the signature include the chip ID?
>> David Molnar: Well, how would you authenticate a particular chip has that
specific ID?
>>: Okay.
>> David Molnar: Does that answer your question? Okay.
>>: [inaudible].
>> David Molnar: Yeah, it's a cloning issue. So the question was what does -why would a challenge response do something different from just sign the
information on the passport and the answer I gave is that it's a question about
trying to prevent cloning of the actual chip involved.
But the point I'm trying to make here is that they -- we found a solution here for
this problem that integrates the workflow of actual passports, but what are we
going to do for other RFID employment? And you think about it there's an
underlying theoretical problem here. And it's a problem I like to call scaleable
private authentication.
So here you see Alice talking to one of two different RFID devices and you see
an adversary listening in on them. And I'm going to show you a protocol which
solves the problem of private authentication. That is Alice should be able to
authenticate to the chip and vice versa without the adversary knowing which tag
is being read by Alice. I want to talk about solutions for a subclass of all the tags
we have where we can do some cryptography.
So the solutions I'll talk about will cover some of the RFID devices in existence
but not all. And I'll argue there's still some interesting problems to solve even in
this class of devices where we're allowed to share keys between Alice and the
tags and where we're allowed to do a little bit of cryptography. So the first
protocol I want to show you is a protocol where Alice and the tag share a secret
key, Alice begins by creating a random nons [phonetic], a 32 bit value that never
repeats, sends it to the tag, the tag creates its own nons and it evaluates a
pseudo-random function on the shared key in the two nons. And just to briefly
argue why this gives us some nice properties, an adversary who doesn't know
the shared key won't be able to predict what the proper value is to respond as the
tag and so won't be able to impersonate a particular tag.
By the same token, because this is a pseudo-random function, without knowing
the secret keys of different tags, the adversary won't be able to figure out
whether a particular response came from tag A or tag B. And the only problem
here is now scaling.
So in this picture, Alice and the tag share a key, and they both know they share a
key. What happens when a new tag shows that Alice's field of vision that she's
never seen before? Well, the naive thing to do is Alice needs to try a different
key for each tag that could possibly be until she succeeds. So that scales
linearly as the number of tags scales. And the question is can we do better? So
let me show you something, an attempt to do better that doesn't actually work.
So the attempt is this. Let's give every key a unique identifier, have the tag
simply say I'm using key number five and then Alice has no problem with scaling,
she just looks up key number five in some hash table and says okay, this is the
key I want to use. Great. And we run the protocol or any other authenticate key
change or any other, you know, protocol that you happen to like.
But the problem with this is that it breaks our privacy guarantees. We begin with
a unique static identifier for every tag. And the question now is what can we do?
If the tag sends anything that's correlated with its identity, it seems like the
adversary will get an idea of which tag it is, which breaks our privacy guarantee.
But if we don't then Alice doesn't know which key to use, which breaks her
scaling problem.
So my solution to this is to use a so-called tree of secrets. Alice knows every
single secret key in this tree. Each tag is associated with a leaf in this tree and
knows the secrets on the path from the root to the leaf. So this particular picture
you have a tag on a library book, and it knows the right secret, then the left
secret then the left secret. And what we do in order to authenticate and identify a
particular tag is the following: We start out using the protocol I showed you and
say okay, am I the left subtree or the right subtree? Which of these two keys do I
share. So we can use the protocol I showed you which scales linearly in the
number of possible keys, but there's only two keys here just the left key and the
right key.
And Alice and the tag can figure out which one they share and then walk down to
the right subtree. Well an adversary who is listening in has no idea where they
are. Similarly, they can do the same thing left subtree or right subtree and go on
and so on and so forth until they reach a leaf and they discover -- yes?
>>: What if the adversary is the neighbor at the bottom of the tree.
>> David Molnar: So the question what if the adversary is a neighbor at the
bottom of the tree. And that's a question about sort of tags sharing keys. For
right now I'm talking about adversary that's a radio only adversary. And in order
to mitigate, and I'll get to this question in just a second.
So for a radio only adversary, the adversary can't tell anything about which tag it
is. And we get scaling logarithmic in the number of tags.
Now, Stuart's question is what the if the adversary happens to be one of the
neighboring tags, like suppose you've broken into one of these tags and
extracted secret keys? In that case, it's an interesting question about trading off
privacy for, you know, efficiency that other people have followed up on my
original work and looked at.
In particular you don't need a fully branching binary tree, you could actually have
a multiple -- a larger branching factor, even different branching factors at different
parts of the tree. So a very nice paper at PETS 2006 which talks about how to
make the trade-off based on how many tags you think will be broken into and
different efficiency metrics you might have, such as the amount of
communication or amount of computation for the reader. Does that answer your
question Stuart?
>>: Okay.
>> David Molnar: Other questions so far? So what I want to show you on this
slide is a comparison, just a simple comparison between the asymptotics and the
actual concrete numbers for a back of the environmental implementation of the
scheme for a two to the 20 or one million tags.
So you can see here that it scales better than sort of naive scheme of trying
everything, every tag in turn, but also scales better than a scheme where you try
to do some precomputation ahead of time and treat it as sort of a key cracking
problem and do like a trade-off, sometime space trade-off. But here at the
bottom we have like again concrete numbers for the reader time, reader space,
space and communication for the tag and for the reader.
Just to give you some idea of how we could actually make this work in an actual
implementation. So the way that the sort of thing I want to leave you with from
this part of the talk and this particular project is this is a story where we start off
by looking at practice. Real deployments of RFID we discovered there was a
fundamental problem, private authentication and scaling private authentication,
and then we need to come up with a new algorithm to solve this problem which I
just showed you.
I want to change directions now to talk about some of the work I did while I was
an intern here at Microsoft and that's in software security, looking for serious
bugs. Now, as I'm sure people here are all information with, these bugs are quite
common, and in fact if you take a look at this URL, you can go to the computer
emergency response team statistics and look at, you know, how many bugs were
in 2007 or how many bugs were in April of your favorite year. Well, in 2007 there
were 65,015 such bugs reported. All major vendors, Apple, Adobe, Microsoft,
many others.
And as you probably know, for each bug writing a patch, queuing a patch,
releasing is very costly. So we'd like to figure out ways to have fewer of these
bugs and mitigate these bugs as early as possible.
The way I like to think about work in this general area is there's sort of a bug
cycle. We start out where we write a bug. We don't really mean to write a bug,
but we do. And then we find a bug or more likely someone else finds a bug. We
have the bug reported to us. We try to fix the bug and then we write another bug.
So there's a lot of work on how not to write bugs in the first place.
>>: [inaudible]. [laughter].
>> David Molnar: That's wonderful. Yeah. So there's a lot of work on how not to
write bugs in the first place. And if you can do that, you should, obviously. But
sometimes we don't have that luxury and sometimes, you know, we have legacy
code or we have other things that prevent us from using techniques to not write
bugs in the first place. So my work that I'm going to talk to you about has been
focussed here on this sort of finding bugs and reporting them part of the cycle.
And that's work I did here with Patrice and Michael Levin [phonetic] when I was
an intern here at Microsoft Research I've continued to do back at the University
of California at Berkeley.
So the jumping off point for this work is a classical technique called fuzz testing,
whose story actually does begin on a dark and stormy night in the middle of
Madison, Wisconsin. Professor Bart Miller, the University of Wisconsin, Madison,
is dialing into the modem pool and he notices that the line noise from the storm is
causing his utilities to crash. So he realizes that what nature can do by chance
man can do by design. And he gets two of his graduate students to write a line
noise generator for his favorite UNIX utilities.
And it finds lots of bugs. So today one implementation of this basic idea is the
following: You pick a seed file. Where you get the seed file from is up to you.
You can get it from Bartlett's familiar quotations, you can generate purely at
random or you could have some other heuristics to generate it. Then you take
random bites, here highlighted in red, change them some other random bites,
feed it to your program and say oh, does it crash or does it not crash? So this is
a very simple, very, you know, sort of straightforward way of testing your
program. Miller himself refers the fuzz testing to sticks and stones kind of testing
but it's remarkably effective.
So from the original paper they found between a quarter to a third of all the UNIX
utilities and been util's crashed depending upon the particular version of UNIX
they looked at. And of course the Microsoft security and life cycle now requires a
hundred thousand fuzzed files before releasing any software to the wild. And I
can give you many, many more anecdotal reports of fuzz testing remarkable
effects in finding high-value security bugs.
But the problem with this particular approach is it doesn't handle unlikely paths.
Here's a small piece of code which I hope none of us will ever write for real. It
simply compares the very first character of the input to B and crashes if it's equal.
And as you can see if you're just randomly testing you have a very low chance of
hitting this particular bug. So the fix is something called white box fuzz testing
that I worked on here which combines this idea of fuzz testing with this idea of
dynamic test generation. So let me now tell you what is dynamic test generation.
So dynamic test generation we trace the dynamic execution of your favorite
program, we capture a symbolic path condition, predicates about the input that
have to be true to go down this path. We then pick a new path we want to
explore and come up with a new symbolic formula for that path. We have a
constraint solver which tries to solve this new path condition. If it can solve the
new path condition, we extract a new input from the result -- from the result and
run the program, there by expanding the coverage of the program that we've
tested.
So this was originally developed by a sort of pair of seminal papers, one by
Gudfaw, Clyland and Send [phonetic], another by Catar and Engler [phonetic],
and the idea of using sort of symbolic execution goes back to sort of static test
generation ideas as far back as King in 1976.
So let me now show you this idea in practice on this particular small program. So
this is a program we had before. It simply compares the first input to B and
crashes if it's seek. Let's run all the input good, see what happens. Well, if we
run on this particular input, we come up with path constraint that says okay, the
first letter of the input is not equal to B. That's the one predicate we tested on the
input to go down this particular path.
We want to come up with a new path through the program. There's really only
one logical choice. It's take the other direction of the if statement, so we negate
this path condition and then we feed this to our constraint solver in the corner, we
ask the constraint solver to find a new input that satisfies this constraint. The
constraint solver says okay, I have one for you. Good. We run the program, and
then it crashes.
So we have now overcome the unlikely paths problem. Yes?
>>: Well I was just concerned about the constraint solver's ability to find ways to
get to all paths. Ultimately that is the [inaudible].
>> David Molnar: Right. So the question is isn't this the holding problem, and
the answer is, yes, finding all paths is difficult. Which is why we end up with this
regime where it turns out empirically constraint solver does very well on many of
the paths we want to solve for. Why it does very well is an interesting research
question and that's something that I actually have an undergraduate working on
at the moment, who is trying to characterize some of the reasons why that might
be true. For example, in the tool I built at Berkeley which uses the constraints
over STP, 70 percent of the time we call it constraint solver returns in under a
second with an answer.
And we're trying to figure out what is it about programs, real programs that leads
to that issue -- leads to that. So this also touches another -- so Josh's question
also touches on this other issue of scale, right? So the first generation tools are
extremely exciting, but, you know, they looked at sort of smaller programs. And
the question now is how can we scale them up the larger programs? Maybe the
constraint solver will fall over. Maybe we don't know how to instrument large
programs. There is some other questions that come up. And I want to focus on
the search strategy question. And I'm going to argue that the depth first search
approach which the early tools used doesn't let them scale as much as we would
like.
So here's a slightly more complicated program. And you can see here we've run
the input here good again, we've come up with these four predicates that it
generates for us at the path condition. And now the question is okay, we have a
much larger program search space than even the small program I just showed
you. What do we do to search.
Well, the initial choices, very natural initial choices just use depth first search.
We'll take the last condition, we'll negate it, go down this path and then
symbolically reexecute the entire program with the new input. And continue
doing this over and over and over again. So we keep symbolically reexecuting
and then doing a depth first search.
Now, the reason this is particularly unfortunate in large programs is just the way
the economics work out is this. For a typical program that's very large, you might
have the time to create a symbolic trace be 25 minutes. There might be
thousands of path conditions or thousands of predicates in that path condition,
each one corresponding to a different if statement or a different branch you could
try to go down. And a time for each branch to solve generate a new test case
and check that new test case for crash is about a second.
So what we want to do is amortize the expensive work of generating these path
conditions over many new test cases. And the reason for that is each test case
is a bite at the apple. Each test case is opportunity to find the new bug that will
justify all the hard work we've done so far. And so, we had to come up with new
search strategy that lets us do this. The answer is something called a
generational search. So the generational search you see the initial path taken,
good, and what happens is we generate each of the test cases we can get to by
flipping one of the predicates in the path condition. And so we get through four
test cases in this example instead of just one. And do you see the sort of set of
test cases we generate here at the bottom we call these generation one test
cases, the seed files generation zero, we generate generation one test cases
and then of course we can reexecute on any of those to get generation two test
cases, generation three, and so on and so forth.
The overall search space for the program now looks like this. So let me tell you
what you're looking at here. On the bottom you see the actual test cases
generated by this technique. Above them you see a number which represents
the generation of that test case. And then the bomb represents test cases that
actually crash.
So after only three generations, three symbolic executions you end up hitting
different test cases that crash instead of with the depth first search having to walk
the way from left to right through the search space. Yes?
>>: [inaudible] right so that's a big part of the overhead.
>> David Molnar: So the question is we're very trace dependent and isn't that a
big part of the overhead? And the answer is yes, we are trace dependent but we
manage to get lots of different test cases from each trace. And that's one of the
things that makes this search strategy better than the previous search strategies
that were employed. Does that answer your question?
>>: [inaudible] I'm just wondering -- I mean the [inaudible] so if you don't come
close to kind of, you know, if the bad parts of the program are in cases where you
don't even have a good trace through it, you won't find them.
>> David Molnar: Right. So the question is, you know, what happens if there's a
part of the program we don't even have a trace that's near by it, what happens
then? And the answer I would say is that that's where the art in picking the seed
files comes in, for example, or trying to figure out if there are parts of the program
you haven't covered yet that you need to direct the technique towards. Does that
answer your question?
So one of the things I worked on while I was here is this idea of active property
checking. So the basic technique I just showed you looks at code coverage. We
came up with new paths that cover new parts of the code that haven't been
tested yet. But of course there are security bugs that don't actually show up in
the path condition. For example you might observe that there is a buffer and the
index depends on the input in a way we can reason about and we would like to
know if that index can ever be outside the bounds of the buffer. Or we might
know because perhaps someone has given us a -- you know, a SAL annotation
that says this particular parameter should never be null to this particular function.
We would like to know if we can solve for an input that makes the parameter null.
Or maybe we have a division and a dominator depends on the particular input.
We want to see can we solve for division by zero.
So it worked in ways to sort of check many properties simultaneously. And the
way I like to talk about this now is sort of dividing the solver queries into sort of
two types. One type is coverage seeking where we're looking for new inputs to
increase our coverage of the state space and coverage of the program. But the
others are sort of bug seeking where we say okay, I'm going to solve for a bug
and see what happens.
I want to talk to you about one particular type of bug that I think is a real great fit
for this technique and where I've had to develop some new methods to look for
this kind of bug. And that putting is integer overflow, integer underflow and other
integer bugs. These are bugs that come out because programmers believe
machine arithmetic is unbounded an works just like arithmetic we all learned in
school. It doesn't. And what happens is that it leaves these subtle bugs that can
really con found our traditional approaches to find security flaws. So, for
example, one of the traditional things that we do is use stack analysis. But
reasoning precisely about the values of integers, values, variables is hard which
leads to many false positives and leads to programmers turning off the tool.
There's a humorous quote from Linus Torvalds about GTC's attempt at finding
such bugs in 2001. He refers to so broken it's not even funny. Now, we've made
a lot of progress since then, but it's still a fundamental issue.
Another approach that we often use in security is to have a runtime monitor that
looks for unsafe behavior and then terminates the program, it looks like we're
doing something not safe. But the problem here is that there are benign integer
overflows. Some code in cryptography for instance might use integer overflows
to do a very fast modular reduction.
If you terminate the program when that kind of overflow happens, you'll make a
very angry user. In contrast, we can use ideas about these bugs to direct our
search and only report test cases generated that are real serious bugs. So in
slogan form stack analysis wastes a programmer's time, runtime analysis wastes
the users' time. But using it for dynamic test duration white box fuzzing only
wastes the tool's time. And at 10 cents an hour for the tool's time, versus more
than 10 cents an hour for my time, I know which one I'd rather use.
So it was a particular kind of bug. This is a piece of code that you might write if
you were trying to balance check this integer X before passing it to copy bytes.
Unfortunately for me, this particular integer is signed and if it -- X is equal to
neglect one then it will pass the balance check. But this copy bytes has a
prototype which says it's an unsigned integer. So when I pass negative one into
this particular function it will be promoted to an unsigned integer and we're going
to copy a few more 800 bytes.
So the bug pattern here is we're treating the same value as signed and then as
unsigned. Or vice versa. So the way I like to think about this is by having kind of
a four point lattice of types. Every value in the program starts out as unknown.
We don't know if it's signed or unsigned.
If we see it as an argument to a signed comparison or unsigned comparison then
we can give it a type. And we see as an argument to both, one after the other,
then we can move it to this bottom value which indicates a potential bug. The
way it interacts with the technique I've talked to you about so far is that if you see
a tainted program value, a program value you can reason about that has this
type bottom, you can solve for an input that will make that equal negative one.
Why negative one? Because that will exhibit the difference between signed and
unsigned comparison. And so what I developed are methods to infer these types
in a memory efficient way over very long traces. Because it turns out, as I'll show
new a few slides, the traces which we deal are several hundreds of millions of
instructions long.
So my first attempts at inferring these types automatically were not memory
efficient in the size of the trace, and so I ended up running out of memory on real
programs. So I developed a method which attempts to have a very small amount
of memory in order to keep track of only the live variables in each point of the
program that might have different types.
So if you put this all together, the architecture looks something like this. You
come up with an initial input, you check it for crashes, then you trace the program
on that particular input, you gather some constraints for your constraint solver,
you solve the constraints you get a whole bunch of new inputs and you move on
and repeat the cycle.
So I've been privileged to put -- to have the involvement of sort of two different
systems that do this. One is SAGE here at Microsoft, and the other one is called
SmartFuzz. It's what I wrote at the University of California, Berkeley. And I
wanted to sort of share with you some of the initial experiences that, you know,
we've had. So SAGE was originally released internally in Microsoft in April 2007,
and since then found dozens of new security bugs which had been missed by a
black box fuzzing stack analysis in human code review. It's bugs that if they had
been found externally to Microsoft would have resulted in a security fix. And you
can see here that there's been a number of people who have worked on SAGE
showing the amount of investment that Microsoft has put into it. I want to show
you an example of such a bug.
So this is a bug in animated icon parsing. Some of you may be familiar with this
bug already. On the left is the initial seed file for SAGE that we fed to it. On the
right is an example input generated by SAGE after seven hours and 36 minutes
which shows the bug in action. Like you run the code on this particular input it
will crash, and it crashes in the place and has the bug. There's a particular bug
only exhibits if there's two so called ANIH records.
So on the left hand side you can see we have highlighted the little list record
type, and on the right hand side it's been changed to an ANIH, and you can see
there's another one up at the top.
So SAGE was able to figure out that oh, goodness, there is a -- the code is
looking for its ANIH records. Synthesize in new test cases has two ANIH records
and that was what was required to show the bug. And you can see that just
randomly testing you would have a one in 232 chance of coming out with such a
bug. But I wanted to show you something that I think is even more interesting
and that's what happens with running SAGE on the particular media file format,
starting with 100 zero bytes. So this is our initial input, 100 zero bytes. We then
generated a bunch of different test cases but one of them, one of those different
test cases had this RIFS at the top. So SAGE was able to figure out that the
program was comparing the first four bytes to RIFS and generate a new test
case that makes that particular condition true. And then that generates several
more test cases, one of which had the particular file type and so on and so forth
to the different generation SAGE discovers more about the different parts of the
program that are looking for different parts of the input file.
And here after ten generations SAGE generates a new test case that actually
crashes the program. Now, what's interesting after only three generations of a
well-formed seed file, there's an actual playing media file, you end up with the
same sort of crash, which shows that even though SAGE was able to come up
with the structure of the input, still again the choice was seed file makes a big
difference.
So for the rest of the talk I'm going to talk a little bit more about SmartFuzz, which
is the implementation of these ideas I've worked on at Berkeley. This is a
Valgrind framework for binary instrumentation which runs on Linux programs. It
uses the STP solver from Stanford, although we've now actually generalized and
use Z3 and a couple other solvers. And it's available in SourceForge. You can
download a virtual machine that preinstalled if you want to run it. And it's on
sourceforge.net. But the first thing I want to talk about is both of these tools now
scale to real programs, millions of instructions per trace.
So these are two tables that show you the source lines of code where applicable
and the number of x86 instructions in a typical trace. So you can see here both
SmartFuzz and SAGE now handle code with hundreds of millions of x86
instructions per trace and perform this kind of looking for constraints during path
conditions and then solving the path conditions. I want to talk to you about some
experiments I did with SmartFuzz where I took six Linux programs, I took three
seed files each, and for each program, each seed file ran 24 hours with this tool
and 24 hours with zzuf which is a black-box testing tool that's roughly
comparable to the file fuzzer used here in Microsoft.
And I use the Amazon elastic compute cloud to run these machines over about
50 different, you know, instances over a weekend. So Amazon is great because
you can check out machine with two gigs of RAM for 10 cents an hour. Or 7 gigs
of RAM for 40 cents an hour. So you can really just sort of summon lots of
machines that do your bidding and then get -- take and then put them back when
you're done.
So you don't need a cluster anymore to do this kind of work. And an interesting
thing about this of course is that both of these techniques, both the white box
fuzzing and black box fuzzing give you millions and millions of test cases. So
you have to figure out how do we sort of pan through all the different test cases
for the few ones, the few flecks of gold which exhibit bugs we didn't know about
before, which are high value and worth fixing. In other words, how do we find the
right tests?
So for Linux my answer is to use Valgrind's memcheck. This is a tool that checks
the execution of your program for memory safety violations, things like memory
links, writing the memory you shouldn't be writing to, reading memory you
shouldn't be reading from users of uninitialized values, a whole passel of other
properties you probably don't want violated in your particular program. But you
can of course use your favorite bug oracle.
So if you like at verifier you can use that, or if you like any other things you like
using you can use that. Yes?
>>: [inaudible] precision or ->> David Molnar: Memcheck only has one level precision really and so it does
have a slowdown. So the question is what level precision do I recommend using
memcheck with and the answer is really only one by default and that's the one
I've been using.
>>: [inaudible].
>> David Molnar: It's about 2 to 5X. Does that answer your question?
>>: [inaudible].
>> David Molnar: So the question is what does that verifier do and the answer is
that verifier, it's a plug-in architecture, you pick what to do. So in the experiments
I did when I was here, we looked at the electric fence sort of mode of that verifier
where it puts a guard page either before or after a particular memory object.
That was the main thing that I used that verifier with when I was here.
>>: [inaudible].
>> David Molnar: So the question is does Valgrind have more precision in its
checks than what I had at Microsoft. So the answer is Valgrind is roughly
comparable to true scan in what its looking for and the precision its looking for.
AppVerifier is less precise and looks for less. But on the other hand AppVerifier
is a much less of a slowdown for the checking so AppVerifier is almost
unnoticeable, whereas this is a 2 to 5X slowdown. Does that answer your
question?
>>: Yes.
>> David Molnar: Yes?
>>: [inaudible] slowdown so the traces you generate, especially if there's
non-determinism in the program, are you really kind of checking paths that occur
in practice, or are you checking things, you know, paths that you [inaudible]? I
mean it's still valuable to find bugs.
>> David Molnar: Right.
>>: But are you finding the bugs that are likely to occur in the wild?
>> David Molnar: So the question is do I find bugs that are likely to occur in the
wild, and the answer is for the security testing regime I don't care because an
adversary, if he find any of these bugs he's interested in them. But for more
reliability perspective, I don't think these are going to be good exemplars of bugs
that we'd find in the wild by random people using them. Does that answer your
question?
So the nice -- the other nice thing about Valgrid memcheck is people understand
in the open source world about Valgrind, they have seen it before, it's been about
five years, so when you report Valgrind test cases to people they actually fix
them, which is really nice.
So I was lucky enough to have nine undergraduates for eight weeks as part of a
team of mentors and we got into a project comparing both white box and black
box fuzzing. You can see them here. And you can't really read it, but their
T-shirt says we found 1,124,452 test cases with at least one Valgrind error and
all we got was this lousy T-shirt.
Does that really mean we found a million bugs? Well, no. Okay? So there's this
problem of bug bucketing, which it turned out to be a rather serious problem
during this summer work. The issue is there's many, many relations between
test cases and bugs. One particular test case especially something as precise
as true scan or Valgrind memcheck can actually exhibit multiple bugs because
the first bug you see may not crash the execution to your program. At the same
time it could be the case that, you know, you have one sort of semantic bug
which is example by multiple test cases.
And in our experience the developers get very angry with you if you report
duplicate bugs. And we were principally reporting to the MPlayer of software
project which is an open source media player that ships with major Linux
distributions. And the thing is we're just posting them on a -- posting bugs on
their Bugzilla and saying hi, you don't know us, but you have an invalid write bug
you might want to look at.
And the thing is this interesting sort of interaction where they would fix these
bugs but they still had no idea who we are or why we were doing this until much
later at the conclusion of the project. So I wanted to talk to you about how do
you actually bucket these bugs. So this is the first thing I tried. It's call stack
cache. This is the actual GDB backtrace, a real test case that we synthesized,
sent to the MPlayer developers. In bold you see the actual obstruction pointers
in the stack trace. And the first thing to try is well, you know, I have a hash
function floating around, let me just, you know, put these in order, feed them to
hash function. This is a bug bucket ID.
In essence I'm saying up to the collision resistance of the hash function, two bugs
are the same if and only if they have the same stack trace. This has a -- this
turns out it has a lot of problems actually but the main one is that doesn't work,
and I know that because we use it as our main approach for bug bucketing
during the summer work I mentioned, and we actually had about 9 to 10 percent
duplicate rate, as in the developers look at our bugs that we reported and said
oh, look that's a duplicate right there.
What was interesting here is we actually had the students sort of currently
reporting things, too, so they really did rely on the stacked cache as their
duplicate detection method. So what I do now and I don't say this is the end all
[inaudible], but I'll talk -- yes?
>>: [inaudible] reduce your bug count from a million to a hundred or is there
other stuff going on?
>> David Molnar: Oh, the students were choosing which bugs to -- so the
question is did the hash reduce the bug count from a million to a hundred. The
answer is no. The students picked which bugs to report. We had nine students.
We told each of them to report at least 10 bugs by the end of the summer, so we
ended up with 110 total. Some students were more industrious than others.
Does that answer your question?
>>: [inaudible].
>> Patrice Godefroid: So [inaudible].
>>: [inaudible] ho you reduce from a million to 10 if you use undergraduate type
of work.
>> David Molnar: So the question is how do you reduce from a million different
bugs to the 10 that use the undergraduate size report? Well, first of all, we
actually used the stack cache in order to say, okay, here are distinct buckets. So
the million is before any bucketing whatsoever. All right? So the undergraduates
got to see here are the different buckets. Which is much less than a million. Like
several orders of magnitude. I don't have the numbers in front of me. But it did
significantly reduce the number.
And the next question is, okay, did we give them criteria? Well, we told them to
look for read and write, invalid read and invalid write and preferred to report
those because those bugs are more serious. But beyond that, we didn't give
them any particular criteria for reporting. Does that answer your question?
Okay.
So what I use now is this what I call a fuzzy stack cache. And this is motivated
by the fact and the kinds of programs and testing I actually had access to the
source code. I can compile debug symbols.
So one problem I noticed when we were doing this work is that the code changed
quite often. So MPlayer particular at one point was updating about four times a
day while we were doing this work. Developers of the MPlayer product really,
really want you to report against the most recent version of the code or they
reject your bug report out of hand. So we have to recompile MPlayer all the time.
If there's a small tiny change to the source code, that actually leads to very -possibly a very big change in the instruction pointer, and so what would happen
is we would have different looking stack traces because the instruction pointers
would be different and then report the duplicate bugs.
So the approach to get around that is by using the line of code and the file the
function is in instead of the instruction pointer. And then to be robust against
small changes, look at all but the last digit. And of course it's only a single digit,
you take just a single digit.
Another issue that I observed was you have a buggy function that's called a
multiple contexts. Each context is a new call site, is a new stack trace. But
they're really semantically the same bug underneath. So in order to address that,
the fuzzy stack cache I use now looks just the top three frames of the stack trace,
not the full stack trace. Yes?
>>: [inaudible] looking at what the line actually has the contents of, the source
lines.
>> David Molnar: So the question is why did I take the sort of fuzzy approach to
the line numbers instead of looking at what the line actually has. Because the
answer is I wanted to be -- first of all, I didn't think about what you just said,
looking at what the line actually has, and second of all, just coming up with an
idea on the fly for why one might prefer one over the other, this would also let
you handle things that are slightly different about the line. Like if there's a -- huh?
[laughter].
>> David Molnar: Oh, great. Thank you. Thanks for saving me time.
Appreciate it. So -- yes?
>>: Why didn't you use two relative offset [inaudible]?
>> David Molnar: Right. So the question is why not use the relative offset to
regain the method. Again, I didn't think about doing that. That would be ->>: [inaudible] all kinds of other [inaudible].
>> David Molnar: Right. And in particular the relative offset from the beginning
of the method wouldn't require the same level debug information as I have here.
Right.
So as I said, I don't believe it's the end all and be all. It's a sort of a starting point
for future work. And what's nice is I actually have a data set now if anyone wants
it about, you know, which bugs were marked as duplicates and which weren't?
So you could imagine going back and redoing this with a different approach for
bucketing.
So what I'm going to -- and then what we did in addition is built a sort of a front
end for collecting all of these different bugs called metafuzz.com. This is a live
website. You can go there right now, if you want to, and it shows you the stack
cache, the bucket ID, the particular program that was tested and a link to
download the test case. So the students used this to sort of browse through
different bugs that they might want to report to developers.
And developers use this to sort of download -- we were able to provide links to
specific test cases that they could then use to download and try out the particular
bug on their own code to see if it could reproduce. And we've been adding
features for reproducing the most recent version of software and so on and so
forth.
So what we found in these experiments, remember I talked to you about the
experiments, six programs, three test files per program, zzuf random testing,
SmartFuzz, white box testing. We found that they find different bugs.
So overall we found 19 bugs and we both found and the reason I talk about this
fuzzy stack cache is when I say bug here, I mean bucket under the fuzzy stack
cache, okay? So that's why I went through that to talk to you about what that is.
So we have eight bugs that SmartFuzz found that weren't found by the other
technique, 31 to random technique found SmartFuzz didn't find, and 19 in the
intersection.
So if you break it down, this table shows that we found bugs in five out of six test
programs. The numbers on the left are SmartFuzz, numbers on the right are
zzuf and the two out of the six we ended up finding more bugs than the random
technique, using SmartFuzz. And the bottom you have the cost per bug, billing
at the Amazon AC2 rates as of February 2009.
So what it says to me is of course you want to try using both techniques but you
don't know ahead of time where your probably one where SmartFuzz would be
better or zzuf would be better. And in particular, the G zip, which is a
decompression algorithm that's used pretty widely, SmartFuzz found two bugs
and zzuf found nothing. So I'm still trying to figure out exactly why those bugs
are the way they are. It's a very optimized code base. It's very fun to read. But
this tells us about how -- the relative [inaudible] of these two different types of
bugs.
The other thing you'll want to point out is remember I talk about bug seeking
queries where we solve for particular kinds of problems that might correlate with
a bug. So this is showing us how many queries over the -- all the runs we had,
particular types, how many of them succeeded. So this gets back to question
about solving the queries. So this tells you roughly how many succeeded. We
have a time out of about five seconds for each query. And then of those how
many bug buckets came from each particular type of query. So you can see
here we actually were successful in using them to find bugs and the signed,
unsigned property was the most successful out of all of our bug seeking queries.
Yes?
>>: [inaudible] these three bug categories [inaudible] these are not necessarily
bugs, or are they?
>> David Molnar: They are not. So what happened is -- so Patrice's question is
these are not necessarily bugs, either overflow or underflow, as I argued earlier.
What this table is showing you is we created a new test case from a property
whose name is here. We're trying to force an underflow, force an overflow, force
a signed, unsigned conversion. And then Valgrind memcheck told us the
resulting test case had an error and then we applied the fuzzy stack cache I
discussed earlier.
>>: [inaudible].
>> David Molnar: It was something else. So it's invalid read, invalid write,
something like that. So again, you can see that these techniques work and that
these signed, unsigned found the most number of bug buckets out of all the
techniques we tried.
So the way I see this is this is a story where we had a beautiful theory which we
showed us the way to a new method of testing software. And the work that we've
been able to do takes this beautiful theory and but into practice with scalable
practical tools that have made a real difference in the way that people test
software. So going forward -- yes?
>>: So the question about the methodology, the traces you start with, do you
require those to be collected from the beginning of the program when you get the
input or can I in the middle of a program hook up something to trace and then
use your techniques?
>> David Molnar: So the question is can I start collecting in the middle of a
program execution and start symbolically executing? The answer is not with my
current infrastructure. My infrastructure uses Valgrind. Valgrind requires loading
the program. And the reason for that is that Valgrind actually takes over the
duties of the program loader. So Valgrind actually sets up in the same address
space as your guest program room for its own host code, which does the
dynamic binary translation, and then does the recompilation of x86. And it is not
supported this time to attach to a running process.
>>: [inaudible] just finding [inaudible] bugs in the startup of programs because
you know, like it takes a while to generate the [inaudible] trace and the
[inaudible].
>> David Molnar: So the answer is will I only find bugs in the startup programs.
The answer is I don't believe so because what happens is we only look at the
constraints from parts of the program depend on the initial input and so different
inputs that exercise different parts of the program will lead to different constraints
and then different bugs. Especially like think of a media player program.
MPlayer supports like 15, 20 different file formats. The bugs we find from the
MP3 playing are different from what you find from the wave playing or what you
find from the AVI playing.
>>: [inaudible] some overflow bug that program really needs to run for a length
of time to ->> David Molnar: Right.
>>: Probably not going to find those [inaudible].
>> David Molnar: That's true. And so that's why I've been talking to people
about check pointing approaches for trying to save the state of the initial part of
the program. So for instance I was just speaking with Gene Cooperman at
Northeastern University. He has a check pointing approach he thinks would help
with this. Yes?
>>: Very large number of 16 bit version possible errors there. Where do they
come from? [inaudible] 16 bit [inaudible] 16 bit data [inaudible].
>> David Molnar: So the question is where do UNIX programs with 16 bit data
types? I don't have a great answer for you because I haven't gone back and
traced all those back to the original source code, but my gut feeling is that
MPlayer might have that in the particular file types we're looking at. Yes?
>>: [inaudible].
>> David Molnar: Right. So the question is contrasting of Clee [phonetic]. So
first there was a different focus. Clee is focused on very high code coverage.
But they're looking at smaller programs. So they do things like they want to
actually keep all the programs stay in memory and then they use a form to try
and go down different code paths. Whereas I actually do just a single trace at a
time and then I try to focus on larger traces.
So I look forward today where we can take some of the things they're doing and
some of the things I'm doing and do really high code coverage of really large
programs. It's just their main difference, their focus is on gaining 99 percent
coverage of graph. And my main focus is on finding bugs in MPlayer. Beyond
that there are lots of other sort of smaller differences that we could go into,
according to my understanding of Clee, but -- or I could tell you more about the
internals of this. But I think that might be best done after the talk. Does that
answer your question?
Okay. So going forward we now know how to create more bugs than anyone will
ever be able to fix unassisted. What do we do to fix these more effectively? And
stepping back a bit, we now have this amazing ability to check out hundreds of
machines and use lots and lots of information about the code base we're working
on. How can we use that to help a programmer write better code? More
interactively and more immediately than our currently techniques. And that's sort
of one of the directions I'm really interested in is this part of the cycle where we
report and fix.
For example, there have been techniques developed here at Microsoft and other
places on -- for worm defense which try to synthesize patches immediately given
an example of a worm that exploits a particular piece of code. Well, how does
that change if we go into this regime where we have a lot of time to create a
patch but the quality has to be higher because human beings have to maintain
and understand them. So that's an example of a sort of a case of a direction I'm
interested in going in in the future.
So with that, I thank you for your time, and I welcome your further questions.
[applause].
>> David Molnar: Yes?
>>: [inaudible] and tell the tools to reemphasize that may be related to the bug
[inaudible].
>> David Molnar: So the question is after I fix a bug, can I tell the tools to
emphasis a particular piece of code? So my work does not currently work -does not currently allow you to say this particular function is interesting, but you
can rerun the tool with the test case it generate earlier on and expand around
that particular path. And look for bugs in the patch to the original problem. So
yes, I can do it, but no, I can't do what I would really like to do would be able to
say this particular line of code needs to be exercised quite a bit. Does that
answer your question?
>>: [inaudible] how do techniques like this one scale to like more interacting
programs on networking protocols [inaudible]. So when you specify test cases
this is like the program taking one input and then running with it or from start to
end or you can also do like, you know, supplying input program does something
and supply another input program does something [inaudible] kind of testing as
well.
>> David Molnar: Right. So the question is what about interactive programs or
protocols, right? So my work that I show you doesn't work on that, there has
been other work that's been done here at Microsoft that does look at that, and I'm
talking to some people in the networking group at Berkeley about how to look at
that particular issue. In principle what you need is a proxy and let you replay
protocol dialogues and sort of close the entire environment to let you treat the
entire -- let you treat the input to the whole program as the entire dialogue
between one server and one client or what have you. In principle it's possible,
it's just it has to be done and there's some different questions that come up about
how to search the state space. Does that answer your question? All right.
Thank you for your time, everyone.
[applause]
Download