>> Patrice Godefroid: Hello. Hi, everyone. It's my great pleasure to introduce David Molnar who is going to talk about RFID security and security testing. So, please, David. >> David Molnar: All right. Thank you very much, Patrice. It is my pleasure today to talk to you about two areas of computer security that example the different range approaches we need to make progress in the area. So the first thing I want to talk to you about is RFID, and to do that, of course, I have to tell you what it is. So RFID is a term that refers to a range of technologies where a small computer with an antenna called a tag is attached to a person, an item or a collection of items. And then a different computer called the reader is able to talk to this little tag and ask it for the information it carries. And in many cases, and the cases I'll be talking about in this talk, the power comes strictly from the radio waves to the reader. There's no battery to change, nothing to wear out, and this is what makes it very attractive for the sort of identification space. So the interesting thing about RFID is the applications are everyone. The slide just shows a few of them. So you can see here on the top we have Fast Track, that's the California automatic toll payment system. You see credit cards on the top right where the credit card sends your credit card number wirelessly over the air to a reader as it comes in. On the bottom right you have library books with RFID tags on them which I will talk about. And of course your driver's licenses here in Washington now have the option of the so called enhanced driver's license which includes an RFID tag. And even United States passports now include these devices which I will also talk about. But one of the key problems with these devices is that what are the security and privacy implications of using them? [Video played:] >>: [Inaudible] or is this her? The card reader that restricts access to the state capitol says it's this gentleman. >>: I was Ms. Pafly [phonetic]. >>: Jonathan Westhughes [phonetic] is really a security consultant hired by State Senator Joe Simitian. >>: I [inaudible]. >>: With a home-made antenna and a laptop Westhughes was able to read radio waves emitted by Pafly's identity card, duplicate it and then gain access to a secured area of the state capitol. >>: All that was done within the moment of time without me even being aware of it. >>: Simitian's experiment illustrated just how easy it is for a hacker to read those Radio Frequency Identification cards, or RFIDs, from a few feet away. >>: If you can read someone's information and then literally in a matter of seconds comb their card and pass yourself off as them, imagine the mischief that people can do. >> David Molnar: So what this shows is that people use these devices believing that they have certain security properties when in fact they don't. And anyone with some basic electrical engineering knowledge can come along and build something that defeats the security of the system. One of the things that makes it difficult to design around these security issues is that as I said, RFID is a range of technologies. On the one hand you have things like the chips which are used in US passports. They cover a lot of information, up to and including a digital photograph of you, and they can do a lot of computation up to including RSA signatures. On the other hand, you have devices used for Wal-Mart originally designed for tracking tubes of toothpaste which are now in enhanced driver's license and a so called pass card which are nothing more than a sort of long range readable bar code. They can be read at five to ten meters but don't really have much in the way of computation. So what I did as part of my work was take a look at some real deployments to understand what are the security and privacy issues and how can we come up with new methods of addressing these problems? So the first one I want to talk to you about is library RFID. So the San Francisco and Berkeley public libraries were considering spending millions of dollars on RFID tags for every single book in the library. A dollar per book, a million books, a million dollars of public funds. The payoff is that these tags enable easy self checkout. You can walk up to a kiosk, wave the book over the kiosk and automatically figures out who you are, what the book is and checks you out. Or, as you can see in the picture, when you walk out the library door you walk through these gates, the gates read all the books in your backpack, and if you haven't checked one of them out, it reminds you that perhaps you need to do that. So there were actually huge public concerns, people coming up to open meetings, people coming to library trustee meetings about well, what is on these tags, who can read them and from how far? So I was able to serve as a pro bono consultant for this Berkeley public library. So I volunteered my expertise as computer scientist to help them understand these issues. In return they actually let me ask vendors, so what do you really do, tell me. And they had to answer. It was great. [laughter]. >> David Molnar: So one of the things I discovered was that some vendors actually advocated putting the title and the author of the book on the tag. And what's more, these tags could be read by any one with a compatible reader. So that leads to an immediate privacy concern of someone being able to scan you as you walk through these gates or anywhere like that, and learn exactly what you're reading. And it turns out there's a long history of people being very concerned about what you read and other people wanting to not reveal that information to just anyone who might be passing by. But then we looked a little bit deeper and said okay, what if you get rid of this title and author? Well, a lot of different places also suggested putting the bar code, the library bar code on the RFID tag. And this is sort of a static unique identifier and it never changes. And one of the things we discovered was this attack we call hot listing. Suppose you're interested in a specific book in the library. For example, the FBI circulate a memo to its agents in late 2003 saying be on the lookout for people with Almanacs because we believe they are using Almanacs to plan an attack on the United States. [laughter]. You can't make this up. I can go to a library that uses RFID like the Cesar Chavez Public Library branch in Oakland and look at the tags, discover the bar code for the Almanacs and now because it's a static unique identifier that never changes, every time I see that particular bar code in the future I know it's from Oakland, and I know it's the Koran or an Almanac or what have you. So one of the things we did then is say, okay, let's work to figure out if there's some new cryptography we could use or some new different methods we could use to change these IDs every time and then we ran into the sort of systems problem of the actual collision avoidance protocols used by the ISA 15693 tags based in unique tag identifiers to the collision avoidance behavior of the radio protocol. And by looking at the collision avoidance behavior you could just find out what tag it was even if at the data level you were rewriting it every time. So our contribution there was a sort of realize the full vertically integrated systems problem. I then moved on to electronic passports. So the proposed -yes? >>: [inaudible] for that or ->> David Molnar: I'll talk about some cryptography we came up as a partial solution for that. And since then, there have been new air interfaced protocols which don't have the same problems that we talked about. Does that answer your question? So the next point I want to talk to you about is passport. So here remember I said that passports are sort of on the other end of the spectrum of computation. Digital passport includes your name, your passport number, a photograph of you and your nationality. So if you take a look at your passport, you have all this information here on the front cover, like your picture, name, passport number, so on so forth. All of that is on the chip inside the passport. For instance, my passport I can tell by looking at the small symbol actually has the chip inside of it. The original deployment choices United States include no encryption or authentication of the chip. So anyone with a compatible reader could go up to your passport and say excuse me, I'd like to know who you are, and the passport would send all of this over to the reader. Yes? >>: Is there a [inaudible] Faraday cage in the cover? Does it or does it not? Because a closed passport is a fairly good [inaudible] if it's made of metal. >> David Molnar: So the question is is there a Faraday cage in the cover. And the answer is there is now. There was not in the original deployment choices. So I'm talking about in 2005, when we started looking at this, the United States State Department published what they called a concept of operations where they said this is what we're going to do, and Faraday Cage was not part of that concept. Does that answer your question? So the key problems with this original deployment choices well, first of all, there's a privacy issue. Someone comes up and reads your passport, they know who you are, even if you didn't want to tell them. But there's actually a more serious issue. The more serious issue is that because it didn't check whether or not the information was coming from a real passport or from a small home made device like the one you saw Jonathan Westhughes made, what this means is if I scan your passport, I can pretend that I'm using your passport and anything that uses automatic sort of use of the passport. For example, Australia trialed gates which would slide your passport a new face recognition to figure out if you're allowed in the country or not. So I notice here you have these nice gates at the instruction -- the entry of the Microsoft Research. >>: [inaudible]. >> David Molnar: So imagine coming back to your home country, swiping your passport and then having a picture, a camera look at your face, look at your passport's picture that's presented and say, oh, is this the right person or not in because the original choices didn't actually check whether the information was coming from a real passport or not, I could try to use your passport. And the reason this is important is because it lets me try to imitate your face at home, instead of at the border. So I can go scan someone who looks somewhat like me and then figure out what the correct Polaroid to hold up to the camera is at home or facial prosthetics or what have you. But the point is that because the biometrics are coming from the passport and it didn't authenticate the chip -- yes? >>: So you're saying there was some kind of photographic signature that paired the identifying information with the face so you could just replace the picture and then show up and it looks like you? >> David Molnar: Yes. So the question is there was a -- was there a cryptographic signature that made sure so I couldn't just replace the identifying information directly, and the answer is yes, there's something they call passive authentication where the issuing nation science the information. We've recently learned that no one checks those signatures. But that was not part of my work. And then the final thing, of course, is this -- this was a, you know, reveals the nationality so it's a US citizen detector, which might be bad. So what we did was we wrote this down. We submitted a formal kept to the state department with electronic frontier foundation. 2400 other people sent in negative comments about these choices. And there was public outcry demonstrations, what have you. So the state department ended up adding encryption to E passports. Now, whenever we have encryption of course a key question is where you get the key used for the encryption. And the answer is the key is actually derived from information that's right here on the front cover of the passport. When you go to the airport as I did in San Francisco International before coming here, you either swipe your passport or have it scanned and then the machine gets this key, authentic indicates the passport and obtains the information. This works for the passport workflow where you're already giving your passport to somebody. But it doesn't work for other RFID applications where the whole point of radio frequency and remote reads is to not touch things. Yes? >>: It does work for the passport but now that you have to expose it over the [inaudible] what does the RFID [inaudible] bar code would not do? >>: Yeah. [inaudible] read the information. >> David Molnar: So the question is what does the RFID do that the 2D bar code does not do? And the answer is in the particular choices they ended up with, there's no difference, but there is an optional feature not yet implemented where the chip would do a challenge response protocol to prove it was an actual chip made in an actual passport facility. This is called active authentication. It is an optional feature. It has not been implemented in current US passports to my knowledge. Does that answer your question? Does that answer your question? Yes? >>: It doesn't answer my question. What does that [inaudible]. >> David Molnar: Well, the active authentication would actually authenticate the specific chip, not just the actual digital -- not just the information on it. >>: Does part of the signature include the chip ID? >> David Molnar: Well, how would you authenticate a particular chip has that specific ID? >>: Okay. >> David Molnar: Does that answer your question? Okay. >>: [inaudible]. >> David Molnar: Yeah, it's a cloning issue. So the question was what does -why would a challenge response do something different from just sign the information on the passport and the answer I gave is that it's a question about trying to prevent cloning of the actual chip involved. But the point I'm trying to make here is that they -- we found a solution here for this problem that integrates the workflow of actual passports, but what are we going to do for other RFID employment? And you think about it there's an underlying theoretical problem here. And it's a problem I like to call scaleable private authentication. So here you see Alice talking to one of two different RFID devices and you see an adversary listening in on them. And I'm going to show you a protocol which solves the problem of private authentication. That is Alice should be able to authenticate to the chip and vice versa without the adversary knowing which tag is being read by Alice. I want to talk about solutions for a subclass of all the tags we have where we can do some cryptography. So the solutions I'll talk about will cover some of the RFID devices in existence but not all. And I'll argue there's still some interesting problems to solve even in this class of devices where we're allowed to share keys between Alice and the tags and where we're allowed to do a little bit of cryptography. So the first protocol I want to show you is a protocol where Alice and the tag share a secret key, Alice begins by creating a random nons [phonetic], a 32 bit value that never repeats, sends it to the tag, the tag creates its own nons and it evaluates a pseudo-random function on the shared key in the two nons. And just to briefly argue why this gives us some nice properties, an adversary who doesn't know the shared key won't be able to predict what the proper value is to respond as the tag and so won't be able to impersonate a particular tag. By the same token, because this is a pseudo-random function, without knowing the secret keys of different tags, the adversary won't be able to figure out whether a particular response came from tag A or tag B. And the only problem here is now scaling. So in this picture, Alice and the tag share a key, and they both know they share a key. What happens when a new tag shows that Alice's field of vision that she's never seen before? Well, the naive thing to do is Alice needs to try a different key for each tag that could possibly be until she succeeds. So that scales linearly as the number of tags scales. And the question is can we do better? So let me show you something, an attempt to do better that doesn't actually work. So the attempt is this. Let's give every key a unique identifier, have the tag simply say I'm using key number five and then Alice has no problem with scaling, she just looks up key number five in some hash table and says okay, this is the key I want to use. Great. And we run the protocol or any other authenticate key change or any other, you know, protocol that you happen to like. But the problem with this is that it breaks our privacy guarantees. We begin with a unique static identifier for every tag. And the question now is what can we do? If the tag sends anything that's correlated with its identity, it seems like the adversary will get an idea of which tag it is, which breaks our privacy guarantee. But if we don't then Alice doesn't know which key to use, which breaks her scaling problem. So my solution to this is to use a so-called tree of secrets. Alice knows every single secret key in this tree. Each tag is associated with a leaf in this tree and knows the secrets on the path from the root to the leaf. So this particular picture you have a tag on a library book, and it knows the right secret, then the left secret then the left secret. And what we do in order to authenticate and identify a particular tag is the following: We start out using the protocol I showed you and say okay, am I the left subtree or the right subtree? Which of these two keys do I share. So we can use the protocol I showed you which scales linearly in the number of possible keys, but there's only two keys here just the left key and the right key. And Alice and the tag can figure out which one they share and then walk down to the right subtree. Well an adversary who is listening in has no idea where they are. Similarly, they can do the same thing left subtree or right subtree and go on and so on and so forth until they reach a leaf and they discover -- yes? >>: What if the adversary is the neighbor at the bottom of the tree. >> David Molnar: So the question what if the adversary is a neighbor at the bottom of the tree. And that's a question about sort of tags sharing keys. For right now I'm talking about adversary that's a radio only adversary. And in order to mitigate, and I'll get to this question in just a second. So for a radio only adversary, the adversary can't tell anything about which tag it is. And we get scaling logarithmic in the number of tags. Now, Stuart's question is what the if the adversary happens to be one of the neighboring tags, like suppose you've broken into one of these tags and extracted secret keys? In that case, it's an interesting question about trading off privacy for, you know, efficiency that other people have followed up on my original work and looked at. In particular you don't need a fully branching binary tree, you could actually have a multiple -- a larger branching factor, even different branching factors at different parts of the tree. So a very nice paper at PETS 2006 which talks about how to make the trade-off based on how many tags you think will be broken into and different efficiency metrics you might have, such as the amount of communication or amount of computation for the reader. Does that answer your question Stuart? >>: Okay. >> David Molnar: Other questions so far? So what I want to show you on this slide is a comparison, just a simple comparison between the asymptotics and the actual concrete numbers for a back of the environmental implementation of the scheme for a two to the 20 or one million tags. So you can see here that it scales better than sort of naive scheme of trying everything, every tag in turn, but also scales better than a scheme where you try to do some precomputation ahead of time and treat it as sort of a key cracking problem and do like a trade-off, sometime space trade-off. But here at the bottom we have like again concrete numbers for the reader time, reader space, space and communication for the tag and for the reader. Just to give you some idea of how we could actually make this work in an actual implementation. So the way that the sort of thing I want to leave you with from this part of the talk and this particular project is this is a story where we start off by looking at practice. Real deployments of RFID we discovered there was a fundamental problem, private authentication and scaling private authentication, and then we need to come up with a new algorithm to solve this problem which I just showed you. I want to change directions now to talk about some of the work I did while I was an intern here at Microsoft and that's in software security, looking for serious bugs. Now, as I'm sure people here are all information with, these bugs are quite common, and in fact if you take a look at this URL, you can go to the computer emergency response team statistics and look at, you know, how many bugs were in 2007 or how many bugs were in April of your favorite year. Well, in 2007 there were 65,015 such bugs reported. All major vendors, Apple, Adobe, Microsoft, many others. And as you probably know, for each bug writing a patch, queuing a patch, releasing is very costly. So we'd like to figure out ways to have fewer of these bugs and mitigate these bugs as early as possible. The way I like to think about work in this general area is there's sort of a bug cycle. We start out where we write a bug. We don't really mean to write a bug, but we do. And then we find a bug or more likely someone else finds a bug. We have the bug reported to us. We try to fix the bug and then we write another bug. So there's a lot of work on how not to write bugs in the first place. >>: [inaudible]. [laughter]. >> David Molnar: That's wonderful. Yeah. So there's a lot of work on how not to write bugs in the first place. And if you can do that, you should, obviously. But sometimes we don't have that luxury and sometimes, you know, we have legacy code or we have other things that prevent us from using techniques to not write bugs in the first place. So my work that I'm going to talk to you about has been focussed here on this sort of finding bugs and reporting them part of the cycle. And that's work I did here with Patrice and Michael Levin [phonetic] when I was an intern here at Microsoft Research I've continued to do back at the University of California at Berkeley. So the jumping off point for this work is a classical technique called fuzz testing, whose story actually does begin on a dark and stormy night in the middle of Madison, Wisconsin. Professor Bart Miller, the University of Wisconsin, Madison, is dialing into the modem pool and he notices that the line noise from the storm is causing his utilities to crash. So he realizes that what nature can do by chance man can do by design. And he gets two of his graduate students to write a line noise generator for his favorite UNIX utilities. And it finds lots of bugs. So today one implementation of this basic idea is the following: You pick a seed file. Where you get the seed file from is up to you. You can get it from Bartlett's familiar quotations, you can generate purely at random or you could have some other heuristics to generate it. Then you take random bites, here highlighted in red, change them some other random bites, feed it to your program and say oh, does it crash or does it not crash? So this is a very simple, very, you know, sort of straightforward way of testing your program. Miller himself refers the fuzz testing to sticks and stones kind of testing but it's remarkably effective. So from the original paper they found between a quarter to a third of all the UNIX utilities and been util's crashed depending upon the particular version of UNIX they looked at. And of course the Microsoft security and life cycle now requires a hundred thousand fuzzed files before releasing any software to the wild. And I can give you many, many more anecdotal reports of fuzz testing remarkable effects in finding high-value security bugs. But the problem with this particular approach is it doesn't handle unlikely paths. Here's a small piece of code which I hope none of us will ever write for real. It simply compares the very first character of the input to B and crashes if it's equal. And as you can see if you're just randomly testing you have a very low chance of hitting this particular bug. So the fix is something called white box fuzz testing that I worked on here which combines this idea of fuzz testing with this idea of dynamic test generation. So let me now tell you what is dynamic test generation. So dynamic test generation we trace the dynamic execution of your favorite program, we capture a symbolic path condition, predicates about the input that have to be true to go down this path. We then pick a new path we want to explore and come up with a new symbolic formula for that path. We have a constraint solver which tries to solve this new path condition. If it can solve the new path condition, we extract a new input from the result -- from the result and run the program, there by expanding the coverage of the program that we've tested. So this was originally developed by a sort of pair of seminal papers, one by Gudfaw, Clyland and Send [phonetic], another by Catar and Engler [phonetic], and the idea of using sort of symbolic execution goes back to sort of static test generation ideas as far back as King in 1976. So let me now show you this idea in practice on this particular small program. So this is a program we had before. It simply compares the first input to B and crashes if it's seek. Let's run all the input good, see what happens. Well, if we run on this particular input, we come up with path constraint that says okay, the first letter of the input is not equal to B. That's the one predicate we tested on the input to go down this particular path. We want to come up with a new path through the program. There's really only one logical choice. It's take the other direction of the if statement, so we negate this path condition and then we feed this to our constraint solver in the corner, we ask the constraint solver to find a new input that satisfies this constraint. The constraint solver says okay, I have one for you. Good. We run the program, and then it crashes. So we have now overcome the unlikely paths problem. Yes? >>: Well I was just concerned about the constraint solver's ability to find ways to get to all paths. Ultimately that is the [inaudible]. >> David Molnar: Right. So the question is isn't this the holding problem, and the answer is, yes, finding all paths is difficult. Which is why we end up with this regime where it turns out empirically constraint solver does very well on many of the paths we want to solve for. Why it does very well is an interesting research question and that's something that I actually have an undergraduate working on at the moment, who is trying to characterize some of the reasons why that might be true. For example, in the tool I built at Berkeley which uses the constraints over STP, 70 percent of the time we call it constraint solver returns in under a second with an answer. And we're trying to figure out what is it about programs, real programs that leads to that issue -- leads to that. So this also touches another -- so Josh's question also touches on this other issue of scale, right? So the first generation tools are extremely exciting, but, you know, they looked at sort of smaller programs. And the question now is how can we scale them up the larger programs? Maybe the constraint solver will fall over. Maybe we don't know how to instrument large programs. There is some other questions that come up. And I want to focus on the search strategy question. And I'm going to argue that the depth first search approach which the early tools used doesn't let them scale as much as we would like. So here's a slightly more complicated program. And you can see here we've run the input here good again, we've come up with these four predicates that it generates for us at the path condition. And now the question is okay, we have a much larger program search space than even the small program I just showed you. What do we do to search. Well, the initial choices, very natural initial choices just use depth first search. We'll take the last condition, we'll negate it, go down this path and then symbolically reexecute the entire program with the new input. And continue doing this over and over and over again. So we keep symbolically reexecuting and then doing a depth first search. Now, the reason this is particularly unfortunate in large programs is just the way the economics work out is this. For a typical program that's very large, you might have the time to create a symbolic trace be 25 minutes. There might be thousands of path conditions or thousands of predicates in that path condition, each one corresponding to a different if statement or a different branch you could try to go down. And a time for each branch to solve generate a new test case and check that new test case for crash is about a second. So what we want to do is amortize the expensive work of generating these path conditions over many new test cases. And the reason for that is each test case is a bite at the apple. Each test case is opportunity to find the new bug that will justify all the hard work we've done so far. And so, we had to come up with new search strategy that lets us do this. The answer is something called a generational search. So the generational search you see the initial path taken, good, and what happens is we generate each of the test cases we can get to by flipping one of the predicates in the path condition. And so we get through four test cases in this example instead of just one. And do you see the sort of set of test cases we generate here at the bottom we call these generation one test cases, the seed files generation zero, we generate generation one test cases and then of course we can reexecute on any of those to get generation two test cases, generation three, and so on and so forth. The overall search space for the program now looks like this. So let me tell you what you're looking at here. On the bottom you see the actual test cases generated by this technique. Above them you see a number which represents the generation of that test case. And then the bomb represents test cases that actually crash. So after only three generations, three symbolic executions you end up hitting different test cases that crash instead of with the depth first search having to walk the way from left to right through the search space. Yes? >>: [inaudible] right so that's a big part of the overhead. >> David Molnar: So the question is we're very trace dependent and isn't that a big part of the overhead? And the answer is yes, we are trace dependent but we manage to get lots of different test cases from each trace. And that's one of the things that makes this search strategy better than the previous search strategies that were employed. Does that answer your question? >>: [inaudible] I'm just wondering -- I mean the [inaudible] so if you don't come close to kind of, you know, if the bad parts of the program are in cases where you don't even have a good trace through it, you won't find them. >> David Molnar: Right. So the question is, you know, what happens if there's a part of the program we don't even have a trace that's near by it, what happens then? And the answer I would say is that that's where the art in picking the seed files comes in, for example, or trying to figure out if there are parts of the program you haven't covered yet that you need to direct the technique towards. Does that answer your question? So one of the things I worked on while I was here is this idea of active property checking. So the basic technique I just showed you looks at code coverage. We came up with new paths that cover new parts of the code that haven't been tested yet. But of course there are security bugs that don't actually show up in the path condition. For example you might observe that there is a buffer and the index depends on the input in a way we can reason about and we would like to know if that index can ever be outside the bounds of the buffer. Or we might know because perhaps someone has given us a -- you know, a SAL annotation that says this particular parameter should never be null to this particular function. We would like to know if we can solve for an input that makes the parameter null. Or maybe we have a division and a dominator depends on the particular input. We want to see can we solve for division by zero. So it worked in ways to sort of check many properties simultaneously. And the way I like to talk about this now is sort of dividing the solver queries into sort of two types. One type is coverage seeking where we're looking for new inputs to increase our coverage of the state space and coverage of the program. But the others are sort of bug seeking where we say okay, I'm going to solve for a bug and see what happens. I want to talk to you about one particular type of bug that I think is a real great fit for this technique and where I've had to develop some new methods to look for this kind of bug. And that putting is integer overflow, integer underflow and other integer bugs. These are bugs that come out because programmers believe machine arithmetic is unbounded an works just like arithmetic we all learned in school. It doesn't. And what happens is that it leaves these subtle bugs that can really con found our traditional approaches to find security flaws. So, for example, one of the traditional things that we do is use stack analysis. But reasoning precisely about the values of integers, values, variables is hard which leads to many false positives and leads to programmers turning off the tool. There's a humorous quote from Linus Torvalds about GTC's attempt at finding such bugs in 2001. He refers to so broken it's not even funny. Now, we've made a lot of progress since then, but it's still a fundamental issue. Another approach that we often use in security is to have a runtime monitor that looks for unsafe behavior and then terminates the program, it looks like we're doing something not safe. But the problem here is that there are benign integer overflows. Some code in cryptography for instance might use integer overflows to do a very fast modular reduction. If you terminate the program when that kind of overflow happens, you'll make a very angry user. In contrast, we can use ideas about these bugs to direct our search and only report test cases generated that are real serious bugs. So in slogan form stack analysis wastes a programmer's time, runtime analysis wastes the users' time. But using it for dynamic test duration white box fuzzing only wastes the tool's time. And at 10 cents an hour for the tool's time, versus more than 10 cents an hour for my time, I know which one I'd rather use. So it was a particular kind of bug. This is a piece of code that you might write if you were trying to balance check this integer X before passing it to copy bytes. Unfortunately for me, this particular integer is signed and if it -- X is equal to neglect one then it will pass the balance check. But this copy bytes has a prototype which says it's an unsigned integer. So when I pass negative one into this particular function it will be promoted to an unsigned integer and we're going to copy a few more 800 bytes. So the bug pattern here is we're treating the same value as signed and then as unsigned. Or vice versa. So the way I like to think about this is by having kind of a four point lattice of types. Every value in the program starts out as unknown. We don't know if it's signed or unsigned. If we see it as an argument to a signed comparison or unsigned comparison then we can give it a type. And we see as an argument to both, one after the other, then we can move it to this bottom value which indicates a potential bug. The way it interacts with the technique I've talked to you about so far is that if you see a tainted program value, a program value you can reason about that has this type bottom, you can solve for an input that will make that equal negative one. Why negative one? Because that will exhibit the difference between signed and unsigned comparison. And so what I developed are methods to infer these types in a memory efficient way over very long traces. Because it turns out, as I'll show new a few slides, the traces which we deal are several hundreds of millions of instructions long. So my first attempts at inferring these types automatically were not memory efficient in the size of the trace, and so I ended up running out of memory on real programs. So I developed a method which attempts to have a very small amount of memory in order to keep track of only the live variables in each point of the program that might have different types. So if you put this all together, the architecture looks something like this. You come up with an initial input, you check it for crashes, then you trace the program on that particular input, you gather some constraints for your constraint solver, you solve the constraints you get a whole bunch of new inputs and you move on and repeat the cycle. So I've been privileged to put -- to have the involvement of sort of two different systems that do this. One is SAGE here at Microsoft, and the other one is called SmartFuzz. It's what I wrote at the University of California, Berkeley. And I wanted to sort of share with you some of the initial experiences that, you know, we've had. So SAGE was originally released internally in Microsoft in April 2007, and since then found dozens of new security bugs which had been missed by a black box fuzzing stack analysis in human code review. It's bugs that if they had been found externally to Microsoft would have resulted in a security fix. And you can see here that there's been a number of people who have worked on SAGE showing the amount of investment that Microsoft has put into it. I want to show you an example of such a bug. So this is a bug in animated icon parsing. Some of you may be familiar with this bug already. On the left is the initial seed file for SAGE that we fed to it. On the right is an example input generated by SAGE after seven hours and 36 minutes which shows the bug in action. Like you run the code on this particular input it will crash, and it crashes in the place and has the bug. There's a particular bug only exhibits if there's two so called ANIH records. So on the left hand side you can see we have highlighted the little list record type, and on the right hand side it's been changed to an ANIH, and you can see there's another one up at the top. So SAGE was able to figure out that oh, goodness, there is a -- the code is looking for its ANIH records. Synthesize in new test cases has two ANIH records and that was what was required to show the bug. And you can see that just randomly testing you would have a one in 232 chance of coming out with such a bug. But I wanted to show you something that I think is even more interesting and that's what happens with running SAGE on the particular media file format, starting with 100 zero bytes. So this is our initial input, 100 zero bytes. We then generated a bunch of different test cases but one of them, one of those different test cases had this RIFS at the top. So SAGE was able to figure out that the program was comparing the first four bytes to RIFS and generate a new test case that makes that particular condition true. And then that generates several more test cases, one of which had the particular file type and so on and so forth to the different generation SAGE discovers more about the different parts of the program that are looking for different parts of the input file. And here after ten generations SAGE generates a new test case that actually crashes the program. Now, what's interesting after only three generations of a well-formed seed file, there's an actual playing media file, you end up with the same sort of crash, which shows that even though SAGE was able to come up with the structure of the input, still again the choice was seed file makes a big difference. So for the rest of the talk I'm going to talk a little bit more about SmartFuzz, which is the implementation of these ideas I've worked on at Berkeley. This is a Valgrind framework for binary instrumentation which runs on Linux programs. It uses the STP solver from Stanford, although we've now actually generalized and use Z3 and a couple other solvers. And it's available in SourceForge. You can download a virtual machine that preinstalled if you want to run it. And it's on sourceforge.net. But the first thing I want to talk about is both of these tools now scale to real programs, millions of instructions per trace. So these are two tables that show you the source lines of code where applicable and the number of x86 instructions in a typical trace. So you can see here both SmartFuzz and SAGE now handle code with hundreds of millions of x86 instructions per trace and perform this kind of looking for constraints during path conditions and then solving the path conditions. I want to talk to you about some experiments I did with SmartFuzz where I took six Linux programs, I took three seed files each, and for each program, each seed file ran 24 hours with this tool and 24 hours with zzuf which is a black-box testing tool that's roughly comparable to the file fuzzer used here in Microsoft. And I use the Amazon elastic compute cloud to run these machines over about 50 different, you know, instances over a weekend. So Amazon is great because you can check out machine with two gigs of RAM for 10 cents an hour. Or 7 gigs of RAM for 40 cents an hour. So you can really just sort of summon lots of machines that do your bidding and then get -- take and then put them back when you're done. So you don't need a cluster anymore to do this kind of work. And an interesting thing about this of course is that both of these techniques, both the white box fuzzing and black box fuzzing give you millions and millions of test cases. So you have to figure out how do we sort of pan through all the different test cases for the few ones, the few flecks of gold which exhibit bugs we didn't know about before, which are high value and worth fixing. In other words, how do we find the right tests? So for Linux my answer is to use Valgrind's memcheck. This is a tool that checks the execution of your program for memory safety violations, things like memory links, writing the memory you shouldn't be writing to, reading memory you shouldn't be reading from users of uninitialized values, a whole passel of other properties you probably don't want violated in your particular program. But you can of course use your favorite bug oracle. So if you like at verifier you can use that, or if you like any other things you like using you can use that. Yes? >>: [inaudible] precision or ->> David Molnar: Memcheck only has one level precision really and so it does have a slowdown. So the question is what level precision do I recommend using memcheck with and the answer is really only one by default and that's the one I've been using. >>: [inaudible]. >> David Molnar: It's about 2 to 5X. Does that answer your question? >>: [inaudible]. >> David Molnar: So the question is what does that verifier do and the answer is that verifier, it's a plug-in architecture, you pick what to do. So in the experiments I did when I was here, we looked at the electric fence sort of mode of that verifier where it puts a guard page either before or after a particular memory object. That was the main thing that I used that verifier with when I was here. >>: [inaudible]. >> David Molnar: So the question is does Valgrind have more precision in its checks than what I had at Microsoft. So the answer is Valgrind is roughly comparable to true scan in what its looking for and the precision its looking for. AppVerifier is less precise and looks for less. But on the other hand AppVerifier is a much less of a slowdown for the checking so AppVerifier is almost unnoticeable, whereas this is a 2 to 5X slowdown. Does that answer your question? >>: Yes. >> David Molnar: Yes? >>: [inaudible] slowdown so the traces you generate, especially if there's non-determinism in the program, are you really kind of checking paths that occur in practice, or are you checking things, you know, paths that you [inaudible]? I mean it's still valuable to find bugs. >> David Molnar: Right. >>: But are you finding the bugs that are likely to occur in the wild? >> David Molnar: So the question is do I find bugs that are likely to occur in the wild, and the answer is for the security testing regime I don't care because an adversary, if he find any of these bugs he's interested in them. But for more reliability perspective, I don't think these are going to be good exemplars of bugs that we'd find in the wild by random people using them. Does that answer your question? So the nice -- the other nice thing about Valgrid memcheck is people understand in the open source world about Valgrind, they have seen it before, it's been about five years, so when you report Valgrind test cases to people they actually fix them, which is really nice. So I was lucky enough to have nine undergraduates for eight weeks as part of a team of mentors and we got into a project comparing both white box and black box fuzzing. You can see them here. And you can't really read it, but their T-shirt says we found 1,124,452 test cases with at least one Valgrind error and all we got was this lousy T-shirt. Does that really mean we found a million bugs? Well, no. Okay? So there's this problem of bug bucketing, which it turned out to be a rather serious problem during this summer work. The issue is there's many, many relations between test cases and bugs. One particular test case especially something as precise as true scan or Valgrind memcheck can actually exhibit multiple bugs because the first bug you see may not crash the execution to your program. At the same time it could be the case that, you know, you have one sort of semantic bug which is example by multiple test cases. And in our experience the developers get very angry with you if you report duplicate bugs. And we were principally reporting to the MPlayer of software project which is an open source media player that ships with major Linux distributions. And the thing is we're just posting them on a -- posting bugs on their Bugzilla and saying hi, you don't know us, but you have an invalid write bug you might want to look at. And the thing is this interesting sort of interaction where they would fix these bugs but they still had no idea who we are or why we were doing this until much later at the conclusion of the project. So I wanted to talk to you about how do you actually bucket these bugs. So this is the first thing I tried. It's call stack cache. This is the actual GDB backtrace, a real test case that we synthesized, sent to the MPlayer developers. In bold you see the actual obstruction pointers in the stack trace. And the first thing to try is well, you know, I have a hash function floating around, let me just, you know, put these in order, feed them to hash function. This is a bug bucket ID. In essence I'm saying up to the collision resistance of the hash function, two bugs are the same if and only if they have the same stack trace. This has a -- this turns out it has a lot of problems actually but the main one is that doesn't work, and I know that because we use it as our main approach for bug bucketing during the summer work I mentioned, and we actually had about 9 to 10 percent duplicate rate, as in the developers look at our bugs that we reported and said oh, look that's a duplicate right there. What was interesting here is we actually had the students sort of currently reporting things, too, so they really did rely on the stacked cache as their duplicate detection method. So what I do now and I don't say this is the end all [inaudible], but I'll talk -- yes? >>: [inaudible] reduce your bug count from a million to a hundred or is there other stuff going on? >> David Molnar: Oh, the students were choosing which bugs to -- so the question is did the hash reduce the bug count from a million to a hundred. The answer is no. The students picked which bugs to report. We had nine students. We told each of them to report at least 10 bugs by the end of the summer, so we ended up with 110 total. Some students were more industrious than others. Does that answer your question? >>: [inaudible]. >> Patrice Godefroid: So [inaudible]. >>: [inaudible] ho you reduce from a million to 10 if you use undergraduate type of work. >> David Molnar: So the question is how do you reduce from a million different bugs to the 10 that use the undergraduate size report? Well, first of all, we actually used the stack cache in order to say, okay, here are distinct buckets. So the million is before any bucketing whatsoever. All right? So the undergraduates got to see here are the different buckets. Which is much less than a million. Like several orders of magnitude. I don't have the numbers in front of me. But it did significantly reduce the number. And the next question is, okay, did we give them criteria? Well, we told them to look for read and write, invalid read and invalid write and preferred to report those because those bugs are more serious. But beyond that, we didn't give them any particular criteria for reporting. Does that answer your question? Okay. So what I use now is this what I call a fuzzy stack cache. And this is motivated by the fact and the kinds of programs and testing I actually had access to the source code. I can compile debug symbols. So one problem I noticed when we were doing this work is that the code changed quite often. So MPlayer particular at one point was updating about four times a day while we were doing this work. Developers of the MPlayer product really, really want you to report against the most recent version of the code or they reject your bug report out of hand. So we have to recompile MPlayer all the time. If there's a small tiny change to the source code, that actually leads to very -possibly a very big change in the instruction pointer, and so what would happen is we would have different looking stack traces because the instruction pointers would be different and then report the duplicate bugs. So the approach to get around that is by using the line of code and the file the function is in instead of the instruction pointer. And then to be robust against small changes, look at all but the last digit. And of course it's only a single digit, you take just a single digit. Another issue that I observed was you have a buggy function that's called a multiple contexts. Each context is a new call site, is a new stack trace. But they're really semantically the same bug underneath. So in order to address that, the fuzzy stack cache I use now looks just the top three frames of the stack trace, not the full stack trace. Yes? >>: [inaudible] looking at what the line actually has the contents of, the source lines. >> David Molnar: So the question is why did I take the sort of fuzzy approach to the line numbers instead of looking at what the line actually has. Because the answer is I wanted to be -- first of all, I didn't think about what you just said, looking at what the line actually has, and second of all, just coming up with an idea on the fly for why one might prefer one over the other, this would also let you handle things that are slightly different about the line. Like if there's a -- huh? [laughter]. >> David Molnar: Oh, great. Thank you. Thanks for saving me time. Appreciate it. So -- yes? >>: Why didn't you use two relative offset [inaudible]? >> David Molnar: Right. So the question is why not use the relative offset to regain the method. Again, I didn't think about doing that. That would be ->>: [inaudible] all kinds of other [inaudible]. >> David Molnar: Right. And in particular the relative offset from the beginning of the method wouldn't require the same level debug information as I have here. Right. So as I said, I don't believe it's the end all and be all. It's a sort of a starting point for future work. And what's nice is I actually have a data set now if anyone wants it about, you know, which bugs were marked as duplicates and which weren't? So you could imagine going back and redoing this with a different approach for bucketing. So what I'm going to -- and then what we did in addition is built a sort of a front end for collecting all of these different bugs called metafuzz.com. This is a live website. You can go there right now, if you want to, and it shows you the stack cache, the bucket ID, the particular program that was tested and a link to download the test case. So the students used this to sort of browse through different bugs that they might want to report to developers. And developers use this to sort of download -- we were able to provide links to specific test cases that they could then use to download and try out the particular bug on their own code to see if it could reproduce. And we've been adding features for reproducing the most recent version of software and so on and so forth. So what we found in these experiments, remember I talked to you about the experiments, six programs, three test files per program, zzuf random testing, SmartFuzz, white box testing. We found that they find different bugs. So overall we found 19 bugs and we both found and the reason I talk about this fuzzy stack cache is when I say bug here, I mean bucket under the fuzzy stack cache, okay? So that's why I went through that to talk to you about what that is. So we have eight bugs that SmartFuzz found that weren't found by the other technique, 31 to random technique found SmartFuzz didn't find, and 19 in the intersection. So if you break it down, this table shows that we found bugs in five out of six test programs. The numbers on the left are SmartFuzz, numbers on the right are zzuf and the two out of the six we ended up finding more bugs than the random technique, using SmartFuzz. And the bottom you have the cost per bug, billing at the Amazon AC2 rates as of February 2009. So what it says to me is of course you want to try using both techniques but you don't know ahead of time where your probably one where SmartFuzz would be better or zzuf would be better. And in particular, the G zip, which is a decompression algorithm that's used pretty widely, SmartFuzz found two bugs and zzuf found nothing. So I'm still trying to figure out exactly why those bugs are the way they are. It's a very optimized code base. It's very fun to read. But this tells us about how -- the relative [inaudible] of these two different types of bugs. The other thing you'll want to point out is remember I talk about bug seeking queries where we solve for particular kinds of problems that might correlate with a bug. So this is showing us how many queries over the -- all the runs we had, particular types, how many of them succeeded. So this gets back to question about solving the queries. So this tells you roughly how many succeeded. We have a time out of about five seconds for each query. And then of those how many bug buckets came from each particular type of query. So you can see here we actually were successful in using them to find bugs and the signed, unsigned property was the most successful out of all of our bug seeking queries. Yes? >>: [inaudible] these three bug categories [inaudible] these are not necessarily bugs, or are they? >> David Molnar: They are not. So what happened is -- so Patrice's question is these are not necessarily bugs, either overflow or underflow, as I argued earlier. What this table is showing you is we created a new test case from a property whose name is here. We're trying to force an underflow, force an overflow, force a signed, unsigned conversion. And then Valgrind memcheck told us the resulting test case had an error and then we applied the fuzzy stack cache I discussed earlier. >>: [inaudible]. >> David Molnar: It was something else. So it's invalid read, invalid write, something like that. So again, you can see that these techniques work and that these signed, unsigned found the most number of bug buckets out of all the techniques we tried. So the way I see this is this is a story where we had a beautiful theory which we showed us the way to a new method of testing software. And the work that we've been able to do takes this beautiful theory and but into practice with scalable practical tools that have made a real difference in the way that people test software. So going forward -- yes? >>: So the question about the methodology, the traces you start with, do you require those to be collected from the beginning of the program when you get the input or can I in the middle of a program hook up something to trace and then use your techniques? >> David Molnar: So the question is can I start collecting in the middle of a program execution and start symbolically executing? The answer is not with my current infrastructure. My infrastructure uses Valgrind. Valgrind requires loading the program. And the reason for that is that Valgrind actually takes over the duties of the program loader. So Valgrind actually sets up in the same address space as your guest program room for its own host code, which does the dynamic binary translation, and then does the recompilation of x86. And it is not supported this time to attach to a running process. >>: [inaudible] just finding [inaudible] bugs in the startup of programs because you know, like it takes a while to generate the [inaudible] trace and the [inaudible]. >> David Molnar: So the answer is will I only find bugs in the startup programs. The answer is I don't believe so because what happens is we only look at the constraints from parts of the program depend on the initial input and so different inputs that exercise different parts of the program will lead to different constraints and then different bugs. Especially like think of a media player program. MPlayer supports like 15, 20 different file formats. The bugs we find from the MP3 playing are different from what you find from the wave playing or what you find from the AVI playing. >>: [inaudible] some overflow bug that program really needs to run for a length of time to ->> David Molnar: Right. >>: Probably not going to find those [inaudible]. >> David Molnar: That's true. And so that's why I've been talking to people about check pointing approaches for trying to save the state of the initial part of the program. So for instance I was just speaking with Gene Cooperman at Northeastern University. He has a check pointing approach he thinks would help with this. Yes? >>: Very large number of 16 bit version possible errors there. Where do they come from? [inaudible] 16 bit [inaudible] 16 bit data [inaudible]. >> David Molnar: So the question is where do UNIX programs with 16 bit data types? I don't have a great answer for you because I haven't gone back and traced all those back to the original source code, but my gut feeling is that MPlayer might have that in the particular file types we're looking at. Yes? >>: [inaudible]. >> David Molnar: Right. So the question is contrasting of Clee [phonetic]. So first there was a different focus. Clee is focused on very high code coverage. But they're looking at smaller programs. So they do things like they want to actually keep all the programs stay in memory and then they use a form to try and go down different code paths. Whereas I actually do just a single trace at a time and then I try to focus on larger traces. So I look forward today where we can take some of the things they're doing and some of the things I'm doing and do really high code coverage of really large programs. It's just their main difference, their focus is on gaining 99 percent coverage of graph. And my main focus is on finding bugs in MPlayer. Beyond that there are lots of other sort of smaller differences that we could go into, according to my understanding of Clee, but -- or I could tell you more about the internals of this. But I think that might be best done after the talk. Does that answer your question? Okay. So going forward we now know how to create more bugs than anyone will ever be able to fix unassisted. What do we do to fix these more effectively? And stepping back a bit, we now have this amazing ability to check out hundreds of machines and use lots and lots of information about the code base we're working on. How can we use that to help a programmer write better code? More interactively and more immediately than our currently techniques. And that's sort of one of the directions I'm really interested in is this part of the cycle where we report and fix. For example, there have been techniques developed here at Microsoft and other places on -- for worm defense which try to synthesize patches immediately given an example of a worm that exploits a particular piece of code. Well, how does that change if we go into this regime where we have a lot of time to create a patch but the quality has to be higher because human beings have to maintain and understand them. So that's an example of a sort of a case of a direction I'm interested in going in in the future. So with that, I thank you for your time, and I welcome your further questions. [applause]. >> David Molnar: Yes? >>: [inaudible] and tell the tools to reemphasize that may be related to the bug [inaudible]. >> David Molnar: So the question is after I fix a bug, can I tell the tools to emphasis a particular piece of code? So my work does not currently work -does not currently allow you to say this particular function is interesting, but you can rerun the tool with the test case it generate earlier on and expand around that particular path. And look for bugs in the patch to the original problem. So yes, I can do it, but no, I can't do what I would really like to do would be able to say this particular line of code needs to be exercised quite a bit. Does that answer your question? >>: [inaudible] how do techniques like this one scale to like more interacting programs on networking protocols [inaudible]. So when you specify test cases this is like the program taking one input and then running with it or from start to end or you can also do like, you know, supplying input program does something and supply another input program does something [inaudible] kind of testing as well. >> David Molnar: Right. So the question is what about interactive programs or protocols, right? So my work that I show you doesn't work on that, there has been other work that's been done here at Microsoft that does look at that, and I'm talking to some people in the networking group at Berkeley about how to look at that particular issue. In principle what you need is a proxy and let you replay protocol dialogues and sort of close the entire environment to let you treat the entire -- let you treat the input to the whole program as the entire dialogue between one server and one client or what have you. In principle it's possible, it's just it has to be done and there's some different questions that come up about how to search the state space. Does that answer your question? All right. Thank you for your time, everyone. [applause]