24252 >> Margus Veanes: Good morning, everybody. This is Pieter's candidate talk. And it's a pleasure to welcome him back. So Pieter Hooimeijer is from the University of Virginia, working with Wesley Weimar on various problems related to strings and that's also the topic of his talk today. Personally I know Peter for roughly two years. Or actually, no, even two and a half or three years. He did his first internship here with me in 2010, which was work related to strings. And then last year he continued the internship, was a second internship following that on that same topic. And some of the stuff he'll talk about today will touch upon these topics or these things he did here at that time. So with this -- and also let me say that Peter has been working with other stuff, other areas like empirical software engineering and sensor networks. So he has a pretty broad scope from other areas as well. So with that, I think I'll give Peter -- let Peter start talking. >> Pieter Hooimeijer: Thank you, Marcus. Thanks everyone for coming. So I'll be talking about strings. The working title of this talk is Peter talks engagingly about a single data type for one hour. So in order to make that happen I figured I'd tie this to the audience by saying: Imagine you're currently using your laptop. So you might have a Lenovo which looks a lot like this, imagine you're browsing a website, let's say you're looking at stack overflow to scout out your competition, turns out John Skeet has an impossible number of points, you'll never be able to beat him. But you're looking at his profile. And the developer who implemented this page decided it would be a good idea to allow users to provide their own profile picture. So in this case they can provide an address that is the address of the image file. So in this case I have a very bare bones image tag on the slide, and the idea is that we'll treat the source attribute in this case as untrusted input. And as a developer, let's say I'm quite competent but not super security savvy, I might ask what could possibly go wrong; and, of course, the answer is it depends. It depends on what we do with this untrusted input. So the address is user provided. We should make sure that it doesn't do anything bad to our page. And let's imagine that we don't do anything, and in that case an attacker might provide a specially crafted attack string, in this case an address that contains a single quote and allows the attacker to escape from the source attributes and enter, let's say, their own attributes. For example, an onload attribute that loads arbitrary JavaScript into the page. So this is a cross-site scripting attack. You may have heard of them. This is very common, and this is perhaps not the most exciting example of one; but, nevertheless, the problem here is that the attacker can run code whenever someone visits this profile page. And in addition to annoying alert boxes, they might execute website functions with the privileges of the user currently viewing the page, and this is problematic if you're a bank or if you really care about your score on stack overflow. All right. So there's a lot of research on cross-site scripting. Imagine going on Microsoft's academic search engine and looking for this you'll turn up hundreds of papers for some reason in a variety of areas not related to computer science as well but I'll assert that the majority of results come from computer science and they tend to try to mitigate this problem if it happens or prevent cross-site scripting altogether through some constructive means. Either way, the point is there's a lot of work on this, and part of my research aims to generalize some of the insights gleaned from the existing research on, let's say, cross-site scripting or SQL injection. In short, vulnerabilities related to string manipulation. So with that, I'll give you a dark slide. The talk will be two parts. So I'll talk about my swing constraint solving work, which is roughly my dissertation for the first, let's say, half, and after that I'll briefly touch on the back project, which is my internship work with Marcus from the last two years, among a few other things. So let's talk about string constraint solving, sort of what it is, why I care and why you should care. So the structure of this will be roughly like this. I'll go into some background to sort of explain what I mean by string constraint solving and why you should care, and also following that I'll talk about sort of building a string constraint solver in this case sort of my first paper on the topic, and then I'll talk about tuning, which is on the topic of a few other papers trying to make this sufficiently fast for people to actually be able to use this stuff. So roughly background definitions, evaluation or performance tuning. All right. Start with background. So there's a number of constraint solvers out there. You may have heard of some of them. There is this notion of doing program analysis, using a constraint solver. This is very common. It is prevalent for in a variety of contexts. So everything from automated testing to static analysis, doing model checking and so on, you typically end up using one of these. So this is sort of the MSR potentially redundant slide. There's a number of different implementations. One key point to keep in mind is that there's a standard input format. At least in principle, if you don't use terribly esoteric features, you might interchange these tools. And this also allows for doing annual competitions and so on. So this is the state of the art in terms of mathematical constraints, typically involving, let's say, integers, bit factors, data structures and so on. So a very simple example. If I have X squared is 25, I want to know what X might be. I can solve for X using an SMT solver. Looks like something like this. The declarations roughly match what we have and sort of straight math notation. And in this case my result is X, can be five in this case. There might be other solutions, but we just have one particular example here. And what's most important is that we know it is possible to satisfy these constraints. So what about strings? It turns out existing SMT solvers have not traditionally had a theory of string constraints. So bit vectors are well represented, like I mentioned sort of various types of arithmetic on integers. Some stuff involving quantifiers even, but not necessarily strings. So I've said here on the slide that reasoning about strings is difficult. And I mean that in sort of an informal way. For programmers, it is apparently difficult, because they write websites that have errors in them, like cross-site scripting vulnerabilities. And it is difficult for automated tools sort of for other reasons, which I'll go into. So in that case I guess the definition of difficult is a little bit more formal. So there's a number of existing approaches or, rather, approaches that exist now that did not exist previously. These are the names of four tools I will feature in the talk. And each of these tools is essentially a domain-specific SMT solver. Let's say a string constraint solver for just string constraint. So DPerl is a tool I present at PLID '09. And Hampi was a month later, ISTA '09, and I'll assert all these tools were published, let's say, in the last three years. >>: So what about the whole second order stuff, Mona and all that, it goes way back, right? >> Pieter Hooimeijer: Sure. >>: So ->> Pieter Hooimeijer: >>: What about Mona? So the question is what about Mona? >> Pieter Hooimeijer: In the mid 90s there was a bunch of research on writing dedicated solvers for Mona second order logic, using either multi-track autotomia, depending on which fragment. And those are in many ways theoretically similar. They solve in the multi autotomia case, constraints that are equivalent to a regular language. But they are not sort of specified -- they're not specifically designed for string constraint solving in a way we do here. So there's one paper, I forget where, I forget the venue, a recent paper that uses Mona to solve string constraints, and it turns out this is possible. And you can express regular languages and so on. This is well known. However, it is not particularly performance. So there's still sort of space for domain-specific tools and so on. And I think it would be very interesting to look at sort of the deeper implications for, let's say, the Mona implementation, very highly tuned and so on. And we have some of the same stuff showing up in string constraint solvers that they did for Mona. So the use of BDDs, for example. >>: What makes these tools domain-specific? >> Pieter Hooimeijer: They are domain-specific. So the question is what makes them domain-specific? The tools are domain-specific in the sense that very, sort of bluntly put, they are presented in the papers that sort of publish them. They're domain-specific in the sense that they solve in this case regular language constraints. So if you do symbolic execution on some code and it generates some string constraints, so stuff like this string should match this regular expression, then you can use one of these tools to solve those constraints. Does that answer your question? >>: That seems pretty general to me. >>: But in terms of logic, what are the -- compared to Mona, for example, for these languages like logically formally more restrictive than ->> Pieter Hooimeijer: That's an interesting question. And the short answer is I don't know the sort of full specifics or at least not well enough to state them on the record. But I think so a fragment of Mona second order logic is definitely representable using autotomia. Most of these constraints are representable using autotomia, but the structure is slightly differently. So my guess is that they would similar. So there's a notion of repeated or iterated autotomia intersection being phi space complete. And I believe that is the case for sort of both things in this case. But that's the best I can do in terms of specific upper bound, let's say. Are you satisfied, Chas? Thanks. So, yeah, so four different tools all published in the last four years and the basic notion is that it will generate constraints that look an awful lot like string manipulating code that will look like C#, and we'll pass them to sort of any of these tools to have roughly equivalent put languages, sort of sufficiently similar to be engineering feasible. And we'll get out some answer. So where previously we got out a response that says X should be 5, now we get a response that says string variable A should get concrete string value A/B. So very briefly I've shown a very short example constraint. Talked about different solvers, and what the shape of a solution looks like. So an assignment to string variables. But what that doesn't tell you is sort of where the constraints come from. So I've already touched on this, but the basic notion is that we'll sort of punt on this. So we'll say stuff like you can use standard techniques to generate constraints that include string operations like in this example, let's say, symbolic execution. So if I want to exercise the if statement in this code, I can generate a constraint system that instead of branching on R dot IS match of A asserts that that's true and I can solve those constraints and find inputs that exercise this path through the code. But, in general, we separate constraint generation as a problem from constraint solving. And this is sort of common to all of the tools and presentations that I've shown, and we'll focus at least for purposes of this talk on constraint solving. So we'll assume there is a reasonable way to generate string constraints without really going into the details of how to better generate constraints for strings. In practice, we evaluate whether our assumptions about string constraint solving or our assumptions about string manipulating code are correct empirically. So we'll use standard techniques and see if we can solve the constraints that result. All right. So so far I've talked mostly in terms of examples. And I'll sort of continue to do so. But in general I think this raises a question of scope. So what is a string constraint. As you've already sort of asked and what is not. So one example might be MD5 is a cryptographic hash or at least a one-way hash, and it is technically a function that takes a string and outputs a string. So if you wanted to do constraint solving, and we sort of didn't narrow our scope, that would be fair game, except there's sort of separate papers on reversing cryptographic hashes, and sort of very domain-specific problem that we may not want to solve. So if a different example, which if you've seen this slide you are prohibited from answering, but how hard is it to do a simple one string/one regular expression match in Perl. So some of you may have seen this slide at VM CAI maybe last year and the answer is you can't see that because it's color blue very close to the black. It's NP hard. And the reduction is from three set. So this is a five line piece of Perl code that given a string, or rather given an array representation of a three set problem turns it into a single string and a single specially crafted regex that uses back references in order to solve sat. So if you ever need a 5-line sat solver, there you go. So this is not my result. This is something that was sort of floating around on the Perl mailing list some number of years ago. So this still doesn't answer our question of scope. But in general it does answer our question of can we model the sort of anything that shows up in the wild and the answer is probably not. So MD5 is an obvious example. Regular expressions in Perl perhaps not so obvious. Turns out the exact formal language class of Perl regular expressions are sort of nebulous. We'll focus in this talk on constraints that we do know how to solve and in practice that usually means we'll go with these sort of strictly regular component of real world regular expressions. All right. So let's talk about my first attempt at building one of these tools. We'll sort of see the implications. So NPLN9, I provided some definitions of basic string constraints, and the little logo on there is meant to indicate that we defined this stuff in COC [phonetic]. So we went through the trouble of defining sort of strings from first principles, string constraints of interest from first principles and then showing that our core solving algorithm is sound and complete relative to those definitions. And we also provide an implementation, and this is one of those stories where the implementation is strictly not related to the formal proof. But it solves string constraints in practice. And we use it on an existing benchmark to show that we can generate attack inputs for 17 known SQL vulnerabilities. So we look at some corpus of PHP code. We use off-the-shelf techniques to generate constraints. And then we solve them using our tool measuring, let's say, running time and whether or not it works. So rather than defining formally what the constraints of this particular presentation look like, I will focus on a quick demo. So there's an online Web version of this tool available. If you're currently on your laptop, the URL is something along virginia.edu. And if you happen to be playing around with your phone, it turns out I wrote a quick script in Touch Develop. So if you're following along and look for the, want to follow along look for the [inaudible] script in Touch Develop and it is the shortest Touch Develop script you have ever seen. It links you to the website which I will show you now. All right. So for this short demonstration, I want to emphasize a few things. And for this particular tool, variables represent regular languages. And I took that very literally in the tool implementation. So they literally are autotomia definitions, concrete ones. So there's two sort of aspects to this. One is regular inclusion constraints. And the other is concatenation. So it turns out if you support those two operations, you can do the majority of -- you can model the majority of constraints that originate from the symbolic execution of string manipulating code in some way or another. Sometimes less direct. Sometimes more. So on the slide is a short example. So I apologize for the sort of limited visibility of black and colors. I guess the main notion is one that Tom already raised, is that there's an equivalency between other logics and what we call string constraint solving. So the example I've seen here I have on the slide is something where I want to do modular arithmetic. So I have two autotomia. Twos and threes that represent in unary the set of numbers that are multiples of two and three. What I'll do is I'll define 9 as a single number. That's a little tedious in autotomia representation, but let's imagine we can also do regexes. And what I want to find out is how many ways I can use to put two and three together as multiples to form 9. So this is truly the arithmetic example from middle school, let's say. So the way I express that constraint is by saying 2 concat, 3, or 2s and 3s should be a subset of 9 which is a single string. So if I want all solutions, I use solve all in this case, and then there's some additional machinery required to select the two solutions and display them. So let's hit submit on that and hope that I have Internet. So looks like it ran successfully. So, again, this is a tool implemented in, let's say, 2008. So it's been a while. The result is two disjunctive solutions. The first looks like this. 2s, it looks like we multiplied by 3. So I end up with 6/04 9s which is one and one 3. So in other words 6 plus 3. And my second solution looks like zero 2s and exactly three 3s. So I've done a very basic modular arithmetic sample to show the decomposition of the number 9. So this is an example meant to illustrate sort of interesting properties. Most notably the use of concatenation in this case to separate the 2s and the 3s and get separate solutions for them. The other is that it is not sufficient to say here's a regular language for each variable. And I have two disjunctive solutions. It does not work for me to put together zero 2s and 6 taken out of 3. So, in other words, these solutions are inherently disjunctive. I can't just sort of merge them together into single regular language. So if I have multiple solutions, they will be sort of strictly separate. So in practice we probably won't be doing modular arithmetic. But I figured it would be a sort of interesting example to show a known equivalence using a new tool in this case. >>: The slide on the actual logic. >> Pieter Hooimeijer: I do not. I mean, it's really variables are regular languages. We have grounded regular inclusion constraints. So there's a single constant regex that is the right-hand side of this subset constraint, and concatenation. So I can say stuff like variable one concat variable two should be a subset of this regular set, which is really all the core constraint language handles. And this is also, this is equivalent to constraints in other papers. So in the Hampi core language, this is essentially the same, except that we bound the length of each string, because it's a reduction to sat. And in this case the length is unbounded. So moving on. The evaluation as I mentioned was on an existing corpus produced by Wasserman and Sue [phonetic] at PLDI '07. They do a static analysis that finds SQL injection vulnerabilities. And the sort of main motivator at least for me personally at the time was the output of this tool, which looked a lot like there might be a bug on line 58 of your code and nothing else. So we wanted to add indicative inputs to this static analysis. In order to do so we had to put together some additional machinery. So we had to find a path to this particular potential bug and symbolically evaluate net path to find constraints. So we use running time as a metric. Like I said, it's 17 vulnerabilities. So it's not the biggest corpus for doing sort of statistically significant performance results. But we wanted to know if this is feasible just as a first cut. And the results are that, yes, we can generate successful attack inputs for these constraint systems. And our running time is between about 100th of a second and about ten minutes. So that's quite a big range, even across a very limited sample set. So more on that later. In fact, I'll talk about that now. My next sort of step in this process was finding out faster ways of doing this. So I'll go into the context in a little bit. But in general the idea at the time was there's two competing approaches. One was the Hampi tool, which I was a co-author on, which uses a reduction from string constraints to bit vector constraints which then get turned into a sat problem. And that tool is relatively fast in practice. Faster than the tool I just showed you, in fact. Nevertheless, we have this feeling that autotomia-based constraint solving would be faster, and it seems sort of obvious in hindsight. But at the time I wasn't really clear immediately why. >>: My impression is that the Hampi tool is bounded string length, bounded string encoding, right. >> Pieter Hooimeijer: Yes. >>: It's not comparable to what you are doing because you are ->> Pieter Hooimeijer: Yeah. So, yeah, the Hampi problem is NP complete and ours is not believed to be. >>: Complexity of your problem. >> Pieter Hooimeijer: So like I said I believe it is fee based complete, but I have not formally evaluated that. I focused more on empirical performance evaluation instead. All right. So there's two papers on this one. One is a VM CAI paper which I wrote with Marcus during my first internship in Redmon. And it does basically data structure selection. So we implemented a bunch of different techniques from known automata libraries in the same context and used them to do some string constraint solving problems on one variable which reduces to single autotomia intersection or single autotomia determination and computing inverse. I won't talk about that paper in a lot of detail for lack of time. But I think the key point there is that we did as sort of a rigorous of an evaluation as reasonable in the sense that we fixed pretty much everything down to the front end parser, the language of implementation and so on. At the time existing work had the more usual structure of we have a tool, it works better on some benchmarks than some other tool, so therefore we win. This is sort of a reimplementation of existing techniques, not a novel technique, to see which data structures work best, for example, for representing large character sets and intersecting them efficiently. But I'll focus on instead is the AST 2010 paper which essentially takes some of the results from VM CAI and implements them in a real solver. In this case I sat down and decided to code this stuff in C++, so it has some engineering benefits as well as some of the sort of data structure algorithm selection that we gleaned from this other paper. So the approach is just like the solver you saw before. Except whenever we relied on sort of full autotomia operations previously, we'll now do stuff as lazily as possible. For finding a single string in an autotomia, that means we find a path to a final state without looking at other parts of the autotomia if we can avoid it. If we do intersection, it's essentially the same thing. It turns out for multivariate string constraints that becomes a little trickier. In some cases we may have to find paths that are essentially circular in nature. If I have a constraint A follows B in some regular language, and then I have a different constraint, B follows A and some other regular language, it suddenly becomes very tricky even to sort of figure out where I should start in the hypothetical search space which may be quite large. So I won't go into detail about the algorithm too much. It's basically a wall of pseudo code in one of my papers as well as my dissertation with some examples and so on. If that interests you, please have a look and about that off line as well. So instead I'll sort of the hard numbers in this case. So we experiments, and I will sort of skim over the I'll be happy to talk focus on the evaluation do a bunch of different two middle ones of these. So I'll do a comparison with Hampi, which is like I said a different tool that I was involved with. And I'll do the long strings experiment, which is sort of designed to illustrate some of the limitations of other string constraint solvers in this case. So let's talk about Hampi, do a little bit of background in terms of the Hampi architecture. So, like I said, where I went the route of using autotomia for this stuff which made stuff easy to prove but not particularly efficient, for Hampi, my co-authors came up with a separate algorithm, which is a reduction to bid vector constraints. So for a given regular language, it sort of enumerates various possibilities of characters appearing in a particular position, in a single big vector that represents the entire constraint system. And, like I said, this approach worked well in practice. But there's some question of how much can we improve this. So Hampi internally uses the STP bit vector constraint solver which internally uses mini sat. So there's several layers here already. And that makes the performance at least somewhat unpredictable. So for different problem sizes you might get sort of noncontinuous function of performance results. And what we wanted to find out for this particular experiment is how much could Hampi be improved if we were to have faster bit vector solvers. One of the main advantages of doing a re-encoding like this instead of relying on ad hoc algorithms is that if STP were to become twice as fast, then Hampi can directly benefit from that performance improvement. So in this experiment we'll assume that we'll replace Hampi with a zero time oracle that answers bit vector constraint solvers, bit vector constraints. So the task for this experiment will be to do 100 instances of regular set difference. So I have 10 regular expressions taken from real world code. They vary in size and so on. And we'll do A set minus B for each pair. So leading to 100 data points for each length bound that we give Hampi. And our metric will be the proportion that Hampi spends solving constraints versus encoding them in the first place. So our idea will be that we can eliminate the solving time if bit vector constraint solvers were perfect, but we want to see how much that improvement ->>: What is the question you're asking, the set different, what do you ask about it? >> Pieter Hooimeijer: Is it empty or not. The minimal requirement is finding a single string if it's not empty and reporting that it is empty if it is. So here's some results for length bounds one through 15 on the vertical axis with the proportion running time on the horizontal axis as a stack bar. So in this case solving is gold on the left and it takes relatively little of the time. In fact, most of the time is spent in the exponential, it turns out, encoding step. So these results -- and these results I have sort of light gray as everything else, which you might consider like parsing the constraints and returning the solution and so on. What's actually happening in each individual bar is sort of not clear in this graph, right? So this is aggregate based on 100 results per bar. So just to reinforce this a little bit more, let's look at let's say just N equals 15. And it turns out I have sort of the correlation or the scatter plot of total running time on the horizontal axis, and the proportion of encoding and solving time ignoring everything else on the vertical axis. So, in general, we see sort of a vague trend where the longer the running time is, the clearer the difference between solving and encoding becomes with encoding almost always dominating. >>: So you already mentioned that Hampi is solving a problem in NP. But here you've said that the encoding time was exponential. >> Pieter Hooimeijer: Yes. >>: So what is going on? >> Pieter Hooimeijer: There is two things that are going on. One is what is the size of a regex. So for a linear size regex represents potentially large number of strings and here we've defined stuff strictly in terms of the output length. More generally speaking, the proof of Hampi's NP completeness is unrelated to its implementation. >>: I see. >>: So the problem that Hampi solves is NP complete. The implementation does not take the polynomial time reduction to an NP complete problem solution. >>: So was that never implemented? >> Pieter Hooimeijer: later. Why are they doing that? That is a great question, which I will leave for >>: All right. >> Pieter Hooimeijer: Thanks. So in short, we find that even if we were to replace Hampi's solving time, or with zero, let's say, so eliminate the gold portion of each bar, it is still orders of magnitude slower than our fastest autotomia-based implementation. So let's look at it long strings. This is a benchmark that's sort of designed to show performance of string constraint solving tools relative to the output size. So I have two regexes. I want to intersect them. In other words, find a single string that matches both of them at once. The goal will be to do that parameterized on N. So in this case the curly braced notation N plus one and N represents the repetition of the most recent element in regex, in this case A to C, some number of times exactly. So notable is that we'll need some string that I guess contains the sub clause AB somewhere. And if you do this incorrectly, you'll spend a long time looking. If you do this correctly, you can do this in sort of linear time in terms of the output, if you're very lucky, let's say. So for this experiment, we ran four different tools. So DPerl, which you have seen before and Hampi and REX based out of MSR and Stir Soft [phonetic] we'll call our new prototype implemented in C++ and so on. So the graph shows N from 0 to 1,000 and the time on the vertical axis is log scale. We see DPRLE very sort of slow compared to other tools in this case. That's because of its eager autotomia implementation. Hampi actually fares reasonably well. In fact, it is a little slower than DPRLE for let's say the first N equals 50 up to there. After that, it definitely beats DPRLE, sort of confirming what we already had as an informal impression. REX is very horizontal and around a tenth of a second, and our implementation is sort of pushing the boundaries of how precisely we can measure running time in this case because as it so happens it makes the right choice when implementing autotomia intersection. So the take-away from this in addition to being a little bit slower, Hampi, for example, also exhibits a lot of these. So some bumps in running time which on a log graph don't show up that well if you look at sort of the nonlog version of this graph, it sort of fans out. So as you go from bigger, to bigger N, the optimizations that STP and in turn mini sat make to solve these constraints certainly matter. For some reason the encoding time dominates, but the solving time sort of fluctuates. So in some sense this is undesirable because it makes performance a little unpredictable. But, in general, if you look at, say, N equals 750 through 1,000 you can sort of go from, let's say, 60 hours of worth of solving for DPRLE to a couple of seconds for our tool sort of in the worst case. All right. So I've talked about string constraint solving, and I guess the main point there was that we didn't really talk about constraint generation. So I talked a lot about solving. Solving string constraints, and this will be good for everyone and so on. What I haven't talked about is sort of an end-to-end tool that implements this for a particular class of programs, say. And that's pretty much exactly what we did with the Bick project. So it's a complementary approach to what I've presented so far. So this is the work of two internships, and I'll sort of very briefly skim over the details. So let's return to our earlier example Web developer implements a profile page and let's say that we throw in a sanitizer this time. previously we were vulnerable because we didn't include any sanitization, but there's all these libraries that claim to help mitigate cross-scripting attacks. So So we can call HTML code and this will save us, right? And it turns out the answer in this case might well work. So in this case the single quote in the attacker's input string is actually encoded to ampersand hash 39 semicolon and now there's no way to actually escape the source attributes and include code. So this will just be an image that shows up as well. URL does not exist. So we've successfully avoided this attack. Now as a developer I might ask: What could possibly go wrong. In other words, did I just fix all my problems permanently? And the answer is, well, it depends on which library you used. So let's say library A has a function called HTML encode and it has been available for C# developers for some number of years and sort of well regarded. And now I have a library B, which through a Bing search I find to be sort of equally equivalent published by the same people and also let's say the exact same credentials. And it turns out that in this case if I use library AI win. My single quote gets escaped. It turns out library B takes the more formal route and since the HTML standard says I shouldn't be using single quotes for my source attributes anyways it doesn't escape it because it's not deemed necessary. Now I still have the same vulnerability in spite of the fact that I called the function called HTML encode. So it turns out libraries A and B correspond exactly to Microsoft's NTXs's library, the implementation for HTML code and Microsoft's .NET Web utility library. So it turns out Microsoft is not the only entity that has concerned itself with the exact semantics of HTML code should be. In fact, if you look at the source code for the PHT interpreter a portion of those HTML encode has been updated 151 times in the last decade. So I'd say every three 300,000 SVN revisions, assert relatively easy tenfold increase, 1693 at as a programmer and weeks on average. In that time span across 200, it has grown from 135 lines of code which I'll to inspect and manually verify to about a lines. So a little bit more challenging to look figure out what's going on. So in the meantime they've added all sorts of flags, for example, to see if this function is item potent and so on to make sure it doesn't double encode things and all sorts of other subtle changes in behavior. If you think about it it's kind of bad. If you have millions of Web apps out there that rely on the exact behavior of HTML encodes for their security, then it is not necessarily a good thing that HTML encoder apparently changes its semantics in subtle ways once every three weeks. So I am sort of limited in terms of time. I'll go into the background of the Beck project as well as a very high level overview of our approach. If you're interested in both external and internal valuation, I'm going to refer you to the papers in just a bit. All right. So let's talk about background. The key idea is that we want to create essentially a regular expression language for string transformations. So where a classical regex is a well-defined formal entity that has sort of properties that I can convert to an automaton which I can use to match strings. For Beck we want essentially the same thing, but for transducer. So an automaton that takes inputs but also produces a set of outputs. Potentially empty, potentially many. So the key idea is to create a domain-specific language that programmers can actually use to write sanitizers and convert it to this formal model so we can do interesting analysis. At a higher level, let's say there's this gap between code and the formal model we would like to use. So this is a problem that comes up a lot. Rather than punting and doing a separation between constraint solving and constraint generation like we did before, we'll actually try to fully close this gap for a limited class of programs. So we'll create a domain-specific language in this case to make the code more amenable to translation into this formal model. And on the right-hand side I've replaced model with finite state transducers. The second step will be to make those sufficiently expressive so that they easily capture the types of transducers we would want to build based on our domain-specific language. And these two approaches correspond pretty much one-to-one to our Usnix security paper which presents the language and practical applications in terms of modeling real world sanitizers. And the POPL paper which goes into the detail of symbolic finite state transducers, which are an abstraction that we show to be sort of strictly more expressive than classical transducers and yet somehow still very analyzable and very useful. So let's talk a little bit about the approach. And I guess I'll start with a short back program. So this is a program -- I won't go into the details -- but it escapes quotes. So double quotes and single quotes get a slash in front of them unless they are already escaped. So the key sort of parts of this program are an iteration over some input string S and a Boolean variable which will update across iterations which captures whether the last thing we've seen is a slash or not. So to avoid double escaping. So I won't go into detail about what this code looks like as a transducer and so on. It turns out you can try that on writesforfun.com. There's an online demo which includes examples similar to this and allows you to visualize the transducer and perform analysis and so on. But one thing that we might want to prove about this transducer is we did it correctly.. So if I apply this to the same string twice, will it end up double escaping quotes or will it do the right thing and sort of escape them exactly once. All right. So slight transition to the formal definition of symbolic finite state transducers. I'll go over this sort of at a very high level. The basic definition is something you might have seen before. It's a four TUPL with states, a start state, a set of final states and transition. So all of the state-related parts are pretty much as you would expect. The main difference is in the transition relation. So each transition is of the form Q to R where the edge has two annotations. In this case a formal FV and let's say a bold F, which is an output. So a sequence of output formula in this case. So the fee bit goes from the input alphabet to the back-to-schoolians. This decides if the edge is reversible on a given character and the F portion of the show is an output sequence. So I can then provide some number of functions that take the input. So basically as a lambda and turn it into an output. And the star here means that we can have a sequence of those. And I guess the only really notable thing about this formal definition is that the star in this case is outside of the sort of function scope. So normally I would say goes to a list of characters. In this case we explicitly require a sequence of functions for some number that's a bounded. So this helps make the algorithms even more decidable than they already are. But this definition biases a clean separation between state related operations on the one end and background theory related operations on the other. So this work is an extension of Marcus's work on symbolic finite state machines. Taking that into the symbolic finite state transducer space where we have outputs and what's nice about this is that it works for any decidable background theory. So anything that can do satisfiability and witness generation in C3 let's say. So in practice we use the theory of bit vectors to represent characters. And we can do that quite efficiently for large alpha bets, including UTS 16 and so on. So for further details I would refer you to the POPL paper at least in regards to what we can do with symbolic finite state transducers. The high level ideas that we define informal algebra of operations that are closed on symbolic finite automata that represent values and symbolic finite state transducers that represent transformations. And based on that algebra you can implement a lot of different interesting analyses. You can do relational assumption. If you do it in two directions you get relational equivalence and you can use that to test for let's say item potents. Can I apply this sanitizer twice, and will it double encode stuff or not and commutativity and so on. So that concludes my portion of the show on that. Like I mentioned, it's available as a demo on writesforsfun and I think it's a great demonstration. So in conclusion I've presented two complementary approaches to dealing with code that manipulates strings from a security perspective or more generally from a validation or test case generation perspective. String constraint solving. There's a number of different concretely available tools that are open source and so on. And they all solve a sort of similar set of constraints. And in sort of contrast to that, the Beck project is sort of a Singleton at the moment and models a subset of important string manipulating functions more directly. So in terms of let's say my work, I would say that the bulk of it has focused on swing constraint solving, a relatively little portion of it so far has focused on back. Two summers so far. this. And I would sort of love to continue working on So in terms of future work, I have a couple of stray bullets which I will cover very briefly. Basically we're looking into the closer integration of string constraint solving into SMT solvers. So there's a draft for a standard that would add this to the SMT lib sort of set of theories that we know how to deal with and potentially generate benchmarks that would allow future tools to participate in SMT comp while solving string constraint specifically. And then other potential uses for BECK, for example, would be an educational tool. So there's some of this available at MSR already. But basically let's say I learned to program in Q Basic, in 1992 or some such, and Q Basic had as its sort of main point in favor a very good documentation, very good debugger and so on. What if we can extend something along those lines to a more domain specific tool like using back to string manipulation or this manipulation in general and do automatic checking of student-submitted code by using the fact that we can check for program equivalents. So if there's a gold standard correct version of a function we can always tell you exactly why you're wrong, if you're wrong. The other point I briefly wanted to mention is back for other domains. I've done work on wireless sensor networks using different models to I guess model a distributed system in this case. It is a common approach to analyze distributed systems using automata. I think it would be very interesting to see if there's a front end language extension to Beck that would be amenable to use as an easy tool to write wireless sensor network applications that run on distributed heterogeneous hardware and sort of guarantee properties of programs of communicating. So that's it for my talk. questions. [applause] And I'll be very happy to take any >> Margus Veanes: Plenty of time for questions. an hour. So, please. I think formally half >>: So actually I didn't quite understand what do you do with these transducers, how are they related to the security of Web pages. >> Pieter Hooimeijer: Sure. So the basic notion is that a function like HTML encode is written in a certain style. So we've looked at a bunch of different implementations for Usnix paper and they follow this loop overscreen. They keep them Boolean state and looks at a sliding window of the input string, converting characters into other characters as needed. It turns out the translation from let's say the Java OS implementation of HTML encode to Beck in this case has the front end language to our tool is fairly direct. The purpose of the language is to mimic the style in which programmers already write low level string manipulating code such as HTML encode. The transducer is the formal model which we can do -- which we can use to analyze stuff like equivalence. So they are single valued symbolic finite state transducers. We show that equivalence checking of those transducers is decidable, which means if you give me two different versions of HTML encode, both implemented in our language, then we convert them both to a transducer and check whether they are equivalent. If they are equivalent we say yes. If they're not equivalent we say no, here's a differentiating input that looks different if you run it through on the two transducers. So that's the sort of core of the Beck project. Basically a language that describes transducer in the same way a classic regex would describe a set of strings. >>: What are your thoughts about the in C programming you might do a lot of this by hand, just because you didn't know there were tools that could do it for you. If you ->> Pieter Hooimeijer: What is this. >>: Sanitization. >> Pieter Hooimeijer: Sure, sure. >>: So if you extended the C language or any other programming language with these features, is it appropriate for it so that you can encode the -- essentially putting Beck into language? Is it there yet? Would that be useful for languages, or is it just going to be a handful of people. >> Pieter Hooimeijer: I think that's a great question. And so I guess the basics is what about ad hoc Sanitization. Can programmers use this in real life. The answer is yes. There's an ongoing effort to turn Beck back into real world code. So in the same way you might use Lexnax [phonetic] to write a specification which gets turned into a relatively obtuse-looking C code, we have a similar project underway or at least Marcus has been working on this, I believe. It's basically the translation from Beck into various mainstream languages. So then it actually becomes quite feasible if you recognize that your current task is one of sanitizing a string using let's say a single pass to write it in Beck instead. So you can do some analysis before generating the code that will then link into your application Patrice. >>: I was wondering today what's the main source of string constraints, where does it come from? >> Pieter Hooimeijer: Sure, the question is where do string constraints come from. Literally -- >>: High level. >> Pieter Hooimeijer: From a slide. I think the answer is: So if you look at modern Web development, there's a lot of templating languages out there. I would say that if you were to analyze them in the same way we analyzed PHP code which represents a slightly older class of Web application in my opinion, then you might look at stuff like -- so these templating frameworks, they might apply some auto Sanitization. In order to reason about their correctness, you would need to symbolically execute through that library code or develop some model of that library code. And that way the string constraint solving approach is almost exactly sort of complementary to the Beck project. We might use Beck to essentially summarize different parts of library code. We might use string constraint solving for executing or evaluating single paths through library boundaries. Does that answer your question? >> Margus Veanes: More questions. >>: I have a question about one of the experiments you did. The experiments about the long strings. So the example regexes you gave classical examples of regex if you try to determine them they blow up. >> Pieter Hooimeijer: Yes, definitely. >>: Recognizing some character curves like that from the end. >> Pieter Hooimeijer: Yes. >>: But the experiment was on the intersection and no difference. >> Pieter Hooimeijer: Yes. >>: So none of these tools tried to determine-ize. >> Pieter Hooimeijer: This is true. That would make all the tools slower especially automata ones. That's a good suggestion. >>: Taking the difference. >> Pieter Hooimeijer: Right. So doing difference instead of intersection is generally slower. I wonder if that's also the case for Hampi, maybe Hampi comes out looking a little better. >>: Related to that was the writes use in this case using accents? >> Pieter Hooimeijer: It was the more recent implementation that we had a license for. So the Rex implementation that used concrete ranges I believe. So it's not exactly the fastest Rex implementation, BDDs would be faster. But BDDs tend to consume a lot of memory and run out in some cases at least that particular version of the code did. So we went with the version that returned results for all inputs. So I guess I should qualify that graph by saying that the Rex results could have been a little faster if we had used BDDs but then there would be sort of fewer data points to look at. Tom. >>: So one of the things that makes the approach viable in general is that these general purpose languages you're analyzing have a domain-specific sub language that closely matches sort of your domain, right. >> Pieter Hooimeijer: Right. >>: So if you wanted to extend this, for example, to trees, then the way people construct trees and matchover trees and programs might not be as regular, shall we say, as it's encoded in the general purpose language. >> Pieter Hooimeijer: Sure. >>: What would you -- if you had to sort of generalize your approach to tree manipulating code ->> Pieter Hooimeijer: The question is what about the data structures with multiple follows instead of just one. So trees and stuff of lists. It's a very interesting area of research. So I think Marcus is looking into trees in this upcoming summer in particular two transducers and this goes back to your earlier question about Mona as well. So Mona has specialized support for this. This is something that I've been meaning to look into sort of on and off over time. I think in general it would be very useful but like you said we do benefit a lot from the fact that high level library operations over strings look a lot like let's say string constraint solving combined with some Beck. So I'm not totally sure that many tree manipulations take the same form. >> Margus Veanes: So any more questions? We'll thank Pieter once more. [applause] No, then I think we're done.