24252 >> Margus Veanes: Good morning, everybody. This is... talk. And it's a pleasure to welcome him back. ...

advertisement
24252
>> Margus Veanes: Good morning, everybody. This is Pieter's candidate
talk. And it's a pleasure to welcome him back. So Pieter Hooimeijer
is from the University of Virginia, working with Wesley Weimar on
various problems related to strings and that's also the topic of his
talk today.
Personally I know Peter for roughly two years. Or actually, no, even
two and a half or three years. He did his first internship here with
me in 2010, which was work related to strings. And then last year he
continued the internship, was a second internship following that on
that same topic.
And some of the stuff he'll talk about today will touch upon these
topics or these things he did here at that time. So with this -- and
also let me say that Peter has been working with other stuff, other
areas like empirical software engineering and sensor networks. So he
has a pretty broad scope from other areas as well.
So with that, I think I'll give Peter -- let Peter start talking.
>> Pieter Hooimeijer: Thank you, Marcus. Thanks everyone for coming.
So I'll be talking about strings. The working title of this talk is
Peter talks engagingly about a single data type for one hour.
So in order to make that happen I figured I'd tie this to the audience
by saying: Imagine you're currently using your laptop. So you might
have a Lenovo which looks a lot like this, imagine you're browsing a
website, let's say you're looking at stack overflow to scout out your
competition, turns out John Skeet has an impossible number of points,
you'll never be able to beat him. But you're looking at his profile.
And the developer who implemented this page decided it would be a good
idea to allow users to provide their own profile picture. So in this
case they can provide an address that is the address of the image file.
So in this case I have a very bare bones image tag on the slide, and
the idea is that we'll treat the source attribute in this case as
untrusted input.
And as a developer, let's say I'm quite competent but not super
security savvy, I might ask what could possibly go wrong; and, of
course, the answer is it depends. It depends on what we do with this
untrusted input. So the address is user provided. We should make sure
that it doesn't do anything bad to our page. And let's imagine that we
don't do anything, and in that case an attacker might provide a
specially crafted attack string, in this case an address that contains
a single quote and allows the attacker to escape from the source
attributes and enter, let's say, their own attributes. For example, an
onload attribute that loads arbitrary JavaScript into the page.
So this is a cross-site scripting attack. You may have heard of them.
This is very common, and this is perhaps not the most exciting example
of one; but, nevertheless, the problem here is that the attacker can
run code whenever someone visits this profile page. And in addition to
annoying alert boxes, they might execute website functions with the
privileges of the user currently viewing the page, and this is
problematic if you're a bank or if you really care about your score on
stack overflow.
All right. So there's a lot of research on cross-site scripting.
Imagine going on Microsoft's academic search engine and looking for
this you'll turn up hundreds of papers for some reason in a variety of
areas not related to computer science as well but I'll assert that the
majority of results come from computer science and they tend to try to
mitigate this problem if it happens or prevent cross-site scripting
altogether through some constructive means.
Either way, the point is there's a lot of work on this, and part of my
research aims to generalize some of the insights gleaned from the
existing research on, let's say, cross-site scripting or SQL injection.
In short, vulnerabilities related to string manipulation.
So with that, I'll give you a dark slide. The talk will be two parts.
So I'll talk about my swing constraint solving work, which is roughly
my dissertation for the first, let's say, half, and after that I'll
briefly touch on the back project, which is my internship work with
Marcus from the last two years, among a few other things.
So let's talk about string constraint solving, sort of what it is, why
I care and why you should care. So the structure of this will be
roughly like this. I'll go into some background to sort of explain
what I mean by string constraint solving and why you should care, and
also following that I'll talk about sort of building a string
constraint solver in this case sort of my first paper on the topic, and
then I'll talk about tuning, which is on the topic of a few other
papers trying to make this sufficiently fast for people to actually be
able to use this stuff.
So roughly background definitions, evaluation or performance tuning.
All right. Start with background. So there's a number of constraint
solvers out there. You may have heard of some of them. There is this
notion of doing program analysis, using a constraint solver. This is
very common. It is prevalent for in a variety of contexts.
So everything from automated testing to static analysis, doing model
checking and so on, you typically end up using one of these. So this
is sort of the MSR potentially redundant slide. There's a number of
different implementations. One key point to keep in mind is that
there's a standard input format. At least in principle, if you don't
use terribly esoteric features, you might interchange these tools. And
this also allows for doing annual competitions and so on. So this is
the state of the art in terms of mathematical constraints, typically
involving, let's say, integers, bit factors, data structures and so on.
So a very simple example. If I have X squared is 25, I want to know
what X might be. I can solve for X using an SMT solver. Looks like
something like this. The declarations roughly match what we have and
sort of straight math notation. And in this case my result is X, can
be five in this case. There might be other solutions, but we just have
one particular example here. And what's most important is that we know
it is possible to satisfy these constraints.
So what about strings? It turns out existing SMT solvers have not
traditionally had a theory of string constraints. So bit vectors are
well represented, like I mentioned sort of various types of arithmetic
on integers.
Some stuff involving quantifiers even, but not necessarily strings. So
I've said here on the slide that reasoning about strings is difficult.
And I mean that in sort of an informal way.
For programmers, it is apparently difficult, because they write
websites that have errors in them, like cross-site scripting
vulnerabilities. And it is difficult for automated tools sort of for
other reasons, which I'll go into. So in that case I guess the
definition of difficult is a little bit more formal.
So there's a number of existing approaches or, rather, approaches that
exist now that did not exist previously. These are the names of four
tools I will feature in the talk. And each of these tools is
essentially a domain-specific SMT solver. Let's say a string
constraint solver for just string constraint. So DPerl is a tool I
present at PLID '09. And Hampi was a month later, ISTA '09, and I'll
assert all these tools were published, let's say, in the last three
years.
>>: So what about the whole second order stuff, Mona and all that, it
goes way back, right?
>> Pieter Hooimeijer:
Sure.
>>: So ->> Pieter Hooimeijer:
>>: What about Mona?
So the question is what about Mona?
>> Pieter Hooimeijer: In the mid 90s there was a bunch of research on
writing dedicated solvers for Mona second order logic, using either
multi-track autotomia, depending on which fragment. And those are in
many ways theoretically similar. They solve in the multi autotomia
case, constraints that are equivalent to a regular language.
But they are not sort of specified -- they're not specifically designed
for string constraint solving in a way we do here. So there's one
paper, I forget where, I forget the venue, a recent paper that uses
Mona to solve string constraints, and it turns out this is possible.
And you can express regular languages and so on. This is well known.
However, it is not particularly performance. So there's still sort of
space for domain-specific tools and so on. And I think it would be
very interesting to look at sort of the deeper implications for, let's
say, the Mona implementation, very highly tuned and so on.
And we have some of the same stuff showing up in string constraint
solvers that they did for Mona. So the use of BDDs, for example.
>>: What makes these tools domain-specific?
>> Pieter Hooimeijer: They are domain-specific. So the question is
what makes them domain-specific? The tools are domain-specific in the
sense that very, sort of bluntly put, they are presented in the papers
that sort of publish them. They're domain-specific in the sense that
they solve in this case regular language constraints.
So if you do symbolic execution on some code and it generates some
string constraints, so stuff like this string should match this regular
expression, then you can use one of these tools to solve those
constraints. Does that answer your question?
>>: That seems pretty general to me.
>>: But in terms of logic, what are the -- compared to Mona, for
example, for these languages like logically formally more restrictive
than ->> Pieter Hooimeijer: That's an interesting question. And the short
answer is I don't know the sort of full specifics or at least not well
enough to state them on the record. But I think so a fragment of Mona
second order logic is definitely representable using autotomia. Most
of these constraints are representable using autotomia, but the
structure is slightly differently. So my guess is that they would
similar. So there's a notion of repeated or iterated autotomia
intersection being phi space complete. And I believe that is the case
for sort of both things in this case. But that's the best I can do in
terms of specific upper bound, let's say.
Are you satisfied, Chas? Thanks. So, yeah, so four different tools
all published in the last four years and the basic notion is that it
will generate constraints that look an awful lot like string
manipulating code that will look like C#, and we'll pass them to sort
of any of these tools to have roughly equivalent put languages, sort of
sufficiently similar to be engineering feasible. And we'll get out
some answer. So where previously we got out a response that says X
should be 5, now we get a response that says string variable A should
get concrete string value A/B.
So very briefly I've shown a very short example constraint. Talked
about different solvers, and what the shape of a solution looks like.
So an assignment to string variables. But what that doesn't tell you
is sort of where the constraints come from. So I've already touched on
this, but the basic notion is that we'll sort of punt on this.
So we'll say stuff like you can use standard techniques to generate
constraints that include string operations like in this example, let's
say, symbolic execution. So if I want to exercise the if statement in
this code, I can generate a constraint system that instead of branching
on R dot IS match of A asserts that that's true and I can solve those
constraints and find inputs that exercise this path through the code.
But, in general, we separate constraint generation as a problem from
constraint solving. And this is sort of common to all of the tools and
presentations that I've shown, and we'll focus at least for purposes of
this talk on constraint solving.
So we'll assume there is a reasonable way to generate string
constraints without really going into the details of how to better
generate constraints for strings.
In practice, we evaluate whether our assumptions about string
constraint solving or our assumptions about string manipulating code
are correct empirically. So we'll use standard techniques and see if
we can solve the constraints that result.
All right. So so far I've talked mostly in terms of examples. And
I'll sort of continue to do so. But in general I think this raises a
question of scope. So what is a string constraint. As you've already
sort of asked and what is not. So one example might be MD5 is a
cryptographic hash or at least a one-way hash, and it is technically a
function that takes a string and outputs a string. So if you wanted to
do constraint solving, and we sort of didn't narrow our scope, that
would be fair game, except there's sort of separate papers on reversing
cryptographic hashes, and sort of very domain-specific problem that we
may not want to solve.
So if a different example, which if you've seen this slide you are
prohibited from answering, but how hard is it to do a simple one
string/one regular expression match in Perl. So some of you may have
seen this slide at VM CAI maybe last year and the answer is you can't
see that because it's color blue very close to the black. It's NP
hard. And the reduction is from three set. So this is a five line
piece of Perl code that given a string, or rather given an array
representation of a three set problem turns it into a single string and
a single specially crafted regex that uses back references in order to
solve sat.
So if you ever need a 5-line sat solver, there you go. So this is not
my result. This is something that was sort of floating around on the
Perl mailing list some number of years ago.
So this still doesn't answer our question of scope. But in general it
does answer our question of can we model the sort of anything that
shows up in the wild and the answer is probably not. So MD5 is an
obvious example. Regular expressions in Perl perhaps not so obvious.
Turns out the exact formal language class of Perl regular expressions
are sort of nebulous. We'll focus in this talk on constraints that we
do know how to solve and in practice that usually means we'll go with
these sort of strictly regular component of real world regular
expressions.
All right. So let's talk about my first attempt at building one of
these tools. We'll sort of see the implications. So NPLN9, I provided
some definitions of basic string constraints, and the little logo on
there is meant to indicate that we defined this stuff in COC
[phonetic]. So we went through the trouble of defining sort of strings
from first principles, string constraints of interest from first
principles and then showing that our core solving algorithm is sound
and complete relative to those definitions.
And we also provide an implementation, and this is one of those stories
where the implementation is strictly not related to the formal proof.
But it solves string constraints in practice. And we use it on an
existing benchmark to show that we can generate attack inputs for 17
known SQL vulnerabilities.
So we look at some corpus of PHP code. We use off-the-shelf techniques
to generate constraints. And then we solve them using our tool
measuring, let's say, running time and whether or not it works.
So rather than defining formally what the constraints of this
particular presentation look like, I will focus on a quick demo. So
there's an online Web version of this tool available. If you're
currently on your laptop, the URL is something along virginia.edu. And
if you happen to be playing around with your phone, it turns out I
wrote a quick script in Touch Develop. So if you're following along
and look for the, want to follow along look for the [inaudible] script
in Touch Develop and it is the shortest Touch Develop script you have
ever seen. It links you to the website which I will show you now.
All right. So for this short demonstration, I want to emphasize a few
things. And for this particular tool, variables represent regular
languages. And I took that very literally in the tool implementation.
So they literally are autotomia definitions, concrete ones. So there's
two sort of aspects to this. One is regular inclusion constraints.
And the other is concatenation. So it turns out if you support those
two operations, you can do the majority of -- you can model the
majority of constraints that originate from the symbolic execution of
string manipulating code in some way or another.
Sometimes less direct. Sometimes more. So on the slide is a short
example. So I apologize for the sort of limited visibility of black
and colors. I guess the main notion is one that Tom already raised, is
that there's an equivalency between other logics and what we call
string constraint solving. So the example I've seen here I have on the
slide is something where I want to do modular arithmetic. So I have
two autotomia. Twos and threes that represent in unary the set of
numbers that are multiples of two and three.
What I'll do is I'll define 9 as a single number. That's a little
tedious in autotomia representation, but let's imagine we can also do
regexes. And what I want to find out is how many ways I can use to put
two and three together as multiples to form 9.
So this is truly the arithmetic example from middle school, let's say.
So the way I express that constraint is by saying 2 concat, 3, or 2s
and 3s should be a subset of 9 which is a single string. So if I want
all solutions, I use solve all in this case, and then there's some
additional machinery required to select the two solutions and display
them.
So let's hit submit on that and hope that I have Internet. So looks
like it ran successfully. So, again, this is a tool implemented in,
let's say, 2008. So it's been a while. The result is two disjunctive
solutions. The first looks like this. 2s, it looks like we multiplied
by 3.
So I end up with 6/04 9s which is one and one 3. So in other words 6
plus 3. And my second solution looks like zero 2s and exactly three
3s. So I've done a very basic modular arithmetic sample to show the
decomposition of the number 9.
So this is an example meant to illustrate sort of interesting
properties. Most notably the use of concatenation in this case to
separate the 2s and the 3s and get separate solutions for them. The
other is that it is not sufficient to say here's a regular language for
each variable. And I have two disjunctive solutions. It does not work
for me to put together zero 2s and 6 taken out of 3. So, in other
words, these solutions are inherently disjunctive. I can't just sort
of merge them together into single regular language. So if I have
multiple solutions, they will be sort of strictly separate. So in
practice we probably won't be doing modular arithmetic. But I figured
it would be a sort of interesting example to show a known equivalence
using a new tool in this case.
>>: The slide on the actual logic.
>> Pieter Hooimeijer: I do not. I mean, it's really variables are
regular languages. We have grounded regular inclusion constraints. So
there's a single constant regex that is the right-hand side of this
subset constraint, and concatenation. So I can say stuff like variable
one concat variable two should be a subset of this regular set, which
is really all the core constraint language handles.
And this is also, this is equivalent to constraints in other papers.
So in the Hampi core language, this is essentially the same, except
that we bound the length of each string, because it's a reduction to
sat. And in this case the length is unbounded.
So moving on. The evaluation as I mentioned was on an existing corpus
produced by Wasserman and Sue [phonetic] at PLDI '07.
They do a static analysis that finds SQL injection vulnerabilities.
And the sort of main motivator at least for me personally at the time
was the output of this tool, which looked a lot like there might be a
bug on line 58 of your code and nothing else. So we wanted to add
indicative inputs to this static analysis. In order to do so we had to
put together some additional machinery.
So we had to find a path to this particular potential bug and
symbolically evaluate net path to find constraints. So we use running
time as a metric. Like I said, it's 17 vulnerabilities. So it's not
the biggest corpus for doing sort of statistically significant
performance results. But we wanted to know if this is feasible just as
a first cut.
And the results are that, yes, we can generate successful attack inputs
for these constraint systems. And our running time is between about
100th of a second and about ten minutes. So that's quite a big range,
even across a very limited sample set. So more on that later.
In fact, I'll talk about that now. My next sort of step in this
process was finding out faster ways of doing this. So I'll go into the
context in a little bit. But in general the idea at the time was
there's two competing approaches. One was the Hampi tool, which I was
a co-author on, which uses a reduction from string constraints to bit
vector constraints which then get turned into a sat problem.
And that tool is relatively fast in practice. Faster than the tool I
just showed you, in fact. Nevertheless, we have this feeling that
autotomia-based constraint solving would be faster, and it seems sort
of obvious in hindsight. But at the time I wasn't really clear
immediately why.
>>: My impression is that the Hampi tool is bounded string length,
bounded string encoding, right.
>> Pieter Hooimeijer:
Yes.
>>: It's not comparable to what you are doing because you are ->> Pieter Hooimeijer: Yeah. So, yeah, the Hampi problem is NP
complete and ours is not believed to be.
>>: Complexity of your problem.
>> Pieter Hooimeijer: So like I said I believe it is fee based
complete, but I have not formally evaluated that. I focused more on
empirical performance evaluation instead.
All right. So there's two papers on this one. One is a VM CAI paper
which I wrote with Marcus during my first internship in Redmon. And it
does basically data structure selection. So we implemented a bunch of
different techniques from known automata libraries in the same context
and used them to do some string constraint solving problems on one
variable which reduces to single autotomia intersection or single
autotomia determination and computing inverse.
I won't talk about that paper in a lot of detail for lack of time. But
I think the key point there is that we did as sort of a rigorous of an
evaluation as reasonable in the sense that we fixed pretty much
everything down to the front end parser, the language of implementation
and so on. At the time existing work had the more usual structure of
we have a tool, it works better on some benchmarks than some other
tool, so therefore we win. This is sort of a reimplementation of
existing techniques, not a novel technique, to see which data
structures work best, for example, for representing large character
sets and intersecting them efficiently.
But I'll focus on instead is the AST 2010 paper which essentially takes
some of the results from VM CAI and implements them in a real solver.
In this case I sat down and decided to code this stuff in C++, so it
has some engineering benefits as well as some of the sort of data
structure algorithm selection that we gleaned from this other paper.
So the approach is just like the solver you saw before. Except
whenever we relied on sort of full autotomia operations previously,
we'll now do stuff as lazily as possible. For finding a single string
in an autotomia, that means we find a path to a final state without
looking at other parts of the autotomia if we can avoid it.
If we do intersection, it's essentially the same thing. It turns out
for multivariate string constraints that becomes a little trickier. In
some cases we may have to find paths that are essentially circular in
nature. If I have a constraint A follows B in some regular language,
and then I have a different constraint, B follows A and some other
regular language, it suddenly becomes very tricky even to sort of
figure out where I should start in the hypothetical search space which
may be quite large. So I won't go into detail about the algorithm too
much. It's basically a wall of pseudo code in one of my papers as well
as my dissertation with some examples and so on.
If that interests you, please have a look and
about that off line as well. So instead I'll
sort of the hard numbers in this case. So we
experiments, and I will sort of skim over the
I'll be happy to talk
focus on the evaluation
do a bunch of different
two middle ones of these.
So I'll do a comparison with Hampi, which is like I said a different
tool that I was involved with. And I'll do the long strings
experiment, which is sort of designed to illustrate some of the
limitations of other string constraint solvers in this case.
So let's talk about Hampi, do a little bit of background in terms of
the Hampi architecture. So, like I said, where I went the route of
using autotomia for this stuff which made stuff easy to prove but not
particularly efficient, for Hampi, my co-authors came up with a
separate algorithm, which is a reduction to bid vector constraints. So
for a given regular language, it sort of enumerates various
possibilities of characters appearing in a particular position, in a
single big vector that represents the entire constraint system.
And, like I said, this approach worked well in practice. But there's
some question of how much can we improve this. So Hampi internally
uses the STP bit vector constraint solver which internally uses mini
sat. So there's several layers here already. And that makes the
performance at least somewhat unpredictable. So for different problem
sizes you might get sort of noncontinuous function of performance
results.
And what we wanted to find out for this particular experiment is how
much could Hampi be improved if we were to have faster bit vector
solvers. One of the main advantages of doing a re-encoding like this
instead of relying on ad hoc algorithms is that if STP were to become
twice as fast, then Hampi can directly benefit from that performance
improvement. So in this experiment we'll assume that we'll replace
Hampi with a zero time oracle that answers bit vector constraint
solvers, bit vector constraints.
So the task for this experiment will be to do 100 instances of regular
set difference. So I have 10 regular expressions taken from real world
code. They vary in size and so on. And we'll do A set minus B for
each pair. So leading to 100 data points for each length bound that we
give Hampi. And our metric will be the proportion that Hampi spends
solving constraints versus encoding them in the first place.
So our idea will be that we can eliminate the solving time if bit
vector constraint solvers were perfect, but we want to see how much
that improvement ->>: What is the question you're asking, the set different, what do you
ask about it?
>> Pieter Hooimeijer: Is it empty or not. The minimal requirement is
finding a single string if it's not empty and reporting that it is
empty if it is.
So here's some results for length bounds one through 15 on the vertical
axis with the proportion running time on the horizontal axis as a stack
bar. So in this case solving is gold on the left and it takes
relatively little of the time. In fact, most of the time is spent in
the exponential, it turns out, encoding step. So these results -- and
these results I have sort of light gray as everything else, which you
might consider like parsing the constraints and returning the solution
and so on.
What's actually happening in each individual bar is sort of not clear
in this graph, right? So this is aggregate based on 100 results per
bar. So just to reinforce this a little bit more, let's look at let's
say just N equals 15. And it turns out I have sort of the correlation
or the scatter plot of total running time on the horizontal axis, and
the proportion of encoding and solving time ignoring everything else on
the vertical axis.
So, in general, we see sort of a vague trend where the longer the
running time is, the clearer the difference between solving and
encoding becomes with encoding almost always dominating.
>>: So you already mentioned that Hampi is solving a problem in NP.
But here you've said that the encoding time was exponential.
>> Pieter Hooimeijer:
Yes.
>>: So what is going on?
>> Pieter Hooimeijer: There is two things that are going on. One is
what is the size of a regex. So for a linear size regex represents
potentially large number of strings and here we've defined stuff
strictly in terms of the output length. More generally speaking, the
proof of Hampi's NP completeness is unrelated to its implementation.
>>: I see.
>>: So the problem that Hampi solves is NP complete. The
implementation does not take the polynomial time reduction to an NP
complete problem solution.
>>: So was that never implemented?
>> Pieter Hooimeijer:
later.
Why are they doing that?
That is a great question, which I will leave for
>>: All right.
>> Pieter Hooimeijer: Thanks. So in short, we find that even if we
were to replace Hampi's solving time, or with zero, let's say, so
eliminate the gold portion of each bar, it is still orders of magnitude
slower than our fastest autotomia-based implementation. So let's look
at it long strings. This is a benchmark that's sort of designed to
show performance of string constraint solving tools relative to the
output size.
So I have two regexes. I want to intersect them. In other words, find
a single string that matches both of them at once. The goal will be to
do that parameterized on N.
So in this case the curly braced notation N plus one and N represents
the repetition of the most recent element in regex, in this case A to
C, some number of times exactly.
So notable is that we'll need some string that I guess contains the sub
clause AB somewhere. And if you do this incorrectly, you'll spend a
long time looking. If you do this correctly, you can do this in sort
of linear time in terms of the output, if you're very lucky, let's say.
So for this experiment, we ran four different tools. So DPerl, which
you have seen before and Hampi and REX based out of MSR and Stir Soft
[phonetic] we'll call our new prototype implemented in C++ and so on.
So the graph shows N from 0 to 1,000 and the time on the vertical axis
is log scale. We see DPRLE very sort of slow compared to other tools
in this case. That's because of its eager autotomia implementation.
Hampi actually fares reasonably well. In fact, it is a little slower
than DPRLE for let's say the first N equals 50 up to there.
After that, it definitely beats DPRLE, sort of confirming what we
already had as an informal impression. REX is very horizontal and
around a tenth of a second, and our implementation is sort of pushing
the boundaries of how precisely we can measure running time in this
case because as it so happens it makes the right choice when
implementing autotomia intersection. So the take-away from this in
addition to being a little bit slower, Hampi, for example, also
exhibits a lot of these.
So some bumps in running time which on a log graph don't show up that
well if you look at sort of the nonlog version of this graph, it sort
of fans out. So as you go from bigger, to bigger N, the optimizations
that STP and in turn mini sat make to solve these constraints certainly
matter. For some reason the encoding time dominates, but the solving
time sort of fluctuates. So in some sense this is undesirable because
it makes performance a little unpredictable.
But, in general, if you look at, say, N equals 750 through 1,000 you
can sort of go from, let's say, 60 hours of worth of solving for DPRLE
to a couple of seconds for our tool sort of in the worst case.
All right. So I've talked about string constraint solving, and I guess
the main point there was that we didn't really talk about constraint
generation. So I talked a lot about solving. Solving string
constraints, and this will be good for everyone and so on.
What I haven't talked about is sort of an end-to-end tool that
implements this for a particular class of programs, say. And that's
pretty much exactly what we did with the Bick project. So it's a
complementary approach to what I've presented so far. So this is the
work of two internships, and I'll sort of very briefly skim over the
details.
So let's return to our earlier example Web developer implements a
profile page and let's say that we throw in a sanitizer this time.
previously we were vulnerable because we didn't include any
sanitization, but there's all these libraries that claim to help
mitigate cross-scripting attacks.
So
So we can call HTML code and this will save us, right? And it turns
out the answer in this case might well work. So in this case the
single quote in the attacker's input string is actually encoded to
ampersand hash 39 semicolon and now there's no way to actually escape
the source attributes and include code. So this will just be an image
that shows up as well. URL does not exist. So we've successfully
avoided this attack.
Now as a developer I might ask: What could possibly go wrong. In
other words, did I just fix all my problems permanently? And the
answer is, well, it depends on which library you used. So let's say
library A has a function called HTML encode and it has been available
for C# developers for some number of years and sort of well regarded.
And now I have a library B, which through a Bing search I find to be
sort of equally equivalent published by the same people and also let's
say the exact same credentials. And it turns out that in this case if
I use library AI win. My single quote gets escaped. It turns out
library B takes the more formal route and since the HTML standard says
I shouldn't be using single quotes for my source attributes anyways it
doesn't escape it because it's not deemed necessary. Now I still have
the same vulnerability in spite of the fact that I called the function
called HTML encode.
So it turns out libraries A and B correspond exactly to Microsoft's
NTXs's library, the implementation for HTML code and Microsoft's .NET
Web utility library. So it turns out Microsoft is not the only entity
that has concerned itself with the exact semantics of HTML code should
be. In fact, if you look at the source code for the PHT interpreter a
portion of those HTML encode has been updated 151 times in the last
decade.
So I'd say every three
300,000 SVN revisions,
assert relatively easy
tenfold increase, 1693
at as a programmer and
weeks on average. In that time span across 200,
it has grown from 135 lines of code which I'll
to inspect and manually verify to about a
lines. So a little bit more challenging to look
figure out what's going on.
So in the meantime they've added all sorts of flags, for example, to
see if this function is item potent and so on to make sure it doesn't
double encode things and all sorts of other subtle changes in behavior.
If you think about it it's kind of bad. If you have millions of Web
apps out there that rely on the exact behavior of HTML encodes for
their security, then it is not necessarily a good thing that HTML
encoder apparently changes its semantics in subtle ways once every
three weeks.
So I am sort of limited in terms of time. I'll go into the background
of the Beck project as well as a very high level overview of our
approach. If you're interested in both external and internal
valuation, I'm going to refer you to the papers in just a bit.
All right. So let's talk about background. The key idea is that we
want to create essentially a regular expression language for string
transformations. So where a classical regex is a well-defined formal
entity that has sort of properties that I can convert to an automaton
which I can use to match strings. For Beck we want essentially the
same thing, but for transducer. So an automaton that takes inputs but
also produces a set of outputs. Potentially empty, potentially many.
So the key idea is to create a domain-specific language that
programmers can actually use to write sanitizers and convert it to this
formal model so we can do interesting analysis.
At a higher level, let's say there's this gap between code and the
formal model we would like to use. So this is a problem that comes up
a lot. Rather than punting and doing a separation between constraint
solving and constraint generation like we did before, we'll actually
try to fully close this gap for a limited class of programs.
So we'll create a domain-specific language in this case to make the
code more amenable to translation into this formal model. And on the
right-hand side I've replaced model with finite state transducers. The
second step will be to make those sufficiently expressive so that they
easily capture the types of transducers we would want to build based on
our domain-specific language. And these two approaches correspond
pretty much one-to-one to our Usnix security paper which presents the
language and practical applications in terms of modeling real world
sanitizers.
And the POPL paper which goes into the detail of symbolic finite state
transducers, which are an abstraction that we show to be sort of
strictly more expressive than classical transducers and yet somehow
still very analyzable and very useful.
So let's talk a little bit about the approach. And I guess I'll start
with a short back program. So this is a program -- I won't go into the
details -- but it escapes quotes. So double quotes and single quotes
get a slash in front of them unless they are already escaped. So the
key sort of parts of this program are an iteration over some input
string S and a Boolean variable which will update across iterations
which captures whether the last thing we've seen is a slash or not.
So to avoid double escaping. So I won't go into detail about what this
code looks like as a transducer and so on. It turns out you can try
that on writesforfun.com. There's an online demo which includes
examples similar to this and allows you to visualize the transducer and
perform analysis and so on. But one thing that we might want to prove
about this transducer is we did it correctly..
So if I apply this to the same string twice, will it end up double
escaping quotes or will it do the right thing and sort of escape them
exactly once.
All right. So slight transition to the formal definition of symbolic
finite state transducers. I'll go over this sort of at a very high
level. The basic definition is something you might have seen before.
It's a four TUPL with states, a start state, a set of final states and
transition. So all of the state-related parts are pretty much as you
would expect. The main difference is in the transition relation.
So each transition is of the form Q to R where the edge has two
annotations. In this case a formal FV and let's say a bold F, which is
an output. So a sequence of output formula in this case.
So the fee bit goes from the input alphabet to the back-to-schoolians.
This decides if the edge is reversible on a given character and the F
portion of the show is an output sequence. So I can then provide some
number of functions that take the input. So basically as a lambda and
turn it into an output. And the star here means that we can have a
sequence of those. And I guess the only really notable thing about
this formal definition is that the star in this case is outside of the
sort of function scope.
So normally I would say goes to a list of characters. In this case we
explicitly require a sequence of functions for some number that's a
bounded. So this helps make the algorithms even more decidable than
they already are. But this definition biases a clean separation
between state related operations on the one end and background theory
related operations on the other. So this work is an extension of
Marcus's work on symbolic finite state machines. Taking that into the
symbolic finite state transducer space where we have outputs and what's
nice about this is that it works for any decidable background theory.
So anything that can do satisfiability and witness generation in C3
let's say. So in practice we use the theory of bit vectors to
represent characters. And we can do that quite efficiently for large
alpha bets, including UTS 16 and so on.
So for further details I would refer you to the POPL paper at least in
regards to what we can do with symbolic finite state transducers. The
high level ideas that we define informal algebra of operations that are
closed on symbolic finite automata that represent values and symbolic
finite state transducers that represent transformations. And based on
that algebra you can implement a lot of different interesting analyses.
You can do relational assumption. If you do it in two directions you
get relational equivalence and you can use that to test for let's say
item potents. Can I apply this sanitizer twice, and will it double
encode stuff or not and commutativity and so on.
So that concludes my portion of the show on that. Like I mentioned,
it's available as a demo on writesforsfun and I think it's a great
demonstration. So in conclusion I've presented two complementary
approaches to dealing with code that manipulates strings from a
security perspective or more generally from a validation or test case
generation perspective. String constraint solving. There's a number
of different concretely available tools that are open source and so on.
And they all solve a sort of similar set of constraints.
And in sort of contrast to that, the Beck project is sort of a
Singleton at the moment and models a subset of important string
manipulating functions more directly. So in terms of let's say my
work, I would say that the bulk of it has focused on swing constraint
solving, a relatively little portion of it so far has focused on back.
Two summers so far.
this.
And I would sort of love to continue working on
So in terms of future work, I have a couple of stray bullets which I
will cover very briefly. Basically we're looking into the closer
integration of string constraint solving into SMT solvers. So there's
a draft for a standard that would add this to the SMT lib sort of set
of theories that we know how to deal with and potentially generate
benchmarks that would allow future tools to participate in SMT comp
while solving string constraint specifically.
And then other potential uses for BECK, for example, would be an
educational tool. So there's some of this available at MSR already.
But basically let's say I learned to program in Q Basic, in 1992 or
some such, and Q Basic had as its sort of main point in favor a very
good documentation, very good debugger and so on. What if we can
extend something along those lines to a more domain specific tool like
using back to string manipulation or this manipulation in general and
do automatic checking of student-submitted code by using the fact that
we can check for program equivalents.
So if there's a gold standard correct version of a function we can
always tell you exactly why you're wrong, if you're wrong. The other
point I briefly wanted to mention is back for other domains. I've done
work on wireless sensor networks using different models to I guess
model a distributed system in this case. It is a common approach to
analyze distributed systems using automata. I think it would be very
interesting to see if there's a front end language extension to Beck
that would be amenable to use as an easy tool to write wireless sensor
network applications that run on distributed heterogeneous hardware and
sort of guarantee properties of programs of communicating.
So that's it for my talk.
questions.
[applause]
And I'll be very happy to take any
>> Margus Veanes: Plenty of time for questions.
an hour. So, please.
I think formally half
>>: So actually I didn't quite understand what do you do with these
transducers, how are they related to the security of Web pages.
>> Pieter Hooimeijer: Sure. So the basic notion is that a function
like HTML encode is written in a certain style. So we've looked at a
bunch of different implementations for Usnix paper and they follow this
loop overscreen. They keep them Boolean state and looks at a sliding
window of the input string, converting characters into other characters
as needed. It turns out the translation from let's say the Java OS
implementation of HTML encode to Beck in this case has the front end
language to our tool is fairly direct. The purpose of the language is
to mimic the style in which programmers already write low level string
manipulating code such as HTML encode.
The transducer is the formal model which we can do -- which we can use
to analyze stuff like equivalence. So they are single valued symbolic
finite state transducers. We show that equivalence checking of those
transducers is decidable, which means if you give me two different
versions of HTML encode, both implemented in our language, then we
convert them both to a transducer and check whether they are
equivalent. If they are equivalent we say yes. If they're not
equivalent we say no, here's a differentiating input that looks
different if you run it through on the two transducers. So that's the
sort of core of the Beck project. Basically a language that describes
transducer in the same way a classic regex would describe a set of
strings.
>>: What are your thoughts about the in C programming you might do a
lot of this by hand, just because you didn't know there were tools that
could do it for you. If you ->> Pieter Hooimeijer:
What is this.
>>: Sanitization.
>> Pieter Hooimeijer:
Sure, sure.
>>: So if you extended the C language or any other programming language
with these features, is it appropriate for it so that you can encode
the -- essentially putting Beck into language? Is it there yet? Would
that be useful for languages, or is it just going to be a handful of
people.
>> Pieter Hooimeijer: I think that's a great question. And so I guess
the basics is what about ad hoc Sanitization. Can programmers use this
in real life. The answer is yes. There's an ongoing effort to turn
Beck back into real world code. So in the same way you might use
Lexnax [phonetic] to write a specification which gets turned into a
relatively obtuse-looking C code, we have a similar project underway or
at least Marcus has been working on this, I believe. It's basically
the translation from Beck into various mainstream languages. So then
it actually becomes quite feasible if you recognize that your current
task is one of sanitizing a string using let's say a single pass to
write it in Beck instead. So you can do some analysis before
generating the code that will then link into your application Patrice.
>>: I was wondering today what's the main source of string constraints,
where does it come from?
>> Pieter Hooimeijer: Sure, the question is where do string
constraints come from. Literally --
>>: High level.
>> Pieter Hooimeijer: From a slide. I think the answer is: So if you
look at modern Web development, there's a lot of templating languages
out there. I would say that if you were to analyze them in the same
way we analyzed PHP code which represents a slightly older class of Web
application in my opinion, then you might look at stuff like -- so
these templating frameworks, they might apply some auto Sanitization.
In order to reason about their correctness, you would need to
symbolically execute through that library code or develop some model of
that library code.
And that way the string constraint solving approach is almost exactly
sort of complementary to the Beck project. We might use Beck to
essentially summarize different parts of library code. We might use
string constraint solving for executing or evaluating single paths
through library boundaries.
Does that answer your question?
>> Margus Veanes:
More questions.
>>: I have a question about one of the experiments you did. The
experiments about the long strings. So the example regexes you gave
classical examples of regex if you try to determine them they blow up.
>> Pieter Hooimeijer:
Yes, definitely.
>>: Recognizing some character curves like that from the end.
>> Pieter Hooimeijer:
Yes.
>>: But the experiment was on the intersection and no difference.
>> Pieter Hooimeijer:
Yes.
>>: So none of these tools tried to determine-ize.
>> Pieter Hooimeijer: This is true. That would make all the tools
slower especially automata ones. That's a good suggestion.
>>: Taking the difference.
>> Pieter Hooimeijer: Right. So doing difference instead of
intersection is generally slower. I wonder if that's also the case for
Hampi, maybe Hampi comes out looking a little better.
>>: Related to that was the writes use in this case using accents?
>> Pieter Hooimeijer: It was the more recent implementation that we
had a license for. So the Rex implementation that used concrete ranges
I believe. So it's not exactly the fastest Rex implementation, BDDs
would be faster. But BDDs tend to consume a lot of memory and run out
in some cases at least that particular version of the code did. So we
went with the version that returned results for all inputs. So I guess
I should qualify that graph by saying that the Rex results could have
been a little faster if we had used BDDs but then there would be sort
of fewer data points to look at.
Tom.
>>: So one of the things that makes the approach viable in general is
that these general purpose languages you're analyzing have a
domain-specific sub language that closely matches sort of your domain,
right.
>> Pieter Hooimeijer:
Right.
>>: So if you wanted to extend this, for example, to trees, then the
way people construct trees and matchover trees and programs might not
be as regular, shall we say, as it's encoded in the general purpose
language.
>> Pieter Hooimeijer:
Sure.
>>: What would you -- if you had to sort of generalize your approach to
tree manipulating code ->> Pieter Hooimeijer: The question is what about the data structures
with multiple follows instead of just one. So trees and stuff of
lists. It's a very interesting area of research. So I think Marcus is
looking into trees in this upcoming summer in particular two
transducers and this goes back to your earlier question about Mona as
well. So Mona has specialized support for this. This is something
that I've been meaning to look into sort of on and off over time. I
think in general it would be very useful but like you said we do
benefit a lot from the fact that high level library operations over
strings look a lot like let's say string constraint solving combined
with some Beck. So I'm not totally sure that many tree manipulations
take the same form.
>> Margus Veanes: So any more questions?
We'll thank Pieter once more.
[applause]
No, then I think we're done.
Download