>> Sumit Gulwani: Okay. Hello, everyone. Thanks... lunch talk. It is my immense pleasure to introduce...

advertisement
>> Sumit Gulwani: Okay. Hello, everyone. Thanks for coming the immediately post
lunch talk. It is my immense pleasure to introduce Armando Solar-Lezama, who is a
professor at MIT.
So Armando is one of the world's leading experts on program synthesis, and he's leading
a group, computer aided programming, at MIT. And before that he obtained his Ph.D.
from Berkeley and he worked on a new program synthesis technique called Sketching,
which actually revived interest in program synthesis, and several people have since then
started working on this topic again, me including. So that was a very inspiring piece of
work.
And over to Armando now to tell you more about the exciting things that he has been
doing at MIT since then.
>> Armando Solar-Lezama: Thank you very much. So for today's talk I decided to focus
on some of the new stuff that we've been working on over the last couple of months in
my group. Most -- in fact, all of the work that you'll see here is still unpublished. It's at
the stage right now where it's started to produce results, so it's in a very exciting stage.
It's at the stage where if you give me a hard question that I can't answer, it's going to
make the future paper all that much better. So please do ask lots and lots of questions.
So my group, the group that I've just started at MIT for now a little over a year, it's called
the Computer Aided Programming Group. And as the name suggests a lot of our
inspiration comes from this field and this idea of computer aided design, computer aided
engineering.
So computer aided design, computer aided engineering has been around at least since the
1960s. It was one of the -- it was one of those applications of computers that had a very
dramatic impact in the way people do engineering and the way people design things all
the way from airplanes and cars to buildings.
And it's kind of interesting to look at the way this field task evolved and some of the
parallels with the way programming languages have evolved.
So if you look at some of the papers in this area from -- starting really in the 1960s and
really all the way to the 1980s, a lot of the issues that people worry about in this field
were very much the same kind of issues that people in programming languages have been
worrying about for a while, things like how do you organize your design so that it's easier
to understand and easier to communicate.
How do you manage these very large designs for something like an airplane or a building.
How do you modularize it so that you can design different pieces independently and put
them together. How do you do reusability so that if you're dealing with a part where you
have to make lots of copies of it with small differences you can actually parameterize this
part so that you have this parameterized part where you can create lots of instances of it
for different contexts and different situations. How do you compile these designs.
And in this case what people mean -- what we mean by compiling a design, well,
something like taking your designs that you've made in the computer and building actual
mechanical parts out of them. You saying things like numerically controlled machines or
how do you generate drawings so that the builders in the field and the welders in the plant
can actually use these drawings to generate this product.
And even at this point people are already starting to worry about very low-level checks,
things like checking compatibility between different parts. What you could almost think
about is interface checking, making sure that if you design one part and you design
another part, when you put them together, they're actually going to match.
So a lot of these -- a lot of these things are very much the kinds of things that people have
been worrying about in programming languages for a very long time, how do you create
modular designs, how do you organize your designs. A lot of the literature at the time
also talked about things like version control, how do you make sure that -- how do you
keep different version of the same design organized.
Now, some interesting things started happening around the 1990s. So for a very long
time people had already been looking at analyzing these designs from the point of view
of doing, say, finite element analysis or doing simulations.
But one of the really interesting things that started happening in the '90s is that people
started really looking for push-button design validation. At the level of this computer
aided design tools, they wanted to be able to press a button and see how this particular
piece that they designed was going to dissipate heat, for example. Or how this particular
piece that they designed was going to withstand vibrations.
And so there was a big movement for taking a lot of this technology that had been
developed for numerical simulation and actually incorporating it as part of these
computer aided design tools.
So you could say that today something very similar is happening in the world of software
for a lot of this technology that we have been developing for a very long time for
verifying programs, for checking them, for validating them. It's actually moving into the
mainstream. It's moving into a position where people in the field can actually push a
button and check their programs, check for bugs.
But in the 2000s something really interesting happened, which is in this field once that
you have all of these decision procedures in place, once you have the capability of
simulating your designs and of analyzing, people actually started playing with design
synthesis.
And what people refer to as design optimization, which essentially in this setting means
that you actually create the design and give it to the system and tell it this is not just the
design, this is actually entire space of possible designs for the wing of your airplane or
for the airfoil in your race car or even for the buttresses in a building. And you actually
give it to the computer and have the computer go. And by relying on these decision
procedures that they had developed over the years get the system to actually help you
with the design.
It's interesting to look at, for example, race cars. If you look at how the airfoils in race
cars used to look like, they were very geometric. It's the kind of thing that you can
design with a ruler and a pencil. A lot of straight lines, very simple curves. This is the
kind of thing that people would design by hand and then simply validate through the
simulator.
Compare that with the kind of airfoils that you see today which are actually designed
through this kind of parameter optimization where you have very complex shapes, very
complex curves where you actually give it to the system and tell the system I want an
airfoil that maximizes downward force while minimizing drag.
This is the kind of thing that we would really like to be able to do with programs. We
would really like to be able to take some of the burden of designing these very complex
systems and leave it up to the machine to do it.
Now, one of the interesting things about this metaphor relating computer aided design
and programming is that in the computer aided design world, from the very beginning,
from the very early days, people have been aware that these are not automated design
tools. These tools are not called computerized design. They are purposely called
computer aided design.
And what does that mean? It means that we're not leaving the design to the tools; rather,
what we're looking for is a synergy between the machine and the humans. What we're
really after is being able to exploit the knowledge and the insights that programmers have
and combine them with the power of machines for doing very thorough design space
exploration, for doing very detailed analysis of the low-level aspects of the design and
really combine them together in a way that is to the benefit of both.
This doesn't mean, for example, simply using the humanness in Oracle so that when the
machine gets stuck the human has to go in and fix the mess that the machine made. It
really means designing from the very beginning, looking at these problems from the point
of view of what is it that the human knows and what it is that the machine knows, what is
it that the machine is good at and being able to combine the capabilities of the human
with the capabilities of the machine, to actually make for systems that make
programming easier and simpler; that allow better productivity from the side of the
programmers; that allow fewer box and that ultimately allow our -- the dream goal is to
allow for designs that programmers by themselves could have not achieved.
So this is the goal. And in particular, what we want to do as part of the work of my group
is we want to go beyond validation. The next frontier beyond validation is going to be to
use the capabilities that we have, to actually help the programmer come up with these
designs and come up with these programs in the first place.
So today I'm going to talk about three projects that we recently started at MIT. As I said,
all three of these projects are relatively early stage, but I'm going to talk about what are
the key ideas and what are the key results that we have so far.
But all these three properties are looking at dealing with essentially very important
problems. The first one has to do with implementing complex algorithms. If you talk to
a lot of people today, they will tell you, well, nowadays people don't really write
complicated algorithms. They don't really write complicated data structures.
It turns out a lot of people don't. But there are a lot of people who do. Ultimately
somebody has to write the runtime for the CLR. Somebody has to write the liveries on
which .NET or Java subsist. And, yes, even though modern programming languages
have given us these tremendous abilities to reuse these complicated algorithms and these
complicated data structures so that, yes, most people in the field don't have to worry
about them. But ultimately somebody has to sit down and write them.
And for the people who actually have to sit down and write them, the tools that they have
at their disposal today are actually not different from the tools that people had at their
disposal in the 1970s, except maybe for the ability nowadays to at least validate their
designs.
So we're going to talk about some technologies that could make that process simpler.
We're also going to talk about another problem that a lot of programmers do face on a
day-to-day basis. And it is the problem of dealing with these really large code bases,
dealing with programmers that are so large that not single -- that not a single person
understands the whole thing.
But so we want the machine to help us reason about these programs, to help us
understand what's going on with these programs and help us with the problem of going
and adding functionality to these programs, of building functionality on top of these very
complicated frameworks.
This is a very important problem and a very difficult one because these programs are at
the scale where not only humans have a very hard time understanding them, but even
current analysis capabilities also have a very hard time understanding them.
And finally we're going to talk about this issue of unpredictable environments. One of
the things that makes programming hard is the fact that when you write a program,
particularly when you write a program that is going to have to interact with the world in
some form, you have to think beyond the common case.
You have to really think about all the different possible scenarios that might arise where
the main algorithm that is running in your program might have to do something a little bit
different, might have to use slightly different code, might require some slightly different
line of reasoning to deal with each one of these different corner cases.
And a lot of times these corner cases arise not necessarily out of the specification, but out
of some properties of the programming tools.
So let's -- so to deal with these complicated algorithms, I'm going to talk about this idea
of storyboard programming. And the key idea is we wanted to turn the programmer's
graphical insights into code.
So this is work with a student, Rishabh Singh. We're also going to talk about this project
called MatchMaker, which is really a case study in data driven synthesis and that using
large amounts of data about the execution of the program to actually help the programmer
come up with pieces of code that are needed in order to fulfill a particular programming
task.
And finally I'm going to talk about this project with Jean Yang dealing with
specification-based hardening, which is, again, which -- where the idea is to use symbolic
reasoning to make the program more robust, to allow programmers to cope with some of
these unexpected situations.
So any questions up to this point? All right. So let's start with this idea of storyboard
programming. So this is really a metaphor. A lot of animated films today actually start
their life as a storyboard. And what is a storyboard? Well, the storyboard, what it is, is
it's a sequence of frames that describes the action in the movie. It essentially describes
the sequence of events that are going to happen as the movie progresses.
And it's a great mechanism for communicating the insight, in this case of the creative
writers, and artists communicating their insight to this very rigid and very rule-bound
structure, which in this case happens to be not a machine by studio executives.
So it's a mechanism for communicating this creative insight to the studio executives.
And what it is, again, it's the sequence of frames, the sequence that describes the action of
the movie. So now the question is why can't we do the same thing to communicate our
insights and our intensions to the computer.
So we've come up with this idea of storyboard programming. And it's particularly geared
to programs like data structure manipulations, where programmers really start with this
very high-level, this very graphical intuition of how the programing is to proceed.
And there what they have to do is they have to take this very high-level graphical
intuition about how their data structures are going to evolve and convert it into the
sequence of pointer updates, convert it into a sequence of [inaudible] updates into this
invisible heap that is completely conceptual.
And what you end up with is this inscrutable mass of code that if you want to have any
chance of understanding it, you really have to reverse engineer. You have to reconstruct
from this sequence of pointer updates, this graphical intuition that tells you this is what's
actually going on in your data structure.
So what we want is we want to allow programmers to convey their insight in the form of
a storyboard. And what is the storyboard. The storyboard, it's actually a cartoon version
of this data structure transformation. In this case, for example, you can see that's what
we're trying to do is we're trying to insert an element into a linked list. So we start with a
linked list where A happens to be less than X and B happens to be greater than X.
And then we finish with the same data structure where now X is sitting there right in the
middle.
Now, in addition to this, we want to provide the system with some structure, some
structure that tells the system the kind of solutions that we're looking for. In this case,
this -- we can call this a sketch or a scaffold. It's really providing the control flow
structure of the solution that you expect.
But the key insights really come from here, from the story board which is providing -which is tell you telling you a story. It's telling you about how that data structure is going
to evolve in the process of doing this data structure transformation.
This storyboard is actually composed of different scenarios. Each one of these different
scenarios corresponds to different situations, different cases that the input is going to
have to deal with is. In this case, in addition to the common case where the data
structure -- where you're inserting into the middle of the data structure you have to worry
about the case where you're inserting at the end or you're inserting at the beginning,
where you're inserting into this empty data structure.
So the storyboard is essentially providing the scenarios -- yes.
>> [inaudible] box mean something special?
>> Armando Solar-Lezama: Yes. This is an excellent question. So one of the things that
you see here is -- one way to think about this is this is essentially an input/output pair.
This is essentially an example that says when you get a lift like this, you want to produce
a lift like this. But it's more than an input/output pair because we're using some
abstraction here.
What we're really telling the system is we're going to get this list and there is this part of
the list that is just some series of nodes that you don't particularly care about, and then
there is an interesting part, the part where you're actually going to be modifying and
inserting the thing. And then there's, again, this boring part that just contains a lot of
nodes.
And when you're done, the front part is going to be the same and the back part is going to
be the same. And the change is going to happen here in the middle, where this is where
the important stuff happens.
>> Is this not about inserting in a sorted list?
>> Armando Solar-Lezama: Yes.
>> So you should have some kind of a predicate saying that X is between A and B, right?
>> Armando Solar-Lezama: Yes. That's absolutely right. So ->> Why is that not in the picture?
>> Armando Solar-Lezama: Because I forgot.
>> Okay.
>> Armando Solar-Lezama: So, yes, you're absolutely right. So you also need a
predicate here that says the fact that all the elements in front are less than A and that A is
less than X and that X is less than B and that B is less than all the elements in back.
So this is the predicate. And this is exactly the predicate that you need. And so the idea
is this predicate is really part of the storyboard, is part of the description of what's going
on in this execution. And from that the system actually synthesizes a correct
implementation of the list insertion and implementation that realizes that what it has to do
is iterate over this front part until it gets to the interesting stuff.
And when it gets to the interesting stuff it has to go and insert into the interesting stuff by
doing some pointer manipulations, and that it has to worry about different cases, like the
case where the head is equal to no, or the case where private's equal to no, in which case
you are inserting at the beginning of the list.
So this is the information that you want to provide to the synthesizers, and now the
question is what do you need in order to actually make this real.
So the first thing and really the most important thing is you need to give semantic
meaning to the storyboard. You need to be able to go from this very abstract graphical
description to something the system actually can manipulate, something the system can
actually reason formally about so that it can actually tell whether what it's producing
actually conforms to the storyboard that you provided.
So the storyboard is -- the storyboard, as I said before, you can think of it as a set of
input/output pairs, but it's a little bit more than that.
The fact that you have these abstract nodes means that the single input/output pair, what
it's really representing is an infinite family of input/output pairs. It's describing the
behavior of this program on an infinite set of input/output pairs.
So this, on the one hand, helps you constraint the set of solutions that you want to
consider, but it also gives you something very important: It tells the system what to focus
on.
What we want to do is we want to take advantage of this abstraction that the user is
providing. The fact that the user is telling you what's happening here is not really that
relevant, this is where the interesting stuff is happening, that's what you really need to
reason about in order to reason about this data structure manipulation.
So you want to treat the storyboard as a specification, but you also want to treat it as a
source of an abstraction, something that tells you what's important, what you need to
reason about in order to get your algorithm right. So then, of course, you need to actually
be able to use this insight presented in the storyboard to make the synthesis happen.
And I'm going to talk about a little bit about how do we deal with issues of
expressiveness and scalability.
So to see how we do this, let's start by thinking a little bit -- by looking in a little bit more
detail at the information that we need to be able to extract out of these storyboards.
So the first thing we're going to extract out of a storyboard is an environment. So the first
thing that you have in an environment is a sequence of variables. Some of these variables
actually show up in the storyboard. Some of them don't show up in the storyboard but
instead they come from this scaffold, the fact that you have this current and this previous
pointer.
So the environment -- the environment describes those variables that are going to be
necessary to reason about this data structure manipulation.
>> So who provides this? Is it the creative designer or the studio executives?
>> Armando Solar-Lezama: The idea is that this is provided the creative designer. This
is extracted from the storyboard.
>> [inaudible]
>> Armando Solar-Lezama: Right. Right now we're actually at the stage of writing
these by hand. But I'm going to try to convince you that this -- all this is is a text
representation of what you see in this picture with a couple of extra things.
So first you have the variables. And again the variables just come from the scaffolding.
Then you have a collection of concrete nodes. Those come directly from the storyboard.
The storyboard tells you that in order to reason about this you really only have to reason
about these concrete nodes A, X and B. And you really don't need to reason about any
additional concrete nodes.
There what you have are these abstract nodes, front and back. So the system is also
going to have to reason about these abstract nodes front and back.
And here you also need to tell the system a little bit about these abstract nodes and
relationships between them. In this case, the fact that front.next, in other words, any
abstract node coming from -- any concrete node coming from front, it's next pointer can
either point to front or it can point to A. And in the case of back, the fact that back.next
can either point to back or it can point to no.
And finally you have this invariant that says -- describes the relationships between the
nodes and your storyboard, the fact that the front node has to be less than A and that A
has to be less than X, X has to be less than B. This is something that the user has to
provide separately. It's an invariant about the data structure that you're trying to
synthesize.
So in addition to the environment you have -- this scenario is actually composed of two
frames. The first frame says this is how things look like before I start the execution of
the program, and this is how things look like at the end of the program. And this is just a
text description of this picture. All it's saying is -- it's describing the connections that you
see graphically in this picture.
So now the key is that from this scenario and from the environment what we're actually
going to do is we're going to derive an abstraction. In this -- in our case -- yes.
>> Can you go to the previous slide?
>> Armando Solar-Lezama: Yes.
>> So somewhere in this there is an implicit [inaudible] information that you're reasoning
about [inaudible], right? Otherwise you cannot go from this picture to [inaudible]. So
this has to be -- I mean, that has to be information that's provided by the creative
designer.
>> Armando Solar-Lezama: In this case it's actually not the case. So you can actually
think of this -- the only case where that becomes relevant is here. From the point of view
of these predicates.
All the information that we have here is the fact that this corresponds to some set of
nodes and that if you take one of those nodes and you do .next on it, either it's going to
take you to another node in here or it's going to take you to A.
>> But the picture has more information and [inaudible] writing [inaudible].
>> Armando Solar-Lezama: Right. So the picture has -- the picture has -- in addition to
this information, it also has this information right here.
>> So why does the picture not show an arrow from front to B?
>> Armando Solar-Lezama: That's a good question. So this is actually something that is
part of -- this actually is something that is part of our description of the scene, the fact
that this is really how the list looks like. At least the list that I care about, the data
structures that I care about, this is how they look like.
And from front there are links to A, but there are no links to B. And I don't want also to
be links to X or links directly to back.
So I really only care about lists that look exactly like this in the beginning, and at the end
I want them to look exactly like this. Yes.
>> So does the picture also implicitly say that the front [inaudible] exactly the same as
the front [inaudible] what the links between the different nodes in the front can be
changed or they're not supposed to change?
>> Armando Solar-Lezama: That's a very important question. So right now the
convention that we're using is that these abstract nodes cannot change. And this is a very
strong restriction. I'll talk a little bit more in a few slides about how that -- how we can
actually deal with that restriction. Yes.
>> So [inaudible] sort of dealt with the problem of giving logical meaning to a picture ->> Armando Solar-Lezama: Yes.
>> -- with shape graphs, and there's all sorts of hygiene conditions and all sorts of extra
things. You know, a picture may be a thousand words, but a few symbols are worth a
thousand pictures, right [inaudible] so do you -- more technically, do you give a -- do you
use their sort of notions of hygiene conditions and not having -- are you using something
like their encoding of shape graphs? Or how does it compare?
>> Armando Solar-Lezama: So you could think of this as a cartoon version of shape
analysis. Some of the important things that we don't have here and that we're trying to
get away without are the kind of elaborate predicates that you see sometimes in shape
analysis, things like reachability predicates.
>> Well, that's not very fundamental. Like, for example, you can't have -- I mean, some
are just obvious to programmers but not to a computer system. Like next can only point
to one thing. A pointer can only point to one thing. In a graph you could make it point to
two things. But we know that the next field can only have ->> Armando Solar-Lezama: Right. So those things are implicitly encoded here.
>> Ah. Yes. That's what I'm referring to.
>> Armando Solar-Lezama: Yes.
>> And there's others that, you know, that there's no dangling, there's no garbage, all the
sort of things you get when you really want to precisely describe what the picture means.
>> Armando Solar-Lezama: Right. That is absolutely right. So it's actually -- there is a
set of background -- there is a set of background conditions here in terms of ->> [inaudible]
>> Armando Solar-Lezama: Yes.
>> [inaudible] why are you trying to get away from the reachability predicate? You said
that explicitly, right? You don't want to use a reachability predicate.
>> Armando Solar-Lezama: Right now one of the things that we're trying to see is how
far we can push this with very simple -- with very simple predicates that you can get out
of the picture with very minimal extra annotation.
So it's probably -- you will probably need them for certain complicated things. But we're
trying to see how far we can push it with just a very simple interpretation of these graphs,
just based on very simple predicates and very simple predicate abstraction.
>> [inaudible] a big deal that the front and back [inaudible] you can always [inaudible]
another template to match [inaudible] some units [inaudible] you don't have to change.
>> Armando Solar-Lezama: Having front and back not have to change ->> It's not a big deal.
>> Armando Solar-Lezama: It's not -- conceptually it's not. In practice if front and back
don't have to change, it makes the algorithm a little bit simpler. In general, if abstract
nodes don't have to change, it makes certain aspects of the encoding simpler.
>> I agree. But so what I was saying is that going further, if you want [inaudible] to
change, you can actually make two complex [inaudible].
>> Armando Solar-Lezama: You're getting ahead of me, but that is exactly -- that is
exactly right.
>> [inaudible] is it true that you're [inaudible] this assertion actually means something
specifically with the pointer structure as opposed -- I mean, I first read this and I thought
you were talking about sorting delays based on [inaudible] key or something like that.
But you're actually talking about pointer structure here.
>> Armando Solar-Lezama: So that's a very good point. This is not about the pointer
structure. This is just a statement about the value stored in node A versus the value of the
nodes that belong to the set front versus the value of the nodes that belong to the set back.
>> [inaudible] overloading the pointer values and the data values stored here, right?
>> Armando Solar-Lezama: Yes. So that is solely for the purpose of presentation. This
should actually be front.val and a.val except when it gets too long.
>> [inaudible] don't understand how you know the picture -- how you even know that the
arrows go that way. You know, the picture -- the sequence of arrows in the picture are
very specific.
>> Armando Solar-Lezama: Well, so this is the key. The key is -- what you're trying to
do is you want to get this environment from the picture, not the other way around.
So from the picture, you want to get the fact that -- you have a certain set of variables,
you want to get the fact that you have a certain set of concrete nodes, a certain set of
abstract nodes, the fact that you have certain relationships between the pointers for these
nodes and the fact that you have certain relationships between the values of nose node.
The pointer structure is really coming into play here in the description of the scenario,
where you're stating very explicitly the fact that X points to some node in front -- sorry.
The fact that head points to some node in front, the fact that a.next points exactly to B -yes.
>> Why is front pointing to A? I don't see anything on the left that tells me that front
should be pointing to A.
>> But the thing is that he is not saying that the left is -- has just as much information as a
picture.
>> Armando Solar-Lezama: Right.
>> He wants to go from the picture to the left. So the left is really I think perhaps an
abstraction of the picture, right?
>> Armando Solar-Lezama: Exactly. So because we have this rule that these nodes in
the -- these abstract nodes, we're not going to allow them to change. That's just a design
decision. Then this relationship between -- these pointer relationships corresponding to
this front node are actually part of the environment. They are -- they have fixed. They
are static in this case.
>> [inaudible] pointed out was that you can also use more sophisticated [inaudible] on
the graphical side [inaudible].
>> I guess I'm trying to -- that's what I'm trying to understand. And I was under the
impression at first that your left encoding is an exact encoding of the information in the
pictures, and now you're saying no, it's just a subset of ->> Armando Solar-Lezama: So together this environment plus the information in the two
scenarios, that those two things together encompass the information in the picture.
So -- and so that's the basic idea. You have these scenarios and what we do from these
scenarios is we actually define a -- we actually define a predicate abstraction over our
domain of lists. And in this case it's something very simple. We have already our sets of
nodes. So in this case -- and we also have that every variable and every location in the -every location in the program, every variable can point to a very specific set of locations.
So we have that head could either point to this front node or some node corresponding to
front or to the null pointer. And we have that a.next could point to either B or X or to
null.
So essentially what we have is we have a set of predicates that define an abstract domain.
And so from those predicates what we do is we construct an abstract domain. In this case
each valuation of the predicates corresponds to a particular set of states. And what we
have is essentially this power set domain where you can represent it as a bit vector and
every bit corresponds to one of these abstract states which in turn correspond to
valuations of these predicates.
And you can actually frame this entire synthesis problem as a very -- as a very simple
dataflow analysis. So this is actually borrowing from the work of Saurabh and Sumit -and Sumit that essentially tells you that you can actually do synthesis on top of an
abstract -- on top of predicate abstraction.
And in this setting it actually leads to a very simple formulation. So in this case what you
have is scaffold or the sketch gives you the structure of the solution and what you're
trying to figure out are what these functions, F3 and F2 and F1 and this F predicate,
you're trying to figure out what they are.
And so what this leads to is essentially a set of equations that tells you that there -- each
one of these, F1, F2, F3, FP and negation of FP, they're actually parameterized. They're
actually parameterized by a bit vector in such a way that depending on the value of the
parameters you're going to get a different concrete piece of code for F2 or for F3.
And so what you're really looking for is that there exists the value of T1 and T2 and T3
where these Ts correspond to your bit vector representation of your abstract domain.
And this C corresponds to the set of bits that determine how all of these unknown pieces
of codes are going to get instantiated.
And because we're doing predicate abstraction, in principle, you can solve this in one
shot. So this is just a satisfiability problem that you can solve with one shot.
And an interesting observation is that if you're only trying to do synthesis and if you're
not really trying to find the variants, you don't actually care to find the least fixed point.
Because, after all, if you find some values for the different missing piece of code, it
works for any fixed point that it's going to work for the least fixed point.
So you don't really care about minimizing ->> You also need to make sure that you've done this.
>> Armando Solar-Lezama: That is actually very important, and the fact that what this is
going to give you is this is going to give you a partial correctness.
So currently we actually -- we're actually not -- we actually haven't gotten to the point of
doing more than partial correctness.
So what this means is that this is giving you a set of constraints such that any solution to
these constraints is going to satisfy your storyboard, potentially it's going to loop forever.
Yes.
>> [inaudible]
>> Armando Solar-Lezama: Yeah. So actually one of the things that you have here is
that your storyboard doesn't have to have abstract nodes. So you can also provide a
couple of concrete input/output examples. And if you have a couple of concrete
input/output examples, then this just becomes a simple inductive -- a simple inductive
synthesis.
>> But how do you know that?
>> [inaudible] true and semicolon, right, so [inaudible].
>> Armando Solar-Lezama: So one of the things that we do require is that at the end of
the -- at the end of the process you have for your final state, we actually require that the
system finds a solution to the dataflow equation such that the final state is identical to
this.
Now, sure, it could make up some values out of thin air and then say, yeah, here it is, it
looks like it works correctly even though it actually doesn't terminate for any inputs.
When you have concrete input/output pairs, though, you -- then it really becomes like the
inductive synthesis that we do in standard sketch, where we actually require termination
after a fixed number of iterations.
One of the things that we like actually about these constraints is that they are compatible
with the constraints that we generate in sketch. Which means in sketch you can write, for
example, a test method and say I want my solution to actually work correctly for this test
method, for all inputs within this bounded size and I want everything to terminate for
those inputs.
And you can actually just take these constraints, combine them with the constraints that
you get from sketch and get determination out of that.
So the first question is, well, does this work. And the short answer is yes. So we've -this is sort of some of the experiments that we have run, things like inserting into the
linked list. It turns out if you want to delete from a linked list, then all you have to do is
swap these inputs and outputs in your storyboard. And then you'd get -- then you get a
storyboard for linked list deletion for free.
Inserting into a binary search tree is another interesting one. So you can see that the
space of possible solutions that the system is considering is quite astronomical. The
performance is not -- the performance is not amazing. It's of the order of a few minutes.
But it's reasonable. But when we started playing with this ->> [inaudible]
>> Armando Solar-Lezama: Yeah, that is -- that is probably true.
One of the things that we realized, though, when we started playing with this and with
just this very simple formulation was that this formulation is great for scan and modify
type of manipulations. Something like inserting into a list where the solution is of the
form -- you're going to scan through some portion of the list, and then you're going to get
to this middle interesting part where you're actually going to modify it.
So something like inserting into a linked list certainly falls into that pattern, something
like removing into a linked list certainly falls into that pattern. Something like inserting
into a binary tree certainly falls into that pattern.
But something like reversing a list, for example, does not fall into that pattern. Yes.
>> [inaudible]
>> Armando Solar-Lezama: Sorry?
>> What about splay tree? Can it do it?
>> Armando Solar-Lezama: Splay tree also requires a little bit more than just scan and
modify. Because it also requires some restructuring of the tree. Similarly in the case of
something like a red-black tree. It's -- so in order to deal with those kind of
manipulations, we really need more than this. Any question? Oh.
So dealing with a more complex operations really require additional machinery. And in
particular dealing with something like linked list reversals really requires some level of
inductive reasoning which this kind of abstraction, like the one that I showed here,
doesn't really support the kind of inductive reasoning that you need for something like an
in-place linked list reversal.
>> I still don't -- I don't quite see what is the difference between the two? What is the
difference between inserting and reversal?
>> Armando Solar-Lezama: I'll show you in a second. I think this example will make it
very clear.
>> Okay.
>> Armando Solar-Lezama: So let's say I want to insert into a linked list and reverse a
linked list. So one way I can describe it is as follows: I'm going to have this head node
A, and that is going to point to some middle part of the list, and then that is going to point
to some end of the list Z. And what I want to get is I want to get essentially the same
thing but backwards.
And here's the thing: The way this linked list reversal works is that you essentially -- this
middle part, you actually have to break it down into two separate pieces.
And the way this in-place linked list reversal happens is that you have one part of the list
that you've already reversed and that is already pointing backwards. And you have
another part of the list that is still pointing forward. And essentially you have to keep
pointers pointing to the edges of that gap in the middle, where the list goes from pointing
backwards to pointing forward.
And what happens is that every iteration, this gap moves by one step. And so in order to
reason about this kind of transformation, you really need to reason about what's going on
in this middle part to a much greater level of detail than what the rules that I described
earlier allowed.
If you remember, one of the things that we said here is that these asterisk nodes, we don't
want to let the system go and modify them. We want the system to come up with
solutions that don't require modifying this abstract node.
And in the case of this kind of iterate scan and modify algorithms, that is perfectly
reasonable because then the part that you scan, that's the part where you abstract. And
then when you get to the part where you modify, that's where all the detail is. And then
you work on that part that has the detail. And you get your scanning and modify.
With something like reversing a linked list, you can't quite do that. You really need to be
doing modifications in this abstract part of the list.
And the way we get away with this is by providing additional inductive invariance. So in
this case what the programmer has to describe is the fact that this middle part actually
corresponds to some node, and I use A, but it really shouldn't be A, it's really some new
node connected to this middle part, or just some individual node. In other words, this is a
way that describing that this middle part, it's really a sequence of nodes, a nonempty
sequence of nodes .
So now we are saying something very specific about how this middle part looks like. We
are giving it this inductive invariant that the system now can use to reason about what's
going to happen in the middle of this algorithm.
And so how do you deal with this kind of inductive invariance in the context of -- in the
context of our abstract interpretation.
Well, the important part happens right here, in the body of that while loop. Because in
the body of that while loop you need to reason about when you have to unfold this
abstraction or when you have to fold it back so that eventually you reach a fixed point.
And when you're dealing with shape analysis, this is actually one of the tricky aspects of
shape analysis, the fact that the shape analysis engine has to figure out this kind of
inductive definitions of your data structure. It has to figure out when to expand these
abstract nodes and when to collapse them into a single abstract node.
In this case, on the other hand, because we're doing synthesis anyway, what we're going
to do is we're going to explicitly introduce these fold and unfold operations as part of the
basic operations that the system can come up with, just like -- for this part, just like the
system can choose between doing a pointer assignment or doing a different pointer
assignment, it can also choose just to insert an unfold operation or a fold operation.
And so the idea is that the synthesized program is actually going to tell you exactly where
you need to unfold and where you need to fold in order for the abstract interpretation to
actually work, in order for you to actually reach a fixed point.
And so once you introduce these fold and unfold operations, then the job of the
synthesizer and the abstract interpretation becomes very, very similar to what we had
before.
The only difference is that now you have this new extra operation that when you apply
fold into something that looks like this, then you're going to get something that looks like
this. When you apply fold to something that looks like this, you're going to get
something that looks like this. When you apply in fold, then it's nondeterministic. You
can either get this or you can get this and you don't know. Yes.
>> So you're essentially telling me that you are going to synthesize these fold and unfold
instructions.
>> Armando Solar-Lezama: Exactly. So the code that it synthesized is going to have -explicitly there it's going to tell you this is where you fold. This is where you unfold.
And you fold on whatever this pointer is pointing to.
>> It's totally the dual of you know what we do in verification where we have these
synthetic instructions sort of in your code just to guide the [inaudible].
>> Armando Solar-Lezama: That's exactly right.
>> And you will actually also have to sort of synthesize those even though they have no
runtime effect?
>> Armando Solar-Lezama: That's exactly right. Because even though they have no
runtime effect, from the point of view of making the abstract interpretation work, they're
crucial. And any solution that doesn't have them, it's not going to verify according to the
abstract interpreter.
So that's the basic idea. Now, what are the big obstacles in order to make this work? The
biggest obstacle is that when you have something like this, your abstract domain can start
growing, and it can start growing quite a bit.
So for this example we have something of the order of 7 to the 9th possible abstract
states. So what that means is that your space is now partitioned into -- your infinite space
of lists is now partitioned into 7 to the 9th possible states.
>> [inaudible]
>> Armando Solar-Lezama: Well, it's not that much until you realize that you actually
have a power set domain.
>> [inaudible]
>> Armando Solar-Lezama: It's true. It's the fact that in the previous setting what we
had was -- what we wanted to do, one-shot synthesis, essentially what we had was this
T1, T2, and T3 had to -- had one bit right there for every one of those abstract states.
So what that means is that now the width of T1, T2, and T3 has to be 7 to the 9th. And
having T and T2, T3 have a width of 7 to the 9th, which gives you then a search space of
2 to the 7th to the 9th, then it doesn't work.
>> [inaudible]
>> Armando Solar-Lezama: Exactly. So how can we deal with this? This kind of
sounds like a show-stopper, right? Well, it's -- we have a trick under our sleeve. So what
is this trick. So before when we're doing sketching we had this algorithm for doing
counterexample-guided inductive synthesis.
And the idea of counterexample-guided inductive synthesis is that you have automated
validation and as part of these automated validation -- and so what you have is you have
this inductive synthesizer that has some concrete inputs, and based on those concrete
inputs it's going to synthesize something that works for those concrete inputs. And it
doesn't care what else happens in the world. It just cares that for those inputs it works.
And so that solution is thrown down the fence to an automated validation procedure that
then goes and checks whether this is indeed correct or not. And in the case of sketch,
there was just a symbolic model checker. In the case of sequential things, it was just
spin, in the case of concurrent things.
But what you wanted to get out of that was a counterexample input. And so by following
on this idea of inductive synthesis, the idea is that you never have to instantiate this
problem that says for all possible inputs. Instead you only instantiate it for a small
number of inputs. And from that small number of inputs you get a solution that happens
to generalize, that happens to generalize for all inputs.
So what we want to do is we want to expand this idea in the context of abstract
interpretation. Now, why is it tricky to expand this in the context of abstract
interpretation? Well, the problem is if you do abstract interpretation here, what you're
going to get out of it is not a -- what you're going to get out of it is not a counterexample
input that says, oh, yes, this counterexample input is where your concrete solution fails.
And, moreover, because we're doing abstraction, we don't want to do inductive synthesis
on a concrete input and in the concrete space. We really want to do this in the abstract
setting.
So how in the world do we do this, how in the world do we combine this idea of
counterexample-guided inductive synthesis with the kind of abstraction-based synthesis
that we're doing here.
And the idea is very simple. For validation, we're going to do this abstract interpretation.
It's really just a very simple dataflow analysis in the case because of the fact that we have
our explicit fold and unfold instructions there.
And so what you have is you have your control photograph and you start doing dataflow
analysis. And what happens is you start iterating over this loop part until you get a fixed
point and eventually once you reach a fixed point you find that for this particular program
that you have it's allowing you to reach a bad output. It's this particular program that you
have is leading to a bad output.
So you can actually trace back the chain of dependencies that led to the bad output. You
can look at this function, F3, for example, is really a function that takes some input
abstract states and produces some output abstract state, and you can look at, well, what
was the input abstract state that allowed it to produce this bad output abstract state.
And then you can keep going backwards. And essentially what you get is a formula like
this that says, well, when you take this abstract state from the input and you apply this
composition of all the functions corresponding to essentially this path that is shown here,
what you get is the output.
So this is very similar to the path-based synthesis Saurabh showed earlier today. The
main difference is that in this case we're doing it -- we're getting this path by tracking
backwards from the result of abstract interpretation.
By tracking backwards these -- there's dependencies in the abstract interpretation, and
one of the things that is interesting is because we are dealing with abstract interpretations,
these functions F are nondeterministic. So -- or you want to think of them as
nondeterministic in the sense that for any given input state they can produce potentially
many output states.
And one of the things that you want to do when you convert it into this big composition
of functions is settle -- fix that nondeterminism to the nondeterminism that actually
caused things to go wrong.
So then what this gives you is this gives you a constraint that you can then incorporate
into the inductive synthesizer so that the next solution that is produced by the inductive
synthesizer doesn't fall into the same problems that the previous solution came about.
And this is really the basic kind of constraints that are going to be added. There's a
couple of additional constraints that you want to add for technical reasons, things like the
synthesizer can come up with a really bad solution, a really bad solution that after
thousands and thousands iterations it will fail to reach a fixed point.
Now, you know that for the linked list reversal that is not the case. The fixed point is
actually reached relatively quickly. It's reached after -- I think in this case it's reached
after something like four iterations with this particular abstraction.
And so you want to tell the system, look, if after three iterations you haven't found the
solution -- you haven't converged to a fixed point, then you should -- then that's not a
good solution and you actually want to incorporate a constraint that says that's not a good
solution, I need a different one and a better one.
And in this way the idea is to constrain the system so that it gives you only good
solutions. So this we're actually in the process of trying out. Ask me in two weeks and
I'll tell you how it works.
Any questions up to this point?
>> [inaudible] examples in the graphical phase?
>> Armando Solar-Lezama: The counterexample essentially is this.
>> [inaudible] the graph space [inaudible]. So we have your model [inaudible] lots of
specifications that you looked at. So here you can actually [inaudible] another examples
that don't do this kind of [inaudible] they don't do this.
>> Armando Solar-Lezama: So there's actually an interesting question. One of the
questions that we want to explore is what happens when you provide richer things in
addition to this input/output pairs. You have, for example -- you tell the system, for
example, that in the middle of the loop right here there is a particular shape that you don't
want to see or there is a particular shape that you do want to see, that you want a solution
to have this particular shape.
And we think that that is going to be particularly relevant when you're trying to
synthesize something like a red-black tree, for example. What you'd really like to say,
well, here in the middle there is this rotate operation. And here is the storyboard for the
rotate operation in addition to the higher-level storyboard for the complete insertion into
a red-black tree, for example.
So this is basically the idea of doing -- of doing synthesis from the storyboard. And,
again, the high-level point that you want to take out of this is that this is really a
mechanism for providing the system with your insight about how a particular data
structure manipulation looks like, about what are the interesting aspects of this data
structure manipulation, what are the aspects that are relevant that the system really has to
focus on in order to come up with a reasonable solution.
So as we said before, a lot of people don't write these big -- a lot of people don't write
complicated data structures. Instead their job is to go in and go into this massive, giant
piece of code that is already sitting there, lots of people have worked on it, and now they
have to go and add functionality to it. They have to use this massive framework to build
a new application, but not from scratch, but using the functionality that the framework
provides.
And these kind of object-oriented frameworks that are very popular today, they have
really revolutionized programming. They are very much designed around flexibility and
extensibility. And overall this is a really good thing. It has been a great way to improve
program productivity, it facilitates reuse, and it's really easy to write really rich
applications that deliver lots of functionality with relatively little code.
But unfortunately there are also [inaudible] consequences to the use of these massive,
large-scale object-oriented frameworks. And it's a fact that in order to be extensible, in
order to be really flexible, a lot of the functionality has to be atomized into these very
small methods, into these very small objects and components.
And you end up with this proliferation of classes and interfaces and you end up with what
a lot of people call Ravioli code, where everything has been chopped into tiny little
pieces and in order to do anything you have to understand a dozen interfaces and half a
dozen classes, and you go into one method and -- trying to understand what it does, and
all you see are two method calls, and both of them are virtual so you have no idea what
it's actually going to run when you call it.
So I'll show you an example of this that we've been playing with. So most of our work in
this setting has been in the context of Eclipse. And Eclipse actually makes it really easy
to write your own editors and right your own tools on top of the framework.
And one of the functionalities that you have in a lot of these editors is that you can
selectively highlight the syntax and distinguish between different lexical elements by
their color. And it turns out the framework actually makes it really easy to do this. So
you can define -- you can differentiate between, say, a comment and a tag or a string and
make them all come out in different color. It's a great thing to have if you're writing your
own editor.
And so we want to add this functionality to our own language. We were -- have actually
been playing trying to write an editor for the sketch language and this is the kind of
functionality that is a must if you're writing a new editor for a language.
So let's start with what the programming knows starting out. You know that you have
this text editor, you're building a text editor. And if you go and you look around in
Google a little bit, you find that there is this -- there is actually this token scanner. And
there have actually been studies that look at how people look at -- look for functionality
in these massive code bases.
And one of the things they do is they guide themselves by what people in HCI tend to call
smells, which is patterns of code, things like lexical clues that tell you this piece of code
probably does what I want to do based on its name or bates on how it's written. And so
clearly something called token scanner sounds like it's going to do the right thing.
And if you look at the documentation, you see that in fact it does. The token scanner is
the one that is responsible for telling you this is a comment, this is a string, this is a tag,
color it accordingly.
So in the case of our sketch editor, we want to create the sketch editor, and we've created
our sketch scanner. And now the question is how do we get them to interact with each
other? How do we get the token scanner to be part of the editor? And so, well, what
about just calling the set token scanner method in the editor, right? Then we just register
the scanner and we're done.
Well, we wish. It turns out the story is a little bit more complicated than that. It turns out
that in order to actually make this work you start with the scanner. And the first thing
you have to do is you have to create what is called a damage repairer.
And any of these people who study the smells would probably be flabbergasted at the fact
that this functionality that is there to help you color code your syntax actually has to go
through a class called damage repair. Who in the world would have known that a
damage repairer has actually the purpose of adding this code to your class. To this date I
have no idea why it's called a damage repairer.
But then what you have to do is you have to pass your scanner to this damage repairer.
And then you have to create this presentation reconciler and give -- set a damager and a
repairer on your presentation reconciler. And don't ask me what this presentation
reconciler is, because I have no idea, but that's what you have to do in order to get these
to work.
And so then -- so what you have is you have actually a link between the presentation
reconciler, the damage repairer, the scanner, and from those you can now get the scanner.
So what actually happens is that the editor through some chain of pointers, it has a source
viewer. And the source viewer has a configuration, and that configuration, the editor is
actually going to call this method called get presentation reconciler to get to this
presentation reconciler.
So it means that in order to get the framework to use your scanner, you have to create
your own -- you have to create your own configuration and you have to overload this
method called get presentation reconciler.
And then when you create the editor you have to pass your new configuration and give
your new configuration to the -- through this method called set source viewer
configuration.
So, in essence, this is the code that you need to write in order to get your editor to use the
presentation reconciler.
So it's a very complicated code. It's not just a matter of adding a little line of code in a
method; it's really a matter of knowing that you have to overload this configuration class
knowing exactly which method you have to override, knowing that you have to register
the functionality and if you don't do all of these things, then it's not going to work. It's
just going fail silently and not color your syntax.
So with the technique that I'm going to show you right now, we can actually synthesize
this code and we can actually discover for the user that if you have an editor and you
have a scanner, this is the code that you need to write in order to make it work.
And these -- and so this is really a job for synthesis. There are other ways we could try to
do this. We could write documentation. And, in fact, Eclipse has really good
documentation. But the kind of dictation that they have with something like Javadoc is
very fragmented at the level of classes. You can find what the source -- you can find
what the scanner does and you can find what the editor does, but if you want to find out
how to combine them, it's a little bit trickier.
In the case of Eclipse, you can actually find tutorials that do this. But doing with this
tutorial -- doing this with tutorials is a little bit trickier because if you have a hundred
classes, then that means if you want to write tutorials for how every one of these classes
then tracks with every other one of those classes, then you would have to end up writing
10,000 tutorials.
And a lot of times what really helps is actually going through these examples that other
people have written, going through and figuring out how other people have used these
token scanners. And that a lot of times, because of all the indirection, because of all the
reflection, because of all these interfaces, it really requires firing up the debugger and
stepping through -- putting breakpoints here and there and stepping until you find exactly
how these things interact and exactly how they work.
So synthesis is a better answer. And what you want to do is essentially for the
synthesizer to do this process for you, to do this process of analyzing the execution of
programs and figuring out how these things interact.
So the approach that we're following is a very data-intensive approach. You could say
that the data is where the human insight really comes from. The fact that lots of people
have written editors in Eclipse. Eclipse comes itself with lots of editors, and each one of
those editors comes with its own scanner and it also has some editors that don't have
scanners.
And so people have already done this before in different ways, in different contexts, but
you can actually record information about how people have done this before. And the
crucial novelty here is that we actually want to collect this information not just at the
level of the source code but at the level of the actual execution.
We have a database of Eclipse behaviors that records how Eclipse actually works when it
fires up, when it runs, when it loads different editors.
And so the idea is to really break from this paradigm that says that if the user has a
programming tool, this programming tool has to figure out things from scratch by
running on the user's machine.
By having this data-intensive approach, you can actually have a situation where the
programming tools are accessing a remote, highly distributed resource that actually
contains this knowledge and this insight about how other people have solved this
problem, about how this program executes when it runs. And you can use that to drive
the programming analysis tools.
So in our case the tool that we have is based on three observations. The first observation
is that if two objects are going to interact, usually a precondition for this interaction is
that there has to be a reference chain from one to the other. If the editor is to use the
scanner, then the editor has to have a link -- have a way to get to the scanner. There has
to be a link in the heap between the editor and the scanner.
So these -- this is going to be our crucial assumption as part of this work. And our goal is
to find the pieces of code that actually make this connection happen.
This is very important because finding these connections in the heap is something that is
very difficult to do statically, particularly in something like Eclipse where you have a lot
of reflection, where you have very dynamic behavior.
But this is something that you can actually do if you're observing a database of program
executions. You can look at -- you can look through the execution and find those places
where these links get established and then look at the sequence of actions that led to those
things being established, in the context of a particular execution, in the context of a very
special editor, and use that knowledge to tell the user and to figure out this is what needs
to happen in order for this connection to happen, in order for the scanner and the editor to
really start talking to each other.
So the other important observation is that it's always very helpful to imitate the behavior
of sibling classes. So in the case of the scanner and this editor, lots of people have
written editors, lots of people have added scanners to them. So there are even some
examples as part of Eclipse that contain their own editors and their own scanners.
So by looking, for example, at how the XML editor and the XML scanner talk to each
other, you can make inferences about how your editor and your scanner should talk to
each other.
And the final observation is that if we have many of these traces and if we have
information about many of these interactions over time, we can use similarities and
differences between these operations to look at what is specific to every particular
instance and distinguish it from what is general. You can use it to distinguish between
operations that are done by the framework and that the framework is going to take care of
compared to operations that are really the responsibility of the programmer.
So in our case, we have some examples, for example, using the XML editor in different
contexts. And we have also some example of another editor that doesn't use a scanner.
And by combining the information from this, you can find things that are common
between the different uses of the XML editor.
Sometimes the XML editor is going to get loaded one way, sometimes it's going to be
loaded a different way, sometimes it's going to cache some information so the
initialization is going to look a little bit different.
But there's going to be a layer of behavior that is common, because it is the actual
behavior that the programmer had to write and that actually has to run every single time
you want to connect the scanner and the editor.
And by looking also at editors that don't use scanners or that use a different scanner, you
can distinguish between how much of that behavior is really specific versus how much of
that behavior is general to making this interaction happen.
So currently our database is actually very rudimentary. We have -- we are essentially
tracking method entry and exit. We're tracking heaps, loads and stores but at a relatively
coarse level of granularity. We're also tracking the class hierarchy. One of the things
that we are not tracking are things like anything that happens inside Java classes, things
like java.util containers. All of that we are abstracting away.
We are -- and so a lot of the actual expensive behavior, which happens inside containers,
for example, we don't have to track because all we care about is that there is some
container that is fast pointers to some particular set of elements. So a lot of these events
are ignored in the database.
The database also contains some periodic heap snapshots to allow us to run these queries
where we go through the snapshots and we look for connections between different kinds
of objects and we look for objects -- we look for the objects that we're interested in and
we look for where they are connected so that we can track back, find the place where the
connections between these objects first appear.
It's a lot of data. But it manageable. And note the fact that currently we're, again, not
doing anything very sophisticated. One of the areas that we're really interested is how we
can use abstraction at the level of this data to deal with the fact that there's a lot of
redundancy in this data, a lot of things happen over and over again. There's a lot of
redundancy across different executions, and there's a lot of redundancy even within
executions.
So currently with the way we're tracking data, we're getting about between three and
seven megabytes per second of realtime execution. Yes.
>> Can you compare this approach with how jungloid mining could help in this case and
the work on specification mining which tries to compute temporal relationships between
[inaudible] API calls.
>> Armando Solar-Lezama: So this is an excellent question. So the big obstacle to
something like jungloid mining is that this is not a jungloid. Jungloid mining helps you
find these instances where you have an object of type A and you want to, by some chain
of calls, to get out of it an object of type B.
In this case -- and so what jungloid mining is going to give you is it's going to give you
this sequence of calls that are going to return your object of type B from an object of type
A.
In this case what you want is something that requires a much richer knowledge about the
runtime behavior of the program. You actually want to discover -- there's three things
that you need to discover: first is classes that the user needs to override in order to get
the desired behavior, methods within those classes that the user needs to override, and
methods that need to be called within those methods to actually get the desired behavior.
So something like this is much, much more -- it's much richer than what the jungloid
provides. And that's why you need much richer semantic information about the
programming that the kind of type-level information that the jungloid uses.
The big difference with code mining is that in this case we're using the runtime
information. And it's not just the runtime information about to one execution, really
putting together the runtime information about many different executions. And the thing
that we get from the runtime information are these relationships, these heap-level
relationships that are very difficult to extract directly from the code.
>> I have a question.
>> Armando Solar-Lezama: Yes.
>> So for all of the syntheses you always need a component that validates that your result
is actually working. So what is your validation in this case? You have a user seeing the
color appear in the browser?
>> Armando Solar-Lezama: Yes. So at this point the validation is essentially for the
user to take the code, put it in the program, run it and see if it works.
This is actually a very good question. Data is a great way to learn about the common
case, to learn about how people usually go about doing things. It's a very bad way of
trying to find bugs, for example, or trying to find, you know, that rare corner case that
nobody thought about and that is going to get your program to crash.
So in this case the synthesis is really there to tell you this is how people usually do this,
this is the accepted way of getting this functionality to work.
And it might be that for that accepted -- it might be that there is this extra thing that you
need to do in order for it not to crash, or it might be there is this rare input that is going to
cause it not to work.
But here what we're trying do is get you a foot in the door to allow you to get that crucial
first running version where you can actually -- you can get a hold of how this
functionality is put together, what are the classes involved, what do I need to override,
what do I need to write in order to have any chance of this work at all.
And, again, one other thing that is crucial here is the issue of insight. In this case there's
actually very little insight that comes from the programmer who is asking the questions at
this point. The programmer has just a very tenuous grasp of exactly how this
functionality is supposed to work and what needs to be done.
The insight is really coming from all of these other programmers that have written test
cases for this, that have written unit tests, and from those test cases, from those unit tests,
from those other instances where this has been used is that you build the database and
from the database that's where the knowledge and the insight is really coming from, more
than from the programmer who is doing the programming right then and there.
But still there is a crucial aspect of usability in that the programmer needs to be able to
provide at least this tenuous insight and this tenuous description of, well, this is roughly
what I'm trying to do, I'm trying to get these two classes to talk to each other and to
interact with each other.
So I'm going to finish this talk very briefly about this final project dealing with
specification-based hardening, where what we're trying to do -- the basic idea is that in
many cases people write imperative programs, or even functional programs, where you
have a very prescriptive programming model that says you have to do this and then you
have to do this and then you have to do this, and they did it really for performance
reasons. They know what the algorithm is that they want to implement, they know how
that algorithm looks like, and they just want to program it.
But a lot of times what you have are corner cases, corner cases where you have to do
something a little bit different, where something has to -- where the user didn't enter the
right information, where the permissions don't quite match. And you need to do
something different.
And there what you have is all of these corner cases that a lot of times you might end up
having to write even more code than you have to write for the common case, even though
most of the effort -- most of the runtime time is actually going to be spent in the common
case.
At the other extreme, you have declarative programming where you can really specify
things at a much higher level of abstraction, where you don't have to worry so much
about all of these different corner cases.
But the catch is that you don't have as much control over exactly how the computation
happens. You don't have as much control as to exactly what algorithms used, exactly
how the computation should proceed.
So what we really want to do is to allow the programmer to write imperative code for the
common case and then to for a moment stop worrying about these corner cases and
instead provide a declarative specification for these corner cases, a declarative
specification that says globally this is how the program should behave, this is how things
should happen when you run the program.
And what you want to do is you want as you're running, when you encounter these corner
cases, you want to switch from these just concrete mode of execution, where the program
is just running, to a mode of execution where you're really going to an oracle and asking
an oracle what should I do now, what should the answer be, what should this value be.
And this way you use the declarative specification to deal with these corner cases while
most of the effort and most of the computation is taken care of through the standard
imperative computation.
So the idea is that as the program is running it runs into this exceptional case where it
falls back on the specification and asks the specification what should happen here, what
is it that I should do, and there's some amount of symbolic programming, and then after a
while that symbolic state gets flushed out and the program continues doing its own
standard -- its own standard computation.
So we've tried this idea in a very simple context. This context is a context of data
processing applications. So they're applications where you have a lot of data and you
want to write a very simple program on top of these. So here's a trivial program. You
want to get this table of census data, you want to filter based on the marriage status, and
then you want to average. So it's a two-line program. How much simpler can things get.
But what happens if maybe the spouses can be missing. Well, one of the things that you
know is that you can take advantage of the fact that spouses -- the spouse relationship is
symmetric -- it's done?
>> Sumit Gulwani: I think we should wrap up.
>> Armando Solar-Lezama: All right. Well, so I think we can conclude. And the
conclusion is this is a case study for a much richer programming paradigm where we
want to be able to deal with some aspects of the computation in an imperative manner
and with some aspects of the computation in a constraint-based manner. And we want to
use nondeterminism in the programming model to connect the two, to tell the system this
is where you can use -- this is where you should use the declarative model, this is where
you should use the imperative model.
And we're studying right now some applications beyond data processing to security and
privacy.
So that's probably it. We can -- I guess we can continue the discussion outside if you
want.
>> Sumit Gulwani: Let's thank the speaker.
[applause]
Download