>> Sumit Gulwani: Okay. Hello, everyone. It's

advertisement
>> Sumit Gulwani: Okay. Hello, everyone. It's
my great pleasure to introduce Rishabh Singh who
is a Ph.D. student at MIT and is graduating this
year. So Rishabh was a three-time Internet
Microsoft Research, so some of you might know him
well. He's also the recipient of the MSR Ph.D.
fellowship award.
So Rishabh was a key contributor to the FlashFill
project, so FlashFill feature, as many of you
know, shipped in Excel 2013. Back at his
university he is also learning lots of several
other interesting projects also, and one of my
personal favorites is his work on automatically
grading interruptive programming assignments,
which I think has the potential to revolutionize
computer science education.
So without any further ado, Rishabh. And he's
going to tell us something about some of these
interesting projects and are perhaps you will
learn more.
>> Rishabh Singh: Thanks a lot, Sumit. So hi,
everyone. Today I'm going to talk about programs
synthesis for the masses, but before that, let me
start with something that happened to me three
years before. I was a teaching assign -teaching assistant for a course, introductory
programming at MIT, and I was supposed to grade
assignments of 200 students, and it just so
happened that I was lazy, that I only had two
days to grade them, and there was also
[indiscernible] line, so what could I have done?
So what I did, I just ran test cases and gave
them appropriate marks, how many test case pass.
But then there's the feedback later we got, the
student was quite unhappy, and in this particular
case we're saying that the feedback based on test
cases it was not really helpful. It doesn't
really tell them what's wrong with the program.
And also, the kind of grades they get is not in
accordance with their potential to write this
code.
So this is already a problem in classrooms. We
have to spent a lot of time grading these.
Typically it used to take me two to three days to
give good feedback.
But this problem is getting even bigger now with
mokes [phonetic] coming up. When there are
hundreds and thousands of students signing up,
there's no way we can hire enough people to give
feedback, good feedback for all of them.
So in this talk I'll show you one way how we can
use synthesis to provide automated feedback
similar to what teachers would give, but more
generally, my research goal is to make
programming more accessible to the people, and we
can achieve this goal in two ways, and these are
the two ways I've been looking at it.
We can build systems to help people learn
programming to let them go up the bad
[indiscernible], and the second way we can do
this, make it easier for people to program by
letting them program using more intuitive
specifications.
And these two directions I have been taking over
the last few years. So the three systems I've
built or I've worked on. The first system is
Autograder, which is -- which provides automated
feedback on programming assignments, and we have
used it to provide -- to generate feedback for
thousands of submissions from edX, and we are in
the process of deploying it for next course.
The second system is FlashFill, which is a
programming by example system for excellent
users, for people who don't know programming but
still want to get tasks done. And I'll briefly
talk about this, as well.
The third system, which I won't have time to talk
about today, but I can talk about it afterwards,
is storyboard programming which lets students
write data [indiscernible] manipulations using
examples of data structures.
So these three projects, automated grading,
programming by example, and virtual programming
seem quite different, and they indeed have quite
different from each other. But one thing that
ties along all of them is program synthesis. And
in this talk I'll show you how some of the ideas
we can enable all of this these different
applications.
Now, traditionally back in '80s the traditional
real programs synthesis looked like this.
Somebody would go in and write a complete
specification. This is a specification for
Mudsort, give it to the '80s computer, and out
comes whatever implementation. In this case, it
will be somewhat sort of implementation. And as
you can imagine here, oftentimes the
implementation would be much smaller than the
spec, itself.
So this complete process didn't take make
traction, and programmers would rather write code
instead of writing this larger piece of
specification.
So now let me show you what's the more modern
view of synthesis has been in recent years.
First of all, we can break it down in three
components. The first component is specification
mechanism, and we are going to embrace the fact
that it's hard for people to write complete
spec -- complete description, and we're going to
let them enable -- let them provide specs using
more [indiscernible] specifications, things like
examples, reference implementation, control flow.
And some of the examples are there more
concretely.
The second component has been hyper
[indiscernible] space. Typically people have
tried to synthesize programming during complete
languages, and one of the contributions of our
work has been to identify subsets of logic, that
you don't have to always go during complete
languages, which has two properties. These
domain-specific languages, they're expressive in
the sense that you're able to express most of the
useful tasks, but the same time, they're
efficiently learnable. So it has these two
properties.
And finally, we have the third main component,
synthesis algorithm that learns programs from
this hypothesis space that conforms to the
specification that somebody has given. And this
also is interesting challenges here, how do you
search efficiently this last piece of programs.
So now let me start with the first system,
Autograder. This is joint work with Sumit and
Armando, and this project aims to automate
grading. So before think about automation, let's
look like how it's done classrooms today
typically.
We all, TAs, sit in a classroom, go over a few
assignments, and after going over a few
assignments, you typically know what common
mistakes students are making, and then you
construct a grading rubric. And after
constructing that, there's a very repetitive
process. You take the rubric, you apply to the
program, student program, and give appropriate
feedback. And this repetitive process is the
exact thing we want to automate in this system.
>>:
What do you mean by "grading rubric"?
>> Rishabh Singh: Oh, so rubric is simply saying
what are the mistakes students are making and how
many marks should I deduct for it and what
feedback should I give for this particular
mistake. So these are common sort of mistakes
students are making on a given assignment.
So we wanted to placate this repetitive process,
and instead of calling rubric, in this system
we're going to call it error model. But
everything remains similar.
So let me tell you the system using very simple
example. This is an exercise from the edX class
on introductory programming. So this exercise
asked students to write a python function that
takes a polynomial, which is represented as a
Python list, and written into the list it gives
points whose values corresponds to coefficients
of the derivative of the function.
And your function is constant, you want to return
zero. So these are the two, the current case and
the general case. And you can imagine from very
simple algebra teaching [indiscernible] a very
simple solution. First of all, check if the
polynomial is constant, you return zero. Don't
do anything. Otherwise, simply trade over the
list and compute the coefficients of the
derivative multiplying the value times the index
of the value in the list.
So this is a pretty straightforward way to
implement this. Much cleaner what a teacher
would write.
Now let me show you one solution of what somebody
had on the edX class was submitting.
So this was this particular student's 10th
attempt. You get multiple attempts. You can
keep clicking, and it will tell you what test
cases are passing or failing.
And so this particular student, this was the 10th
attempt, and after this the student is going to
give up because the student is not getting any
feedback from the system, what's happening with
this code, and why is it failing.
And similar problem also happens when you
typically grade these programs because you would
see such an adding of solutions that it's not
even clear to me what's wrong with this solution.
I would have to spend four or five minutes to
understand it.
But now, actually, we don't have to worry about
it. We can give it to our system and we can
enjoy in the meantime. So let me show you what
our system can do.
This is the exact problem I was showing you
before, and we can ask the system to do some
analysis for us. It's going to search over a
space of some corrections and then it's going to
say your program is almost correct, but it
requires two changes. In line number 14, instead
of range one, comma, length, it has to be one
length plus one.
So we can do that. And now hopefully the system
is going to tell us there's only one change we
need to do, which in this case I think is that
instead of greater than equal to, it should be
greater than.
And there are multiple choices for system reduce,
so here it is saying make I to I minus one, which
leads to another question, but I minus one is
same as changing greater than equal to, greater
than.
So one more thing I wanted to show you here was
this problem is challenging because for a given
problem, there are many ways to solve it. For
example, this was a problem, and attempt was
closer to what teacher had written, but still had
some small mistakes. Somebody, in this
particular case student was using ->>:
I have a question.
>> Rishabh Singh:
Yeah.
>>: The previous demo, you made that change with
I minus one, what would your system return?
>> Rishabh Singh:
minus one?
>>:
Oh, if you just make it I
Yeah, whatever it said.
>> Rishabh Singh: Then it would say program is
correct. Nothing else. Yeah.
>>:
Oh, I see.
>> Rishabh Singh: Yeah.
anything else, yeah.
So it doesn't do
>>: Is that, by the way, I mean, just generally
is that worrisome to say that the program is
correct? I'm sure that your system doesn't
really know that the program is correct
[inaudible].
>> Rishabh Singh: That's a very good point,
right. So I'll tell you what the system does and
what kind of guarantees you get, but as you can
imagine, we are doing an exhaustive testing, even
as bounded, still much more exhaustive than what
typically happens in edX right now. So you have
us find the sort of test cases and then also they
would say your program is correct. So at least
it's better than that.
>>:
[inaudible].
>> Rishabh Singh:
thing, right?
But not complete accurate
>>: Is this [inaudible] impact to the teacher or
the student in this case?
>> Rishabh Singh: That's another good point. So
depends on the use case, and I'll tell you some
of the use cases we have used it in our
experience. But one of the use cases is actually
tells students if they're -- for example, the
student was stuck for tenth attempt. Maybe you
want to give that -- this kind of feedback, but
not exactly how to fix it. So that's also a
good ->>: It seems like something [inaudible] if you
just told them what to do ->> Rishabh Singh: Then the learning component is
going away, right. So one thing I can show you
here, so we support different hint levels, so you
don't always have to say everything. You can say
that maybe there's something wrong in this
expression, but I won't tell you how to fix it.
>>:
[inaudible].
>> Rishabh Singh: You can also go one level
down. This is what we're going to experiment on
edX. You're just going to say highlight the
expression, yeah.
>>: So there's a level, I mean, this is like
fixing the program, but there's also a level of
understanding. It's like they don't understand
the difference in greater than, equal, or greater
than, like off you by one kind of thing.
>> Rishabh Singh:
Yes.
>>: Is there any conceptual framework that you
can raise these things into to actually explain
the real problem?
>> Rishabh Singh: That's also a very good point,
right. For example, so one thing we have been
thinking about has been more interactive fashion.
So let's say your fixes, your add index, I should
be I minus one, and instead of just telling make
I to I minus one, you can actually ask the
student, let me ask you a different problem, not
this problem. Given a list, can you print the
limits of it, and if student is still making the
mistake or thinking [indiscernible] they start
from one instead of zero, so now you have a high
level idea of what's the problem with student.
So that way you can actually use this low-level
information to figure out what's the root cause
and then try to give high level information.
For example, I greater than or equal to or I
greater than zero, you can figure out you're
doing one more loop iteration. So instead of
telling that, you give more high-level
information. But so that's ->>: Yeah, that's exactly how I would do the
teaching, right? You don't want to have them
solve the problem. You don't want to tell the
answer at hand.
>> Rishabh Singh:
>>:
You want to, you know, it's just like --
>> Rishabh Singh:
>>:
Right.
Somehow --
-- you want to simplify it.
>> Rishabh Singh:
Yeah.
>>: You know, find the thing that they really
need to understand and then say, okay, now think
about this other thing, how can you solve it.
>> Rishabh Singh: That's actually a very good
point. So we had lots of meetings with the
teachers who are teaching this course, as well,
and they had the same opinion, as well, that you
want to teach them debugging instead of just
teaching them how to fix code, right. And this
was one way we were thinking to have an
interactive dialogue.
So one thing also, so this was the most
interesting one I found from the database, that
if you look at this program, this looks really,
really different and it cannot be correct because
you're only returning one length of input is less
than two, and you want your program should work
for all inputs.
But in this case, student is actually doing very
interesting thing here, and it just so happens
this is really correct program. Student is
flopping elements from the list, computing the
result, and when you have probed enough, then
it's okay to return, which is actually quite
interesting, clever way to do it, even though
this shouldn't be recommended because you are
doing stateful change to the input. And the only
change here is that in line 11 it should be one
instead of zero, the initialization, and student
is missing a base case, but that's still pretty
close.
>>: It's interesting, because if I was teaching
it, we talk about the rubric, the rubric would be
something like you missed the base case minus
one, off by one minus one, right? I mean, they
are these sort of conceptual things. They're not
at the level of the detail, but, yeah.
>> Rishabh Singh:
>>:
Yeah.
I'm curious about these English sentences --
>> Rishabh Singh:
Yeah.
>>: -- that your tool is outputting. So I'm
sure that at the core of it there is some SAP
solving going on.
>> Rishabh Singh:
Yeah, yeah, yeah.
>>: From there you have to construct these ->> Rishabh Singh: Yeah.
>>:
-- into sentences.
How do you do that?
>> Rishabh Singh: So this is very basic right
now. It doesn't do any NLP. So for every
correction rule, there is a template of English
statement. You just with some blanks and you
fill them up based on the feedback the stack
solver is telling you.
>>:
So that --
>> Rishabh Singh:
Template is fixed.
>>:
-- template is your domain, the one that
you were talking about earlier?
>> Rishabh Singh:
>>:
The error model.
The error model.
>> Rishabh Singh: The rubric, yeah. So for
every correction rule, you have a template, what
feedback should I give if this correction rule is
applied.
>>:
Right.
>> Rishabh Singh:
space --
And then you fill up the blank
>>: But I was trying to connect it to the
broader picture that you had when you were
putting programs in the [indiscernible] middle
where -- I don't know what you were calling it.
You know, there was a DSL component, right?
>> Rishabh Singh:
Oh, actually --
>>: Is that the same as the error model?
the same ->> Rishabh Singh:
yeah.
>>:
Okay.
It's
Yeah, so that's part of DSL,
>> Rishabh Singh: So I'll actually go back to
that. That's a good point, right? Which is
probably the next slide, actually. Or maybe not.
Yeah.
So the error model I showed you, so let me tell
you what error model is, which is in this case in
our system it's simply a sequence of correction
rule, collection of correction rule. For
example, if you look at this correction rule, it
says whenever see a return statement in the
program, you can optionally modify it to return,
for this case, a list containing zero or question
mark V.
Question mark V means any variable defined in the
program, and this corresponds to mistakes
students were doing, that they would do
everything right, but they would return something
else or forget to change the return statement
because it comes with a default return, the
template of the code.
So this correction rule let's you fix that
problem. You can return any of the variable in
the program. So typically you would have large
number of these rules.
>>:
Where are these rules from?
>> Rishabh Singh: So right now some teacher,
somebody has to go and write manually, but
towards the end I'll show you how we can learn
these, as well, from data, but right now it has
to be manual, these rules.
So, yeah, one question would revise the synthesis
problem and this would go back to Shah's
question, as well. So we can go back and look at
our bigger picture of three components and try to
put things together.
So specification mechanism in this case is
whatever teacher is writing. So if we have a
correct implementation, let's say if we were
doing sorting, teacher can write bubble thought.
Hypothesis space comes from whatever student has
submitted, the student's solution, and the error
model that you have, which is giving you a space
of corrections you are searching over.
And the algorithm essentially searches over this
class of problems, but represents the solution
space in essentially using a language we're going
to call Python tilda, and then tries to find a
correction that matches the behavior of teacher
solution.
So let me tell you some of the key ideas in each
of these steps from starting from that student
code to feedback. So there are four main phases.
First phase is rewriting, which takes a Python
program, takes the error model and rewrites it
into this language we call Python tilda. And
then we are to solve these programs, these
constraints, so we're going to use an
off-the-shelf solver. In this case we're going
to use get solver to do the synthesis task for
us, so it creates a Sketch file, which is then
solved and then it gets transmitted to an actual
language using templates.
So let's go over the interesting things in each
one of these phases. The rewriting phase takes
the error model. So let's assume I have a very
simple error model, any expression A, any
arithmetic expression A goes to A plus one.
And let's assume this is what student had
written, this expression. So we can start
applying this rule. We can say -- we can apply
it to the first sub expression, zero. We get
zero and one. We can apply this to poly, as
well. Also, one thing is whatever student had
written, we're going to keep it in a box, just to
make sure remember what student had written.
We can apply it to length through poly and poly
becomes poly and poly plus one. So one thing you
would imagine poly is a list and one is an
integer, so it doesn't make sense to add integer
to a list, but this process, this phase is
completely syntactic, so it's not taking
semantics into account, but this would be handled
for the next phase, the semantics of the
language.
So you can also apply also the whole expression
length poly and you get two more choices. So in
this way, you keep rewriting the program
symbolically, and this set expression now
represents in this case eight possible rewrites
of what student had written. And in this way you
take the error model, you apply it to the whole
program and now you get a family of programs ->>: This is -- this a never-ending thing, you
keep rewriting and it will ->> Rishabh Singh: Yes, exactly. So we have well
formed error correction rules, so this is a well
formed rule in the sense that it's not nested.
So to do nested rewrites, you have to have a
special symbol. So you have to say A prime.
>>: I see. So you're saying that what you do A
to A plus one ->> Rishabh Singh: There's no more rewrite on
this thing you can do.
>>:
I see.
Okay.
>> Rishabh Singh: Yeah. And then there's some
well formed correction to -- well formedness
condition to make sure it dominates always. But
typically we always have one level rewrites. We
never go into nested rewrites. But the way the
semantics are defined, it's actually all works
out, at least theoretically.
So now the problem is you have this program in
Python tilda language where you have set
expressions, and now the problem boils down to
give me a program from the set of programs that
is functionally equal with what teacher had
written.
So by functionally equivalent, we mean that it's
more than just syntactic. So if the teacher had
written bubble sort and student is writing
Mudsort, we still want to give feedback. And it
should have the same functional behavior.
>>:
So I've got a question --
>> Rishabh Singh:
Yeah.
>>:
-- for you. So I noticed that some of the
program that you were generating, they're not
type six. Python is a dynamic language. The
fact that Python is a dynamic language, does it
help you or hinder you?
>> Rishabh Singh: At least from my experience,
it hinders us a lot because you have to model all
these type constraints inside.
>>:
I see.
>> Rishabh Singh: We'll go a little bit in the
encoding after that, but, yeah. So it's a little
bigger challenge than normal C like language.
>>: Also, a lot of these programs, if there was
a static type checker, then you could have just
pruned them away.
>> Rishabh Singh: That's a good point, right.
You can prune it ->>:
[inaudible] layer right now [inaudible] --
>> Rishabh Singh:
So right now --
>>:
-- dynamic.
>> Rishabh Singh: Right. Right now the
constraint solver is doing typewritten for us,
right? But, yeah, that's a good point. You can
remove some of them up writing before giving it
to the solver. And actually does a lot of
efficient checking also before giving it to SAT.
So in between also there's some optimizations
going on, but, yeah.
So the thing is you can take this program and if
you try to do an explicit state model checking
like technique of smart search, and I tried doing
it, as well, it's still running. It's been more
than one year, but from my back-of-the-hand
calculation, it would take more than 30 years to
get feedback on just one assignment like this.
So we want to do better than that. We can't wait
for that long.
So that goes to our second phase. So we're going
to, instead of just enumerating them and pruning,
we're going to craft it to a symbolic solver to
solve this search space, and we're going to use
the Sketch solver.
So to give you one slight introduction to Sketch,
it's a C like language, very similar to C. It
has one extra construct called holds. So, and
holds are represented using double question mark.
So you can write a program like this P where C
some unknown and you have some assertion in the
program that says input plus input should be C
times input. And I want this assertion to hold
for all inputs.
And you can give it to the solver and solver
would say when C is equal to two, I can prove
that for all inputs, this -- all assertions will
be satisfied. So we're going to use the system,
and essentially what this system is solving for
us is an equation of this form. Give me a
program P that exists in program B such that for
all input, all my assertions are satisfied. So
phi of P in is satisfied.
So it's solving this double quantified equation.
So we can go ahead and try to represent our
problem also in this form. So we can say, let's
assume S is a student solution and ERR is the
error model. So give me a rewrite of student
solution S prime which belongs to the set of
rewrites such that for all input, I want the
behavior of teacher's program same as the modify
student program. So this is a very functionally
equal is coming in.
But at least in this way we can formulate our
grading problem as a synthesis problem. But
there's some challenges again that we have Python
programs, as Shah was mentioning. We don't have
C programs for this. So we would have to somehow
encode Python semantics in Sketch, and which has
a lot of interesting details as well, but let me
point out one or two interesting ones.
So essentially the idea is we're going to take
Python, write a compiler to Sketch, and in some
sense in this way we are going to get synthesis
system for dynamically type language. So -yeah.
>>: When you say "correct," I mean, you mean
you're going to test it for some number of cases
or you're actually proved it's correct or ->> Rishabh Singh: Yeah, so I'm going to go over
algorithm after this, but, yeah, so it has to be
bounded, yeah.
>>: The double quantifier that you have, it's
really just -- it's an or [indiscernible]?
>> Rishabh Singh: Yes, exactly.
turn into that, yeah. Right.
It's going to
So the first challenge is we are to handle
dynamic types and we're going to take a strategy
similar to unit types. We're going to define a
structure which is going to have a Boolean flag,
or let's say an integer flag in this case. It's
going to denote what type of the variable it is,
and for every type it's going to have a field.
So it will have an integer field, a Boolean
field, a list field, a string field.
For example, if I have constant like integers, I
can say my MultiTypes type is int and the integer
value is corresponding integer value. For lists,
it's going to be a recursive type. We're going
to say type fill this list and the values are the
other two MultiTypes.
So this way we can encode constants into
MultiType. We now also have to take all the
Python statements and library functions over
MultiTypes. So for example, if you take the
edition expression, you have to encode that.
When there are two integers, it adds the two
integers and preserves the type flag.
When there are two lists, it appends the two
lists, and whenever you are trying to add an
integer to a list, it leads to an error state.
So this is where type checking is coming in some
things. And, yeah, so typing rules are encoded
now as a constraints inside the system, and you
do this for all of Python or at least the subset
of Python you want to support.
Yeah.
>>: So then students can create new classes,
then they don't [inaudible] Python basic types?
>> Rishabh Singh: Oh, no, it's actually -- so we
also have encoding for classes and objects, yeah.
So this is a very simple thing, but yeah, you can
build on top of it more complicated objects.
>>:
This seems to me to be a massive
undertaking.
>> Rishabh Singh:
Yeah, it is.
>>: To encode the operational semantics of all
of Python. So what -- I'm sure you're not
handling all of Python, right?
>> Rishabh Singh:
you can't do it.
Yeah, exactly right, because
>>: How do you go about [inaudible] keep getting
programs ->> Rishabh Singh: Yeah, so the idea was, yes,
thing was we want to start small, so we took
seven to eight weeks of edX course. So at that
time it didn't include classes. It was mostly
all of the basic data types and things like high
rotor functions, loops. So you have less
comprehensions those things. And these are
things we wanted to capture, most of the things
people do for first eight weeks. But then after
that we got more people and also we have now
class encodings and other things on top of it.
But, yeah, it's a subset, but still rich enough
to actually handle very large class [inaudible].
And also there are some complicated library
functions that people use which are not good for
solvers, so we use -- we built -- we call it
models for search complicated function. For
example, if you have square root, so here the
system is going to encode it in such a way, it's
going to say I don't care what the function is,
so I'm just saying it's some under [inaudible]
function that satisfies some post conditions.
So here we're going to say whatever the return
value is, square should be less than integer. So
there's some ways, some approximation also
happening for complicated library functions, and
we have models for them.
So doing these two things, we somehow you can
imagine that we have a compiler moving from
Python to Sketch, and I can talk about more
details afterwards if you're interested.
But let's assume we have a compiler now, so we
can translate these Python programs to Sketch.
Now the challenge is how do you solve them. And
this is very rich. I think Reston [phonetic] was
asking.
So essentially we're going to use similar
algorithm to see. So first let me tell you what
C gives us. It's count example guided
[indiscernible] synthesis. We're trying to solve
a doubly quantified equation, and you want to
know what the value C is.
And so the way we're going to solve it is it's a
doubly quantified equation, so we're going to
remove one quantifier at a time and we're going
to divide it into two parts. The first phase
would drop for all quantified would say I don't
care about for all inputs. Give me a program
that works for a random input, and let's say in
this case it's going to be zero. And I put zero
in my constraints there and I would say, okay, I
know zero plus zero is ten times zero. So one
valid solution is C is equal to ten,
Which is okay now. But the problem is this
program works for zero but may not work for the
input, so we need to go to the second phase,
which says verify, give me an input, count
example input such that this program doesn't
work, whatever I have found till now.
So in this case, the system would say yes, when
input is four, four plus four is not equal to ten
times four, so it doesn't work on four. So four
becomes your new count example input, and you go
back to the synthesis phase, and this time you
ask the question: Give me a program that works
for both zero and four.
And you keep doing this. At some point in this
case it just so happens that when you have two
inputs, it's sufficient to constrain it to get
the right answer, C is equal to two, and you give
it to the verifier and verify can't find a
solution -- can't find a count example for you
and says okay, the program is correct for all
inputs.
Now so -- so this technique depends on how strong
your verifier is. If your verify can reason
about infinite inputs, then you would get that
guarantees. For our system, we have a bounded
model checker. So you only do it for bounded
inputs.
So in general, yeah, so you would have few
iterations of these two phases until you can
[indiscernible] to solution.
So we can use this algorithm and try to give it
to our -- give this problem that we had, and it
would come up with this thing, your program
requests 15 changes. And why did this happen?
Because we didn't really say anything about what
kind of solutions we want. We just said give me
a fix and it found some random fix.
So ideally, we want minimal changes. We don't
want any number of changes because that is what's
going to correspond to what student had in mind.
But having minimization is hard for SAT solvers.
Typically solvers don't support such feature.
But we have alternate opportunities to solve this
problem. We could build binary search. We could
say let's try with some value. If it works, we
try half of it. So we do some iteration, but
this is very expensive.
We could also solve MAX-SAT. We could use
MAX-SAT, but then again, as you saw, there are
too many calls happening to solvers. Many
[indiscernible] iterations. So that also was
very expensive.
So one thing that really worked well was we call
it incremental linear search which try to reuse
as much as [indiscernible] solver has learned in
previous iterations.
So let me give you high-level idea of what that
is. So let's go back to our CEGIS loop, and
let's say I want to minimize a function fx, some
fx. So we can use the same algorithm. We get
program P1, and let's say the value of fx in the
current context is seven.
So what we do, we just say don't throw away
anything that you have done till now, so whatever
constraint you have learned in synthesis, keep
it. Whatever constraint you have learned in
verification, keep it.
And you put in a -- put a state constraint back
that also add a constraint that fx is now less
than seven, and the nice thing that happens now
is all the other iterations, since the solver has
learned so many things, so the next iteration
becomes really fast and it would say, okay, I can
find you another program P2 where the value of fx
is four. And this way you keep asking it to give
you lesser solutions and trying to reuse as much
as possible whatever you've done in the past.
At some point the synthesis phase is going to say
I can't find solution for you anymore, and then
you know the minimum value of fx is four.
So this is pretty interesting in the sense that
even in practice we have seen the first iteration
would take quite a lot of time, but all the other
iterations are really, really fast. Takes
milliseconds.
And as you can imagine, we went from high to low.
If you went from low to high, then you would
throw away whatever solver has learned because
that would become unSAT.
So finally, this way we can solve these
constraints, the minimization constraints, and
now it's pretty straightforward. You have
templates for each correction rule. It gets
transferred to natural language.
So now let me tell you briefly about some of the
evaluation we did with the system. We took lots
of benchmarks from both the classroom, the intro
programming class at MIT as well as the edX class
over these types, integer, tuples, and strengths.
And for each of these problems we removed the
ones which were correct, so these are our
problems that were incorrect in our benchmark
set. For classrooms we got about hundreds of
them, and for edX class we got thousands of them.
And this graph shows how much time on average our
system took to generate feedback, when it could
generate a feedback. And typically on average it
took about ten seconds. And when you translate
for hundred thousand students on amazing cluster,
it comes out to be about $14. It used to be
$1400 and edX team was quite worried, but now
it's quite reasonable.
Okay.
>>:
[inaudible] go down?
>> Rishabh Singh: Oh, just the optimizations,
yes. So it used to take two minute per problem.
Now it's taking ten seconds, yeah. Just the
optimization.
>>: What was the reason for that, that linear
search?
>> Rishabh Singh: Oh, so the linear search was
also one, yeah, but also a lot of optimizations
in the encoding going to SAT before. So lots of
details there. Some of them are in the VM type,
but yeah.
>>: So what are you showing here in terms of
these bars?
>> Rishabh Singh: Oh, so these are the average
amount of time it took per problem to give
feedback. Yeah.
>>: Does that correlate the number of lines in
code or ->> Rishabh Singh: That's a good point, right.
So it's not clear code iteration, but at least
this problem is the biggest problem. It has 60
lines of Python code. But some of the bigger
problems also take less time. So the complexity
measure is not just the lines of code but also
how comprehensive error model is. And in some
cases it's more comprehensive than other
problems. Yeah.
>>: Do you have some data on for each problem
that you are trying to correct, how many
possibilities did it generate, you know, by doing
that, adding that error model you were generating
a family of programs?
>> Rishabh Singh:
>>:
Right, right.
So what was the size of that family?
>> Rishabh Singh: Yeah. So I was telling you
ten to about 15 was the basic thing, and then it
would just grow. So 10, 15 comes from ten lines,
and every line has 15 choices. So you have X's
go to [indiscernible] and you have for X you can
do five transformations. Y you can do five
transformations. It let's you also have an
operator. So that way you get 15, 20 per line,
and you have ten lines, so -- but, yeah. So I
think ten, about ten, 15 is the approximate. We
didn't count it, but that's the size we are
looking at.
>>:
I got the impression when you were
describing the application of the error model
that you construct explicitly the fixed point of
applying all those rewrite operations, but I
guess you're not doing that. So you somehow are
trying to encode that directly ->> Rishabh Singh: You are just representing
syntactically -- yeah, you are just -[indiscernible] is just syntactically
representation of all choices, but yeah, but when
you go to the solver, it has to explore all
choices, right.
>>: So you literally do explore all ten of the
15 ->> Rishabh Singh:
>>:
Yeah.
-- choices, right?
>> Rishabh Singh:
Using the solver, yeah.
>>: So what was the [indiscernible] fixes for
this, like three fixes or four or ->> Rishabh Singh: Right. So I don't have that
graph. It's in the paper, but the average -- the
biggest class was one and two. There were few
with three and four, as well. But the biggest
class had one and two changes.
>>: [inaudible] then the search space can become
quite small.
>> Rishabh Singh:
>>:
That's a good point, right.
And then --
>> Rishabh Singh:
assumption, yeah.
So if you start with that
>>: So these are the final submitted programs?
I mean, how does --
>> Rishabh Singh: This is everything, yeah.
this is even intermediary attempts, as well.
>>:
So
Intermediate attempts as well.
>> Rishabh Singh: For classrooms, they are the
final, yeah. For the edX class, they are
everything.
>>: So if somebody just has a totally garbage
program, you could find and fix this to get it to
be a working program?
>> Rishabh Singh:
yeah.
>>: Okay.
>> Rishabh Singh:
I'll actually get into that,
But, yeah.
>>: And are you going to talk about the number
of times you couldn't find it?
>> Rishabh Singh: Yeah, yeah, yeah. After this,
yeah. So in total we ran it over 13,000
submissions and it was able to give feedback
about 64 percent of the time. So 36 percent of
the time it couldn't give feedback.
Let's go into more detail. So for every problem
this graph is showing the percentage of times the
system was able to give feedback. For some
classes, some problems it did really well. It
gave feedback for 80, 90 percent of the times.
For some it did medium okay for about 50 percent,
and some it didn't do that well for 30 to 40
percent of the time.
>>: Does that mean you weren't able to find a
rewrite?
>> Rishabh Singh: Yes, in the system. Not able
to fix the program, right. So we wanted to also
see what happened --
>>: You mean it didn't have the right set of
rules for fixing the errors?
>> Rishabh Singh: Yeah, exactly. So we wanted
to see what happened, yeah. So one thing we were
doing was this was all manual, so we would use
the error model, run it on the system, and then
check what happened and what could we add to the
error model to fix them.
So it's a very manual process, but what happened,
we couldn't go much farther than that for some
problems was because this was the first offering
of the class, the errors class, and people were
not used to it, so many times they would just
submit the empty body. They would just wanted to
see what the system is doing.
So a large percentage was just empty solutions,
and there's no way you can rewrite it to make it
correct.
There were few which at some point students were
using features which they were not supposed to
use outside the scope you were handling. Things
like classes, and some of them timed out, but
there were very few. The larger one, the compute
balance one.
>>:
What was the time of value you were using?
>> Rishabh Singh:
our case.
So this was five minutes in
>>: So a question about the rewrite. So this
was a kind of line by line, is it -- so the
system cannot insert new lines into the program?
>> Rishabh Singh: That's a very good point.
Yeah, so in general it can do it, but if you -so it's actually during complete to write -rewrite because you can write a program fragment.
So for some problems we did actually wrote rules
which could introduce a statement. For example,
transfer time was swapped to values inside
sortings. They would say AI is equal to AJ, AJ
is equal to AI, and the only way to fix it was
introduce a temp and then -- so you could
actually write specific things, but in general we
don't do it because then the space becomes too
big.
So if you know the problem, then you can do it.
The most interesting thing class was this one,
which we found student actually tried to do
something reasonable, but the system still
couldn't do it. We call them big conceptual
errors. For example, one class of error was
understand -- misunderstanding the Python API.
This function asks to insert -- take a polynomial
in a value and evaluate the value of the
polynomial on that given value. This works
really well. It's completely fine.
The only problem is students were using this
function index which is supposed to take a value
in a list and return the index of the value in
the list, and it works fine if your list is
listing, but as soon as you have duplicated, it
would always return the first occurrence of the
value. It would never return the second or third
occurrence.
And it just so happened that, yeah, since our
system is doing exhaustive checking, and there's
no way to fix this program because there does not
exist any of the API that would do the right
thing. So there's no way we can rewrite this
program.
Similarly, another class falls, students were
trying to misunderstand the problem statements.
Even though the problem is asking them to do one
thing, they were doing something else. And then
again it just so happens that you can't make it
functionally equivalent to what teacher had
written. But we have some other techniques to
give feedback on search assignments now. We're
building on top of it.
So now let me tell you this system we built for
Python because that was the data we had both from
classroom and edX. We also had a small waiting
of it for C# because we were able to get some
data from Nicoli and Pele from Pex for Fun, but
it's not as robust as Python one.
But after the system there were a lot of
universities who wanted to use the system. The
only problem was everybody was teaching the
course in different language. But that was good
again because it forced us to come up with an
architecture that becomes -- it becomes easier
for us to add new languages on tip of it, and we
are using it for building a system for a language
called JSIM, which is a hardware description
language at MIT to give feedback, and also
there's a startup, which is no more a startup.
It has many people now. About 300. But still,
they're building on top of our system to make
similar grading system for C and Java. So their
motive is something different, but hopefully
we'll use it for grading as well.
So last semester we also used the system for
classrooms. We gave it to TAs and asked them if
you want, you can run it on your programs, and it
did pretty well for -- when we create
assignments. Almost about 90 percent of the case
the system was able to give feedback.
In one of the cases TA emailed us as well that
there was some certain problem system found that
she didn't realize it was a common mistake and
she didn't -- went in the classroom next day and
told the class about it.
There was also a problem where system didn't do
that well, and the reason was just because this
problem was supposed to teach students about
performance and it didn't matter what input you
give to the program, it would always take 16,000
loop iterations, and it was also working with
floating point numbers, which is hard for solvers
to do in the analysis, but now we are combining
dynamic analysis plus -- with purely symbolic.
We call it concolic synthesis to handle such
cases to try to add -- to be a little incomplete,
but still be able to give some feedback.
The edX team is actually finally now very
excited, as well, and they are currently
designing a study based on different feedback
levels and also when to show the hints. So their
two main questions, when do you want to show the
hints and what level do you want to show the
hints. And depending on many parameters, we are
going to figure out that during this study.
So some of the takeaways from the system was that
we can actually solve grading or feedback
generation as an efficiently as a synthesis
problem, and this idea that we can just write a
compiler to get a synthesis system for a new
language was quite interesting. Even though it
took a lot of work, but really, we're designing a
common platform, so hopefully in future it won't
take that much work to come up with synthesis
system for a new language.
And finally, even though in this case
specification was complete -- was complete, but
it was easy for teachers to write specification.
They didn't have to learn a new language, and it
was doing this for classroom. They were writing
a reference implementation. That's the only
thing they had to do.
So this system specification was complete, but
now let me go to the next system where
specification is going to be very, very
incomplete. Yeah.
>>: Shouldn't students try to treat this system,
I mean, to just input that make the system
believe it's correct but it's not? That's got to
have that right.
>> Rishabh Singh: That's a good point, right.
So using certain just have if and if an L
statement and say if this is input, return this.
Right, right. Since we are only checking ->>: [inaudible] exploiting some implementation
of the tool, right?
>>:
Why would the student do that?
>>: The student ->>: It may be easier for the student to just
solve it right now rather than try -[inaudible].
>>: Sometimes it's common knowledge, oh, if you
take [inaudible] -- the system is going to
believe ->> Rishabh Singh: Yeah, that's actually a good
point. Yeah, so the thing is ->>: There is not a very easy way to fool it and
[inaudible].
>> Rishabh Singh: Yeah. The thing is if we try
to be sound in the sense that if we can -- if the
system solved, then only we give any feedback.
Otherwise, we give up and we fall back to
testing. It's not very one way, but students
still have to go through the test cases of
whatever the teacher has returned. So sometimes.
But yeah, so I was also talking to edX team and
they were saying that if students are trying to
fool the system, it's their loss. It's not so
much our loss. They are robbing themselves of
learning opportunities, but yeah.
>>: [inaudible] it's kind of related, but asking
it, doing it from the teacher's side, so I know
that the teacher probably needs to write the
reference solution anyway, but can you just give
some examples where he --
>> Rishabh Singh: Yeah, exactly right. So thing
is the system I showed you, Sketch, it doesn't
have to be reference always. You can use
assertions, yeah. It's just that it's going to
be less complete than having a complete solution,
but if your assertions can satisfy -- can go
where all possible cases, then, yeah, it can
support that.
>>: It has a set of -- small set of test cases,
right?
>> Rishabh Singh:
>>:
Right.
So --
Then you couldn't have a solution --
>> Rishabh Singh: Those could be assertions as
well, right. It's just that we were trying to
explore all possible inputs up to a bounded
space, yeah.
>>:
Can I ask you a question?
>> Rishabh Singh:
Yeah.
>>: Why is the classical method of getting
feedback not good enough? I think you started by
talking about it and I sort of lost track of it.
>> Rishabh Singh:
Yeah, yeah, yeah.
>>: We know that classical method is there are
some inputs on which I know the expected output
and I run my program on that input, and if it
doesn't give it, then I step through it.
>> Rishabh Singh: Yeah, yeah, yeah. Yeah,
that's a good point. And some people actually do
it, learn that way. I also learned that way.
Many people do that.
But at least from the data we were seeing on edX
and also in classrooms with some students,
they're taking this class as the first class in
their programming career, whatever, and it's not
something -- they're also not in computer science
student. They just want to know programming.
And it's something I realize that it's a skill
that has to be taught. You can't just tell
students that this is a test case, go take a pen
and pencil and figure out what's happening.
So this is something that has to be taught. And
this system is just one way to accelerate that
thing. Maybe teacher can use it or maybe student
can themselves learn stepping through using this
more guided feedback.
>>:
[inaudible] assertion like --
>> Rishabh Singh: Yeah, so that is a very open
question right now. There hasn't been any
usability study in that sense. So we've only
used it for grading in classroom, but nothing for
teaching purposes.
>>: No, it just would be interesting to see, you
know, [inaudible] that same question of like, you
know, like, I mean, you can unfold this in
multiple ways. You could say like a teacher
could spend X amount of time writing rules or,
you know, generating rules, or maybe they write
more test cases, right? And then you could kind
of compare what's the student's reaction and, you
know, the amount of pain or, you know ->> Rishabh Singh:
Right, right.
>>:
-- how much success they face under those
different ->> Rishabh Singh: Right, so the AP test study I
was telling about before on edX, it's supposed to
do that. There's going to be a basic case which
just gives test cases. There's going to be
different feedback levels, and you can't just
measure because you don't have access to
students, but you can see retention rate, did
they solve the problem, did they solve more
problems.
So those things you can measure.
>>: Well, I think you can't just test against
the original set of test cases because there is
more effort from the teachers going into writing
the rules.
>> Rishabh Singh: Oh, so actually, I will tell
you about how you can automatically do that. So
it was just because we were building the system
in the beginning, so we had to go through rules,
but these rules can be learned automatically,
yeah.
>>: So [inaudible] you're doing relative to
folks that have pre-distinctions error.
>> Rishabh Singh: That's a good point. So the
question is how do you relate to fort
localizations. So I think that these are
complementary techniques. Right now we're not
doing any kind of fort localization, but if you
were able to do fort localization, you can be
much more precise when you apply the errors,
error correction rules.
Right now we are applying everywhere in the
program, but it would scale better if you can
actually localize the fort and only apply there.
And typically -- so there have been some work in
fort localization for reprogram repair, but they
typically do it for just one test case at a time
because it's hard to consider all test case
together, and that's what synthesis is doing for
you. It's considering for all keys, testing.
People typically do it for the system input,
yeah. But, yeah, those are complementary
techniques.
>>: You see them -- [inaudible]. Do you provide
the teacher with kind of a probabilistic view of
where the error -- where the error might
[inaudible] or [inaudible] feedback from applying
correction a bunch of access [inaudible]. So you
mention --
>> Rishabh Singh: Oh, right, right. So
somewhere visualization afterwards when you have
done all the analysis, yeah. Not yet, actually,
but that's also an interesting thing to do. We
are doing it more at the before level but not so
much at the after level, but yeah, you could do
some analysis afterwards to aggregate common
mistakes, right, right. Yeah.
>>:
So how much is left to do in this space?
>> Rishabh Singh: Oh, actually, a lot. I'll
talk to you late -- about that. So one thing is
just scalability, I think right now it only
scales to about 30 to 40 lengths of Python code,
but if you want to go to next step, the
scalability is one, but also a lot of teachers
were expressing concern that you want to give
feedback on not just correctness but also style,
the design of the program, the modularity, all
the other aspects. Things like how people
decompose the problem, how they name the
variables. So all these other things.
So that's one thing to do here. And also the
other thing is just doing this learning that when
you want to show how does it help students to
learn programming and in a teaching setting, not
so much in just analysis setting.
So let me quickly go, because otherwise we might
run over time. So this is the second system,
FlashFill. And the motivation of the system came
from we looked at lots of help forums for Excel
end users, so these are people who don't know
programming but struggle with getting tasks done.
And what they typically do, they pose the problem
on a help forum saying that this is my data and I
want it to look like this. Can somebody help me.
So some expert goes in and says, okay, yes, I see
what you're doing. Try this formula. It will
work.
So these people go back, place this formula in
the spreadsheet, and they figure out it works
mostly, but in some case it doesn't work, and
they give a new input, and then the expert says,
okay, now I see what's happening, so this is what
you wanted. And you go back, and after some
iterations, the people are happy. They say
thanks a lot.
And this was the process we wanted to automate in
the system. Instead of taking days or weeks, we
want to do it in seconds. For example, we can
use the same system now, give the same data and
the system will learn the program for them
without having to go to a forum.
So this system was started with Sumit, by Sumit
here on string transformations, and I soon joined
the project and worked on extending the system to
handle more sophisticated things that Excel
wanted us to do, and also work on the program of
ranking, which I'll go into later. But that was
also pretty important thing to do.
And then we worked on this problem and together
with Shobana [phonetic], Sumit, Dany, and Ben and
also people from the Excel team, we were able to
work with all the other aspects of the system,
and a part of the string system was shipped in
Excel 2013. Oops again.
So, and then we extended the system also to not
just handle strings but also to be able to learn
lookup transformations from examples, joins and
lookups, and also for number transformations. So
let me show you some examples of what the system
can do. I think you guys probably pretty much
know what it does, so let me tell some of the
interesting things.
For example, let's say you have data in your
spreadsheet and you want to get certain amount of
it, and as you can imagine, there are going to be
many, many possible regular expressions to get
city out of this address, and there are many, but
we can just give one example, and the system
learns such an expression that is likely going to
be the one which user had in mind.
So this where ranking also coming in, that if
there are multiple regular expressions, for
example, one regular expression would be this is
the sixth string starting with a capital letter
followed by second comma. Something like that.
But it's still, it's going to rank them and pick
the one which is going to be more likely.
The one which I use it for is this. So let's
imagine you are -- oops -- let's imagine you are
traveling for faculty [indiscernible] and you
have a paper deadline, and this is the data that
my system produces and you want to put it in
paper and latic [phonetic] format. So I just
write one example and let the system figure out,
and it basically gives me a table that I can go
and paste in my Emacs. The nice thing here is
that if I want to change, swap the columns, I
don't have to go over it line by line, so I can
just say make it two and make it four and I can
get a new table with columns that swap here. So
this is the thing I use it for.
So let me tell you the other thing which ->>:
What happens [inaudible] --
>> Rishabh Singh: So I just updated the example,
the first example. So it just learned that I
want four to be the second column and two to be
the third column. So it learned a different
program and ran it on the spreadsheet.
So this is the system for learning lookup tables.
This is something which is now with the Excel
team, but hopefully it will part of some future
release, but the idea here was let's say there
was a shopkeeper who had two tables, one that
mapped how much profit should the person make on
each item, and what's the cost at which the shop
keeper purchased this item. And -- oops, it's
not showing everything.
So the idea was given just the item name and the
date, compute the price for the system. And the
challenge here is that just given stroller, I
can't compute the price because, first of all, I
have to find the item ID. Then do a join of
these two tables, then get the cost.
And similarly, after getting all the cost and
markup, I have to do some string transformations.
I have to replace, remove percent sign and have
to add plus and multiplication, semicolon, things
like that. But in this system, we can just give
one example and let the system figure out the
program and it learns the program to do
appropriate joins, lookups, and string
transformations and do the task for us.
>>: Who [indiscernible] the sciences of the
action programs that you ->> Rishabh Singh: That's a good pointed, yeah.
So let me show you the program that we learned
here. So actually, this is the program that was
learned in this case. It said basically
concatenate few expressions, and for each one of
them you have some lookups or some regular
expressions. So it looks quite complicated just
because it was not meant to be readable. It was
more meant to be easier to learn these programs.
So let me show you some of the key ideas. The
main -- there are two main ideas in actually all
of the systems we built. First idea is how do
you design your logic, the domain specific
language such that you can divide the task of
learning into independent subtasks. So that's
the first key idea, how much independence can you
have.
And the second key idea is how do you rank them
once you have so many hypotheses, which one do
you prefer. For example, this is the output I
showed you on the demo. This is the string
somebody wanted. So let's see what kind of
independence can we get if you want to learn this
output.
So we can say that we can chunk this string into
many different small strings, and we can look at
one substring at a time. We can say we will have
some structure for each index in the string and
we can look at one chunk at a time. We can say
all expressions to learn this output are going to
be from zero to seven the edge. I'm going to put
all expressions there, but they're going to be
independent from the programs that I'm going to
learn for the other substring.
So you get the substring level independence. For
every substring, you will have different
programs, but the programs are independent of
each other. You don't have to worry about how I
computed this to compute program for other
substring. So that's one subindependence.
This, again, independence is going to come from
the lookup transformation language we developed.
For example, let's say I have these two tables
and I want to get the word join to get price of
an item. I'm not going to do the SQL way where
I'll take the cross product and do a projection,
because that's not learnable. What's efficiently
learnable is nested join where the idea is you
select price from this table, such the item ID is
equal to selection of item ID from the first
table.
So this way you're doing nested join, and the
nice thing about doing this is that this
subexpression now becomes independent of the top
level expression. So for the first expression,
it only needs to know that I need to get an item
ID. I don't care how you compute it. Even
though you have thousand ways to compute it, I'm
not going to care about that at the top level.
So this way you also again get independence of
subexpressions.
And finally, the third independence comes from
learning substring programs. For example, I want
to remove the dollar sign from the cost I got
from the table, and the idea is that we're going
to learn a program to get one and seven, left and
right index, but they're, again, going to be
independent of each other, so it doesn't matter
how I compute one, it's independent of how I
compute seven. So I learn regular expressions to
get one, they're there going to be independent of
regular expression to get seven.
So in this way we have all this independence and
this way lets us again learn huge space of
programs, so if you do rough calculation, it
comes ten by 20 different programs. You can
represent them in parliament space as well as
learn them in parliament time. So that's good
thing you get from such independence. If you
know your domain well, you design a language,
then you can do this much better.
But the problem is once you have so many
programs, which one do you show back to user
because there's too many. For example, let's say
this was a task somebody wanted to do. I want to
add "Mr." in front of all names and I want to
have the first name. So I give one example. It
goes to Mr. Rick, and the thing is if I learn the
wrong program, if I -- because "R" can come from
many different places. "R" can be a constant,
but it can also come from first name of Rick and
second name of Rashid. So if it learns one of
those programs, we might actually get something
like this, which is -- which can be not good,
right?
So, yeah. So the thing is, we have to make a
choice. Whenever we have many different options,
we have to make a choice. For example, let's say
you can say that I'm going to make a choice since
S -- "R" is such a small string, I'm always going
to make it constant. So let's say that's a
choice we make. But then in future we get some
other -- somebody else doing this, and again in
this case, "S" is a small substring, but I can't
make it a constant. It has to come from input.
So that's where the challenge comes in when you
have many choices, in some context you're going
to prefer one than the other. And also you're
going to have many regular expressions, as well,
to get the last name. For example, it would be
second word, last word, all the other thousand
different regular expressions.
And as you can imagine, if you just do second
word, that's not going to be appropriate for this
task because if you have a person with middle
name, it would get the middle name. So in this
case actually the more preferred one is last
word.
So the idea is how do you solve this problem when
you have so many choices, what do you do. So
this is that code we got from Spiderman, which
said: When you have great power comes with great
responsibility. So you have made the language so
expressive, there's so many choices. Now you
have to do the right thing. And that's where we
use machine learning.
So the idea is we're going to divide the tasks
into two phases, creating and tests. So in
creating phase, we'll have a bunch of benchmarks,
and for every benchmark, I'll give lots of
examples. So in training phase I don't care how
many examples I give. I give as many examples as
possible to make the task disambiguous.
So in this case, I would learn programs for "R"
and some of them are going to be good, some of
them are going to be bad. The good ones I'm
going to say they're positive. The bad ones are
going to be negative. And the goal is you only
learn a ranking function if such that in future
when you see the same task and you learn the same
programs, it should rank the positive one higher
than negative ones. So that's the task. Give me
ranking function that in future I would prefer
positive or negative programs.
And this is actually a task which typically
people do in search. There's a whole field
called learning to rank, and the only difference
is typically people solve this problem I want all
good results come before all bad results. But in
this case, we say any good result can come before
all bad results. I'm happy with it.
So this is just the interesting thing is here is
how do you come up with a loss function to
optimize. So here we are saying give me the rank
of the highest negative program and give me the
rank of the highest positive program. I want
this to be negative, and since this is the last
function, it's going to be negative of that. So
this is the function we optimize to learn such a
function. But this function is highly
discontinuous, not even differentiable, so we
smooth it up a little bit, but this is we use for
regression and try to learn this function.
So here are some of the results showing ->>: How do you learn a function like that?
What's the technique you use?
>> Rishabh Singh: Yes. So the idea is for every
expression you're going to have features.
Features are going to be there. So function we
are going to assume it's a linear combination of
features. So the challenge is how do you learn
the coefficient of the function, and that you do
a regression over your training set. You say in
training set I want to optimize this function, so
give me a -- learn a function for that.
>>:
[inaudible] this length --
>> Rishabh Singh: Yeah, so that's a good point.
So I showed you three levels of independence. So
you have different features at different
independence level. So one of the features are
going to be the length of the substring, the
relative length, the context, what happens to the
left-hand side, right-hand side. So you have
different features at different levels.
>>: Are the features that these -- I mean, a
list of all, you know, one possible set of
features could be every possible expression that
you could use in a program, but of course, that
would be too large ->> Rishabh Singh: Yeah, so we use frequencies,
yeah. So, yeah, so it's ->>:
[inaudible] the most frequent --
>> Rishabh Singh: Yeah, most frequent, less
frequent, yeah, right, right. So we do some
abstractions, right, right, right. Yeah.
>>: The one question I have is in the
[indiscernible] the methodology was supposed to
be in the writing this program simple.
>> Rishabh Singh:
Yeah.
>>: But now, you know, if you actually have to
do all these feature design and give all these
examples, that's kind of taking [inaudible] from
the simplicity of writing ->> Rishabh Singh: So the thing is this task is
only supposed to be done by us, not by users. So
users are still going to use the system as if
they're just providing examples. Somebody said
that developer has to do a little more work now
to make the system more robust and more --
something what it could try to read what users
have in mind.
>>: Sometimes, you know, the [indiscernible]
Developer, right? You had a task to do, you
have, you know, Rick Rashid, Satya Nadella and
you want to do this, right?
>>: The developer was the person who implemented
the system, yeah.
>> Rishabh Singh:
yeah.
Who implemented the system,
>>: The developer who implemented FlashFill
feature.
>>: So to be able to have this kind of modules,
your learning modules for various domains and one
for, you know ->> Rishabh Singh: Yeah, for exactly that. So
the hope is -- so the thing is that the learning
has to be done once, but in the test phase you
just use the function. So all this done offline.
Once you have the function, you just -- when
you're learning the program, you apply the
function and you rank them. So when you're
running the system, there's no overhead. Very
minimal overhead. But yeah, it would be done
offline. So let's say at Microsoft somebody
would do this.
>>: But I think the question is that the learn
ranking -- right? -- might be different depending
on the domain ->> Rishabh Singh:
>>:
Oh, yes, yes.
So --
-- working.
>>: So every task you have to learn this, then
it's too much of an [inaudible], but you're
saying you don't have to learn it for every, you
know, every, you know, task that you do.
>> Rishabh Singh: Yeah, yeah.
it over set of tasks.
>>:
You have to learn
Domains.
>> Rishabh Singh:
Yeah.
>>: [inaudible] what is it that identifies a
domain?
>> Rishabh Singh: So right now we are doing it
just for strings, string transformations. So it
would say any kind of string transformation.
>>: [inaudible] you know, you can have names and
addresses are very different types of strings.
The type of features that you have for names
might be very different from that.
>> Rishabh Singh: Yeah, exactly. So, yeah. So
right now we're not modeling any semantic
information in this work, yeah. This was just
for learning regular expressions, yeah. But
that's a good point, that if you know what your
data is, you can do much better job at ranking.
>>:
[inaudible] Excel logs and stuff like that.
>> Rishabh Singh:
Excel?
>>: You exploit the [indiscernible] the features
[inaudible] you exploit like [indiscernible].
>> Rishabh Singh:
>>:
Excel user --
>> Rishabh Singh:
>>:
Data.
Yeah.
-- data.
>> Rishabh Singh:
So right now we just get it
from -- so people have uploaded videos on
YouTube, so we try to get as much data from there
and also from blogs, so people writing lots of
blogs.
>>:
[inaudible].
>> Rishabh Singh: So for this we use 300
benchmarks, yeah, so like 300. So actually,
yeah, these are the results, actually. Or maybe
170. Something like that. So we use 50 programs
for training and there's 120 for testing.
And this graph is showing three different
approaches to rank. The blue one is the one
which was the very first thing we came up with,
which was Occam's razor which said learn the
simplest program that you can learn that's going
to correspond to what user has in mind.
The orange one is the one which we spent almost
four to five months working with the Excel team.
They would give us new benchmarks and we would
tweak the parameters. They would give us more
benchmarks. We would manually do it. As you can
see, it did pretty well then Occam's razors. So
for most benchmarks, about 50 it could do with
one example. Quite a few would do even more.
So it was not that bad, but even a very simple
machine learning based technique we found
actually did much, much better. So for about 80
percent of the benchmarks it was able to learn
from one example
After learning's ranking function.
So let me tell -- try to put it back into -- oh,
you had a question? Oh, thanks. So in this case
our specification mechanism is going to be in
proper examples. The hypothesis space is the
interesting thing, how you design your
domain-specific language to have independence.
And finally, the algorithm is going to learn all
programs in the language in polymer time and then
rank them.
And the biggest difference from all the -- there
have been a lot of work in [indiscernible] text
by example, but the biggest difference has been a
language-based approach where you have the
completeness guarantee as well as much richer
language for this particular task. And actually,
so I was telling, it was the string system was
shipped in Excel 2013 and there's been, at least
initially, encouraging response. A lot of people
writing good things about it.
And currently we are now working on actually, as
you were mentioning, trying to model more
semantic knowledge. So if you know something is
a date, if you know something is a name,
something is an address, you can do much better,
so semantics knowledge, as well as making the
system more probabilistic.
So right now at first if you make a mistake in
giving an input, it would just say I can't learn
a program for you. But can you tolerate some
noise and make the synthesis algorithm a little
more probabilistic.
So some of the takeaways here that partial
specifications like, in part, for example, can be
really useful for end users. And the main
takeaway was if you know your domain well enough,
you can design a language which can be both
expressive and learnable.
And the interesting idea finally was we didn't
use machine learning to learn these programs, but
we actually used it to bootstrap the synthesis
algorithm, which was also interesting.
So now let me tell you very quickly some of the
things I want to do in future. I'm very excited
about using synthesis for education and users and
programmers and both from theory and practice
side. From the theoretical side, there's been a
lot of work in inductive inference and machine
learning going back to Gold's and Angluin's work.
And the thing was they have done it for a very
small subset of languages, like regular languages
and context-free languages, but here we are
talking about much richer class of languages. So
I want to see what kind of relationship there
exists, and can we actually exploit some of the
things they have learned in the past, and also
the other way, how it corresponds to their work.
From the practics' side, I just showed you very
small part of Excel, but there are many more
things that could be automated that repetitive
task users have to do, but not just Excel.
Actually, there are many things which I do
routinely on PowerPoint. Even actually for this
one, if I have to change fonts, I have to change
margins, there's something called page -- I
forgot. Something where you can set for one
master slide, I guess.
>>:
[inaudible].
>> Rishabh Singh: But I still can't use it
because it's too complicated. So is there a
natural way for us to make a system where you do
one or two changes and let the system figure out
the changes for you. Similarly for Word
documents, if you want to do more complex
searches, you still have to learn regular
expressions. So can you automate that.
Finally, also we are getting new devices now,
things like almost everybody has smartphone now.
People are saying in five years everybody would
have a robot. So what is the more natural way
for us to specify tasks to these systems where
people don't need to know programming to get the
task done, simple tasks.
For example, let's say I want my robot to clean
my room, bedroom, but the thing is, everybody's
bedroom is going to be different and everybody's
definition of "cleaning" is also going to be
different. So how do you express that intent and
figure out a program or something for the system
to do it for you.
So that's very interesting.
For programmers, I've been looking at trying to
come up with a language where having such
synthesis construct would be first class where
you can let programmers -- a seamless language
where you can let programmers write examples,
maybe for some task writing test cases better.
For some tasks, writing complete spec is better.
So all these different ways to specify your
intent and have a very efficient runtime system
to actually compile it and also efficiently run
the whole program when you have all different
classes of specifications.
And finally, for education, one thing I'm really
excited about is the data we're getting right
now. So that's the only thing that has changed.
People have been asking me what is -- what are
mokes [phonetic] Actually doing now that we
couldn't do ten years back, and the interesting
thing is we are able to capture so much data that
we didn't have resources before, and that is
something we could really use now.
So some of the things I'm excited about and also
doing a little bit of work has been on clustering
solutions and we use clustering techniques, more
probabilistic techniques to cluster these two
assignments, to do power grading, to actually
give feedback on alternative approaches to solve
the same problem, and also for teachers to know,
as you were saying, what's happening in the
classroom. If I see hundred thousand
submissions, I have to go through each one of
them, there's no way I can figure out, but if you
can give a bird's eye view of what's happening in
the classroom, it's going to be much useful.
And finally, yeah, right now the error models we
had to write by hand, but since in the data you
have time-stamped data, so you know what's run -it's time to have an I plus one, so you can dev
the two programs and see what changed. Many
times truant stem cells fix the mistakes, so that
way you can do a frequency measure of what are
the common changes and make an automated error
model from them.
So finally, to conclude, let's divide the room
into three groups. At the top we have all the
good people who know programming. Then we have
the students who are one order of magnitude more
in the small; and finally, we have end users who
are even one more magnitude of mold in this
world.
And I've shown you a system, Autograder, which is
aimed for helping students. Storyboard
programming I didn't have time, but is also aimed
for helping people, students learn data
structures. And FlashFill was more for end
users.
So I want to end my talk with the thought that a
lot of work in our community in program analysis
and verification is focused on programmers, which
is great. We want to make their lives easier,
but a lot of techniques actually equally apply to
even these two broader classes of users, and we
can potentially have even bigger impact than what
we have had on programmers. And with this
thought, I would like to end my talk. Thanks.
[applause]
>> Rishabh Singh:
Yes.
>>: So you mentioned that in the 1980s the
specifications were much larger than the
programmers. Of course, there may be tricky
programs --
>> Rishabh Singh:
Yes.
>>: -- that's why they [inaudible] some
description, but in the '70s and '60s, before XD,
they were learning some examples ->> Rishabh Singh:
talking to you --
Yes, actually, yeah.
So I was
>>: So do you think that now that we can do
learning from examples much better is the time to
return to synthesizing things from specifications
and get, I mean, get full correctness? Can we do
that much better now?
>> Rishabh Singh:
Hmm.
Than before?
>>: If the '80s was too early for a lot of those
things, right?
>> Rishabh Singh: Right. I think -- yeah, I
think the problem was -- yeah, so definitely our
theorem provers have gotten better, so a lot of
work was done in '80s on directive synthesis. So
the idea was given a program, can I take -- can I
generate a proof for a condition and try to
rewrite the proof to -- from one specification
language to an implementation.
So definitely our theorem provers have gotten
better, our algorithms have gotten better, so we
can scale to much bigger programs. But I think
one issue there was -- the other issue was
orthogonal, that it was hard for people to write
complete specification. I think that was a
bigger issue in that sense, which I'm still not
sure if people are willing to do that, but maybe
if you have better ways to specify, maybe
different specification languages, then maybe we
could do that. But, yeah, but I think there was
the other issue, as well, how much programmers
are willing to write these complete specs.
>>: Just a question for my edification. What's
the difference between deductive synthesis and
inductive synthesis?
>> Rishabh Singh: So in deductive you have
everything with you, so you have a complete
specification. So at any point of proof search,
if you stop, you have some completeness
guarantees.
In inductive the idea is you don't have complete
information. You start from few information
about here and there. Some examples, let's say,
or demonstrations. So that's more inductive.
You try to generalize from small information. In
directive you have complete information. You're
just transforming it into another language.
>>: Do you have plan or thoughts to use like
existing code, like, say a prerequisite and then
you have programming in some form already there
[indiscernible] and much better and pointed at
code written by human which is probably already
[indiscernible].
>> Rishabh Singh:
>>:
Yeah.
Then more or less [indiscernible].
>> Rishabh Singh: That's a very good point,
right. So that actually goes back to the point
of data-driven synthesis where the idea is this,
even the way I code right now, many times I just
search on Google or Bing to find -- actually,
it's a fact that Bing is better in searching code
than Google.
So you search for something and it gives you
code. You copy/paste it, right? So can you
actually let -- build a system that goes over the
net and figures out the code for you? So there's
already been little work been done in the area.
But, yeah, I think that's very useful approach.
And there's also a very big DARPA grant this year
on the same topic. We have so much code data
everywhere. How can you leverage that for
synthesis. Also verification and other things.
>>: [inaudible] in the case of FlashFill I
wonder how much effort is put in actually writing
the exact same properties all the time.
>> Rishabh Singh:
It's a good point, right.
>>: The [indiscernible] that kind of
[indiscernible] code.
>> Rishabh Singh: Yes. The thing about Excel,
it's interesting, that typically formulas are not
that big. So we can actually just do synthesis
on it. But you can imagine if you really want to
write a [indiscernible] script of 100 lines, then
maybe this idea that you search over repository
would be better suited, but, yeah.
>>: So when you're -- in FlashFill when you're
ranking and [indiscernible] learning, right? A
bunch of possible programs that match the set of
examples you've given, any model that you built
potentially could be wrong, right?
>> Rishabh Singh:
>>:
It will always be [indiscernible].
>> Rishabh Singh:
>>: So
that to
ranking
result,
user of
Right.
have you ever found -- how do you expose
a user? The fact that your model is
something and, yes, it's producing a
but you don't necessarily know that the
that thing --
>> Rishabh Singh:
>>:
Right.
If it's a good result.
[inaudible].
>> Rishabh Singh:
Yeah.
So there's an --
there's also an interaction mechanism which was
not shipped in Excel, so I think where Ben and
Sumit have also worked on that, but the idea was
you essentially learning all the programs, even
though you're ranking one, but you don't ever run
one program. You can run ten, top ten, top 100,
and if you see the top two answers are very
different from each of them, you can highlight
the cell. You can say that I'm not sure about
this answer. There's the other answer, you want
to pick it or not. But it was, unfortunately,
not shipped in Excel. They thought it was too
complicated. But, yeah.
>>: I think that the issue is the same thing.
If you are going to target the masses, they're,
by definition, unsophisticated.
>> Rishabh Singh:
Right.
>>: So either you get the job done right away or
you don't bother them. [indiscernible] right?
>> Rishabh Singh: But it was funny that we saw
videos on YouTube that said FlashFill doesn't
work, because I gave an example it doesn't work.
Just because they had data in different formats,
it's supposed to give multiple example, but they
don't even know you can give multiple examples.
So, yeah. So they're not very sophisticated,
yeah.
>>: But if you think about it, you know, I mean,
we're equivalent to the "I'm feeling lucky"
button in Google, right? You type a search query
and you get a bunch of results, and most people
don't hit "I'm feeling lucky." You know, it
would only show one result because, basically.
So the question is how do you educate people to
think about this a different way, which is to say
there are multiple things. It's sometimes going
to get it wrong, but if they think in those
terms, you're going to converge. You're going to
get a lot more feedback and you're going to
converge on a really, really good solution over
time.
>> Rishabh Singh:
Yes.
>>: So I think, you know, if you think of how
good Google has gotten, I mean, you didn't start
as good as it is now, right? That's part of the
story, and I think ->> Rishabh Singh:
Right.
>>:
-- this is a question of how you give
people the ->> Rishabh Singh: It's actually a good analogy,
yeah, because do you want to prefer to see just
one result or multiple results, and we have
learned over the years to look at multiple
results, maybe.
>>:
Filter.
>> Rishabh Singh:
>>:
And filter automatically.
Very good output from search engines.
>>: He's right now [indiscernible] actually
multiple results.
>> Rishabh Singh: So not in the Excel side,
yeah. So in the research prototype, you just
right click on a cell and you can see multiple
results.
>>: Can you -- I know that you know all these
Excel PowerPoint and all, they're extensible.
>> Rishabh Singh:
Yes, yes, yes.
>>: So is it plausible for you to just ship to
make your own --
>> Rishabh Singh:
>>:
Actually, so this --
[inaudible] stop you.
>> Rishabh Singh: Yeah, so the idea was,
actually, yeah, you can ship your thing as an
add-in, as well, which a lot of people do. But
the thing is, people are not sophisticated enough
to download that add-in. But, yes, if some
people are, you can give them, but typically
people just use basics. So we ->>:
Stop [indiscernible] it.
>>: That's the problem, right?
stack.
>> Rishabh Singh:
You go down that
Yeah.
>>:
Then, I mean --
>>:
The expectations --
>>:
The expectations go down very fast.
>>: I mean, the problem is, I mean, I think
search analogy, the thing is that, like, you
know, the reason that people have become better
at searching is because there are multiple things
and there's an opportunity for them to improve
their mental model.
But if they only ever had the "I'm feeling lucky
button," there's not an opportunity for improving
their mental model [indiscernible] of reaching
only possibilities. So I'm starting to push the
Excel team a lot to say, hey, you guys need to
show more than one.
>> Rishabh Singh: Maybe that's a new way to look
at spreadsheets, yeah.
>>:
Yeah.
>>: I don't know whether it's lack of
sophistication. It might actually be just a
discoverability issue, you know, the right
button.
>>:
Well, no, no.
So for example, I mean --
>>: It's been shown they cannot search
[indiscernible] on the right-hand side ->>: There's been a bunch of things, so if you
want to pick a font and you mouse over a font, it
automatically shows you ->>:
That's right.
>>: You know. So they have ways to show you
lots of choices quickly. And I think this is
part of the story is that it's really about the
UI and the user experience and preparing people
and giving them confidence, you know, that make
them learn and understand it, et cetera. So I
mean, there's huge opportunity.
>> Rishabh Singh:
>>:
Right, that's the thing.
Yeah.
>> Rishabh Singh:
Yeah.
>>: It's not limited by the capability of the
underlying synthesis.
>> Sumit Gulwani:
speaker here.
Okay.
>> Rishabh Singh:
Thanks for coming.
[applause]
So let's thank the
Download