>> Sumit Gulwani: Okay. Hello, everyone. It's my great pleasure to introduce Rishabh Singh who is a Ph.D. student at MIT and is graduating this year. So Rishabh was a three-time Internet Microsoft Research, so some of you might know him well. He's also the recipient of the MSR Ph.D. fellowship award. So Rishabh was a key contributor to the FlashFill project, so FlashFill feature, as many of you know, shipped in Excel 2013. Back at his university he is also learning lots of several other interesting projects also, and one of my personal favorites is his work on automatically grading interruptive programming assignments, which I think has the potential to revolutionize computer science education. So without any further ado, Rishabh. And he's going to tell us something about some of these interesting projects and are perhaps you will learn more. >> Rishabh Singh: Thanks a lot, Sumit. So hi, everyone. Today I'm going to talk about programs synthesis for the masses, but before that, let me start with something that happened to me three years before. I was a teaching assign -teaching assistant for a course, introductory programming at MIT, and I was supposed to grade assignments of 200 students, and it just so happened that I was lazy, that I only had two days to grade them, and there was also [indiscernible] line, so what could I have done? So what I did, I just ran test cases and gave them appropriate marks, how many test case pass. But then there's the feedback later we got, the student was quite unhappy, and in this particular case we're saying that the feedback based on test cases it was not really helpful. It doesn't really tell them what's wrong with the program. And also, the kind of grades they get is not in accordance with their potential to write this code. So this is already a problem in classrooms. We have to spent a lot of time grading these. Typically it used to take me two to three days to give good feedback. But this problem is getting even bigger now with mokes [phonetic] coming up. When there are hundreds and thousands of students signing up, there's no way we can hire enough people to give feedback, good feedback for all of them. So in this talk I'll show you one way how we can use synthesis to provide automated feedback similar to what teachers would give, but more generally, my research goal is to make programming more accessible to the people, and we can achieve this goal in two ways, and these are the two ways I've been looking at it. We can build systems to help people learn programming to let them go up the bad [indiscernible], and the second way we can do this, make it easier for people to program by letting them program using more intuitive specifications. And these two directions I have been taking over the last few years. So the three systems I've built or I've worked on. The first system is Autograder, which is -- which provides automated feedback on programming assignments, and we have used it to provide -- to generate feedback for thousands of submissions from edX, and we are in the process of deploying it for next course. The second system is FlashFill, which is a programming by example system for excellent users, for people who don't know programming but still want to get tasks done. And I'll briefly talk about this, as well. The third system, which I won't have time to talk about today, but I can talk about it afterwards, is storyboard programming which lets students write data [indiscernible] manipulations using examples of data structures. So these three projects, automated grading, programming by example, and virtual programming seem quite different, and they indeed have quite different from each other. But one thing that ties along all of them is program synthesis. And in this talk I'll show you how some of the ideas we can enable all of this these different applications. Now, traditionally back in '80s the traditional real programs synthesis looked like this. Somebody would go in and write a complete specification. This is a specification for Mudsort, give it to the '80s computer, and out comes whatever implementation. In this case, it will be somewhat sort of implementation. And as you can imagine here, oftentimes the implementation would be much smaller than the spec, itself. So this complete process didn't take make traction, and programmers would rather write code instead of writing this larger piece of specification. So now let me show you what's the more modern view of synthesis has been in recent years. First of all, we can break it down in three components. The first component is specification mechanism, and we are going to embrace the fact that it's hard for people to write complete spec -- complete description, and we're going to let them enable -- let them provide specs using more [indiscernible] specifications, things like examples, reference implementation, control flow. And some of the examples are there more concretely. The second component has been hyper [indiscernible] space. Typically people have tried to synthesize programming during complete languages, and one of the contributions of our work has been to identify subsets of logic, that you don't have to always go during complete languages, which has two properties. These domain-specific languages, they're expressive in the sense that you're able to express most of the useful tasks, but the same time, they're efficiently learnable. So it has these two properties. And finally, we have the third main component, synthesis algorithm that learns programs from this hypothesis space that conforms to the specification that somebody has given. And this also is interesting challenges here, how do you search efficiently this last piece of programs. So now let me start with the first system, Autograder. This is joint work with Sumit and Armando, and this project aims to automate grading. So before think about automation, let's look like how it's done classrooms today typically. We all, TAs, sit in a classroom, go over a few assignments, and after going over a few assignments, you typically know what common mistakes students are making, and then you construct a grading rubric. And after constructing that, there's a very repetitive process. You take the rubric, you apply to the program, student program, and give appropriate feedback. And this repetitive process is the exact thing we want to automate in this system. >>: What do you mean by "grading rubric"? >> Rishabh Singh: Oh, so rubric is simply saying what are the mistakes students are making and how many marks should I deduct for it and what feedback should I give for this particular mistake. So these are common sort of mistakes students are making on a given assignment. So we wanted to placate this repetitive process, and instead of calling rubric, in this system we're going to call it error model. But everything remains similar. So let me tell you the system using very simple example. This is an exercise from the edX class on introductory programming. So this exercise asked students to write a python function that takes a polynomial, which is represented as a Python list, and written into the list it gives points whose values corresponds to coefficients of the derivative of the function. And your function is constant, you want to return zero. So these are the two, the current case and the general case. And you can imagine from very simple algebra teaching [indiscernible] a very simple solution. First of all, check if the polynomial is constant, you return zero. Don't do anything. Otherwise, simply trade over the list and compute the coefficients of the derivative multiplying the value times the index of the value in the list. So this is a pretty straightforward way to implement this. Much cleaner what a teacher would write. Now let me show you one solution of what somebody had on the edX class was submitting. So this was this particular student's 10th attempt. You get multiple attempts. You can keep clicking, and it will tell you what test cases are passing or failing. And so this particular student, this was the 10th attempt, and after this the student is going to give up because the student is not getting any feedback from the system, what's happening with this code, and why is it failing. And similar problem also happens when you typically grade these programs because you would see such an adding of solutions that it's not even clear to me what's wrong with this solution. I would have to spend four or five minutes to understand it. But now, actually, we don't have to worry about it. We can give it to our system and we can enjoy in the meantime. So let me show you what our system can do. This is the exact problem I was showing you before, and we can ask the system to do some analysis for us. It's going to search over a space of some corrections and then it's going to say your program is almost correct, but it requires two changes. In line number 14, instead of range one, comma, length, it has to be one length plus one. So we can do that. And now hopefully the system is going to tell us there's only one change we need to do, which in this case I think is that instead of greater than equal to, it should be greater than. And there are multiple choices for system reduce, so here it is saying make I to I minus one, which leads to another question, but I minus one is same as changing greater than equal to, greater than. So one more thing I wanted to show you here was this problem is challenging because for a given problem, there are many ways to solve it. For example, this was a problem, and attempt was closer to what teacher had written, but still had some small mistakes. Somebody, in this particular case student was using ->>: I have a question. >> Rishabh Singh: Yeah. >>: The previous demo, you made that change with I minus one, what would your system return? >> Rishabh Singh: minus one? >>: Oh, if you just make it I Yeah, whatever it said. >> Rishabh Singh: Then it would say program is correct. Nothing else. Yeah. >>: Oh, I see. >> Rishabh Singh: Yeah. anything else, yeah. So it doesn't do >>: Is that, by the way, I mean, just generally is that worrisome to say that the program is correct? I'm sure that your system doesn't really know that the program is correct [inaudible]. >> Rishabh Singh: That's a very good point, right. So I'll tell you what the system does and what kind of guarantees you get, but as you can imagine, we are doing an exhaustive testing, even as bounded, still much more exhaustive than what typically happens in edX right now. So you have us find the sort of test cases and then also they would say your program is correct. So at least it's better than that. >>: [inaudible]. >> Rishabh Singh: thing, right? But not complete accurate >>: Is this [inaudible] impact to the teacher or the student in this case? >> Rishabh Singh: That's another good point. So depends on the use case, and I'll tell you some of the use cases we have used it in our experience. But one of the use cases is actually tells students if they're -- for example, the student was stuck for tenth attempt. Maybe you want to give that -- this kind of feedback, but not exactly how to fix it. So that's also a good ->>: It seems like something [inaudible] if you just told them what to do ->> Rishabh Singh: Then the learning component is going away, right. So one thing I can show you here, so we support different hint levels, so you don't always have to say everything. You can say that maybe there's something wrong in this expression, but I won't tell you how to fix it. >>: [inaudible]. >> Rishabh Singh: You can also go one level down. This is what we're going to experiment on edX. You're just going to say highlight the expression, yeah. >>: So there's a level, I mean, this is like fixing the program, but there's also a level of understanding. It's like they don't understand the difference in greater than, equal, or greater than, like off you by one kind of thing. >> Rishabh Singh: Yes. >>: Is there any conceptual framework that you can raise these things into to actually explain the real problem? >> Rishabh Singh: That's also a very good point, right. For example, so one thing we have been thinking about has been more interactive fashion. So let's say your fixes, your add index, I should be I minus one, and instead of just telling make I to I minus one, you can actually ask the student, let me ask you a different problem, not this problem. Given a list, can you print the limits of it, and if student is still making the mistake or thinking [indiscernible] they start from one instead of zero, so now you have a high level idea of what's the problem with student. So that way you can actually use this low-level information to figure out what's the root cause and then try to give high level information. For example, I greater than or equal to or I greater than zero, you can figure out you're doing one more loop iteration. So instead of telling that, you give more high-level information. But so that's ->>: Yeah, that's exactly how I would do the teaching, right? You don't want to have them solve the problem. You don't want to tell the answer at hand. >> Rishabh Singh: >>: You want to, you know, it's just like -- >> Rishabh Singh: >>: Right. Somehow -- -- you want to simplify it. >> Rishabh Singh: Yeah. >>: You know, find the thing that they really need to understand and then say, okay, now think about this other thing, how can you solve it. >> Rishabh Singh: That's actually a very good point. So we had lots of meetings with the teachers who are teaching this course, as well, and they had the same opinion, as well, that you want to teach them debugging instead of just teaching them how to fix code, right. And this was one way we were thinking to have an interactive dialogue. So one thing also, so this was the most interesting one I found from the database, that if you look at this program, this looks really, really different and it cannot be correct because you're only returning one length of input is less than two, and you want your program should work for all inputs. But in this case, student is actually doing very interesting thing here, and it just so happens this is really correct program. Student is flopping elements from the list, computing the result, and when you have probed enough, then it's okay to return, which is actually quite interesting, clever way to do it, even though this shouldn't be recommended because you are doing stateful change to the input. And the only change here is that in line 11 it should be one instead of zero, the initialization, and student is missing a base case, but that's still pretty close. >>: It's interesting, because if I was teaching it, we talk about the rubric, the rubric would be something like you missed the base case minus one, off by one minus one, right? I mean, they are these sort of conceptual things. They're not at the level of the detail, but, yeah. >> Rishabh Singh: >>: Yeah. I'm curious about these English sentences -- >> Rishabh Singh: Yeah. >>: -- that your tool is outputting. So I'm sure that at the core of it there is some SAP solving going on. >> Rishabh Singh: Yeah, yeah, yeah. >>: From there you have to construct these ->> Rishabh Singh: Yeah. >>: -- into sentences. How do you do that? >> Rishabh Singh: So this is very basic right now. It doesn't do any NLP. So for every correction rule, there is a template of English statement. You just with some blanks and you fill them up based on the feedback the stack solver is telling you. >>: So that -- >> Rishabh Singh: Template is fixed. >>: -- template is your domain, the one that you were talking about earlier? >> Rishabh Singh: >>: The error model. The error model. >> Rishabh Singh: The rubric, yeah. So for every correction rule, you have a template, what feedback should I give if this correction rule is applied. >>: Right. >> Rishabh Singh: space -- And then you fill up the blank >>: But I was trying to connect it to the broader picture that you had when you were putting programs in the [indiscernible] middle where -- I don't know what you were calling it. You know, there was a DSL component, right? >> Rishabh Singh: Oh, actually -- >>: Is that the same as the error model? the same ->> Rishabh Singh: yeah. >>: Okay. It's Yeah, so that's part of DSL, >> Rishabh Singh: So I'll actually go back to that. That's a good point, right? Which is probably the next slide, actually. Or maybe not. Yeah. So the error model I showed you, so let me tell you what error model is, which is in this case in our system it's simply a sequence of correction rule, collection of correction rule. For example, if you look at this correction rule, it says whenever see a return statement in the program, you can optionally modify it to return, for this case, a list containing zero or question mark V. Question mark V means any variable defined in the program, and this corresponds to mistakes students were doing, that they would do everything right, but they would return something else or forget to change the return statement because it comes with a default return, the template of the code. So this correction rule let's you fix that problem. You can return any of the variable in the program. So typically you would have large number of these rules. >>: Where are these rules from? >> Rishabh Singh: So right now some teacher, somebody has to go and write manually, but towards the end I'll show you how we can learn these, as well, from data, but right now it has to be manual, these rules. So, yeah, one question would revise the synthesis problem and this would go back to Shah's question, as well. So we can go back and look at our bigger picture of three components and try to put things together. So specification mechanism in this case is whatever teacher is writing. So if we have a correct implementation, let's say if we were doing sorting, teacher can write bubble thought. Hypothesis space comes from whatever student has submitted, the student's solution, and the error model that you have, which is giving you a space of corrections you are searching over. And the algorithm essentially searches over this class of problems, but represents the solution space in essentially using a language we're going to call Python tilda, and then tries to find a correction that matches the behavior of teacher solution. So let me tell you some of the key ideas in each of these steps from starting from that student code to feedback. So there are four main phases. First phase is rewriting, which takes a Python program, takes the error model and rewrites it into this language we call Python tilda. And then we are to solve these programs, these constraints, so we're going to use an off-the-shelf solver. In this case we're going to use get solver to do the synthesis task for us, so it creates a Sketch file, which is then solved and then it gets transmitted to an actual language using templates. So let's go over the interesting things in each one of these phases. The rewriting phase takes the error model. So let's assume I have a very simple error model, any expression A, any arithmetic expression A goes to A plus one. And let's assume this is what student had written, this expression. So we can start applying this rule. We can say -- we can apply it to the first sub expression, zero. We get zero and one. We can apply this to poly, as well. Also, one thing is whatever student had written, we're going to keep it in a box, just to make sure remember what student had written. We can apply it to length through poly and poly becomes poly and poly plus one. So one thing you would imagine poly is a list and one is an integer, so it doesn't make sense to add integer to a list, but this process, this phase is completely syntactic, so it's not taking semantics into account, but this would be handled for the next phase, the semantics of the language. So you can also apply also the whole expression length poly and you get two more choices. So in this way, you keep rewriting the program symbolically, and this set expression now represents in this case eight possible rewrites of what student had written. And in this way you take the error model, you apply it to the whole program and now you get a family of programs ->>: This is -- this a never-ending thing, you keep rewriting and it will ->> Rishabh Singh: Yes, exactly. So we have well formed error correction rules, so this is a well formed rule in the sense that it's not nested. So to do nested rewrites, you have to have a special symbol. So you have to say A prime. >>: I see. So you're saying that what you do A to A plus one ->> Rishabh Singh: There's no more rewrite on this thing you can do. >>: I see. Okay. >> Rishabh Singh: Yeah. And then there's some well formed correction to -- well formedness condition to make sure it dominates always. But typically we always have one level rewrites. We never go into nested rewrites. But the way the semantics are defined, it's actually all works out, at least theoretically. So now the problem is you have this program in Python tilda language where you have set expressions, and now the problem boils down to give me a program from the set of programs that is functionally equal with what teacher had written. So by functionally equivalent, we mean that it's more than just syntactic. So if the teacher had written bubble sort and student is writing Mudsort, we still want to give feedback. And it should have the same functional behavior. >>: So I've got a question -- >> Rishabh Singh: Yeah. >>: -- for you. So I noticed that some of the program that you were generating, they're not type six. Python is a dynamic language. The fact that Python is a dynamic language, does it help you or hinder you? >> Rishabh Singh: At least from my experience, it hinders us a lot because you have to model all these type constraints inside. >>: I see. >> Rishabh Singh: We'll go a little bit in the encoding after that, but, yeah. So it's a little bigger challenge than normal C like language. >>: Also, a lot of these programs, if there was a static type checker, then you could have just pruned them away. >> Rishabh Singh: That's a good point, right. You can prune it ->>: [inaudible] layer right now [inaudible] -- >> Rishabh Singh: So right now -- >>: -- dynamic. >> Rishabh Singh: Right. Right now the constraint solver is doing typewritten for us, right? But, yeah, that's a good point. You can remove some of them up writing before giving it to the solver. And actually does a lot of efficient checking also before giving it to SAT. So in between also there's some optimizations going on, but, yeah. So the thing is you can take this program and if you try to do an explicit state model checking like technique of smart search, and I tried doing it, as well, it's still running. It's been more than one year, but from my back-of-the-hand calculation, it would take more than 30 years to get feedback on just one assignment like this. So we want to do better than that. We can't wait for that long. So that goes to our second phase. So we're going to, instead of just enumerating them and pruning, we're going to craft it to a symbolic solver to solve this search space, and we're going to use the Sketch solver. So to give you one slight introduction to Sketch, it's a C like language, very similar to C. It has one extra construct called holds. So, and holds are represented using double question mark. So you can write a program like this P where C some unknown and you have some assertion in the program that says input plus input should be C times input. And I want this assertion to hold for all inputs. And you can give it to the solver and solver would say when C is equal to two, I can prove that for all inputs, this -- all assertions will be satisfied. So we're going to use the system, and essentially what this system is solving for us is an equation of this form. Give me a program P that exists in program B such that for all input, all my assertions are satisfied. So phi of P in is satisfied. So it's solving this double quantified equation. So we can go ahead and try to represent our problem also in this form. So we can say, let's assume S is a student solution and ERR is the error model. So give me a rewrite of student solution S prime which belongs to the set of rewrites such that for all input, I want the behavior of teacher's program same as the modify student program. So this is a very functionally equal is coming in. But at least in this way we can formulate our grading problem as a synthesis problem. But there's some challenges again that we have Python programs, as Shah was mentioning. We don't have C programs for this. So we would have to somehow encode Python semantics in Sketch, and which has a lot of interesting details as well, but let me point out one or two interesting ones. So essentially the idea is we're going to take Python, write a compiler to Sketch, and in some sense in this way we are going to get synthesis system for dynamically type language. So -yeah. >>: When you say "correct," I mean, you mean you're going to test it for some number of cases or you're actually proved it's correct or ->> Rishabh Singh: Yeah, so I'm going to go over algorithm after this, but, yeah, so it has to be bounded, yeah. >>: The double quantifier that you have, it's really just -- it's an or [indiscernible]? >> Rishabh Singh: Yes, exactly. turn into that, yeah. Right. It's going to So the first challenge is we are to handle dynamic types and we're going to take a strategy similar to unit types. We're going to define a structure which is going to have a Boolean flag, or let's say an integer flag in this case. It's going to denote what type of the variable it is, and for every type it's going to have a field. So it will have an integer field, a Boolean field, a list field, a string field. For example, if I have constant like integers, I can say my MultiTypes type is int and the integer value is corresponding integer value. For lists, it's going to be a recursive type. We're going to say type fill this list and the values are the other two MultiTypes. So this way we can encode constants into MultiType. We now also have to take all the Python statements and library functions over MultiTypes. So for example, if you take the edition expression, you have to encode that. When there are two integers, it adds the two integers and preserves the type flag. When there are two lists, it appends the two lists, and whenever you are trying to add an integer to a list, it leads to an error state. So this is where type checking is coming in some things. And, yeah, so typing rules are encoded now as a constraints inside the system, and you do this for all of Python or at least the subset of Python you want to support. Yeah. >>: So then students can create new classes, then they don't [inaudible] Python basic types? >> Rishabh Singh: Oh, no, it's actually -- so we also have encoding for classes and objects, yeah. So this is a very simple thing, but yeah, you can build on top of it more complicated objects. >>: This seems to me to be a massive undertaking. >> Rishabh Singh: Yeah, it is. >>: To encode the operational semantics of all of Python. So what -- I'm sure you're not handling all of Python, right? >> Rishabh Singh: you can't do it. Yeah, exactly right, because >>: How do you go about [inaudible] keep getting programs ->> Rishabh Singh: Yeah, so the idea was, yes, thing was we want to start small, so we took seven to eight weeks of edX course. So at that time it didn't include classes. It was mostly all of the basic data types and things like high rotor functions, loops. So you have less comprehensions those things. And these are things we wanted to capture, most of the things people do for first eight weeks. But then after that we got more people and also we have now class encodings and other things on top of it. But, yeah, it's a subset, but still rich enough to actually handle very large class [inaudible]. And also there are some complicated library functions that people use which are not good for solvers, so we use -- we built -- we call it models for search complicated function. For example, if you have square root, so here the system is going to encode it in such a way, it's going to say I don't care what the function is, so I'm just saying it's some under [inaudible] function that satisfies some post conditions. So here we're going to say whatever the return value is, square should be less than integer. So there's some ways, some approximation also happening for complicated library functions, and we have models for them. So doing these two things, we somehow you can imagine that we have a compiler moving from Python to Sketch, and I can talk about more details afterwards if you're interested. But let's assume we have a compiler now, so we can translate these Python programs to Sketch. Now the challenge is how do you solve them. And this is very rich. I think Reston [phonetic] was asking. So essentially we're going to use similar algorithm to see. So first let me tell you what C gives us. It's count example guided [indiscernible] synthesis. We're trying to solve a doubly quantified equation, and you want to know what the value C is. And so the way we're going to solve it is it's a doubly quantified equation, so we're going to remove one quantifier at a time and we're going to divide it into two parts. The first phase would drop for all quantified would say I don't care about for all inputs. Give me a program that works for a random input, and let's say in this case it's going to be zero. And I put zero in my constraints there and I would say, okay, I know zero plus zero is ten times zero. So one valid solution is C is equal to ten, Which is okay now. But the problem is this program works for zero but may not work for the input, so we need to go to the second phase, which says verify, give me an input, count example input such that this program doesn't work, whatever I have found till now. So in this case, the system would say yes, when input is four, four plus four is not equal to ten times four, so it doesn't work on four. So four becomes your new count example input, and you go back to the synthesis phase, and this time you ask the question: Give me a program that works for both zero and four. And you keep doing this. At some point in this case it just so happens that when you have two inputs, it's sufficient to constrain it to get the right answer, C is equal to two, and you give it to the verifier and verify can't find a solution -- can't find a count example for you and says okay, the program is correct for all inputs. Now so -- so this technique depends on how strong your verifier is. If your verify can reason about infinite inputs, then you would get that guarantees. For our system, we have a bounded model checker. So you only do it for bounded inputs. So in general, yeah, so you would have few iterations of these two phases until you can [indiscernible] to solution. So we can use this algorithm and try to give it to our -- give this problem that we had, and it would come up with this thing, your program requests 15 changes. And why did this happen? Because we didn't really say anything about what kind of solutions we want. We just said give me a fix and it found some random fix. So ideally, we want minimal changes. We don't want any number of changes because that is what's going to correspond to what student had in mind. But having minimization is hard for SAT solvers. Typically solvers don't support such feature. But we have alternate opportunities to solve this problem. We could build binary search. We could say let's try with some value. If it works, we try half of it. So we do some iteration, but this is very expensive. We could also solve MAX-SAT. We could use MAX-SAT, but then again, as you saw, there are too many calls happening to solvers. Many [indiscernible] iterations. So that also was very expensive. So one thing that really worked well was we call it incremental linear search which try to reuse as much as [indiscernible] solver has learned in previous iterations. So let me give you high-level idea of what that is. So let's go back to our CEGIS loop, and let's say I want to minimize a function fx, some fx. So we can use the same algorithm. We get program P1, and let's say the value of fx in the current context is seven. So what we do, we just say don't throw away anything that you have done till now, so whatever constraint you have learned in synthesis, keep it. Whatever constraint you have learned in verification, keep it. And you put in a -- put a state constraint back that also add a constraint that fx is now less than seven, and the nice thing that happens now is all the other iterations, since the solver has learned so many things, so the next iteration becomes really fast and it would say, okay, I can find you another program P2 where the value of fx is four. And this way you keep asking it to give you lesser solutions and trying to reuse as much as possible whatever you've done in the past. At some point the synthesis phase is going to say I can't find solution for you anymore, and then you know the minimum value of fx is four. So this is pretty interesting in the sense that even in practice we have seen the first iteration would take quite a lot of time, but all the other iterations are really, really fast. Takes milliseconds. And as you can imagine, we went from high to low. If you went from low to high, then you would throw away whatever solver has learned because that would become unSAT. So finally, this way we can solve these constraints, the minimization constraints, and now it's pretty straightforward. You have templates for each correction rule. It gets transferred to natural language. So now let me tell you briefly about some of the evaluation we did with the system. We took lots of benchmarks from both the classroom, the intro programming class at MIT as well as the edX class over these types, integer, tuples, and strengths. And for each of these problems we removed the ones which were correct, so these are our problems that were incorrect in our benchmark set. For classrooms we got about hundreds of them, and for edX class we got thousands of them. And this graph shows how much time on average our system took to generate feedback, when it could generate a feedback. And typically on average it took about ten seconds. And when you translate for hundred thousand students on amazing cluster, it comes out to be about $14. It used to be $1400 and edX team was quite worried, but now it's quite reasonable. Okay. >>: [inaudible] go down? >> Rishabh Singh: Oh, just the optimizations, yes. So it used to take two minute per problem. Now it's taking ten seconds, yeah. Just the optimization. >>: What was the reason for that, that linear search? >> Rishabh Singh: Oh, so the linear search was also one, yeah, but also a lot of optimizations in the encoding going to SAT before. So lots of details there. Some of them are in the VM type, but yeah. >>: So what are you showing here in terms of these bars? >> Rishabh Singh: Oh, so these are the average amount of time it took per problem to give feedback. Yeah. >>: Does that correlate the number of lines in code or ->> Rishabh Singh: That's a good point, right. So it's not clear code iteration, but at least this problem is the biggest problem. It has 60 lines of Python code. But some of the bigger problems also take less time. So the complexity measure is not just the lines of code but also how comprehensive error model is. And in some cases it's more comprehensive than other problems. Yeah. >>: Do you have some data on for each problem that you are trying to correct, how many possibilities did it generate, you know, by doing that, adding that error model you were generating a family of programs? >> Rishabh Singh: >>: Right, right. So what was the size of that family? >> Rishabh Singh: Yeah. So I was telling you ten to about 15 was the basic thing, and then it would just grow. So 10, 15 comes from ten lines, and every line has 15 choices. So you have X's go to [indiscernible] and you have for X you can do five transformations. Y you can do five transformations. It let's you also have an operator. So that way you get 15, 20 per line, and you have ten lines, so -- but, yeah. So I think ten, about ten, 15 is the approximate. We didn't count it, but that's the size we are looking at. >>: I got the impression when you were describing the application of the error model that you construct explicitly the fixed point of applying all those rewrite operations, but I guess you're not doing that. So you somehow are trying to encode that directly ->> Rishabh Singh: You are just representing syntactically -- yeah, you are just -[indiscernible] is just syntactically representation of all choices, but yeah, but when you go to the solver, it has to explore all choices, right. >>: So you literally do explore all ten of the 15 ->> Rishabh Singh: >>: Yeah. -- choices, right? >> Rishabh Singh: Using the solver, yeah. >>: So what was the [indiscernible] fixes for this, like three fixes or four or ->> Rishabh Singh: Right. So I don't have that graph. It's in the paper, but the average -- the biggest class was one and two. There were few with three and four, as well. But the biggest class had one and two changes. >>: [inaudible] then the search space can become quite small. >> Rishabh Singh: >>: That's a good point, right. And then -- >> Rishabh Singh: assumption, yeah. So if you start with that >>: So these are the final submitted programs? I mean, how does -- >> Rishabh Singh: This is everything, yeah. this is even intermediary attempts, as well. >>: So Intermediate attempts as well. >> Rishabh Singh: For classrooms, they are the final, yeah. For the edX class, they are everything. >>: So if somebody just has a totally garbage program, you could find and fix this to get it to be a working program? >> Rishabh Singh: yeah. >>: Okay. >> Rishabh Singh: I'll actually get into that, But, yeah. >>: And are you going to talk about the number of times you couldn't find it? >> Rishabh Singh: Yeah, yeah, yeah. After this, yeah. So in total we ran it over 13,000 submissions and it was able to give feedback about 64 percent of the time. So 36 percent of the time it couldn't give feedback. Let's go into more detail. So for every problem this graph is showing the percentage of times the system was able to give feedback. For some classes, some problems it did really well. It gave feedback for 80, 90 percent of the times. For some it did medium okay for about 50 percent, and some it didn't do that well for 30 to 40 percent of the time. >>: Does that mean you weren't able to find a rewrite? >> Rishabh Singh: Yes, in the system. Not able to fix the program, right. So we wanted to also see what happened -- >>: You mean it didn't have the right set of rules for fixing the errors? >> Rishabh Singh: Yeah, exactly. So we wanted to see what happened, yeah. So one thing we were doing was this was all manual, so we would use the error model, run it on the system, and then check what happened and what could we add to the error model to fix them. So it's a very manual process, but what happened, we couldn't go much farther than that for some problems was because this was the first offering of the class, the errors class, and people were not used to it, so many times they would just submit the empty body. They would just wanted to see what the system is doing. So a large percentage was just empty solutions, and there's no way you can rewrite it to make it correct. There were few which at some point students were using features which they were not supposed to use outside the scope you were handling. Things like classes, and some of them timed out, but there were very few. The larger one, the compute balance one. >>: What was the time of value you were using? >> Rishabh Singh: our case. So this was five minutes in >>: So a question about the rewrite. So this was a kind of line by line, is it -- so the system cannot insert new lines into the program? >> Rishabh Singh: That's a very good point. Yeah, so in general it can do it, but if you -so it's actually during complete to write -rewrite because you can write a program fragment. So for some problems we did actually wrote rules which could introduce a statement. For example, transfer time was swapped to values inside sortings. They would say AI is equal to AJ, AJ is equal to AI, and the only way to fix it was introduce a temp and then -- so you could actually write specific things, but in general we don't do it because then the space becomes too big. So if you know the problem, then you can do it. The most interesting thing class was this one, which we found student actually tried to do something reasonable, but the system still couldn't do it. We call them big conceptual errors. For example, one class of error was understand -- misunderstanding the Python API. This function asks to insert -- take a polynomial in a value and evaluate the value of the polynomial on that given value. This works really well. It's completely fine. The only problem is students were using this function index which is supposed to take a value in a list and return the index of the value in the list, and it works fine if your list is listing, but as soon as you have duplicated, it would always return the first occurrence of the value. It would never return the second or third occurrence. And it just so happened that, yeah, since our system is doing exhaustive checking, and there's no way to fix this program because there does not exist any of the API that would do the right thing. So there's no way we can rewrite this program. Similarly, another class falls, students were trying to misunderstand the problem statements. Even though the problem is asking them to do one thing, they were doing something else. And then again it just so happens that you can't make it functionally equivalent to what teacher had written. But we have some other techniques to give feedback on search assignments now. We're building on top of it. So now let me tell you this system we built for Python because that was the data we had both from classroom and edX. We also had a small waiting of it for C# because we were able to get some data from Nicoli and Pele from Pex for Fun, but it's not as robust as Python one. But after the system there were a lot of universities who wanted to use the system. The only problem was everybody was teaching the course in different language. But that was good again because it forced us to come up with an architecture that becomes -- it becomes easier for us to add new languages on tip of it, and we are using it for building a system for a language called JSIM, which is a hardware description language at MIT to give feedback, and also there's a startup, which is no more a startup. It has many people now. About 300. But still, they're building on top of our system to make similar grading system for C and Java. So their motive is something different, but hopefully we'll use it for grading as well. So last semester we also used the system for classrooms. We gave it to TAs and asked them if you want, you can run it on your programs, and it did pretty well for -- when we create assignments. Almost about 90 percent of the case the system was able to give feedback. In one of the cases TA emailed us as well that there was some certain problem system found that she didn't realize it was a common mistake and she didn't -- went in the classroom next day and told the class about it. There was also a problem where system didn't do that well, and the reason was just because this problem was supposed to teach students about performance and it didn't matter what input you give to the program, it would always take 16,000 loop iterations, and it was also working with floating point numbers, which is hard for solvers to do in the analysis, but now we are combining dynamic analysis plus -- with purely symbolic. We call it concolic synthesis to handle such cases to try to add -- to be a little incomplete, but still be able to give some feedback. The edX team is actually finally now very excited, as well, and they are currently designing a study based on different feedback levels and also when to show the hints. So their two main questions, when do you want to show the hints and what level do you want to show the hints. And depending on many parameters, we are going to figure out that during this study. So some of the takeaways from the system was that we can actually solve grading or feedback generation as an efficiently as a synthesis problem, and this idea that we can just write a compiler to get a synthesis system for a new language was quite interesting. Even though it took a lot of work, but really, we're designing a common platform, so hopefully in future it won't take that much work to come up with synthesis system for a new language. And finally, even though in this case specification was complete -- was complete, but it was easy for teachers to write specification. They didn't have to learn a new language, and it was doing this for classroom. They were writing a reference implementation. That's the only thing they had to do. So this system specification was complete, but now let me go to the next system where specification is going to be very, very incomplete. Yeah. >>: Shouldn't students try to treat this system, I mean, to just input that make the system believe it's correct but it's not? That's got to have that right. >> Rishabh Singh: That's a good point, right. So using certain just have if and if an L statement and say if this is input, return this. Right, right. Since we are only checking ->>: [inaudible] exploiting some implementation of the tool, right? >>: Why would the student do that? >>: The student ->>: It may be easier for the student to just solve it right now rather than try -[inaudible]. >>: Sometimes it's common knowledge, oh, if you take [inaudible] -- the system is going to believe ->> Rishabh Singh: Yeah, that's actually a good point. Yeah, so the thing is ->>: There is not a very easy way to fool it and [inaudible]. >> Rishabh Singh: Yeah. The thing is if we try to be sound in the sense that if we can -- if the system solved, then only we give any feedback. Otherwise, we give up and we fall back to testing. It's not very one way, but students still have to go through the test cases of whatever the teacher has returned. So sometimes. But yeah, so I was also talking to edX team and they were saying that if students are trying to fool the system, it's their loss. It's not so much our loss. They are robbing themselves of learning opportunities, but yeah. >>: [inaudible] it's kind of related, but asking it, doing it from the teacher's side, so I know that the teacher probably needs to write the reference solution anyway, but can you just give some examples where he -- >> Rishabh Singh: Yeah, exactly right. So thing is the system I showed you, Sketch, it doesn't have to be reference always. You can use assertions, yeah. It's just that it's going to be less complete than having a complete solution, but if your assertions can satisfy -- can go where all possible cases, then, yeah, it can support that. >>: It has a set of -- small set of test cases, right? >> Rishabh Singh: >>: Right. So -- Then you couldn't have a solution -- >> Rishabh Singh: Those could be assertions as well, right. It's just that we were trying to explore all possible inputs up to a bounded space, yeah. >>: Can I ask you a question? >> Rishabh Singh: Yeah. >>: Why is the classical method of getting feedback not good enough? I think you started by talking about it and I sort of lost track of it. >> Rishabh Singh: Yeah, yeah, yeah. >>: We know that classical method is there are some inputs on which I know the expected output and I run my program on that input, and if it doesn't give it, then I step through it. >> Rishabh Singh: Yeah, yeah, yeah. Yeah, that's a good point. And some people actually do it, learn that way. I also learned that way. Many people do that. But at least from the data we were seeing on edX and also in classrooms with some students, they're taking this class as the first class in their programming career, whatever, and it's not something -- they're also not in computer science student. They just want to know programming. And it's something I realize that it's a skill that has to be taught. You can't just tell students that this is a test case, go take a pen and pencil and figure out what's happening. So this is something that has to be taught. And this system is just one way to accelerate that thing. Maybe teacher can use it or maybe student can themselves learn stepping through using this more guided feedback. >>: [inaudible] assertion like -- >> Rishabh Singh: Yeah, so that is a very open question right now. There hasn't been any usability study in that sense. So we've only used it for grading in classroom, but nothing for teaching purposes. >>: No, it just would be interesting to see, you know, [inaudible] that same question of like, you know, like, I mean, you can unfold this in multiple ways. You could say like a teacher could spend X amount of time writing rules or, you know, generating rules, or maybe they write more test cases, right? And then you could kind of compare what's the student's reaction and, you know, the amount of pain or, you know ->> Rishabh Singh: Right, right. >>: -- how much success they face under those different ->> Rishabh Singh: Right, so the AP test study I was telling about before on edX, it's supposed to do that. There's going to be a basic case which just gives test cases. There's going to be different feedback levels, and you can't just measure because you don't have access to students, but you can see retention rate, did they solve the problem, did they solve more problems. So those things you can measure. >>: Well, I think you can't just test against the original set of test cases because there is more effort from the teachers going into writing the rules. >> Rishabh Singh: Oh, so actually, I will tell you about how you can automatically do that. So it was just because we were building the system in the beginning, so we had to go through rules, but these rules can be learned automatically, yeah. >>: So [inaudible] you're doing relative to folks that have pre-distinctions error. >> Rishabh Singh: That's a good point. So the question is how do you relate to fort localizations. So I think that these are complementary techniques. Right now we're not doing any kind of fort localization, but if you were able to do fort localization, you can be much more precise when you apply the errors, error correction rules. Right now we are applying everywhere in the program, but it would scale better if you can actually localize the fort and only apply there. And typically -- so there have been some work in fort localization for reprogram repair, but they typically do it for just one test case at a time because it's hard to consider all test case together, and that's what synthesis is doing for you. It's considering for all keys, testing. People typically do it for the system input, yeah. But, yeah, those are complementary techniques. >>: You see them -- [inaudible]. Do you provide the teacher with kind of a probabilistic view of where the error -- where the error might [inaudible] or [inaudible] feedback from applying correction a bunch of access [inaudible]. So you mention -- >> Rishabh Singh: Oh, right, right. So somewhere visualization afterwards when you have done all the analysis, yeah. Not yet, actually, but that's also an interesting thing to do. We are doing it more at the before level but not so much at the after level, but yeah, you could do some analysis afterwards to aggregate common mistakes, right, right. Yeah. >>: So how much is left to do in this space? >> Rishabh Singh: Oh, actually, a lot. I'll talk to you late -- about that. So one thing is just scalability, I think right now it only scales to about 30 to 40 lengths of Python code, but if you want to go to next step, the scalability is one, but also a lot of teachers were expressing concern that you want to give feedback on not just correctness but also style, the design of the program, the modularity, all the other aspects. Things like how people decompose the problem, how they name the variables. So all these other things. So that's one thing to do here. And also the other thing is just doing this learning that when you want to show how does it help students to learn programming and in a teaching setting, not so much in just analysis setting. So let me quickly go, because otherwise we might run over time. So this is the second system, FlashFill. And the motivation of the system came from we looked at lots of help forums for Excel end users, so these are people who don't know programming but struggle with getting tasks done. And what they typically do, they pose the problem on a help forum saying that this is my data and I want it to look like this. Can somebody help me. So some expert goes in and says, okay, yes, I see what you're doing. Try this formula. It will work. So these people go back, place this formula in the spreadsheet, and they figure out it works mostly, but in some case it doesn't work, and they give a new input, and then the expert says, okay, now I see what's happening, so this is what you wanted. And you go back, and after some iterations, the people are happy. They say thanks a lot. And this was the process we wanted to automate in the system. Instead of taking days or weeks, we want to do it in seconds. For example, we can use the same system now, give the same data and the system will learn the program for them without having to go to a forum. So this system was started with Sumit, by Sumit here on string transformations, and I soon joined the project and worked on extending the system to handle more sophisticated things that Excel wanted us to do, and also work on the program of ranking, which I'll go into later. But that was also pretty important thing to do. And then we worked on this problem and together with Shobana [phonetic], Sumit, Dany, and Ben and also people from the Excel team, we were able to work with all the other aspects of the system, and a part of the string system was shipped in Excel 2013. Oops again. So, and then we extended the system also to not just handle strings but also to be able to learn lookup transformations from examples, joins and lookups, and also for number transformations. So let me show you some examples of what the system can do. I think you guys probably pretty much know what it does, so let me tell some of the interesting things. For example, let's say you have data in your spreadsheet and you want to get certain amount of it, and as you can imagine, there are going to be many, many possible regular expressions to get city out of this address, and there are many, but we can just give one example, and the system learns such an expression that is likely going to be the one which user had in mind. So this where ranking also coming in, that if there are multiple regular expressions, for example, one regular expression would be this is the sixth string starting with a capital letter followed by second comma. Something like that. But it's still, it's going to rank them and pick the one which is going to be more likely. The one which I use it for is this. So let's imagine you are -- oops -- let's imagine you are traveling for faculty [indiscernible] and you have a paper deadline, and this is the data that my system produces and you want to put it in paper and latic [phonetic] format. So I just write one example and let the system figure out, and it basically gives me a table that I can go and paste in my Emacs. The nice thing here is that if I want to change, swap the columns, I don't have to go over it line by line, so I can just say make it two and make it four and I can get a new table with columns that swap here. So this is the thing I use it for. So let me tell you the other thing which ->>: What happens [inaudible] -- >> Rishabh Singh: So I just updated the example, the first example. So it just learned that I want four to be the second column and two to be the third column. So it learned a different program and ran it on the spreadsheet. So this is the system for learning lookup tables. This is something which is now with the Excel team, but hopefully it will part of some future release, but the idea here was let's say there was a shopkeeper who had two tables, one that mapped how much profit should the person make on each item, and what's the cost at which the shop keeper purchased this item. And -- oops, it's not showing everything. So the idea was given just the item name and the date, compute the price for the system. And the challenge here is that just given stroller, I can't compute the price because, first of all, I have to find the item ID. Then do a join of these two tables, then get the cost. And similarly, after getting all the cost and markup, I have to do some string transformations. I have to replace, remove percent sign and have to add plus and multiplication, semicolon, things like that. But in this system, we can just give one example and let the system figure out the program and it learns the program to do appropriate joins, lookups, and string transformations and do the task for us. >>: Who [indiscernible] the sciences of the action programs that you ->> Rishabh Singh: That's a good pointed, yeah. So let me show you the program that we learned here. So actually, this is the program that was learned in this case. It said basically concatenate few expressions, and for each one of them you have some lookups or some regular expressions. So it looks quite complicated just because it was not meant to be readable. It was more meant to be easier to learn these programs. So let me show you some of the key ideas. The main -- there are two main ideas in actually all of the systems we built. First idea is how do you design your logic, the domain specific language such that you can divide the task of learning into independent subtasks. So that's the first key idea, how much independence can you have. And the second key idea is how do you rank them once you have so many hypotheses, which one do you prefer. For example, this is the output I showed you on the demo. This is the string somebody wanted. So let's see what kind of independence can we get if you want to learn this output. So we can say that we can chunk this string into many different small strings, and we can look at one substring at a time. We can say we will have some structure for each index in the string and we can look at one chunk at a time. We can say all expressions to learn this output are going to be from zero to seven the edge. I'm going to put all expressions there, but they're going to be independent from the programs that I'm going to learn for the other substring. So you get the substring level independence. For every substring, you will have different programs, but the programs are independent of each other. You don't have to worry about how I computed this to compute program for other substring. So that's one subindependence. This, again, independence is going to come from the lookup transformation language we developed. For example, let's say I have these two tables and I want to get the word join to get price of an item. I'm not going to do the SQL way where I'll take the cross product and do a projection, because that's not learnable. What's efficiently learnable is nested join where the idea is you select price from this table, such the item ID is equal to selection of item ID from the first table. So this way you're doing nested join, and the nice thing about doing this is that this subexpression now becomes independent of the top level expression. So for the first expression, it only needs to know that I need to get an item ID. I don't care how you compute it. Even though you have thousand ways to compute it, I'm not going to care about that at the top level. So this way you also again get independence of subexpressions. And finally, the third independence comes from learning substring programs. For example, I want to remove the dollar sign from the cost I got from the table, and the idea is that we're going to learn a program to get one and seven, left and right index, but they're, again, going to be independent of each other, so it doesn't matter how I compute one, it's independent of how I compute seven. So I learn regular expressions to get one, they're there going to be independent of regular expression to get seven. So in this way we have all this independence and this way lets us again learn huge space of programs, so if you do rough calculation, it comes ten by 20 different programs. You can represent them in parliament space as well as learn them in parliament time. So that's good thing you get from such independence. If you know your domain well, you design a language, then you can do this much better. But the problem is once you have so many programs, which one do you show back to user because there's too many. For example, let's say this was a task somebody wanted to do. I want to add "Mr." in front of all names and I want to have the first name. So I give one example. It goes to Mr. Rick, and the thing is if I learn the wrong program, if I -- because "R" can come from many different places. "R" can be a constant, but it can also come from first name of Rick and second name of Rashid. So if it learns one of those programs, we might actually get something like this, which is -- which can be not good, right? So, yeah. So the thing is, we have to make a choice. Whenever we have many different options, we have to make a choice. For example, let's say you can say that I'm going to make a choice since S -- "R" is such a small string, I'm always going to make it constant. So let's say that's a choice we make. But then in future we get some other -- somebody else doing this, and again in this case, "S" is a small substring, but I can't make it a constant. It has to come from input. So that's where the challenge comes in when you have many choices, in some context you're going to prefer one than the other. And also you're going to have many regular expressions, as well, to get the last name. For example, it would be second word, last word, all the other thousand different regular expressions. And as you can imagine, if you just do second word, that's not going to be appropriate for this task because if you have a person with middle name, it would get the middle name. So in this case actually the more preferred one is last word. So the idea is how do you solve this problem when you have so many choices, what do you do. So this is that code we got from Spiderman, which said: When you have great power comes with great responsibility. So you have made the language so expressive, there's so many choices. Now you have to do the right thing. And that's where we use machine learning. So the idea is we're going to divide the tasks into two phases, creating and tests. So in creating phase, we'll have a bunch of benchmarks, and for every benchmark, I'll give lots of examples. So in training phase I don't care how many examples I give. I give as many examples as possible to make the task disambiguous. So in this case, I would learn programs for "R" and some of them are going to be good, some of them are going to be bad. The good ones I'm going to say they're positive. The bad ones are going to be negative. And the goal is you only learn a ranking function if such that in future when you see the same task and you learn the same programs, it should rank the positive one higher than negative ones. So that's the task. Give me ranking function that in future I would prefer positive or negative programs. And this is actually a task which typically people do in search. There's a whole field called learning to rank, and the only difference is typically people solve this problem I want all good results come before all bad results. But in this case, we say any good result can come before all bad results. I'm happy with it. So this is just the interesting thing is here is how do you come up with a loss function to optimize. So here we are saying give me the rank of the highest negative program and give me the rank of the highest positive program. I want this to be negative, and since this is the last function, it's going to be negative of that. So this is the function we optimize to learn such a function. But this function is highly discontinuous, not even differentiable, so we smooth it up a little bit, but this is we use for regression and try to learn this function. So here are some of the results showing ->>: How do you learn a function like that? What's the technique you use? >> Rishabh Singh: Yes. So the idea is for every expression you're going to have features. Features are going to be there. So function we are going to assume it's a linear combination of features. So the challenge is how do you learn the coefficient of the function, and that you do a regression over your training set. You say in training set I want to optimize this function, so give me a -- learn a function for that. >>: [inaudible] this length -- >> Rishabh Singh: Yeah, so that's a good point. So I showed you three levels of independence. So you have different features at different independence level. So one of the features are going to be the length of the substring, the relative length, the context, what happens to the left-hand side, right-hand side. So you have different features at different levels. >>: Are the features that these -- I mean, a list of all, you know, one possible set of features could be every possible expression that you could use in a program, but of course, that would be too large ->> Rishabh Singh: Yeah, so we use frequencies, yeah. So, yeah, so it's ->>: [inaudible] the most frequent -- >> Rishabh Singh: Yeah, most frequent, less frequent, yeah, right, right. So we do some abstractions, right, right, right. Yeah. >>: The one question I have is in the [indiscernible] the methodology was supposed to be in the writing this program simple. >> Rishabh Singh: Yeah. >>: But now, you know, if you actually have to do all these feature design and give all these examples, that's kind of taking [inaudible] from the simplicity of writing ->> Rishabh Singh: So the thing is this task is only supposed to be done by us, not by users. So users are still going to use the system as if they're just providing examples. Somebody said that developer has to do a little more work now to make the system more robust and more -- something what it could try to read what users have in mind. >>: Sometimes, you know, the [indiscernible] Developer, right? You had a task to do, you have, you know, Rick Rashid, Satya Nadella and you want to do this, right? >>: The developer was the person who implemented the system, yeah. >> Rishabh Singh: yeah. Who implemented the system, >>: The developer who implemented FlashFill feature. >>: So to be able to have this kind of modules, your learning modules for various domains and one for, you know ->> Rishabh Singh: Yeah, for exactly that. So the hope is -- so the thing is that the learning has to be done once, but in the test phase you just use the function. So all this done offline. Once you have the function, you just -- when you're learning the program, you apply the function and you rank them. So when you're running the system, there's no overhead. Very minimal overhead. But yeah, it would be done offline. So let's say at Microsoft somebody would do this. >>: But I think the question is that the learn ranking -- right? -- might be different depending on the domain ->> Rishabh Singh: >>: Oh, yes, yes. So -- -- working. >>: So every task you have to learn this, then it's too much of an [inaudible], but you're saying you don't have to learn it for every, you know, every, you know, task that you do. >> Rishabh Singh: Yeah, yeah. it over set of tasks. >>: You have to learn Domains. >> Rishabh Singh: Yeah. >>: [inaudible] what is it that identifies a domain? >> Rishabh Singh: So right now we are doing it just for strings, string transformations. So it would say any kind of string transformation. >>: [inaudible] you know, you can have names and addresses are very different types of strings. The type of features that you have for names might be very different from that. >> Rishabh Singh: Yeah, exactly. So, yeah. So right now we're not modeling any semantic information in this work, yeah. This was just for learning regular expressions, yeah. But that's a good point, that if you know what your data is, you can do much better job at ranking. >>: [inaudible] Excel logs and stuff like that. >> Rishabh Singh: Excel? >>: You exploit the [indiscernible] the features [inaudible] you exploit like [indiscernible]. >> Rishabh Singh: >>: Excel user -- >> Rishabh Singh: >>: Data. Yeah. -- data. >> Rishabh Singh: So right now we just get it from -- so people have uploaded videos on YouTube, so we try to get as much data from there and also from blogs, so people writing lots of blogs. >>: [inaudible]. >> Rishabh Singh: So for this we use 300 benchmarks, yeah, so like 300. So actually, yeah, these are the results, actually. Or maybe 170. Something like that. So we use 50 programs for training and there's 120 for testing. And this graph is showing three different approaches to rank. The blue one is the one which was the very first thing we came up with, which was Occam's razor which said learn the simplest program that you can learn that's going to correspond to what user has in mind. The orange one is the one which we spent almost four to five months working with the Excel team. They would give us new benchmarks and we would tweak the parameters. They would give us more benchmarks. We would manually do it. As you can see, it did pretty well then Occam's razors. So for most benchmarks, about 50 it could do with one example. Quite a few would do even more. So it was not that bad, but even a very simple machine learning based technique we found actually did much, much better. So for about 80 percent of the benchmarks it was able to learn from one example After learning's ranking function. So let me tell -- try to put it back into -- oh, you had a question? Oh, thanks. So in this case our specification mechanism is going to be in proper examples. The hypothesis space is the interesting thing, how you design your domain-specific language to have independence. And finally, the algorithm is going to learn all programs in the language in polymer time and then rank them. And the biggest difference from all the -- there have been a lot of work in [indiscernible] text by example, but the biggest difference has been a language-based approach where you have the completeness guarantee as well as much richer language for this particular task. And actually, so I was telling, it was the string system was shipped in Excel 2013 and there's been, at least initially, encouraging response. A lot of people writing good things about it. And currently we are now working on actually, as you were mentioning, trying to model more semantic knowledge. So if you know something is a date, if you know something is a name, something is an address, you can do much better, so semantics knowledge, as well as making the system more probabilistic. So right now at first if you make a mistake in giving an input, it would just say I can't learn a program for you. But can you tolerate some noise and make the synthesis algorithm a little more probabilistic. So some of the takeaways here that partial specifications like, in part, for example, can be really useful for end users. And the main takeaway was if you know your domain well enough, you can design a language which can be both expressive and learnable. And the interesting idea finally was we didn't use machine learning to learn these programs, but we actually used it to bootstrap the synthesis algorithm, which was also interesting. So now let me tell you very quickly some of the things I want to do in future. I'm very excited about using synthesis for education and users and programmers and both from theory and practice side. From the theoretical side, there's been a lot of work in inductive inference and machine learning going back to Gold's and Angluin's work. And the thing was they have done it for a very small subset of languages, like regular languages and context-free languages, but here we are talking about much richer class of languages. So I want to see what kind of relationship there exists, and can we actually exploit some of the things they have learned in the past, and also the other way, how it corresponds to their work. From the practics' side, I just showed you very small part of Excel, but there are many more things that could be automated that repetitive task users have to do, but not just Excel. Actually, there are many things which I do routinely on PowerPoint. Even actually for this one, if I have to change fonts, I have to change margins, there's something called page -- I forgot. Something where you can set for one master slide, I guess. >>: [inaudible]. >> Rishabh Singh: But I still can't use it because it's too complicated. So is there a natural way for us to make a system where you do one or two changes and let the system figure out the changes for you. Similarly for Word documents, if you want to do more complex searches, you still have to learn regular expressions. So can you automate that. Finally, also we are getting new devices now, things like almost everybody has smartphone now. People are saying in five years everybody would have a robot. So what is the more natural way for us to specify tasks to these systems where people don't need to know programming to get the task done, simple tasks. For example, let's say I want my robot to clean my room, bedroom, but the thing is, everybody's bedroom is going to be different and everybody's definition of "cleaning" is also going to be different. So how do you express that intent and figure out a program or something for the system to do it for you. So that's very interesting. For programmers, I've been looking at trying to come up with a language where having such synthesis construct would be first class where you can let programmers -- a seamless language where you can let programmers write examples, maybe for some task writing test cases better. For some tasks, writing complete spec is better. So all these different ways to specify your intent and have a very efficient runtime system to actually compile it and also efficiently run the whole program when you have all different classes of specifications. And finally, for education, one thing I'm really excited about is the data we're getting right now. So that's the only thing that has changed. People have been asking me what is -- what are mokes [phonetic] Actually doing now that we couldn't do ten years back, and the interesting thing is we are able to capture so much data that we didn't have resources before, and that is something we could really use now. So some of the things I'm excited about and also doing a little bit of work has been on clustering solutions and we use clustering techniques, more probabilistic techniques to cluster these two assignments, to do power grading, to actually give feedback on alternative approaches to solve the same problem, and also for teachers to know, as you were saying, what's happening in the classroom. If I see hundred thousand submissions, I have to go through each one of them, there's no way I can figure out, but if you can give a bird's eye view of what's happening in the classroom, it's going to be much useful. And finally, yeah, right now the error models we had to write by hand, but since in the data you have time-stamped data, so you know what's run -it's time to have an I plus one, so you can dev the two programs and see what changed. Many times truant stem cells fix the mistakes, so that way you can do a frequency measure of what are the common changes and make an automated error model from them. So finally, to conclude, let's divide the room into three groups. At the top we have all the good people who know programming. Then we have the students who are one order of magnitude more in the small; and finally, we have end users who are even one more magnitude of mold in this world. And I've shown you a system, Autograder, which is aimed for helping students. Storyboard programming I didn't have time, but is also aimed for helping people, students learn data structures. And FlashFill was more for end users. So I want to end my talk with the thought that a lot of work in our community in program analysis and verification is focused on programmers, which is great. We want to make their lives easier, but a lot of techniques actually equally apply to even these two broader classes of users, and we can potentially have even bigger impact than what we have had on programmers. And with this thought, I would like to end my talk. Thanks. [applause] >> Rishabh Singh: Yes. >>: So you mentioned that in the 1980s the specifications were much larger than the programmers. Of course, there may be tricky programs -- >> Rishabh Singh: Yes. >>: -- that's why they [inaudible] some description, but in the '70s and '60s, before XD, they were learning some examples ->> Rishabh Singh: talking to you -- Yes, actually, yeah. So I was >>: So do you think that now that we can do learning from examples much better is the time to return to synthesizing things from specifications and get, I mean, get full correctness? Can we do that much better now? >> Rishabh Singh: Hmm. Than before? >>: If the '80s was too early for a lot of those things, right? >> Rishabh Singh: Right. I think -- yeah, I think the problem was -- yeah, so definitely our theorem provers have gotten better, so a lot of work was done in '80s on directive synthesis. So the idea was given a program, can I take -- can I generate a proof for a condition and try to rewrite the proof to -- from one specification language to an implementation. So definitely our theorem provers have gotten better, our algorithms have gotten better, so we can scale to much bigger programs. But I think one issue there was -- the other issue was orthogonal, that it was hard for people to write complete specification. I think that was a bigger issue in that sense, which I'm still not sure if people are willing to do that, but maybe if you have better ways to specify, maybe different specification languages, then maybe we could do that. But, yeah, but I think there was the other issue, as well, how much programmers are willing to write these complete specs. >>: Just a question for my edification. What's the difference between deductive synthesis and inductive synthesis? >> Rishabh Singh: So in deductive you have everything with you, so you have a complete specification. So at any point of proof search, if you stop, you have some completeness guarantees. In inductive the idea is you don't have complete information. You start from few information about here and there. Some examples, let's say, or demonstrations. So that's more inductive. You try to generalize from small information. In directive you have complete information. You're just transforming it into another language. >>: Do you have plan or thoughts to use like existing code, like, say a prerequisite and then you have programming in some form already there [indiscernible] and much better and pointed at code written by human which is probably already [indiscernible]. >> Rishabh Singh: >>: Yeah. Then more or less [indiscernible]. >> Rishabh Singh: That's a very good point, right. So that actually goes back to the point of data-driven synthesis where the idea is this, even the way I code right now, many times I just search on Google or Bing to find -- actually, it's a fact that Bing is better in searching code than Google. So you search for something and it gives you code. You copy/paste it, right? So can you actually let -- build a system that goes over the net and figures out the code for you? So there's already been little work been done in the area. But, yeah, I think that's very useful approach. And there's also a very big DARPA grant this year on the same topic. We have so much code data everywhere. How can you leverage that for synthesis. Also verification and other things. >>: [inaudible] in the case of FlashFill I wonder how much effort is put in actually writing the exact same properties all the time. >> Rishabh Singh: It's a good point, right. >>: The [indiscernible] that kind of [indiscernible] code. >> Rishabh Singh: Yes. The thing about Excel, it's interesting, that typically formulas are not that big. So we can actually just do synthesis on it. But you can imagine if you really want to write a [indiscernible] script of 100 lines, then maybe this idea that you search over repository would be better suited, but, yeah. >>: So when you're -- in FlashFill when you're ranking and [indiscernible] learning, right? A bunch of possible programs that match the set of examples you've given, any model that you built potentially could be wrong, right? >> Rishabh Singh: >>: It will always be [indiscernible]. >> Rishabh Singh: >>: So that to ranking result, user of Right. have you ever found -- how do you expose a user? The fact that your model is something and, yes, it's producing a but you don't necessarily know that the that thing -- >> Rishabh Singh: >>: Right. If it's a good result. [inaudible]. >> Rishabh Singh: Yeah. So there's an -- there's also an interaction mechanism which was not shipped in Excel, so I think where Ben and Sumit have also worked on that, but the idea was you essentially learning all the programs, even though you're ranking one, but you don't ever run one program. You can run ten, top ten, top 100, and if you see the top two answers are very different from each of them, you can highlight the cell. You can say that I'm not sure about this answer. There's the other answer, you want to pick it or not. But it was, unfortunately, not shipped in Excel. They thought it was too complicated. But, yeah. >>: I think that the issue is the same thing. If you are going to target the masses, they're, by definition, unsophisticated. >> Rishabh Singh: Right. >>: So either you get the job done right away or you don't bother them. [indiscernible] right? >> Rishabh Singh: But it was funny that we saw videos on YouTube that said FlashFill doesn't work, because I gave an example it doesn't work. Just because they had data in different formats, it's supposed to give multiple example, but they don't even know you can give multiple examples. So, yeah. So they're not very sophisticated, yeah. >>: But if you think about it, you know, I mean, we're equivalent to the "I'm feeling lucky" button in Google, right? You type a search query and you get a bunch of results, and most people don't hit "I'm feeling lucky." You know, it would only show one result because, basically. So the question is how do you educate people to think about this a different way, which is to say there are multiple things. It's sometimes going to get it wrong, but if they think in those terms, you're going to converge. You're going to get a lot more feedback and you're going to converge on a really, really good solution over time. >> Rishabh Singh: Yes. >>: So I think, you know, if you think of how good Google has gotten, I mean, you didn't start as good as it is now, right? That's part of the story, and I think ->> Rishabh Singh: Right. >>: -- this is a question of how you give people the ->> Rishabh Singh: It's actually a good analogy, yeah, because do you want to prefer to see just one result or multiple results, and we have learned over the years to look at multiple results, maybe. >>: Filter. >> Rishabh Singh: >>: And filter automatically. Very good output from search engines. >>: He's right now [indiscernible] actually multiple results. >> Rishabh Singh: So not in the Excel side, yeah. So in the research prototype, you just right click on a cell and you can see multiple results. >>: Can you -- I know that you know all these Excel PowerPoint and all, they're extensible. >> Rishabh Singh: Yes, yes, yes. >>: So is it plausible for you to just ship to make your own -- >> Rishabh Singh: >>: Actually, so this -- [inaudible] stop you. >> Rishabh Singh: Yeah, so the idea was, actually, yeah, you can ship your thing as an add-in, as well, which a lot of people do. But the thing is, people are not sophisticated enough to download that add-in. But, yes, if some people are, you can give them, but typically people just use basics. So we ->>: Stop [indiscernible] it. >>: That's the problem, right? stack. >> Rishabh Singh: You go down that Yeah. >>: Then, I mean -- >>: The expectations -- >>: The expectations go down very fast. >>: I mean, the problem is, I mean, I think search analogy, the thing is that, like, you know, the reason that people have become better at searching is because there are multiple things and there's an opportunity for them to improve their mental model. But if they only ever had the "I'm feeling lucky button," there's not an opportunity for improving their mental model [indiscernible] of reaching only possibilities. So I'm starting to push the Excel team a lot to say, hey, you guys need to show more than one. >> Rishabh Singh: Maybe that's a new way to look at spreadsheets, yeah. >>: Yeah. >>: I don't know whether it's lack of sophistication. It might actually be just a discoverability issue, you know, the right button. >>: Well, no, no. So for example, I mean -- >>: It's been shown they cannot search [indiscernible] on the right-hand side ->>: There's been a bunch of things, so if you want to pick a font and you mouse over a font, it automatically shows you ->>: That's right. >>: You know. So they have ways to show you lots of choices quickly. And I think this is part of the story is that it's really about the UI and the user experience and preparing people and giving them confidence, you know, that make them learn and understand it, et cetera. So I mean, there's huge opportunity. >> Rishabh Singh: >>: Right, that's the thing. Yeah. >> Rishabh Singh: Yeah. >>: It's not limited by the capability of the underlying synthesis. >> Sumit Gulwani: speaker here. Okay. >> Rishabh Singh: Thanks for coming. [applause] So let's thank the