>> Judith Bishop: And so now we have Amey who is going to tell us about a system to teach Introductory Programming which will jumpstart us into the final session today. Thank you. >> Amey Kakare: Thank you Judith for inviting me for this workshop. And actually a study session sort of covered what I want to say, but at the same time I will show what we have developed, and in a way it aligns with the kind of feedback we got yesterday because our experience has been the same with this particular tool that we have developed. So before I start I would acknowledge some people. So these are four students who are working on the project. Rajdeep Das is the chief architect of the particular tool I'm going to show. Others are Umair Z. Ahmed and Naman Bansal who are my PhD and [inaudible] student respectively and we are collaborating with Ivan from [inaudible] University; and then several interns have helped us build the system. And finally we have Sumit who is collaborating on this project and myself and we are getting more students to continue with this work. So before I start demoing the tool I would like to tell you about some challenges that are specific to India as well as in general. So it's our experience that teaching the first programming course is difficult because you have several challenges, several problems and especially in India where we do not have proper infrastructure in faraway villages; and then we have many languages, and I'm saying not programming languages but natural languages that people speak. I checked the remodeling. So there are 22 official languages, and unfortunately we do not have programming languages in those languages so we cannot [inaudible] in that language. And the second major issue is wide gap in the level of exposure to computers. So in our institute we get students who are competitive programmers already winning combinations to students who have not even seen a computer yet. These are the students who have not even seen mobile in some cases. So mobile has sort of increased exposure. People at least know what a computer look like, if you can think of mobile as a small computer, but still we have about 5 to 10 percent student who have not seen a proper computer. We cannot solve all those challenges. So what are the challenges for us? The first challenge is how to keep good programmers engaged while teaching simple stuff to beginners because very soon after second or third lecture we start seeing these sleepy faces. How to provide early feedback because if we are giving assignments and if TAs are taking a long time to correct this assignment then it becomes again boding for students and we don't want to use too many extra resources because we have limited TA support. These TAs are having different labels of expertise. In fact, sometimes we have teaching assistant who themselves don’t know proper programming, and at the end nobody likes to work extra hours so they want to do their job fast. And we want to use existing computers. We have a lot of existing computers. We are not allowed to buy too many computers. And these computers have various flavors of OS browsers and so on. Some are [inaudible] I saw a computer which uses Firefox 3 on which nothing seems to work. So what is the solution? And this is where Code Hunt can come into picture. In fact I was here in 2013 and I had some talk with Judith and I think there it is important because it was the same time when [inaudible] was getting converted into Code Hunt and I got to use probably the first version of Code Hunt, probably one of the few people outside of Microsoft that used it, and unfortunately things did not work out for several reasons. One was probably Judith and team was too busy importing it to Code Hunt. It's difficult to convince people at IIT [inaudible] shift from C to some other language especially because this course is across departments and our department has very little say to it even though we are the computer science department. So the other idea was using some online browser like IDEone. So this is a browser-based IDE where people can log in and start trying the programming languages. The nice thing about this browser-based IDE or IDEone is it supports some 34 different languages. So if you just want to get a feel of some language go to their site and start typing in that particular language. You don't have to install anything. But the negative part is nothing else. You can't do anything else just run compiler program, run it and see the output. But these two provided us motivation to create our own solution. So before I start showing you their demo I would just like to describe the setup because some of these points will be important if we have discussion afterwards why we chose a particular way of doing certain things. So the course is called ESC101 and it has approximately 400 students every semester. So this is offered both semesters to a [inaudible] set of students. So weekly load, every student has to go through three lectures, one tutorial and one lab. Lab is three hours of programming assignments, but for instructor it is much more heavy because instructor has to prepare three lectures, instructor has to prepare for one tutorial, even the tutorial itself is taken by someone else, some helper, other faculty members, and finally instructor has to prepare for four labs every week which is four times the student load. And why is that so? Because other courses have labs too so there are logistic issues, space issues, timetable issues. So what has been done is every class is divided into sections and different sections have their different labs on different days. So in particular our course has labs on Monday, Tuesday, Wednesday and Thursday. So this is the setup before I come to the demo time. So any questions up to this point? So this is our system. So I am an administrator here. So once I log in I come to what is called the administrative interface. Before coming to administrative I will just go to student mode so that it’s clear what the student sees here, and you can see there are quite similarity with online browser-based IDE like IDEone or even Code Hunt. So this is a typical session for the student. Every week whenever he logs in to do the lab time he sees the assignment that is given to him or her, he can look at grade card to figure out previous assignments, how they are graded and so on, and then he also gets some or she also gets some statistics about the previous submissions and so on. So let's assume we are planning to solve one of the questions, and in this case this is very different from Code Hunt. We provide problem statement. So in this case the problem statement is very simple. Student has to check whether a given number is palindrome or not. And at this point it's important that we have to think in terms of how a student is thinking, how can we provide feedback for students who are stuck at some level, or how can we provide better feedback if a student is close to a solution. So these are all the issues that we discussed yesterday. So let's start typing some code. So suppose the student is stuck at the very beginning itself. So what he can do is, or maybe he started with some code and these are some of the basic errors that we have seen. So students forget to add any declaration, students forget to or make some mistake in reading the numbers and so on. So the system, what we are doing is we are intercepting compiler messages and rewriting them so that they are more useful to the user. So one example I will show you this particular case the student if he just gets num undeclared first [inaudible] in this function, so this is the original compiler message. I think the foreign size is very small but it should be readable. So student gets some 5, 6 lines of message, he doesn't know what this message means, and how is he going to fix that? So what we do is we intercept this message and present our own message which is hopefully much more readable and useful to the student. We also give some sort of hint whether you should initialize or not and so on. >>: [inaudible]? >> Amey Kakare: It says you are using a variable over here that is undeclared. Num has not been declared before; please declare this variable as int num or int num equal to zero. >>: [inaudible] type inference that num should be an 8 so that's why you [inaudible]? >> Amey Kakare: No, we are not doing type inferencing. We are actually rewriting what the compiler is telling us. So let me just see. So in this case it’s a hardcoded message int num. Because it’s not declared we don't even know what the type should be so it’s hardcoded. I will show you how this message is generated. There is another warning which is also very common mistake people keep forgetting. So this is the bad thing of C. People keep forgetting putting an address of operator in front of scanner and this is something you, no matter how many times you teach in class, unless student sees it he or she is not going to learn it. And similarly there is, so the compiler in this case generates a warning which is really complicated for a student to understand. So compiler starts talking about [inaudible] even though student has not even known what a pointer is and so on. So compiler starts giving the proper message which a seasoned programmer can understand, but think about the poor student who is just doing hello world like program and he gets the message that [inaudible] are of type [inaudible]. He does not know what the status; he just knows that it's probably used for multiplication and so on. So we can again simplify it. At the same time we also don't know whether it's really a genuine warning or not so we give both type of intent. It's also likely that you forgot to use address of operator in a scanner and we also rewrite the original warning in some sense. >>: Did he go through all the warnings for the C compiler? >> Amey Kakare: No. We have done some, I will show you the admin interface and then it will be clear. So we are not going through all the warnings. So let’s say a student has started with solving the problem and now he's stuck at the first level itself. So what he can do is, so we have what is called three events associated with each program. You can compile, you can execute it on your own input, or you can evaluate the program on each of the provided inputs or test cases. So if the student evaluates the program this is where we do some kind of analysis of the program to give hints. So in this case if you did not notice I will just run it again. So small popup comes up which tells that there is a feedback and the program did not pass all tests. So if you notice on left-hand side we have several tabs. First one is about problem statement, second one is about tutor which is where we give the hint; so in this case we are getting a hint that you should try to use a loop to compute reverse of a number, and again I will show you in admin interface how this hint is generated; and at the same time we also get what is the result of running this input, running this program on each of the provided set of inputs. So we have concept of a visible test as well as invisible tests because we notice that if all tests are visible students start writing code like what we see yesterday, if X is 1552 and answer is yes, if X is 42 answer is no and so on. So now at this point student has two choices to make. Either he understands the hint and starts working there according to the hint or we also provide what is called a downgrade option for a problem. So downgrade what it does is it will choose a similar problem which is possibly of a less difficulty. So suppose I downgrade this program. So in this case the idea is probably student does not know how to get digits out of a number so we can give him a program which wants the student to print last two digits in reverse order. Now this program does not require any loops so if a student knows how to extract digits out of a number he or she should be able to solve this. If he or she can solve it then we can again ask them to upgrade the problem and solve it from there onward. So otherwise a student can start writing the program, introduce some variables, so let me see. And I'm making some mistakes, hopefully [inaudible]. So let's just check, let’s try to elevate this program. So we still got some message like it did not pass and there are more feedback. So it says that you have to add an assignment to reverse at the beginning of main function. You have to use a new integer variable, you have to check assignments, and you have to add a print f because right now the program is not printing anything even though the output expects, we are expecting some output so we can add some assignment. So this is the mistake I made in the first version of the code when I wrote model solution for this and that's why I like this particular mistake. So there is a bug in our interface. As you can see that on one side it is saying all the test cases have passed when it did not so this bug has been fixed in the live system. I'm using the clone system. Yesterday I discovered this. So still we have some more feedback, so we just look at it, and the program still asked me to create a new integer variable. And at this point if I am curious enough I will recall that this number is changing. So I read some number, I keep changing it until it is zero, and then I am comparing the modified number with the reverse and this test is always going to fail. So it is asking me to add a number, add a saved component, and save the number before starting to [inaudible]. So now everything has passed, but I just want to show you that there are some other bugs that our tool can catch. [inaudible] our tool; I will explain that as well. So if there is initialization error so it can tell you that there is some bug in the assignment at line four and similarly it's not able to deduce that this error will fix other tool problems as well so it is giving some extra messages and similarly if there's a problem with the loop condition. Okay. So for figuring out this feedback we are using the tool developed by Ivan, Sumit, and Florian at University of [inaudible]. So the way this tool works is it creates a control flow graph of this program and it also uses a lot of specifications which are provided by the instructor. These specifications are various ways to write correct program. So what it does is it compares this program with the existing specifications, it does some analysis to figure out which specification closely matches this program, and then prints the line which differed in the specification and in the student submission. Yes. >>: Are you assuming that there is only one master solution for every problem? >> Amey Kakare: No. So I will just show you that. So the tutor has to write as many specifications as possible. Again, this is not a very nice thing to do for an instructor because he's already loaded, so we have some ways; and that's why I said that some of this set up that I described earlier is important. So I will come to that point as well. So we need many, many specifications. Ideally we need all possible ways to correctly solve this program. And on top of it we want some common mistakes at the specification so that we can generate proper feedback. So again I will come back to this question when we do the administrative interface. So in this case as you can see the downgraded problem was actually manually generated but what we have done is with some interns we created a small framework to automatically generate problem statement. So in particular this is a problem where we want students to print, actually it's not coming out very nice so I’ll just compile it and run it; so we want students to print two square boxes using stars. So the output is supposed to be something like to square boxes with boundary of stars side-by-side. And this is a very, very challenging problem because for a first-time programmer he has to try the understand concept of printing spaces, finding out each, it's a doubly nested loop and there is a nested if condition as well. So in case student is not finding this problem [inaudible] then he can downgrade and in that case he gets a simpler problem where he has to print just one square. In fact the original, I will show you some problems that we generated for patterns. So this PVD is something our internal coding. So for every assignment, this is not a good one. So here is a problem which has three variations. In the first case a student has to print a rhombus with some increasing numbers but if student is a stuck we can downgrade this to another problem where the student has to just print the first top part of the rhombus, and if student is still stuck we can go to an even simpler problem where student has to just print a square. And as you can see we can even go down. Instead of numbers we can print stars or simple prorations[phonetic]. So we did this thing for nested loop pattern problems, some series problems where you are computing some amount of numbers in a series or sequence and then you are also mixing two series like [inaudible] with x factorial so on. Any questions? Okay. So this is we have, whatever we have in the student interface. So we have also incorporated the bug feature but it's still disabled because we are still not comfortable with it, it's not totally tested. And then there are other management things like students can see their existing solutions for previous labs and so on, they can test using Scratchpad, they can find some bugs and so on. So let's go to the admin mode. So this is what a tutor or instructor will see. And the first thing is about problem management and this particular tab is very similar to the driver routine designer of the Code Hunt. So here for every problem we have to provide the data. You can create a new problem with some unique names and so on. So I'll just take the particular problem that we are trying to solve. So for every problem there are 3 or 4 sections. So we give some unique name to identify it, then we can decide whether the problem is a practice problem which is released every week and it does not have any deadline associated with it, then comes the problem statement. This is what Code Hunt does not have. And after that this is the instructor solution which is used to compute the desired output or the set of input, output. We have these specifications, so this should answer your question, we need as many correct solutions as we can get. So, for example, specification three is the same as the instructor solution. Then we have some more specification. So this is a totally different solution than it is first computing, basically it’s finding the highest digit or the 10th power of 10 which is there and then it's computing the reverse in some other way and so on. The interesting thing is we also need some incorrect specification. For example, this is an incomplete program which does not have any loop. This helps us in identifying that a student is not using a loop where he is supposed to use one. So we can add some feedback for such problems. So whenever a student’s incorrect attempt matches this particular program we'll get this feedback. And to decide whether, so this thing again uses concolic execution. So it uses some concrete input to figure out which variables in the student solution match which variable in the instructor solution or the specification. So after specifications we have the initial template. Again, this is like the programmer’s part in Code Hunt where the initial screen that a student sees and we have test cases. So in this case we have just manual test cases. You can see three visible and two invisible test cases and there is a possibility to add automatically generated test cases. In the previous semester we used KLEE to generate test cases. I’ll just show you one example. So this was a lab exam and we needed a lot of tests cases. So here we need an area of size 500 or something. So it was not possible for us to write it manually. We used KLEE to generate these test cases and use the model program to generate the solutions. So apart from problem management we have account management, again, it's not very interesting. What is more interesting is first thing that our tool is more like a framework which can use different compilers or different feedback mechanisms. So right now we are using GCC with certain compiler flags and we are using this strategy-based feedback as a plug-in which was developed by Ivan. There are other things like we run the programs in a Sandbox, again to avoid any changes to the system itself, and then there are certain delays. These have been added so that students are not evaluating their program again and again putting a load on the server. So the nice thing is we can use any feedback mechanism developed by a third party as long as it follow certain protocol about input and output. So this is not the only feedback mechanism we can add. We did try some other mechanism as well and we are looking forward for other input from users like you, from people like you. So last part is about analytics of the data. So this is the place where we are generating error messages for a student. So what we do is we're just constantly monitoring which error messages are occurring most often. So undeclared variable we have seen some 7500 times on lab four and lab five when this clone was generated. So at this place we can create our own rewriting of the compiler message. So what we are doing is we are removing all the specific information. We are creating some meta-variables, actually this is a very compute-intensive thing. It’s loading 7400 data points from the database. So what we have done is we have created this meta-variables, colon X1, X2 and so on and you can rewrite these messages using this colon X1. So that's how we get the name in num and so on. So I should have chosen some simpler, some bug which is occurring less often. So let me just see if I can open another instance. So as you can see this crosses means we have not provided any feedback. So this is just something we added last week. So let me try this one. So here you can add a feedback, you can update it, and you can also see instances. So what happens sometimes a particular instance is more interesting because if a semicolon is missing it sort of gets a cascading effect a lot more messages are coming so we want to treat such errors spatially. And you can see all the particular assignment submissions that caused this issue. And finally we come to the code submissions. So whatever code has been submitted by students, that gets decoded into the system. In fact, we store incremental changes to every submission, so this test one is the particular event that is created where I tried to solve some problem. So these are different students. So each box corresponds to one student and the particular one. So as you can see the data is anonymized. There's no easy way to relate a student to the particular submission. You have to know the ID on which student is trying the solution or you have to go through several databases exercises to map these to. So this is the submission that I made at the end. However, the nice thing is I can go over all possible submission from the very first. In fact I did a lot of trial and error. So this is, so I can keep looking at how this code evolved from the first template that was given to the student. So this is more like a selling feature. It does not do anything useful. It just shows you how a student is thinking, how he's coding. So we keep saving a student’s input, auto save it every few correctors or after a few seconds. So in a way we are getting the snapshot of student submission every 10 or 15 seconds. This also helps us in figuring out the mistake if student is making. So whenever student files a bug request or some issue using this e-mail we also get the associated ID and we also get the particular version number for the code that he's talking about. So we can just look at the code and think about what could be going wrong. So you can see I tried a lot of variations before coming here. >>: So is the purpose that you would respond to the student at this point? >> Amey Kakare: That is one of the uses, but in reality but want to do other analytics as well. So basically we want to look at the kind of errors that students are making most often because this is storing all the events as well. Whenever the code was compiled what was the error that was created, whether it passed the test cases or not, we are also recording any test case that a student entered manually so that it can be used with other submissions. >>: So I don't want to anticipate what you're going to say, but are you expecting any surprises here or did you find any surprises? >> Amey Kakare: In fact, initial few labs I do go over these submissions because they tell me what students are thinking; and yesterday, as I said, one of the major pinpoint is this forgetting percent operator, second one is integer division. Most students use two integers and they are surprised that the answer is not matching their expectation. So last semester when I taught this course for the first time I saw this mistake, on any random submission [inaudible] I saw this mistake, so I dedicated 15-20 minutes in the next class explaining how integer division works and why is it different from floating-point and sort of gave them some motivation. So this helps in understanding the thinking of the student where they are stuck, but in the long run we can also do some analytics on top of it. So one analytic was getting the error messages. We can also look at how the code size is changing. So this is something which I was trying to [inaudible] different combinations. So you can see the code size was constant for a long time and today modeling suddenly there are several changes and then we are looking at different events. So this is not coming out very well; probably because I zoomed it. So how many times the code was auto saved, how many times students saved it explicitly. So there is another interesting event. This is called submit. Because we are capturing code every few seconds, we are doing auto saves, we want don't want to evaluate a wrong version of the code so we have something called submit where a student tells us this is the version he or she wants us to evaluate, grade. So all these events, and once again you can see that all these are concentrated towards the end because this is my submission. I can show you some real submissions as well, and so we also collect some other data which is not available in this virtual clone model. So let's look at some real submissions. So this is an exam that we had just a few days back, and actually I found some interesting patterns there. So I'll just open one of them. So this is a view which tutor gets when he or she is grading the submission. So a tutor can evaluate the submission, he can look at what kind of errors are there, and these test cases help tutor to grade submission, you can grade it right here and provide certain feedback. So why this was interesting is it seemed that student was running out of time and he just wrote some comment at the end that he knows how to solve it just that he's unable to convert a correct integer. So my program is not complete because I don't know how to convert [inaudible]. In fact, last time we saw a program where it just had one print f. I don't know how to code this. We gave him two marks. So there was one more. So I think I'm pretty good on time. I’ll finish in two minutes. So I think this is again, I think there were some interesting patterns here as well. So you can see what was the problem statement? So this problem is about given an area you have to do [inaudible] and coding on the area, so basically every element and what is the frequency of that element, a consecutive [inaudible]. So in this case the interesting part was per analytics because the code size it changed a lot at the start but then it became almost constant, and towards the end this is close to the finish of the exam, 5 PM, so the code size increased again, probably student is adding some comment or is trying to change his program to pass some more test. The interesting thing is even though we did not have plagiarism detection as our first goal this tool has helped us in doing that. So we used [inaudible] along with this tool to decide about copying cases and we could find some interesting cases there as well that I will not go into because that’s not a very proud thing for me as a teacher. So I will just conclude that. The system was developed, we started developing it in July last year just before the semester started. So it's about nine months old and it's still in experimentation phase. We are doing a lot of things we don't know what will work, what will not; but the good thing is this framework allows us to plug-and-play different components. We are using GCC, but we have done some experiments with by Python as well as Haskell. So right now we are using Ivan’s strategy-based feedback and compiler message rewriting, but previous semester we did some ad hoc scripting using Python. And in that there we used an ASD of the program to do some type checking, especially the division by integer kind of type checking. Then we used some ad hoc programs to automatically generate programs which are [inaudible] with the system, right now these are two independent systems, and we have used KLEE earlier to do automated test case generation. This time we did not get a chance. We do automated test case generation mostly for lab exams because that's where we want to check all possible interesting cases, but for normal day to day exams KLEE generates test cases which are again simple, it will find the first thing which satisfies certain properties; so if you want an [inaudible] it will generate all zeroes [inaudible] which is not very interesting from the student’s perspective. So there is a lot of future work, a lot of data waiting to be processed. There are many [inaudible] and HCI issues which have to be resolved. For example, we had some discussion yesterday like how much feedback should be given to student, at what interval should we wait for a student to click but time to give feedback or just give it into [inaudible] and so on. The feedback tool has some limitations. It takes a lot of time to process the inputs and compare it with a specification. It generates false positives, we saw that, and it requires a large number of specifications in case your program is not trivial. So this is something we have decided we are doing. In fact, in some cases we are using crowd sourcing, we are using the same batch of students to generate test cases. Peer review is in [inaudible] and even specification generation. So this is where this thing helps us that our labs are distributed across four days. So whatever programs we give on first day we change it slightly for the second day and we use correct programs on the first day to act as specification. So this has worked for us very well for some cases and at the end this is our lab where our system is getting used [inaudible]. This is [inaudible] developed system and this is sort of coincidence that the day this lab exam was conducted it was raining heavily and this is Seattle so all the umbrellas are lined up outside lab. Okay. So any questions? So we are in fact trying to collaborate with the several groups. In fact [inaudible] is maintained so I am in touch with the candidate and we are planning to use it as a feedback generation mechanism. We are also talking to some other people, some other tools which have been reported in the last few years. And that is one of the advantages of the system. We are not tied to one system. Even though it's not our research it's getting used in our tool. The main challenge is simply find the tools. Right now these tools they are working on programs like BZ[phonetic] or GCC which are huge. In our case the problem is a simple. We have a few lines of code so we can go all out there. We don't worry about state space exploration in general. So thank you. Any questions? Yes. >>: How are students liking this system? So what’s experience been like? >> Amey Kakare: So that's one of the [inaudible] issues. We have not done any formal study. So informally students are happy, but are much more happy. Their job has cut down quite a lot. So each TA has to grade about 80 problems every week and they are finding this much, much useful because earlier it used to be ad hoc way. So people used to send e-mails on Gmail or submit on model and TAs have to download it and make sure that there are no name clashes and so on. So right now everything can be done online and they have these existing test cases. Yes? >>: So beyond intro programming courses, for higher-level courses do you use Java? >> Amey Kakare: So generally we do not force any language. So in our departmental courses we do not fix any language. Students are free to code in any language of their choice. But typically we see a C Plus Plus or Java using being used most often. So we have some courses on Haskell and [inaudible] as well. >>: So I think another question is how do you see Code Hunt style could be integrated with your system or what kind of like relationships of these two systems if you're going to use Code Hunt in the future? >> Amey Kakare: So one of the things, as I said, one of the problems is this is again HCI issue that students are not interested in writing their own test cases. What we see is they are just keep pressing evaluate, evaluate and then, especially during the exam, they try to fit their code according to their test cases. So that's where we introduce this hidden concept of hidden test cases, but one of the take away from code [inaudible], if I can bottle some technology from Code Hunt it will be this dynamic test case generation where students just see one or two failing tests at a time and there's no chance of fitting the problem according to. In fact, one of the funniest things was I had a [inaudible] variation which used to generate some 70,000 output for L equal to some 20 or something and a student was trying to code that every move hardcoded move. So the idea for dynamic test generation is very attractive for me. And other ways we can use Code Hunt like programming context because it's very easy for us to delete the problem statement and let the student work on finding out the problem in the same way as Code Hunt, but we will need automatic test case generation in that case, the dynamic test case generation. >>: [inaudible] you build up a very good problem repository along with these two supports on the side. I mean the conversions from C to Java or C Sharp in the Code Hunt, these kind of coding tools would be difficult, right? It would be worse way to think about then kind of like community efforts for building these problem repositories. >> Amey Kakare: Definitely. It's a very good [inaudible]. In fact there is one advantage of using the system. So officially we cannot say that we are teaching Python, but we can also have an optional track which can use Python or Java or C Sharp whatever. Unofficially we can do all that. So once we have the system up and running we can have a choice of language, so just like Code Hunt gives you C Sharp and Java we can add Python and some other languages which are more popular. And that will also give us some more tools to explore like [inaudible] feedback generation. So thank you very much. >> Rishabh Singh: Hi everyone. Today I'm going to talk about some of the things I thought might be interesting for the audience here since we'll be getting the Code Hunt data and what kind of analysis we can do with it. So this is actually joint work with the [inaudible] group at MIT. Elena Glassman, Jeremy Scott are the students. Scott is our student then. Philip Guo used to be postdoc is now a Professor at University of Rochester and Rob Miller leads the group. And actually the motivation is quite clear as everybody else before has spoken that it's already a big tasks for teachers in classrooms with 200 students to go over each submission, figure out what's happening. And with these online platforms [inaudible], Coursera, things coming up there are more than one hundred thousand students and there's no way a teacher can go over each submission and figure out what students are doing, what things need to be focused on in the next lecture. So the goal here in this project was can we build something where teacher can easily go over millions of submissions and try to see what students are doing in the classroom? And actually we have already seen there's been a lot of work in providing feedback for programming assignments and most work that I’m actually very familiar with I can characterized into two parts. One is more programming languages-based where we did some work before on auto data and there’s some work which I may briefly mention as we’ll start [inaudible] feedback. So that's coming more from the PL side. There's been a lot of work from the machine learning community as well. A lot of people from Stanford they've been working on this system called Codewebs which tries to use machine learning to give feedback. And I'll go briefly into pros and cons of each of these approaches. So in this particular project OverCode we tried to combine the two things, we tried to use programming language techniques to complement machine learning-based techniques to build a system called OverCode. So actually I digress a little bit because I think we've seen lots of interesting talks yesterday as well as today on giving good feedback on programming assignments. So I also briefly want to mention these PL-based techniques that we are also working on to give feedback on programming assignments. So the idea here is that we have Python programs coming in from the Edits class and they have all kinds of mistakes and we have built a system called up AutoPROF which tries to do some analysis. It doesn't matter what kind of programs it has written; we just need one answer from teacher and something which we call error model which corresponds to common mistakes students are making, and then we give syntactic changes to the program so that the program becomes correct. So the problem has given me a minimum number of changes to the program to make it correct. So let me, it would be better probably just give a brief demo here. So this is a problem which asks students to compute the meta-variable polynomial in Python and this particular problem is wrong, but it's hard for a TA to just look at it and figure out what's going wrong here. So in this system the system tries to do some analysis and figures out, actually the program is almost correct, it just needs two syntactic changes. In line number 19 instead I greater than or equal to zero you need to make it, in this case there are many choices. The system comes up with one particular choice it says not equal to and now hopefully the system would say the program is just one step closer to the solution. Some kind of progress but you can have this way. So here actually we are giving a written way by the system we are deploying on the Edits platform. We don't want to give everything because that would take away learning part. So we have different hint levels here. We can say that something is wrong in this particular expression in your code, go check it; we can do different levels as well. And the nice thing about this thing is that it doesn't depend on what kind of programs students are writing. So here somebody wrote a fall loop to solve the same problem, again with some mistakes. Somebody wrote, used list comprehension to solve it. This was most interesting one which I found. The first time I saw this I thought it was completely incorrect because the student only returning the result when the [inaudible] of input is less than two but the inputs can be of arbitrary length. It just so happens to answer this particular student is employing a very interesting tactic where student is trying to pop out of the list, do some computation, and when you have popped enough you want to return. So here actually the program is really close. It says you just need as a base case and some initialization is wrong here. So this is the class of mistakes the system can find and give feedback on. And the technique briefly, I won't go into too many details, technique is like this: so you have some student solution and you have a teacher solution. So what you do in your error model you all kinds of changes, the kind of mistakes the students can do. So the system goes in and starts modifying the program, start mutating it in all possible ways based on the error models, and this is a big space. Typically you get about 12 to 15 different mutants of this program. And then what it does to check if any particular mutant is correct or not it does functional equivalents with the teacher solution. And by functional equivalents I mean it checks them over a large space of inputs. In our case we do up to 10 per 6 different inputs. So make sure on same inputs the mutant and the teacher solution have the same outputs so this way we can have some confidence that the fix is correct. The problem is if you do really naive way of doing this and assuming it takes one millisecond to do function equivalents it will take us more than 30 years to give feedback on just one assignment and that's too long of a time. So here the secret thing is we solve this problem using program synthesis. So essentially I can characterize PL techniques by giving feedback like this that here we give teachers a way to describe what are the common mistakes students are making, so here they are describing error patterns, and then to search over this large family of mutants we use constraint-based search to efficiently solve it; and then again to check the equivalents we use program equivalents techniques that have been developed in the verification community; and the nice thing here with PL-based technique this is the most important point here: the teacher has a way to incrementally add new rules to it. So it's not a black box thing. So the good thing about PL-based techniques that there's a control with teachers what kind of mistakes they are looking for and actually we used it over lots of benchmarks from the Edits class; and in the beginning it was taking quite a bit of time, but now we have made the system quite efficient so it came out to be about 14 dollar per assignment for Edits team and they were quite happy with it. So these are just some numbers how well the system performs. So over 13,000 submissions the system is able to give feedback in about 10 seconds for 64 percent of the cases. And this is a little more detailed. If you're interested in this technique I can talk about it afterwards. So this was the PL-based techniques of giving feedback in some sense during program analysis verification-based techniques and this is base techniques to give feedback. There’s this other community now who are also looking at giving feedback. This is more machine learning-based, and this particular system is called Codewebs from Stanford. And they wanted to give feedback on machine learning assignments from the course Edit class. >>: I had a question about the previous one. It seems like from the [inaudible] standpoint you might want to be a little less specific when you give the feedback. Do you have any way of abstracting the feedback? >> Rishabh Singh: Actually, so I didn't go into too many details, so the version we are putting on Edits it's not going to say that you did something wrong in this line, change to plus 1, it would say print this expression at this line. So we want to also teach students how to debug so it would say you want to print value I at this particular line in the program or it might say that this expression looks buggy or something. So it won't really give you the answer but it will guide you towards helping to learn debugging. So this particular technique what is it does it takes these assignments, gets the ASDs out of it, and then computes all kinds of features so they call it forest, trees, lots of syntactic features and then they use clustering, [inaudible] clustering technique here, and then they get this nice graph. So here what's happening is you see all these very close chunks, the idea is all the programs that look similar would go into that chunk and now what can inserter can do is go over these chunks and look not all assignments in that particular trunk but only a few of them and give feedback and this way you can do power feedback. But some shortcomings here are that this kind of graph is really hard for teachers so we use these things and we showed it to instructors and it’s very hard for them to make sense of it because sometimes you would find two submissions in one cluster. It just so happens because the feature space is very high dimensional, the [inaudible] clustering think they should be together but it's not clear why they should be together so it's hard to understand. It is also computationally expensive. So this particular thing took multiple days on large clusters to come up with these clustering. So just to summarize, PL-based techniques requires>>: How many programs, do you have any idea? Approximately. >> Rishabh Singh: One hundred thousand. So this of course from Coursera, machine learning codes. They want to give feedback. So if you want to compare PL-based techniques, machine learning one bad thing about PL-based technique I showed you was there's some manual effort here. The teacher needs to go in and say these are the class of mistakes I'm looking at whereas machine learning is quite automated; it’s data driven. But the good thing about PL-based techniques was that teacher has control of what's happening and teacher can comprehend with the error models, the kind of patterns they want to give feedback, and also it's more interactive in the sense that if something doesn't work the teacher can go in and add rules whereas machine learning is more black box, one short technique which works then it’s good but it's not clear how to improve it. >>: You mentioned that in machine learning technique cost of the [inaudible] in high dimensionality they put solutions together just because there are lots of dimensions not because they were only similar? >> Rishabh Singh: Yeah. Exactly. So the problem is so you get a feature vector out of these programs and then they are mapped to higher dimensions and then actually the clustering happens in higher dimension. It doesn't happen in the lower dimension. >>: [inaudible]. >> Rishabh Singh: Yeah. Exactly. So the correlation is not clear between syntactic features now with the feature that they use for clustering. >>: It seems that the key difference would be in the domain knowledge that you apply in your error models, right? I don't think it’s particularly about PL’s. [inaudible]. In an L solution if you provide some domain knowledge you may get some understandable result is my guess. >> Rishabh Singh: Right. So that question is, yeah, exactly. So there does need to be similar feature in generating. Somebody has to say that these are the features that I'm interested in; and it's not clear for somebody not in machine learning to be able to specify these are features that would work for this particular domain insight, but with error model it's kind of made clear that this is a mistake that I'm looking for common patterns. So it's a little bit more intuitive, but you're right. If someone knows machine learning how clustering algorithms work and they can do feature generating, right. So you can provide those insights through features as well. So actually now actually let me briefly go over the system I want to talk to you about. So we wanted to combine both PL-based techniques and machine-based learning techniques. And this is the system we have built which is called OverCode. And here the idea is there's submissions coming in from Edits class, more than 100,000, so as an instructor who’s teaching the class how can the instructor keep track of what's happening? So in some terms you can think of it as a dashboard, an interactive dashboard that we can have with Code Hunt as well to try to make sense of what kind of programs people are writing and you want to do some analysis search over it. And the motivation is again similar that there's a lot of variation in programming assignments, a same assignment can be solved in many different ways. The focus here is not so much on correctness, the focus here was more on the design of the programs, more software engineering skills where you want to say it's better to use these constructs in this manner rather than this manner or some design patterns. So we have the system which we're trying to make it light weight so normal teachers and laptops can use it without having access to clusters. And the goal is to help teachers identify interesting examples from the data to go to the classroom and say these are pedagogically interesting examples; and one side effect is that once you have nice clustering you can also use it for giving feedback and power grading. So let me also briefly show you the system. So this is a problem from the Edits class which asked students to solve this [inaudible] power problem which takes a base and an exponent and they have to compute base to the power exponent. So this is a little complicated, but actually what's happening is we have this concept of stacks which are these rectangular boxes and they have a number associated to the left. So it says there were 1500 submissions that did something like this. I'll tell you how we can sort these, but the idea is 1500 students were doing something similar to this and then you have different stacks going this way which the system thinks are different. So these are essentially clusters. Every stack represents a cluster and then teacher has access to some filters on the right. So here system comes up with all interesting lines and expressions in the code that students have submitted and teacher might be interested to say I want to see which students are actually using zero as an argument to the range function. So here the teacher can filter these assignments and the stacks rearrange and now actually the teacher can give this feedback that when you are using range zero is implicit so you don't really need to provide zero and go over these programs. They can also filter by other things. So let me show you one example. So here actually teacher is saying that cluster one and cluster two are kind of similar. I don't really think they should be different. So again, they have this interactivity coming in that teacher can provide rules for saying these two are the same instead of defining features for machine learning algorithm to come up to match clusters. So here teacher can say actually result equals result time base and so teacher can say that this is something that I'm not interested in it too much. Some teacher might be who want to teach that you can do operator equal to variable equal to that variable operator times operator two. Operator two can be returned as operator equal to but in this case teacher is not interested so teacher can say I think there are the same. So here we can provide additional rules to help the clustering algorithm to then collapse all the clusters where the difference was just that. Yeah. >>: So you showed your data is what students have written. So how many currently do you have? How many data items? >> Rishabh Singh: So this interface is kind of demo version. So it's working over I think 6000 submissions. >>: How many? >> Rishabh Singh: 6000. >>: 6000. So that's sufficient for what you need, do you think? >> Rishabh Singh: So the goal is actually since this is kind of a scalable algorithm we can feed it with one hundred thousand as well, and I think the storage is already [inaudible] with the Edits class data. >>: I think the question I am aiming for is you finding interesting rules and things or interesting mistakes and so on in the same way as [inaudible] was. So if you had more data do you think that you would find more things or do you think it would just be repetition that there's actually not more interesting things to find? >> Rishabh Singh: Right. So actually the goal of this interface was to find interesting things in the data itself, to collapse as many things that I don't care about into one stack. The more the data there’s more chances I’ll find interesting things but it depends on the students how they are writing the code. >>: So what's that leading up to is once we provide you with the Code Hunt data obviously you need to provide a [inaudible] in Python. >> Rishabh Singh: So a good thing is actually now is it also works for Java. >>: So you could take that data and then put that in and you could then incorporate it. >> Rishabh Singh: Exactly. So I'll just show you, we gave this system to teachers and what kind of studies we did. But, yeah. So one of the features is it will show you different stacks and it will also tell you what are the differences between the two. So you'll see that some of the lines are grayed out. Only the differences are apparent so this way it can also help teach us to figure out what's happening. And then there's this legend thing which I think doesn't make sense right now until I go into the algorithm part, but here actually teachers can see what variable values are being taken for different test case values. >>: Are the variables renamed by your code? >> Rishabh Singh: I'll tell you, yeah, exactly. So that’s where the PL part comes in in clustering. So I'll briefly go over the algorithm, the forming steps. So first of all we try to reformat the programs that are submitted. So we move all comments, we do white space tightening and everything so that programs are kind of clean and now. So this is where actually we do some dynamic analysis. We run programs on different test cases and we get all traces for the program. So here in this example it would say when I run on 5,3 these are the different programs states I get at different program points. And essentially these are program cases and then we do an abstraction where we try to remove duplicate values. For example, if I have this loop index I it would always have value 0 0, 1, 1, 2, 2. So instead of having duplicates we do an abstraction we say make it 0, 1, 2 because some people might have multiple lines in the loop, some people might have single lines in the loop. So this is some kind of a trace abstraction we do and now we get these values for each variable in all the programs. Then, once we have these variable values, we take two programs at a time and try to see which traces correspond to each other. So even though these programs are kind of different in the sense it's similar but they use different variable names, somebody used R, somebody used result, so now you can match which variable corresponds to which variable based on these trace values. So we do this across all the programs in the code base and now we find that one of the most common names used for each trace sequence, so here we find that this particular trace value most of the students use result, some other student use other variables, but result was the most common one. Similarly for other creates values different variable names were used, and then essentially we do a majority voting. When we get a new program we rename the program based on the sequence of values these variables are taking. So this program would get converted into something like that where R would change to result. And the idea is once all programs have similar variable names it becomes easier for teachers to understand what different variables are doing in different summations. So this way we rename the variables and then there are some concolic cases that happens. So one of the cases is I try to rename something to X and a student was already using X so there’s some way to handle that. And finally we know now need to create clusters. So what we do we take two programs and they have been renamed and reformatted so they look pretty much similar. There’s one more abstraction we do. We get the lines of programs for each program and then we do instead of doing line by line matching we do side-based matching. So we say these set of lines correspond to these set of lines. And if everything is the same we create a cluster out of it. Any questions up until now? Yeah. >>: How big are the programs you’re typically dealing with? >> Rishabh Singh: Right. So actually the Edits class was Introduction to Programming so most of the programs here were about 10 to 15 lines. It's now being used in the Software Engineering class which is about 100 lines of Java essentially. Yeah. >>: Can you kind of tweak the system so that you also identify plagiarism at the same time? Because in the sense you're looking for the similarities, these similarities so maybe if you find the cases that are very similar but not that common there’s an indication of the students actually- >> Rishabh Singh: Right. Exactly. So we can go down the stack and see the small sizes. You could do that as well. I don't think they're using it that way yet but that could be useful. Yeah. >>: [inaudible] red traces so you don’t rename it, like student used a conventional solution [inaudible]. >> Rishabh Singh: So that's actually a good thing. So the idea of this interface is to come up with these anomalies actually, to identify these anomalies. So you had this legend tab on the right. There actually you can see all these funny sequences and it also is supposed to tell you how many solutions this variable is being used in so that way the teacher can go in and say this looks highly unusual to me; I want to check that. This way you can actually, from this large set of programs, you can find interesting programs as well. >>: I was just wondering for this little example when you swap the two lines around do you also do some kind of inference to tell you that you're allowed to do that? >> Rishabh Singh: That’s a good point actually. So there's a lot of hidden assumptions behind this thing that since we are renaming all variables and they're all functionally correct, so the idea is when you're doing set bits extraction you might be sometimes collapse two things that shouldn't be together so there’s incompleteness in that sense, but typically with high probability I think you can say that there are going to likely be, the lines are going to be independent of each other. The order is going to matter that much. But yeah, we don't do any analysis when this can be done or can't be done. Okay. So let me quickly also tell you we gave this system to teachers who were teaching the Edits class and essentially the TAs who are helping the professor teach the Edits class. And this is the user study. So there were a few hypotheses we wanted to check with the system. Does the system help teachers get more satisfied, I'll tell you with that means. Does it help them read more solutions and faster than what typically people do? Does it help them give better feedback? And does it help them gain confidence in terms of how confident they are that they've given good feedback to the class? So study one was kind of open-ended task where the idea was we gave them two ways to look at programs. One was the more traditional way. There's this big line; you can imagine a giant HTML file with all submissions. The other was using our system. And the task was write a piazza post in 20 minutes using both the interfaces about one of the things you found interesting in this data set. So it's kind of very open-ended. We didn't give any specific things to do. So there were 12 TAs here and almost everybody had graded these particular assignments before and we had some qualitative questions here, and based on those surveys actually it was statistically significant when almost everybody said it was much easier to use on a like scale from 1 to 7. It helped them get an understanding of what students were doing in the class and also it was less overwhelming than just getting a giant HTML file. So it was kind of reasonable that they liked the system, and then there was some qualitative comments that they liked certain features more than certain other features. But the problem was we weren't able to get any quantitative data out of that study so it was very qualitative. So we are designed another study where we focused on getting more quantitative data. So here we told the teachers, TAs, don't just write a piazza post but go over and tell us five most important things you've found in this data to tell the class and also tell us how confident you are about it. So it was a little more focused task, more concrete. And again, we recruited 12 more people. So there were some interesting observations here. So this particular graph is showing you control is the baseline, OverCode is the system. So for different problems it's showing you that teachers, when they were using baseline, in this particular graph it actually shows that they were able to look at more solutions than OverCode and I'll explain why is that. But actually for this particular problem even here they were able to read more solutions. So the difference is when they read one solution in OverCode they're actually looking at large family of solutions whereas in baseline when they look at one solution they're looking at just one solution. So when you multiply the sizes of the stacks they were reading the difference becomes quite appreciable. In some cases they were able to cover 64 percent of the problems in 10 minutes so in some sense this interface helped them go over student’s submissions much faster than baseline. Then the second thing we wanted to do was when they gave feedback, when they talk about five important points we wanted to see how many students does it apply to? So in this case actually we found interestingly for these two problems using both baseline and our system there wasn't really statistically interesting difference between the two; but for the quantitative problem they gave the feedback, they talked about points that actually applied to many more students than the baseline, and also it helped them be more confident, so everybody was much more confident that they covered everything in the class as compared to using baseline. And some TAs were so excited about rewriting rules. So in five minutes they were given to do all kinds of analysis and they wrote lots of rules like this. So in summary I actually showed you an interactive visualization tool for going over thousands of programming solutions. And it's kind of language independent, so this I showed you for Python, but they're already using it in the Java class. It's lightweight. So this technique I showed you about clustering it's linear in the number of solutions as compared to quadratic which is typically the case with all the clustering algorithms so it’s scalable. And it helps teachers understand the overview of the class and better understand the kind of summations students are submitting. That's it. Thanks. >>: What's TOCHI? >> Rishabh Singh: Which one? >>: TOCHI. >> Rishabh Singh: So it's actually, so I think I CHI is something different. So what happens in CHI is they have this general TOCHI where if you submit within a timeframe you can present it in CHI. So it's running here. So this actually is going to be in CHI this year. >>: So I'm interested off the track here why you chose to present this work at CHI. >> Rishabh Singh: That's a good point. So most of my collaborators work at [inaudible] for this particular work and it’s their top conference. So the thing is there's a lot of interesting education in CHHI community. They were looking at more towards the interface site and here we wanted to combine a little bit of PL-based semantics techniques with [inaudible] issues. >>: So do you think that's a community that we should also address? >> Rishabh Singh: Actually they are, I think CHI is probably the biggest community in education working on right now. I think they have separate sessions for inside CHI just on education. And this particular paper is going to present in one of those sessions. It's a big community, and they also started this conference called Learning at Scale, which is about [inaudible] and education. And that's primarily [inaudible] people. So I think yeah. So they are actually very interested in education or related research. >>: So for the system that you developed is there any plan for engaging community or opensource part of it? >> Rishabh Singh: So actually the student that was working on it she has actually, I’m not sure if she already open-sourced it but the goal was to open-source it, and I think somebody at UDub is also using it. So the goal is actually to have community use it and build on top of it. >>: Excellent. So we can start coffee early for a change and come back for our third paper.