>> Judith Bishop: And so now we have Amey... Introductory Programming which will jumpstart us into the final session...

advertisement
>> Judith Bishop: And so now we have Amey who is going to tell us about a system to teach
Introductory Programming which will jumpstart us into the final session today. Thank you.
>> Amey Kakare: Thank you Judith for inviting me for this workshop. And actually a study
session sort of covered what I want to say, but at the same time I will show what we have
developed, and in a way it aligns with the kind of feedback we got yesterday because our
experience has been the same with this particular tool that we have developed.
So before I start I would acknowledge some people. So these are four students who are
working on the project. Rajdeep Das is the chief architect of the particular tool I'm going to
show. Others are Umair Z. Ahmed and Naman Bansal who are my PhD and [inaudible] student
respectively and we are collaborating with Ivan from [inaudible] University; and then several
interns have helped us build the system. And finally we have Sumit who is collaborating on this
project and myself and we are getting more students to continue with this work.
So before I start demoing the tool I would like to tell you about some challenges that are
specific to India as well as in general. So it's our experience that teaching the first programming
course is difficult because you have several challenges, several problems and especially in India
where we do not have proper infrastructure in faraway villages; and then we have many
languages, and I'm saying not programming languages but natural languages that people speak.
I checked the remodeling. So there are 22 official languages, and unfortunately we do not have
programming languages in those languages so we cannot [inaudible] in that language.
And the second major issue is wide gap in the level of exposure to computers. So in our
institute we get students who are competitive programmers already winning combinations to
students who have not even seen a computer yet. These are the students who have not even
seen mobile in some cases. So mobile has sort of increased exposure. People at least know
what a computer look like, if you can think of mobile as a small computer, but still we have
about 5 to 10 percent student who have not seen a proper computer.
We cannot solve all those challenges. So what are the challenges for us? The first challenge is
how to keep good programmers engaged while teaching simple stuff to beginners because very
soon after second or third lecture we start seeing these sleepy faces. How to provide early
feedback because if we are giving assignments and if TAs are taking a long time to correct this
assignment then it becomes again boding for students and we don't want to use too many
extra resources because we have limited TA support. These TAs are having different labels of
expertise. In fact, sometimes we have teaching assistant who themselves don’t know proper
programming, and at the end nobody likes to work extra hours so they want to do their job fast.
And we want to use existing computers. We have a lot of existing computers. We are not
allowed to buy too many computers. And these computers have various flavors of OS browsers
and so on. Some are [inaudible] I saw a computer which uses Firefox 3 on which nothing seems
to work.
So what is the solution? And this is where Code Hunt can come into picture. In fact I was here
in 2013 and I had some talk with Judith and I think there it is important because it was the same
time when [inaudible] was getting converted into Code Hunt and I got to use probably the first
version of Code Hunt, probably one of the few people outside of Microsoft that used it, and
unfortunately things did not work out for several reasons. One was probably Judith and team
was too busy importing it to Code Hunt. It's difficult to convince people at IIT [inaudible] shift
from C to some other language especially because this course is across departments and our
department has very little say to it even though we are the computer science department.
So the other idea was using some online browser like IDEone. So this is a browser-based IDE
where people can log in and start trying the programming languages. The nice thing about this
browser-based IDE or IDEone is it supports some 34 different languages. So if you just want to
get a feel of some language go to their site and start typing in that particular language. You
don't have to install anything. But the negative part is nothing else. You can't do anything else
just run compiler program, run it and see the output.
But these two provided us motivation to create our own solution. So before I start showing
you their demo I would just like to describe the setup because some of these points will be
important if we have discussion afterwards why we chose a particular way of doing certain
things. So the course is called ESC101 and it has approximately 400 students every semester.
So this is offered both semesters to a [inaudible] set of students. So weekly load, every student
has to go through three lectures, one tutorial and one lab. Lab is three hours of programming
assignments, but for instructor it is much more heavy because instructor has to prepare three
lectures, instructor has to prepare for one tutorial, even the tutorial itself is taken by someone
else, some helper, other faculty members, and finally instructor has to prepare for four labs
every week which is four times the student load. And why is that so? Because other courses
have labs too so there are logistic issues, space issues, timetable issues. So what has been done
is every class is divided into sections and different sections have their different labs on different
days. So in particular our course has labs on Monday, Tuesday, Wednesday and Thursday.
So this is the setup before I come to the demo time. So any questions up to this point? So this
is our system. So I am an administrator here. So once I log in I come to what is called the
administrative interface. Before coming to administrative I will just go to student mode so that
it’s clear what the student sees here, and you can see there are quite similarity with online
browser-based IDE like IDEone or even Code Hunt. So this is a typical session for the student.
Every week whenever he logs in to do the lab time he sees the assignment that is given to him
or her, he can look at grade card to figure out previous assignments, how they are graded and
so on, and then he also gets some or she also gets some statistics about the previous
submissions and so on.
So let's assume we are planning to solve one of the questions, and in this case this is very
different from Code Hunt. We provide problem statement. So in this case the problem
statement is very simple. Student has to check whether a given number is palindrome or not.
And at this point it's important that we have to think in terms of how a student is thinking, how
can we provide feedback for students who are stuck at some level, or how can we provide
better feedback if a student is close to a solution. So these are all the issues that we discussed
yesterday. So let's start typing some code. So suppose the student is stuck at the very
beginning itself. So what he can do is, or maybe he started with some code and these are some
of the basic errors that we have seen. So students forget to add any declaration, students
forget to or make some mistake in reading the numbers and so on.
So the system, what we are doing is we are intercepting compiler messages and rewriting them
so that they are more useful to the user. So one example I will show you this particular case
the student if he just gets num undeclared first [inaudible] in this function, so this is the original
compiler message. I think the foreign size is very small but it should be readable. So student
gets some 5, 6 lines of message, he doesn't know what this message means, and how is he
going to fix that? So what we do is we intercept this message and present our own message
which is hopefully much more readable and useful to the student. We also give some sort of
hint whether you should initialize or not and so on.
>>: [inaudible]?
>> Amey Kakare: It says you are using a variable over here that is undeclared. Num has not
been declared before; please declare this variable as int num or int num equal to zero.
>>: [inaudible] type inference that num should be an 8 so that's why you [inaudible]?
>> Amey Kakare: No, we are not doing type inferencing. We are actually rewriting what the
compiler is telling us. So let me just see. So in this case it’s a hardcoded message int num.
Because it’s not declared we don't even know what the type should be so it’s hardcoded. I will
show you how this message is generated. There is another warning which is also very common
mistake people keep forgetting. So this is the bad thing of C. People keep forgetting putting an
address of operator in front of scanner and this is something you, no matter how many times
you teach in class, unless student sees it he or she is not going to learn it. And similarly there is,
so the compiler in this case generates a warning which is really complicated for a student to
understand. So compiler starts talking about [inaudible] even though student has not even
known what a pointer is and so on. So compiler starts giving the proper message which a
seasoned programmer can understand, but think about the poor student who is just doing hello
world like program and he gets the message that [inaudible] are of type [inaudible]. He does
not know what the status; he just knows that it's probably used for multiplication and so on. So
we can again simplify it. At the same time we also don't know whether it's really a genuine
warning or not so we give both type of intent. It's also likely that you forgot to use address of
operator in a scanner and we also rewrite the original warning in some sense.
>>: Did he go through all the warnings for the C compiler?
>> Amey Kakare: No. We have done some, I will show you the admin interface and then it will
be clear. So we are not going through all the warnings. So let’s say a student has started with
solving the problem and now he's stuck at the first level itself. So what he can do is, so we have
what is called three events associated with each program. You can compile, you can execute it
on your own input, or you can evaluate the program on each of the provided inputs or test
cases.
So if the student evaluates the program this is where we do some kind of analysis of the
program to give hints. So in this case if you did not notice I will just run it again. So small popup comes up which tells that there is a feedback and the program did not pass all tests. So if
you notice on left-hand side we have several tabs. First one is about problem statement,
second one is about tutor which is where we give the hint; so in this case we are getting a hint
that you should try to use a loop to compute reverse of a number, and again I will show you in
admin interface how this hint is generated; and at the same time we also get what is the result
of running this input, running this program on each of the provided set of inputs. So we have
concept of a visible test as well as invisible tests because we notice that if all tests are visible
students start writing code like what we see yesterday, if X is 1552 and answer is yes, if X is 42
answer is no and so on.
So now at this point student has two choices to make. Either he understands the hint and
starts working there according to the hint or we also provide what is called a downgrade option
for a problem. So downgrade what it does is it will choose a similar problem which is possibly
of a less difficulty. So suppose I downgrade this program. So in this case the idea is probably
student does not know how to get digits out of a number so we can give him a program which
wants the student to print last two digits in reverse order. Now this program does not require
any loops so if a student knows how to extract digits out of a number he or she should be able
to solve this. If he or she can solve it then we can again ask them to upgrade the problem and
solve it from there onward.
So otherwise a student can start writing the program, introduce some variables, so let me see.
And I'm making some mistakes, hopefully [inaudible]. So let's just check, let’s try to elevate this
program. So we still got some message like it did not pass and there are more feedback. So it
says that you have to add an assignment to reverse at the beginning of main function. You
have to use a new integer variable, you have to check assignments, and you have to add a print
f because right now the program is not printing anything even though the output expects, we
are expecting some output so we can add some assignment.
So this is the mistake I made in the first version of the code when I wrote model solution for
this and that's why I like this particular mistake. So there is a bug in our interface. As you can
see that on one side it is saying all the test cases have passed when it did not so this bug has
been fixed in the live system. I'm using the clone system. Yesterday I discovered this. So still
we have some more feedback, so we just look at it, and the program still asked me to create a
new integer variable. And at this point if I am curious enough I will recall that this number is
changing. So I read some number, I keep changing it until it is zero, and then I am comparing
the modified number with the reverse and this test is always going to fail. So it is asking me to
add a number, add a saved component, and save the number before starting to [inaudible].
So now everything has passed, but I just want to show you that there are some other bugs that
our tool can catch. [inaudible] our tool; I will explain that as well. So if there is initialization
error so it can tell you that there is some bug in the assignment at line four and similarly it's not
able to deduce that this error will fix other tool problems as well so it is giving some extra
messages and similarly if there's a problem with the loop condition.
Okay. So for figuring out this feedback we are using the tool developed by Ivan, Sumit, and
Florian at University of [inaudible]. So the way this tool works is it creates a control flow graph
of this program and it also uses a lot of specifications which are provided by the instructor.
These specifications are various ways to write correct program. So what it does is it compares
this program with the existing specifications, it does some analysis to figure out which
specification closely matches this program, and then prints the line which differed in the
specification and in the student submission. Yes.
>>: Are you assuming that there is only one master solution for every problem?
>> Amey Kakare: No. So I will just show you that. So the tutor has to write as many
specifications as possible. Again, this is not a very nice thing to do for an instructor because
he's already loaded, so we have some ways; and that's why I said that some of this set up that I
described earlier is important. So I will come to that point as well. So we need many, many
specifications. Ideally we need all possible ways to correctly solve this program. And on top of
it we want some common mistakes at the specification so that we can generate proper
feedback. So again I will come back to this question when we do the administrative interface.
So in this case as you can see the downgraded problem was actually manually generated but
what we have done is with some interns we created a small framework to automatically
generate problem statement. So in particular this is a problem where we want students to
print, actually it's not coming out very nice so I’ll just compile it and run it; so we want students
to print two square boxes using stars. So the output is supposed to be something like to square
boxes with boundary of stars side-by-side. And this is a very, very challenging problem because
for a first-time programmer he has to try the understand concept of printing spaces, finding out
each, it's a doubly nested loop and there is a nested if condition as well. So in case student is
not finding this problem [inaudible] then he can downgrade and in that case he gets a simpler
problem where he has to print just one square. In fact the original, I will show you some
problems that we generated for patterns.
So this PVD is something our internal coding. So for every assignment, this is not a good one.
So here is a problem which has three variations. In the first case a student has to print a
rhombus with some increasing numbers but if student is a stuck we can downgrade this to
another problem where the student has to just print the first top part of the rhombus, and if
student is still stuck we can go to an even simpler problem where student has to just print a
square. And as you can see we can even go down. Instead of numbers we can print stars or
simple prorations[phonetic].
So we did this thing for nested loop pattern problems, some series problems where you are
computing some amount of numbers in a series or sequence and then you are also mixing two
series like [inaudible] with x factorial so on. Any questions?
Okay. So this is we have, whatever we have in the student interface. So we have also
incorporated the bug feature but it's still disabled because we are still not comfortable with it,
it's not totally tested. And then there are other management things like students can see their
existing solutions for previous labs and so on, they can test using Scratchpad, they can find
some bugs and so on. So let's go to the admin mode.
So this is what a tutor or instructor will see. And the first thing is about problem management
and this particular tab is very similar to the driver routine designer of the Code Hunt. So here
for every problem we have to provide the data. You can create a new problem with some
unique names and so on. So I'll just take the particular problem that we are trying to solve. So
for every problem there are 3 or 4 sections. So we give some unique name to identify it, then
we can decide whether the problem is a practice problem which is released every week and it
does not have any deadline associated with it, then comes the problem statement. This is what
Code Hunt does not have. And after that this is the instructor solution which is used to
compute the desired output or the set of input, output. We have these specifications, so this
should answer your question, we need as many correct solutions as we can get. So, for
example, specification three is the same as the instructor solution. Then we have some more
specification. So this is a totally different solution than it is first computing, basically it’s finding
the highest digit or the 10th power of 10 which is there and then it's computing the reverse in
some other way and so on.
The interesting thing is we also need some incorrect specification. For example, this is an
incomplete program which does not have any loop. This helps us in identifying that a student is
not using a loop where he is supposed to use one. So we can add some feedback for such
problems. So whenever a student’s incorrect attempt matches this particular program we'll get
this feedback. And to decide whether, so this thing again uses concolic execution. So it uses
some concrete input to figure out which variables in the student solution match which variable
in the instructor solution or the specification.
So after specifications we have the initial template. Again, this is like the programmer’s part in
Code Hunt where the initial screen that a student sees and we have test cases. So in this case
we have just manual test cases. You can see three visible and two invisible test cases and there
is a possibility to add automatically generated test cases. In the previous semester we used
KLEE to generate test cases. I’ll just show you one example. So this was a lab exam and we
needed a lot of tests cases. So here we need an area of size 500 or something. So it was not
possible for us to write it manually. We used KLEE to generate these test cases and use the
model program to generate the solutions.
So apart from problem management we have account management, again, it's not very
interesting. What is more interesting is first thing that our tool is more like a framework which
can use different compilers or different feedback mechanisms. So right now we are using GCC
with certain compiler flags and we are using this strategy-based feedback as a plug-in which
was developed by Ivan. There are other things like we run the programs in a Sandbox, again to
avoid any changes to the system itself, and then there are certain delays. These have been
added so that students are not evaluating their program again and again putting a load on the
server.
So the nice thing is we can use any feedback mechanism developed by a third party as long as it
follow certain protocol about input and output. So this is not the only feedback mechanism we
can add. We did try some other mechanism as well and we are looking forward for other input
from users like you, from people like you.
So last part is about analytics of the data. So this is the place where we are generating error
messages for a student. So what we do is we're just constantly monitoring which error
messages are occurring most often. So undeclared variable we have seen some 7500 times on
lab four and lab five when this clone was generated. So at this place we can create our own
rewriting of the compiler message. So what we are doing is we are removing all the specific
information. We are creating some meta-variables, actually this is a very compute-intensive
thing. It’s loading 7400 data points from the database. So what we have done is we have
created this meta-variables, colon X1, X2 and so on and you can rewrite these messages using
this colon X1. So that's how we get the name in num and so on.
So I should have chosen some simpler, some bug which is occurring less often. So let me just
see if I can open another instance. So as you can see this crosses means we have not provided
any feedback. So this is just something we added last week. So let me try this one. So here
you can add a feedback, you can update it, and you can also see instances. So what happens
sometimes a particular instance is more interesting because if a semicolon is missing it sort of
gets a cascading effect a lot more messages are coming so we want to treat such errors
spatially. And you can see all the particular assignment submissions that caused this issue. And
finally we come to the code submissions. So whatever code has been submitted by students,
that gets decoded into the system. In fact, we store incremental changes to every submission,
so this test one is the particular event that is created where I tried to solve some problem. So
these are different students. So each box corresponds to one student and the particular one.
So as you can see the data is anonymized. There's no easy way to relate a student to the
particular submission. You have to know the ID on which student is trying the solution or you
have to go through several databases exercises to map these to. So this is the submission that I
made at the end. However, the nice thing is I can go over all possible submission from the very
first. In fact I did a lot of trial and error. So this is, so I can keep looking at how this code
evolved from the first template that was given to the student. So this is more like a selling
feature. It does not do anything useful. It just shows you how a student is thinking, how he's
coding. So we keep saving a student’s input, auto save it every few correctors or after a few
seconds. So in a way we are getting the snapshot of student submission every 10 or 15
seconds. This also helps us in figuring out the mistake if student is making. So whenever
student files a bug request or some issue using this e-mail we also get the associated ID and we
also get the particular version number for the code that he's talking about. So we can just look
at the code and think about what could be going wrong. So you can see I tried a lot of
variations before coming here.
>>: So is the purpose that you would respond to the student at this point?
>> Amey Kakare: That is one of the uses, but in reality but want to do other analytics as well.
So basically we want to look at the kind of errors that students are making most often because
this is storing all the events as well. Whenever the code was compiled what was the error that
was created, whether it passed the test cases or not, we are also recording any test case that a
student entered manually so that it can be used with other submissions.
>>: So I don't want to anticipate what you're going to say, but are you expecting any surprises
here or did you find any surprises?
>> Amey Kakare: In fact, initial few labs I do go over these submissions because they tell me
what students are thinking; and yesterday, as I said, one of the major pinpoint is this forgetting
percent operator, second one is integer division. Most students use two integers and they are
surprised that the answer is not matching their expectation. So last semester when I taught
this course for the first time I saw this mistake, on any random submission [inaudible] I saw this
mistake, so I dedicated 15-20 minutes in the next class explaining how integer division works
and why is it different from floating-point and sort of gave them some motivation. So this helps
in understanding the thinking of the student where they are stuck, but in the long run we can
also do some analytics on top of it.
So one analytic was getting the error messages. We can also look at how the code size is
changing. So this is something which I was trying to [inaudible] different combinations. So you
can see the code size was constant for a long time and today modeling suddenly there are
several changes and then we are looking at different events. So this is not coming out very
well; probably because I zoomed it. So how many times the code was auto saved, how many
times students saved it explicitly.
So there is another interesting event. This is called submit. Because we are capturing code
every few seconds, we are doing auto saves, we want don't want to evaluate a wrong version of
the code so we have something called submit where a student tells us this is the version he or
she wants us to evaluate, grade. So all these events, and once again you can see that all these
are concentrated towards the end because this is my submission. I can show you some real
submissions as well, and so we also collect some other data which is not available in this virtual
clone model.
So let's look at some real submissions. So this is an exam that we had just a few days back, and
actually I found some interesting patterns there. So I'll just open one of them. So this is a view
which tutor gets when he or she is grading the submission. So a tutor can evaluate the
submission, he can look at what kind of errors are there, and these test cases help tutor to
grade submission, you can grade it right here and provide certain feedback. So why this was
interesting is it seemed that student was running out of time and he just wrote some comment
at the end that he knows how to solve it just that he's unable to convert a correct integer. So
my program is not complete because I don't know how to convert [inaudible]. In fact, last time
we saw a program where it just had one print f. I don't know how to code this. We gave him
two marks. So there was one more. So I think I'm pretty good on time. I’ll finish in two
minutes. So I think this is again, I think there were some interesting patterns here as well. So
you can see what was the problem statement? So this problem is about given an area you have
to do [inaudible] and coding on the area, so basically every element and what is the frequency
of that element, a consecutive [inaudible].
So in this case the interesting part was per analytics because the code size it changed a lot at
the start but then it became almost constant, and towards the end this is close to the finish of
the exam, 5 PM, so the code size increased again, probably student is adding some comment or
is trying to change his program to pass some more test. The interesting thing is even though
we did not have plagiarism detection as our first goal this tool has helped us in doing that. So
we used [inaudible] along with this tool to decide about copying cases and we could find some
interesting cases there as well that I will not go into because that’s not a very proud thing for
me as a teacher. So I will just conclude that.
The system was developed, we started developing it in July last year just before the semester
started. So it's about nine months old and it's still in experimentation phase. We are doing a
lot of things we don't know what will work, what will not; but the good thing is this framework
allows us to plug-and-play different components. We are using GCC, but we have done some
experiments with by Python as well as Haskell. So right now we are using Ivan’s strategy-based
feedback and compiler message rewriting, but previous semester we did some ad hoc scripting
using Python. And in that there we used an ASD of the program to do some type checking,
especially the division by integer kind of type checking. Then we used some ad hoc programs to
automatically generate programs which are [inaudible] with the system, right now these are
two independent systems, and we have used KLEE earlier to do automated test case
generation. This time we did not get a chance. We do automated test case generation mostly
for lab exams because that's where we want to check all possible interesting cases, but for
normal day to day exams KLEE generates test cases which are again simple, it will find the first
thing which satisfies certain properties; so if you want an [inaudible] it will generate all zeroes
[inaudible] which is not very interesting from the student’s perspective.
So there is a lot of future work, a lot of data waiting to be processed. There are many
[inaudible] and HCI issues which have to be resolved. For example, we had some discussion
yesterday like how much feedback should be given to student, at what interval should we wait
for a student to click but time to give feedback or just give it into [inaudible] and so on. The
feedback tool has some limitations. It takes a lot of time to process the inputs and compare it
with a specification. It generates false positives, we saw that, and it requires a large number of
specifications in case your program is not trivial.
So this is something we have decided we are doing. In fact, in some cases we are using crowd
sourcing, we are using the same batch of students to generate test cases. Peer review is in
[inaudible] and even specification generation. So this is where this thing helps us that our labs
are distributed across four days. So whatever programs we give on first day we change it
slightly for the second day and we use correct programs on the first day to act as specification.
So this has worked for us very well for some cases and at the end this is our lab where our
system is getting used [inaudible]. This is [inaudible] developed system and this is sort of
coincidence that the day this lab exam was conducted it was raining heavily and this is Seattle
so all the umbrellas are lined up outside lab. Okay. So any questions?
So we are in fact trying to collaborate with the several groups. In fact [inaudible] is maintained
so I am in touch with the candidate and we are planning to use it as a feedback generation
mechanism. We are also talking to some other people, some other tools which have been
reported in the last few years. And that is one of the advantages of the system. We are not
tied to one system. Even though it's not our research it's getting used in our tool. The main
challenge is simply find the tools. Right now these tools they are working on programs like
BZ[phonetic] or GCC which are huge. In our case the problem is a simple. We have a few lines
of code so we can go all out there. We don't worry about state space exploration in general.
So thank you. Any questions? Yes.
>>: How are students liking this system? So what’s experience been like?
>> Amey Kakare: So that's one of the [inaudible] issues. We have not done any formal study.
So informally students are happy, but are much more happy. Their job has cut down quite a lot.
So each TA has to grade about 80 problems every week and they are finding this much, much
useful because earlier it used to be ad hoc way. So people used to send e-mails on Gmail or
submit on model and TAs have to download it and make sure that there are no name clashes
and so on. So right now everything can be done online and they have these existing test cases.
Yes?
>>: So beyond intro programming courses, for higher-level courses do you use Java?
>> Amey Kakare: So generally we do not force any language. So in our departmental courses
we do not fix any language. Students are free to code in any language of their choice. But
typically we see a C Plus Plus or Java using being used most often. So we have some courses on
Haskell and [inaudible] as well.
>>: So I think another question is how do you see Code Hunt style could be integrated with your
system or what kind of like relationships of these two systems if you're going to use Code Hunt
in the future?
>> Amey Kakare: So one of the things, as I said, one of the problems is this is again HCI issue
that students are not interested in writing their own test cases. What we see is they are just
keep pressing evaluate, evaluate and then, especially during the exam, they try to fit their code
according to their test cases. So that's where we introduce this hidden concept of hidden test
cases, but one of the take away from code [inaudible], if I can bottle some technology from
Code Hunt it will be this dynamic test case generation where students just see one or two
failing tests at a time and there's no chance of fitting the problem according to. In fact, one of
the funniest things was I had a [inaudible] variation which used to generate some 70,000
output for L equal to some 20 or something and a student was trying to code that every move
hardcoded move. So the idea for dynamic test generation is very attractive for me.
And other ways we can use Code Hunt like programming context because it's very easy for us to
delete the problem statement and let the student work on finding out the problem in the same
way as Code Hunt, but we will need automatic test case generation in that case, the dynamic
test case generation.
>>: [inaudible] you build up a very good problem repository along with these two supports on
the side. I mean the conversions from C to Java or C Sharp in the Code Hunt, these kind of
coding tools would be difficult, right? It would be worse way to think about then kind of like
community efforts for building these problem repositories.
>> Amey Kakare: Definitely. It's a very good [inaudible]. In fact there is one advantage of using
the system. So officially we cannot say that we are teaching Python, but we can also have an
optional track which can use Python or Java or C Sharp whatever. Unofficially we can do all
that. So once we have the system up and running we can have a choice of language, so just like
Code Hunt gives you C Sharp and Java we can add Python and some other languages which are
more popular. And that will also give us some more tools to explore like [inaudible] feedback
generation. So thank you very much.
>> Rishabh Singh: Hi everyone. Today I'm going to talk about some of the things I thought
might be interesting for the audience here since we'll be getting the Code Hunt data and what
kind of analysis we can do with it. So this is actually joint work with the [inaudible] group at
MIT. Elena Glassman, Jeremy Scott are the students. Scott is our student then. Philip Guo
used to be postdoc is now a Professor at University of Rochester and Rob Miller leads the
group. And actually the motivation is quite clear as everybody else before has spoken that it's
already a big tasks for teachers in classrooms with 200 students to go over each submission,
figure out what's happening. And with these online platforms [inaudible], Coursera, things
coming up there are more than one hundred thousand students and there's no way a teacher
can go over each submission and figure out what students are doing, what things need to be
focused on in the next lecture. So the goal here in this project was can we build something
where teacher can easily go over millions of submissions and try to see what students are doing
in the classroom?
And actually we have already seen there's been a lot of work in providing feedback for
programming assignments and most work that I’m actually very familiar with I can
characterized into two parts. One is more programming languages-based where we did some
work before on auto data and there’s some work which I may briefly mention as we’ll start
[inaudible] feedback. So that's coming more from the PL side. There's been a lot of work from
the machine learning community as well. A lot of people from Stanford they've been working
on this system called Codewebs which tries to use machine learning to give feedback. And I'll
go briefly into pros and cons of each of these approaches.
So in this particular project OverCode we tried to combine the two things, we tried to use
programming language techniques to complement machine learning-based techniques to build
a system called OverCode. So actually I digress a little bit because I think we've seen lots of
interesting talks yesterday as well as today on giving good feedback on programming
assignments. So I also briefly want to mention these PL-based techniques that we are also
working on to give feedback on programming assignments. So the idea here is that we have
Python programs coming in from the Edits class and they have all kinds of mistakes and we
have built a system called up AutoPROF which tries to do some analysis. It doesn't matter what
kind of programs it has written; we just need one answer from teacher and something which
we call error model which corresponds to common mistakes students are making, and then we
give syntactic changes to the program so that the program becomes correct.
So the problem has given me a minimum number of changes to the program to make it correct.
So let me, it would be better probably just give a brief demo here. So this is a problem which
asks students to compute the meta-variable polynomial in Python and this particular problem is
wrong, but it's hard for a TA to just look at it and figure out what's going wrong here. So in this
system the system tries to do some analysis and figures out, actually the program is almost
correct, it just needs two syntactic changes. In line number 19 instead I greater than or equal
to zero you need to make it, in this case there are many choices. The system comes up with
one particular choice it says not equal to and now hopefully the system would say the program
is just one step closer to the solution. Some kind of progress but you can have this way.
So here actually we are giving a written way by the system we are deploying on the Edits
platform. We don't want to give everything because that would take away learning part. So we
have different hint levels here. We can say that something is wrong in this particular
expression in your code, go check it; we can do different levels as well. And the nice thing
about this thing is that it doesn't depend on what kind of programs students are writing. So
here somebody wrote a fall loop to solve the same problem, again with some mistakes.
Somebody wrote, used list comprehension to solve it. This was most interesting one which I
found. The first time I saw this I thought it was completely incorrect because the student only
returning the result when the [inaudible] of input is less than two but the inputs can be of
arbitrary length. It just so happens to answer this particular student is employing a very
interesting tactic where student is trying to pop out of the list, do some computation, and when
you have popped enough you want to return. So here actually the program is really close. It
says you just need as a base case and some initialization is wrong here.
So this is the class of mistakes the system can find and give feedback on. And the technique
briefly, I won't go into too many details, technique is like this: so you have some student
solution and you have a teacher solution. So what you do in your error model you all kinds of
changes, the kind of mistakes the students can do. So the system goes in and starts modifying
the program, start mutating it in all possible ways based on the error models, and this is a big
space. Typically you get about 12 to 15 different mutants of this program. And then what it
does to check if any particular mutant is correct or not it does functional equivalents with the
teacher solution. And by functional equivalents I mean it checks them over a large space of
inputs. In our case we do up to 10 per 6 different inputs. So make sure on same inputs the
mutant and the teacher solution have the same outputs so this way we can have some
confidence that the fix is correct.
The problem is if you do really naive way of doing this and assuming it takes one millisecond to
do function equivalents it will take us more than 30 years to give feedback on just one
assignment and that's too long of a time. So here the secret thing is we solve this problem
using program synthesis. So essentially I can characterize PL techniques by giving feedback like
this that here we give teachers a way to describe what are the common mistakes students are
making, so here they are describing error patterns, and then to search over this large family of
mutants we use constraint-based search to efficiently solve it; and then again to check the
equivalents we use program equivalents techniques that have been developed in the
verification community; and the nice thing here with PL-based technique this is the most
important point here: the teacher has a way to incrementally add new rules to it. So it's not a
black box thing. So the good thing about PL-based techniques that there's a control with
teachers what kind of mistakes they are looking for and actually we used it over lots of
benchmarks from the Edits class; and in the beginning it was taking quite a bit of time, but now
we have made the system quite efficient so it came out to be about 14 dollar per assignment
for Edits team and they were quite happy with it.
So these are just some numbers how well the system performs. So over 13,000 submissions
the system is able to give feedback in about 10 seconds for 64 percent of the cases. And this is
a little more detailed. If you're interested in this technique I can talk about it afterwards. So
this was the PL-based techniques of giving feedback in some sense during program analysis
verification-based techniques and this is base techniques to give feedback.
There’s this other community now who are also looking at giving feedback. This is more
machine learning-based, and this particular system is called Codewebs from Stanford. And they
wanted to give feedback on machine learning assignments from the course Edit class.
>>: I had a question about the previous one. It seems like from the [inaudible] standpoint you
might want to be a little less specific when you give the feedback. Do you have any way of
abstracting the feedback?
>> Rishabh Singh: Actually, so I didn't go into too many details, so the version we are putting
on Edits it's not going to say that you did something wrong in this line, change to plus 1, it
would say print this expression at this line. So we want to also teach students how to debug so
it would say you want to print value I at this particular line in the program or it might say that
this expression looks buggy or something. So it won't really give you the answer but it will
guide you towards helping to learn debugging.
So this particular technique what is it does it takes these assignments, gets the ASDs out of it,
and then computes all kinds of features so they call it forest, trees, lots of syntactic features
and then they use clustering, [inaudible] clustering technique here, and then they get this nice
graph. So here what's happening is you see all these very close chunks, the idea is all the
programs that look similar would go into that chunk and now what can inserter can do is go
over these chunks and look not all assignments in that particular trunk but only a few of them
and give feedback and this way you can do power feedback.
But some shortcomings here are that this kind of graph is really hard for teachers so we use
these things and we showed it to instructors and it’s very hard for them to make sense of it
because sometimes you would find two submissions in one cluster. It just so happens because
the feature space is very high dimensional, the [inaudible] clustering think they should be
together but it's not clear why they should be together so it's hard to understand. It is also
computationally expensive. So this particular thing took multiple days on large clusters to come
up with these clustering. So just to summarize, PL-based techniques requires>>: How many programs, do you have any idea? Approximately.
>> Rishabh Singh: One hundred thousand. So this of course from Coursera, machine learning
codes. They want to give feedback. So if you want to compare PL-based techniques, machine
learning one bad thing about PL-based technique I showed you was there's some manual effort
here. The teacher needs to go in and say these are the class of mistakes I'm looking at whereas
machine learning is quite automated; it’s data driven. But the good thing about PL-based
techniques was that teacher has control of what's happening and teacher can comprehend with
the error models, the kind of patterns they want to give feedback, and also it's more
interactive in the sense that if something doesn't work the teacher can go in and add rules
whereas machine learning is more black box, one short technique which works then it’s good
but it's not clear how to improve it.
>>: You mentioned that in machine learning technique cost of the [inaudible] in high
dimensionality they put solutions together just because there are lots of dimensions not
because they were only similar?
>> Rishabh Singh: Yeah. Exactly. So the problem is so you get a feature vector out of these
programs and then they are mapped to higher dimensions and then actually the clustering
happens in higher dimension. It doesn't happen in the lower dimension.
>>: [inaudible].
>> Rishabh Singh: Yeah. Exactly. So the correlation is not clear between syntactic features now
with the feature that they use for clustering.
>>: It seems that the key difference would be in the domain knowledge that you apply in your
error models, right? I don't think it’s particularly about PL’s. [inaudible]. In an L solution if you
provide some domain knowledge you may get some understandable result is my guess.
>> Rishabh Singh: Right. So that question is, yeah, exactly. So there does need to be similar
feature in generating. Somebody has to say that these are the features that I'm interested in;
and it's not clear for somebody not in machine learning to be able to specify these are features
that would work for this particular domain insight, but with error model it's kind of made clear
that this is a mistake that I'm looking for common patterns. So it's a little bit more intuitive, but
you're right. If someone knows machine learning how clustering algorithms work and they can
do feature generating, right. So you can provide those insights through features as well.
So actually now actually let me briefly go over the system I want to talk to you about. So we
wanted to combine both PL-based techniques and machine-based learning techniques. And
this is the system we have built which is called OverCode. And here the idea is there's
submissions coming in from Edits class, more than 100,000, so as an instructor who’s teaching
the class how can the instructor keep track of what's happening? So in some terms you can
think of it as a dashboard, an interactive dashboard that we can have with Code Hunt as well to
try to make sense of what kind of programs people are writing and you want to do some
analysis search over it.
And the motivation is again similar that there's a lot of variation in programming assignments, a
same assignment can be solved in many different ways. The focus here is not so much on
correctness, the focus here was more on the design of the programs, more software
engineering skills where you want to say it's better to use these constructs in this manner
rather than this manner or some design patterns. So we have the system which we're trying to
make it light weight so normal teachers and laptops can use it without having access to clusters.
And the goal is to help teachers identify interesting examples from the data to go to the
classroom and say these are pedagogically interesting examples; and one side effect is that
once you have nice clustering you can also use it for giving feedback and power grading.
So let me also briefly show you the system. So this is a problem from the Edits class which
asked students to solve this [inaudible] power problem which takes a base and an exponent
and they have to compute base to the power exponent. So this is a little complicated, but
actually what's happening is we have this concept of stacks which are these rectangular boxes
and they have a number associated to the left. So it says there were 1500 submissions that did
something like this. I'll tell you how we can sort these, but the idea is 1500 students were doing
something similar to this and then you have different stacks going this way which the system
thinks are different. So these are essentially clusters. Every stack represents a cluster and then
teacher has access to some filters on the right. So here system comes up with all interesting
lines and expressions in the code that students have submitted and teacher might be interested
to say I want to see which students are actually using zero as an argument to the range
function. So here the teacher can filter these assignments and the stacks rearrange and now
actually the teacher can give this feedback that when you are using range zero is implicit so you
don't really need to provide zero and go over these programs.
They can also filter by other things. So let me show you one example. So here actually teacher
is saying that cluster one and cluster two are kind of similar. I don't really think they should be
different. So again, they have this interactivity coming in that teacher can provide rules for
saying these two are the same instead of defining features for machine learning algorithm to
come up to match clusters. So here teacher can say actually result equals result time base and
so teacher can say that this is something that I'm not interested in it too much. Some teacher
might be who want to teach that you can do operator equal to variable equal to that variable
operator times operator two. Operator two can be returned as operator equal to but in this
case teacher is not interested so teacher can say I think there are the same. So here we can
provide additional rules to help the clustering algorithm to then collapse all the clusters where
the difference was just that. Yeah.
>>: So you showed your data is what students have written. So how many currently do you
have? How many data items?
>> Rishabh Singh: So this interface is kind of demo version. So it's working over I think 6000
submissions.
>>: How many?
>> Rishabh Singh: 6000.
>>: 6000. So that's sufficient for what you need, do you think?
>> Rishabh Singh: So the goal is actually since this is kind of a scalable algorithm we can feed it
with one hundred thousand as well, and I think the storage is already [inaudible] with the Edits
class data.
>>: I think the question I am aiming for is you finding interesting rules and things or interesting
mistakes and so on in the same way as [inaudible] was. So if you had more data do you think
that you would find more things or do you think it would just be repetition that there's actually
not more interesting things to find?
>> Rishabh Singh: Right. So actually the goal of this interface was to find interesting things in
the data itself, to collapse as many things that I don't care about into one stack. The more the
data there’s more chances I’ll find interesting things but it depends on the students how they
are writing the code.
>>: So what's that leading up to is once we provide you with the Code Hunt data obviously you
need to provide a [inaudible] in Python.
>> Rishabh Singh: So a good thing is actually now is it also works for Java.
>>: So you could take that data and then put that in and you could then incorporate it.
>> Rishabh Singh: Exactly. So I'll just show you, we gave this system to teachers and what kind
of studies we did. But, yeah. So one of the features is it will show you different stacks and it
will also tell you what are the differences between the two. So you'll see that some of the lines
are grayed out. Only the differences are apparent so this way it can also help teach us to figure
out what's happening. And then there's this legend thing which I think doesn't make sense
right now until I go into the algorithm part, but here actually teachers can see what variable
values are being taken for different test case values.
>>: Are the variables renamed by your code?
>> Rishabh Singh: I'll tell you, yeah, exactly. So that’s where the PL part comes in in clustering.
So I'll briefly go over the algorithm, the forming steps. So first of all we try to reformat the
programs that are submitted. So we move all comments, we do white space tightening and
everything so that programs are kind of clean and now. So this is where actually we do some
dynamic analysis. We run programs on different test cases and we get all traces for the
program. So here in this example it would say when I run on 5,3 these are the different
programs states I get at different program points. And essentially these are program cases and
then we do an abstraction where we try to remove duplicate values. For example, if I have this
loop index I it would always have value 0 0, 1, 1, 2, 2. So instead of having duplicates we do an
abstraction we say make it 0, 1, 2 because some people might have multiple lines in the loop,
some people might have single lines in the loop. So this is some kind of a trace abstraction we
do and now we get these values for each variable in all the programs.
Then, once we have these variable values, we take two programs at a time and try to see which
traces correspond to each other. So even though these programs are kind of different in the
sense it's similar but they use different variable names, somebody used R, somebody used
result, so now you can match which variable corresponds to which variable based on these
trace values. So we do this across all the programs in the code base and now we find that one
of the most common names used for each trace sequence, so here we find that this particular
trace value most of the students use result, some other student use other variables, but result
was the most common one. Similarly for other creates values different variable names were
used, and then essentially we do a majority voting.
When we get a new program we rename the program based on the sequence of values these
variables are taking. So this program would get converted into something like that where R
would change to result. And the idea is once all programs have similar variable names it
becomes easier for teachers to understand what different variables are doing in different
summations. So this way we rename the variables and then there are some concolic cases that
happens. So one of the cases is I try to rename something to X and a student was already using
X so there’s some way to handle that. And finally we know now need to create clusters. So
what we do we take two programs and they have been renamed and reformatted so they look
pretty much similar. There’s one more abstraction we do. We get the lines of programs for
each program and then we do instead of doing line by line matching we do side-based
matching. So we say these set of lines correspond to these set of lines. And if everything is the
same we create a cluster out of it. Any questions up until now? Yeah.
>>: How big are the programs you’re typically dealing with?
>> Rishabh Singh: Right. So actually the Edits class was Introduction to Programming so most
of the programs here were about 10 to 15 lines. It's now being used in the Software
Engineering class which is about 100 lines of Java essentially. Yeah.
>>: Can you kind of tweak the system so that you also identify plagiarism at the same time?
Because in the sense you're looking for the similarities, these similarities so maybe if you find
the cases that are very similar but not that common there’s an indication of the students
actually-
>> Rishabh Singh: Right. Exactly. So we can go down the stack and see the small sizes. You
could do that as well. I don't think they're using it that way yet but that could be useful. Yeah.
>>: [inaudible] red traces so you don’t rename it, like student used a conventional solution
[inaudible].
>> Rishabh Singh: So that's actually a good thing. So the idea of this interface is to come up
with these anomalies actually, to identify these anomalies. So you had this legend tab on the
right. There actually you can see all these funny sequences and it also is supposed to tell you
how many solutions this variable is being used in so that way the teacher can go in and say this
looks highly unusual to me; I want to check that. This way you can actually, from this large set
of programs, you can find interesting programs as well.
>>: I was just wondering for this little example when you swap the two lines around do you also
do some kind of inference to tell you that you're allowed to do that?
>> Rishabh Singh: That’s a good point actually. So there's a lot of hidden assumptions behind
this thing that since we are renaming all variables and they're all functionally correct, so the
idea is when you're doing set bits extraction you might be sometimes collapse two things that
shouldn't be together so there’s incompleteness in that sense, but typically with high
probability I think you can say that there are going to likely be, the lines are going to be
independent of each other. The order is going to matter that much. But yeah, we don't do any
analysis when this can be done or can't be done.
Okay. So let me quickly also tell you we gave this system to teachers who were teaching the
Edits class and essentially the TAs who are helping the professor teach the Edits class. And this
is the user study. So there were a few hypotheses we wanted to check with the system. Does
the system help teachers get more satisfied, I'll tell you with that means. Does it help them
read more solutions and faster than what typically people do? Does it help them give better
feedback? And does it help them gain confidence in terms of how confident they are that
they've given good feedback to the class?
So study one was kind of open-ended task where the idea was we gave them two ways to look
at programs. One was the more traditional way. There's this big line; you can imagine a giant
HTML file with all submissions. The other was using our system. And the task was write a
piazza post in 20 minutes using both the interfaces about one of the things you found
interesting in this data set. So it's kind of very open-ended. We didn't give any specific things
to do.
So there were 12 TAs here and almost everybody had graded these particular assignments
before and we had some qualitative questions here, and based on those surveys actually it was
statistically significant when almost everybody said it was much easier to use on a like scale
from 1 to 7. It helped them get an understanding of what students were doing in the class and
also it was less overwhelming than just getting a giant HTML file. So it was kind of reasonable
that they liked the system, and then there was some qualitative comments that they liked
certain features more than certain other features.
But the problem was we weren't able to get any quantitative data out of that study so it was
very qualitative. So we are designed another study where we focused on getting more
quantitative data. So here we told the teachers, TAs, don't just write a piazza post but go over
and tell us five most important things you've found in this data to tell the class and also tell us
how confident you are about it. So it was a little more focused task, more concrete. And again,
we recruited 12 more people.
So there were some interesting observations here. So this particular graph is showing you
control is the baseline, OverCode is the system. So for different problems it's showing you that
teachers, when they were using baseline, in this particular graph it actually shows that they
were able to look at more solutions than OverCode and I'll explain why is that. But actually for
this particular problem even here they were able to read more solutions.
So the difference is when they read one solution in OverCode they're actually looking at large
family of solutions whereas in baseline when they look at one solution they're looking at just
one solution. So when you multiply the sizes of the stacks they were reading the difference
becomes quite appreciable. In some cases they were able to cover 64 percent of the problems
in 10 minutes so in some sense this interface helped them go over student’s submissions much
faster than baseline.
Then the second thing we wanted to do was when they gave feedback, when they talk about
five important points we wanted to see how many students does it apply to? So in this case
actually we found interestingly for these two problems using both baseline and our system
there wasn't really statistically interesting difference between the two; but for the quantitative
problem they gave the feedback, they talked about points that actually applied to many more
students than the baseline, and also it helped them be more confident, so everybody was much
more confident that they covered everything in the class as compared to using baseline. And
some TAs were so excited about rewriting rules. So in five minutes they were given to do all
kinds of analysis and they wrote lots of rules like this.
So in summary I actually showed you an interactive visualization tool for going over thousands
of programming solutions. And it's kind of language independent, so this I showed you for
Python, but they're already using it in the Java class. It's lightweight. So this technique I
showed you about clustering it's linear in the number of solutions as compared to quadratic
which is typically the case with all the clustering algorithms so it’s scalable. And it helps
teachers understand the overview of the class and better understand the kind of summations
students are submitting. That's it. Thanks.
>>: What's TOCHI?
>> Rishabh Singh: Which one?
>>: TOCHI.
>> Rishabh Singh: So it's actually, so I think I CHI is something different. So what happens in
CHI is they have this general TOCHI where if you submit within a timeframe you can present it
in CHI. So it's running here. So this actually is going to be in CHI this year.
>>: So I'm interested off the track here why you chose to present this work at CHI.
>> Rishabh Singh: That's a good point. So most of my collaborators work at [inaudible] for this
particular work and it’s their top conference. So the thing is there's a lot of interesting
education in CHHI community. They were looking at more towards the interface site and here
we wanted to combine a little bit of PL-based semantics techniques with [inaudible] issues.
>>: So do you think that's a community that we should also address?
>> Rishabh Singh: Actually they are, I think CHI is probably the biggest community in education
working on right now. I think they have separate sessions for inside CHI just on education. And
this particular paper is going to present in one of those sessions. It's a big community, and they
also started this conference called Learning at Scale, which is about [inaudible] and education.
And that's primarily [inaudible] people. So I think yeah. So they are actually very interested in
education or related research.
>>: So for the system that you developed is there any plan for engaging community or opensource part of it?
>> Rishabh Singh: So actually the student that was working on it she has actually, I’m not sure if
she already open-sourced it but the goal was to open-source it, and I think somebody at UDub
is also using it. So the goal is actually to have community use it and build on top of it.
>>: Excellent. So we can start coffee early for a change and come back for our third paper.
Download