>> Nikolai Tillmann: Hello. It's my pleasure today... who is now an assistant professor at Urbana-Champaign, Illinois. ...

advertisement
>> Nikolai Tillmann: Hello. It's my pleasure today to have here Darko Marinov
who is now an assistant professor at Urbana-Champaign, Illinois. And so before
that he got his Ph.D. at MIT, and even before that, he actually was interning with
us for three months, so we already know him for a long time. And he's been
working on testing and how to improve it. So I'm actually looking forward to
hearing his latest results today.
>> Darko Marinov: Okay. Thank you. Thanks, Nikolai. Hello, everyone. Okay.
So the latest results will be on this word that was not an automated testing of
refractoring engines using something they call test abstractions. I'll describe
what I mean by these test abstractions. I'll be here talking, but of course the
work was done actually by students, so here's the list of people who were
involved in that, Brett Daniel, he's my Ph.D. student. He was actually the main
person behind this work. And the others who helped Danny Dig, Kelly Garcia,
Vilas Jagannath and Yun Young Lee more recently.
Also the work was supported by National Science Foundation, by some grants
and then a gift from Microsoft. It was a small gift. If you have money to give for
testing, it would be good. So what we wanted to look at was some testing of
refactoring engines. So let's first see what these are. So refactorings are
behavioral preserving program transformations. So the idea there is that you
want to change the code of the program, but you don't want to change its
external behavior.
The reason why people want to do this refactoring is typically to improve program
design and to make your program easier to maintain or easier to use and so on.
So some examples involve things like rename a class maybe or you moving
oriented program and you had some class that you called A and you realize that
A is really a bad name so instead of A you should have called it, you know, an
airplane if it models an airplane or something like that.
So you simply want to go and apply this change across the program. Now, you
can do this manually, but it's kind of tedious to go everywhere in the program and
find wherever that you had A and replace that with say airplane. So what you
want to do is actually to automated this process. And refactoring engines are
programming tools that actually automate these applications of refactorings. So
if you say want to rename clause A with airplane, you just go to the tool and point
that that's what you want to do and it goes automatically, traverses the entire
code and finds where to make these changes.
And these refactoring engines are encoded in many modern IDEs. Here are
examples from two Open Source IDEs from Eclipse and NetBeans. If you go in
the top level menu, they actually offer these refactorings that can be applied. So
on the left is Eclipse, they started building their refactoring engine a few years
before the one in NetBeans so they had more of these refactoring building things
like renaming program elements, moving them, you know, things like convert
local variable to field and so on.
So NetBeans has some fewer of these. And then Visual Studio also has a
number of refactorings imported but I think even fewer. It has something like four
or five. We did not pass Visual Study just because it's not your Open Source so
we didn't have easy access to that and to adding some changes to the tool itself
to make it easier to test.
So why did we want to test these refactoring engines? Well, one thing is they are
widely used. Programmers actually want to use the refactoring engines rather
than manually making these changes through the program. Also, they are
complex. It's kind of interesting to test something that's not simple and they're
complex in two dimensions. So first the kind of inputs that they take are
complex, so what the refactoring engine does, it takes an input program, takes a
source code, and then is output also produces a change source code, maybe
saying just renaming a clause. So the inputs themselves are complex and also
the code of the refactoring engine itself is complex, so it means to usually
perform some sophisticated program analysis to find the -- whether it's fine to
perform a refactoring and how to change the program then also needs to perform
a transformation that actually goes and makes the changes in the program.
And then what's also important is that a bargain refactoring engine can have a
severe consequences on the development, namely if there is some bug it can -or sometimes silently corrupt large parts of the programs. If you are a developer
and the building some program and if you apply refactoring it may happen that
this refactoring engine just goes and silently corrupts something say instead of
replacing your class A with airplane it goes and replaces with something else and
now things don't work. Sometimes you can find this easily because the output
program does not compile. So if there is a big in refactoring engine you can
easily detect that there is a bug. But sometimes that's not the case, so if you find
a bug that's very kind of hidden and it's as unpleasant as finding a bug in a
compiler itself.
And kind of last but not least we wanted to test refactoring engine was because
they contain bugs. If you do some research on testing you should try to get
something where there are bugs so you can say that you found bugs. But more
importantly it was because people use this and so it look like significant and
complex application and something interesting and challenging to test.
So here's an example of how refactoring looks like. They say example of how to
use [inaudible] there's this refactoring it's called encapsulate field, and so what it
does, it replaces all fields accesses with appropriate access through the getter
and setter methods. So on the left here, we have an input program, small
program for illustrative purposes, so here's only one class A, here's only one
field, this F, and there is some method M here that access this field F, right. So it
reads F and then multiplies maybe with sum I and then writes that to F.
Now, what we want to do here is say to replace all these accesses to the field F
with actually use getters and setter methods. The reason why we want to do that
again is to kind of improve the design of our program, gee, it's a bad practice to
access fields directly, it's much better to access them with the getters and
setters. And this simple example, these fields are accessed from the same
class, so, you know, it's find to access them there but in general these field
access could have been from different classes.
Now, we can use the tool to do this. We will just go to the field F, we would click
there and say encapsulate this field, and what the tool would do is actually make
these changes shown there on the right. So this on the right is the output
program. And to make five changes. So first it would add this getter method so
it will add this method here that just returns the value of F. Then it the setter
method, so this is the one that can set the field F, so given some new value it just
sets this F. Then it would replace these field reads with the close to the getter
method as this thing here. Then this field writes that we're here where F was set
it would replace those with setter methods. Then finally makes this field private
such that if it was accessed from the other class it cannot be accessed any more.
Okay.
How many people here have used refactorings? Okay. Almost everyone. So
here's some bug that we found in Eclipse so here we have again a very similar
program, input program on the left. In this case we have two classes, one which
subclasses the other. And if you apply here, encapsulate field, here's the output
program that Eclipse generates. And as it turns out, there is actually a bug here.
The bug is right here. And what happened here was that we had this effectively
a field write, we were setting the super dot F to zero but because the engine has
some bug, it told that this was the getting the field F. So here's the output that
they generated said somehow that you want to get the field F and then do
something with it. As a matter of fact, this thing here would not even compile.
What it should have generated should have been this, super dot F of set zero.
Here is another bug that was found in NetBeans. Here the bug is slightly
different in that the output program does compile but it still -- the refactoring was
not applied correctly. Again, here's the input program and something small. We
have one class that's on one field, and there is some method. And then we are
kind of reading the -- actually writing to this field F in a somewhat weird way. We
have these brackets around this, so we are setting the field F of some object A to
zero and we want to encapsulate this, and this is what the refactoring engine
gives, so it properly introduces this setter and getter methods, however, it does
not replace this access, just leaves it as it was. So the bug again is here in this
line, and what it should have done actually is generated these setters. It should
have been [inaudible] dot set F to zero. Sense? No, yes? Nikolai doesn't think
that this is a bug.
>>: [inaudible]. Refactoring but what is it any way.
>> Darko Marinov: Okay. That's actually a very good question. I mean, what's
even a definition of a bug in a refactoring engine? I guess part of what you
wanted to do is to preserve behavior, right, so if you generated something
whereby you know when I run the program I get the different value, we could say
that that's definitely bug. But if that's the only thing that you want out of
refactoring engine, then it would be very easy, right, I can just build, you know.
>>: [inaudible].
>> Darko Marinov: Sorry?
>>: In this case, M is in the class it has connection -- it has access to [inaudible].
>> Darko Marinov: Okay. Yeah. I guess ->>: [inaudible].
>> Darko Marinov: All right. So I -- so what [inaudible] is saying in this example
F is inside. Let me first answer that question, then I can go back to this other
one. Yes, in this case, F is indeed the inside. So you could even say maybe
when you want to do this encapsulate late field, you don't need to change the
access that are inside the class. It's fine to access those directly. You don't want
to change those to setters and getters.
But the same bug would have occurred even if M was outside. If M was outside
and the -- in another class, B was exiting to this F, it would still not replace this.
That particular case it would also have then a compile error because F field
became private and you are trying to exit from outside. So that's kind of the
answer to this specific question.
But going back to this more general question, you know, what is a bug in a
refactoring engine? If you wanted to preserve behavior but actually wanted to do
the change that you wanted to do. Here actually an interesting example of
something like this, we wanted to replace some fields and then the refactoring
engine did not replace this and actually the example it was officially created an
example where we created something to show how a bug here may be -- may
have, you know, big consequences. What it did there, there was some field here
that was kind of measuring some temperature, and what we wanted to do here is
kind of say for this set temperature say if the temperature you want to set
exceeds some limit and raise some exception rather than, you know, exploding
something somewhere. So if you want to do that, what that means is one, you
encapsulate field view, you start adding some behavior here of what you want to
happen whenever you said this field F. And if you don't have all the fields to F,
you know, properly going through this set F then it could just miss something like
that.
So another example we would get some boiler to explode or something like that.
But I agree with you that these examples may -- may look, you know, somewhat
like simple and artificial and one may asked you know is this really a bug or not?
But usually you can construct some bigger bugs from these going from these
smaller simple examples. Nickly's still not convinced about that.
>>: [inaudible] so if you say a [inaudible] then the bug surfaced but then again it
wouldn't be really about the refactoring but you have some other invariance and
in the program which is violated but [inaudible] how can you blame the
refactoring engine? So the definition [inaudible] refactoring engine?
>> Darko Marinov: Well, I guess definition what it means to be a bug. Okay. So
I guess one way you could define that is actually going in, implementing your
own refactoring engine and saying if this other one doesn't generate the same
thing as mine, that would be a bug. You know, so say for encapsulate field, the
definition of encapsulate field it says, you know, it replaces all fields read and
write with access through getter methods, getter and setter methods. So I guess
here the point that he should replace all fields read and write.
Of course you could have a different encapsulate field which says, you know, I
do not replace those that are within the same class. Encapsulate field outside of
the class, in which case you could say, you know, it replaces all access that are
outside the class with getters and setters. So, you know, the definition would be
you could, you know, normalize this, what it means, this thing here and then
check whether the output program really satisfies this property or not.
This makes sense?
>>: [inaudible] you expect [inaudible] and then you [inaudible].
>> Darko Marinov: Yes. So if you run encapsulate field you do expect that all
the field access were replace with get and set, so if you add later on some
behavior to these gettersable setters, you hope that everything was indeed
replaced.
>>: So in a set of the [inaudible] is that actually also how you check for bug or
how do you discover these bugs?
>> Darko Marinov: That's a very good question. So maybe I can postpone that
for a few slides an discuss what were the oracles that we used for that. All right.
So then these were the examples, so maybe you can trust me so far that these
are bugs. Maybe you know we wouldn't call them bugs, we can discuss it later
on whether these are really bugs or not. So one way of how we check whether
this is a bug or not is get the same input program, say get this program and run
them both through Eclipse and NetBeans. And if they would give something
different, then, you know, we would ask the question, you know, is there really a
bug. Is one of them wrong because they are getting different results, or is it
simply that, you know, your spec is [inaudible] can give any of these to.
I think that that was the way to discover this one, because if you get things that
compile, those are actually the most problematic bugs. Because if something
doesn't compile and if you go back here, if you just apply encapsulate field and
you get this thing, you know, it will immediately tell you here it does not compile, I
can just immediately see that something went wrong. If I give you a program that
compiles and you make some change and it doesn't compile anymore, you know
that that went wrong.
But here it's much trickier because, you know, if everything compiles and it looks
seemingly fine, actually you could even find that this is your -- that it preserved
behavior by still didn't do what you wanted it to do. All right. So how does one
test then this refactoring engine, this kind of the general setup? You get the
refactoring engine, you give it some input program, which we want to change,
you give it what refactoring to apply, say encapsulate field or remain class or
something like that, and what the refactoring engine gives you is the output is
either a refactored program as we've seen in these examples before, or it can
instead give you a set of warnings and say oh, I cannot actually go and refactor
this, I cannot apply this for several reasons. Say if you want to remain the class
A to be called airplane, maybe there is already a class airplane in your program
then it would just say well, I cannot create something with airplane because now
obviously we are going to get two classes that have the same name and that's
going to create some problems.
Or, you know, it could say I cannot encapsulate some fields because you already
have setters and getters or maybe cause field access from outside so once it
becomes private couldn't be done and so on. So these warnings are kind of very
refactoring specific, and the good engine needs to apply all this program analysis
to figure out whether it's safe to apply refactoring or not before it actually goes
and changes the program.
So how do actually people test these refactoring engines? Both Eclipse and
NetBeans have a number of manually written tests so developers I guess of
these engines care about their programs so they go and write a unit test for that.
So they write these input programs so they kind of prepare the program a
number of classes that are referenced in a certain way. They also have the code
that invokes these refactorings, and then they have the expected output. So this
is either that they refactored the program, you know, by 10 and say this is what
you should get, or some set of expected warnings in case where you do want to
test for the cases for you're expecting to get warnings. So these are all many
written tests and then they're automatically executed.
Eclipse program uses JUnit for this, so they have over 2600 these JUnit tests.
NetBeans uses a different testing framework called X test and they have
seemingly much smaller number, 252, but the issue here is that it's not quite fair
comparison because the NetBeans has much larger test, so they have kind of
more like system tests where they go and execute many things at once. So even
this number, 250 actually is -- there are many more things that are happening
there than in any one of the small JUnit tests for Eclipse.
>>: A number of lines of the test [inaudible]?
>> Darko Marinov: I don't know that by heart, but ->>: Each individual test, is it 10 lines, a hundred lines ->> Darko Marinov: Most of the tests, interesting enough, are one line, because
all the test does is kind of -- all that this test does -- so when you ask the number
of lines, there are two things to distinguish. There is like the code of the test and
kind of the test input, right. So what is the file? So the code is usually just one
line. It would say apply, encapsulate field on some project P0 and the project P1
or project P2 or project P3. And I will check that the expected result is the same
as in E0, E1 or E2 or E3.
Maybe the size should be more measured in terms of these, the program files,
how big is your program input. Because these testing is most kind of data driven,
right, so your input data is the program files and the code to actually just apply
the test is fairly simple. Load the whole project, apply refactoring and check
whether the output is the same as expected. And that's very generic, so write
that once and then each test is just one line that cause the generic thing.
Actually not to go there, but each test actually has zero lines because there's
reflection and referral and the name of the test method to figure out which project
to read, so.
The point there was that, you know, these people care about that and so they
built a large number of tests and still show that we are finding bugs in that code.
So what you wanted actually to do here was to automate this testing. So rather
than you know manually creating all these test inputs and then checking whether
the outputs are correct, I wanted to automate both this input generation and the
output checking. So a lot of testing research actually goes, you know, along the
lines of I'll give you this refactoring engine and some big pile of code and
somehow magically just look at this code and from there generate the inputs that
could show bugs or maybe your input will achieve coverage or something like
that.
In this domain, however, it's very hard to do that. I mean, refactoring engines are
just fairly complex to be able to generate test inputs for them. Inputs themselves
are also fairly complex. You need to effectively generate programs, you need to
generate syntax that satisfies some syntactic and semantic constraints of valid
programs. You have very strong preconditions on these test inputs. So we did
not even attempt to do that. We did not want to look at the refactoring code and
just from there automatically generate tests.
What we actually wanted to do is something else. Here so basically we made
these assumptions that the tester has -- yes, Nikolai?
>>: I have a question. So a very simple way to test appears to be the following.
And I can just take any program which has already been tested which has lots of
tests already generated and then just apply refactorings and run the tests
[inaudible] refactored one and you should see the same behavior, right? So that
can be in addition to these tests.
>> Darko Marinov: Yes, yes, yes, that's a very good thing, yes. That's
something that we did discuss. So you get an existing program as you said that
[inaudible] test, now you go and refactor that program plus its test and then you
rerun the test and see whether tests are still passing.
Yeah, that's definitely something that can be done, something that we discussed.
We did not do that -- sorry?
>>: You still have to come up with specifications, you know, what you want to
refactor.
>> Darko Marinov: Yes. Yes, in theory you could just traverse and apply all
possible refactorings. So and if I have a program with many classes, I could just
say, okay, get any of this classes, try to rename it to some new name that does
not appear there, or then try ->>: [inaudible] lots of tests you should have quite some confidence [inaudible].
>> Darko Marinov: That's true except that you may miss many of the bugs that
we have found. You know, if the programs that you have do not have certain
kind of, you know, weird properties, if it in your program you never use something
like this, you're just not going to find the refactoring. You know, you're not going
to find the specific bug. Of course, then you could say, well, maybe this bug
does not matter, right, if no really program would never write something like this,
then you know, it does not matter.
Actually I didn't even know that this is Java that can put these brackets on the left
side of an assignment. But the thing is usually, you know, when you go and try
to look for these bugs, these bugs are kind of, you impose, social creatures, they
tend to go together, so when you find some bug somewhere, there are more
bugs there. Actually once we found this bug here and encapsulate field, we
actually went from there and found many more bugs. So even you know if this
specific one looks like who cares, then there are many others that may be much
more important. But what you suggest is kind of one way in which this testing
can be done. We discussed that. But never, you know, did that.
What we did was we start to look at this other thing here, so we made this
assumption the tester kind of knows what inputs could expose bugs. So tester
has good intuition for that. Say an example if you want to encapsulate field
where we would say maybe there are some problems if they are inherited fields.
If we are -- if some subclass is inheriting a field and they're referring to that in
some weird ways that could potentially show a bug. So that's one important
assumption sort of behind the work. And then the other one is that it's labor
intensive to manually write many input programs. If this encapsulate refactoring
-- encapsulate inherited field requires you only to write one or two test inputs,
that's fine, you can just go and do that manually and be done.
But if it requires you to write say thousands of them, then it's very hard to go and
actually manually generate one by one of these to check for that. So if you want
then to take this approach, then the challenges become how to qualify this
tester's intuition. So if I already know where the bugs may be, how can I actually
turn that into a way to automatically generate large number of test inputs. And
then also how to automatically check that the outputs are correct if I start
generating inputs automatically.
And so the kind of general solution that I call for this program, something I called
test abstractions, so the idea of test abstraction is that they conceptually describe
a set of test inputs. So the main idea there is that instead of manually writing a
large number of tests the users should write this test abstractions and then tools
automatically generate tests. If you want to test refactoring engines rather than
you automatically -- rather than you manually writing a large number of this input
programs, you just somehow want to describe what are the input programs that
you want to generate, and then have the tool generate that automatically.
And the point that I'm making is this useful not only for test generation when you
want to generate this thing once but also for maintenance. Say if you decide to
make some changes in your test inputs, some changes in your code, if you need
to regenerate this, if you use some of these kind of test abstractions and you
have descriptions of a set of inputs, then you can just regenerate this thing.
Whereas if you were to manually generate you know hundreds of thousands of
them, you will need to somehow manually operate them all, we need to write
some scripts to change all these tests.
So if you do something like this, using these test abstractions, then you're going
to be automatically generating tests, but whenever you have that, then you have
the issue of you need to check whether the code is actually working correctly or
not. So namely we run into this problem of test oracles, we need to have a way
to automatically determine whether the execution was correct or not, and then
also the other related problem is of this clustering. If you start now getting a test
that do fail, we would like to group them such that they're usually due to the same
faults, you don't need to explore all of them. Yes, Tom?
>>: So the first two, three, the test abstractions, that sounds sort of like model
based testing.
>> Darko Marinov: Yes. But you need to have a different buzz word to get
career proposal from NSF. So my buzz word is test abstractions. Yes, but you
could call this model based testing, I guess. That would be fine.
>>: I mean, what's -- I mean, is there some technical difference between these?
Are these programs -- I mean, are these model programs, these test abstractions
[inaudible].
>> Darko Marinov: Well, they could be. I mean, the idea is your test abstraction
is a general term and you know, you can do whatever you want put there, so I
would say that, you know, model based testing is one approach to doing test
abstraction. Usually when you say model based testing, you know, people think
of a certain models, right. I mean, they would think of machines not necessarily
different people may use different models. But suppose that you know you want
to describe some complex input such as, you know, Java programs. So would
you call that a model based testing? You know, what would be your appropriate
model? Of course you can do that and just say you know my models are some
grammars, you know, or something that describes what programs are. And this
should be called model based testing. I think that model based testing usually at
least, you know, to me means that you're generating some kind of, you know,
sequence of inputs that you are giving to your program. At least in my mind.
Yuri, what would you think of model based testing? Would you think of
generation of complex inputs as model based testing or not?
>>: [inaudible].
>> Darko Marinov: Okay. I mean, this could have been easily called model
based testing. Often time then if you do model based testing it kind of has test
Oracle imbedded into the model, right, the model not only helps you to generate
the inputs but also to check the outputs. Whereas here, you know, without
present the model somehow does not embody what the correct output is.
Okay. So that was sort of the general thing about the test abstractions or model
based testing or however we want to call that, and here's then the specific
solution that we did for testing these refactoring engines. We have developed
something we call ASTGen so this framework for generating abstract syntax
trees, so this is a way for us to input programs for testing refactoring engines.
And basically what it does, it provides a library of generators that can produce a
simple parts of AST and then there is a way to combine these simpler things and
actually build larger programs. And so that was as far as the test inputs go.
And then as far as the test outputs go, basically we just developed manually a
variety of oracles, so there was no automation there in terms of automatically
generating oracles so that was manually returning and you would just like
automatically run them. And we have some ongoing work on this clustering
basically if they're failing tests to try to group them together due to their causes.
So ATSGen was then the main thing that we developed this framework for
generating abstract syntax trees and we had a few design goals for that. First,
we wanted this to be imperative so the idea there is that tester can control how to
build this complex data. Some previous work that I worked on, we have taken a
declarative approach to that. But the tester would not describe directly how to
generate complex data but would only describe what the properties of the data
was. So you would just describe what the Java program sees and maybe what
specific properties you wanted to satisfy, and then the two would generate that.
So here the approach is different and then the tester directly writes how to build
this data. We also wanted this to be iterative, which means that it can generate
these inputs lazily because oftentimes it can end up, you know, with thousands of
them or even with millions of these inputs so we don't want to join them all at
once. We wanted this framework cost to be bounded exhaustive. So there's
some interesting points. So the idea of this bounded exhaustive is that you want
to try all tests within the given bounds. So say if you want to generate the
programs, you may want to say, well, we are going to generate the programs that
give up to three classes or we are going to generate the expressions that have
up to three levels of nesting and so on.
So put some bounds on the size of the program and then you wanted your
testing to try all possible test inputs within those bounds. Yes?
>>: So those appear to be also desirable by comparative numbers, right,
comparative [inaudible] generate all these programs because the compiler
[inaudible] so I was wondering whether somehow this aims to be further than just
refactorings.
>> Darko Marinov: Yes. In theory, one can use this ASTGen to test any other
piece of code that takes inputs as programs. You can just use ASTGen to
generate various kinds of inputs and run that. Now, if you want to test compilers,
that becomes a bit trickier because you need some way to check the output. And
checking the output of a compiler is, you know, much trickier than checking the
output of a refactoring engine. So, yes, conceptually you can use this to test
your compiler but that wouldn't quite work.
Another also issue here is that when you go to generate these programs,
sometimes it may be very hard to generate the programs exactly as you want
them. So suppose that we want to generate input programs only those that
compile. That may be fairly hard to express in this framework. So what you do
then is you generate the larger superset of programs and use compiler to filter
out those that don't compile. See. So what I'm trying to say, if I want to test
specifically compiler I may have harder time using this because of the out-- both
output and the properties of input. Yes?
>>: [inaudible] refactoring tool have the support like [inaudible] correct programs
that don't compile because the user's always in the middle of making changes?
>> Darko Marinov: Yes. They do support in the sense that they let you apply
refactorings on the programs that don't compile, but usually that just comes with
the big warning and the tool just says, look, your program doesn't compile, you
know, maybe I cannot even parse your program, I real don't know if you have
class A appearing somewhere where you know my parser got lost and I don't
know whether there is A there or not. If you wanted to replace A with airplane, I
may easily miss that. And if I cannot give you any guarantee about that.
So in most cases I would say I don't have any empirical data to support this, but I
think that in most cases developers probably refactoring engine only for the
points where their code does compile. Because otherwise if your code doesn't
compile especially in some parts cannot even parse, then it's very hard to know
what guarantee are we going to get out of the engine. So all our testing was
done with the inputs where that does compile. We still found many bugs. So I
assume -- if you go and test on the cases where the input program does not
compile there's no be able to find even more bugs but it would be hard to
describe even what the correct behavior of the refactoring engine is in those
cases. So going back to this, so for -- we do this bound exhaustive testing and
the goal is to catch the corner cases. The reason why we want to try all possible
inputs is to catch the corner cases that may be there.
And then last but not least, wanted this to be composer again. The idea is that
you write this generators that can create the simple parts of inputs and then from
there you can build the larger parts. Here is then how the whole testing process
looks like. If one wants to use this ASTGen, first the tester has to manually write
the generator using this framework then the tester kind of instantiate this
generator by providing some bounds, maybe how many classes you want to
have or what their relationship should be and so on. And then there needs to be
some driver code that actually runs this whole thing in the loop. So here is an
example of this driver code how this looks like.
So here we will need to create some generator say if you want to encapsulate
some field you would say we want to encapsulate the field F and then we need
some generator that can actually create the programs that do that, and then
there is a small just kind of piece of scuffling code that does this, so we get some
refactoring that encapsulate field and we tell it to encapsulate the field F, and we
perform this refactoring on the input program and then just check whether the
output actually satisfies the required properties or not.
So let's see now how an example generator looks like. This is sort of really the
key part of the framework. And again tester hears some intuition of what kind of
test inputs may show bugs, and sometimes you can express this intuition in a
fairly easy in English, you can just go and write, you know, two, three sentences
of describing a set of input programs. It's very hard to actually go and manually
write all these programs that satisfy this. So this particular example from our
testing of encapsulate field, some generator that recalled the double class field
reference generator, there is the short English description of that. So I want to
produce input programs that have two classes that are related in various space
by containment and inheritance, then one class should declare the field and the
other should reference the field in some way.
So here down we can see some examples of the programs that satisfy these
descriptions. We have two classes, A and B. They may have various
relationship like subclass or inner class and so on, and then they're referencing
these fields in various ways. So here we are only showing three examples, but
of course the number is much larger, here the number is unbounded because,
you know, we can just reference the field F in an unbounded number of ways in
various expressions. But even if you put some bugs, if you say we want to have
the nesting depth of these expressions up to some bound two or three, it's still
going to end up with thousands of these programs.
Then the question is how can we go from this English description to these
thousands of programs. We don't do anything with English description, we don't,
you know, try to analyze natural language, we actually ask the tester to express
these properties directly into the code. So here is some kind of the parts of these
descriptions. So we want classes that are related by containment or related by
inheritance. We want one class to declare a field and then want the other class
to reference the field in some way.
So each of these parts effectively corresponds to a large number of ways in
which this can be done in the progress programs. And then for each of these
parts we actually go and build a small generator that just focuses on that one
thing. For example, if you want to discuss the containment between classes, we
would build a specific generator that just can generate all these different
possibilities. Maybe either the class are independent or it's an inner class or
some method class and so on.
If you look at the inheritance we can again generate the all possible ways in
which one class can inherit from the other, it can be again unrelated or a
superclass or a subclass or related to interface and so on. If you want to declare
a field, there are many ways in which this field can be declared. It can have
different types, it can have different visibility and so on. Again, we build this field
declaration generator can go and enumerate the large number of these
expressions.
And also if you want to reference a field, there are various expressions in which
we can reference this field. And again, we go and build a small piece of code
that can produce all these pieces of abstract syntax tree. And now in order to
test our program what we want actually to do is to get the cross product of all
these possibilities. Want to generate all possible programs by combining these
things and then testing the refactoring engines on the resulting programs. Yes?
>>: You want [inaudible] basically [inaudible] for all kinds of programs. Here
[inaudible] like A, B, and [inaudible] called F. So to what extent [inaudible]
actually want to generate. Are the names [inaudible].
>> Darko Marinov: Okay. So the question is whether the names are hard coded.
You could hard code them, but they are not necessarily hard coded. Typically
the names are just given as some parameters here. So if I build here, you know,
a generator that actually creates these programs, you know, I can pass these the
field name as the parameter in there.
>>: [inaudible].
>> Darko Marinov: If you're generating only one field and you say to be F, then
in all programs it will be called F. Did I answer that question? It would never
generate the field G. Unless you have, you know, a different generator that goes
and generates, you know, something else. Here you would also give the names
of these classes. You know, A and B. Because, you know, some programs you
may want to generate things with the three classes, right? So if you want to
generate something with three classes, presumably they would be called A, B,
and C. But you may want different relation between B and C and A and C and A
and B so these have been parameters that you can just give here. Yes?
>>: You suspect [inaudible] identifiers are handled in the refactoring, could you
have a identifier generator that then would add to the cross product? So would
affect that identifier creation of all [inaudible].
>>: Yes, yes. One would do that. So the field declaration generator I guess tax
the parameter which is the name of the field to generate and then what are the
possible types to put there and maybe what are the possible visibility and so on,
whether there is also initialization for this field and so on. It's much more
involved than is just shown in this example. So each of these generator takes a
number of subgenerators for generation. So field declaration generator takes
one generator which says what are the types to generate. Another generator
which says exactly this, that you ask what are the identifiers to generate for the
field names, yet another one for possible visibility and so on.
>>: And that's also used by these [inaudible] to say [inaudible].
>> Darko Marinov: Yes. Then you could pass that same identified generator say
as in here, in the one that generates reference fields and say, you know, if you
are generating say F and G, then it would generate here F and also G, and this F
and this G, and this AF and this AG and so on. Yes?
>>: [inaudible].
>> Darko Marinov: That's an excellent question. So once we start combining
these things, the problem is that there may be dependency there and that one
thing can depend on the other, right, so if I generate here F or G or some other
field name, then when I want to generate here in an expression that references
there, I better use the same identifier. Otherwise, you know, I'm going to create
the program that double compile if I have here super dot G and there I had
[inaudible]. So the way that that's done is that you need to build these dependent
generators. And other things become a bit trickier to express and to especially to
put the generation in the right order, because now this means you first have to
iterate this one to produce the values and then you need to iterate this one to
refer to those values from there.
But in general, then these -- the problem arises when some of these
compositions may be invalid, say, you know, you cannot take this particular
containment of classes and then this particular inheritance between classes
because you are going to get something that doesn't compile. And then there
are various solutions there. One is this going into dependent generators which
means you spend more kind of work to describe what's proper and what's not.
And then another sort of the easiest solution is just delegate this to compiler.
And that goes back to this question whether you could use this for testing the
compiler. The problem then is you would need to spend more time to describe
all these dependencies here. And if you want to avoid that, if you want to kind of
be lazy, then you just make the generator that produces things that don't
necessarily compile, but then you just compile it to filter that out. Yes?
>>: This is very, very heavily language testing dependent, do these same kind of
cross product principles apply in data generators, also?
>> Darko Marinov: Yes, yes, yes. You know, I claim that that's possible to do,
so I've done some research on that before just not using this particular kind of
generators, not using these ASTGen, something else that we called corat
[phonetic] and basically generating the complex data. We've done that things for
say testing things like basic data sections, you may want to generate say binary
search trees or red-black trees and so on. And then some of those things can
also be used at Microsoft for testing various things. They've used it for
generating some XML documents and other things. What are the other things?
>>: [inaudible] codes?
>> Darko Marinov: Parts of the civilization code and so on. So a number of
things that were done also here where some data was generated using somehow
conceptually the same idea so if these what I like to call test abstractions or we
can call them model based testing or something, but the idea is describe a set of
testing that somehow the two generates them. Of course then the question just
becomes how do you actually describe this set of test input. That was the
language that used that and how does this tool generate them.
>>: But do you [inaudible] all three or just the compiler?
>> Darko Marinov: You know, I [inaudible] the students that go and input all of
this, so I just wave hands. No. All these things, all these three things kind of are
supporting the framework. There is this thing with dependent generators. I'm not
going to go here much into details of how this actually works but the way this is
done actually in the design of the generators, it distinguishes two phases. One
phase is how to iterate to the next value, you know, so during the generation you
kind of need to take this, you know, suppose that I chose first this, this, this and
that. Now, I need to go to the next value, which means I need to move one of
these guide to go forward to actually try the whole cross product. And so you
need to move this forward and then, you know, as you move forward it may
become immediately, you know, illegal to combine these two things. Say if this
field F is not actually in this class but it's somewhere else and you don't have
field F then you cannot generate this.
So one thing is how to move forward, how to describe what should be moving
forward, and then the other thing is if you generate these pieces, how to combine
small pieces into larger pieces. So what the framework does actually it
distinguishes these two, these two concerns. One is the iteration and the other is
composition. And what that actually allows is to make some of these things here
easier and for this dependent generators. Take much more time to describe this.
So we can, you know, take this offline and I can show you some of these things.
But basically the framework supports all of these things. So, you know, you can
write some kind of filters, many right the filters that just throw things away as
soon as they are generated, if they are invalid or you can force the generation to
only produce valid things, or we can just eventually delegate everything to the
compiler.
There is this trade of between the amount of -- that you spend on writing the
generator, versus the generation time. Because if you wait for the compiler, then
you may just be wasting some of the time and throwing stuff away, but you know
that's -- you just kind of wait a bit more. So that was all about inputs. I'm going
to go into more but I'll be happy to discuss this offline.
So the other thing that we needed to do was about oracles, which is to validate
whether the outputs are -- to validate the outputs of the refactoring engines and
of course then the challenge there is that you don't know the expected outputs
because we are automatically generated the inputs, the case whenever you
automatically generate the inputs. Another issue is that sort of at its base kind of
the refactorings require that the output program be equivalent to the input. So
this of course undecidable by itself, but the thing is that, you know, our program
is even harder than this, it's even different than this. Not only if you cannot check
this in general but we want to actually check that, you know, the structure
changes were made, as in that example with the field. If you do want to
encapsulate the field F in the setter or getter, you want this actually to be made.
So questions how to do that? As I said, we've just built manual a number of
oracles for that, you know, ranging from simple things, whether the refactoring
engine crashes all together, we never found any bug there, to the things whether
the output program compiles, and then whether we are getting appropriate
warnings from the refactoring engines because remember it need not always
generate the refactor program, it may sometimes say I cannot refactor this,
similar like your compiler would say, you know, you have a compiler and so I
cannot generate some assembly code or whatever code that it should generate.
Then there were also a few interesting things that we did for example this with
the inverse refactorings. So many of the refactorings you can kind of apply the
other way around. Say if you remain A to B, then we can remain B to A and
hopefully get the same input program. So what you want to do then is to check
in a [inaudible] from some program, rename A to B and rename B to A that you
are getting the same thing. Of course you need not get exactly the same thing,
even you know at whatever level of -- if you just even print that out, it may not
have exactly the same files, exactly the same sequence of characters. Even the
ASTs for that may not be exactly the same, so we need to build some tool that
kind of tries to compare these ASTs by ignoring some of the details. Maybe the
order of methods would be different or maybe the new name that was chosen,
the fresh name that was chosen in some rename would be different and so on.
So you need to have a comparison that kind of ignores some of that stuff.
Then there was some custom oracles for example if you apply encapsulate field
you want to check that there are no more references to the field except for the
setter and getter. And then last but not least we did this differently testing. You
get the same input program you give it to both Eclipse and NetBeans and then
you check whether you're getting the same output program or not. Of course
even there, the same means modelling some of the changes that you tolerate.
Yes?
>>: So refactoring engine which does nothing [inaudible].
>> Darko Marinov: Well, it could presumably not pass this one, custom one,
because you would check for certain structural changes and find that it didn't to it.
And then also you hope that here out of these two refactoring engines at least
one of them does something, and you know, that one would have a bug. Yes?
>>: [inaudible] involves actually running the program and seeing whether the
input or output changes.
>> Darko Marinov: These that I'm mentioning here not, but we had also
experimented a bit with that. That becomes much more complex. It's not like we
have automatically generated programs for which we automatically generate test
and then we want to refactor the program and run these tests. So becomes a bit
-- we just have, I think, one experiment with that, but we did not pursue that too
much.
And these are all good questions. I guess the thing was that, you know, we still
found a lot of bugs even with this stuff that fit it, so that's why we didn't pursue
some more things. So here then basically what we did with this ASTGen, we
tested this Eclipse and NetBeans, we tested data refactorings I guess from each.
They target various program elements, you know, fields, methods, classes, we
had things like, you know,encapsulate field, move method, rename class and so
on. So we had about 50 generators, some of them for very small ones for
generating the small parts of ASTs and then some complex ones for building
entire programs. We've factual found 47 new bugs, so they reported these bugs.
Then we've done some comparison of these how good these oracles are and
how well the generation works.
So here's some of these results for the generator, say for testing these
encapsulate field we've written a number of generators that generates various
programs, so I guess the one that uses the example was this double class field
reference. So if you run that, it can produce a number of inputs, or some of them
in you know hundreds or thousands. Here was the time taken to produce all
those inputs and to run refactoring engines. And then here was the number of
bugs found in the refactoring engine.
So what we found then overall was that this generation time and compilation
times were much more than the refactoring time, actually running the refactoring
and checking the oracles. So generation by itself was not that big problem, but
this execution of refactorings was actually taking a lot of time. So that's as far as
the machine time goes. And them as far as the human time goes, some of the
initial experiments that we kind of tried to track how much time it takes to build
one of these generators, this took about two work days. But this was really still,
you know, the initial phase where we were not just saying, okay, we want to build
this specific generator but we were still also developing the library and not only
writing some specific generator, but also writing the small subgenerators that we
need.
So nowadays that you know, the library is much bigger and there are many
things that you reuse, so it takes about two hours to right something like that, so
it takes about two hours to right one of these, and in turn that one can produce,
you know, this many, this many input programs. Yes?
>>: How many generators input actually compiled?
>> Darko Marinov: I don't have that number with me here, but, you know, again,
that's in the paper. I believe about, you know, like one in -- one in, you know,
three, four would compile. I think it all depends, you know, specifically what the
generator was doing and so on.
>>: [inaudible] easy to do or [inaudible] these five generators that I found most of
the bugs were?
>> Darko Marinov: And so the way it went is so this table is much bigger. So
there are many more refactorings that we tested and for various refactorings we
would have more generators and so on. This thing why we had sort of the most
generators for encapsulated field was as I said, you know, once you start finding
some bugs somewhere, then you kind of, you know, figure out that there may be
more bugs there, so we found some bugs in the encapsulate field and then just
figure out that probably this refactoring was not as robust as these. Actually you
can even see that the simple way this one was built may be in two, three years
ago, whereas these are seven, eight years old.
So, you know, these ones that are older did not have any bugs actually here in
Eclipse we did not find any bug with the rename. At least for these generators.
Does not mean that some other generator could not find the bug. We did still
find, you know, one of them for NetBeans. But for this encapsulate field, you
know, we would just find bugs with almost any of these generators and, you
know, just at some point you stop writing anymore of these you know generators
and, you know, probably we could find even more bugs there. Yes in.
>>: So this number of bugs does it mean that it failed or a actual bug?
>> Darko Marinov: So this is an actual bug. Actually the way it goes, let me just
see if I have this number here. Yes. So here's a number that shows a bit more
of what happened there. So you've seen that we can generate like hundreds or
thousands of test inputs. When we run them, we can still get a large number of
failures. What this table here shows is the number of failures for various oracles.
WS, I guess was warning starters, this DNC was does not compile. So if you
give a program that compiles, we run the refactoring engine if you retain a
program that does not compile.
So as you can see here for this particular case, let's say this double class field
reference generator, I think we were generating about 4,000. That was this on
the previous slide. So here we are generating 4,000 of them and then once we
run that, we would obtain few hundred of them that actually fail. So we would get
187 of them that failed these does not compile and you know hundreds more that
fail these custom and oracle and 500 more where Eclipse and NetBeans would
differ. So this was the number of failures. But then the actual number of bugs
was only this. So, you know, the number of bugs that we reported in the bug
tracking systems for Eclipse was just one.
And so the issue then here is that you can get the large number of failing test
that's actually due to the same underlying cause, they are due to the same bug.
And one needs to sort of address that and that's part of ongoing works. I have a
few slides there. Does this answer the question?
>>: Yes.
>> Darko Marinov: Okay. Yes?
>>: You're saying all of those failures were the result of one bug? There are
[inaudible] sort of hidden once we fix that bug, there's going to be ->> Darko Marinov: So in this particular case, now, again, I don't have the data,
but it may be that we found more, so the numbers in this column are what we
reported in the bug database. We did not report the things that were either fixed
in a later version, so all of this work was done in a bit older version, we need to
get some old version to fix that version and to build this whole thing about
automatically running and so on. There needs to be a lot of change, needing to
actually get this thing to run this whole stuff automatically.
So at the time when we would find some bug, we would check against the newer
version whether it was already fixed or not. And we were about six months or so
behind. So they would already fix some bugs. So that was one thing. And then
sometimes we could find here a bug that was already reported previously. So
even if it was not fixed in the current version, it may have been reported. So we
did not just want to go there and report duplicate bug reports. So then what this
column shows is just how many we actually put there in their bug tracking
system. We did find some more. Now, I don't have again the numbers on this,
on this slide of how many more we exactly found.
>>: So with respect to bugs, how many bugs does this approach final, the one is
not really representative?
>> Darko Marinov: It's slightly more than one, but not much more, maybe like
two or three. So it's not like oh, it found 20, but all those 19 were either fixed or
reported. Maybe it's just slightly more than what's shown here. Those are all
good questions. Any more questions?
>>: I'm just curious about the number of open bugs in each of those in the
Eclipse refactoring [inaudible] like just to give us a perspective like are there just
ten open bugs and you added 21, are there a thousand open bugs ->> Darko Marinov: I think probably on the order of hundred, but again, I would
not know the exact number. Actually it's very hard to search through those,
through those bug taking reports, you know. We wanted to do even simple
things like this, you know, find which of these refactorings had the most bugs
submitted in say the last six months. That one is say the least robust and maybe
we want to focus our testing effort on that one. So even asking this simple query
is actually fairly hard to get the numbers. So what they use Eclipse uses Bugzilla
for their bug tracking and NetBeans uses this Issuezilla, which is very similar to
Bugzilla.
So asking what these simple queries is actually fairly, fairly hard. Of course one
can just go and search by encapsulate field but then you are going to miss some
of them, and some people just don't use encapsulate field refactoring, some refer
it in some other name and so on. So improving this bug tracking system is
challenging and important problem. So it's being able to better search through
that. Sometimes if you just want to find, you know, say rename method, you
know, this is fine, you can just search for rename method and find that. If we
wanted also to find some things that are kind of, you know, cross-cutting, say we
want to find all bug reports that say that an incorrect warning status was
generated, regardless of which refactoring but we just want to find incorrect
warning status. And this one is almost impossible to find, unless we just go and
download all the hundreds of the bug reports that are open and someone goes
and manually inspects that.
So I think that the number of open reports is may be on the order of hundred or
so. That of course depends for all the refactorings. Eclipse has I think about two
dozen or so whereas we only tested eight of them in our approach and the
NetBeans has maybe slightly fewer, maybe about 15 or so of refactorings.
So here then the results that you obtain by running this, we got this 47 new bugs
or 21 in Eclipse and so they confirm 20 and then 26 were in NetBeans, so they
have confirmed 17 and they've already fixed three, and they put that one they
don't want to fix although we still think that, you know, this is incorrect. And then
this is I guess what made the students furious that they reported some of the
bugs were duplicates. Because the students spend a lot of time actually looking
whether the bugs are duplicates or not, just kind of doing their best effort to make
sure that they are not reporting something that's already there, but then you
know, the triagers or developers, whoever on the NetBeans side just came and
said that some things are duplicate. And we still think they are wrong there in the
sense that if they go and patch whatever they said was duplicated, that's not
necessarily going to patch the actual bug that we reported. That's kind of very
hard to evaluate. We need to wait for them to actually go the patch duplicate and
see whether they actually patched our bug or not.
And then as I said, we did find some more bugs but did not report them because
they were either obvious duplicates or something that was reported previously
that were already fixed by the time we found them. Then what's interesting here
in this result, this part of the ASTGen be included in the NetBeans process start
adding that to not only here, they are manually written tests but also to add some
infrastructure to be able to run this. This is still not finished but they started some
things on that.
>>: [inaudible].
>> Darko Marinov: So they settle for one of these bug reports that they do not
want to fix it, even though it's incorrect? I happen to remember that one, it was
something related, also, I think encapsulate field. The problem there was that in
most cases when you put this setter, the return type is void. But in some cases
they are fairly tricky, the return type of this setter needs to be the same as the
type of the field. So we need to put here in set F.
The reason why you need that is because you can create this kind of weird
expression where the assignment is actually a subexpression of something else,
so therefore the assignment not only needs to make the change on the state but
also needs to return the new value. So you would do this F gets F and then
return this dot F and then this needs to become N. And the what they said is
they don't want to fix that, because if they do that, that would violate some
issues, you know, whatever. When you build this JavaBeans then the setters
have to have void, otherwise refaction cannot find them. But the problem there is
their fix should not be oh, we will change the way we do encapsulate field, but
the fix should be we should raise the warning and say if you have an assignment
as a subexpression, the refactoring engine should say, oh, I cannot do
encapsulate field because you have an assignment as a subexpression.
But now they just said they don't want to fix and what that means if you have a
[inaudible] that's a subexpression, you can apply encapsulate field, you get
something that doesn't compile. This is an easy bug, you can just go and
immediately sort of revert the -- your -- revert the refactoring and just say okay,
the refactoring can not apply, I'm just going to do it manual or something like that.
>>: Well, [inaudible] in that case?
>> Darko Marinov: Eclipse creates in here, so if we just put, you know, insert F
and adhere return this dot F. This creates a setter that also returns the new
value.
>>: [inaudible] when they fix a bug [inaudible] instead of generating something
that's wrong, now they [inaudible].
>> Darko Marinov: Yes. It could give a warning. Actually what we found is that
NetBeans is much more a sort of it's well much less aggressive in writing to apply
refactoring. NetBeans much more often gives warning and says I don't want to
proceed with this because I could do something wrong.
>>: [inaudible].
>> Darko Marinov: Yeah. Yeah. The refactoring -- yes. The refactoring engine
that always gives you a warning says that I could do something wrong if I
proceed would satisfy all -- probably all your requirements, yeah. It would be,
you know, useless if one of your requirements was software was to be useless,
then it would satisfy even that requirement. Yeah, but at least it would be
correct, according to our thing we would never find, you know, a bug there. I
mean, what we would find is that there is some difference. When we run this on
the same input program, both Eclipse and, you know, NetBeans, Eclipse
proceeds but NetBeans doesn't. Then we need to go and you know manual
inspect, you know, why, and what's the difference and should, you know, Eclipse
said also I should not proceed or should NetBeans be more aggressive and
actually proceed? At least in that way, you know, you are not kind of introducing
the bugs that could trick the developer by creating some, you know, hidden,
hidden thing somewhere.
Okay. Any more questions about these things? Yes?
>>: Just one more question in [inaudible] generations so did you guys spend any
time looking at like why the compiler wasn't able to compile some of these
[inaudible] trying to generate [inaudible] structure programs, right? Like you're
not deliberately increasing [inaudible] or anything?
>> Darko Marinov: Okay. You mean if in this our own generation, whether we
looked at this thing, you mean why [inaudible] generate why they cannot compile,
right?
>>: [inaudible].
>> Darko Marinov: We spent some time on that. That's how we were building
even these dependent generators and adding some of these filters, but we did
not spend too much time on that because eventually yes, what we would like to
do is to only get the programs that compile. Maybe someone wants to test the
refactoring engine with problems that don't compile but still if you have some
properties in mind you want to generate test inputs that satisfy those properties.
>>: [inaudible] might not generate really complex inputs that satisfy the
properties as long as [inaudible] some inputs?
>> Darko Marinov: I would actually say that you know it's a big deal. It's
something that we should spend more effort on, you know. If I want to generate
test inputs that satisfy certain properties I only want to have those that satisfy
properties. Here we have an easy solution sort of, you know, we want inputs
compiled but even if we get something that doesn't compile that's fine, we run
through the compiler and just throws that away. But the question is, you know,
what if you really wanted to generate things that, you know, do compile or you
want to simply generate things that satisfy certain properties, how can you easier
express that in here? That's kind of part of a future work. So it's something we
have a final solution for that.
So here's also some work that we have ongoing work so we are trying to reduce
the machine time in the human time in using this ASTGen framework. So the
machine time is kind of for generation and execution of this test inputs. Of
course machine time by itself wouldn't be all that important, but the issue is that
this machine time actually translates into the human time. So if the right one of
these generators, as I said, it takes nowadays it takes about two hours for the
students to write one of those, if I write the generator and you know push some
button and say start the testing now, if I need to wait, you know, like one, two, or
three hours to get back some result, then you know, then the boat is just kind of
idling developer testing because you are just waiting back for the result.
So that's one part. And then the other thing is just the human time for inspection
because as we discussed, you can get hundreds of programs that fail but they
only have one or two actual bugs that are there. And so here then the things that
we are doing, one thing is want to reduce the time to first failure. So rather than
exhaustively trying all possible inputs, we try to kind of skip through some of
them. We just go and skip through a number of inputs and try to click or find
some that fails. And then if this kind of sparse generation does not find anything,
then we go and proceed exhaustively. We also want to reduce the test
generation and execution time, so here we are trying a smaller number of larger
tests rather than generating large number of smaller tests which we had, you
know, one class and only one expression, here we want to try one class with
many expressions.
And then to reduce the time for inspection, we have some oracle based
clustering. We try to group the tests together based on the oracle messages
such that they are hopefully have the same underlying bugs. So that there is
less to inspect there. Actual results are quite promising. Here we can save the
times sort of an order of magnitude. Here it's also like 2, 3X and here we can
significantly save this time. We can sometimes merge hundred failing inputs and
just say these all seem to be due to the same bug, so only need to inspect one of
two from here rather than all hundred.
So here's kind of future work, things that can be done. So specific to testing
refactoring engines of course always try more refactorings, try differently
refactoring engines. Maybe someone wants try this for visuals. Some people
have actually used the ASTGen recently there was a paper presented at
[inaudible] I guess two, three weeks ago, where they've used ASTGen to test
their own uni-factoring engine that they built. Then we can apply this ASTGen to
other program analyzers basically, anything that takes program inputs. We've
already done a small study with a tool called JavaCOP, develop by Todd Millstein
at UCLA and so we found the small bug there. The other things that can be done
is to reduce or eliminate these false alarms that I didn't even discuss but because
of this comparison of ASTs we don't always -- we don't always get the correct
results, we have some false positives there. And then to reduce these redundant
tests rather than hundred failures to maybe just show only one.
So this specific refactoring engines but they are more general about the test
abstractions. They did remember test abstractions that you want in some
language you want to describe a set of test inputs, rather than manually one by
one and describe the whole set. So the research there is along the lines are
what are the languages that make this easier to use, how to better describe
these sets and how to generate them faster. And of course always to improve
these oracles and clustering. This is I guess what we are also finding to be very
important. I mean, now we can find many bugs, but now you have still many
tests to inspect and many failing tests, so how can you reduce even that
therefore, even at the expense of maybe missing one bug here or there, at least
to reduce that effort.
I mean it's very hard to go to read to this hundred programs and just try to figure
out are they due to the same bug or not? Here's then the work of our test
abstraction basically asking the question how to describe these tests, what to
generate, how to generate and so on. And here's then the conclusion, you know,
we apply that refactoring engines, we found some bugs and the code is probably
available for download there.
Okay.
>> Nikolai Tillmann: If there are no other questions.
[applause].
>> Nikolai Tillmann: And so that will be [inaudible] for the entire week for the
[inaudible] contents and the Microsoft [inaudible] so [inaudible].
Download