>> Nikolai Tillmann: Hello. It's my pleasure today to have here Darko Marinov who is now an assistant professor at Urbana-Champaign, Illinois. And so before that he got his Ph.D. at MIT, and even before that, he actually was interning with us for three months, so we already know him for a long time. And he's been working on testing and how to improve it. So I'm actually looking forward to hearing his latest results today. >> Darko Marinov: Okay. Thank you. Thanks, Nikolai. Hello, everyone. Okay. So the latest results will be on this word that was not an automated testing of refractoring engines using something they call test abstractions. I'll describe what I mean by these test abstractions. I'll be here talking, but of course the work was done actually by students, so here's the list of people who were involved in that, Brett Daniel, he's my Ph.D. student. He was actually the main person behind this work. And the others who helped Danny Dig, Kelly Garcia, Vilas Jagannath and Yun Young Lee more recently. Also the work was supported by National Science Foundation, by some grants and then a gift from Microsoft. It was a small gift. If you have money to give for testing, it would be good. So what we wanted to look at was some testing of refactoring engines. So let's first see what these are. So refactorings are behavioral preserving program transformations. So the idea there is that you want to change the code of the program, but you don't want to change its external behavior. The reason why people want to do this refactoring is typically to improve program design and to make your program easier to maintain or easier to use and so on. So some examples involve things like rename a class maybe or you moving oriented program and you had some class that you called A and you realize that A is really a bad name so instead of A you should have called it, you know, an airplane if it models an airplane or something like that. So you simply want to go and apply this change across the program. Now, you can do this manually, but it's kind of tedious to go everywhere in the program and find wherever that you had A and replace that with say airplane. So what you want to do is actually to automated this process. And refactoring engines are programming tools that actually automate these applications of refactorings. So if you say want to rename clause A with airplane, you just go to the tool and point that that's what you want to do and it goes automatically, traverses the entire code and finds where to make these changes. And these refactoring engines are encoded in many modern IDEs. Here are examples from two Open Source IDEs from Eclipse and NetBeans. If you go in the top level menu, they actually offer these refactorings that can be applied. So on the left is Eclipse, they started building their refactoring engine a few years before the one in NetBeans so they had more of these refactoring building things like renaming program elements, moving them, you know, things like convert local variable to field and so on. So NetBeans has some fewer of these. And then Visual Studio also has a number of refactorings imported but I think even fewer. It has something like four or five. We did not pass Visual Study just because it's not your Open Source so we didn't have easy access to that and to adding some changes to the tool itself to make it easier to test. So why did we want to test these refactoring engines? Well, one thing is they are widely used. Programmers actually want to use the refactoring engines rather than manually making these changes through the program. Also, they are complex. It's kind of interesting to test something that's not simple and they're complex in two dimensions. So first the kind of inputs that they take are complex, so what the refactoring engine does, it takes an input program, takes a source code, and then is output also produces a change source code, maybe saying just renaming a clause. So the inputs themselves are complex and also the code of the refactoring engine itself is complex, so it means to usually perform some sophisticated program analysis to find the -- whether it's fine to perform a refactoring and how to change the program then also needs to perform a transformation that actually goes and makes the changes in the program. And then what's also important is that a bargain refactoring engine can have a severe consequences on the development, namely if there is some bug it can -or sometimes silently corrupt large parts of the programs. If you are a developer and the building some program and if you apply refactoring it may happen that this refactoring engine just goes and silently corrupts something say instead of replacing your class A with airplane it goes and replaces with something else and now things don't work. Sometimes you can find this easily because the output program does not compile. So if there is a big in refactoring engine you can easily detect that there is a bug. But sometimes that's not the case, so if you find a bug that's very kind of hidden and it's as unpleasant as finding a bug in a compiler itself. And kind of last but not least we wanted to test refactoring engine was because they contain bugs. If you do some research on testing you should try to get something where there are bugs so you can say that you found bugs. But more importantly it was because people use this and so it look like significant and complex application and something interesting and challenging to test. So here's an example of how refactoring looks like. They say example of how to use [inaudible] there's this refactoring it's called encapsulate field, and so what it does, it replaces all fields accesses with appropriate access through the getter and setter methods. So on the left here, we have an input program, small program for illustrative purposes, so here's only one class A, here's only one field, this F, and there is some method M here that access this field F, right. So it reads F and then multiplies maybe with sum I and then writes that to F. Now, what we want to do here is say to replace all these accesses to the field F with actually use getters and setter methods. The reason why we want to do that again is to kind of improve the design of our program, gee, it's a bad practice to access fields directly, it's much better to access them with the getters and setters. And this simple example, these fields are accessed from the same class, so, you know, it's find to access them there but in general these field access could have been from different classes. Now, we can use the tool to do this. We will just go to the field F, we would click there and say encapsulate this field, and what the tool would do is actually make these changes shown there on the right. So this on the right is the output program. And to make five changes. So first it would add this getter method so it will add this method here that just returns the value of F. Then it the setter method, so this is the one that can set the field F, so given some new value it just sets this F. Then it would replace these field reads with the close to the getter method as this thing here. Then this field writes that we're here where F was set it would replace those with setter methods. Then finally makes this field private such that if it was accessed from the other class it cannot be accessed any more. Okay. How many people here have used refactorings? Okay. Almost everyone. So here's some bug that we found in Eclipse so here we have again a very similar program, input program on the left. In this case we have two classes, one which subclasses the other. And if you apply here, encapsulate field, here's the output program that Eclipse generates. And as it turns out, there is actually a bug here. The bug is right here. And what happened here was that we had this effectively a field write, we were setting the super dot F to zero but because the engine has some bug, it told that this was the getting the field F. So here's the output that they generated said somehow that you want to get the field F and then do something with it. As a matter of fact, this thing here would not even compile. What it should have generated should have been this, super dot F of set zero. Here is another bug that was found in NetBeans. Here the bug is slightly different in that the output program does compile but it still -- the refactoring was not applied correctly. Again, here's the input program and something small. We have one class that's on one field, and there is some method. And then we are kind of reading the -- actually writing to this field F in a somewhat weird way. We have these brackets around this, so we are setting the field F of some object A to zero and we want to encapsulate this, and this is what the refactoring engine gives, so it properly introduces this setter and getter methods, however, it does not replace this access, just leaves it as it was. So the bug again is here in this line, and what it should have done actually is generated these setters. It should have been [inaudible] dot set F to zero. Sense? No, yes? Nikolai doesn't think that this is a bug. >>: [inaudible]. Refactoring but what is it any way. >> Darko Marinov: Okay. That's actually a very good question. I mean, what's even a definition of a bug in a refactoring engine? I guess part of what you wanted to do is to preserve behavior, right, so if you generated something whereby you know when I run the program I get the different value, we could say that that's definitely bug. But if that's the only thing that you want out of refactoring engine, then it would be very easy, right, I can just build, you know. >>: [inaudible]. >> Darko Marinov: Sorry? >>: In this case, M is in the class it has connection -- it has access to [inaudible]. >> Darko Marinov: Okay. Yeah. I guess ->>: [inaudible]. >> Darko Marinov: All right. So I -- so what [inaudible] is saying in this example F is inside. Let me first answer that question, then I can go back to this other one. Yes, in this case, F is indeed the inside. So you could even say maybe when you want to do this encapsulate late field, you don't need to change the access that are inside the class. It's fine to access those directly. You don't want to change those to setters and getters. But the same bug would have occurred even if M was outside. If M was outside and the -- in another class, B was exiting to this F, it would still not replace this. That particular case it would also have then a compile error because F field became private and you are trying to exit from outside. So that's kind of the answer to this specific question. But going back to this more general question, you know, what is a bug in a refactoring engine? If you wanted to preserve behavior but actually wanted to do the change that you wanted to do. Here actually an interesting example of something like this, we wanted to replace some fields and then the refactoring engine did not replace this and actually the example it was officially created an example where we created something to show how a bug here may be -- may have, you know, big consequences. What it did there, there was some field here that was kind of measuring some temperature, and what we wanted to do here is kind of say for this set temperature say if the temperature you want to set exceeds some limit and raise some exception rather than, you know, exploding something somewhere. So if you want to do that, what that means is one, you encapsulate field view, you start adding some behavior here of what you want to happen whenever you said this field F. And if you don't have all the fields to F, you know, properly going through this set F then it could just miss something like that. So another example we would get some boiler to explode or something like that. But I agree with you that these examples may -- may look, you know, somewhat like simple and artificial and one may asked you know is this really a bug or not? But usually you can construct some bigger bugs from these going from these smaller simple examples. Nickly's still not convinced about that. >>: [inaudible] so if you say a [inaudible] then the bug surfaced but then again it wouldn't be really about the refactoring but you have some other invariance and in the program which is violated but [inaudible] how can you blame the refactoring engine? So the definition [inaudible] refactoring engine? >> Darko Marinov: Well, I guess definition what it means to be a bug. Okay. So I guess one way you could define that is actually going in, implementing your own refactoring engine and saying if this other one doesn't generate the same thing as mine, that would be a bug. You know, so say for encapsulate field, the definition of encapsulate field it says, you know, it replaces all fields read and write with access through getter methods, getter and setter methods. So I guess here the point that he should replace all fields read and write. Of course you could have a different encapsulate field which says, you know, I do not replace those that are within the same class. Encapsulate field outside of the class, in which case you could say, you know, it replaces all access that are outside the class with getters and setters. So, you know, the definition would be you could, you know, normalize this, what it means, this thing here and then check whether the output program really satisfies this property or not. This makes sense? >>: [inaudible] you expect [inaudible] and then you [inaudible]. >> Darko Marinov: Yes. So if you run encapsulate field you do expect that all the field access were replace with get and set, so if you add later on some behavior to these gettersable setters, you hope that everything was indeed replaced. >>: So in a set of the [inaudible] is that actually also how you check for bug or how do you discover these bugs? >> Darko Marinov: That's a very good question. So maybe I can postpone that for a few slides an discuss what were the oracles that we used for that. All right. So then these were the examples, so maybe you can trust me so far that these are bugs. Maybe you know we wouldn't call them bugs, we can discuss it later on whether these are really bugs or not. So one way of how we check whether this is a bug or not is get the same input program, say get this program and run them both through Eclipse and NetBeans. And if they would give something different, then, you know, we would ask the question, you know, is there really a bug. Is one of them wrong because they are getting different results, or is it simply that, you know, your spec is [inaudible] can give any of these to. I think that that was the way to discover this one, because if you get things that compile, those are actually the most problematic bugs. Because if something doesn't compile and if you go back here, if you just apply encapsulate field and you get this thing, you know, it will immediately tell you here it does not compile, I can just immediately see that something went wrong. If I give you a program that compiles and you make some change and it doesn't compile anymore, you know that that went wrong. But here it's much trickier because, you know, if everything compiles and it looks seemingly fine, actually you could even find that this is your -- that it preserved behavior by still didn't do what you wanted it to do. All right. So how does one test then this refactoring engine, this kind of the general setup? You get the refactoring engine, you give it some input program, which we want to change, you give it what refactoring to apply, say encapsulate field or remain class or something like that, and what the refactoring engine gives you is the output is either a refactored program as we've seen in these examples before, or it can instead give you a set of warnings and say oh, I cannot actually go and refactor this, I cannot apply this for several reasons. Say if you want to remain the class A to be called airplane, maybe there is already a class airplane in your program then it would just say well, I cannot create something with airplane because now obviously we are going to get two classes that have the same name and that's going to create some problems. Or, you know, it could say I cannot encapsulate some fields because you already have setters and getters or maybe cause field access from outside so once it becomes private couldn't be done and so on. So these warnings are kind of very refactoring specific, and the good engine needs to apply all this program analysis to figure out whether it's safe to apply refactoring or not before it actually goes and changes the program. So how do actually people test these refactoring engines? Both Eclipse and NetBeans have a number of manually written tests so developers I guess of these engines care about their programs so they go and write a unit test for that. So they write these input programs so they kind of prepare the program a number of classes that are referenced in a certain way. They also have the code that invokes these refactorings, and then they have the expected output. So this is either that they refactored the program, you know, by 10 and say this is what you should get, or some set of expected warnings in case where you do want to test for the cases for you're expecting to get warnings. So these are all many written tests and then they're automatically executed. Eclipse program uses JUnit for this, so they have over 2600 these JUnit tests. NetBeans uses a different testing framework called X test and they have seemingly much smaller number, 252, but the issue here is that it's not quite fair comparison because the NetBeans has much larger test, so they have kind of more like system tests where they go and execute many things at once. So even this number, 250 actually is -- there are many more things that are happening there than in any one of the small JUnit tests for Eclipse. >>: A number of lines of the test [inaudible]? >> Darko Marinov: I don't know that by heart, but ->>: Each individual test, is it 10 lines, a hundred lines ->> Darko Marinov: Most of the tests, interesting enough, are one line, because all the test does is kind of -- all that this test does -- so when you ask the number of lines, there are two things to distinguish. There is like the code of the test and kind of the test input, right. So what is the file? So the code is usually just one line. It would say apply, encapsulate field on some project P0 and the project P1 or project P2 or project P3. And I will check that the expected result is the same as in E0, E1 or E2 or E3. Maybe the size should be more measured in terms of these, the program files, how big is your program input. Because these testing is most kind of data driven, right, so your input data is the program files and the code to actually just apply the test is fairly simple. Load the whole project, apply refactoring and check whether the output is the same as expected. And that's very generic, so write that once and then each test is just one line that cause the generic thing. Actually not to go there, but each test actually has zero lines because there's reflection and referral and the name of the test method to figure out which project to read, so. The point there was that, you know, these people care about that and so they built a large number of tests and still show that we are finding bugs in that code. So what you wanted actually to do here was to automate this testing. So rather than you know manually creating all these test inputs and then checking whether the outputs are correct, I wanted to automate both this input generation and the output checking. So a lot of testing research actually goes, you know, along the lines of I'll give you this refactoring engine and some big pile of code and somehow magically just look at this code and from there generate the inputs that could show bugs or maybe your input will achieve coverage or something like that. In this domain, however, it's very hard to do that. I mean, refactoring engines are just fairly complex to be able to generate test inputs for them. Inputs themselves are also fairly complex. You need to effectively generate programs, you need to generate syntax that satisfies some syntactic and semantic constraints of valid programs. You have very strong preconditions on these test inputs. So we did not even attempt to do that. We did not want to look at the refactoring code and just from there automatically generate tests. What we actually wanted to do is something else. Here so basically we made these assumptions that the tester has -- yes, Nikolai? >>: I have a question. So a very simple way to test appears to be the following. And I can just take any program which has already been tested which has lots of tests already generated and then just apply refactorings and run the tests [inaudible] refactored one and you should see the same behavior, right? So that can be in addition to these tests. >> Darko Marinov: Yes, yes, yes, that's a very good thing, yes. That's something that we did discuss. So you get an existing program as you said that [inaudible] test, now you go and refactor that program plus its test and then you rerun the test and see whether tests are still passing. Yeah, that's definitely something that can be done, something that we discussed. We did not do that -- sorry? >>: You still have to come up with specifications, you know, what you want to refactor. >> Darko Marinov: Yes. Yes, in theory you could just traverse and apply all possible refactorings. So and if I have a program with many classes, I could just say, okay, get any of this classes, try to rename it to some new name that does not appear there, or then try ->>: [inaudible] lots of tests you should have quite some confidence [inaudible]. >> Darko Marinov: That's true except that you may miss many of the bugs that we have found. You know, if the programs that you have do not have certain kind of, you know, weird properties, if it in your program you never use something like this, you're just not going to find the refactoring. You know, you're not going to find the specific bug. Of course, then you could say, well, maybe this bug does not matter, right, if no really program would never write something like this, then you know, it does not matter. Actually I didn't even know that this is Java that can put these brackets on the left side of an assignment. But the thing is usually, you know, when you go and try to look for these bugs, these bugs are kind of, you impose, social creatures, they tend to go together, so when you find some bug somewhere, there are more bugs there. Actually once we found this bug here and encapsulate field, we actually went from there and found many more bugs. So even you know if this specific one looks like who cares, then there are many others that may be much more important. But what you suggest is kind of one way in which this testing can be done. We discussed that. But never, you know, did that. What we did was we start to look at this other thing here, so we made this assumption the tester kind of knows what inputs could expose bugs. So tester has good intuition for that. Say an example if you want to encapsulate field where we would say maybe there are some problems if they are inherited fields. If we are -- if some subclass is inheriting a field and they're referring to that in some weird ways that could potentially show a bug. So that's one important assumption sort of behind the work. And then the other one is that it's labor intensive to manually write many input programs. If this encapsulate refactoring -- encapsulate inherited field requires you only to write one or two test inputs, that's fine, you can just go and do that manually and be done. But if it requires you to write say thousands of them, then it's very hard to go and actually manually generate one by one of these to check for that. So if you want then to take this approach, then the challenges become how to qualify this tester's intuition. So if I already know where the bugs may be, how can I actually turn that into a way to automatically generate large number of test inputs. And then also how to automatically check that the outputs are correct if I start generating inputs automatically. And so the kind of general solution that I call for this program, something I called test abstractions, so the idea of test abstraction is that they conceptually describe a set of test inputs. So the main idea there is that instead of manually writing a large number of tests the users should write this test abstractions and then tools automatically generate tests. If you want to test refactoring engines rather than you automatically -- rather than you manually writing a large number of this input programs, you just somehow want to describe what are the input programs that you want to generate, and then have the tool generate that automatically. And the point that I'm making is this useful not only for test generation when you want to generate this thing once but also for maintenance. Say if you decide to make some changes in your test inputs, some changes in your code, if you need to regenerate this, if you use some of these kind of test abstractions and you have descriptions of a set of inputs, then you can just regenerate this thing. Whereas if you were to manually generate you know hundreds of thousands of them, you will need to somehow manually operate them all, we need to write some scripts to change all these tests. So if you do something like this, using these test abstractions, then you're going to be automatically generating tests, but whenever you have that, then you have the issue of you need to check whether the code is actually working correctly or not. So namely we run into this problem of test oracles, we need to have a way to automatically determine whether the execution was correct or not, and then also the other related problem is of this clustering. If you start now getting a test that do fail, we would like to group them such that they're usually due to the same faults, you don't need to explore all of them. Yes, Tom? >>: So the first two, three, the test abstractions, that sounds sort of like model based testing. >> Darko Marinov: Yes. But you need to have a different buzz word to get career proposal from NSF. So my buzz word is test abstractions. Yes, but you could call this model based testing, I guess. That would be fine. >>: I mean, what's -- I mean, is there some technical difference between these? Are these programs -- I mean, are these model programs, these test abstractions [inaudible]. >> Darko Marinov: Well, they could be. I mean, the idea is your test abstraction is a general term and you know, you can do whatever you want put there, so I would say that, you know, model based testing is one approach to doing test abstraction. Usually when you say model based testing, you know, people think of a certain models, right. I mean, they would think of machines not necessarily different people may use different models. But suppose that you know you want to describe some complex input such as, you know, Java programs. So would you call that a model based testing? You know, what would be your appropriate model? Of course you can do that and just say you know my models are some grammars, you know, or something that describes what programs are. And this should be called model based testing. I think that model based testing usually at least, you know, to me means that you're generating some kind of, you know, sequence of inputs that you are giving to your program. At least in my mind. Yuri, what would you think of model based testing? Would you think of generation of complex inputs as model based testing or not? >>: [inaudible]. >> Darko Marinov: Okay. I mean, this could have been easily called model based testing. Often time then if you do model based testing it kind of has test Oracle imbedded into the model, right, the model not only helps you to generate the inputs but also to check the outputs. Whereas here, you know, without present the model somehow does not embody what the correct output is. Okay. So that was sort of the general thing about the test abstractions or model based testing or however we want to call that, and here's then the specific solution that we did for testing these refactoring engines. We have developed something we call ASTGen so this framework for generating abstract syntax trees, so this is a way for us to input programs for testing refactoring engines. And basically what it does, it provides a library of generators that can produce a simple parts of AST and then there is a way to combine these simpler things and actually build larger programs. And so that was as far as the test inputs go. And then as far as the test outputs go, basically we just developed manually a variety of oracles, so there was no automation there in terms of automatically generating oracles so that was manually returning and you would just like automatically run them. And we have some ongoing work on this clustering basically if they're failing tests to try to group them together due to their causes. So ATSGen was then the main thing that we developed this framework for generating abstract syntax trees and we had a few design goals for that. First, we wanted this to be imperative so the idea there is that tester can control how to build this complex data. Some previous work that I worked on, we have taken a declarative approach to that. But the tester would not describe directly how to generate complex data but would only describe what the properties of the data was. So you would just describe what the Java program sees and maybe what specific properties you wanted to satisfy, and then the two would generate that. So here the approach is different and then the tester directly writes how to build this data. We also wanted this to be iterative, which means that it can generate these inputs lazily because oftentimes it can end up, you know, with thousands of them or even with millions of these inputs so we don't want to join them all at once. We wanted this framework cost to be bounded exhaustive. So there's some interesting points. So the idea of this bounded exhaustive is that you want to try all tests within the given bounds. So say if you want to generate the programs, you may want to say, well, we are going to generate the programs that give up to three classes or we are going to generate the expressions that have up to three levels of nesting and so on. So put some bounds on the size of the program and then you wanted your testing to try all possible test inputs within those bounds. Yes? >>: So those appear to be also desirable by comparative numbers, right, comparative [inaudible] generate all these programs because the compiler [inaudible] so I was wondering whether somehow this aims to be further than just refactorings. >> Darko Marinov: Yes. In theory, one can use this ASTGen to test any other piece of code that takes inputs as programs. You can just use ASTGen to generate various kinds of inputs and run that. Now, if you want to test compilers, that becomes a bit trickier because you need some way to check the output. And checking the output of a compiler is, you know, much trickier than checking the output of a refactoring engine. So, yes, conceptually you can use this to test your compiler but that wouldn't quite work. Another also issue here is that when you go to generate these programs, sometimes it may be very hard to generate the programs exactly as you want them. So suppose that we want to generate input programs only those that compile. That may be fairly hard to express in this framework. So what you do then is you generate the larger superset of programs and use compiler to filter out those that don't compile. See. So what I'm trying to say, if I want to test specifically compiler I may have harder time using this because of the out-- both output and the properties of input. Yes? >>: [inaudible] refactoring tool have the support like [inaudible] correct programs that don't compile because the user's always in the middle of making changes? >> Darko Marinov: Yes. They do support in the sense that they let you apply refactorings on the programs that don't compile, but usually that just comes with the big warning and the tool just says, look, your program doesn't compile, you know, maybe I cannot even parse your program, I real don't know if you have class A appearing somewhere where you know my parser got lost and I don't know whether there is A there or not. If you wanted to replace A with airplane, I may easily miss that. And if I cannot give you any guarantee about that. So in most cases I would say I don't have any empirical data to support this, but I think that in most cases developers probably refactoring engine only for the points where their code does compile. Because otherwise if your code doesn't compile especially in some parts cannot even parse, then it's very hard to know what guarantee are we going to get out of the engine. So all our testing was done with the inputs where that does compile. We still found many bugs. So I assume -- if you go and test on the cases where the input program does not compile there's no be able to find even more bugs but it would be hard to describe even what the correct behavior of the refactoring engine is in those cases. So going back to this, so for -- we do this bound exhaustive testing and the goal is to catch the corner cases. The reason why we want to try all possible inputs is to catch the corner cases that may be there. And then last but not least, wanted this to be composer again. The idea is that you write this generators that can create the simple parts of inputs and then from there you can build the larger parts. Here is then how the whole testing process looks like. If one wants to use this ASTGen, first the tester has to manually write the generator using this framework then the tester kind of instantiate this generator by providing some bounds, maybe how many classes you want to have or what their relationship should be and so on. And then there needs to be some driver code that actually runs this whole thing in the loop. So here is an example of this driver code how this looks like. So here we will need to create some generator say if you want to encapsulate some field you would say we want to encapsulate the field F and then we need some generator that can actually create the programs that do that, and then there is a small just kind of piece of scuffling code that does this, so we get some refactoring that encapsulate field and we tell it to encapsulate the field F, and we perform this refactoring on the input program and then just check whether the output actually satisfies the required properties or not. So let's see now how an example generator looks like. This is sort of really the key part of the framework. And again tester hears some intuition of what kind of test inputs may show bugs, and sometimes you can express this intuition in a fairly easy in English, you can just go and write, you know, two, three sentences of describing a set of input programs. It's very hard to actually go and manually write all these programs that satisfy this. So this particular example from our testing of encapsulate field, some generator that recalled the double class field reference generator, there is the short English description of that. So I want to produce input programs that have two classes that are related in various space by containment and inheritance, then one class should declare the field and the other should reference the field in some way. So here down we can see some examples of the programs that satisfy these descriptions. We have two classes, A and B. They may have various relationship like subclass or inner class and so on, and then they're referencing these fields in various ways. So here we are only showing three examples, but of course the number is much larger, here the number is unbounded because, you know, we can just reference the field F in an unbounded number of ways in various expressions. But even if you put some bugs, if you say we want to have the nesting depth of these expressions up to some bound two or three, it's still going to end up with thousands of these programs. Then the question is how can we go from this English description to these thousands of programs. We don't do anything with English description, we don't, you know, try to analyze natural language, we actually ask the tester to express these properties directly into the code. So here is some kind of the parts of these descriptions. So we want classes that are related by containment or related by inheritance. We want one class to declare a field and then want the other class to reference the field in some way. So each of these parts effectively corresponds to a large number of ways in which this can be done in the progress programs. And then for each of these parts we actually go and build a small generator that just focuses on that one thing. For example, if you want to discuss the containment between classes, we would build a specific generator that just can generate all these different possibilities. Maybe either the class are independent or it's an inner class or some method class and so on. If you look at the inheritance we can again generate the all possible ways in which one class can inherit from the other, it can be again unrelated or a superclass or a subclass or related to interface and so on. If you want to declare a field, there are many ways in which this field can be declared. It can have different types, it can have different visibility and so on. Again, we build this field declaration generator can go and enumerate the large number of these expressions. And also if you want to reference a field, there are various expressions in which we can reference this field. And again, we go and build a small piece of code that can produce all these pieces of abstract syntax tree. And now in order to test our program what we want actually to do is to get the cross product of all these possibilities. Want to generate all possible programs by combining these things and then testing the refactoring engines on the resulting programs. Yes? >>: You want [inaudible] basically [inaudible] for all kinds of programs. Here [inaudible] like A, B, and [inaudible] called F. So to what extent [inaudible] actually want to generate. Are the names [inaudible]. >> Darko Marinov: Okay. So the question is whether the names are hard coded. You could hard code them, but they are not necessarily hard coded. Typically the names are just given as some parameters here. So if I build here, you know, a generator that actually creates these programs, you know, I can pass these the field name as the parameter in there. >>: [inaudible]. >> Darko Marinov: If you're generating only one field and you say to be F, then in all programs it will be called F. Did I answer that question? It would never generate the field G. Unless you have, you know, a different generator that goes and generates, you know, something else. Here you would also give the names of these classes. You know, A and B. Because, you know, some programs you may want to generate things with the three classes, right? So if you want to generate something with three classes, presumably they would be called A, B, and C. But you may want different relation between B and C and A and C and A and B so these have been parameters that you can just give here. Yes? >>: You suspect [inaudible] identifiers are handled in the refactoring, could you have a identifier generator that then would add to the cross product? So would affect that identifier creation of all [inaudible]. >>: Yes, yes. One would do that. So the field declaration generator I guess tax the parameter which is the name of the field to generate and then what are the possible types to put there and maybe what are the possible visibility and so on, whether there is also initialization for this field and so on. It's much more involved than is just shown in this example. So each of these generator takes a number of subgenerators for generation. So field declaration generator takes one generator which says what are the types to generate. Another generator which says exactly this, that you ask what are the identifiers to generate for the field names, yet another one for possible visibility and so on. >>: And that's also used by these [inaudible] to say [inaudible]. >> Darko Marinov: Yes. Then you could pass that same identified generator say as in here, in the one that generates reference fields and say, you know, if you are generating say F and G, then it would generate here F and also G, and this F and this G, and this AF and this AG and so on. Yes? >>: [inaudible]. >> Darko Marinov: That's an excellent question. So once we start combining these things, the problem is that there may be dependency there and that one thing can depend on the other, right, so if I generate here F or G or some other field name, then when I want to generate here in an expression that references there, I better use the same identifier. Otherwise, you know, I'm going to create the program that double compile if I have here super dot G and there I had [inaudible]. So the way that that's done is that you need to build these dependent generators. And other things become a bit trickier to express and to especially to put the generation in the right order, because now this means you first have to iterate this one to produce the values and then you need to iterate this one to refer to those values from there. But in general, then these -- the problem arises when some of these compositions may be invalid, say, you know, you cannot take this particular containment of classes and then this particular inheritance between classes because you are going to get something that doesn't compile. And then there are various solutions there. One is this going into dependent generators which means you spend more kind of work to describe what's proper and what's not. And then another sort of the easiest solution is just delegate this to compiler. And that goes back to this question whether you could use this for testing the compiler. The problem then is you would need to spend more time to describe all these dependencies here. And if you want to avoid that, if you want to kind of be lazy, then you just make the generator that produces things that don't necessarily compile, but then you just compile it to filter that out. Yes? >>: This is very, very heavily language testing dependent, do these same kind of cross product principles apply in data generators, also? >> Darko Marinov: Yes, yes, yes. You know, I claim that that's possible to do, so I've done some research on that before just not using this particular kind of generators, not using these ASTGen, something else that we called corat [phonetic] and basically generating the complex data. We've done that things for say testing things like basic data sections, you may want to generate say binary search trees or red-black trees and so on. And then some of those things can also be used at Microsoft for testing various things. They've used it for generating some XML documents and other things. What are the other things? >>: [inaudible] codes? >> Darko Marinov: Parts of the civilization code and so on. So a number of things that were done also here where some data was generated using somehow conceptually the same idea so if these what I like to call test abstractions or we can call them model based testing or something, but the idea is describe a set of testing that somehow the two generates them. Of course then the question just becomes how do you actually describe this set of test input. That was the language that used that and how does this tool generate them. >>: But do you [inaudible] all three or just the compiler? >> Darko Marinov: You know, I [inaudible] the students that go and input all of this, so I just wave hands. No. All these things, all these three things kind of are supporting the framework. There is this thing with dependent generators. I'm not going to go here much into details of how this actually works but the way this is done actually in the design of the generators, it distinguishes two phases. One phase is how to iterate to the next value, you know, so during the generation you kind of need to take this, you know, suppose that I chose first this, this, this and that. Now, I need to go to the next value, which means I need to move one of these guide to go forward to actually try the whole cross product. And so you need to move this forward and then, you know, as you move forward it may become immediately, you know, illegal to combine these two things. Say if this field F is not actually in this class but it's somewhere else and you don't have field F then you cannot generate this. So one thing is how to move forward, how to describe what should be moving forward, and then the other thing is if you generate these pieces, how to combine small pieces into larger pieces. So what the framework does actually it distinguishes these two, these two concerns. One is the iteration and the other is composition. And what that actually allows is to make some of these things here easier and for this dependent generators. Take much more time to describe this. So we can, you know, take this offline and I can show you some of these things. But basically the framework supports all of these things. So, you know, you can write some kind of filters, many right the filters that just throw things away as soon as they are generated, if they are invalid or you can force the generation to only produce valid things, or we can just eventually delegate everything to the compiler. There is this trade of between the amount of -- that you spend on writing the generator, versus the generation time. Because if you wait for the compiler, then you may just be wasting some of the time and throwing stuff away, but you know that's -- you just kind of wait a bit more. So that was all about inputs. I'm going to go into more but I'll be happy to discuss this offline. So the other thing that we needed to do was about oracles, which is to validate whether the outputs are -- to validate the outputs of the refactoring engines and of course then the challenge there is that you don't know the expected outputs because we are automatically generated the inputs, the case whenever you automatically generate the inputs. Another issue is that sort of at its base kind of the refactorings require that the output program be equivalent to the input. So this of course undecidable by itself, but the thing is that, you know, our program is even harder than this, it's even different than this. Not only if you cannot check this in general but we want to actually check that, you know, the structure changes were made, as in that example with the field. If you do want to encapsulate the field F in the setter or getter, you want this actually to be made. So questions how to do that? As I said, we've just built manual a number of oracles for that, you know, ranging from simple things, whether the refactoring engine crashes all together, we never found any bug there, to the things whether the output program compiles, and then whether we are getting appropriate warnings from the refactoring engines because remember it need not always generate the refactor program, it may sometimes say I cannot refactor this, similar like your compiler would say, you know, you have a compiler and so I cannot generate some assembly code or whatever code that it should generate. Then there were also a few interesting things that we did for example this with the inverse refactorings. So many of the refactorings you can kind of apply the other way around. Say if you remain A to B, then we can remain B to A and hopefully get the same input program. So what you want to do then is to check in a [inaudible] from some program, rename A to B and rename B to A that you are getting the same thing. Of course you need not get exactly the same thing, even you know at whatever level of -- if you just even print that out, it may not have exactly the same files, exactly the same sequence of characters. Even the ASTs for that may not be exactly the same, so we need to build some tool that kind of tries to compare these ASTs by ignoring some of the details. Maybe the order of methods would be different or maybe the new name that was chosen, the fresh name that was chosen in some rename would be different and so on. So you need to have a comparison that kind of ignores some of that stuff. Then there was some custom oracles for example if you apply encapsulate field you want to check that there are no more references to the field except for the setter and getter. And then last but not least we did this differently testing. You get the same input program you give it to both Eclipse and NetBeans and then you check whether you're getting the same output program or not. Of course even there, the same means modelling some of the changes that you tolerate. Yes? >>: So refactoring engine which does nothing [inaudible]. >> Darko Marinov: Well, it could presumably not pass this one, custom one, because you would check for certain structural changes and find that it didn't to it. And then also you hope that here out of these two refactoring engines at least one of them does something, and you know, that one would have a bug. Yes? >>: [inaudible] involves actually running the program and seeing whether the input or output changes. >> Darko Marinov: These that I'm mentioning here not, but we had also experimented a bit with that. That becomes much more complex. It's not like we have automatically generated programs for which we automatically generate test and then we want to refactor the program and run these tests. So becomes a bit -- we just have, I think, one experiment with that, but we did not pursue that too much. And these are all good questions. I guess the thing was that, you know, we still found a lot of bugs even with this stuff that fit it, so that's why we didn't pursue some more things. So here then basically what we did with this ASTGen, we tested this Eclipse and NetBeans, we tested data refactorings I guess from each. They target various program elements, you know, fields, methods, classes, we had things like, you know,encapsulate field, move method, rename class and so on. So we had about 50 generators, some of them for very small ones for generating the small parts of ASTs and then some complex ones for building entire programs. We've factual found 47 new bugs, so they reported these bugs. Then we've done some comparison of these how good these oracles are and how well the generation works. So here's some of these results for the generator, say for testing these encapsulate field we've written a number of generators that generates various programs, so I guess the one that uses the example was this double class field reference. So if you run that, it can produce a number of inputs, or some of them in you know hundreds or thousands. Here was the time taken to produce all those inputs and to run refactoring engines. And then here was the number of bugs found in the refactoring engine. So what we found then overall was that this generation time and compilation times were much more than the refactoring time, actually running the refactoring and checking the oracles. So generation by itself was not that big problem, but this execution of refactorings was actually taking a lot of time. So that's as far as the machine time goes. And them as far as the human time goes, some of the initial experiments that we kind of tried to track how much time it takes to build one of these generators, this took about two work days. But this was really still, you know, the initial phase where we were not just saying, okay, we want to build this specific generator but we were still also developing the library and not only writing some specific generator, but also writing the small subgenerators that we need. So nowadays that you know, the library is much bigger and there are many things that you reuse, so it takes about two hours to right something like that, so it takes about two hours to right one of these, and in turn that one can produce, you know, this many, this many input programs. Yes? >>: How many generators input actually compiled? >> Darko Marinov: I don't have that number with me here, but, you know, again, that's in the paper. I believe about, you know, like one in -- one in, you know, three, four would compile. I think it all depends, you know, specifically what the generator was doing and so on. >>: [inaudible] easy to do or [inaudible] these five generators that I found most of the bugs were? >> Darko Marinov: And so the way it went is so this table is much bigger. So there are many more refactorings that we tested and for various refactorings we would have more generators and so on. This thing why we had sort of the most generators for encapsulated field was as I said, you know, once you start finding some bugs somewhere, then you kind of, you know, figure out that there may be more bugs there, so we found some bugs in the encapsulate field and then just figure out that probably this refactoring was not as robust as these. Actually you can even see that the simple way this one was built may be in two, three years ago, whereas these are seven, eight years old. So, you know, these ones that are older did not have any bugs actually here in Eclipse we did not find any bug with the rename. At least for these generators. Does not mean that some other generator could not find the bug. We did still find, you know, one of them for NetBeans. But for this encapsulate field, you know, we would just find bugs with almost any of these generators and, you know, just at some point you stop writing anymore of these you know generators and, you know, probably we could find even more bugs there. Yes in. >>: So this number of bugs does it mean that it failed or a actual bug? >> Darko Marinov: So this is an actual bug. Actually the way it goes, let me just see if I have this number here. Yes. So here's a number that shows a bit more of what happened there. So you've seen that we can generate like hundreds or thousands of test inputs. When we run them, we can still get a large number of failures. What this table here shows is the number of failures for various oracles. WS, I guess was warning starters, this DNC was does not compile. So if you give a program that compiles, we run the refactoring engine if you retain a program that does not compile. So as you can see here for this particular case, let's say this double class field reference generator, I think we were generating about 4,000. That was this on the previous slide. So here we are generating 4,000 of them and then once we run that, we would obtain few hundred of them that actually fail. So we would get 187 of them that failed these does not compile and you know hundreds more that fail these custom and oracle and 500 more where Eclipse and NetBeans would differ. So this was the number of failures. But then the actual number of bugs was only this. So, you know, the number of bugs that we reported in the bug tracking systems for Eclipse was just one. And so the issue then here is that you can get the large number of failing test that's actually due to the same underlying cause, they are due to the same bug. And one needs to sort of address that and that's part of ongoing works. I have a few slides there. Does this answer the question? >>: Yes. >> Darko Marinov: Okay. Yes? >>: You're saying all of those failures were the result of one bug? There are [inaudible] sort of hidden once we fix that bug, there's going to be ->> Darko Marinov: So in this particular case, now, again, I don't have the data, but it may be that we found more, so the numbers in this column are what we reported in the bug database. We did not report the things that were either fixed in a later version, so all of this work was done in a bit older version, we need to get some old version to fix that version and to build this whole thing about automatically running and so on. There needs to be a lot of change, needing to actually get this thing to run this whole stuff automatically. So at the time when we would find some bug, we would check against the newer version whether it was already fixed or not. And we were about six months or so behind. So they would already fix some bugs. So that was one thing. And then sometimes we could find here a bug that was already reported previously. So even if it was not fixed in the current version, it may have been reported. So we did not just want to go there and report duplicate bug reports. So then what this column shows is just how many we actually put there in their bug tracking system. We did find some more. Now, I don't have again the numbers on this, on this slide of how many more we exactly found. >>: So with respect to bugs, how many bugs does this approach final, the one is not really representative? >> Darko Marinov: It's slightly more than one, but not much more, maybe like two or three. So it's not like oh, it found 20, but all those 19 were either fixed or reported. Maybe it's just slightly more than what's shown here. Those are all good questions. Any more questions? >>: I'm just curious about the number of open bugs in each of those in the Eclipse refactoring [inaudible] like just to give us a perspective like are there just ten open bugs and you added 21, are there a thousand open bugs ->> Darko Marinov: I think probably on the order of hundred, but again, I would not know the exact number. Actually it's very hard to search through those, through those bug taking reports, you know. We wanted to do even simple things like this, you know, find which of these refactorings had the most bugs submitted in say the last six months. That one is say the least robust and maybe we want to focus our testing effort on that one. So even asking this simple query is actually fairly hard to get the numbers. So what they use Eclipse uses Bugzilla for their bug tracking and NetBeans uses this Issuezilla, which is very similar to Bugzilla. So asking what these simple queries is actually fairly, fairly hard. Of course one can just go and search by encapsulate field but then you are going to miss some of them, and some people just don't use encapsulate field refactoring, some refer it in some other name and so on. So improving this bug tracking system is challenging and important problem. So it's being able to better search through that. Sometimes if you just want to find, you know, say rename method, you know, this is fine, you can just search for rename method and find that. If we wanted also to find some things that are kind of, you know, cross-cutting, say we want to find all bug reports that say that an incorrect warning status was generated, regardless of which refactoring but we just want to find incorrect warning status. And this one is almost impossible to find, unless we just go and download all the hundreds of the bug reports that are open and someone goes and manually inspects that. So I think that the number of open reports is may be on the order of hundred or so. That of course depends for all the refactorings. Eclipse has I think about two dozen or so whereas we only tested eight of them in our approach and the NetBeans has maybe slightly fewer, maybe about 15 or so of refactorings. So here then the results that you obtain by running this, we got this 47 new bugs or 21 in Eclipse and so they confirm 20 and then 26 were in NetBeans, so they have confirmed 17 and they've already fixed three, and they put that one they don't want to fix although we still think that, you know, this is incorrect. And then this is I guess what made the students furious that they reported some of the bugs were duplicates. Because the students spend a lot of time actually looking whether the bugs are duplicates or not, just kind of doing their best effort to make sure that they are not reporting something that's already there, but then you know, the triagers or developers, whoever on the NetBeans side just came and said that some things are duplicate. And we still think they are wrong there in the sense that if they go and patch whatever they said was duplicated, that's not necessarily going to patch the actual bug that we reported. That's kind of very hard to evaluate. We need to wait for them to actually go the patch duplicate and see whether they actually patched our bug or not. And then as I said, we did find some more bugs but did not report them because they were either obvious duplicates or something that was reported previously that were already fixed by the time we found them. Then what's interesting here in this result, this part of the ASTGen be included in the NetBeans process start adding that to not only here, they are manually written tests but also to add some infrastructure to be able to run this. This is still not finished but they started some things on that. >>: [inaudible]. >> Darko Marinov: So they settle for one of these bug reports that they do not want to fix it, even though it's incorrect? I happen to remember that one, it was something related, also, I think encapsulate field. The problem there was that in most cases when you put this setter, the return type is void. But in some cases they are fairly tricky, the return type of this setter needs to be the same as the type of the field. So we need to put here in set F. The reason why you need that is because you can create this kind of weird expression where the assignment is actually a subexpression of something else, so therefore the assignment not only needs to make the change on the state but also needs to return the new value. So you would do this F gets F and then return this dot F and then this needs to become N. And the what they said is they don't want to fix that, because if they do that, that would violate some issues, you know, whatever. When you build this JavaBeans then the setters have to have void, otherwise refaction cannot find them. But the problem there is their fix should not be oh, we will change the way we do encapsulate field, but the fix should be we should raise the warning and say if you have an assignment as a subexpression, the refactoring engine should say, oh, I cannot do encapsulate field because you have an assignment as a subexpression. But now they just said they don't want to fix and what that means if you have a [inaudible] that's a subexpression, you can apply encapsulate field, you get something that doesn't compile. This is an easy bug, you can just go and immediately sort of revert the -- your -- revert the refactoring and just say okay, the refactoring can not apply, I'm just going to do it manual or something like that. >>: Well, [inaudible] in that case? >> Darko Marinov: Eclipse creates in here, so if we just put, you know, insert F and adhere return this dot F. This creates a setter that also returns the new value. >>: [inaudible] when they fix a bug [inaudible] instead of generating something that's wrong, now they [inaudible]. >> Darko Marinov: Yes. It could give a warning. Actually what we found is that NetBeans is much more a sort of it's well much less aggressive in writing to apply refactoring. NetBeans much more often gives warning and says I don't want to proceed with this because I could do something wrong. >>: [inaudible]. >> Darko Marinov: Yeah. Yeah. The refactoring -- yes. The refactoring engine that always gives you a warning says that I could do something wrong if I proceed would satisfy all -- probably all your requirements, yeah. It would be, you know, useless if one of your requirements was software was to be useless, then it would satisfy even that requirement. Yeah, but at least it would be correct, according to our thing we would never find, you know, a bug there. I mean, what we would find is that there is some difference. When we run this on the same input program, both Eclipse and, you know, NetBeans, Eclipse proceeds but NetBeans doesn't. Then we need to go and you know manual inspect, you know, why, and what's the difference and should, you know, Eclipse said also I should not proceed or should NetBeans be more aggressive and actually proceed? At least in that way, you know, you are not kind of introducing the bugs that could trick the developer by creating some, you know, hidden, hidden thing somewhere. Okay. Any more questions about these things? Yes? >>: Just one more question in [inaudible] generations so did you guys spend any time looking at like why the compiler wasn't able to compile some of these [inaudible] trying to generate [inaudible] structure programs, right? Like you're not deliberately increasing [inaudible] or anything? >> Darko Marinov: Okay. You mean if in this our own generation, whether we looked at this thing, you mean why [inaudible] generate why they cannot compile, right? >>: [inaudible]. >> Darko Marinov: We spent some time on that. That's how we were building even these dependent generators and adding some of these filters, but we did not spend too much time on that because eventually yes, what we would like to do is to only get the programs that compile. Maybe someone wants to test the refactoring engine with problems that don't compile but still if you have some properties in mind you want to generate test inputs that satisfy those properties. >>: [inaudible] might not generate really complex inputs that satisfy the properties as long as [inaudible] some inputs? >> Darko Marinov: I would actually say that you know it's a big deal. It's something that we should spend more effort on, you know. If I want to generate test inputs that satisfy certain properties I only want to have those that satisfy properties. Here we have an easy solution sort of, you know, we want inputs compiled but even if we get something that doesn't compile that's fine, we run through the compiler and just throws that away. But the question is, you know, what if you really wanted to generate things that, you know, do compile or you want to simply generate things that satisfy certain properties, how can you easier express that in here? That's kind of part of a future work. So it's something we have a final solution for that. So here's also some work that we have ongoing work so we are trying to reduce the machine time in the human time in using this ASTGen framework. So the machine time is kind of for generation and execution of this test inputs. Of course machine time by itself wouldn't be all that important, but the issue is that this machine time actually translates into the human time. So if the right one of these generators, as I said, it takes nowadays it takes about two hours for the students to write one of those, if I write the generator and you know push some button and say start the testing now, if I need to wait, you know, like one, two, or three hours to get back some result, then you know, then the boat is just kind of idling developer testing because you are just waiting back for the result. So that's one part. And then the other thing is just the human time for inspection because as we discussed, you can get hundreds of programs that fail but they only have one or two actual bugs that are there. And so here then the things that we are doing, one thing is want to reduce the time to first failure. So rather than exhaustively trying all possible inputs, we try to kind of skip through some of them. We just go and skip through a number of inputs and try to click or find some that fails. And then if this kind of sparse generation does not find anything, then we go and proceed exhaustively. We also want to reduce the test generation and execution time, so here we are trying a smaller number of larger tests rather than generating large number of smaller tests which we had, you know, one class and only one expression, here we want to try one class with many expressions. And then to reduce the time for inspection, we have some oracle based clustering. We try to group the tests together based on the oracle messages such that they are hopefully have the same underlying bugs. So that there is less to inspect there. Actual results are quite promising. Here we can save the times sort of an order of magnitude. Here it's also like 2, 3X and here we can significantly save this time. We can sometimes merge hundred failing inputs and just say these all seem to be due to the same bug, so only need to inspect one of two from here rather than all hundred. So here's kind of future work, things that can be done. So specific to testing refactoring engines of course always try more refactorings, try differently refactoring engines. Maybe someone wants try this for visuals. Some people have actually used the ASTGen recently there was a paper presented at [inaudible] I guess two, three weeks ago, where they've used ASTGen to test their own uni-factoring engine that they built. Then we can apply this ASTGen to other program analyzers basically, anything that takes program inputs. We've already done a small study with a tool called JavaCOP, develop by Todd Millstein at UCLA and so we found the small bug there. The other things that can be done is to reduce or eliminate these false alarms that I didn't even discuss but because of this comparison of ASTs we don't always -- we don't always get the correct results, we have some false positives there. And then to reduce these redundant tests rather than hundred failures to maybe just show only one. So this specific refactoring engines but they are more general about the test abstractions. They did remember test abstractions that you want in some language you want to describe a set of test inputs, rather than manually one by one and describe the whole set. So the research there is along the lines are what are the languages that make this easier to use, how to better describe these sets and how to generate them faster. And of course always to improve these oracles and clustering. This is I guess what we are also finding to be very important. I mean, now we can find many bugs, but now you have still many tests to inspect and many failing tests, so how can you reduce even that therefore, even at the expense of maybe missing one bug here or there, at least to reduce that effort. I mean it's very hard to go to read to this hundred programs and just try to figure out are they due to the same bug or not? Here's then the work of our test abstraction basically asking the question how to describe these tests, what to generate, how to generate and so on. And here's then the conclusion, you know, we apply that refactoring engines, we found some bugs and the code is probably available for download there. Okay. >> Nikolai Tillmann: If there are no other questions. [applause]. >> Nikolai Tillmann: And so that will be [inaudible] for the entire week for the [inaudible] contents and the Microsoft [inaudible] so [inaudible].