>> Todd Mytkowicz: Okay. Thanks everybody for coming. This is Matthias Hauswirth from the University of Lugano and the Faculty of Informatics. I was lucky enough to visit Matthias for three months in beautiful Lugano when I was doing my PhD, so it is great to have him here talking about his recent work that has been in both Models and OOPSLA. >> Matthias Hauswirth: Thanks Todd, yeah. Talking about beautiful Lugano, I have to show my mandatory slide, sponsored by the tourist office [laughter] no, it's really that beautiful. So if you ever get to Milan we are basically a suburb of Milan but in Switzerland, not in Italy. Or if you get to Zürich, it is a beautiful train ride across the Alps, so visit us. I'm going to talk about two things. My research is kind of spanning various things and it spans also the bridge between design and performance. My first part will be about the research That My PhD student Dmitrijs is doing and it is kind of an offspring of something else. He is also working on performance, but this is something that fell out. And I would like to bounce off this idea off of you and get some of your thoughts on it. So here is a little algorithm example, Cocktail Sort is kind of like Bubble Sort. If you want to sort a list like this, you start from one side and you just swap elements so you bump the bigger one to the right hand side, go through it and then you say okay, the biggest one is on the right and then instead of Bubble Sort you go back towards the left and you shift the smallest one to the left, and then you again the other side, and you shake your cocktail here and it is sorted. So that is a little algorithm. If you implement that you get performance of ON squared usually or best case of OON, and the code looks roughly like this. So this is something, a language kind of like Java, and if you look at this code you may say it is a little bit dense maybe, maybe not, depends on your preference. And you can maybe re-factor it. So you could say here are two things that maybe I should treat as abstractions and I will maybe want to abstract a message for each of those and you end up with code like this. So you say I am walking forwards and I am walking backwards and around those two method calls I have to out the loop which basically shifts the balance towards the center of the array that you are sorting. So this will be re-factoring. Now we have code that may be understandable, more understandable depending on your background. You could even re-factor more. I probably wouldn't recommend this, but some people may go to the extreme and say oh, I want to extract those green things as methods too. So let's say less than is a method, and increment, and decrement and greater than and you end up with code like this, which probably is not ideal but you could do it. So now we have three different versions of our algorithm, three implementations and we see that they differ. But the algorithm is exactly the same. So the implementation is a different size. Now you can measure the size, for example, by counting the number of methods from the high granularity and then you have to spectrum from dense code towards a kind of code with lots of indirections, and the question is what kind of metric could we use here to measure this density of your code. The problem is we can measure methods, but that is a size metric so the bigger you code the more method, so that doesn't measure density. You have to divide by the size of the actual algorithmic core of your program. So how do we measure the size of the algorithm given an implementation? And we have this thought of saying well the essential part of an algorithm really lies in its repetitions. If you look at the algorithmic complexity, the computational complexity, what drives that complexity are loops are repetitions, so the more repetitions you have, the bigger your O of M. Now there are other things that affect complexity of algorithms, obviously like conditionals. But conditionals can't increase the algorithmic complexity; they can only reduce it. You need the loops otherwise you have O of 1. So loops are kind of… >>: [inaudible] isn't that good? >> Matthias Hauswirth: It's great, yeah [laughter]. But it is not interesting algorithm anymore, is it? [laughter]. So if we accept this idea, I don't say that it is the only idea, but if you accept this idea that the kind of essential aspects of your code are repetitions, then we can go back to this and we can measure the algorithmic contents by just counting loops. So we have three loops, three loops, three loops and now we can compute or divide those two metrics and get something like the Essence, or the relative Essence, so it is normalized by size by dividing the number of loops or the algorithmic, the essential part of your code, by the actual implementation cost which would be a number of methods in your code. So this is a metric, a new metric we introduce that we call relative Essence. The way we compute is revealed a big call graph of your application, and it will be the call graph example for this. Call graphs of course can have cycles if you have recursion, repeat recursions are another form of repetition so we count recursion header the same as we count loop headers in a control flow graph. We actually augment this call graph with loops. So for every loop we put an extra node in there, so we can say, for example, that this loop here is calling the decrement method, and we would also call it all the recursion headers in this graph a different color. And we count the colored nodes, which is the Essence and we count all of the massive nodes which is the size. So here is the definition is very, very simple. Absolute Essence is basically the amount of repetitive constructs or the number of repetitive constructs loops and recursions, recursion headers and relative Essence just divides by the total number of methods. If you look how that actually correlates with constructs people use when they develop code, it is quite interesting. If you look at code smells that Fowler's book, the refactoring book lists, here are all of the code smells from the book and I colored them by whether those smells represent high relative Essence, so kind of loopy highly recursive code without much isolation; that would be red. Or low relative Essence where you have a lot of indirections; that would be green. And you see that most of those code smells actually correspond to really dense code, so he basically says if you have dense code without many isolations or indirections in between, it is a bad smell or bad smells are like that, like long methods, really big method that's like my left hand example. But there are other examples too, like middleman, where you have like a proxy or a media [inaudible] sitting in between which doesn't do anything, it just forwards or mediates, same with refactoring. So re-factorings are used to get rid of a code smell, so you are transforming something that you get better designs. Now what you see here is green is dominating so when you perform the re-factoring, you introduce indirections, so you reduce your relative Essence. You see that some of the re-factorings are paired like extract method would add extra levels of indirection. In line method does just the opposite. There are quite a few of those tools, but most of them are green so that is the second set of refactorings here from the book. If you read the book, Fowler actually even says that, so he mentions the most common variant of this re-factoring game being add indirection and the less common game, but also exists sometimes, but it is much rarer, is to find parasitic indirection and get rid of some extra bloat somehow in your code. If you look at design patterns, you have the same picture. If you think about the design pattern, what that does to your code, usually, the green ones, introduce extra indirections. The black ones don't really do anything. But none of the design patterns is indicative of in-lining stuff. The measure to this Essence metric in real world systems, so we use this corpus, qualitative corpus has 100 applications. It is used for static analysis as a benchmark, and it used the common Java benchmarks and the Java runtime libraries from three different platforms, so that's a total of 133 programs, 11,000 packages, 200,000 classes, 2 million methods and in there we have roughly 10% of those methods have loops, roughly 1.5% had recursion headers. So here is kind of an overview. On the x-axis you see the size of those programs in terms of the loop call graphs, so the number of methods plus loops in those programs. Each point represents one of the programs, so you have 133 points. And the y-axis is our metric, relative Essence. So the higher you go, the more algorithmically dense it is and the lower you go the more indirection is in your code. The yellow bar here represents the interquartile range, align; the median which is roughly 0.2 loops and recursions had this per method. You see some extreme outliers. The red ellipse actually represents the SPEC benchmark seats, no wonder. They are very small; they are not representative. And you also see… >>: [inaudible] and C++ [inaudible]? >> Matthias Hauswirth: JBM. JBM and… >>: Oh these are JBM? >> Matthias Hauswirth: SPEC JBM 2008. This is all Java here. >>: But which benchmarks are in 2008? >> Matthias Hauswirth: I don't remember all of them. I think different from the ‘98 ones. >>: Right. >> Matthias Hauswirth: But I think there is still some kind of compiler in there and some more numerical ones. I think some encoding, decoding, some security stuff. >>: Okay. >>: [inaudible] spec [inaudible] benchmark [inaudible] Java [inaudible] benchmark. >> Matthias Hauswirth: Java [inaudible] is still there, yeah. >>: Okay. >>: [inaudible] [laughter]. >> Matthias Hauswirth: So there is some kind of computational kernels in there and those are the guys on the top. They are really outliers. They are just loops. They are kind of micro-benchmarks. >>: [inaudible] loop [inaudible]. >> Matthias Hauswirth: Yeah, they are probably very similar. And Dacapo I didn't call it them, but they are really more realistic and then the Qualitas Corpus really contains a lot of open-source applications, so they should be at least representative for a segment of the applications. Now if you exclude the SPEC and you look at the others, you can see some outliers even there. Weka, for example, has dense code and if you know what Weka is, it is a machine learning library, so it is a big set of masses or clusters, you see like 20,000 or so masses or classes that implement machine learning algorithms. So it is not really a big complicated application where you have a lot of architecture in there, but it is just a large number of compact algorithms. On the other side of the spectrum you have jmoney which is like Microsoft Money more or less, but it is just glue. It is glue code around existing libraries that do the computation. So if you have glue, then you have a very, very low relative Essence. Yes? >>: [inaudible]? >> Matthias Hauswirth: I removed that graph. I can show it at the end if I can find it. I have it and there are some other very interesting findings about it. Actually maybe one I can say is we have JavaCC or whatever it is called, the Java compiler compiler, right? And we have ANTLR which is also a compiler compiler in Java, and we have SableCC which is also a compiler compiler. Now SableCC was written by Laurie Hendren’s group and in the documentation they say that SableCC generates nice object oriented code, so to generated code is nice and object oriented. And I assume that they also wrote their own code nice object oriented, and if you look at where it is located in the spectrum, it is somewhere, I don't know, somewhere down. I don't know where exactly it was. And if you look at JavaCC it is somewhere really high up, so it is much more kind of densely packed algorithmically and not using indirections much. So we came up with this metric basically out of a random event that just happened, and we thought is it just something new, or is it just something that is exactly the same as an existing metric, so we used tools to measure I don't know, 100 or so design metrics that we found and we couldn't find any that really correlated well. So maybe some of the most obvious you would want to compare our Cyclomatic Complexity which also tries to, given a control flow graph or a method tries to figure out how much computation is in there and how much is fluff like extra basic fluffs or instructions that are not essential. So it just goes for branches or independent puffs. So we try to correlate but Cyclomatic Complexities with absolute numbers, we have to divide by something to normalize, so we divide it by the number of instructions in the code, so branches per instructions basically, that is this axis here, and here we have our relative Essence and you see that it doesn't really correlate much. I don't remember the correlation coefficient but it was very uninteresting. Another one that is probably the main design metric is coupling and also cohesion and I think they are very nice in terms of they are orthogonal to Essence. So if you do coupling or cohesion then it is about moving things around in new code, so if you have two components and you have some method for example, you can move that method into the other component to reduce coupling. So you have this scenario here; you reduce coupling, or the other way around. So if you measure those metrics, it tells you to rearrange your code in this way. If you measure Essence, it doesn't tell you move this. It says oh, you should probably insert some extra abstraction layer or you should remove an extra abstraction layer because you are too dense or too abstract or too much indirection. It is also related to Bloat which is not really a metric, but I think Nick Mitchell has done very nice work in figuring out unnecessary computations or unnecessary transformations of data so it has a very clear relationship. >>: But then it was measured with respect to code, they have only measured its representation in the heat, right? >> Matthias Hauswirth: Right. Actually I am going to go in that direction in the next part. But you are right. I think they might also have some kind of code. But I don't remember all of the work. I think he had a few papers. >>: Well, they tried to map the heat back to the code to find why it happened, but from manifestation back to code rather than from code to manifestation, right? >> Matthias Hauswirth: Right. So I haven't tried to somehow correlate this. I don't know how to best do it, but it might be interesting to see if there is any connection between the two. Conceptually there should be. >>: So if you have nested loops like do you multiply them or do you still just add them, one up, multiply? >> Matthias Hauswirth: Add. You are not measuring the runtime complexity, but you are just measuring how many kinds of decision points, like the Cyclomatic Complexity, right, you just count the branches, but you are not going to really multiply things. We try to do the same, which is to say how many repetitive constructs are there that the program has to think of? And whether the two loops are nested or sequential, it probably makes a difference cognitively, but we don't distinguish that. It might be… >>: It might be the first thing to try to… Because that’s, for me an algorithm is all the levels of building. [inaudible]. >> Matthias Hauswirth: But that is one point, right? If you say an algorithm, then actually every program is an algorithm and you would have to in-line all of the code and you would get one big loop messed. >>: No, not necessarily. I mean you have a sequential stream with a bunch of different… >> Matthias Hauswirth: That's true. I mean [inaudible]… >>: [inaudible] but processing loops, the processing loops put everything in one loop, but if you try--it would be interesting to rebuild like a loop nesting call graph for doing loop transformations, right? And figuring out where to in the line so you could do loop exchange like in my thesis work and so that was something we thought about there. But just for program transformation, not for program understanding, right? That is just something that should be easy for you to do if you could try to figure out that. >> Matthias Hauswirth: You would need a different value, but I would have to think about what it means when you extract a method and suddenly you have an in the loop not being an in the loop anymore, but just being a method that is called from within the outer loop. Is that different? Or how is that different? I would have to think through it, but it is an interesting thought, yeah. >>: What about the loop contends, for example, does it go to [inaudible]? >> Matthias Hauswirth: So we do it well even if you have unstructured control flow. So we actually don't just do a dominated analysis or so to identify the loop headers, but you have an algorithm that is deeper to find even loops with multiple entries into the middle of the loops and non-reducible graphs, we deal with that too. We just take multiple header nodes for a given loop. It is in the paper I will point you in the last thing. We looked at different ways to do it. I think we picked the one that seemed most correct, but it is kind of an intuitive way of correctness. One last thing that is correlated probably is design pattern density. I don't remember but I think it was Dirk Riehle wrote a paper on measuring design pattern density in code, but for measuring at least for that paper they had to annotate code, and say this clause corresponds to this collaborator in this pattern or something and then you could count it. So it is not fully automatic and so we couldn't correlate our method with this one. But given that you see that every design pattern seems to be green, meaning it seems to represent indirection, I guess it should correlate, but I am not sure, because the patterns overlap. So the metrics really are related to two very key principles, first, information hiding, right? Indirections are usually introduced it to hide something, to abstract from something and the second one is Brooks accidental versus essential complexity. That is where we stole the name from, Essence, so in his view you have an abstract model which is the Essence of your computation and then you have a solution which is the implementation in a specific programming language which adds accidental complexity. Then we tried to take implementation and extract some notion of Essence. There are some interesting dualities. If you look at behavior, which is what we looked at, loops and stuff. There is also structure. So if you look at the loop, you could say that the dual to a loop is the array. If you look at the recursive method or function, the dual to that is the recursive type. So we can compute Essence on behavior by going for a call graph and augment it with loops and count those kind of constructs, and then we could do the same when we go through a type graph or a cross diagram and count these kinds of constructs. If you want to optimize code, then you often do method inlining. It is one of the most effective optimizations. If you want to optimize for structure, you may want to inline objects into other objects or even allocate them on the stack. So those are really two duals also. And our Essence to some degree measures the minimum you can reach you when you inline everything somehow kind of, in both cases. There is also a limitation of our metrics which is if you do generic things. So you could write one traverse logarithm and you would have one loop, let's say apply something to data structure, and then you can pass in functions or visitors or something like that that do the computation. And you measure one loop inside the Essence as one or else you have many different computations. So what you would have to do there is to say whenever I instantiate that generic traversal, that's adding one to the Essence, instead of saying there is one generic traversal. There many instances of that, same for generic types. So I can have lists, link lists of N and N is an arbitrary type, so when I instantiate that container, I would say I have Essence one higher. So if you look at the UML class diagram and you want to measure here Essence, you would have to again go for repetition, so instead of loops we go for arrays. And if you look at the class diagram, arrays are in implementation; they are not in the diagram, but often they correspond to multiplicities in the class diagram. So here a student visits multiple courses and the course is visited by multiple students and so you have those stars. Every star represents a multiplicity, a one to N relationship and can be implemented as an array. The second thing you can have is recursive types, so a professor and the course form a recursive structure. A professor is in the course. The course is taught by another professor, another course, and so on. So we can have this recursion there. So what we do is the same thing here as we do with the call graph. We build a graph. We also call it the distilled model where you put all of the types and we add these extra gray nodes. They are the loop nodes basically, so array nodes, whenever you have a multiplicity, and then we count again the cycle headers of the recursions and those special loop nodes or array nodes, and that gives us again Essence, but this time it is structural Essence, not behavioral Essence. If you look at the essential parts of this graph, it is basically everything you count in-line, so the super type address is irrelevant anyway. It may be an abstract class; you never instantiate so you can get rid of it. And those things here you could all in-line them into this array element. You could just inline the object, tie into the period, the day into the period and then just combine period into the array so you have… >>: It is relevant for the student to get to class on time. [laughter]. >> Matthias Hauswirth: That is actually a very good point. When Dmitrijs presented these models, one of the comments was that what you are doing here is getting rid of the essential part, because it is actually relevant that I can talk about you get there on time. And you are removing this knowledge and you just left this course array which doesn't tell me anything what it is, right? So in terms of concepts, yes we get rid of concepts. In terms of the structure of the solution, we preserve I think some of the aspects of that structure. >>: What you're getting at is the mapping between the components but not, it is like the bloat is getting to that also, right? And saying the essential data is the stuff that is unique and not the mapping between. >> Matthias Hauswirth: It goes in that direction. >>: Right. >> Matthias Hauswirth: So now we have those two arrays. Actually for the structural array, what is interesting is we tried to do some data structures. We implemented them in Java and then we looked at what it turns out to be in terms of Essence. If you do a linked list, no matter how you do it, if it is singularly linked, then you end up with an absolute Essence of one, so you get one recursive node in the distilled diagram. If you do a tree, if it is the pattern pointing to the children and no back pointing, then you get two. And if you do a graph, then you get three, if it can't be traversed backwards. So that is kind of an interesting result. We built an infrastructure called the centralizer. I think it is available on the centralizer tutorial, if it is running. And you basically take some input that could be Java, the current implementation is Java, compiler project. There is a charge file. We analyze it. We built this distilled model, this graph out of it and then we have a tool that computes the metrics and another tool that visualizes the model in some way. And so we have transformers from Java and from UML class diagrams in XMI format and one could add arbitrary transformers to it. Maybe to conclude, two comments, so this is a comment from a reviewer for the ECOOP paper. I think he seemed to like it. Now I know people that don't like it, and I don't put that on here [laughter]. They don't believe… >>: Why not? >> Matthias Hauswirth: I tell you. They don't believe that this is useful and I am not sure that it is useful. I like the idea and I wonder and I would like to figure out if it is useful. Some people think it is useful and that reviewer pointed us to a blog post of a developer. I have never met him, Bill Woody. And he talks about in his whole post about how Java programmers create these really bloated things that start with a simple thing. He talks about implementing factorializing, and he has a very nice implementation of factorializing, very, very few lines, and then he turns it into this framework where he can plug in different algorithms for implementing factorials and need a factory and you have to have this really big architecture for computing factorials. And he says that is how Java developers who build web applications, that is the direction they take. They add layer upon layer. >>: [inaudible] in the last few months I [inaudible] collaborate with [inaudible] and if you look how they do Java you [inaudible] print. [inaudible] streaming part [inaudible] system. >> Matthias Hauswirth: Overdesigned, yeah. >>: It's not. I mean it is over, over, overdesigned. [laughter]. >> Matthias Hauswirth: So it could be that Essence might capture that. It would be interesting to see it. >>: And another interesting [inaudible]. If you look at [inaudible] kernel, in the last 20 years the same stuff, the interface has not changed that much. But the code keep refactoring keep re-factoring for the past 20 years at and you can get all of the patch and you can think actually this is unfair for the system, for example, [inaudible] the first time they got one [inaudible] systems and one [inaudible]. In laser, now they got probably 100 architectures and each one has some [inaudible] and by the same function and keep re-factor keep re-factor keep re-factor. A large part of the system is [inaudible]. And [inaudible] increasing to have [inaudible] pickup [inaudible]. >> Matthias Hauswirth: I think one of the problems that we have is we don't have a baseline. We don't know if it is overdesigned or not. That is a human expert that should be evaluated. So actually that is my kind of pitch here. >>: I think there is some method you can do that if you have a history of the program. For example [inaudible] okay the workings for [inaudible] okay start from the first portion, right and someday [inaudible] does that and I have to [inaudible] and can result [inaudible] merged. In the merging has some problems and they tried to, they do some better research and pick out a problem here. And we watched and [inaudible] of same work. We did the patch and researched all of them. So and I think you can run some metrics which I think [inaudible] what these patches do and how [inaudible] one function to do, for example, this function just for one [inaudible] system and now this function for 100 new systems and you [inaudible] some packages [inaudible] fix some part which is because this speed. And you do have [inaudible] because people have similar [inaudible] actually we can do the [inaudible] analyze and [inaudible]. >> Matthias Hauswirth: We can try. We should talk about that later. I want to make sure that I don't have to skip the second part, but we should definitely talk about that. I am looking for ways to actually be able to say whether this is useful or not. >>: [inaudible]. >> Matthias Hauswirth: Yeah, I have seen that too. So I went and searched a little bit and there seems to be I believe there is something there. >>: I liked when we did the Decapo paper. The shop for engineering metrics that we were using were all useless. And then they weren't telling me enough because it didn't go to what was, what the program was actually doing and how well it was expressed. So we could have big Decapo benchmarks with lots of things going on in them, but it could all be useless [laughter], right? But we really didn't know that until instruction footprint was one of the things that I found was like especially suspect for us because it said guess how well, how many instructions you had to generate, but it didn't tell you how many you really needed, right? >> Matthias Hauswirth: Well, we don't really tell you that either. There can be loops that are totally inessential, so we actually found that in the developer's example that I saw in this blog post, he then goes on and he has this factory instantiation idea, but he has a recursion where there needs to be no recursion at all, but our thing picks up that there is a recursion. In practice it will never be executed recursively, but it is--so I don't claim that this is really going to find your really useful core of your algorithm, but it might. If you think about it, Dmitrijs gave the talk at MODELS, so they focus on models, on abstractions and so my claim is that probably if you have a model and you compute this Essence, this relative Essence on the model, that should be bigger. It should be denser in terms of the essential aspects and the implementation which has all of these kinds of extra things that are necessary when you implement the model. That is kind of Brooks, so we are trying to quantify Brooks. So if I had a set of models and their implementation and you could just do it right now and look at this, but I don't have any models. I haven't found any models. I don't know whether there are any UML models that people can just use publicly. >>: What about maybe taking it down a notch and looking at programming language and then back into compiler generating code? So you look at the white codes that get generated because the optimizing compiler actually has some constraints in which it actually builds up the implementation of, where the model is now the code and the implementation is x86. >> Matthias Hauswirth: I am not sure whether it actually does much in that respect. Sure it can unroll loops so if you have a static bound or whatever in some cases it would unroll a loop and send it to task. Change it… >>: Well, would you consider unnecessary abstraction anything that the inliner is so short the inliner inlines right? >> Matthias Hauswirth: Yes. You could look at the inlining instead of this conceptual view, but really what the compiler actually does, yeah. That would be an interesting--but in the end… >>: There is some notion that the compiler is getting rid of some of the abstractions for you, right? Maybe not perfectly… >> Matthias Hauswirth: Yeah, maybe what I am showing here is the perfect thing, right? The nodes that I color are the minimum nodes that you have to have and everything else you could in theory inline, but whether that is a good idea, I don't know. But it is something. The second point is, and that is kind of the idea of the metric. If you measure a metric, a design metric, you have to have a goal and the main goal is usually the humans that use the code, not the compiler part. But maybe this could be useful in both ways. So if I measure Essence… >>: Just assume that it is the compiler. Like that is the other consumer of code. People say, oh it doesn't matter because it is an automated tool, but it is automated for human patterns. So if you have some notion of what the patterns are, you should be able to build a better compiler because it is just an engineered object based on program behavior. So people think, people ignore that the compiler is a consumer of whatever people are producing. And it is engineered fundamentally that way. >> Matthias Hauswirth: That is a good point. Yeah. >>: [inaudible] in the [inaudible] program we were writing code and you have to understand the competitors [inaudible]. And you try to make your program have more patterns [inaudible] even [inaudible]. [laughter]. >> Matthias Hauswirth: If you are good, you write the right ones. >>: [inaudible] like that so they have to, [inaudible] if they have a lot of loops in the code then we should and then because they try to use [inaudible] because they do original [inaudible] and they first [inaudible] and that is not performing as they expected. And so they change the code so that the code [inaudible] competitor work, and they changed it, they make the code not that easy really to [inaudible]. >> Matthias Hauswirth: So they make it much more performant, right? [laughter]. So this is the second part. It is not exactly connected to the Essence. It is about Milan Jovic who is a nether PhD student of mine. It is focusing on performance and performance of interactive applications. So I personally, I don't know, attract performance programs somehow. I very often run a program on my phone or my computer and it is slow and I don't know why and I want to get rid of that problem. So let me show you a performance problem which is not in a Microsoft product [laughter]. Here we have Eclipse which you all probably have heard of. This is version 3.5, so it is not the most recent version of Eclipse. It has this beautiful remanufacturing, so I can go on a simple like a valuable name here and I can say, rename this. And if I rename this it will automatically when I am typing change all occurrences of this variable in the currently open editor. And I can now start typing and I can say, okay, I want to change a name and say Matthias, and that obviously is not fast enough, right? So somehow there is this performance problem. It has been around for years. I don't know, maybe two years, maybe three years. I know that it just suddenly appeared and I was thinking oh, I need to update and the new version came out and I updated and is still there. And eventually I thought I just have to live with it forever. So now I will show you how we can detect the reason for this bug. If you use a normal profiler, then you will get something like this. So you get a hotness profile which tells you in this function or method you spend so much time total. So that is not going to help you at all to find this bug, it is going to point you to something that is absolutely unrelated. I could show you a picture. Is going to show that this is a huge problem and the actual problem is a pixel in a picture which you cannot even see what it is. So this is an abstraction and it is the wrong abstraction for this kind of performance problem. What you really need to look at is the trace, behavior over time. And when you look at this trace effect of execution, you see a lot of white spaces are sitting here in between those colorful boxes and that is the time when the user is thinking. So we found in some experiments… >>: You need to optimize that [laughter]. >> Matthias Hauswirth: Yes, ideally, but that is more AI I think than… So actually AI research, 98% of the time that could be improved is due to AI, to make humans faster because only about 2% of time is spent computing. We did thousands of hours of tracing of interactive applications on just Java applications. Most of it is idle. We don't care about that. What you care about is the time when the computer handles a user event. We really care about the latency of handling the event, not the overall time. Now if you look at this picture here, you see that the lowest, we call that episode, so the longest handling of the event is this one that I circled and the dominant code in there was the green code. If you look at the hotness profile, the hottest code is the blue code. But the blue is spread across many, many handlings of different events, so nobody would ever notice this, but the developer would optimize for the blue one, would not think about the green one. In this picture is nice. It is much worse in practice. So when you characterize performance you want to characterize latency. Here you see latency on the x-axis, 0 milliseconds, 100 milliseconds, 10 seconds or whatever to the right. And on the y-axis you see how many times, how many events or episodes you had with that latency and it is cumulative. So the point I show there is 179 episodes or events took longer than 100 milliseconds. And this hundred millisecond line there is not there by chance. This is a very important line. It is not entirely clear that it is 100, but this is the perceptibly threshold. So Ben Shneiderman did some experiments and he found that people do not notice latencies below 100 milliseconds. So if you click a button and it takes 90 milliseconds to respond, people will perceive that as instantaneous. So what you want to have in performance is you want to have an L-shaped curve which goes all the way to the bottom and then stays at the bottom beyond 100 milliseconds so nothing takes more than 100 milliseconds. But everything can take something here. It doesn't matter where the upward thing is here in this range. So you don't have to optimize forever. You can give up and say it doesn't make a difference to the user, maybe for energy consumption there is a difference, but for the user it doesn't. >>: [inaudible] this company gain? >> Matthias Hauswirth: There is some variability to this. There is a gain. They did a study where they put a virtual reality helmet on people and they put them in a virtual maze and in the maze there was a huge hole. And the goal was the user needs to walk around the hole and not fall into the hole. And then they changed the latency from moving to the update of the display between 75 milliseconds and 125 and they found that people died at a much higher rate when they had the higher latencies. So somewhere in this range there is a critical threshold where you can't really walk anymore because the delay is too long. So it is not a fact in just perception, so that you notice, but it actually decreases performance at least for this game task. And they did it for other things. They measured the differences by input device. There are many studies on this. And the threshold changes, but it is in this order of magnitude. So we don't need to make those episodes shorter than something like 100 were maybe if you want to be optimal 40 or 20 milliseconds. >>: Seventy-five if you don't want to fall in the whole [laughter]. >> Matthias Hauswirth: Exactly. So here we have latency curves for Eclipse. So we have five versions of Eclipse. Black is the oldest. The lightest gray is the most recent. And you say that every subsequent version got slower. I mean you have more and more episodes that take that long. And the reason is because… >>: [inaudible] more [inaudible]. [laughter]. [inaudible]. >> Matthias Hauswirth: I think that may be. So the reason is that the code group. The code grows over those versions. It went from 20,000 to 30 something thousand classes. Now if you just add classes that doesn't make it slower. >>: That is why you have to do nesting. >>: That's why you have to…? >>: That's why you have to do nesting, because if you added separate functionality, then it doesn't matter. But if it is all messy inside each other, that, you're stuck with it. >> Matthias Hauswirth: Exactly. It has to be somehow, you are right. In the nesting shape, and it is actually in this nested shape, because if you write an Eclipse plug-in, you are not writing an extra application that sits next to Eclipse, but you are plugging it into the framework and you're going to listen to changes in the global state. So whenever you press a key, your plug-in is also going to respond and it is doing that synchronously. This is not an asynchronous kind of message path, but it is going to call your plug-in and tell you oh, the key access was just pressed and your plug-in is going to compute some very complex thing. And so every plug-in is going to add a little bit of latency and your system gets slower and slower. Now we want to detect the causes of those latencies and we want to detect them that people really care about it. So our idea is to put a little agent with each application that is deployed in the field. So you may know that in this research group here the idea of putting an agent or some kind of an analysis in the field so you can do this collaborative bug detection is not [inaudible] but we are focusing on performance here, in latency in particular. And then we graph traces or profiles from all of those agents in a global repository and then we ship that back as bug reports to the developers. Now you could say that hopefully the end users will be happy once the developers fix the latency bugs. Now you could say why do you want to deploy that in the field? Why don't you just test for performance? We do unit tests for everything anyway, so you just measure and you know that you are not too slow. Well, the problem is that coverage is not the same for performance tests as for functional tests. If you are talking about performance tests, you have to take the entire system into consideration. It depends on the platform, meaning the virtual machine, how much memory you have, what hardware, what other backup applications are running, like they use an MP3 player in the background when they work on Eclipse or whatever. So all of those things impact performance and you can't tell what users do without actually observing the users. That's why we ship the profile to the users and we get the performance they perceive in their context and we do it for everybody. So the profile needs to be extremely lightweight and it needs to give you just the right information to find the bug. >>: What you could also say that that tells you how much you need to drive down latency. 100 is with no one else and just the tester running and then if everybody is listening to their MP3 player while they [inaudible] then you have the latency has to accommodate that. But if only one or two people are, then maybe you don't really care. >> Matthias Hauswirth: Yes, but you need to know that. So you have to measure something in the field, so you can have as well a profile in the field, and then just look at the latency that they actually get which includes the scheduling and the contention and whatever is happening. >>: [inaudible] devise a game developer and got a game for his android phone and what he [inaudible] his [inaudible] experiments on the android platform. Actually [inaudible] he tried his best to [inaudible] Matthias [laughter]. >>: Matthias wasn't online [laughter]. >> Matthias Hauswirth: Actually yeah, we were doing something like that too. Now the question is what do you measure? You want to measure latency. And you have some behavior in this blue box over time and you measure the start and the end time stamp and you compute latency. Now the question is from what? Like what is this blue box? And so when you pick what you actually want to measure, what kind of behavior you want to measure, we have this two-dimensional space. You want to have minimum overhead. You want it deployed in the field so users must not notice anything and it must not retard your measurement results. On the x-axis you have the amount of information you get. You want to figure out what the problem is. You don't want to just get information that it is slow, but you want to know why it was slow. So ideally it would be here. You know everything and it costs nothing. That is unrealistic. So you could go there. You could instrument every single method call and say this method took so much and you have this whole calling context measuring latency information everywhere. That is going to be too expensive. On the other hand you could say okay, I reduced my overhead. I will only measure in the dispatching of events how much each event takes and I don't know exactly what happens. So you know nothing. You just know that it was slow, but you don't know why. So we went in the middle somewhere and we tried to get towards this corner. So we instrument methods, but we instrument only a subset of methods which we termed landmarks. Those are methods that are infrequently called so we have little overhead. They are covered in most of the executions, so we don't miss any latency bugs. They are usually bounded recalls so like calls between components in your system, and a very important point is we use those methods to classify the bugs, so we have one bug in our bug report, one report for each landmark message. That is how we group or cluster things, by method. So if you take an execution of a program, you have time and then you have the call stack that grows and shrinks a long time, some of the methods will be landmarks, only a small subset, very few. And usually they are listen notifications or observe notifications. If you have a command pattern in your application you say compile, then this compile method or the listener will be a landmark that you time. Paint, which is output, IDE or your application draws something complex that can take time and native calls of course because that is where we do I/O and all the possibly long latency OS calls. So those are landmarks. You can define what that means for your scenario. To make sure that we don't miss anything, we also include the dispatches, which are at the bottom of the call stack in the event loop as landmarks. So we know for everything that happened how long it took, but if this is a landmark, we have one entry in our bug report that says dispatch, and then you have the distribution of latencies and it doesn't tell us what actually was dispatched. But we know that there was a problem and we didn't capture it with those other guys, like here for example. But we don't know exactly what happened, so we have to introduce more landmarks. It is kind of a safety net. If it just did the landmarks, we wouldn't find the reason for the problem. We know a bit. We know oh, compile, the compile command was slow, but we don't know why the compile command was slow. So we use an old approach which is stack sample like a stack sampling profile like H prov in randomly spaced to prevent bias. And what we then do is if you have such a long latency episode like this one, you take the stack samples that occurred during that episode and we build a calling context tree. So we merge the samples; we form a tree and we have weights in each node of the tree which is the number of stack samples, or the hotness of this calling context. So we see that this context occurred in one sample so it is kind of cold, but this trunk here is relatively hot because all samples were in that. So we did this study to evaluate that. We ran Eclipse, or our students ran Eclipse for three months during a semester. There were 24 students in that course and they were using our little agent in there. They did over 1000 sessions. So start Eclipse and use Eclipse for a while and the total was almost 2000 hours of usage, so that is not CPU hours; that is total hours where Eclipse was running. Out of that we got this huge repository. We have about 800 issues in there, but the issues like I said are landmark methods. Every issue is a circle. So we have roughly 800 circles here. And they are positioned according to their severity. On the x-axis you have how many times this landmark actually executed with any kind of non-zero latency. So if it is almost 0 we just throw it out, but if it is a little bit higher than we show it. And on the y-axis you see the latency that occurred in these landmarks. So the higher it is, the slower the response was, and the more to the right, the more often the problem occurred. The size of the circle corresponds to how many users actually observed this problem, so if somebody and our students did, installs extra plug-ins, that somehow slow down the thing, then only user that would experience that and that would be a small dot, so they are up here and if it is a stupid plug-in for I don't know playing movies that he installed then we don't care about that. Maybe. So I picked one of those issues, the red one. This is the issue that actually corresponds to the demo I did. And if you select that issue it will show us related issues. So if you have a method, let's say mouse click listener or mouse button press listener, you often have a kind of corresponding method, mouse button release. So you have to have two different landmarks but in reality both of them do very similar computation. And so we group all of the landmarks with the same kind of computation into a family of landmarks and here we highlight them. And we group them by similarity of the calling context tree. So we do it three, similarity, computation and we group those that are closely related like this. So if you go and you try to fix this, you know that you will probably also fix this one and this one and this one. So let's go to our specific issue. If you look at the calling context tree produced by this specific landmark, it starts at the top. That is the name of the landmark. Verify text, and then you see the tree like this is one method here on the stack so the stack goes like this. And if you follow this, there is lots of information you can see. Here you spend a lot of time. And if you look at what that is there, it is a message called pack pop up and you spend 84% or 82% of the stack samples in this long latency behavior in that method. >>: And so when you find that out because you sampled all of these calling contexts and all the way to there and that is in 80% of the samples? >> Matthias Hauswirth: Eighty percent of the samples taken during this long latency operation. If you look at the calling context tree of the whole run, this would not show up. >>: Right. It is because you are in this focused region and the samples of the calling context merged together into these, well, not always merged, there are some completely separate ones, right? But they, I guess because you are starting with there, they all have to be under that call, right? For the listener? >> Matthias Hauswirth: Actually the listener for this verify text could be called from different contexts, and often it is. So we can merge them together, which is to say that method, and we merge those trees so we see it. >>: All right. >> Matthias Hauswirth: We can also show which context this thing occurs in and all kinds of extra information. So now if I go back to Eclipse you can see what the problem is. So let me re-factor this again so re-factor, rename, and then try to delete. It is again slow. And what you see here is this kind of fancy tooltip that has this very, very useful information. I mean it starts to say enter new name, press return to re-factor, always, no matter what. So I guess the developer of this re-factoring, this in place re-factoring, thought that he would use this fancy kind of tooltip API, and actually the API has this neat feature. You can do this little callout, this little thing here, this triangle. Which is a very advanced feature actually that was not possible before and then they added this to this GUI toolkit that you can have nonrectangular windows. Normally windows are rectangular, but they have these arbitrary shapes and now pop up is the message that tries to kind of figure out the area that is kind of hidden by this tooltip. And because the tooltip is not rectangular, it takes a long time to compute this geometry, for some reason. So, you have a nice feature here too. You can actually drag this guy and you can move it somewhere else, for example here in the corner and then this little triangle disappears and the other triangle is gone and if I continue typing, the performance problem is gone. So it really is exactly that, whoops, that problem. So with this we found that originally we thought it has to do with oh, there are many instances or occurrences of this name, and when you type a key it needs to update the text in many different places, so it slows it down by a factor of x and it has nothing to do with that. It is just the GUI that is somehow the fancy feature in there. They fixed it in a later version of Eclipse. >>: How did they fix it? >> Matthias Hauswirth: I think they did some caching. I don't remember, but I think they cached the outline of that thing, because it is constant anyway, right? It is always the same shape. It doesn't change. >>: You could move it out of the way too. >> Matthias Hauswirth: Yes, but that is still that they want to keep this user feature that you can move it. If you like at the bottom you can move it to the bottom. >>: Yes, but by default you can move it into blank space at the beginning, right? >> Matthias Hauswirth: Yeah. You could, but then if a user finds it nicer up there, he would still have the problem. So they did the right thing. They cached it. They should have cached in the beginning. >>: [inaudible] problem and they start to feel that they… >> Matthias Hauswirth: Actually they fixed it and we found the problem, but independently. So when we had the problem understood, we checked and the newest version had fixed that thing, not because of us. >>: So the history of this research you are saying that you first got the problem first and tried to build some tools to find a problem or… >> Matthias Hauswirth: We didn't start with this specific problem; we started long before I was aware of that problem. But this was a nice example that I said look this is really annoying me. I use it every day and it is slow and so, Milan, find the reason for this. And so he used the tools that he already had developed for it. >>: [inaudible] you find a reason, you find a reason for the tools or you find [inaudible]. >> Matthias Hauswirth: I think I found the problem first, but not this problem, just generally I got annoyed by those kinds of performance problems and I found that traditional profilers do not help me. And then this came along as kind of a nice bonus, but it was not the inspiration for the whole research. So Milan used this for quite a few tools. This is one bug in Eclipse. He found quite a few of them. >>: When did they fix the problem with the bug in Eclipse? >> Matthias Hauswirth: I don't know exactly but I think it was more than two versions, so like five still have it. Two maybe three years. It I mean it is amazing, you know, because everybody uses rename re-factoring. It is the one re-factoring that everybody uses, so I wonder what… >>: So you could say that because tools like this didn't exist if you were trying to sell your work a little better, that it took them two years to fix the bug, whereas if they had used your tools they could have fixed the bug a lot faster. >> Matthias Hauswirth: You are totally right. And I am going to undersell a bit, you could also say that users probably have complained and maybe they just didn't listen. Maybe, actually I didn't complain. I used it all the time and I just screamed at the computer and I didn't tell the Eclipse people that… >>: [inaudible] assumed that those people would hear about that. It Eclipse has these problems like it is slow, [inaudible] and people assume that is normal, people would think… >> Matthias Hauswirth: People get used to it, yeah. >>: [inaudible] kind of get used to that too, that it is slow. And a lot of people [inaudible] because it is written by Java. Then they think it is a common thing. >> Matthias Hauswirth: Yeah. So we used it for different applications. Here are three examples. Informa clicker is a tool that we wrote and we use in our teaching and it is a distributed system and our students and the professor have a graphical user interface that is relatively involved. We used Milan’s tool on a few sessions, and we fixed some bugs and the next time we used the tool the students actually said whoa, what happened? It is so fast. And so it was clearly finding a useful bug. It was a simple one, but it was a useful one. And Milan integrated it into an alpha version of Code Bubbles by Steven Reiss and I think Steven integrated it. And Milan found in the repository some interesting performance problems that I think probably are fixed now. So this works in general for any kind of Java applications no matter, graphical Java applications no matter what framework they use. So conclusion is for performance look at performance problems that actually matter, measure it on the user’s desktops or phones or whatever, so you know what you need to optimize. Okay. So that is both halves of my spectrum. >> Todd Mytkowicz: Thank you. [applause]. >> Matthias Hauswirth: Okay. Anymore questions or are you questioned out? So if somebody has comments on anything like this tool or the Essence, if you totally disagree, let me know to. I want to know. And if somebody has models that I could use to test, I am very interested in running tests on some models. Okay.