>> Todd Mytkowicz: Okay. Thanks everybody for coming. ... from the University of Lugano and the Faculty of Informatics. ...

advertisement
>> Todd Mytkowicz: Okay. Thanks everybody for coming. This is Matthias Hauswirth
from the University of Lugano and the Faculty of Informatics. I was lucky enough to
visit Matthias for three months in beautiful Lugano when I was doing my PhD, so it is
great to have him here talking about his recent work that has been in both Models and
OOPSLA.
>> Matthias Hauswirth: Thanks Todd, yeah. Talking about beautiful Lugano, I have to
show my mandatory slide, sponsored by the tourist office [laughter] no, it's really that
beautiful. So if you ever get to Milan we are basically a suburb of Milan but in
Switzerland, not in Italy. Or if you get to Zürich, it is a beautiful train ride across the
Alps, so visit us.
I'm going to talk about two things. My research is kind of spanning various things and it
spans also the bridge between design and performance. My first part will be about the
research That My PhD student Dmitrijs is doing and it is kind of an offspring of
something else. He is also working on performance, but this is something that fell out.
And I would like to bounce off this idea off of you and get some of your thoughts on it.
So here is a little algorithm example, Cocktail Sort is kind of like Bubble Sort. If you
want to sort a list like this, you start from one side and you just swap elements so you
bump the bigger one to the right hand side, go through it and then you say okay, the
biggest one is on the right and then instead of Bubble Sort you go back towards the left
and you shift the smallest one to the left, and then you again the other side, and you shake
your cocktail here and it is sorted. So that is a little algorithm. If you implement that you
get performance of ON squared usually or best case of OON, and the code looks roughly
like this. So this is something, a language kind of like Java, and if you look at this code
you may say it is a little bit dense maybe, maybe not, depends on your preference. And
you can maybe re-factor it. So you could say here are two things that maybe I should
treat as abstractions and I will maybe want to abstract a message for each of those and
you end up with code like this. So you say I am walking forwards and I am walking
backwards and around those two method calls I have to out the loop which basically
shifts the balance towards the center of the array that you are sorting.
So this will be re-factoring. Now we have code that may be understandable, more
understandable depending on your background. You could even re-factor more. I
probably wouldn't recommend this, but some people may go to the extreme and say oh, I
want to extract those green things as methods too. So let's say less than is a method, and
increment, and decrement and greater than and you end up with code like this, which
probably is not ideal but you could do it.
So now we have three different versions of our algorithm, three implementations and we
see that they differ. But the algorithm is exactly the same. So the implementation is a
different size. Now you can measure the size, for example, by counting the number of
methods from the high granularity and then you have to spectrum from dense code
towards a kind of code with lots of indirections, and the question is what kind of metric
could we use here to measure this density of your code. The problem is we can measure
methods, but that is a size metric so the bigger you code the more method, so that doesn't
measure density. You have to divide by the size of the actual algorithmic core of your
program. So how do we measure the size of the algorithm given an implementation?
And we have this thought of saying well the essential part of an algorithm really lies in its
repetitions. If you look at the algorithmic complexity, the computational complexity,
what drives that complexity are loops are repetitions, so the more repetitions you have,
the bigger your O of M. Now there are other things that affect complexity of algorithms,
obviously like conditionals. But conditionals can't increase the algorithmic complexity;
they can only reduce it. You need the loops otherwise you have O of 1. So loops are
kind of…
>>: [inaudible] isn't that good?
>> Matthias Hauswirth: It's great, yeah [laughter]. But it is not interesting algorithm
anymore, is it? [laughter]. So if we accept this idea, I don't say that it is the only idea,
but if you accept this idea that the kind of essential aspects of your code are repetitions,
then we can go back to this and we can measure the algorithmic contents by just counting
loops. So we have three loops, three loops, three loops and now we can compute or
divide those two metrics and get something like the Essence, or the relative Essence, so it
is normalized by size by dividing the number of loops or the algorithmic, the essential
part of your code, by the actual implementation cost which would be a number of
methods in your code.
So this is a metric, a new metric we introduce that we call relative Essence. The way we
compute is revealed a big call graph of your application, and it will be the call graph
example for this. Call graphs of course can have cycles if you have recursion, repeat
recursions are another form of repetition so we count recursion header the same as we
count loop headers in a control flow graph. We actually augment this call graph with
loops. So for every loop we put an extra node in there, so we can say, for example, that
this loop here is calling the decrement method, and we would also call it all the recursion
headers in this graph a different color. And we count the colored nodes, which is the
Essence and we count all of the massive nodes which is the size. So here is the definition
is very, very simple. Absolute Essence is basically the amount of repetitive constructs or
the number of repetitive constructs loops and recursions, recursion headers and relative
Essence just divides by the total number of methods.
If you look how that actually correlates with constructs people use when they develop
code, it is quite interesting. If you look at code smells that Fowler's book, the refactoring book lists, here are all of the code smells from the book and I colored them by
whether those smells represent high relative Essence, so kind of loopy highly recursive
code without much isolation; that would be red. Or low relative Essence where you have
a lot of indirections; that would be green. And you see that most of those code smells
actually correspond to really dense code, so he basically says if you have dense code
without many isolations or indirections in between, it is a bad smell or bad smells are like
that, like long methods, really big method that's like my left hand example. But there are
other examples too, like middleman, where you have like a proxy or a media [inaudible]
sitting in between which doesn't do anything, it just forwards or mediates, same with refactoring. So re-factorings are used to get rid of a code smell, so you are transforming
something that you get better designs. Now what you see here is green is dominating so
when you perform the re-factoring, you introduce indirections, so you reduce your
relative Essence. You see that some of the re-factorings are paired like extract method
would add extra levels of indirection. In line method does just the opposite. There are
quite a few of those tools, but most of them are green so that is the second set of refactorings here from the book.
If you read the book, Fowler actually even says that, so he mentions the most common
variant of this re-factoring game being add indirection and the less common game, but
also exists sometimes, but it is much rarer, is to find parasitic indirection and get rid of
some extra bloat somehow in your code. If you look at design patterns, you have the
same picture. If you think about the design pattern, what that does to your code, usually,
the green ones, introduce extra indirections. The black ones don't really do anything.
But none of the design patterns is indicative of in-lining stuff. The measure to this
Essence metric in real world systems, so we use this corpus, qualitative corpus has 100
applications. It is used for static analysis as a benchmark, and it used the common Java
benchmarks and the Java runtime libraries from three different platforms, so that's a total
of 133 programs, 11,000 packages, 200,000 classes, 2 million methods and in there we
have roughly 10% of those methods have loops, roughly 1.5% had recursion headers.
So here is kind of an overview. On the x-axis you see the size of those programs in terms
of the loop call graphs, so the number of methods plus loops in those programs. Each
point represents one of the programs, so you have 133 points. And the y-axis is our
metric, relative Essence. So the higher you go, the more algorithmically dense it is and
the lower you go the more indirection is in your code. The yellow bar here represents the
interquartile range, align; the median which is roughly 0.2 loops and recursions had this
per method. You see some extreme outliers. The red ellipse actually represents the
SPEC benchmark seats, no wonder. They are very small; they are not representative.
And you also see…
>>: [inaudible] and C++ [inaudible]?
>> Matthias Hauswirth: JBM. JBM and…
>>: Oh these are JBM?
>> Matthias Hauswirth: SPEC JBM 2008. This is all Java here.
>>: But which benchmarks are in 2008?
>> Matthias Hauswirth: I don't remember all of them. I think different from the ‘98
ones.
>>: Right.
>> Matthias Hauswirth: But I think there is still some kind of compiler in there and some
more numerical ones. I think some encoding, decoding, some security stuff.
>>: Okay.
>>: [inaudible] spec [inaudible] benchmark [inaudible] Java [inaudible] benchmark.
>> Matthias Hauswirth: Java [inaudible] is still there, yeah.
>>: Okay.
>>: [inaudible] [laughter].
>> Matthias Hauswirth: So there is some kind of computational kernels in there and
those are the guys on the top. They are really outliers. They are just loops. They are
kind of micro-benchmarks.
>>: [inaudible] loop [inaudible].
>> Matthias Hauswirth: Yeah, they are probably very similar. And Dacapo I didn't call
it them, but they are really more realistic and then the Qualitas Corpus really contains a
lot of open-source applications, so they should be at least representative for a segment of
the applications. Now if you exclude the SPEC and you look at the others, you can see
some outliers even there. Weka, for example, has dense code and if you know what
Weka is, it is a machine learning library, so it is a big set of masses or clusters, you see
like 20,000 or so masses or classes that implement machine learning algorithms. So it is
not really a big complicated application where you have a lot of architecture in there, but
it is just a large number of compact algorithms.
On the other side of the spectrum you have jmoney which is like Microsoft Money more
or less, but it is just glue. It is glue code around existing libraries that do the
computation. So if you have glue, then you have a very, very low relative Essence. Yes?
>>: [inaudible]?
>> Matthias Hauswirth: I removed that graph. I can show it at the end if I can find it. I
have it and there are some other very interesting findings about it. Actually maybe one I
can say is we have JavaCC or whatever it is called, the Java compiler compiler, right?
And we have ANTLR which is also a compiler compiler in Java, and we have SableCC
which is also a compiler compiler. Now SableCC was written by Laurie Hendren’s
group and in the documentation they say that SableCC generates nice object oriented
code, so to generated code is nice and object oriented. And I assume that they also wrote
their own code nice object oriented, and if you look at where it is located in the spectrum,
it is somewhere, I don't know, somewhere down. I don't know where exactly it was. And
if you look at JavaCC it is somewhere really high up, so it is much more kind of densely
packed algorithmically and not using indirections much.
So we came up with this metric basically out of a random event that just happened, and
we thought is it just something new, or is it just something that is exactly the same as an
existing metric, so we used tools to measure I don't know, 100 or so design metrics that
we found and we couldn't find any that really correlated well. So maybe some of the
most obvious you would want to compare our Cyclomatic Complexity which also tries to,
given a control flow graph or a method tries to figure out how much computation is in
there and how much is fluff like extra basic fluffs or instructions that are not essential.
So it just goes for branches or independent puffs. So we try to correlate but Cyclomatic
Complexities with absolute numbers, we have to divide by something to normalize, so we
divide it by the number of instructions in the code, so branches per instructions basically,
that is this axis here, and here we have our relative Essence and you see that it doesn't
really correlate much. I don't remember the correlation coefficient but it was very
uninteresting.
Another one that is probably the main design metric is coupling and also cohesion and I
think they are very nice in terms of they are orthogonal to Essence. So if you do coupling
or cohesion then it is about moving things around in new code, so if you have two
components and you have some method for example, you can move that method into the
other component to reduce coupling. So you have this scenario here; you reduce
coupling, or the other way around. So if you measure those metrics, it tells you to
rearrange your code in this way. If you measure Essence, it doesn't tell you move this. It
says oh, you should probably insert some extra abstraction layer or you should remove an
extra abstraction layer because you are too dense or too abstract or too much indirection.
It is also related to Bloat which is not really a metric, but I think Nick Mitchell has done
very nice work in figuring out unnecessary computations or unnecessary transformations
of data so it has a very clear relationship.
>>: But then it was measured with respect to code, they have only measured its
representation in the heat, right?
>> Matthias Hauswirth: Right. Actually I am going to go in that direction in the next
part. But you are right. I think they might also have some kind of code. But I don't
remember all of the work. I think he had a few papers.
>>: Well, they tried to map the heat back to the code to find why it happened, but from
manifestation back to code rather than from code to manifestation, right?
>> Matthias Hauswirth: Right. So I haven't tried to somehow correlate this. I don't
know how to best do it, but it might be interesting to see if there is any connection
between the two. Conceptually there should be.
>>: So if you have nested loops like do you multiply them or do you still just add them,
one up, multiply?
>> Matthias Hauswirth: Add. You are not measuring the runtime complexity, but you
are just measuring how many kinds of decision points, like the Cyclomatic Complexity,
right, you just count the branches, but you are not going to really multiply things. We try
to do the same, which is to say how many repetitive constructs are there that the program
has to think of? And whether the two loops are nested or sequential, it probably makes a
difference cognitively, but we don't distinguish that. It might be…
>>: It might be the first thing to try to… Because that’s, for me an algorithm is all the
levels of building. [inaudible].
>> Matthias Hauswirth: But that is one point, right? If you say an algorithm, then
actually every program is an algorithm and you would have to in-line all of the code and
you would get one big loop messed.
>>: No, not necessarily. I mean you have a sequential stream with a bunch of
different…
>> Matthias Hauswirth: That's true. I mean [inaudible]…
>>: [inaudible] but processing loops, the processing loops put everything in one loop,
but if you try--it would be interesting to rebuild like a loop nesting call graph for doing
loop transformations, right? And figuring out where to in the line so you could do loop
exchange like in my thesis work and so that was something we thought about there. But
just for program transformation, not for program understanding, right? That is just
something that should be easy for you to do if you could try to figure out that.
>> Matthias Hauswirth: You would need a different value, but I would have to think
about what it means when you extract a method and suddenly you have an in the loop not
being an in the loop anymore, but just being a method that is called from within the outer
loop. Is that different? Or how is that different? I would have to think through it, but it
is an interesting thought, yeah.
>>: What about the loop contends, for example, does it go to [inaudible]?
>> Matthias Hauswirth: So we do it well even if you have unstructured control flow. So
we actually don't just do a dominated analysis or so to identify the loop headers, but you
have an algorithm that is deeper to find even loops with multiple entries into the middle
of the loops and non-reducible graphs, we deal with that too. We just take multiple
header nodes for a given loop. It is in the paper I will point you in the last thing. We
looked at different ways to do it. I think we picked the one that seemed most correct, but
it is kind of an intuitive way of correctness. One last thing that is correlated probably is
design pattern density. I don't remember but I think it was Dirk Riehle wrote a paper on
measuring design pattern density in code, but for measuring at least for that paper they
had to annotate code, and say this clause corresponds to this collaborator in this pattern or
something and then you could count it. So it is not fully automatic and so we couldn't
correlate our method with this one. But given that you see that every design pattern
seems to be green, meaning it seems to represent indirection, I guess it should correlate,
but I am not sure, because the patterns overlap.
So the metrics really are related to two very key principles, first, information hiding,
right? Indirections are usually introduced it to hide something, to abstract from
something and the second one is Brooks accidental versus essential complexity. That is
where we stole the name from, Essence, so in his view you have an abstract model which
is the Essence of your computation and then you have a solution which is the
implementation in a specific programming language which adds accidental complexity.
Then we tried to take implementation and extract some notion of Essence.
There are some interesting dualities. If you look at behavior, which is what we looked at,
loops and stuff. There is also structure. So if you look at the loop, you could say that the
dual to a loop is the array. If you look at the recursive method or function, the dual to
that is the recursive type. So we can compute Essence on behavior by going for a call
graph and augment it with loops and count those kind of constructs, and then we could do
the same when we go through a type graph or a cross diagram and count these kinds of
constructs. If you want to optimize code, then you often do method inlining. It is one of
the most effective optimizations. If you want to optimize for structure, you may want to
inline objects into other objects or even allocate them on the stack. So those are really
two duals also. And our Essence to some degree measures the minimum you can reach
you when you inline everything somehow kind of, in both cases.
There is also a limitation of our metrics which is if you do generic things. So you could
write one traverse logarithm and you would have one loop, let's say apply something to
data structure, and then you can pass in functions or visitors or something like that that do
the computation. And you measure one loop inside the Essence as one or else you have
many different computations. So what you would have to do there is to say whenever I
instantiate that generic traversal, that's adding one to the Essence, instead of saying there
is one generic traversal. There many instances of that, same for generic types. So I can
have lists, link lists of N and N is an arbitrary type, so when I instantiate that container, I
would say I have Essence one higher.
So if you look at the UML class diagram and you want to measure here Essence, you
would have to again go for repetition, so instead of loops we go for arrays. And if you
look at the class diagram, arrays are in implementation; they are not in the diagram, but
often they correspond to multiplicities in the class diagram. So here a student visits
multiple courses and the course is visited by multiple students and so you have those
stars. Every star represents a multiplicity, a one to N relationship and can be
implemented as an array. The second thing you can have is recursive types, so a
professor and the course form a recursive structure. A professor is in the course. The
course is taught by another professor, another course, and so on. So we can have this
recursion there. So what we do is the same thing here as we do with the call graph. We
build a graph. We also call it the distilled model where you put all of the types and we
add these extra gray nodes. They are the loop nodes basically, so array nodes, whenever
you have a multiplicity, and then we count again the cycle headers of the recursions and
those special loop nodes or array nodes, and that gives us again Essence, but this time it
is structural Essence, not behavioral Essence.
If you look at the essential parts of this graph, it is basically everything you count in-line,
so the super type address is irrelevant anyway. It may be an abstract class; you never
instantiate so you can get rid of it. And those things here you could all in-line them into
this array element. You could just inline the object, tie into the period, the day into the
period and then just combine period into the array so you have…
>>: It is relevant for the student to get to class on time. [laughter].
>> Matthias Hauswirth: That is actually a very good point. When Dmitrijs presented
these models, one of the comments was that what you are doing here is getting rid of the
essential part, because it is actually relevant that I can talk about you get there on time.
And you are removing this knowledge and you just left this course array which doesn't
tell me anything what it is, right? So in terms of concepts, yes we get rid of concepts. In
terms of the structure of the solution, we preserve I think some of the aspects of that
structure.
>>: What you're getting at is the mapping between the components but not, it is like the
bloat is getting to that also, right? And saying the essential data is the stuff that is unique
and not the mapping between.
>> Matthias Hauswirth: It goes in that direction.
>>: Right.
>> Matthias Hauswirth: So now we have those two arrays. Actually for the structural
array, what is interesting is we tried to do some data structures. We implemented them in
Java and then we looked at what it turns out to be in terms of Essence. If you do a linked
list, no matter how you do it, if it is singularly linked, then you end up with an absolute
Essence of one, so you get one recursive node in the distilled diagram. If you do a tree, if
it is the pattern pointing to the children and no back pointing, then you get two. And if
you do a graph, then you get three, if it can't be traversed backwards. So that is kind of
an interesting result.
We built an infrastructure called the centralizer. I think it is available on the centralizer
tutorial, if it is running. And you basically take some input that could be Java, the current
implementation is Java, compiler project. There is a charge file. We analyze it. We built
this distilled model, this graph out of it and then we have a tool that computes the metrics
and another tool that visualizes the model in some way. And so we have transformers
from Java and from UML class diagrams in XMI format and one could add arbitrary
transformers to it.
Maybe to conclude, two comments, so this is a comment from a reviewer for the ECOOP
paper. I think he seemed to like it. Now I know people that don't like it, and I don't put
that on here [laughter]. They don't believe…
>>: Why not?
>> Matthias Hauswirth: I tell you. They don't believe that this is useful and I am not
sure that it is useful. I like the idea and I wonder and I would like to figure out if it is
useful. Some people think it is useful and that reviewer pointed us to a blog post of a
developer. I have never met him, Bill Woody. And he talks about in his whole post
about how Java programmers create these really bloated things that start with a simple
thing. He talks about implementing factorializing, and he has a very nice implementation
of factorializing, very, very few lines, and then he turns it into this framework where he
can plug in different algorithms for implementing factorials and need a factory and you
have to have this really big architecture for computing factorials. And he says that is how
Java developers who build web applications, that is the direction they take. They add
layer upon layer.
>>: [inaudible] in the last few months I [inaudible] collaborate with [inaudible] and if
you look how they do Java you [inaudible] print. [inaudible] streaming part [inaudible]
system.
>> Matthias Hauswirth: Overdesigned, yeah.
>>: It's not. I mean it is over, over, overdesigned. [laughter].
>> Matthias Hauswirth: So it could be that Essence might capture that. It would be
interesting to see it.
>>: And another interesting [inaudible]. If you look at [inaudible] kernel, in the last 20
years the same stuff, the interface has not changed that much. But the code keep refactoring keep re-factoring for the past 20 years at and you can get all of the patch and
you can think actually this is unfair for the system, for example, [inaudible] the first time
they got one [inaudible] systems and one [inaudible]. In laser, now they got probably
100 architectures and each one has some [inaudible] and by the same function and keep
re-factor keep re-factor keep re-factor. A large part of the system is [inaudible]. And
[inaudible] increasing to have [inaudible] pickup [inaudible].
>> Matthias Hauswirth: I think one of the problems that we have is we don't have a
baseline. We don't know if it is overdesigned or not. That is a human expert that should
be evaluated. So actually that is my kind of pitch here.
>>: I think there is some method you can do that if you have a history of the program.
For example [inaudible] okay the workings for [inaudible] okay start from the first
portion, right and someday [inaudible] does that and I have to [inaudible] and can result
[inaudible] merged. In the merging has some problems and they tried to, they do some
better research and pick out a problem here. And we watched and [inaudible] of same
work. We did the patch and researched all of them. So and I think you can run some
metrics which I think [inaudible] what these patches do and how [inaudible] one function
to do, for example, this function just for one [inaudible] system and now this function for
100 new systems and you [inaudible] some packages [inaudible] fix some part which is
because this speed. And you do have [inaudible] because people have similar [inaudible]
actually we can do the [inaudible] analyze and [inaudible].
>> Matthias Hauswirth: We can try. We should talk about that later. I want to make
sure that I don't have to skip the second part, but we should definitely talk about that. I
am looking for ways to actually be able to say whether this is useful or not.
>>: [inaudible].
>> Matthias Hauswirth: Yeah, I have seen that too. So I went and searched a little bit
and there seems to be I believe there is something there.
>>: I liked when we did the Decapo paper. The shop for engineering metrics that we
were using were all useless. And then they weren't telling me enough because it didn't go
to what was, what the program was actually doing and how well it was expressed. So we
could have big Decapo benchmarks with lots of things going on in them, but it could all
be useless [laughter], right? But we really didn't know that until instruction footprint was
one of the things that I found was like especially suspect for us because it said guess how
well, how many instructions you had to generate, but it didn't tell you how many you
really needed, right?
>> Matthias Hauswirth: Well, we don't really tell you that either. There can be loops
that are totally inessential, so we actually found that in the developer's example that I saw
in this blog post, he then goes on and he has this factory instantiation idea, but he has a
recursion where there needs to be no recursion at all, but our thing picks up that there is a
recursion. In practice it will never be executed recursively, but it is--so I don't claim that
this is really going to find your really useful core of your algorithm, but it might. If you
think about it, Dmitrijs gave the talk at MODELS, so they focus on models, on
abstractions and so my claim is that probably if you have a model and you compute this
Essence, this relative Essence on the model, that should be bigger. It should be denser in
terms of the essential aspects and the implementation which has all of these kinds of extra
things that are necessary when you implement the model. That is kind of Brooks, so we
are trying to quantify Brooks. So if I had a set of models and their implementation and
you could just do it right now and look at this, but I don't have any models. I haven't
found any models. I don't know whether there are any UML models that people can just
use publicly.
>>: What about maybe taking it down a notch and looking at programming language and
then back into compiler generating code? So you look at the white codes that get
generated because the optimizing compiler actually has some constraints in which it
actually builds up the implementation of, where the model is now the code and the
implementation is x86.
>> Matthias Hauswirth: I am not sure whether it actually does much in that respect.
Sure it can unroll loops so if you have a static bound or whatever in some cases it would
unroll a loop and send it to task. Change it…
>>: Well, would you consider unnecessary abstraction anything that the inliner is so
short the inliner inlines right?
>> Matthias Hauswirth: Yes. You could look at the inlining instead of this conceptual
view, but really what the compiler actually does, yeah. That would be an interesting--but
in the end…
>>: There is some notion that the compiler is getting rid of some of the abstractions for
you, right? Maybe not perfectly…
>> Matthias Hauswirth: Yeah, maybe what I am showing here is the perfect thing, right?
The nodes that I color are the minimum nodes that you have to have and everything else
you could in theory inline, but whether that is a good idea, I don't know. But it is
something. The second point is, and that is kind of the idea of the metric. If you measure
a metric, a design metric, you have to have a goal and the main goal is usually the
humans that use the code, not the compiler part. But maybe this could be useful in both
ways. So if I measure Essence…
>>: Just assume that it is the compiler. Like that is the other consumer of code. People
say, oh it doesn't matter because it is an automated tool, but it is automated for human
patterns. So if you have some notion of what the patterns are, you should be able to build
a better compiler because it is just an engineered object based on program behavior. So
people think, people ignore that the compiler is a consumer of whatever people are
producing. And it is engineered fundamentally that way.
>> Matthias Hauswirth: That is a good point. Yeah.
>>: [inaudible] in the [inaudible] program we were writing code and you have to
understand the competitors [inaudible]. And you try to make your program have more
patterns [inaudible] even [inaudible].
[laughter].
>> Matthias Hauswirth: If you are good, you write the right ones.
>>: [inaudible] like that so they have to, [inaudible] if they have a lot of loops in the
code then we should and then because they try to use [inaudible] because they do original
[inaudible] and they first [inaudible] and that is not performing as they expected. And so
they change the code so that the code [inaudible] competitor work, and they changed it,
they make the code not that easy really to [inaudible].
>> Matthias Hauswirth: So they make it much more performant, right? [laughter]. So
this is the second part. It is not exactly connected to the Essence. It is about Milan Jovic
who is a nether PhD student of mine. It is focusing on performance and performance of
interactive applications. So I personally, I don't know, attract performance programs
somehow. I very often run a program on my phone or my computer and it is slow and I
don't know why and I want to get rid of that problem. So let me show you a performance
problem which is not in a Microsoft product [laughter]. Here we have Eclipse which you
all probably have heard of. This is version 3.5, so it is not the most recent version of
Eclipse. It has this beautiful remanufacturing, so I can go on a simple like a valuable
name here and I can say, rename this. And if I rename this it will automatically when I
am typing change all occurrences of this variable in the currently open editor. And I can
now start typing and I can say, okay, I want to change a name and say Matthias, and that
obviously is not fast enough, right?
So somehow there is this performance problem. It has been around for years. I don't
know, maybe two years, maybe three years. I know that it just suddenly appeared and I
was thinking oh, I need to update and the new version came out and I updated and is still
there. And eventually I thought I just have to live with it forever. So now I will show
you how we can detect the reason for this bug. If you use a normal profiler, then you will
get something like this. So you get a hotness profile which tells you in this function or
method you spend so much time total. So that is not going to help you at all to find this
bug, it is going to point you to something that is absolutely unrelated. I could show you a
picture. Is going to show that this is a huge problem and the actual problem is a pixel in a
picture which you cannot even see what it is. So this is an abstraction and it is the wrong
abstraction for this kind of performance problem.
What you really need to look at is the trace, behavior over time. And when you look at
this trace effect of execution, you see a lot of white spaces are sitting here in between
those colorful boxes and that is the time when the user is thinking. So we found in some
experiments…
>>: You need to optimize that [laughter].
>> Matthias Hauswirth: Yes, ideally, but that is more AI I think than… So actually AI
research, 98% of the time that could be improved is due to AI, to make humans faster
because only about 2% of time is spent computing. We did thousands of hours of tracing
of interactive applications on just Java applications. Most of it is idle. We don't care
about that. What you care about is the time when the computer handles a user event. We
really care about the latency of handling the event, not the overall time. Now if you look
at this picture here, you see that the lowest, we call that episode, so the longest handling
of the event is this one that I circled and the dominant code in there was the green code.
If you look at the hotness profile, the hottest code is the blue code. But the blue is spread
across many, many handlings of different events, so nobody would ever notice this, but
the developer would optimize for the blue one, would not think about the green one. In
this picture is nice. It is much worse in practice.
So when you characterize performance you want to characterize latency. Here you see
latency on the x-axis, 0 milliseconds, 100 milliseconds, 10 seconds or whatever to the
right. And on the y-axis you see how many times, how many events or episodes you had
with that latency and it is cumulative. So the point I show there is 179 episodes or events
took longer than 100 milliseconds. And this hundred millisecond line there is not there
by chance. This is a very important line. It is not entirely clear that it is 100, but this is
the perceptibly threshold. So Ben Shneiderman did some experiments and he found that
people do not notice latencies below 100 milliseconds. So if you click a button and it
takes 90 milliseconds to respond, people will perceive that as instantaneous. So what you
want to have in performance is you want to have an L-shaped curve which goes all the
way to the bottom and then stays at the bottom beyond 100 milliseconds so nothing takes
more than 100 milliseconds. But everything can take something here. It doesn't matter
where the upward thing is here in this range.
So you don't have to optimize forever. You can give up and say it doesn't make a
difference to the user, maybe for energy consumption there is a difference, but for the
user it doesn't.
>>: [inaudible] this company gain?
>> Matthias Hauswirth: There is some variability to this. There is a gain. They did a
study where they put a virtual reality helmet on people and they put them in a virtual
maze and in the maze there was a huge hole. And the goal was the user needs to walk
around the hole and not fall into the hole. And then they changed the latency from
moving to the update of the display between 75 milliseconds and 125 and they found that
people died at a much higher rate when they had the higher latencies. So somewhere in
this range there is a critical threshold where you can't really walk anymore because the
delay is too long. So it is not a fact in just perception, so that you notice, but it actually
decreases performance at least for this game task.
And they did it for other things. They measured the differences by input device. There
are many studies on this. And the threshold changes, but it is in this order of magnitude.
So we don't need to make those episodes shorter than something like 100 were maybe if
you want to be optimal 40 or 20 milliseconds.
>>: Seventy-five if you don't want to fall in the whole [laughter].
>> Matthias Hauswirth: Exactly. So here we have latency curves for Eclipse. So we
have five versions of Eclipse. Black is the oldest. The lightest gray is the most recent.
And you say that every subsequent version got slower. I mean you have more and more
episodes that take that long. And the reason is because…
>>: [inaudible] more [inaudible].
[laughter]. [inaudible].
>> Matthias Hauswirth: I think that may be. So the reason is that the code group. The
code grows over those versions. It went from 20,000 to 30 something thousand classes.
Now if you just add classes that doesn't make it slower.
>>: That is why you have to do nesting.
>>: That's why you have to…?
>>: That's why you have to do nesting, because if you added separate functionality, then
it doesn't matter. But if it is all messy inside each other, that, you're stuck with it.
>> Matthias Hauswirth: Exactly. It has to be somehow, you are right. In the nesting
shape, and it is actually in this nested shape, because if you write an Eclipse plug-in, you
are not writing an extra application that sits next to Eclipse, but you are plugging it into
the framework and you're going to listen to changes in the global state. So whenever you
press a key, your plug-in is also going to respond and it is doing that synchronously. This
is not an asynchronous kind of message path, but it is going to call your plug-in and tell
you oh, the key access was just pressed and your plug-in is going to compute some very
complex thing. And so every plug-in is going to add a little bit of latency and your
system gets slower and slower.
Now we want to detect the causes of those latencies and we want to detect them that
people really care about it. So our idea is to put a little agent with each application that is
deployed in the field. So you may know that in this research group here the idea of
putting an agent or some kind of an analysis in the field so you can do this collaborative
bug detection is not [inaudible] but we are focusing on performance here, in latency in
particular. And then we graph traces or profiles from all of those agents in a global
repository and then we ship that back as bug reports to the developers. Now you could
say that hopefully the end users will be happy once the developers fix the latency bugs.
Now you could say why do you want to deploy that in the field? Why don't you just test
for performance? We do unit tests for everything anyway, so you just measure and you
know that you are not too slow.
Well, the problem is that coverage is not the same for performance tests as for functional
tests. If you are talking about performance tests, you have to take the entire system into
consideration. It depends on the platform, meaning the virtual machine, how much
memory you have, what hardware, what other backup applications are running, like they
use an MP3 player in the background when they work on Eclipse or whatever. So all of
those things impact performance and you can't tell what users do without actually
observing the users. That's why we ship the profile to the users and we get the
performance they perceive in their context and we do it for everybody. So the profile
needs to be extremely lightweight and it needs to give you just the right information to
find the bug.
>>: What you could also say that that tells you how much you need to drive down
latency. 100 is with no one else and just the tester running and then if everybody is
listening to their MP3 player while they [inaudible] then you have the latency has to
accommodate that. But if only one or two people are, then maybe you don't really care.
>> Matthias Hauswirth: Yes, but you need to know that. So you have to measure
something in the field, so you can have as well a profile in the field, and then just look at
the latency that they actually get which includes the scheduling and the contention and
whatever is happening.
>>: [inaudible] devise a game developer and got a game for his android phone and what
he [inaudible] his [inaudible] experiments on the android platform. Actually [inaudible]
he tried his best to [inaudible] Matthias [laughter].
>>: Matthias wasn't online [laughter].
>> Matthias Hauswirth: Actually yeah, we were doing something like that too. Now the
question is what do you measure? You want to measure latency. And you have some
behavior in this blue box over time and you measure the start and the end time stamp and
you compute latency. Now the question is from what? Like what is this blue box? And
so when you pick what you actually want to measure, what kind of behavior you want to
measure, we have this two-dimensional space. You want to have minimum overhead.
You want it deployed in the field so users must not notice anything and it must not retard
your measurement results. On the x-axis you have the amount of information you get.
You want to figure out what the problem is. You don't want to just get information that it
is slow, but you want to know why it was slow.
So ideally it would be here. You know everything and it costs nothing. That is
unrealistic. So you could go there. You could instrument every single method call and
say this method took so much and you have this whole calling context measuring latency
information everywhere. That is going to be too expensive. On the other hand you could
say okay, I reduced my overhead. I will only measure in the dispatching of events how
much each event takes and I don't know exactly what happens. So you know nothing.
You just know that it was slow, but you don't know why. So we went in the middle
somewhere and we tried to get towards this corner. So we instrument methods, but we
instrument only a subset of methods which we termed landmarks. Those are methods
that are infrequently called so we have little overhead. They are covered in most of the
executions, so we don't miss any latency bugs. They are usually bounded recalls so like
calls between components in your system, and a very important point is we use those
methods to classify the bugs, so we have one bug in our bug report, one report for each
landmark message. That is how we group or cluster things, by method.
So if you take an execution of a program, you have time and then you have the call stack
that grows and shrinks a long time, some of the methods will be landmarks, only a small
subset, very few. And usually they are listen notifications or observe notifications. If
you have a command pattern in your application you say compile, then this compile
method or the listener will be a landmark that you time. Paint, which is output, IDE or
your application draws something complex that can take time and native calls of course
because that is where we do I/O and all the possibly long latency OS calls. So those are
landmarks. You can define what that means for your scenario.
To make sure that we don't miss anything, we also include the dispatches, which are at
the bottom of the call stack in the event loop as landmarks. So we know for everything
that happened how long it took, but if this is a landmark, we have one entry in our bug
report that says dispatch, and then you have the distribution of latencies and it doesn't tell
us what actually was dispatched. But we know that there was a problem and we didn't
capture it with those other guys, like here for example. But we don't know exactly what
happened, so we have to introduce more landmarks. It is kind of a safety net.
If it just did the landmarks, we wouldn't find the reason for the problem. We know a bit.
We know oh, compile, the compile command was slow, but we don't know why the
compile command was slow. So we use an old approach which is stack sample like a
stack sampling profile like H prov in randomly spaced to prevent bias. And what we then
do is if you have such a long latency episode like this one, you take the stack samples that
occurred during that episode and we build a calling context tree. So we merge the
samples; we form a tree and we have weights in each node of the tree which is the
number of stack samples, or the hotness of this calling context.
So we see that this context occurred in one sample so it is kind of cold, but this trunk here
is relatively hot because all samples were in that. So we did this study to evaluate that.
We ran Eclipse, or our students ran Eclipse for three months during a semester. There
were 24 students in that course and they were using our little agent in there. They did
over 1000 sessions. So start Eclipse and use Eclipse for a while and the total was almost
2000 hours of usage, so that is not CPU hours; that is total hours where Eclipse was
running. Out of that we got this huge repository. We have about 800 issues in there, but
the issues like I said are landmark methods. Every issue is a circle. So we have roughly
800 circles here. And they are positioned according to their severity. On the x-axis you
have how many times this landmark actually executed with any kind of non-zero latency.
So if it is almost 0 we just throw it out, but if it is a little bit higher than we show it. And
on the y-axis you see the latency that occurred in these landmarks. So the higher it is, the
slower the response was, and the more to the right, the more often the problem occurred.
The size of the circle corresponds to how many users actually observed this problem, so
if somebody and our students did, installs extra plug-ins, that somehow slow down the
thing, then only user that would experience that and that would be a small dot, so they are
up here and if it is a stupid plug-in for I don't know playing movies that he installed then
we don't care about that. Maybe. So I picked one of those issues, the red one. This is the
issue that actually corresponds to the demo I did. And if you select that issue it will show
us related issues. So if you have a method, let's say mouse click listener or mouse button
press listener, you often have a kind of corresponding method, mouse button release. So
you have to have two different landmarks but in reality both of them do very similar
computation. And so we group all of the landmarks with the same kind of computation
into a family of landmarks and here we highlight them. And we group them by similarity
of the calling context tree. So we do it three, similarity, computation and we group those
that are closely related like this. So if you go and you try to fix this, you know that you
will probably also fix this one and this one and this one.
So let's go to our specific issue. If you look at the calling context tree produced by this
specific landmark, it starts at the top. That is the name of the landmark. Verify text, and
then you see the tree like this is one method here on the stack so the stack goes like this.
And if you follow this, there is lots of information you can see. Here you spend a lot of
time. And if you look at what that is there, it is a message called pack pop up and you
spend 84% or 82% of the stack samples in this long latency behavior in that method.
>>: And so when you find that out because you sampled all of these calling contexts and
all the way to there and that is in 80% of the samples?
>> Matthias Hauswirth: Eighty percent of the samples taken during this long latency
operation. If you look at the calling context tree of the whole run, this would not show
up.
>>: Right. It is because you are in this focused region and the samples of the calling
context merged together into these, well, not always merged, there are some completely
separate ones, right? But they, I guess because you are starting with there, they all have
to be under that call, right? For the listener?
>> Matthias Hauswirth: Actually the listener for this verify text could be called from
different contexts, and often it is. So we can merge them together, which is to say that
method, and we merge those trees so we see it.
>>: All right.
>> Matthias Hauswirth: We can also show which context this thing occurs in and all
kinds of extra information. So now if I go back to Eclipse you can see what the problem
is. So let me re-factor this again so re-factor, rename, and then try to delete. It is again
slow. And what you see here is this kind of fancy tooltip that has this very, very useful
information. I mean it starts to say enter new name, press return to re-factor, always, no
matter what. So I guess the developer of this re-factoring, this in place re-factoring,
thought that he would use this fancy kind of tooltip API, and actually the API has this
neat feature. You can do this little callout, this little thing here, this triangle. Which is a
very advanced feature actually that was not possible before and then they added this to
this GUI toolkit that you can have nonrectangular windows. Normally windows are
rectangular, but they have these arbitrary shapes and now pop up is the message that tries
to kind of figure out the area that is kind of hidden by this tooltip. And because the
tooltip is not rectangular, it takes a long time to compute this geometry, for some reason.
So, you have a nice feature here too. You can actually drag this guy and you can move it
somewhere else, for example here in the corner and then this little triangle disappears and
the other triangle is gone and if I continue typing, the performance problem is gone. So it
really is exactly that, whoops, that problem. So with this we found that originally we
thought it has to do with oh, there are many instances or occurrences of this name, and
when you type a key it needs to update the text in many different places, so it slows it
down by a factor of x and it has nothing to do with that. It is just the GUI that is
somehow the fancy feature in there. They fixed it in a later version of Eclipse.
>>: How did they fix it?
>> Matthias Hauswirth: I think they did some caching. I don't remember, but I think
they cached the outline of that thing, because it is constant anyway, right? It is always
the same shape. It doesn't change.
>>: You could move it out of the way too.
>> Matthias Hauswirth: Yes, but that is still that they want to keep this user feature that
you can move it. If you like at the bottom you can move it to the bottom.
>>: Yes, but by default you can move it into blank space at the beginning, right?
>> Matthias Hauswirth: Yeah. You could, but then if a user finds it nicer up there, he
would still have the problem. So they did the right thing. They cached it. They should
have cached in the beginning.
>>: [inaudible] problem and they start to feel that they…
>> Matthias Hauswirth: Actually they fixed it and we found the problem, but
independently. So when we had the problem understood, we checked and the newest
version had fixed that thing, not because of us.
>>: So the history of this research you are saying that you first got the problem first and
tried to build some tools to find a problem or…
>> Matthias Hauswirth: We didn't start with this specific problem; we started long
before I was aware of that problem. But this was a nice example that I said look this is
really annoying me. I use it every day and it is slow and so, Milan, find the reason for
this. And so he used the tools that he already had developed for it.
>>: [inaudible] you find a reason, you find a reason for the tools or you find [inaudible].
>> Matthias Hauswirth: I think I found the problem first, but not this problem, just
generally I got annoyed by those kinds of performance problems and I found that
traditional profilers do not help me. And then this came along as kind of a nice bonus,
but it was not the inspiration for the whole research.
So Milan used this for quite a few tools. This is one bug in Eclipse. He found quite a
few of them.
>>: When did they fix the problem with the bug in Eclipse?
>> Matthias Hauswirth: I don't know exactly but I think it was more than two versions,
so like five still have it. Two maybe three years. It I mean it is amazing, you know,
because everybody uses rename re-factoring. It is the one re-factoring that everybody
uses, so I wonder what…
>>: So you could say that because tools like this didn't exist if you were trying to sell
your work a little better, that it took them two years to fix the bug, whereas if they had
used your tools they could have fixed the bug a lot faster.
>> Matthias Hauswirth: You are totally right. And I am going to undersell a bit, you
could also say that users probably have complained and maybe they just didn't listen.
Maybe, actually I didn't complain. I used it all the time and I just screamed at the
computer and I didn't tell the Eclipse people that…
>>: [inaudible] assumed that those people would hear about that. It Eclipse has these
problems like it is slow, [inaudible] and people assume that is normal, people would
think…
>> Matthias Hauswirth: People get used to it, yeah.
>>: [inaudible] kind of get used to that too, that it is slow. And a lot of people
[inaudible] because it is written by Java. Then they think it is a common thing.
>> Matthias Hauswirth: Yeah. So we used it for different applications. Here are three
examples. Informa clicker is a tool that we wrote and we use in our teaching and it is a
distributed system and our students and the professor have a graphical user interface that
is relatively involved. We used Milan’s tool on a few sessions, and we fixed some bugs
and the next time we used the tool the students actually said whoa, what happened? It is
so fast. And so it was clearly finding a useful bug. It was a simple one, but it was a
useful one. And Milan integrated it into an alpha version of Code Bubbles by Steven
Reiss and I think Steven integrated it. And Milan found in the repository some
interesting performance problems that I think probably are fixed now.
So this works in general for any kind of Java applications no matter, graphical Java
applications no matter what framework they use. So conclusion is for performance look
at performance problems that actually matter, measure it on the user’s desktops or phones
or whatever, so you know what you need to optimize. Okay. So that is both halves of
my spectrum.
>> Todd Mytkowicz: Thank you. [applause].
>> Matthias Hauswirth: Okay. Anymore questions or are you questioned out? So if
somebody has comments on anything like this tool or the Essence, if you totally disagree,
let me know to. I want to know. And if somebody has models that I could use to test, I
am very interested in running tests on some models. Okay.
Download