23081 >>: Thank you for coming out. It's my... has been here visiting us for the past week or...

advertisement
23081
>>: Thank you for coming out. It's my pleasure to introduce Andreas Zeller, who
has been here visiting us for the past week or so. Andreas came out for two
months in March; is that right? Earlier this year. And this is the paper that he's
going to be presenting is one of the pieces of work that he worked on with us,
Tom Zimmerman and myself.
Back in Germany, Andreas is a full professor at the University of Saarland. He's
done a work in a lot of areas of software engineering. Has written a book and a
number of papers. And most recently I think it was just in June he was inducted
as an ACM Fellow. So we're happy to have him here.
We look forward to your feedback. He's actually going to be presenting this at
Promise, the international workshop, or conference, on predictor models on
software engineering. So Andreas.
>> Andreas Zeller: I can see you can all hear me, Marcus, can you hear me
back there? Wonderful, great. This is Failure is a Four Letter Word. As the title
already said, this talk is about software failures. Software failures as you all
know them, this is one of the most spectacular samples, videos, here's the crash
of the Arianna Five rocket, one of the many software failures we have to deal in
real life. Besides programming probably one of the major obsessions of
computer researchers, how can we find all the bugs and how can we eradicate
them. One particularly interesting line of research in here has been the analysis
of software histories.
What you see here is -- what you see here is actually all the classes of Firefox.
Actually, there's an anti-tree map where you can also see the individual
packages. And in this picture what you see here is we have color-coded the
defect density in the individual classes of Firefox.
What you see here below, for instance, is classes in the JavaScript engine of
Firefox. And you can see these red boxes down here are in classes where a
high number of security bugs showed up.
There's also other places in here. Here's the layout base. This is where
cross-scripting attacks happen. Over here we have -- over here we have the
document object model where also a number of attacks have happened, and we
have very such bugs have been fixed in the past.
Such results, the results of a pretty new field of software engineering, which is
mining software repositories, where you go along and where you look at, where
you look at the history, in particular change in bug history of systems, and then
you map them to individual places in your source code in order to figure out
where the most bugs have occurred in the past and where the most bugs have
been fixed in the past.
This field has taken off enormously since, well, 2004, I think that's when the first
workshop on the topic happened. At this point there were about 20 people
attending. Now it's almost a major conference. But 150, 160 people attending
and plenty of people all along the way who work on this kind of research.
The interesting thing is once you see a picture like this, is you wonder how can it
actually be that there's specific places in your code where there's many bugs or
many bugs of a specific kind. And how can it be that there's other places which
are literally stable from the beginning.
So what people typically do is you go and associate this with proper failure
causes. Failure causes such as, for instance, the quality of your developers.
You look at appropriate features of your developers and you try to correlate
them, you try to correlate them with specific features of the code they've been
writing.
One of the things that my student [inaudible] and myself at the point found out,
for instance, was that in Eclipse the more experienced you were the more bugs
you would create, very much in contrast to what the picture shows over here.
We can also go and try to correlate failure causes with specific management
styles or with specific ways the team works together, you can go and figure out
this is actually something that was found out here at Microsoft Research that if
you have many different people from different teams and none of these
individuals actually takes responsibility in the actual outcome, turns out that the
resulting modules have higher failure rate than modules for which there's
individuals who take responsibility.
This is another interesting point that researchers have been investigating. And
then, of course, there's the issue of complexity, all sorts of code features which
people have looked upon various metrics that you can go and correlate with -that you can go and correlate with this failure data, obviously a high complexity.
You would assume high complexity yields more bugs. Certainly if the complexity
is high, it dominates everything else.
The problem with this at this point what we see is a large, large number of
studies all trying to correlate failure cause with various aspect of the code and its
process, which means that for a manager, the situation is actually pretty much
confusing, because there's so many relationships between failures and features
of code and process, such that something what we call the cost of consequence.
We may choose to work on a specific thing.
We may work say, for instance, on a management style. We may work on
reducing the code complexity. We may work on increasing the abilities of our
developers, but all of these have costs and all of these issues have costs.
There's the risk involved with that that all of these actually, all of these actuals
that we could possibly undertake will actually not change things.
And we see this problem primarily as a problem of, well, it's a problem of too
many results, too many results, too many findings and it's hard to find, it's hard to
choose the right one between, how to choose the right one in all of these.
What I'm going to suggest in this talk is that we go back to the very basics. Back
to the very basics of programming. What does programming in its true sense
actually mean?
And in its true sense, what does programming from the beginning, programming
in the beginning simply, we're typing. We're typing our code on the keyboard,
and we're producing code.
This is what programmers do at the very lowest level of abstraction. Actually, I'm
going to make this a metaphor. I'm going to say we try to avoid all abstraction,
trying to come up with abstract concepts in the code or abstract concept of the
process. But rather we're going to look, rather we're going to look into the code
as it is right here in front of us. Simply as -- simply as a combination of individual
characters as they form out the source code in the very end.
And this is precisely the level at which we analyze code, at which we analyze
code, simply by looking into its individual constituents which means individual
characters. Actually, this is -- actually if you apply such a thing, if you simply look
at the distribution of characters in Eclipse you'll already find that there's actually,
there's actually a certain bias in here, because individual characters are not
equally distributed. For one thing there's plenty of printable characters in code
and there's very few nonprintable characters except for space and for new line.
New line, line feed which make up the gist. There's plenty of lower case letters in
here. Very few upper case letters in here.
And this difference in distribution you can also see E is the most prominent
character in the lower case letters. There's difference in distribution actually
points to the concept. We may be able to look at the distribution of individual
characters in source in order to, in order to do empirical, in order to come up with
empirical findings on them.
So what we looked at is we looked at the publicly available Eclipse bug dataset.
This is available for download for anyone. Plenty of people actually have worked
on this.
Or it comes in three releases, from 2.0 to 3.0. And it consists of between 6,000
and the earlier releases 10,005 in later releases, each with individual defect rate,
which means that the number of defects which have been fixed in the appropriate
file, in the appropriate file over time.
Here's a few more features. This is the number of characters that you find. We
have a 45 million characters in Eclipse 2.0. And 76 million characters in Eclipse
3.0. This is something we look at. This is number of files that have defects. It's
about one-sixth of all files in these settings have had defects fixed in the past.
So we're looking at three hypotheses here. The first hypothesis is we can
actually use these programmer actions, these very basic typing actions, to
predict defects in Eclipse or elsewhere.
The second is we can go and find out which programmer actions actually are the
ones that are mostly correlated with failures. And third is we can actually go and
prevent defects by restricting such actions.
And let's start with the first one. Predicting defects from programmer actions. So
this is a failure -- this was a fairly straightforward approach. We simply used -we simply trained a machine learner, simply trained a regression model on the
number of, on the distribution of characters in each file.
And used the number of, used the number of specific characters in each file to
train it and we used this to make predictions on whether a specific file would be
failure-prone or not.
This is our results for Eclipse. We used straining sets, we used Eclipse 2.0 to 3.0
and we checked this for the very same set, Eclipse 2.0 and 3.0, if you see, for
instance, if we use regression model in 2.0, used it to predict which files would be
defect-prone. Later releases, in Eclipse 2.0 and 3.0, we find we're actually able
to, that we're actually able to come back with about 50 percent precision,
meaning that every second of our predictions regarding defect from files is
actually right.
Eclipse 2.1 is better, 2.0 is still better. Interestingly enough, this is very important
for our work, if we use the same, if we use the same set for training and
prediction, that is, we learn things in Eclipse 2.0 and predict things in Eclipse 2.0.
This is actually where we get the highest numbers in terms of precision this is
very important because normally you do not want to learn from earlier version to
do things in a current version, but you'd like to learn things from your current
version because this is the version you're currently working with, and if you are
learning from this current version and applying it right away, that's when this
whole approach will be most effective, of course.
In terms of recall, recall is not as good as precision. We are about average -- this
is about 20 percent recall. So we are able to identify, to identify correctly
20 percent of the -- 20 percent of the defect prone files. But this is on par with
same bug, generally, with bug detecting tools. If you use an arbitrary, say, bug
detector and apply it you'll find a number of bugs but you'll never find all the
bugs. There will be a number -- there will be a number of questions at the very
end if you would like to have it.
So there's the first thing. So by training regression model on the distribution of
characters and the number of characters in files, we can very easily predict, we
can predict defects in various Eclipse versions.
Let's come to the next topic, can we actually figure out which are the programmer
actions that are most defect-prone? And this is where, this is where there's truly
interesting stuff begins. Because we actually looked at the thing, at correlations
of the individual characters in files and looked at whether these individual
characters would correlate, what their correlation coefficient with
defect-proneness would be. So here you see the correlation coefficient and here
you see the correlation values for the individual characters. And you can clearly
see here that there's four values that stand out.
This is the lower case letter I. This is the lower case letter R. This is O and this
is P. These are the four characters, these are the four characters in our dataset
which had the strongest correlation with, which is the strongest correlation with
failures. And we found that it's so striking, not only because, well, these four -but also because these four letters actually make a nice acronym. We call this
the IROP principle.
>>: [inaudible].
>> Andreas Zeller: I'll come to that in a second. You have the IROP principle,
which incidentally is also airline slang for irregular operation. So this makes a
nice mean to remember. Remember IROP and you'll find the four important
characters that correlate with failure.
This was striking for us, because now, because now we knew which programmer
actions led to failures. And so we had a clear sign what actually were the basic
actions we had in the end.
The third point is can we actually go and prevent defects like this? Can we go
and I'll come to this in a second? We'll be there at the end.
The third thing is, the third thing is can we prevent defects by restricting
programmer action? And what we did here is we first went and mapped the
individual characters on keyboards. So we came up with this concept of
color-coding, of color-coding the individual correlations of the individual
characters on a keyboard. So we found numbers, for instance, were okay. But
here you see the four letters of failure in here, and enter actually was, this is one
of the nonprinting letters that correlated with failures.
So we start that by color encoding this, we would be able to provide immediate
feedback for programmers such that they would be able to see exactly, such that
they would be able to see exactly the consequence of typing that key, of
introducing that particular letter into the code.
So we thought, well, that's one thing. But didn't respond too well. So we went for
a more drastic action. So we set up a special keyboard, which is shown here,
and this keyboard actually eliminates the possibility, well, you can still press ALT
and something but this is really difficult. So we simply eliminate these letters up
here.
So making the cost of these letters immediately clear. You'll find that there still is
an enter key. This is right. This is right. Formally speaking we also should have
eliminated an enter key in here. But the thing is that if you eliminate the enter
key, what happens is you get many, many more defects per line than
beforehand. So we kept the enter key in the keyboard. So we kept the enter key
in the keyboard but we eliminated the other letters.
So this is what we thought then would eventually lead to better, would lead to
better code. However, it turned out we had to introduce a number of new coding
standards because of course your regular code as you have it in here has a
number of these IROP characters in there.
And not only that, also the keywords in here have IROP letters in here. We had
to rename them appropriately, which actually is fairly straightforward. If its when
integer becomes numb and return becomes handbag, which is essentially the
same, we avoid these four letters of failure.
And the interesting thing here, this is 100 percent semantics preserving, so
there's no chance of this automatic translation introducing any new bugs in here.
So everything stays as before, except that now we can avoid the correlations
between these four letters and failures.
And these keyboards actually very nicely stuck on with our test persons. So
what we found is that they also used the, they used these keyboards in real life
for the daily communication, and this is what one of them wrote us on this
keyboard. We can shun these set [inaudible] and the text stays just as well
incidentally let us just bend them. You'll see there's no IROP in this text anymore
and this has become a true habit of developers.
So this is for a third hypothesis, we can prevent defects by restricting
programmer actions. And this is true just as well. Now, I understand you have a
number of questions. Let me illustrate since the paper has been published on
the Internet beforehand I've received a number of questions and I'd like to
address them all.
The first thing is, of course, is this externally valid? The question here always is,
you only analyze a small set of actual instances, is this actually true for all
instances? And I can easily -- I can easily shun this off because simply we have
been looking at no less than 177 million individual characters.
This is something which actually brings our -- which actually brings our statistical
package which almost makes it crash because it's so huge. This is one of the
largest studies ever in terms of numbers. So thousands of characters, thousands
of defects, 170 million characters.
Really, this is huge. This is something which really reencompasses large, large
body of code. The next question is about internal validity. Are the correlations
that we found, are they significant? Are they statistically significant? And all of
them are statistically significant.
We put them through the test. All of them are as they are. And the last question,
of course, is this in terms of construct validity, construct validity means are the
constructs that we build actually appropriate. The good news about this work is
there is in construct. There is no abstraction such as -- there's no abstraction
such as making a model of the process or making a model of the program or
anything, we just look at the basics of the characters that are in there. And there
is no construct, no abstraction from our part that could possibly have interfered
with that.
So after these stats that is focused a bit on future work. First thing one instance
of that is automatic renamings what we're working on is we're setting up
automatic renaming programs that go and take source code and other code and
simply rename it appropriately. If we find appropriate semantic replacement such
as promise we can rename this to engagement, other words for which there's no
good substitution or simply eliminate the appropriate letters.
We are also thinking about what could actually be the reason behind all this. And
what we found is, and this is striking, is that there's a number of words that do
have negative connotations, such as failure, mistake, error, problem, bug report,
that all contain these four letters of failure.
In contrast, positive words, like success or fame, which we all want, of course.
And it may well be that while typing, while typing these letters, there's sort of
resonance that goes on, resonance that goes towards the risk of making a
mistake.
That's what we think and thereby focusing on words without the four letters of
failure, we think we can get a more positive attitude and this more positive
attitude will then come up with more positive test results.
And, of course, we are generalizing the interesting question is so far for us is this
true for Java code as we looked at it or does it also work for C++ code or is this
actually something that would work only on English-speaking code. So right now
we're looking at a big base of Russian source code. Russian-speaking course
code. And what we found already is that for this Russian codes which is very
different from the source code we've seen so far, just as well strong correlations
between individual letters and between the abundance of individual letters and
the number of defects in the file.
So I think we're looking for something that really works, that really works across a
large number of domains. So this was failure is a four letter word. And with that
let me get out of this again and now give you a few -- obviously you found this
out. There's very few serious things of that.
Now let me point out why we did this whole thing, why we wrote this paper in this
very way. It's a fun read. That's at least what I hear from many people. And let
me give you a few explanations on why we did this and what we did in here.
So the first thing is what we came up with in this research is, well, we came up
with plenty of correlations. That's right. And the true thing is -- the true thing is
any big dataset, in any big dataset you can always find lots of correlations, if the
dataset is simply big enough, there's a couple of famous correlations, say
Germany, for instance, where I come from you can find a very clear correlation
between the abundance, between the decline of the number of storks and the
decline of the birth rate. It's correlated almost to the point, the fewer storks, the
lower the birth rate, for instance. Sure, there's always a correlation. The thing is
for storks and babies, there's at least a couple of theories that explain the whole
thing.
No matter how bogus they may be, however, over here there's no such theory at
all and simply making this up that there could be something. The second
observation is machine learning works. You can train a machine learner on a
number of features and it will make accurate predictions based on these very
features. It doesn't matter which features you feed in there.
Which is exactly one of the reasons that this particular prediction works well.
And this prediction actually is -- there's a number, the fact that we use the same
set as the training set as the evaluation set, of course, is an absolute no-no. I
hope some of you have remarked this during the way. So this is the reason why
this particular line in here has particular high values.
Cherry picking, cherry picking is another problem we actually sometimes see in
published scientific papers where people come along and say we have this
wonderful, we have invented this nice technique now it's evaluated we selected
five programs on which to evaluate it.
No discussion on how these five programs were chosen or anything, they're just
out there. Okay? So in here so in here I think one of you pointed this out this is
the end which has a higher value. That's right.
But we decided to go for the other four characters because it made this nice
acronym. If we had gone for N, it would have been P-O-R-N. This would have
become the PORN principle, which probably would have lowered its acceptance
rate. Although it could have been fun coming up with a PORN principle in
empirical software engineering. The other minor thing is looking at the scale. If
you look at the scale, you see any differences in here are hugely exaggerated. If
you look at the true distributions, true correlations on a regular scale, say from 0
to 1 you'll find it's just a flat line and actually the flat line, the flat line goes over all
characters.
Okay? Because obviously the more characters you have in a file, the higher the
chance of having a bug in it. It doesn't take lots of rocket science in order to
come up with a conclusion like this. Yet, again, this paper is published on this
very fact.
The larger the file, the more bugs. Look and behold. Which is maybe mightily
interesting the first time but it goes on. There's actually people who published a
whole series of papers on that. Very much remarkable.
Our next thing, of course: Fix causes not symptoms. You need to find out what's
the actual cause behind something, just for the characters doesn't work. A word
like 100 percent semantic preserving means whatever you do has no impact at
all on the other side. And finally actionable findings. You need to have findings
that actually, that actually will not only say be of interest to the reader, but
actually will impact the way people will, impact the way people work and also
show up with some credible evidence that their findings you have will have some
impact.
I'd like to point out also that this research was inspired by a comic called XKCD.
Very nice comic, by the way. I'm happy to endorse it over here. So this is jelly
beans, this is the significant comic, jelly beans cause acne: Scientists
investigate. Whoops, no correlation between any kind of jelly beans and acne.
So if there's no correlation between jelly beans, maybe there's a certain color that
causes it. So you go and investigate it. You do this first purple jelly beans,
brown, blue jelly beans never find a correlation, but it turns out if you do this 20
times, with an error, with a confidence factor of 95 percent, then you will actually
find at least one color where there is such a correlation, or we say over here we
found a link between green jelly beans and acne, WOP, this is a very interesting
result. The result of these experiments of course you focus just on the one result
that strikes out with this high error rate. Then you come out with the result green
jelly beans leads to acne, 95 percent confidence, which is actually true. But it's
also a problem because, for instance, all the negative results we produce and
which never ever get published eventually cause a bias and eventually cause a
bias in public perception. And this is exactly what comes out at the very end.
So for us we should go and educate reviewers. We should go and educate
professionals. We should go and educate future generations of computer
scientists to know more about empirical evidence, and also to know about the
many ways one can fake empirical evidence. Not so much in order -- not so
much to use these techniques but to discover them. And, of course, there's also
a nice paper on this which acne has all of these results, pair dean empirical
research, which on the first four pages actually comes as a full-blown description
of what we claim to have done and then there's the two pages of revelation and
feel free if you do a class on empirical research to pass on the first pour pages to
a student, you can eliminate this parody here and have them figure out, have
them figure out all the many ways that this research may be wrong.
This is where we have the paper, the work source, the word, original paper, as
well as our scripts and the data available, the slides here are available on this
site as well. And with this I'd like to conclude my talk on failure is a four letter
word and I'm happy to take questions. Thank you very much.
[applause].
>>: Can I ask a question in character?
>> Andreas Zeller: Yes, please.
>>: So given your results, and you showed the keyboard slide and the numbers
were lower, they had low correlation, right, so what if we just used the number
pad and typed in okay tall because the number pad is entered and has numbers.
If we just type in the numbers as code won't we get a lower defect correlation?
>> Andreas Zeller: This may well. Actually typing the number in Octel [phonetic]
and pressing the numbers in Octel was actually perceived as a work-around by
our interns in order to enter these characters, nonetheless, but it was clear to
them that they were -- that this way it became explicit to them that they were just
entering some dangerous ground, okay, and this made them more aware of what
they were doing right now. So what we actually found was that code with these
letters entered after we had distributed these keyboards tended to be of better
quality because people were more aware and people tended to focus specifically
on the risks involved.
>>: Thank you.
>>: So what would be your estimate that what fraction of all research in this area
and empirical software engineering actually would meet your standards for being
statistically reasonable?
>> Andreas Zeller: So there's sayings that 80 percent of all research anywhere
not only in empirical research, empiric software research but 80 percent of all
research is empirically flawed. This is a number that goes along. If ->>: [inaudible].
>> Andreas Zeller: This has estimates to that. But so a simple example, if,
simple example if you look at headlines, for instance, in a use paper for whatever
A is related to B, whatever, and you actually look at the real paper you'll find out
that this comes from, this is typically our, these are studies that are being bought
by specific companies where the people who are actually responded to them are
carefully selected. Where it doesn't take rocket science to figure out it's far from
the beginning.
>>: But going on in medical research and so on people do retrospective studies.
>> Andreas Zeller: For empirical research, I can't come up with a true estimate
on that. What I see is, what I see is the papers that I'm supposed to review as a
reviewer, and I can see that about one-third of the papers I get to review have
very serious empirical problems. However, these are not the papers that
eventually get accepted. That's a good view.
>>: [inaudible].
>>: Cherry pick a few examples and say work these examples that strikes home
in a bad way with me because some of my research into program verification, I
might come up with a technique that there's a well identifiable hole in some
particular verification method that hole, I don't ever claim my technique is a
one-size-fits-all verification method so I find some programs where I demonstrate
some benefit and present results for these programs. I'm not making any
statistical claim that this technique is good or bad, but the trouble is I always
have this nagging feeling, if you took my technique and ran it on an arbitrary
program, it almost certainly wouldn't give you any results, would you say can't
verify the program. If you combined it with the right other techniques, it probably
would work but only if you knew magically which techniques to use.
So it seems like something I would feel a bit guilty about as a researcher, I don't
see a way around, because I don't think my techniques are rubbish, I think
they're useful. Can you give me some advice there?
>> Andreas Zeller: The thing is simply whether your research or whether your
contribution by construction will always do things better, if it works, so you can
come up with techniques that by construction are set up in a way, either it
improves things or it keeps things as they are. Okay. Then there's no risk
whatever involved with them.
>>: You say run these two things in parallel.
>> Andreas Zeller: But there's also techniques which may think, which you think
may think may make things better but which also run the risk of making things
worse. Bug finding is a typical example of that, because with most static bug
finding tools you have a number of false positives and you need to offset the
number of true positives, that is the value of the bugs you find versus the cost of
the many false positives that -- of the many false positives that such tools
produce.
Having been at Google just a few weeks ago they said, in my face, they said for
us it's much better not to have fault with arms at all, for us it's most important not
to have any false positives, because they get distracted by these false positives,
they think this de-valuates the tool. And for any recommendation say empirical
study can get you, for any such recommendation there's always the risk of such
recommendation, such recommendations going in the wrong direction.
So in these situations you clearly have to evaluate, well, what's the potential
benefit, but also what's the risk and then if you -- if there's risks involved, one
would like to see how large the risk is, how large the benefit is, and this, of
course, can best be done by applying the technique on a number of
representative subjects.
>>: All right. Let's thank the speaker one more time. [applause]
Download