23081 >>: Thank you for coming out. It's my pleasure to introduce Andreas Zeller, who has been here visiting us for the past week or so. Andreas came out for two months in March; is that right? Earlier this year. And this is the paper that he's going to be presenting is one of the pieces of work that he worked on with us, Tom Zimmerman and myself. Back in Germany, Andreas is a full professor at the University of Saarland. He's done a work in a lot of areas of software engineering. Has written a book and a number of papers. And most recently I think it was just in June he was inducted as an ACM Fellow. So we're happy to have him here. We look forward to your feedback. He's actually going to be presenting this at Promise, the international workshop, or conference, on predictor models on software engineering. So Andreas. >> Andreas Zeller: I can see you can all hear me, Marcus, can you hear me back there? Wonderful, great. This is Failure is a Four Letter Word. As the title already said, this talk is about software failures. Software failures as you all know them, this is one of the most spectacular samples, videos, here's the crash of the Arianna Five rocket, one of the many software failures we have to deal in real life. Besides programming probably one of the major obsessions of computer researchers, how can we find all the bugs and how can we eradicate them. One particularly interesting line of research in here has been the analysis of software histories. What you see here is -- what you see here is actually all the classes of Firefox. Actually, there's an anti-tree map where you can also see the individual packages. And in this picture what you see here is we have color-coded the defect density in the individual classes of Firefox. What you see here below, for instance, is classes in the JavaScript engine of Firefox. And you can see these red boxes down here are in classes where a high number of security bugs showed up. There's also other places in here. Here's the layout base. This is where cross-scripting attacks happen. Over here we have -- over here we have the document object model where also a number of attacks have happened, and we have very such bugs have been fixed in the past. Such results, the results of a pretty new field of software engineering, which is mining software repositories, where you go along and where you look at, where you look at the history, in particular change in bug history of systems, and then you map them to individual places in your source code in order to figure out where the most bugs have occurred in the past and where the most bugs have been fixed in the past. This field has taken off enormously since, well, 2004, I think that's when the first workshop on the topic happened. At this point there were about 20 people attending. Now it's almost a major conference. But 150, 160 people attending and plenty of people all along the way who work on this kind of research. The interesting thing is once you see a picture like this, is you wonder how can it actually be that there's specific places in your code where there's many bugs or many bugs of a specific kind. And how can it be that there's other places which are literally stable from the beginning. So what people typically do is you go and associate this with proper failure causes. Failure causes such as, for instance, the quality of your developers. You look at appropriate features of your developers and you try to correlate them, you try to correlate them with specific features of the code they've been writing. One of the things that my student [inaudible] and myself at the point found out, for instance, was that in Eclipse the more experienced you were the more bugs you would create, very much in contrast to what the picture shows over here. We can also go and try to correlate failure causes with specific management styles or with specific ways the team works together, you can go and figure out this is actually something that was found out here at Microsoft Research that if you have many different people from different teams and none of these individuals actually takes responsibility in the actual outcome, turns out that the resulting modules have higher failure rate than modules for which there's individuals who take responsibility. This is another interesting point that researchers have been investigating. And then, of course, there's the issue of complexity, all sorts of code features which people have looked upon various metrics that you can go and correlate with -that you can go and correlate with this failure data, obviously a high complexity. You would assume high complexity yields more bugs. Certainly if the complexity is high, it dominates everything else. The problem with this at this point what we see is a large, large number of studies all trying to correlate failure cause with various aspect of the code and its process, which means that for a manager, the situation is actually pretty much confusing, because there's so many relationships between failures and features of code and process, such that something what we call the cost of consequence. We may choose to work on a specific thing. We may work say, for instance, on a management style. We may work on reducing the code complexity. We may work on increasing the abilities of our developers, but all of these have costs and all of these issues have costs. There's the risk involved with that that all of these actually, all of these actuals that we could possibly undertake will actually not change things. And we see this problem primarily as a problem of, well, it's a problem of too many results, too many results, too many findings and it's hard to find, it's hard to choose the right one between, how to choose the right one in all of these. What I'm going to suggest in this talk is that we go back to the very basics. Back to the very basics of programming. What does programming in its true sense actually mean? And in its true sense, what does programming from the beginning, programming in the beginning simply, we're typing. We're typing our code on the keyboard, and we're producing code. This is what programmers do at the very lowest level of abstraction. Actually, I'm going to make this a metaphor. I'm going to say we try to avoid all abstraction, trying to come up with abstract concepts in the code or abstract concept of the process. But rather we're going to look, rather we're going to look into the code as it is right here in front of us. Simply as -- simply as a combination of individual characters as they form out the source code in the very end. And this is precisely the level at which we analyze code, at which we analyze code, simply by looking into its individual constituents which means individual characters. Actually, this is -- actually if you apply such a thing, if you simply look at the distribution of characters in Eclipse you'll already find that there's actually, there's actually a certain bias in here, because individual characters are not equally distributed. For one thing there's plenty of printable characters in code and there's very few nonprintable characters except for space and for new line. New line, line feed which make up the gist. There's plenty of lower case letters in here. Very few upper case letters in here. And this difference in distribution you can also see E is the most prominent character in the lower case letters. There's difference in distribution actually points to the concept. We may be able to look at the distribution of individual characters in source in order to, in order to do empirical, in order to come up with empirical findings on them. So what we looked at is we looked at the publicly available Eclipse bug dataset. This is available for download for anyone. Plenty of people actually have worked on this. Or it comes in three releases, from 2.0 to 3.0. And it consists of between 6,000 and the earlier releases 10,005 in later releases, each with individual defect rate, which means that the number of defects which have been fixed in the appropriate file, in the appropriate file over time. Here's a few more features. This is the number of characters that you find. We have a 45 million characters in Eclipse 2.0. And 76 million characters in Eclipse 3.0. This is something we look at. This is number of files that have defects. It's about one-sixth of all files in these settings have had defects fixed in the past. So we're looking at three hypotheses here. The first hypothesis is we can actually use these programmer actions, these very basic typing actions, to predict defects in Eclipse or elsewhere. The second is we can go and find out which programmer actions actually are the ones that are mostly correlated with failures. And third is we can actually go and prevent defects by restricting such actions. And let's start with the first one. Predicting defects from programmer actions. So this is a failure -- this was a fairly straightforward approach. We simply used -we simply trained a machine learner, simply trained a regression model on the number of, on the distribution of characters in each file. And used the number of, used the number of specific characters in each file to train it and we used this to make predictions on whether a specific file would be failure-prone or not. This is our results for Eclipse. We used straining sets, we used Eclipse 2.0 to 3.0 and we checked this for the very same set, Eclipse 2.0 and 3.0, if you see, for instance, if we use regression model in 2.0, used it to predict which files would be defect-prone. Later releases, in Eclipse 2.0 and 3.0, we find we're actually able to, that we're actually able to come back with about 50 percent precision, meaning that every second of our predictions regarding defect from files is actually right. Eclipse 2.1 is better, 2.0 is still better. Interestingly enough, this is very important for our work, if we use the same, if we use the same set for training and prediction, that is, we learn things in Eclipse 2.0 and predict things in Eclipse 2.0. This is actually where we get the highest numbers in terms of precision this is very important because normally you do not want to learn from earlier version to do things in a current version, but you'd like to learn things from your current version because this is the version you're currently working with, and if you are learning from this current version and applying it right away, that's when this whole approach will be most effective, of course. In terms of recall, recall is not as good as precision. We are about average -- this is about 20 percent recall. So we are able to identify, to identify correctly 20 percent of the -- 20 percent of the defect prone files. But this is on par with same bug, generally, with bug detecting tools. If you use an arbitrary, say, bug detector and apply it you'll find a number of bugs but you'll never find all the bugs. There will be a number -- there will be a number of questions at the very end if you would like to have it. So there's the first thing. So by training regression model on the distribution of characters and the number of characters in files, we can very easily predict, we can predict defects in various Eclipse versions. Let's come to the next topic, can we actually figure out which are the programmer actions that are most defect-prone? And this is where, this is where there's truly interesting stuff begins. Because we actually looked at the thing, at correlations of the individual characters in files and looked at whether these individual characters would correlate, what their correlation coefficient with defect-proneness would be. So here you see the correlation coefficient and here you see the correlation values for the individual characters. And you can clearly see here that there's four values that stand out. This is the lower case letter I. This is the lower case letter R. This is O and this is P. These are the four characters, these are the four characters in our dataset which had the strongest correlation with, which is the strongest correlation with failures. And we found that it's so striking, not only because, well, these four -but also because these four letters actually make a nice acronym. We call this the IROP principle. >>: [inaudible]. >> Andreas Zeller: I'll come to that in a second. You have the IROP principle, which incidentally is also airline slang for irregular operation. So this makes a nice mean to remember. Remember IROP and you'll find the four important characters that correlate with failure. This was striking for us, because now, because now we knew which programmer actions led to failures. And so we had a clear sign what actually were the basic actions we had in the end. The third point is can we actually go and prevent defects like this? Can we go and I'll come to this in a second? We'll be there at the end. The third thing is, the third thing is can we prevent defects by restricting programmer action? And what we did here is we first went and mapped the individual characters on keyboards. So we came up with this concept of color-coding, of color-coding the individual correlations of the individual characters on a keyboard. So we found numbers, for instance, were okay. But here you see the four letters of failure in here, and enter actually was, this is one of the nonprinting letters that correlated with failures. So we start that by color encoding this, we would be able to provide immediate feedback for programmers such that they would be able to see exactly, such that they would be able to see exactly the consequence of typing that key, of introducing that particular letter into the code. So we thought, well, that's one thing. But didn't respond too well. So we went for a more drastic action. So we set up a special keyboard, which is shown here, and this keyboard actually eliminates the possibility, well, you can still press ALT and something but this is really difficult. So we simply eliminate these letters up here. So making the cost of these letters immediately clear. You'll find that there still is an enter key. This is right. This is right. Formally speaking we also should have eliminated an enter key in here. But the thing is that if you eliminate the enter key, what happens is you get many, many more defects per line than beforehand. So we kept the enter key in the keyboard. So we kept the enter key in the keyboard but we eliminated the other letters. So this is what we thought then would eventually lead to better, would lead to better code. However, it turned out we had to introduce a number of new coding standards because of course your regular code as you have it in here has a number of these IROP characters in there. And not only that, also the keywords in here have IROP letters in here. We had to rename them appropriately, which actually is fairly straightforward. If its when integer becomes numb and return becomes handbag, which is essentially the same, we avoid these four letters of failure. And the interesting thing here, this is 100 percent semantics preserving, so there's no chance of this automatic translation introducing any new bugs in here. So everything stays as before, except that now we can avoid the correlations between these four letters and failures. And these keyboards actually very nicely stuck on with our test persons. So what we found is that they also used the, they used these keyboards in real life for the daily communication, and this is what one of them wrote us on this keyboard. We can shun these set [inaudible] and the text stays just as well incidentally let us just bend them. You'll see there's no IROP in this text anymore and this has become a true habit of developers. So this is for a third hypothesis, we can prevent defects by restricting programmer actions. And this is true just as well. Now, I understand you have a number of questions. Let me illustrate since the paper has been published on the Internet beforehand I've received a number of questions and I'd like to address them all. The first thing is, of course, is this externally valid? The question here always is, you only analyze a small set of actual instances, is this actually true for all instances? And I can easily -- I can easily shun this off because simply we have been looking at no less than 177 million individual characters. This is something which actually brings our -- which actually brings our statistical package which almost makes it crash because it's so huge. This is one of the largest studies ever in terms of numbers. So thousands of characters, thousands of defects, 170 million characters. Really, this is huge. This is something which really reencompasses large, large body of code. The next question is about internal validity. Are the correlations that we found, are they significant? Are they statistically significant? And all of them are statistically significant. We put them through the test. All of them are as they are. And the last question, of course, is this in terms of construct validity, construct validity means are the constructs that we build actually appropriate. The good news about this work is there is in construct. There is no abstraction such as -- there's no abstraction such as making a model of the process or making a model of the program or anything, we just look at the basics of the characters that are in there. And there is no construct, no abstraction from our part that could possibly have interfered with that. So after these stats that is focused a bit on future work. First thing one instance of that is automatic renamings what we're working on is we're setting up automatic renaming programs that go and take source code and other code and simply rename it appropriately. If we find appropriate semantic replacement such as promise we can rename this to engagement, other words for which there's no good substitution or simply eliminate the appropriate letters. We are also thinking about what could actually be the reason behind all this. And what we found is, and this is striking, is that there's a number of words that do have negative connotations, such as failure, mistake, error, problem, bug report, that all contain these four letters of failure. In contrast, positive words, like success or fame, which we all want, of course. And it may well be that while typing, while typing these letters, there's sort of resonance that goes on, resonance that goes towards the risk of making a mistake. That's what we think and thereby focusing on words without the four letters of failure, we think we can get a more positive attitude and this more positive attitude will then come up with more positive test results. And, of course, we are generalizing the interesting question is so far for us is this true for Java code as we looked at it or does it also work for C++ code or is this actually something that would work only on English-speaking code. So right now we're looking at a big base of Russian source code. Russian-speaking course code. And what we found already is that for this Russian codes which is very different from the source code we've seen so far, just as well strong correlations between individual letters and between the abundance of individual letters and the number of defects in the file. So I think we're looking for something that really works, that really works across a large number of domains. So this was failure is a four letter word. And with that let me get out of this again and now give you a few -- obviously you found this out. There's very few serious things of that. Now let me point out why we did this whole thing, why we wrote this paper in this very way. It's a fun read. That's at least what I hear from many people. And let me give you a few explanations on why we did this and what we did in here. So the first thing is what we came up with in this research is, well, we came up with plenty of correlations. That's right. And the true thing is -- the true thing is any big dataset, in any big dataset you can always find lots of correlations, if the dataset is simply big enough, there's a couple of famous correlations, say Germany, for instance, where I come from you can find a very clear correlation between the abundance, between the decline of the number of storks and the decline of the birth rate. It's correlated almost to the point, the fewer storks, the lower the birth rate, for instance. Sure, there's always a correlation. The thing is for storks and babies, there's at least a couple of theories that explain the whole thing. No matter how bogus they may be, however, over here there's no such theory at all and simply making this up that there could be something. The second observation is machine learning works. You can train a machine learner on a number of features and it will make accurate predictions based on these very features. It doesn't matter which features you feed in there. Which is exactly one of the reasons that this particular prediction works well. And this prediction actually is -- there's a number, the fact that we use the same set as the training set as the evaluation set, of course, is an absolute no-no. I hope some of you have remarked this during the way. So this is the reason why this particular line in here has particular high values. Cherry picking, cherry picking is another problem we actually sometimes see in published scientific papers where people come along and say we have this wonderful, we have invented this nice technique now it's evaluated we selected five programs on which to evaluate it. No discussion on how these five programs were chosen or anything, they're just out there. Okay? So in here so in here I think one of you pointed this out this is the end which has a higher value. That's right. But we decided to go for the other four characters because it made this nice acronym. If we had gone for N, it would have been P-O-R-N. This would have become the PORN principle, which probably would have lowered its acceptance rate. Although it could have been fun coming up with a PORN principle in empirical software engineering. The other minor thing is looking at the scale. If you look at the scale, you see any differences in here are hugely exaggerated. If you look at the true distributions, true correlations on a regular scale, say from 0 to 1 you'll find it's just a flat line and actually the flat line, the flat line goes over all characters. Okay? Because obviously the more characters you have in a file, the higher the chance of having a bug in it. It doesn't take lots of rocket science in order to come up with a conclusion like this. Yet, again, this paper is published on this very fact. The larger the file, the more bugs. Look and behold. Which is maybe mightily interesting the first time but it goes on. There's actually people who published a whole series of papers on that. Very much remarkable. Our next thing, of course: Fix causes not symptoms. You need to find out what's the actual cause behind something, just for the characters doesn't work. A word like 100 percent semantic preserving means whatever you do has no impact at all on the other side. And finally actionable findings. You need to have findings that actually, that actually will not only say be of interest to the reader, but actually will impact the way people will, impact the way people work and also show up with some credible evidence that their findings you have will have some impact. I'd like to point out also that this research was inspired by a comic called XKCD. Very nice comic, by the way. I'm happy to endorse it over here. So this is jelly beans, this is the significant comic, jelly beans cause acne: Scientists investigate. Whoops, no correlation between any kind of jelly beans and acne. So if there's no correlation between jelly beans, maybe there's a certain color that causes it. So you go and investigate it. You do this first purple jelly beans, brown, blue jelly beans never find a correlation, but it turns out if you do this 20 times, with an error, with a confidence factor of 95 percent, then you will actually find at least one color where there is such a correlation, or we say over here we found a link between green jelly beans and acne, WOP, this is a very interesting result. The result of these experiments of course you focus just on the one result that strikes out with this high error rate. Then you come out with the result green jelly beans leads to acne, 95 percent confidence, which is actually true. But it's also a problem because, for instance, all the negative results we produce and which never ever get published eventually cause a bias and eventually cause a bias in public perception. And this is exactly what comes out at the very end. So for us we should go and educate reviewers. We should go and educate professionals. We should go and educate future generations of computer scientists to know more about empirical evidence, and also to know about the many ways one can fake empirical evidence. Not so much in order -- not so much to use these techniques but to discover them. And, of course, there's also a nice paper on this which acne has all of these results, pair dean empirical research, which on the first four pages actually comes as a full-blown description of what we claim to have done and then there's the two pages of revelation and feel free if you do a class on empirical research to pass on the first pour pages to a student, you can eliminate this parody here and have them figure out, have them figure out all the many ways that this research may be wrong. This is where we have the paper, the work source, the word, original paper, as well as our scripts and the data available, the slides here are available on this site as well. And with this I'd like to conclude my talk on failure is a four letter word and I'm happy to take questions. Thank you very much. [applause]. >>: Can I ask a question in character? >> Andreas Zeller: Yes, please. >>: So given your results, and you showed the keyboard slide and the numbers were lower, they had low correlation, right, so what if we just used the number pad and typed in okay tall because the number pad is entered and has numbers. If we just type in the numbers as code won't we get a lower defect correlation? >> Andreas Zeller: This may well. Actually typing the number in Octel [phonetic] and pressing the numbers in Octel was actually perceived as a work-around by our interns in order to enter these characters, nonetheless, but it was clear to them that they were -- that this way it became explicit to them that they were just entering some dangerous ground, okay, and this made them more aware of what they were doing right now. So what we actually found was that code with these letters entered after we had distributed these keyboards tended to be of better quality because people were more aware and people tended to focus specifically on the risks involved. >>: Thank you. >>: So what would be your estimate that what fraction of all research in this area and empirical software engineering actually would meet your standards for being statistically reasonable? >> Andreas Zeller: So there's sayings that 80 percent of all research anywhere not only in empirical research, empiric software research but 80 percent of all research is empirically flawed. This is a number that goes along. If ->>: [inaudible]. >> Andreas Zeller: This has estimates to that. But so a simple example, if, simple example if you look at headlines, for instance, in a use paper for whatever A is related to B, whatever, and you actually look at the real paper you'll find out that this comes from, this is typically our, these are studies that are being bought by specific companies where the people who are actually responded to them are carefully selected. Where it doesn't take rocket science to figure out it's far from the beginning. >>: But going on in medical research and so on people do retrospective studies. >> Andreas Zeller: For empirical research, I can't come up with a true estimate on that. What I see is, what I see is the papers that I'm supposed to review as a reviewer, and I can see that about one-third of the papers I get to review have very serious empirical problems. However, these are not the papers that eventually get accepted. That's a good view. >>: [inaudible]. >>: Cherry pick a few examples and say work these examples that strikes home in a bad way with me because some of my research into program verification, I might come up with a technique that there's a well identifiable hole in some particular verification method that hole, I don't ever claim my technique is a one-size-fits-all verification method so I find some programs where I demonstrate some benefit and present results for these programs. I'm not making any statistical claim that this technique is good or bad, but the trouble is I always have this nagging feeling, if you took my technique and ran it on an arbitrary program, it almost certainly wouldn't give you any results, would you say can't verify the program. If you combined it with the right other techniques, it probably would work but only if you knew magically which techniques to use. So it seems like something I would feel a bit guilty about as a researcher, I don't see a way around, because I don't think my techniques are rubbish, I think they're useful. Can you give me some advice there? >> Andreas Zeller: The thing is simply whether your research or whether your contribution by construction will always do things better, if it works, so you can come up with techniques that by construction are set up in a way, either it improves things or it keeps things as they are. Okay. Then there's no risk whatever involved with them. >>: You say run these two things in parallel. >> Andreas Zeller: But there's also techniques which may think, which you think may think may make things better but which also run the risk of making things worse. Bug finding is a typical example of that, because with most static bug finding tools you have a number of false positives and you need to offset the number of true positives, that is the value of the bugs you find versus the cost of the many false positives that -- of the many false positives that such tools produce. Having been at Google just a few weeks ago they said, in my face, they said for us it's much better not to have fault with arms at all, for us it's most important not to have any false positives, because they get distracted by these false positives, they think this de-valuates the tool. And for any recommendation say empirical study can get you, for any such recommendation there's always the risk of such recommendation, such recommendations going in the wrong direction. So in these situations you clearly have to evaluate, well, what's the potential benefit, but also what's the risk and then if you -- if there's risks involved, one would like to see how large the risk is, how large the benefit is, and this, of course, can best be done by applying the technique on a number of representative subjects. >>: All right. Let's thank the speaker one more time. [applause]