18258 >> Nachi: Hello all. Thank you for coming. We are very happy to have with us Chris Bird. Chris is actually no stranger to MSR. He's interned twice with us in SRR. He's actually going to talk to us today about his research interests in sociotechnical congruence, primarily based on teams and on software systems. >> Christian Bird: Thanks, Nachi. So yeah. So I'm a student at UC Davis, finishing up in the summer. And in this talk I kind of wanted to tell you what my background is, why I do what I do, what I have been doing recently and also what I hope to do in the future. So I'll start off and tell you a little bit about myself. So I got here by kind of the traditional means. I did my undergraduate in computer science and towards the end of my undergraduate career, and also while my wife was finishing up her degree, I worked for a large tech company writing software. And I worked with some really bright people, some very capable managers but there were some things about the projects, how the projects that I worked for were run that were a little disconcerting. And when I talked to some coworkers and managers, we kind of had this running theory that when managers make decisions, a lot of times they're making their decisions based on their intuition of what they think and just kind of anecdotal evidence, but as we -- the plural of anecdote is not evidence. And so I was kind of disheartened by some of what was going on. And so when I went back to grad school, I really wanted to do something about this. And so kind of what I was thinking was there's got to be a better way to run software projects, a more principled way, and is there some way that we can work towards making decisions based on actual real evidence. So I'm going to make an assertion that I don't think anybody here would argue with, that developing software is expensive and time-consuming. If you talk to people -- this probably isn't as true at Microsoft, but you would know better than me, but a lot of projects run over budget, and when you talk to developers, most of them ship late. Just this week, we have another victim of something shipping I think a week late. And the rumors are on the net that iPez is actually shipping late due to a software issue and not a hardware issue. So this is very relevant today. And if you look at the literature, the number that's classically given is that 80 percent of the time and money is spent after release, so this is just to get there. And then you have to deal with this. And I should really point out, I don't think it's because we don't have smart people working on the problem. You've got bright people here at Microsoft and also at other places, so I don't think it's that we have dumb people; I think it's just a hard problem. So that's why we have software engineering research. Right? We want to make this better. And at a high level, there's two things that software engineering tends to try to do. Not all of it does this, but what we want to do is in some way increase productivity. So this may be in terms of processes. It may be in terms of tools to help people. It may be in terms of like language abstractions that allow people to be more productive, but we don't want to sacrifice quality. That's painful if you're shipping bugs. That can have financial and reputation impact as well. Okay. So this is what we want to get to. So clearly the next thing to talk about is a cholera outbreak of 1854. So this is related. But to give you some background, 1854, the cholera outbreak that some said was the worst outbreak in the kingdom. And there were some folks' ideas at the time about how cholera was spread. So one idea is miasma, which is this idea that you've got this poisonous gas going around, if you're in its way, you're gone. Sometimes, well, you know, we deserve it. God comes and smites you and you get cholera. And then some people said: This is life. It just happens. You deal with it. Move on. So the government at the time said: Nothing we can do. We just have to deal with it. But fortunately there were some people at the time that had other ideas. This is John Snow. He was a surgeon at the time and he had some ideas about how cholera was spread and its effects and he saw this opportunity to kind of test some of his ideas. And he was under some constraints. If you have an idea about the way disease is spread, typically in medicine if we have questions, we can run like a double blind clinical trial. But you don't want to give people cholera and other people, you know, not give them the cholera and see who dies and say, oh, now I know. You just have to watch what happens and see. So he had these constraints in that he has to watch after the phenomenon. So he has some hypotheses: I think this is how cholera is spread. And then what he did was he went around and he interviewed families. He talked to people who had family members that had passed away and those who were relatively healthy. And he also collected geographic data. There's a very famous map called the ghost map that he made of the area where the outbreak occurred and the little black dots represent families, residences where people had passed away from cholera. And generally he observed the community. There's some interesting things that he found if you look on line at this research, that he found some key events that led him to come to the conclusion cholera was passed by water, and he identified the Broad Street Pump. This is this well in southern England. And so he went to the authorities and said, look, I think the water here, I think this is the source of your outbreak. And he convinced them to take the handle off the pumps so no one could get water, and immediately the outbreak stopped. And this is really considered one of the watershed events in epidemiology, spawned the field. And some of the methods that he used are still in use today. Okay. So what does this matter for us? Hopefully we're not dealing with people living and dying, but I work under the same -- some of the same constraints that Dr. Snow worked under. He couldn't conduct these trials. had to observe a phenomenon and gather whatever data he could to test some hypotheses about how things worked. He So we care about people living and dying, but my research isn't necessarily about that. What we care about are software projects that fail and those that succeed. We want more -- Windows 7 has actually gotten a pretty good reception in the marketplace, so this may be a success. And this is the area in five rocket that actually blew up as a result of a software failure. So we want less of these exploding rockets and more of these Windows 7s, right? So what we do is we use what John Snow used, we call the empirical method. There's three main steps to the method. The first is that we gather data typically regarded to our outcomes, measure software quality and productivity, and then also whatever factors we think are related to those. And then we examine relationships. So both quantitatively and qualitatively try to understand what's going on. And then based on that, we can make changes to processes. Maybe we build tools to help developers and managers so we can have an impact, make things better. So does anybody care about this? There's a lot of problems we can solve. More than we have time for. We want to make -- look for solutions to problems that matter. And so really, if you ask the right questions, then people do care. And so some of the questions I've asked have been based on my prior experience working in software and also the experience that other people have had and the gripes that they've had about what they've encountered. And the goal being that if we can take this empirical data, if we can improve processes, maybe it turns out that working a 9-to-5 work week is better than putting in hundred-hour blitzes at the end of a dev cycle. We can target resources if we know that certain parts of the system are more prone to failure, and hopefully we can improve the quality and the productivity of our developers. And does it matter? Yeah. There's a lot of money riding on this. So 2008, software was a $300 billion market. So there's a lot of ways to look at this. And if you look at the literature, it's replete with ways to address this problem. My kind of perspective is to look at the people. So this is Bjarne Stroustrop, kind of the father of C++ programming language. He says design and programming are human activities. If you forget that, all is lost. And if you remember, John Snow, he was interested in disease, but what did he do? He looked at the people. And so that's what I do. So I'm going to present some of my results today but to give you an idea of what my whole kind of graduate career spans, the things I've looked at, I'm looking at open source software and how it works. Other people are interested in how it works, is it really this bazaar with everybody doing everything and magic just comes out, or is it some more organization. I look at defect prediction, so both using attributes of the people working on software to produce defects but also looking at the effect of the quality of the data that used to make defect prediction. We found that that has a big effect. And then also process and the effects of process used on software quality. And it's not just software. I've also looked at some other things. I've looked at collaboration in computer science research, found that different areas of research, kind of the collaboration patterns are a little bit different. And I'm an empiricist. I love sports. So I actually have a -we're submitting a paper about NCAA football, and I'm happy to talk about this ad nauseam if anybody is interested. Love college football. So today I'm going to talk about three things. Looked at distributed development, ownership and expertise, and also how we think open source software works. So the first one: Does distributed development hurt code quality? So unless you've been under a rock, you realize that in the tech sector, at least, offshoring has been a really hot topic. And if you followed the 2004 elections, this actually came up in some of the political platforms. Offshoring from the U.S. It's a big issue. And there's some people that have some ideas about it. So this is Tom Allen, and he's at MIT. He studies innovation in the workplace. And he's developed what he calls a 50 meter phenomenon. And this is that the -- the idea is that when you have people working in a very creative and innovative environment, when you have people even as far as 50 meters apart, you see dramatic decrease in their ability to -- the frequency of communication and the richness of communication. And so this is 50 meters, the question is: Well, what happens when you're talking about 5,000 miles? Do we see an effect? Does it hurt software quality? And when you talk to developers about the issues that they face when dealing with people that are operating remotely, they have a lot of -- a lot of issues that they'll raise, but all of them leading back to the claim that, look, quality will suffer if you distribute development, especially around the world. And in this study, what was showed is that it can be done with a little effect on quality, not that it always will, but there are ways to do it. So Windows Vista is really a great candidate to study to ask this question. I don't have to give people here much of a background. You have thousands of people that worked on it. Very large project. There are thousands of individual pieces, and we can compare the pieces within one project rather than different projects, so we're really trying to compare apples to apples and avoid some of the confounding factors looking at two individual projects. large and definitely distributed around the entire world. So we have this question. John Snow had some ways of gathering data his hypotheses. Well, what of data do we gather? It's kind of four Initially start with just the source code so we know who contributed code to every binary in Vista. So binary is an executable, a shared and a driver. We know who wrote every line of code. Very to test pieces. source library, Next, this dialogue does have a bad connotation, but people that work here know that the information related to crashes help the management at Microsoft make decisions about what to fix and who's being affected, which crash is the most important. So this is kind of our outcome measure of software quality. Next we have the org chart. We know who is working where, when they were contributing code to Vista. And then last, we actually have the geographic data, so this is a map of precisely where I am, actually. And the ovals indicate buildings that are served by the same cafeteria. So fortunately I don't have to describe as much to you as I have had to the other people. So on with these four pieces of data. We can really answer this question. So what we did was we, bins, binaries based on the level of distribution. So the very lowest level, you have the building level. So this is where most of the developers work in the same building. If you work in the same building, it's very easy to walk next door to ask someone a question about maybe an interface. You may run into people just informally. You've worked together. You've probably been at the white board together, in meetings, talking about your design. Cafeterias where you have developers in buildings that are served by the same cafeteria, so still not too hard to walk next door. You can probably arrange a meeting fairly quickly and you can have meetings over lunch. The campus level, here in Redmond, it can take a while to get between buildings. You may be less familiar with someone that works at another building. Probably want to schedule a meeting a day in advance. And then locality. get between sites. So like you have the Seattle locality. It takes a while to You may not have even ever met face to face someone that's working on something you're working on if you live in the same locality but not on the same campus. And then things start to get bad pretty quickly. At the continent level, you start to begin to deal with time zone issues. Meetings are very difficult. They almost always have to be conducted electronically. And then at the world level, you've gotta fly. It's expensive. You start to deal with cultural issues. There may be sites that don't have any overlap in terms of working hours. And the idea behind this is -- it kind of harkens back to Tom Allen's work, is that, look, as the distance increases, it becomes harder to coordinate, to become aware of what everybody is doing and to be managed. And so we started off looking for -- looking for a 50-meter rule of software. So what did was we took these levels and we created five different splits and said, look, first we'll say that collocated is all binaries developed by developers in the same building and everything else is distributed all the way down to distributed only binaries that were worked on by developers across the world and everything else is collocated. So is there any split here where we see a dramatic difference in software quality? So I'll show you the split for the differences for the very first split. So two-thirds of the binaries fall over here in collocated and then one-third is in distributed. And these distributions look a little bit different because there's a lot more on the left than on the right, but the actual distributions are -- there's not a strong difference between the two. I'll just tell you right now, I had to take off the numbers because people outside of this organization can't see them, but the peak on this side is the same thing as the peak over there. And so although you have different mass, the distribution is fairly similar. And so you conclude just from visible inspection, you don't see a huge difference. Yeah? >>: Binaries aren't equal, right? So some are huge, some are tiny. Isn't it [inaudible] just binaries [inaudible] and then you can -- so you would infer from that that collocated can be much worse than distributed but yet as you know at Microsoft, [inaudible] is entirely done inside here in Redmond and less important perhaps or whatever is distributed and now -- >> Christian Bird: >>: So -- So you're going to discuss all that. >> Christian Bird: >>: Yeah. Yeah, I will exactly as to that point. That is shocking to me [inaudible]. >> Christian Bird: It's shocking to a lot of people. And I should mention, so this is actually -- this is not worse than that. So the XX is the number of binaries and failures -- or excuse me, YX is binaries and X is the failures. So although there's like more in this peak than that peak, the proportion, the actual distribution, density distribution is about the same between them. But you're right. Binaries are different. at, so I'll get to that in a little bit. And that's something that we looked So we decided to take a more principle approach than just looking at pretty pictures. And so we used linear egression. And this is kind of the data that we used looked like this. So for every binary, we included the level of distribution. There's no reason to believe that there's a linear relationship between buildings to cafeterias to campus so what we did is we encoded each one as a binary variable in our model, and so you'll notice that we don't have any category for binaries in the same building. That's kind of our baseline. So we're comparing quality of distributed binaries to those that are developed in the same building. And then our output is quality, software quality, so we measure that in number failures in the first six months after the release of Vista. Okay. So this is what our model came out. interpret had this means. And don't be afraid; I'll actually So the two pieces of information, that we're interested in are the percent increase. So this is the percent increase in failures relative to binaries developed by engineers in the same building. So as an example, if you look at binaries developed in different cafeterias but the same campus, you see a 16 percent increase. And over on the right is the significance. That's just the likelihood that what we're seeing is just due to the noise in the model. So lower values are better. And .05 is about the cutoff for saying that something is statistically significant. So in almost all cases, we see that it's significant. 16 percent is nothing to sneeze at. It's not as high as what we were expecting to find, but you know, that is an increase. So we concluded, look, they're actually is a little bit of an increase in failures when you district development. But, interestingly, Jim Erbsleb [phonetic] did a study in 2003 that's kind of similar to ours. He looked at productivity, not quality. He also found that when it was distributed, the alpha variable went down. [inaudible] people were less productive. But then what they did is they controlled for the number of developers in the teams. So essentially, if you have ten people in the same building or ten people scattered worldwide, you're going to see an increase, the same increase relative to five people in one building. And so we So again, failures, changes a ask the same question: What happens when you control for team size? we're using similar data but now in addition to the level and the we add the number of people that worked on a binary. And the story little bit. So again, this is a result of the new model. The two things to pay attention to are first, the percent increase. It's dropped dramatically. The highest is in the different localities but we can't even tell if that's due to real data or just noise in the model. The only one that is statistically significant is different campuses. You see a six percent increase. So it's dropped, which means a lot of what we're seeing was the factor just larger teams, and larger teams tend to be more distributed. So what we conclude is, look, there is a small increase, but it's actually mostly attributed to the size of the team rather than the level of distribution. Okay. So now back to Patrice's question, because it's a really good question. And when I presented this result to people, both inside and outside of Microsoft, they said, well, look, maybe management knows we should distribute the simpler things, those that are -- have a lower risk if they were to fail. And so they said we think simpler binaries are distributed, and so because distributed development is hard, it -- everything balances out and they look like they're the same. So the first question, how do you find simpler? There's a lot of ways to find simpler. Fortunately Microsoft gathers all kinds of metrics from source codes, so they've got measures of complexity, measures of churns, so churn size is like number of lines. Edits is number of commits. In degree and out degree on the dependency graph, path coverage in testing. All kinds of things. And so we looked at the correlation between these metrics and the level of distribution for the binaries to see is -- you know, do we see that those that are simpler are more distributed. This is actually the list of those with the highest correlations. As you can see, so correlation ranges from negative 1 to 1 with extreme values showing the high relationship, and we don't see very high. The highest is number of developers. But this isn't a surprise. We just found that [inaudible] controlled for the number of developers were able to account for more of the variance in failures. So from this correlation analysis, it doesn't look like what's being distributed is actually simpler. You don't see a huge difference. But I've actually found in my research that sometimes just digging into the data by hand can show you things that you wouldn't find just by doing a quantitative analysis. So we actually looked -- didn't look at all 4,000 binaries, but we did look at like top-20 lists of those that were distributed and those that were the largest and the smallest. Looked to see if maybe there were subsystems that were more distributed than others, and we didn't find anything. We even went so far as to build a logistic regression model. If we include like all six geometrics, can we predict the level of distribution? And the precision and recall were really bad. And so what we concluded from this is that at least relative to the metrics that we used to look at for simpler, there really wasn't much of a difference at all between the binaries that were distributed and those that weren't. So the next two questions, how in the world did they do this and why -- why are we getting these results? >>: [inaudible] communication to get to the same results? >> Christian Bird: So yeah, that may be. To the degree that it's -- so depends on how you define effort, right? So we did measure effort in terms of the number of changes made to code. There are clearly other ways to measure them. So like how often are we having meetings and that kind of thing, we didn't measure that, but I would expect that it's more effort at least in coordination, definitely. And actually some of the qualitative factors when we talked to people kind of bear that out. So here's some factors that we got from managers. I should emphasize we don't know -- we haven't shown a causal relationship, but the intuition and people's experiences support the idea that this may be some of the reason why we got these results. So first, Microsoft uses liaisons and face-to-face meetings. There's literature that shows that if people have face-to-face communication, that later when they work remotely, there's more trust there and you're able to work easier. And if you have liaisons, then people know who the point of contact is in the different teams. Next, a lot of the senior engineers at kind of the distributed sites, a lot of them started here at Redmond. So if there was a question in Beijing about what tool to use or why was this design decision made or who should I contact because I don't understand something, these senior people, they had a lot of that information. If they didn't have the answers, they new who to go to, so they were kind of taking this information in their heads with them to the remote sites. During the Vista cycle, they had daily synchronous communication. So that means that for sites where there's no overlap in work, they had people coming in early or staying late so that you could have meetings and talk about issues before they escalated out of control and went unaddressed for weeks or even months. And lastly, Microsoft tries to use the same process and the same tools, at least within Vista. I don't know about the other projects. But within Vista, they tried to of the same process uniformly. In some other studies, they found that if remote sites are working on their modules, everything is going fine and good until it comes time to integrate and then you have problems because people -- some people run their tests with their bills and other people pull from a different repository system, and so you really run into problems then but Vista didn't have this because they used a uniform process. So what do we conclude? Well, I'm not claiming that distributed development is easy. It's clear difficult. And talking to anybody who has done it will tell you that. But after looking at the community and gathering some data, and testing some hypotheses, what we've found is that it's possible to do it with little effect on post-release failures. And this is a good thing. This is good news. We have a handle -- Microsoft at least has a handle in terms of the Vista development on some of the problems that we face and how to overcome them. So that's collaboration in terms of distance. So I want to talk a little bit about collaboration in terms of expertise as well. So in software engineering literature -- so I'm interested in expertise and how much experience people have with certain parts of the code base. And in the literature, you'll find that a lot of times people use ownership as a proxy for expertise. So if you have made a lot of changes to a particular portion of that code, then you have the experience to probably understand it better. Especially if you're the one that like wrote a large part of it. And if you haven't worked with it, then you have low expertise, less knowledge. And so the question that I ask is: What happens when you have a lot of people working on something and those people have low expertise? Is that bad? Our intuition would tell us, yeah, that's probably not a good thing. And next, is ownership related to defects? So if you have a binary that's clearly owned by someone, is that better than having a piece where -- as kind of shared ownership amongst a large team. And then does the process matter. Does development style that you use, is that related to relationships of ownership and quality? So -- and I'll define ownership more formally a little bit later. So I'm going to make a bit of a generalization. There is a development process spectrum, and I'm kind of generalizing here because it's not just this one-dimensional spectrum. But for the sake of some first steps, Vista is clearly on the commercial side. Eclipse is a project that kind of lives in this hybrid land. If you look at the development activity, it's mostly owned and controlled by IBM but it does espouse some open source principles. It's under an open source license, and it accepts contributions from the community at large. So call that kind of a hybrid. And then Firefox, though lately it's been moving to more of a corporate-controlled entity, for this study, we looked at older versions of Firefox [inaudible] more of an open-source style. I should point out there is no like one open source method of development. one commercial. So I'm kind of generalizing to a bit here. No And so the question that I ask is: Well, how do these differ? So I need to define some ownership terms here. So on a per-component basis, I say a major contributor is someone who has made at least five percent of the total commits. So the idea here is that these major contributors probably have expertise. They have worked a fair amount. Five percent probably sending up a warning flag. It's the magic number. We actually tested it at other sensitivities, other levels for this threshold, so two percent all the way up to ten percent with similar results. So we don't think it's a function of just this magic number. So similarly, a minor contributor has made about five percent or less of the total commits so these are people that make fewer commits, we think they have less expertise. And own the ownership for a component is the proportion of the commits made by the person that made the most commits. And the graph probably explains this a little bit better. So this is a graph for one of the shared libraries in Vista. And I've ordered the developers by the amount of contributions that they made. So for this one, the top contributor made 41.2 percent of the commitments so we say that the ownership is 41 percent. Five developers made at least -- yeah, five developers made at least five percent of the commits and then 12 developers made less than five percent of the commits. So the idea is that when you've got lots of these low-expertise people, it may be a problem. So the first thing we did was we just made a correlation analysis. Just said let's look at the ownership metrics and look at the correlation with failures. And in this setting, it was actually interested in pre-release and post-release failures because I think and the results showed that there are things you can do to mitigate the effects of ownership. So interesting finding, and we also included some of the base metrics that are already collected -- yeah, Rob? >>: [inaudible]? >> Christian Bird: Oh, yeah, okay. So these are things that were used that -bugs found during testing or at the end of the QA cycle that we're putting into the issue tracker and had to be fixed prior to release. >>: [inaudible]. >> Christian Bird: Yeah, yeah. Sorry. Thank you. Good question. So there are some things that are known to be related to failures and we included those as well. So the interesting finding was that the number of minor contributors actually had the highest correlation of any metric that we saw. So with pre-release and post-release failures. It's stronger with pre-release, and we found this trend in all the projects and for all the ownership metrics, that pre-release had a higher, a stronger relationship. So this gives us the idea that we're on the right track, but one of the things we found is related to Patrice's question earlier is that those binaries that have a lot of minor contributors are also those that are more critical in the system, are larger, and so this may not be as big a finding, just looking at a [inaudible] correlation. So return to linear credit regression to ask this question because we want to control, want to look at the effect of ownership when controlling for other factors that are known to be a problem. So in our model, in all cases except for Vista post-release, the distribution of failures was heavily right skewed, and so when you have that, you have to -one of the assumptions in your regression is that your normals or your residuals are normally distributed. So when our output was the log, logarithm of the number of failures, that assumption was met. So we did that transformation in all but the Vista pre-release -- post-release. We built a base model where we had some factors related to size, complexity and churn. And one of the problems when doing linear regression is that if you have too many variables that are all highly correlated, you can suffer from over fitting and multi [inaudible], and that can be a problem. So we picked the measures in each of these categories that had the highest correlation with failures. What we did, we had this base model, and then we added additional ownership metrics, so we added the number of binary contributors and said -- and then asked the question: Does the prediction power of the model go up? And then we added -- we added ownership. We also added a number of others. And I can direct you to the paper. So this is work actually that we just submitted FSE on Friday. And so to evaluate, if the models get better when adding this metrics where look at goodness of fit tests and also the amount of variance in failures explained. Okay. So I'll give you the results for Vista since you guys are probably most interested in that and then high-level results for Eclipse and Firefox, but the details are in our paper. So with our base model, we were able to explain about 26 percent of the variance in failures for pre-release and then 29 in post release. When we add the number of minor contributors, it adds quite a bit. So when you increase 20 percent for the pre-release and 12 release for the post-release failures, and then when you add ownership, it goes up but not as much. We actually tested adding variables to the models in different orders and it was fairly consistent. No matter what we did, minors had more of an effect than ownership. But minors also were always more significant in just adding the total, just the team size, which is total. And so we did this analysis. So what this did mean, is that when we take ownership into account, even including the things that we know are related to failures, the power of the model increases, which means it's having an effect beyond just those standard things. So we did this analysis on the other projects as well. So the interesting thing that we found was that the projects that were more on the industrial side, we saw stronger effective ownership on quality. So in this case, plus means that the number of failures went up when the metric went up; minus means that the number of failures went down. So higher ownership is good. More minor contributors is bad. Eclipse, you see a range like medium to strong in some places because we've tested against six major releases of Eclipse. But although this is only three projects, somewhat preliminary, we definitely see a trend here that in places where it's more industrial, where there are ownership policies in place, violating these policies tends to lead into more failures. Similarly, the number of major contributors those who, say, have more expertise, adding them didn't have a very large effect. Yeah? >>: [inaudible]? >> Christian Bird: Ownership is the function of the number of commits that you've made to code, so how many times you've worked on the code. If you have worked on the code more, then your ownership is higher. The intuition is you're more familiar with that code, less likely to make mistakes. And then the base, the things that we know are related to failures, they were across the board. So that went surprising. >>: [inaudible]. >> Christian Bird: So Eclipse, we used plug-ins and we also looked at packages, Java packages. And for Firefox, we actually did the same thing that you did for your cross project prediction, your [inaudible] papers. We looked at kind of directory level stuff. So kind of two findings. One is ownership is related to quality, but also the process that you use is related to -- has an effect on the relationship between ownership and quality. So when we talk to people, up with of the questions they had was why do we see these minor contributors? What are the reason? And so one of the things that we found in talking to people is that people would say, well, I made this change to this component when I hasn't changed it before because I needed to because I was trying to fix a bug that was assigned to this component that I'm in charge of. So we'd see someone who is a major contributor to one component who may need to add a feature to fix the bug. In the process of doing that, say, oh, well, I actually need to go change the shared library. And so he would be a minor contributor to a different component and we would also see dependency relationships between them. So this is something that people had hypotheses about. We looked and we actually found this happened pretty often. One of the problems in doing something like this is that you can't often see what you're looking for. So this is the type of thing where you buy like a red car and then all of a sudden it looks like everybody's driving a red car, but it's just because you're looking for it. Right? And so we wanted to be careful that we weren't suffering from this. So what we did is we did a Monte Carlo simulation to see how often we would see this just in a random graph where there was no attention to these dependency relationships. And so what we did is we took the contribution graphs, so this is the example contribution graph with some smart an dumb people working on things, and the ovals are the binaries. And what we want to do is create random graphs with the same number of the same developers and the same distribution of major and minor contributors. So this is like my advisor. If he made three major and two minor contributions, we wanted to keep that the same but randomize what he was contributing to to see if this phenomenon, this major/minor relationship was real. So what we did was grab what's called graph rewiring. And in that, you grab two edges at random. Either both major contribution edges or minor contribution edges, and then you flip them. And so what this does is now the same number of binaries, each binary has the same number of major and minor contributors. Same number of people make the same number of major and minor contributions, but now it's random. So you do this a lot, so N squared times, where N is the number of edges, and you can reach a sufficiently random graph. Well, you do that a bunch of times, you generate thousands of random graphs, and see how often do we see this major/minor dependency relationship in these random graphs versus what we actually observed. If there's a big difference, then that means that what we're observing isn't just noise; it's a real phenomenon that's occurring in these software teams. What we found was that we saw this dependency, this major/minor dependency relationship about 50 percent of the time. So we can explain about 50 percent of the minor contributors to binaries. So we know why they're acting this way, and when we do this random graph, it was about 24 percent of the time. So just given the distribution that some people work a lot more than others, we'd expect that random to see 24 percent. So what that means is we're really seeing a process, we're seeing a phenomenon that's real, that's really one of the reasons that people are -- that nonexperts are making changes to code. So that's good because now not only do we have a result, but we know -- have some idea was to why that result is occurring, which makes it somewhat actionable. So we also decided to dig in a little bit deeper and replicate a study based on these contributions. So Martin Pinzger was here a couple years ago working with Nachi, and they built a prediction model where they looked at the topology network of people contributing to binaries, and they were able to predict which binaries were the most failure prone, like really high accuracy. So 85 percent precision, 90 percent recall. And the way that they formulated the problem, random guessing would get you about 50 percent of the way there. So we asked the question: What happens when you remove the nonexpert people, these minor contribution edges. So we replicated the study by removing these -- I show them here as dash edges, the nonexperts. And when we did that, the precision and recall both fell dramatically. So what we can conclude from this is that the topology introduced by these minor contributors, they're really important. They add a lot of signal to the model. We also tried it without -- by removing the major contributors and the percentage of recall did not fall as much. So what does that mean? These minor contributors, these nonexperts are part of the reason that you were having these failures and they add to the predictive power of the model. So with this -- these findings in hand, we want to do what John Snow did, right? We want to take off the handle and see what happens. So we have some recommendations. I actually wrote an internal report that I'm told went to management. I'd like to see, you know, if this has been put into practice and if people have followed these and see what was the results. So three recommendations. The first is that when changes are made by minor contributors, those changes should be reviewed more heavily than the other changes. And further those changes should be reviewed by the people who are the major contributors, the owners and those with the most expertise. They're the ones that are most likely to spot problems. Next, in cases where people want to make a change, so maybe you're working on a component and you see that you need to change another component that you haven't worked with because your component depends on that one, where possible, communicate those changes to the people in charge rather than making the change yourself. So clearly you can't do this all the time because people don't scale indefinitely, but because this is the situation where we see lots of failures being introduced, you follow this, hopefully you'll introduce less. And then lastly, Microsoft already uses a number of metrics to decide where to focus their [inaudible] resources at the end, but ownership should be added to these metrics because we've shown that even when using these metrics, you can increase your prediction accuracy when you add ownership. So these are the recommendations. And the next step is to see what happens when you follow these recommendations. Does the quality actually get better? Okay. And so the last thing that I wanted to look at, share with you, is understanding how open source software communities work. This kind of harkens back to the question: Does this really matter? Do people really care? Is this just like an intellectual exercise? Say, well, we want to know how this works, or do people really care about this question? And I think the answer is yes because there are companies that are trying to compete with open source software. There are companies that are trying to embrace and work with open source software. And so if we understand how these work, this can benefit both parties. So first you start with this premise made by a guy named Eric Raymond. He's this kind of self-appointed spokesman for open source software. He has a very famous essay called the cathedral and the bazaar where he characterizes industrial software as this well-planned, well-executed modular software community, and then you have the open source is this bazaar where you have people just working on anything they want to, wandering around, talking to everybody, and wonderful high-quality stuff just emerges. Well, Fred Brooks has just kind of something to say about this. So if you're familiar with Brooks law, it says that adding people to a software project that's late will only make it later. And one of the reasons that he gives as support for this premise is that, like, you have communication problems. Say you have a project with three people working on it. That's all fine and good. Well, you add three people to it, and now the number of potential communication pads grows quadratically with the number of people that you add to it. And so if you have people just working randomly, then a project will begin to fall apart under the weight of its own communication. And so you have to have some type of organization or you run into this N squared communication path problem that's going to be a problem. So if open source is really these -- if it's really this bazaar community where everybody works in an ad hoc fashion, how do they deal with this communication problem that Brooks talks about. And so the question that I asked is either chapels of organization within this bazaar of open source. So our approach. We started by looking at the social network. This is no org chart like there is here at Microsoft for open source. And so we looked at communication on the developer [inaudible] list to get this structure. Next we took some clustering techniques from the areas of complex networks and physics and altered them to work on our social networks. Had to do with like edge weight where an edge weight may be number of messages sent between people in the community. And they give you an example of what we found. Most of them were very large. This is one of the smaller ones. This is pearl and what you see is the edges represent communication between people, and the boxes represent the clusterings that we found using this technique. And then we examined something called modularity. So modularity is this formal metric. It comes out of our physics that ranges from 0 to 1 and values of about .3 and higher are considered modular for networks that are known to be modular that occur in nature. So what we did was we took a number of open source projects and looked to see how modular they were. Did they pass this threshold. So we looked at these five projects. And in all cases, we saw that the modularity was much higher than you actually source. to teams would expect to see by chance. And so what we find is that there are tight-knit groups of people within the organization of open There are all these teams that form somewhat organically and similar in industrial development that are organized by managers. And the last thing that we did is we looked at the differences in discussion. So we looked at messages that were more technical that mentioned things like function names and files, global variables and then messages that didn't contain any of that he is more technical terms. Typically when we read those, didn't have technical terms. They were talking about more process things like should we release this week or should Joe be given access to the source code repository. And what we found, so the latter we called process topics and the former we called product topics. What we saw is that the community was much for modular when talking about more technical things. So people were drawn to different parts of the system, and they would talk to other people that were drawn to the same parts of the system. So again, we've looked at the community to try to test some hypotheses, and what we found is that, look, the open source communities aren't just ad hoc. They're actually organically formed into teams. We did find that teams are more dynamic that you typically find in the industry. We found that when they're talking about more technical things, they are even more modular. They form into these tight teams. And also, when I looked at their actual development efforts, so what they were actually talking about, what they were doing, the changes that they were making, they were clearly making changes relative to what they were talking about and their organization. And so in some ways, it's kind of a validation of Conway's law that says that the communication structure and the architecture of a system are intimately tied. Okay. So that's what I've done. What do I plan to do? So a few things. So first, 1 the things that I found is that the quality of the data that you're working with is directly -- directly affects the power of the conclusions that you're able to make. So [inaudible] paper [inaudible] last year where we looked at bias in data. We found that there's a lot of bias in a lot of da that that's used in research. So in Apache, we looked at the number of bugs in different -- in a lot of ways, but one of the ways was in severity categories. And we found that for bugs that are market more critical, so maybe in the blocker, the critical and the major categories, we're not able to tie those bugs back to the bug fixes as often as the bugs that are less important. So in the minor and trivial categories. We actually found that -- so this is clearly biased. So certainly types of bugs are overrepresented and others are underrepresented. And when we looked at its effect on bug prediction and hypotheses testing, we found that biased data -- the amount of bias in the data has a direct effect on the ability of the technique to predict the bugs. And so right now, we just submitted a paper to FSE where we actually tried to recover -- okay. Doesn't like me. We tried to recover complete data sets and look at the effect of bias and how can we overcome this bias, get better prediction. And so this is one of the things that I'm trying to look at is how do we deal with these problems in data. And I should mention I think this is a problem here as well. Gina Venolia had a paper ICSE 2009 where -- called The Secret Life of Bugs where they found that, look, a lot of the data relative to bugs actually is not being captured so this may have implications on our ability to use this data or rather how much better we can do if we get the whole story. And also, we've made -- I've made some findings here, but once we know things, what we want to do is actually remove the handle from the pump like John Snow and see what we can affect, what can we change. So we can make changes and observe -- make changes to process and observe the results. In cases where we can't actually make a change, it's not clear how to make the change, we can build tools to help people working on software. Another interesting thing that -- one thing that I'm interested is studying different domains. So in research, there are certain areas that are overrepresented and certain areas that are underrepresented. And I think that there are some areas of software development that are very underrepresented and if we haven't studied them, then we don't know what laws of software development are universal and which ones are context-dependent. So the two areas that I think are hot right now that are worthy of study are web applications and game development. And I'll tell you why I think these are maybe very different from what we've studied in the past. So web applications, you have multiple languages running on different machines. So on the client, you have job description HTML. On the back end, you have some type of database. You may be using SQL for queries. In the middle tier, you may be running like C sharp or C++, java or some other new language. So this is very different than like and operating system or a desktop operation. In addition, there's different schedules. So I know of one website called stack overflow that's like a programmer help website. Their goal is to release some new feature of bug fix on a nightly basis. So every night they're deploying their software again. So this is very different from like a, you know, Windows where you may be deploying, you know, a new version every few years or even patches on a monthly basis. So if you can -- you can fix bugs which -- very quickly which means at that testing may not be as much of an issue, but ultimately if there's a bug that brings down your site, all of a sudden all of your users are totally affected. And then rich monitoring. So I talked to Jim Larris [phonetic] when I visited UC Davis and he said that one of the most important milestones of Windows development was when they added Watson and they were able to view what was affecting developers the most. Web applications, you can even enrich your view. You know exactly what's being used. If there's a crash, you know what led to that crash. You get a lot more insight to the uses of your application than you get in other domains. In game development, there are also some key differences, so the amount of content that ships relative to the amount of executable code is very different than a typical application. Typically the content is [inaudible] magnitude or more. And likewise, the team makeup is different. So in cases where there are tools that rely on aspects of the source code, you may not have as much source code. You have people who -- I won't say are nontechnical, but they're technical in different ways because they're not writing code. They're creating content. So 3D models and story lines. So they're basically -- there may be some implications relative to coordination and how these people interact. And in addition, the systems at that time software is running on is different. You think of a web browser, you should be able to send anything. And even if it's malformed, it has to be able to display something to the user. With a game engine, the content is fairly static. You know when you shift the game, the content that's going with it, and so if they're -- you can test both of them together. And nowadays, where there's downloadable content, but it's controlled so even then if you find out, look, there's a bug in my game engine that is exposed by this new level, well, you can actually test that level and make changes to it so it doesn't expose the same bugs in the engine. So I think this may also have some impact on your QA. Very different from traditional development. And then lastly, I've looked at the team level, but I'm interested at the individual level. So right now, you see this proliferation of new languages and language abstractions to existing languages to deal with things like multicore and some other problems. So if you look at like Link was just added to -- well, I guess not just, but was recently added to C sharp. Java was introducing closures. Twitter just moved to scala as their implementation language. The people that are proponents of this say, look, this solves these problems, makes lots of things easier, but it's not clear to me that this is just a get something for nothing. When you introduce a new abstraction, that introduces some cognitive load on the programmer and may be difficult to deal with. There are some projects that decide, look, there's a new feature of the language, but we're not going to start using it yet because we don't trust it because that may be more difficult for our developers. So I'm interested in looking at the effect of these new abstractions and these new languages on developers at the individual level. And then lastly, I'm interested in coordination. I've only looked at the team level, but I'm interested in looking at the individual level. It's important to know when you should be coordinating and also when you shouldn't be coordinating. If you're always talking about what you're doing and everybody is doing this, you just have too much noise and people stop paying attention. And the literature so far has just looked at things like number of lines changed and commits. And I'm interested in looking at the semantics of changes that people are making and determining when certain semantics imply that you should be coordinating your changes and with who you should be coordinating with. So if you can direct it to certain people rather than sending it to a mailing list. So this is kind of where I'm going. And I'll share with you what I've done. I'm happy to answer any questions that you might have about things that I have presented here, also things in my papers or what I plan on doing. So I appreciate you coming out and listening to what I have to say. Thanks. [applause]. So I've answered everything. >>: So why hasn't the changes that you have proposing, why haven't they be evaluated and [inaudible] because the proposed to you or because there's no infrastructure to measure whether things really have an impact or [inaudible] currently it's mainly observation? >> Christian Bird: Uh-huh. >>: But I want to see that with your changes, things really will go better. So ->> Christian Bird: Yeah. Yeah. Okay. So when I was here over the summer, actually I wrote up some internal reports. And I'm told by Nachi that they actually have been -- have gone to management and they have been evaluating whether to put them into practice. Honestly, I don't know where it's gone from there because I think at that point it's internal and as not a Microsoft employee, I'm not privy to that type of thing. But if I come here, this is clearly something that I'd like to do is say, you know, who can we convince to deal with these recommendations. I doubt people are going to take, you know, broad sweeping changes and say now we're going to change how we do Windows. Right? It's pretty risky. But you can start on projects where it's maybe less intrusive, smaller projects and see what the effects are. But that's -- that's part of why I want to be here is because in an industrial setting, you can recommend changes. Right? Microsoft research has a lot of reputation within the company. I know Nachi has done some work that have actually impacted development, and that's why I'm here. >>: You made a comment about the minor contributors. If they instead of making the changes themselves, if they go to an owner or to major contributor and say, please, I'd like these changes, won't you make them for me, whereas they make them and then they're carefully reviewed by the major contributor. What I'm wondering is what kind of impact might that have on the -- on the timing -- the time required by the major contributors [inaudible] hope that major contributor might be helped by having other people also do stuff. >> Christian Bird: Yeah. So that's a good question. Right. People can scale indefinitely. And if you have one person that has to make every change, this is going to be a problem. So one of the things that I saw is that there -oftentimes there's a clear owner. There are also other people who are not the owner but have a fair amount of expertise. And when you increase those number of people, the software quality went down but not tremendously and nowhere near as much as in terms of the number of minor contributors. So I think it's not just one person that you can work with but a group of people. But clearly I wouldn't expect that there are no minor contributors to any component in the next release. At least if you're aware of it, you know, when there are [inaudible] cycles or a situation where, look, I have time to review five changes. There have been 20 made. Now we can help you prioritize where to put your resources, and that's helpful. >>: Do you know, I mean, for those -- the bugs introduced by minor contributors, who fixed them? Was it the major contributors [inaudible]? >> Christian Bird: Okay. So I should be very careful here. I don't know exactly that the bugs were introduced by those people. Talking with people makes me think this based on anecdotal evidence. The data that I had to work with didn't indicate this person introduced this bug. It's hard when you see a bug fixed who introduced it because it's not always the same lines. So we have strong reason to believe it but I don't know that for sure. Because I didn't have the access to that data, I don't know who made the fix. I will say this: I know that some companies tend to like at the end of a release cycle put everybody working on everything. Like you have some free time, go fix this bug in this one thing. In Vista they don't do that. There's still ownership at the end of the cycle. I should also point out that we looked at the time to see maybe all the minor contributions were coming at the end. And it's really -- it's hiding this issue that it's not the minor people; it's just all the stuff made in a flurry at the end. We didn't see a large effect. So Vista, I think it was like six-year cycle. The difference, the median in the commits, and it was normalish. The median difference was about 60 days between minor and major. So if you look at distributions, it's this heavy big overlap. So gives us some idea of what's not causing the problem. >>: So the data sources that you used for your studies were basically not intended as measures for studies, right? I mean, you were basically looking at things like the bug database and check-ins. >> Christian Bird: Yeah. use it for our own. [inaudible] some other purpose and we're trying to >>: Exactly, exactly. So speculate for me. So you know, if you got very high level buy-in from an executive that you could put any instrument into the work process you wanted to take measurements different from what you can get from the archives, is there anything you would want inserted? What would it be? >> Christian Bird: Let me think about this. So [inaudible] Brandon. He said that there are some groups that actually are buying it and saying well, what would you need to make it better. So I think this may happen. What I would like -- what would I like to see? One is I'd like to see why changes are being made. So if a change is being made, can we say with confidence this is a bug fix, this is a feature addition, you know, who exactly is making this change. That's probably the most important. Another piece that I -- I'm really interested in, I've looked in open source, is that I'm just looking at the repository as this black box. We know that when you're doing development, it's not all on just one line of development. People branch the code or work on -- I don't know what the terminology is here. But not everybody is working on the same code base, right? I'm working on my portion; you're working on yours. And then at some point we're going to merge those in or abandon them if they go wrong. Right? So I'd like to be able to look at this level of granularity to get a more complete picture. So I think that can tell us something about how you should -- how you should work and coordinate your changes. It may be that if I am working in a small team on our own branch, then that's better than -- at least I have evidence in the open source world that that's much better than if you're all working on one code base. And I'd like to actually empirically show that here at Microsoft so that's another piece. I think those two off the top of my head, but you know, give me till tomorrow and I'll probably have a longer list of, you know, if you could have a candy store, what would you put in it. Because, yeah, the more data you have, the more interesting you can answer questions. >>: No other questions [inaudible]. >> Christian Bird: Thanks. [applause].