18258 >> Nachi: Hello all. Thank you for coming. ... Chris Bird. Chris is actually no stranger to MSR. ...

advertisement
18258
>> Nachi: Hello all. Thank you for coming. We are very happy to have with us
Chris Bird. Chris is actually no stranger to MSR. He's interned twice with us
in SRR. He's actually going to talk to us today about his research interests
in sociotechnical congruence, primarily based on teams and on software systems.
>> Christian Bird: Thanks, Nachi. So yeah. So I'm a student at UC Davis,
finishing up in the summer. And in this talk I kind of wanted to tell you what
my background is, why I do what I do, what I have been doing recently and also
what I hope to do in the future. So I'll start off and tell you a little bit
about myself.
So I got here by kind of the traditional means. I did my undergraduate in
computer science and towards the end of my undergraduate career, and also while
my wife was finishing up her degree, I worked for a large tech company writing
software. And I worked with some really bright people, some very capable
managers but there were some things about the projects, how the projects that I
worked for were run that were a little disconcerting. And when I talked to
some coworkers and managers, we kind of had this running theory that when
managers make decisions, a lot of times they're making their decisions based on
their intuition of what they think and just kind of anecdotal evidence, but as
we -- the plural of anecdote is not evidence. And so I was kind of
disheartened by some of what was going on.
And so when I went back to grad school, I really wanted to do something about
this. And so kind of what I was thinking was there's got to be a better way to
run software projects, a more principled way, and is there some way that we can
work towards making decisions based on actual real evidence.
So I'm going to make an assertion that I don't think anybody here would argue
with, that developing software is expensive and time-consuming. If you talk to
people -- this probably isn't as true at Microsoft, but you would know better
than me, but a lot of projects run over budget, and when you talk to
developers, most of them ship late.
Just this week, we have another victim of something shipping I think a week
late. And the rumors are on the net that iPez is actually shipping late due to
a software issue and not a hardware issue. So this is very relevant today.
And if you look at the literature, the number that's classically given is that
80 percent of the time and money is spent after release, so this is just to get
there. And then you have to deal with this.
And I should really point out, I don't think it's because we don't have smart
people working on the problem. You've got bright people here at Microsoft and
also at other places, so I don't think it's that we have dumb people; I think
it's just a hard problem.
So that's why we have software engineering research. Right? We want to make
this better. And at a high level, there's two things that software engineering
tends to try to do. Not all of it does this, but what we want to do is in some
way increase productivity. So this may be in terms of processes. It may be in
terms of tools to help people. It may be in terms of like language
abstractions that allow people to be more productive, but we don't want to
sacrifice quality. That's painful if you're shipping bugs. That can have
financial and reputation impact as well.
Okay. So this is what we want to get to. So clearly the next thing to talk
about is a cholera outbreak of 1854. So this is related. But to give you some
background, 1854, the cholera outbreak that some said was the worst outbreak in
the kingdom. And there were some folks' ideas at the time about how cholera
was spread. So one idea is miasma, which is this idea that you've got this
poisonous gas going around, if you're in its way, you're gone. Sometimes,
well, you know, we deserve it. God comes and smites you and you get cholera.
And then some people said: This is life. It just happens. You deal with it.
Move on.
So the government at the time said: Nothing we can do. We just have to deal
with it. But fortunately there were some people at the time that had other
ideas. This is John Snow. He was a surgeon at the time and he had some ideas
about how cholera was spread and its effects and he saw this opportunity to
kind of test some of his ideas. And he was under some constraints. If you
have an idea about the way disease is spread, typically in medicine if we have
questions, we can run like a double blind clinical trial. But you don't want
to give people cholera and other people, you know, not give them the cholera
and see who dies and say, oh, now I know. You just have to watch what happens
and see.
So he had these constraints in that he has to watch after the phenomenon. So
he has some hypotheses: I think this is how cholera is spread. And then what
he did was he went around and he interviewed families. He talked to people who
had family members that had passed away and those who were relatively healthy.
And he also collected geographic data. There's a very famous map called the
ghost map that he made of the area where the outbreak occurred and the little
black dots represent families, residences where people had passed away from
cholera. And generally he observed the community. There's some interesting
things that he found if you look on line at this research, that he found some
key events that led him to come to the conclusion cholera was passed by water,
and he identified the Broad Street Pump. This is this well in southern
England.
And so he went to the authorities and said, look, I think the water here, I
think this is the source of your outbreak. And he convinced them to take the
handle off the pumps so no one could get water, and immediately the outbreak
stopped. And this is really considered one of the watershed events in
epidemiology, spawned the field. And some of the methods that he used are
still in use today.
Okay. So what does this matter for us? Hopefully we're not dealing with
people living and dying, but I work under the same -- some of the same
constraints that Dr. Snow worked under. He couldn't conduct these trials.
had to observe a phenomenon and gather whatever data he could to test some
hypotheses about how things worked.
He
So we care about people living and dying, but my research isn't necessarily
about that. What we care about are software projects that fail and those that
succeed. We want more -- Windows 7 has actually gotten a pretty good reception
in the marketplace, so this may be a success. And this is the area in five
rocket that actually blew up as a result of a software failure. So we want
less of these exploding rockets and more of these Windows 7s, right?
So what we do is we use what John Snow used, we call the empirical method.
There's three main steps to the method. The first is that we gather data
typically regarded to our outcomes, measure software quality and productivity,
and then also whatever factors we think are related to those. And then we
examine relationships. So both quantitatively and qualitatively try to
understand what's going on. And then based on that, we can make changes to
processes. Maybe we build tools to help developers and managers so we can have
an impact, make things better.
So does anybody care about this? There's a lot of problems we can solve. More
than we have time for. We want to make -- look for solutions to problems that
matter. And so really, if you ask the right questions, then people do care.
And so some of the questions I've asked have been based on my prior experience
working in software and also the experience that other people have had and the
gripes that they've had about what they've encountered.
And the goal being that if we can take this empirical data, if we can improve
processes, maybe it turns out that working a 9-to-5 work week is better than
putting in hundred-hour blitzes at the end of a dev cycle. We can target
resources if we know that certain parts of the system are more prone to
failure, and hopefully we can improve the quality and the productivity of our
developers.
And does it matter? Yeah. There's a lot of money riding on this. So 2008,
software was a $300 billion market. So there's a lot of ways to look at this.
And if you look at the literature, it's replete with ways to address this
problem.
My kind of perspective is to look at the people. So this is Bjarne Stroustrop,
kind of the father of C++ programming language. He says design and programming
are human activities. If you forget that, all is lost. And if you remember,
John Snow, he was interested in disease, but what did he do? He looked at the
people. And so that's what I do.
So I'm going to present some of my results today but to give you an idea of
what my whole kind of graduate career spans, the things I've looked at, I'm
looking at open source software and how it works. Other people are interested
in how it works, is it really this bazaar with everybody doing everything and
magic just comes out, or is it some more organization. I look at defect
prediction, so both using attributes of the people working on software to
produce defects but also looking at the effect of the quality of the data that
used to make defect prediction. We found that that has a big effect. And then
also process and the effects of process used on software quality.
And it's not just software. I've also looked at some other things. I've
looked at collaboration in computer science research, found that different
areas of research, kind of the collaboration patterns are a little bit
different. And I'm an empiricist. I love sports. So I actually have a -we're submitting a paper about NCAA football, and I'm happy to talk about this
ad nauseam if anybody is interested. Love college football.
So today I'm going to talk about three things. Looked at distributed
development, ownership and expertise, and also how we think open source
software works.
So the first one: Does distributed development hurt code quality? So unless
you've been under a rock, you realize that in the tech sector, at least,
offshoring has been a really hot topic. And if you followed the 2004
elections, this actually came up in some of the political platforms.
Offshoring from the U.S. It's a big issue. And there's some people that have
some ideas about it.
So this is Tom Allen, and he's at MIT. He studies innovation in the workplace.
And he's developed what he calls a 50 meter phenomenon. And this is that
the -- the idea is that when you have people working in a very creative and
innovative environment, when you have people even as far as 50 meters apart,
you see dramatic decrease in their ability to -- the frequency of communication
and the richness of communication.
And so this is 50 meters, the question is: Well, what happens when you're
talking about 5,000 miles? Do we see an effect? Does it hurt software
quality?
And when you talk to developers about the issues that they face when dealing
with people that are operating remotely, they have a lot of -- a lot of issues
that they'll raise, but all of them leading back to the claim that, look,
quality will suffer if you distribute development, especially around the world.
And in this study, what was showed is that it can be done with a little effect
on quality, not that it always will, but there are ways to do it.
So Windows Vista is really a great candidate to study to ask this question. I
don't have to give people here much of a background. You have thousands of
people that worked on it. Very large project. There are thousands of
individual pieces, and we can compare the pieces within one project rather than
different projects, so we're really trying to compare apples to apples and
avoid some of the confounding factors looking at two individual projects.
large and definitely distributed around the entire world.
So we have this question. John Snow had some ways of gathering data
his hypotheses. Well, what of data do we gather? It's kind of four
Initially start with just the source code so we know who contributed
code to every binary in Vista. So binary is an executable, a shared
and a driver. We know who wrote every line of code.
Very
to test
pieces.
source
library,
Next, this dialogue does have a bad connotation, but people that work here know
that the information related to crashes help the management at Microsoft make
decisions about what to fix and who's being affected, which crash is the most
important. So this is kind of our outcome measure of software quality.
Next we have the org chart. We know who is working where, when they were
contributing code to Vista. And then last, we actually have the geographic
data, so this is a map of precisely where I am, actually. And the ovals
indicate buildings that are served by the same cafeteria. So fortunately I
don't have to describe as much to you as I have had to the other people.
So on with these four pieces of data. We can really answer this question. So
what we did was we, bins, binaries based on the level of distribution. So the
very lowest level, you have the building level. So this is where most of the
developers work in the same building. If you work in the same building, it's
very easy to walk next door to ask someone a question about maybe an interface.
You may run into people just informally. You've worked together. You've
probably been at the white board together, in meetings, talking about your
design.
Cafeterias where you have developers in buildings that are served by the same
cafeteria, so still not too hard to walk next door. You can probably arrange a
meeting fairly quickly and you can have meetings over lunch.
The campus level, here in Redmond, it can take a while to get between
buildings. You may be less familiar with someone that works at another
building. Probably want to schedule a meeting a day in advance.
And then locality.
get between sites.
So like you have the Seattle locality. It takes a while to
You may not have even ever met face to face someone that's
working on something you're working on if you live in the same locality but not
on the same campus.
And then things start to get bad pretty quickly. At the continent level, you
start to begin to deal with time zone issues. Meetings are very difficult.
They almost always have to be conducted electronically.
And then at the world level, you've gotta fly. It's expensive. You start to
deal with cultural issues. There may be sites that don't have any overlap in
terms of working hours. And the idea behind this is -- it kind of harkens back
to Tom Allen's work, is that, look, as the distance increases, it becomes
harder to coordinate, to become aware of what everybody is doing and to be
managed.
And so we started off looking for -- looking for a 50-meter rule of software.
So what did was we took these levels and we created five different splits and
said, look, first we'll say that collocated is all binaries developed by
developers in the same building and everything else is distributed all the way
down to distributed only binaries that were worked on by developers across the
world and everything else is collocated.
So is there any split here where we see a dramatic difference in software
quality? So I'll show you the split for the differences for the very first
split. So two-thirds of the binaries fall over here in collocated and then
one-third is in distributed.
And these distributions look a little bit different because there's a lot more
on the left than on the right, but the actual distributions are -- there's not
a strong difference between the two. I'll just tell you right now, I had to
take off the numbers because people outside of this organization can't see
them, but the peak on this side is the same thing as the peak over there. And
so although you have different mass, the distribution is fairly similar. And
so you conclude just from visible inspection, you don't see a huge difference.
Yeah?
>>: Binaries aren't equal, right? So some are huge, some are tiny. Isn't it
[inaudible] just binaries [inaudible] and then you can -- so you would infer
from that that collocated can be much worse than distributed but yet as you
know at Microsoft, [inaudible] is entirely done inside here in Redmond and less
important perhaps or whatever is distributed and now --
>> Christian Bird:
>>:
So --
So you're going to discuss all that.
>> Christian Bird:
>>:
Yeah.
Yeah, I will exactly as to that point.
That is shocking to me [inaudible].
>> Christian Bird: It's shocking to a lot of people. And I should mention, so
this is actually -- this is not worse than that. So the XX is the number of
binaries and failures -- or excuse me, YX is binaries and X is the failures.
So although there's like more in this peak than that peak, the proportion, the
actual distribution, density distribution is about the same between them.
But you're right. Binaries are different.
at, so I'll get to that in a little bit.
And that's something that we looked
So we decided to take a more principle approach than just looking at pretty
pictures. And so we used linear egression. And this is kind of the data that
we used looked like this. So for every binary, we included the level of
distribution. There's no reason to believe that there's a linear relationship
between buildings to cafeterias to campus so what we did is we encoded each one
as a binary variable in our model, and so you'll notice that we don't have any
category for binaries in the same building. That's kind of our baseline. So
we're comparing quality of distributed binaries to those that are developed in
the same building.
And then our output is quality, software quality, so we measure that in number
failures in the first six months after the release of Vista.
Okay. So this is what our model came out.
interpret had this means.
And don't be afraid; I'll actually
So the two pieces of information, that we're interested in are the percent
increase. So this is the percent increase in failures relative to binaries
developed by engineers in the same building. So as an example, if you look at
binaries developed in different cafeterias but the same campus, you see a
16 percent increase. And over on the right is the significance. That's just
the likelihood that what we're seeing is just due to the noise in the model.
So lower values are better. And .05 is about the cutoff for saying that
something is statistically significant. So in almost all cases, we see that
it's significant.
16 percent is nothing to sneeze at. It's not as high as what we were expecting
to find, but you know, that is an increase. So we concluded, look, they're
actually is a little bit of an increase in failures when you district
development. But, interestingly, Jim Erbsleb [phonetic] did a study in 2003
that's kind of similar to ours. He looked at productivity, not quality. He
also found that when it was distributed, the alpha variable went down.
[inaudible] people were less productive. But then what they did is they
controlled for the number of developers in the teams. So essentially, if you
have ten people in the same building or ten people scattered worldwide, you're
going to see an increase, the same increase relative to five people in one
building.
And so we
So again,
failures,
changes a
ask the same question: What happens when you control for team size?
we're using similar data but now in addition to the level and the
we add the number of people that worked on a binary. And the story
little bit.
So again, this is a result of the new model. The two things to pay attention
to are first, the percent increase. It's dropped dramatically. The highest is
in the different localities but we can't even tell if that's due to real data
or just noise in the model. The only one that is statistically significant is
different campuses. You see a six percent increase. So it's dropped, which
means a lot of what we're seeing was the factor just larger teams, and larger
teams tend to be more distributed.
So what we conclude is, look, there is a small increase, but it's actually
mostly attributed to the size of the team rather than the level of
distribution.
Okay. So now back to Patrice's question, because it's a really good question.
And when I presented this result to people, both inside and outside of
Microsoft, they said, well, look, maybe management knows we should distribute
the simpler things, those that are -- have a lower risk if they were to fail.
And so they said we think simpler binaries are distributed, and so because
distributed development is hard, it -- everything balances out and they look
like they're the same.
So the first question, how do you find simpler? There's a lot of ways to find
simpler. Fortunately Microsoft gathers all kinds of metrics from source codes,
so they've got measures of complexity, measures of churns, so churn size is
like number of lines. Edits is number of commits. In degree and out degree on
the dependency graph, path coverage in testing. All kinds of things.
And so we looked at the correlation between these metrics and the level of
distribution for the binaries to see is -- you know, do we see that those that
are simpler are more distributed.
This is actually the list of those with the highest correlations. As you can
see, so correlation ranges from negative 1 to 1 with extreme values showing the
high relationship, and we don't see very high. The highest is number of
developers. But this isn't a surprise. We just found that [inaudible]
controlled for the number of developers were able to account for more of the
variance in failures.
So from this correlation analysis, it doesn't look like what's being
distributed is actually simpler. You don't see a huge difference. But I've
actually found in my research that sometimes just digging into the data by hand
can show you things that you wouldn't find just by doing a quantitative
analysis. So we actually looked -- didn't look at all 4,000 binaries, but we
did look at like top-20 lists of those that were distributed and those that
were the largest and the smallest. Looked to see if maybe there were
subsystems that were more distributed than others, and we didn't find anything.
We even went so far as to build a logistic regression model. If we include
like all six geometrics, can we predict the level of distribution? And the
precision and recall were really bad.
And so what we concluded from this is that at least relative to the metrics
that we used to look at for simpler, there really wasn't much of a difference
at all between the binaries that were distributed and those that weren't.
So the next two questions, how in the world did they do this and why -- why are
we getting these results?
>>:
[inaudible] communication to get to the same results?
>> Christian Bird: So yeah, that may be. To the degree that it's -- so
depends on how you define effort, right? So we did measure effort in terms of
the number of changes made to code. There are clearly other ways to measure
them. So like how often are we having meetings and that kind of thing, we
didn't measure that, but I would expect that it's more effort at least in
coordination, definitely. And actually some of the qualitative factors when we
talked to people kind of bear that out.
So here's some factors that we got from managers. I should emphasize we don't
know -- we haven't shown a causal relationship, but the intuition and people's
experiences support the idea that this may be some of the reason why we got
these results.
So first, Microsoft uses liaisons and face-to-face meetings. There's
literature that shows that if people have face-to-face communication, that
later when they work remotely, there's more trust there and you're able to work
easier. And if you have liaisons, then people know who the point of contact is
in the different teams.
Next, a lot of the senior engineers at kind of the distributed sites, a lot of
them started here at Redmond. So if there was a question in Beijing about what
tool to use or why was this design decision made or who should I contact
because I don't understand something, these senior people, they had a lot of
that information. If they didn't have the answers, they new who to go to, so
they were kind of taking this information in their heads with them to the
remote sites.
During the Vista cycle, they had daily synchronous communication. So that
means that for sites where there's no overlap in work, they had people coming
in early or staying late so that you could have meetings and talk about issues
before they escalated out of control and went unaddressed for weeks or even
months.
And lastly, Microsoft tries to use the same process and the same tools, at
least within Vista. I don't know about the other projects. But within Vista,
they tried to of the same process uniformly. In some other studies, they found
that if remote sites are working on their modules, everything is going fine and
good until it comes time to integrate and then you have problems because
people -- some people run their tests with their bills and other people pull
from a different repository system, and so you really run into problems then
but Vista didn't have this because they used a uniform process.
So what do we conclude? Well, I'm not claiming that distributed development is
easy. It's clear difficult. And talking to anybody who has done it will tell
you that. But after looking at the community and gathering some data, and
testing some hypotheses, what we've found is that it's possible to do it with
little effect on post-release failures. And this is a good thing. This is
good news. We have a handle -- Microsoft at least has a handle in terms of the
Vista development on some of the problems that we face and how to overcome
them.
So that's collaboration in terms of distance. So I want to talk a little bit
about collaboration in terms of expertise as well.
So in software engineering literature -- so I'm interested in expertise and how
much experience people have with certain parts of the code base. And in the
literature, you'll find that a lot of times people use ownership as a proxy for
expertise. So if you have made a lot of changes to a particular portion of
that code, then you have the experience to probably understand it better.
Especially if you're the one that like wrote a large part of it. And if you
haven't worked with it, then you have low expertise, less knowledge.
And so the question that I ask is: What happens when you have a lot of people
working on something and those people have low expertise? Is that bad? Our
intuition would tell us, yeah, that's probably not a good thing.
And next, is ownership related to defects? So if you have a binary that's
clearly owned by someone, is that better than having a piece where -- as kind
of shared ownership amongst a large team.
And then does the process matter. Does development style that you use, is that
related to relationships of ownership and quality? So -- and I'll define
ownership more formally a little bit later.
So I'm going to make a bit of a generalization. There is a development process
spectrum, and I'm kind of generalizing here because it's not just this
one-dimensional spectrum. But for the sake of some first steps, Vista is
clearly on the commercial side. Eclipse is a project that kind of lives in
this hybrid land. If you look at the development activity, it's mostly owned
and controlled by IBM but it does espouse some open source principles. It's
under an open source license, and it accepts contributions from the community
at large. So call that kind of a hybrid.
And then Firefox, though lately it's been moving to more of a
corporate-controlled entity, for this study, we looked at older versions of
Firefox [inaudible] more of an open-source style.
I should point out there is no like one open source method of development.
one commercial. So I'm kind of generalizing to a bit here.
No
And so the question that I ask is: Well, how do these differ? So I need to
define some ownership terms here. So on a per-component basis, I say a major
contributor is someone who has made at least five percent of the total commits.
So the idea here is that these major contributors probably have expertise.
They have worked a fair amount. Five percent probably sending up a warning
flag. It's the magic number. We actually tested it at other sensitivities,
other levels for this threshold, so two percent all the way up to ten percent
with similar results. So we don't think it's a function of just this magic
number.
So similarly, a minor contributor has made about five percent or less of the
total commits so these are people that make fewer commits, we think they have
less expertise. And own the ownership for a component is the proportion of the
commits made by the person that made the most commits. And the graph probably
explains this a little bit better. So this is a graph for one of the shared
libraries in Vista. And I've ordered the developers by the amount of
contributions that they made. So for this one, the top contributor made
41.2 percent of the commitments so we say that the ownership is 41 percent.
Five developers made at least -- yeah, five developers made at least five
percent of the commits and then 12 developers made less than five percent of
the commits.
So the idea is that when you've got lots of these low-expertise people, it may
be a problem. So the first thing we did was we just made a correlation
analysis. Just said let's look at the ownership metrics and look at the
correlation with failures. And in this setting, it was actually interested in
pre-release and post-release failures because I think and the results showed
that there are things you can do to mitigate the effects of ownership.
So interesting finding, and we also included some of the base metrics that are
already collected -- yeah, Rob?
>>:
[inaudible]?
>> Christian Bird: Oh, yeah, okay. So these are things that were used that -bugs found during testing or at the end of the QA cycle that we're putting into
the issue tracker and had to be fixed prior to release.
>>:
[inaudible].
>> Christian Bird:
Yeah, yeah.
Sorry.
Thank you.
Good question.
So there are some things that are known to be related to failures and we
included those as well. So the interesting finding was that the number of
minor contributors actually had the highest correlation of any metric that we
saw. So with pre-release and post-release failures. It's stronger with
pre-release, and we found this trend in all the projects and for all the
ownership metrics, that pre-release had a higher, a stronger relationship.
So this gives us the idea that we're on the right track, but one of the things
we found is related to Patrice's question earlier is that those binaries that
have a lot of minor contributors are also those that are more critical in the
system, are larger, and so this may not be as big a finding, just looking at a
[inaudible] correlation.
So return to linear credit regression to ask this question because we want to
control, want to look at the effect of ownership when controlling for other
factors that are known to be a problem.
So in our model, in all cases except for Vista post-release, the distribution
of failures was heavily right skewed, and so when you have that, you have to -one of the assumptions in your regression is that your normals or your
residuals are normally distributed.
So when our output was the log, logarithm of the number of failures, that
assumption was met. So we did that transformation in all but the Vista
pre-release -- post-release.
We built a base model where we had some factors related to size, complexity and
churn. And one of the problems when doing linear regression is that if you
have too many variables that are all highly correlated, you can suffer from
over fitting and multi [inaudible], and that can be a problem. So we picked
the measures in each of these categories that had the highest correlation with
failures.
What we did, we had this base model, and then we added additional ownership
metrics, so we added the number of binary contributors and said -- and then
asked the question: Does the prediction power of the model go up? And then we
added -- we added ownership. We also added a number of others. And I can
direct you to the paper. So this is work actually that we just submitted FSE
on Friday.
And so to evaluate, if the models get better when adding this metrics where
look at goodness of fit tests and also the amount of variance in failures
explained.
Okay. So I'll give you the results for Vista since you guys are probably most
interested in that and then high-level results for Eclipse and Firefox, but the
details are in our paper.
So with our base model, we were able to explain about 26 percent of the
variance in failures for pre-release and then 29 in post release. When we add
the number of minor contributors, it adds quite a bit. So when you increase
20 percent for the pre-release and 12 release for the post-release failures,
and then when you add ownership, it goes up but not as much. We actually
tested adding variables to the models in different orders and it was fairly
consistent. No matter what we did, minors had more of an effect than
ownership. But minors also were always more significant in just adding the
total, just the team size, which is total.
And so we did this analysis. So what this did mean, is that when we take
ownership into account, even including the things that we know are related to
failures, the power of the model increases, which means it's having an effect
beyond just those standard things. So we did this analysis on the other
projects as well.
So the interesting thing that we found was that the projects that were more on
the industrial side, we saw stronger effective ownership on quality. So in
this case, plus means that the number of failures went up when the metric went
up; minus means that the number of failures went down. So higher ownership is
good. More minor contributors is bad.
Eclipse, you see a range like medium to strong in some places because we've
tested against six major releases of Eclipse. But although this is only three
projects, somewhat preliminary, we definitely see a trend here that in places
where it's more industrial, where there are ownership policies in place,
violating these policies tends to lead into more failures.
Similarly, the number of major contributors those who, say, have more
expertise, adding them didn't have a very large effect. Yeah?
>>:
[inaudible]?
>> Christian Bird: Ownership is the function of the number of commits that
you've made to code, so how many times you've worked on the code. If you have
worked on the code more, then your ownership is higher. The intuition is
you're more familiar with that code, less likely to make mistakes.
And then the base, the things that we know are related to failures, they were
across the board. So that went surprising.
>>:
[inaudible].
>> Christian Bird: So Eclipse, we used plug-ins and we also looked at
packages, Java packages. And for Firefox, we actually did the same thing that
you did for your cross project prediction, your [inaudible] papers. We looked
at kind of directory level stuff.
So kind of two findings. One is ownership is related to quality, but also the
process that you use is related to -- has an effect on the relationship between
ownership and quality.
So when we talk to people, up with of the questions they had was why do we see
these minor contributors? What are the reason? And so one of the things that
we found in talking to people is that people would say, well, I made this
change to this component when I hasn't changed it before because I needed to
because I was trying to fix a bug that was assigned to this component that I'm
in charge of. So we'd see someone who is a major contributor to one component
who may need to add a feature to fix the bug. In the process of doing that,
say, oh, well, I actually need to go change the shared library. And so he
would be a minor contributor to a different component and we would also see
dependency relationships between them. So this is something that people had
hypotheses about. We looked and we actually found this happened pretty often.
One of the problems in doing something like this is that you can't often see
what you're looking for. So this is the type of thing where you buy like a red
car and then all of a sudden it looks like everybody's driving a red car, but
it's just because you're looking for it. Right? And so we wanted to be
careful that we weren't suffering from this. So what we did is we did a Monte
Carlo simulation to see how often we would see this just in a random graph
where there was no attention to these dependency relationships.
And so what we did is we took the contribution graphs, so this is the example
contribution graph with some smart an dumb people working on things, and the
ovals are the binaries. And what we want to do is create random graphs with
the same number of the same developers and the same distribution of major and
minor contributors. So this is like my advisor. If he made three major and
two minor contributions, we wanted to keep that the same but randomize what he
was contributing to to see if this phenomenon, this major/minor relationship
was real.
So what we did was grab what's called graph rewiring. And in that, you grab
two edges at random. Either both major contribution edges or minor
contribution edges, and then you flip them. And so what this does is now the
same number of binaries, each binary has the same number of major and minor
contributors. Same number of people make the same number of major and minor
contributions, but now it's random. So you do this a lot, so N squared times,
where N is the number of edges, and you can reach a sufficiently random graph.
Well, you do that a bunch of times, you generate thousands of random graphs,
and see how often do we see this major/minor dependency relationship in these
random graphs versus what we actually observed. If there's a big difference,
then that means that what we're observing isn't just noise; it's a real
phenomenon that's occurring in these software teams.
What we found was that we saw this dependency, this major/minor dependency
relationship about 50 percent of the time. So we can explain about 50 percent
of the minor contributors to binaries. So we know why they're acting this way,
and when we do this random graph, it was about 24 percent of the time. So just
given the distribution that some people work a lot more than others, we'd
expect that random to see 24 percent. So what that means is we're really
seeing a process, we're seeing a phenomenon that's real, that's really one of
the reasons that people are -- that nonexperts are making changes to code.
So that's good because now not only do we have a result, but we know -- have
some idea was to why that result is occurring, which makes it somewhat
actionable.
So we also decided to dig in a little bit deeper and replicate a study based on
these contributions. So Martin Pinzger was here a couple years ago working
with Nachi, and they built a prediction model where they looked at the topology
network of people contributing to binaries, and they were able to predict which
binaries were the most failure prone, like really high accuracy. So 85 percent
precision, 90 percent recall.
And the way that they formulated the problem, random guessing would get you
about 50 percent of the way there. So we asked the question: What happens
when you remove the nonexpert people, these minor contribution edges. So we
replicated the study by removing these -- I show them here as dash edges, the
nonexperts. And when we did that, the precision and recall both fell
dramatically.
So what we can conclude from this is that the topology introduced by these
minor contributors, they're really important. They add a lot of signal to the
model.
We also tried it without -- by removing the major contributors and the
percentage of recall did not fall as much. So what does that mean? These
minor contributors, these nonexperts are part of the reason that you were
having these failures and they add to the predictive power of the model.
So with this -- these findings in hand, we want to do what John Snow did,
right? We want to take off the handle and see what happens. So we have some
recommendations.
I actually wrote an internal report that I'm told went to management. I'd like
to see, you know, if this has been put into practice and if people have
followed these and see what was the results.
So three recommendations. The first is that when changes are made by minor
contributors, those changes should be reviewed more heavily than the other
changes. And further those changes should be reviewed by the people who are
the major contributors, the owners and those with the most expertise. They're
the ones that are most likely to spot problems.
Next, in cases where people want to make a change, so maybe you're working on a
component and you see that you need to change another component that you
haven't worked with because your component depends on that one, where possible,
communicate those changes to the people in charge rather than making the change
yourself.
So clearly you can't do this all the time because people don't scale
indefinitely, but because this is the situation where we see lots of failures
being introduced, you follow this, hopefully you'll introduce less.
And then lastly, Microsoft already uses a number of metrics to decide where to
focus their [inaudible] resources at the end, but ownership should be added to
these metrics because we've shown that even when using these metrics, you can
increase your prediction accuracy when you add ownership.
So these are the recommendations. And the next step is to see what happens
when you follow these recommendations. Does the quality actually get better?
Okay. And so the last thing that I wanted to look at, share with you, is
understanding how open source software communities work. This kind of harkens
back to the question: Does this really matter? Do people really care? Is
this just like an intellectual exercise? Say, well, we want to know how this
works, or do people really care about this question? And I think the answer is
yes because there are companies that are trying to compete with open source
software. There are companies that are trying to embrace and work with open
source software. And so if we understand how these work, this can benefit both
parties.
So first you start with this premise made by a guy named Eric Raymond. He's
this kind of self-appointed spokesman for open source software. He has a very
famous essay called the cathedral and the bazaar where he characterizes
industrial software as this well-planned, well-executed modular software
community, and then you have the open source is this bazaar where you have
people just working on anything they want to, wandering around, talking to
everybody, and wonderful high-quality stuff just emerges.
Well, Fred Brooks has just kind of something to say about this. So if you're
familiar with Brooks law, it says that adding people to a software project
that's late will only make it later. And one of the reasons that he gives as
support for this premise is that, like, you have communication problems. Say
you have a project with three people working on it. That's all fine and good.
Well, you add three people to it, and now the number of potential communication
pads grows quadratically with the number of people that you add to it. And so
if you have people just working randomly, then a project will begin to fall
apart under the weight of its own communication. And so you have to have some
type of organization or you run into this N squared communication path problem
that's going to be a problem.
So if open source is really these -- if it's really this bazaar community where
everybody works in an ad hoc fashion, how do they deal with this communication
problem that Brooks talks about. And so the question that I asked is either
chapels of organization within this bazaar of open source.
So our approach. We started by looking at the social network. This is no org
chart like there is here at Microsoft for open source. And so we looked at
communication on the developer [inaudible] list to get this structure.
Next we took some clustering techniques from the areas of complex networks and
physics and altered them to work on our social networks. Had to do with like
edge weight where an edge weight may be number of messages sent between people
in the community. And they give you an example of what we found. Most of them
were very large. This is one of the smaller ones. This is pearl and what you
see is the edges represent communication between people, and the boxes
represent the clusterings that we found using this technique.
And then we examined something called modularity. So modularity is this formal
metric. It comes out of our physics that ranges from 0 to 1 and values of
about .3 and higher are considered modular for networks that are known to be
modular that occur in nature.
So what we did was we took a number of open source projects and looked to see
how modular they were. Did they pass this threshold. So we looked at these
five projects. And in all cases, we saw that the modularity was much higher
than you
actually
source.
to teams
would expect to see by chance. And so what we find is that there
are tight-knit groups of people within the organization of open
There are all these teams that form somewhat organically and similar
in industrial development that are organized by managers.
And the last thing that we did is we looked at the differences in discussion.
So we looked at messages that were more technical that mentioned things like
function names and files, global variables and then messages that didn't
contain any of that he is more technical terms. Typically when we read those,
didn't have technical terms. They were talking about more process things like
should we release this week or should Joe be given access to the source code
repository. And what we found, so the latter we called process topics and the
former we called product topics. What we saw is that the community was much
for modular when talking about more technical things. So people were drawn to
different parts of the system, and they would talk to other people that were
drawn to the same parts of the system.
So again, we've looked at the community to try to test some hypotheses, and
what we found is that, look, the open source communities aren't just ad hoc.
They're actually organically formed into teams. We did find that teams are
more dynamic that you typically find in the industry. We found that when
they're talking about more technical things, they are even more modular. They
form into these tight teams. And also, when I looked at their actual
development efforts, so what they were actually talking about, what they were
doing, the changes that they were making, they were clearly making changes
relative to what they were talking about and their organization.
And so in some ways, it's kind of a validation of Conway's law that says that
the communication structure and the architecture of a system are intimately
tied.
Okay. So that's what I've done. What do I plan to do? So a few things. So
first, 1 the things that I found is that the quality of the data that you're
working with is directly -- directly affects the power of the conclusions that
you're able to make. So [inaudible] paper [inaudible] last year where we
looked at bias in data. We found that there's a lot of bias in a lot of da
that that's used in research. So in Apache, we looked at the number of bugs in
different -- in a lot of ways, but one of the ways was in severity categories.
And we found that for bugs that are market more critical, so maybe in the
blocker, the critical and the major categories, we're not able to tie those
bugs back to the bug fixes as often as the bugs that are less important. So in
the minor and trivial categories. We actually found that -- so this is clearly
biased. So certainly types of bugs are overrepresented and others are
underrepresented.
And when we looked at its effect on bug prediction and hypotheses testing, we
found that biased data -- the amount of bias in the data has a direct effect on
the ability of the technique to predict the bugs.
And so right now, we just submitted a paper to FSE where we actually tried to
recover -- okay. Doesn't like me. We tried to recover complete data sets and
look at the effect of bias and how can we overcome this bias, get better
prediction.
And so this is one of the things that I'm trying to look at is how do we deal
with these problems in data. And I should mention I think this is a problem
here as well. Gina Venolia had a paper ICSE 2009 where -- called The Secret
Life of Bugs where they found that, look, a lot of the data relative to bugs
actually is not being captured so this may have implications on our ability to
use this data or rather how much better we can do if we get the whole story.
And also, we've made -- I've made some findings here, but once we know things,
what we want to do is actually remove the handle from the pump like John Snow
and see what we can affect, what can we change. So we can make changes and
observe -- make changes to process and observe the results. In cases where we
can't actually make a change, it's not clear how to make the change, we can
build tools to help people working on software.
Another interesting thing that -- one thing that I'm interested is studying
different domains. So in research, there are certain areas that are
overrepresented and certain areas that are underrepresented. And I think that
there are some areas of software development that are very underrepresented and
if we haven't studied them, then we don't know what laws of software
development are universal and which ones are context-dependent. So the two
areas that I think are hot right now that are worthy of study are web
applications and game development. And I'll tell you why I think these are
maybe very different from what we've studied in the past.
So web applications, you have multiple languages running on different machines.
So on the client, you have job description HTML. On the back end, you have
some type of database. You may be using SQL for queries. In the middle tier,
you may be running like C sharp or C++, java or some other new language. So
this is very different than like and operating system or a desktop operation.
In addition, there's different schedules. So I know of one website called
stack overflow that's like a programmer help website. Their goal is to release
some new feature of bug fix on a nightly basis. So every night they're
deploying their software again. So this is very different from like a, you
know, Windows where you may be deploying, you know, a new version every few
years or even patches on a monthly basis.
So if you can -- you can fix bugs which -- very quickly which means at that
testing may not be as much of an issue, but ultimately if there's a bug that
brings down your site, all of a sudden all of your users are totally affected.
And then rich monitoring. So I talked to Jim Larris [phonetic] when I visited
UC Davis and he said that one of the most important milestones of Windows
development was when they added Watson and they were able to view what was
affecting developers the most.
Web applications, you can even enrich your view. You know exactly what's being
used. If there's a crash, you know what led to that crash. You get a lot more
insight to the uses of your application than you get in other domains.
In game development, there are also some key differences, so the amount of
content that ships relative to the amount of executable code is very different
than a typical application. Typically the content is [inaudible] magnitude or
more. And likewise, the team makeup is different. So in cases where there are
tools that rely on aspects of the source code, you may not have as much source
code.
You have people who -- I won't say are nontechnical, but they're technical in
different ways because they're not writing code. They're creating content. So
3D models and story lines. So they're basically -- there may be some
implications relative to coordination and how these people interact. And in
addition, the systems at that time software is running on is different. You
think of a web browser, you should be able to send anything. And even if it's
malformed, it has to be able to display something to the user.
With a game engine, the content is fairly static. You know when you shift the
game, the content that's going with it, and so if they're -- you can test both
of them together. And nowadays, where there's downloadable content, but it's
controlled so even then if you find out, look, there's a bug in my game engine
that is exposed by this new level, well, you can actually test that level and
make changes to it so it doesn't expose the same bugs in the engine. So I
think this may also have some impact on your QA. Very different from
traditional development.
And then lastly, I've looked at the team level, but I'm interested at the
individual level. So right now, you see this proliferation of new languages
and language abstractions to existing languages to deal with things like
multicore and some other problems. So if you look at like Link was just added
to -- well, I guess not just, but was recently added to C sharp. Java was
introducing closures. Twitter just moved to scala as their implementation
language.
The people that are proponents of this say, look, this solves these problems,
makes lots of things easier, but it's not clear to me that this is just a get
something for nothing. When you introduce a new abstraction, that introduces
some cognitive load on the programmer and may be difficult to deal with.
There are some projects that decide, look, there's a new feature of the
language, but we're not going to start using it yet because we don't trust it
because that may be more difficult for our developers. So I'm interested in
looking at the effect of these new abstractions and these new languages on
developers at the individual level.
And then lastly, I'm interested in coordination. I've only looked at the team
level, but I'm interested in looking at the individual level. It's important
to know when you should be coordinating and also when you shouldn't be
coordinating. If you're always talking about what you're doing and everybody
is doing this, you just have too much noise and people stop paying attention.
And the literature so far has just looked at things like number of lines
changed and commits. And I'm interested in looking at the semantics of changes
that people are making and determining when certain semantics imply that you
should be coordinating your changes and with who you should be coordinating
with. So if you can direct it to certain people rather than sending it to a
mailing list.
So this is kind of where I'm going. And I'll share with you what I've done.
I'm happy to answer any questions that you might have about things that I have
presented here, also things in my papers or what I plan on doing. So I
appreciate you coming out and listening to what I have to say. Thanks.
[applause]. So I've answered everything.
>>: So why hasn't the changes that you have proposing, why haven't they be
evaluated and [inaudible] because the proposed to you or because there's no
infrastructure to measure whether things really have an impact or [inaudible]
currently it's mainly observation?
>> Christian Bird:
Uh-huh.
>>: But I want to see that with your changes, things really will go better.
So ->> Christian Bird: Yeah. Yeah. Okay. So when I was here over the summer,
actually I wrote up some internal reports. And I'm told by Nachi that they
actually have been -- have gone to management and they have been evaluating
whether to put them into practice. Honestly, I don't know where it's gone from
there because I think at that point it's internal and as not a Microsoft
employee, I'm not privy to that type of thing. But if I come here, this is
clearly something that I'd like to do is say, you know, who can we convince to
deal with these recommendations. I doubt people are going to take, you know,
broad sweeping changes and say now we're going to change how we do Windows.
Right? It's pretty risky. But you can start on projects where it's maybe less
intrusive, smaller projects and see what the effects are. But that's -- that's
part of why I want to be here is because in an industrial setting, you can
recommend changes. Right?
Microsoft research has a lot of reputation within the company. I know Nachi
has done some work that have actually impacted development, and that's why I'm
here.
>>: You made a comment about the minor contributors. If they instead of
making the changes themselves, if they go to an owner or to major contributor
and say, please, I'd like these changes, won't you make them for me, whereas
they make them and then they're carefully reviewed by the major contributor.
What I'm wondering is what kind of impact might that have on the -- on the
timing -- the time required by the major contributors [inaudible] hope that
major contributor might be helped by having other people also do stuff.
>> Christian Bird: Yeah. So that's a good question. Right. People can scale
indefinitely. And if you have one person that has to make every change, this
is going to be a problem. So one of the things that I saw is that there -oftentimes there's a clear owner. There are also other people who are not the
owner but have a fair amount of expertise. And when you increase those number
of people, the software quality went down but not tremendously and nowhere near
as much as in terms of the number of minor contributors. So I think it's not
just one person that you can work with but a group of people.
But clearly I wouldn't expect that there are no minor contributors to any
component in the next release. At least if you're aware of it, you know, when
there are [inaudible] cycles or a situation where, look, I have time to review
five changes. There have been 20 made. Now we can help you prioritize where
to put your resources, and that's helpful.
>>: Do you know, I mean, for those -- the bugs introduced by minor
contributors, who fixed them? Was it the major contributors [inaudible]?
>> Christian Bird: Okay. So I should be very careful here. I don't know
exactly that the bugs were introduced by those people. Talking with people
makes me think this based on anecdotal evidence. The data that I had to work
with didn't indicate this person introduced this bug. It's hard when you see a
bug fixed who introduced it because it's not always the same lines.
So we have strong reason to believe it but I don't know that for sure.
Because I didn't have the access to that data, I don't know who made the fix.
I will say this: I know that some companies tend to like at the end of a
release cycle put everybody working on everything. Like you have some free
time, go fix this bug in this one thing. In Vista they don't do that. There's
still ownership at the end of the cycle.
I should also point out that we looked at the time to see maybe all the minor
contributions were coming at the end. And it's really -- it's hiding this
issue that it's not the minor people; it's just all the stuff made in a flurry
at the end. We didn't see a large effect. So Vista, I think it was like
six-year cycle.
The difference, the median in the commits, and it was normalish. The median
difference was about 60 days between minor and major. So if you look at
distributions, it's this heavy big overlap. So gives us some idea of what's
not causing the problem.
>>: So the data sources that you used for your studies were basically not
intended as measures for studies, right? I mean, you were basically looking at
things like the bug database and check-ins.
>> Christian Bird: Yeah.
use it for our own.
[inaudible] some other purpose and we're trying to
>>: Exactly, exactly. So speculate for me. So you know, if you got very high
level buy-in from an executive that you could put any instrument into the work
process you wanted to take measurements different from what you can get from
the archives, is there anything you would want inserted? What would it be?
>> Christian Bird: Let me think about this. So [inaudible] Brandon. He said
that there are some groups that actually are buying it and saying well, what
would you need to make it better. So I think this may happen.
What I would like -- what would I like to see? One is I'd like to see why
changes are being made. So if a change is being made, can we say with
confidence this is a bug fix, this is a feature addition, you know, who exactly
is making this change. That's probably the most important.
Another piece that I -- I'm really interested in, I've looked in open source,
is that I'm just looking at the repository as this black box. We know that
when you're doing development, it's not all on just one line of development.
People branch the code or work on -- I don't know what the terminology is here.
But not everybody is working on the same code base, right? I'm working on my
portion; you're working on yours. And then at some point we're going to merge
those in or abandon them if they go wrong. Right?
So I'd like to be able to look at this level of granularity to get a more
complete picture. So I think that can tell us something about how you
should -- how you should work and coordinate your changes. It may be that if I
am working in a small team on our own branch, then that's better than -- at
least I have evidence in the open source world that that's much better than if
you're all working on one code base. And I'd like to actually empirically show
that here at Microsoft so that's another piece.
I think those two off the top of my head, but you know, give me till tomorrow
and I'll probably have a longer list of, you know, if you could have a candy
store, what would you put in it. Because, yeah, the more data you have, the
more interesting you can answer questions.
>>:
No other questions [inaudible].
>> Christian Bird:
Thanks.
[applause].
Download