>> Chris: Okay. So we're pleased to have here Abram Hindle who is visiting us from -- he's right now a postdoc at UC Davis with Prem Devanbu and Zhendong Su. And he just recently got his Ph.D. from Waterloo working with Mike Godfrey and Rick Holt. So he's going to tell us about evidence-based software process recovery. So take it away, Abram. >> Abram Hindle: Thank you, Chris. So thanks for the introduction. So in this presentation really what we're trying to do is we're trying to take a theoretical diagram that tries to explain maybe what software development processes are and what to expect inside of a project such as this unified process diagram. We're trying to take the theoretical diagram which never was concrete, never was based on actual data, and extract it from a real project and see what it actually looks like based upon real data, based upon things that were recorded. So when we're talking about process, we're really talking about software development processes, and these range from a wide variety of different kinds of ideas what processes are. So there is the prescribed process. So these would be things like you have to follow test first, we're following scrum, we have little scrums in the morning, or we use story cards to define requirements, things like that. Many of these prescribed processes are oftentimes based upon formal processes, things like a waterfall model, scrum, XP, unidentified process, anything like that. But there's also a whole set of sort of structured behaviors which are basically the process of the project. They're ad hoc processes. These are basically things that maybe as developers we've come to an agreement that we're going to do, we didn't write it down, the manager maybe didn't specify it, but it's sort of this default behavior that we follow. So the process I'm talking about, basically we cover the whole range here from the formal to the prescribed to the ad hoc. So on the formal side, there's actually quite a few. There's the waterfall model which suggests how development could be staggered, where you have requirements going into design and analysis going into implementation and eventually deployment. There's also iterative view, such as the spiral model where basically you repeat this waterfall model a couple of times over, so you keep reiterating over what you're doing and you have basically multiple iterations. Then the one we're mainly focusing on in this presentation is the unified process. And the unified process is basically a model where you have multiple disciplines and you take part in these disciplines in different proportions over time. So this diagram here, the unified process diagram, this diagram we'll be coming back to where it shows your disciplines, like your business modeling, and then it shows basically the amount of effort over the existence of the project for that one discipline. So you can look at, say, one release and see what the effort was around at that time, the proportion of efforts. Okay. So the world that we're really dealing with if we want to actually look at, recover, and extract these processes, is a world where developers -- they want to follow a software development process but they -- in order to do so, they have to exhibit behavior. And this behavior is exhibited in order to fill a purpose or task which composed the software development process. When they do something, when they exhibit a behavior, sometimes they produce evidence. Lots of time this is lossy. It doesn't contain all the information. But this evidence can be used to suggest the underlying processes, purposes, tasks and behaviors. So the world that we're dealing with in order to actually recover these processes is where we have this evidence that was produced from the behaviors they followed in order to fulfill their processes, fulfill their purposes, and then we basically have this shadow world which we try to recover the behavior, recover the purposes and tasks, and use those behaviors and tasks to compose underlying software development processes. And this all comes from the evidence that developers produce. But the evidence is lossy, so we don't see everything, like if they have a meeting and it's face-to-face, maybe there's minutes, maybe there's not, maybe they have a talk in the hall. We don't see everything. So for the rest of this presentation, I'm going to break it down into basically behavior, intents and purposes, and software development processes. And we've got four basically different kinds of research that we integrated for this work where we did release patterns where we basically looked at the types of files changed and the reasons why you would change those files. And so that dealt with behavior and the process, because we also correlated it with how they acted around release time, was there a freeze, things like that. And then we had the large changes study where we looked at the purpose behind large changes and tried to categorize them by that. We also used topic analysis which helped describe the behavior and also helped elicit some of the intents and purposes, followed by our summary of the processes, the recovered unified process views. That was mostly about software development processes. Okay. So who would actually want to see what was going on in a project from a process standpoint. Well, there's a variety of stakeholders ranging from managers who aren't really intimately involved in the code base who might not be sure what's going on, they might be another tier above another manager, things like that; programmers who are basically shunted around between projects and who basically fix messes. So I got a friend in Victoria, and he's basically one of these. He goes between Java projects, and he's got to figure out what the project's about, get the work done, and then get out, because, well, he's the valuable guy and they don't want to waste him on all the small projects. But then there's the new developers who are unsure of how a project is actually being done, what are the protocols and procedures, how do you, say, check in code, things like that. As well as there's other kinds of stakeholders who are not necessarily very code driven, such as investors or people trying to acquisition a company. They might be interested in what underlying software development processes were being followed during the development of that project within that company that they're interested in purchasing. And as well as there's ISO9000 where you want to document your software, you want to get the certification, it's a big pain, and you don't really have a lot to work with. So those are the main stakeholders who might be interested in this kind of recovery. >>: [inaudible] >> Abram Hindle: Thank you for [inaudible]. >>: Couldn't figure out what it was. Thank you. >> Abram Hindle: Okay. So one example that a manager might have is they might propose a process. They might propose it as a mixture of workflows over time. And then when they recover this process, they can go ahead and actually see if it matches their proposed process. So they could look at the different -- they could look at the similarities and differences between the processes they recovered and the proposed, and then they can investigate further what those differences actually were, why weren't their expectations met. They could just be straight-up wrong, but it'd be interesting to know. So that's one potential use. So how do we get this kind of information, how do we figure out what's going on. Well, we could ask the developers. We could ask the people inherently involved. But there's issues there. You might not have access to them. They might not be around anymore. Basically talking to developers, I think, frankly, annoys many of them, especially if it's not really important to them. And it also takes up a lot of time. So if you are not going to interview developers a lot, how are you going to get your information. Well, we could rely on software repositories and the data left behind. We could try to summarize this in a kind of unified process manner where we basically break down a lot of the information events here into the different disciplines such as business modeling, requirements, design, implementation, and then we could look at how these things change over time and what events relate to these different disciplines and workflows. So I'm now going to cover a little bit of previous work and related research to this -- to our work. So in terms of mining software repositories and stochastic processes, Israel Herraiz, et al., they looked at the distributions of, say, metrics over time, things like McCabe Cyclomatic Complexity and other things, and a lot of code metrics. And they looked at many, many open source projects and they found a wide variety of distributions, most of which were log normal, double Pareto, things like that which were sort of nasty exponentials. And they also found that many of these metrics correlated very heavily with lines of code. There was this laws of software evolution by Manny Lehman, and then people tried to take his laws of software evolution and either validate them, such as in Turksi, or invalidate them in some cases, as in Tu and Godfrey, where in Tu and Godfrey found that -- I think it was the ninth law of software evolution that said that the growth would be sublinear due to complexity. They found in the Linux kernel, at least due to copy and pasting drivers, that the growth of the Linux kernel was superlinear. So it was a little bit above linear. Other work we rely upon is business processes. This is really just another kind of process. It's a little too formal for, say, information work and development. But we rely on it nonetheless. So Van der Aalst basically would pose a business process as either a finite state machine or as a Petri net where you push a token through, say, a finite state machine. Then there is -- we rely on a wide variety of analysis, a little bit of social network analysis such as work by Bird [phonetic] et al., as well as statistics, some natural language processing, mostly at the level of counting words and doing topics. We also use machine learning to do classification. We rely heavily on time series as the unified process diagram does as well. So this work comes out of work by Van der Aalst on process mining where he would monitor a live business process, like buying Chicken McNuggets at McDonald's or setting up insurance clients, things like that. They would monitor and measure the process, and then they'd formalize either as a Petri net or finite state machine. Cook, et al., took that further and they applied this to software. So they tooled the process, they modified the process to actually get more information so they could observe it and extract these Petri nets and finite state machine representations of the process. So our approach is a little bit different. It's the mining software repositories approach where we take the information left behind and then we analyze that after the fact without access to the process itself, without tooling the process to get more information. And we try to get things like statistics, the underlying distribution of effort per discipline and other things out of that. So that's really what process recovery is about. And, again, what we're going to try to do is we're going to try to summarize the information extracted from those repositories in a unified process kind of manner, mostly because it's been used in software engineering textbooks to explain that we do a bunch of things at the same time in software development but we might have different proportions of these disciplines at different times. So you might not be doing a lot of requirements later on unless you're adding new features, things like that. Sorry. >>: Two questions about this. One is was this descriptive or prescriptive when it was created? >> Abram Hindle: What's "this"? >>: The model that we're looking at right now would be [inaudible]. >> Abram Hindle: Oh, it's -- I think it was descriptive, trying to explain what you'd probably see. >>: And then how does this handle hierarchical projects? Like Office is made up of six different projects that are all synced together, and each one might have its own set of phases at this point and they're all kind of unified [inaudible]. >> Abram Hindle: Well, I think you can have multiple views. So you can do your subprojects as separate ones of these, and then you can have an aggregate view where this stuff would probably be thrown away because it no longer syncs up. But at the very least you'd have the proportion [inaudible]. So I'll get into that later. But I think it can be applied and you don't need to apply it to the whole thing. It can be applied to subviews. So I'll get into that. Other work we heavily rely on is the whole mining software repositories field where you basically mine repositories like version control and other repositories in order to get information about what was going on in a project at a certain time. And oftentimes this research does certain things like try to predict faults and also expertise, basically, like who would be an expert in a certain part of the project based upon their past history. So in this work we rely mainly on three software repositories. We rely on discussion and mailing lists, we rely on bugs in the bug tracker, and we rely on version control systems and the revisions to source code and other files in those repositories. So just a quick overview of what mailing list archives are consisting of. So basically mailing lists are often topic driven. Some are user based, some are development based, some have a different topic like off-topic discussions. And they're basically discussions between different people and these discussions occur over time. And these discussions often reference what other people have said and sometimes reference documents. They also have a bunch of metadata in the header, and they have big natural language text in the body. And this stuff can actually reference other things. So it's oftentimes quite difficult to parse the body because you usually need something like natural language processing or some kind of way of understanding it. Followed by that would be bug trackers, which share some similarities. But I'd say the main difference between, say, a mailing list and a bug tracker is that you have a bug ID, they've named them. So not only do you have a subject, you've named basically the whole discussion itself with a bug ID. And they're sort like mailing lists, but usually different software, usually a little bit different. But you still have this discussion between people about what to do, referencing artifacts. And that all occurs over time. Then we had the version control system where we have authors over time making commits to the code base. These commits are composed of revisions. These revisions basically are changes to separate kinds of files, like build or configuration scripts, sometimes documentation, sometimes the actual test. Oftentimes source code. So those are the three main repositories we rely upon to extract information, at least in this study. You don't -- if you had a documentation repository, you had any other information, might be useful. So other work we heavily rely upon are source code metrics, whether they're straight source code, whether they're evolution metrics, such as like information about the deltas or, say, coupling metrics where you measure how much files change together. Other work we heavily rely on is topic and concept analysis. So Poshyvanik and Marcus heavily used LSI and somewhat LDA to figure out what entities are associated with certain concepts. So a lot of this was unsupervised and automatic where these topics -- where they're extracted from source code or natural language text would be extracted from the repository. And then others such as Lukins, Linstead and Maletic and Hindle would actually use LDA to apply it to natural language text, whether it was in the version control commit comments or it was in the bug repositories. So Lukins actually had an interesting paper where they would use LDA on the bug tracker in order to find -- in order to query it for template bugs. So you provide a template of what your bug sort of looks like, and then you ask the bug tracker and it comes up with a similar document. So it was document retrieval. Okay. Other stuff we also rely upon is quality-related nonfunction requirements. So Cleland-Huang has published a lot on mining NFRs from source code and requirements documents. And Ernst, Neil Ernst, has also published on just basically mining these quality-related nonfunction requirements from mailing list histories and version control histories. Okay. So that was a lot of the work that we rely upon in this. So sorry about its length, but let's get down to the brass tacks of software process recovery itself. So we rely on subsignals for software process recovery. We rely on information extracted from version control systems and things like this. In this case of release patterns, what we do is we take the revisions of the version control system over time and we basically partition by file type. So if it's a change to a source code, we suggest it's a source code or implementation revision. If it's change to your benchmarks, to your unit tests, anything like that, we say it's a test revision. If it's change to your build files, Automake, Autoconf and your project files, it's a build change. And if it's a change to your user documentation or your developer documentation, it's recorded in the repository, we suggest as a documentation change. Now, this doesn't sound that useful, but it actually is useful once you aggregate it in a large -especially with respect to events like a release. So what we found was then certain projects, like, say, MySQL, if you looked at these signals across release time, you could get a general behavior. That behavior was actually consistent across the release types. So minor releases in MySQL would look the same. They'd have the same kind of behavior around release time for a source code. They might have a lot of changes and then it would taper off after. Where something like PostgreSQL did more of a freeze, where they would have no real source code changes, maybe a bunch of test changes and minor build changes before a release, then afterwards they'd have a huge spike in source code changes because they'd integrate all that cool stuff they were working on, which they couldn't have integrated before the freeze. So you could see some kind of behaviors, especially process-related behaviors, based upon these four signals in a very simple manner. >>: When you just mentioned the story about Postgre, how do you -- was that story derived straight from these kinds of diagrams, or are you using some of your own knowledge about how software development works [inaudible] to embellish that story? Is that rounded or is that -- >> Abram Hindle: So it was based upon the data, but also in order to see it, I guess you got to know [inaudible] occurred. So I saw it, I thought [inaudible]. >>: But you didn't ask that ->> Abram Hindle: No, I didn't ask that. >>: [inaudible] >> Abram Hindle: No, I didn't ask that. So the usefulness of this is that it's just relatively simple. It's basically partitioning by file type, and you get these four different signals suggesting what kind of behavior is happening. And you can do neat things like look at the correlation between the tests and source code. You could ask things like are we doing tests first, things like that. >>: Did you try any other splits to see whether that gave you more interesting signals, or are these the four that were the best ones? >> Abram Hindle: In this study, we only did these four. But if you have -- and this was on file type. So you could also split on author, because authors are pretty heavily loaded, especially in open source properties where you have the top three authors are really responsible for everything. So if you would subsplit these by author, it might give you more information. >>: How do you differentiate between source code and test code? >> Abram Hindle: Well, these don't have to be straight-up partitions. They can overlap. But you can just say anything that's test code is not source code. It's up to ->>: I guess my -- how do you identify test code? >> Abram Hindle: Oh. Okay. So in Perl you look for .T files. The quick way, the dirty way, which requires no supervision, it's dangerous, is look for test. But in, say, something like a database system, there's a lot of things which are test that are not test code, and especially in, say, a package like R, there would be a lot of cases where tests would be actual code. So you got to be a little bit careful with test code, and mostly you're relying on the idioms that the programmers use to identify what parts of system are test. >>: So it's pretty project specific. >> Abram Hindle: And language specific. So basically for this I had a bunch of language-specific test [inaudible]. So like .T for Perl, benchmark, and there's a few others. So the problem with tests is there's also different kinds of testing [inaudible] like regression tests, and benchmarks would still be considered tests by a lot of people. And in a database system, benchmarks are really important. So if you've got a performance-oriented project, you might have to get more specific with the tests. >>: Have you ever partitioned data into like the different types of tests to see if there's any different [inaudible] like unit tests versus [inaudible]? >> Abram Hindle: No. You're making me feel stupid, because that sounds like a great idea. So this is just a quick example of applying to SQLite over time. So from 2001 to 2010. You can see there's lots of source code revisions. There's a bunch of test revisions. There's hardly any documentation revisions. And there's a couple build revisions. So this is just a concrete view of the source test build documentation revisions. The next thing we did was a large -- a study of the large changes in version control systems, and we basically categorized them with the three Swanson maintenance classifications, so we manually looked at them across I think 18 open source projects. >>: How do you define a large change? >> Abram Hindle: It was top 1 percent in size. So size of lines changed. What was interesting was for the large changes, the vast majority weren't really that Swanson orientated. They weren't really maintenance orientated. They were implementation. So some people would suggest an implementation would belong in adaptive. But these were explicitly implementation, like lots of times larger merges from another project, a totally new feature, things like that. We also found that while the Swanson maintenance classifications weren't really version control specific, we were dealing with version control. So we had to deal with things like copyright changes, legal changes, comment changes, things like that, stuff that never would change the execution of the code but existed nonetheless. And so we applied it to many projects. And not all projects were consistent. I guess relevant to here, the Samba project, which is basically Linux version of the Windows network filesystem, they had a ton of adaptive changes because they had to adapt a lot to anything that changed in the Windows filesystem. So some are more consistent. Like Firebird, which is a database system, was pretty consistent across everything. >>: So does Evolution have no bugs? >> Abram Hindle: Evolution. >>: There's no corrective. >> Abram Hindle: No big bugs. No like hundred-line bugs. Yep. >>: So what do you do with this? It's pretty. >> Abram Hindle: Oh. >>: Like what's -- so at the beginning you talked about applications, but you -- what's the application for this diagram? >> Abram Hindle: Oh, this diagram is to show you sort of what exists in the open source world. So we took the previous information, the manual stuff, and we checked to see if we could apply machine learners to automatically classify the changes. >>: What's the user -- who's the user that's looking at this and what is their need? >> Abram Hindle: Okay. So the first user would be the researcher who learns they shouldn't throw away the big outliers, because big outliers can change architecture. So that's important. The second user would be more of an end user with the previous data where this data would be used to train the learner which would automatically classify their changes. >>: [inaudible] >> Abram Hindle: A machine learner. >>: What about a human? >> Abram Hindle: Well, they'd at least get to see an overview of what the changes were thought to be. So you could use the learner to tag a change with, say, adaptive or corrective. And before they look at the change, they already see it's been tagged with adaptive corrective. This would allow querying, allow them to scroll through changes and decide I only want to see the bug fixes or I want to see the [inaudible] changes or what were the last set of license [inaudible]. >>: How difficult is it to train a learner to do that? Seems like deciding between whether the change is like adaptive [inaudible] tough. >> Abram Hindle: So we did single classes here. We learned the hard lesson that this is software, categorization is not so hard and fast. So you want to use a multilabeled one. So that was the things we really learned. >>: What features did you ->> Abram Hindle: Oh. The features were -- they were actually really interesting. They were file type, author, the text in the change commit. I think that was about it. And what we found was that you could throw away all the files changed and keep only the author, or you could keep all the files changed and throw away the author. There was so heavy a correlation in shared information between those two that you could choose one or the other. The author was very, very important to determining what this was, which suggests that in some projects ->>: [inaudible] with the projects? >> Abram Hindle: Yeah. In some projects that certain authors wear a few hats. >>: How do you validate the training for your learner? I mean, how did you validate that? You found those labels, right? >> Abram Hindle: Oh. We couldn't really validate too well that we got the labels right in the manual labeling where we went through and we labeled. So me and Daniel German did look at each other's labeling. We didn't ask him. We didn't go to developers. We didn't ask them. >>: [inaudible] randomly sample or something for the automatic labeling? >> Abram Hindle: Yeah. So we labeled a bunch. And then we trained the learners. And then we tried to learners using [inaudible] validation to see how well they did against each other. And they didn't do super great. They were like area under the RFC curve, like .6 to .8 depending on the project . And that converts to a [inaudible] score about -- I guess it's easier to think about the letter grades. So an 80 is okay whereas .6 would be like a [inaudible], whereas .5 ->>: Maybe in Canada. It's a D here. >> Abram Hindle: It's a D? Wow, you guys are tough. Okay. So then the other work we used were developer topics. And so what the developer topics were is we took the change log comments and we pushed them through an unsupervised topic analysis engine, like LSI or LDA, and then we got the topics out. So we say LDA, give us 20 topics for this input text, and it gives us 20 topics. And these topics are basically word distributions. So basically counts of words, which isn't really that useful. So what we did is we applied it per month, and then we looked to see if any of the topics reoccurred in the consecutive months. And what we found was most topics don't reoccur. About 80 percent don't reoccur. They're very specific to that month. Sometimes the topics would mention even a bug number, and it would be a bunch of documents, a bunch of changes related to that one bug, but it wouldn't occur in the next month. Whereas there was some topics which occurred across time. And we looked at these big long topics because they were sort of interesting. And we found that they seem to deal with -- sorry. Did you ask ->>: [inaudible] >> Abram Hindle: Yes. >>: Can you explain the colors? >> Abram Hindle: Okay. So gray is never repeat. Not gray is does repeat. >>: Okay. And what about the boxes? >> Abram Hindle: Each box is a topic. And if you had a PDF viewer, you could zoom in and you could see the words. The top ten words in that topic embedded in there. So it's technically like a zoomable graphic. So this sort of illustrates how fundamentally annoying the output from, say, LSR or LDA is. Because these are the top ten words. It will give you many more words. >>: So how do you identify -- like you've got this big brown box at the top [inaudible] I doubt it was the same ten words every [inaudible]. >> Abram Hindle: No, it wasn't the same words. So we didn't have a threshold like they did. >>: Oh, okay. >>: Why are some boxes lighter than others? >> Abram Hindle: Like this box? >>: Yeah, or the boxes in the lighter brown, some are narrower and some are wider than the gray boxes, let's say, or ->> Abram Hindle: Because if they -- if the topic occurs in the next time window, we join the boxes. So this box occurs from 2004 July to 2006 March. >>: Oh, so that's why the big box. >> Abram Hindle: Yeah. So that's a topic that spanned a long time. >>: So you're seeing a lot more words for that topic than you are for any of the gray topics. >> Abram Hindle: Well, it's a lot of topics joined together over time. So these are topics that were similar to each other joined together. And I think this one was correctness orientated. So a lot of the words dealt with bugs and bug fixing and fixes. And so based upon that observation, we felt, well, this diagram in itself is really not that useful until we interpret the topics, right? Like right now this is just some giant matrix, right? It's not really that fun. And so what we tried to do is we tried to label the topics. And we had one interesting method which was unsupervised where we provided a dictionary of software engineering terms related to nonfunctional requirements like portability or, say, reliability, and then if a topic contained any of these terms, we just labeled it with the concept, portability or reliability. And we had five NFRs that we labeled the topics with. And this allowed us to produce a similar diagram but with labels. So this one was maintainability, that one was portability, and there also was -- there were topics that dealt with more than one issue. So they'll be maintainability and portability, things like that. >>: Did you see anything that came up relative -- like periodically relative to the release cycle, anything like that? >> Abram Hindle: Not specifically that I can remember. But what was interesting was that lots of the repeating topics were actually related to the nonfunctional requirements. So they were issues that cross-cut a lot of other projects, issues like performance, maintainability, portability, functionality, efficiency. And by using a very simple dictionary based upon mining ISO9126, which is software quality standards something or other, I don't remember, they had a bunch of words in there. We stole those words, put them into this dictionaries, threw it at this, worked out. Then we also tried, well, let's use WordNet. WordNet was interesting. It had similar performance, but WordNet would include neat little words. So for things like efficiency, it'd include theater. The reason why the WordNet would include theater is because performance and theater go together in the English language, but for software, it's not really that meaningful. So we thought it'd be really nice having like a software engineering word in that where it was more domain specific, and then underneath that having a domain-specific WordNet would be cool too, like for databases. >>: Did the authors of these topics correlate similarly to the previous study where you were looking at who made similar types of changes, large changes to the project where you said you could probably just save that guy if he makes [inaudible] could you do the same thing, this guy, he always talks about efficiency? >> Abram Hindle: So we didn't do that, but it'd be pretty simple to go through the file just for authors corresponding [inaudible]. That's a good idea. Okay. So we had the labeled topics which would -- what was also neat is these topics are related to documents. So when we get this topic, it's back-related to the documents. So basically what LDA tried to do is they tried to say, hey, look, you can compress these documents you gave me by these mixture models of topics and via that you also know which documents relate to which topics, thus you know, given this topic, what documents are related. So you can use that to tag documents as well. Okay. So we did a bunch of work which didn't really seem all that coherent. But we had to string it together. So what we really tried to do was we tried to take all that previous work and we tried to integrate it in order to take this theoretical diagram of what software process was and produce a practical version of it where we took those previous signals, those previous events and information that are tagged, and we produce a practical view of it based on aggregating those. So as an example, we had the unified process requirement signal. So we had a requirements-based word bag. So basically that dictionaries that I mentioned before, we grepped through three repositories: version control, bugs, and mailing list. And then we also looked for the NFRs we were able to grab, such as usability and functionality. So this was open source code. We weren't really sure where requirements were discussed for the most part. In many cases open source projects don't really have a lot of requirements other than clone that other project. So sometimes a requirement's already implicit. Yeah. And sometimes they have external requirements documents. So we suggested that the UP requirements view, the requirements signal would be this mixture model of these. And in this case we just have coefficients of 1, so it's a summation. So what this is is it's basically the events that are related to requirements over time, pulled from three repositories. And we haven't done any kind of real mixing of it other than submitting the events. And I'm not saying this is it hard and fast. I think if you had a project and you knew that, say, you actually had a documentation repository where requirements documentation was or you had story cards and you had a signal where you knew how many story cards you had over time created, removed, things like that, well, you'd want to include that in the requirements signal. So we just -- we had to do something that would -- we tried to produce something that would look like the unified process model on something that wasn't necessarily unified process. Because our purpose with using the unified process diagram was to communicate, was at least a first step to show that, well, this could be done in a certain manner. >>: So there's [inaudible] 2001. >> Abram Hindle: Yeah. >>: What does that mean? What happened? What does that mean? >> Abram Hindle: Okay. I think this one is FreeBSD. And in 2001 I grepped around because I was worried about that spike. Because if it's not requirements related, it's relevant. So there actually was requirements-related events occurring at the time. And one of them was they were trying to meet up with Single UNIX Specification, Version 2. So they were trying to conform to that external requirements document. They were mentioning that in the version control. Another reason requirements got ticked up was in terms of one of the requirements words I think was definition. And they were converting to GCC 2.96 at the time. And they mentioned that they were changing function definitions. So that's probably the majority of the peak. But there was requirements-related stuff there. So the design signal looks pretty well similar to this one, and at least it peaks up in the design signal. So this isn't necessarily very accurate. >>: So you've got a lot of like external knowledge about these. Does that just come from you follow these projects, you're aware, or was -- like let's say that I didn't know anything, and I'm like, dude, there's this spike there. What would I do to try to figure out what actually was going on? >> Abram Hindle: [inaudible] to find what was going on in the spike. I use AWK. And I said AWK, on these files, these CSV files, between these two dates, grab me those, and then I use grep and I told grep here's my requirements document or my requirements dictionary, grep anything that matches that. So [inaudible] and then I looked at what was there. >>: Okay. So you could do it without external ->> Abram Hindle: Yeah. So if like [inaudible] user interface [inaudible] then yes. >>: Okay. >>: It seems that when the project are smaller [inaudible] 1994 and 1995 a spike might still exist, whereas in 2001, if you show it -- like do you normalize per quarter [inaudible]? >> Abram Hindle: No, I didn't. And I mention it in the paper it's based on -- it's something you might want to do. Because, you're right, there's very little here and there's a heck of a lot over here. And in terms of version control, it's pretty well [inaudible]. There's a ton of work over here and very little done over here. So you do want to -- if you're looking at a specific time period, you might want to normalize for that time period. And you might also want to normalize this up based on size. >>: Would you normalize [inaudible] individual components first or would you normalize [inaudible]? Because each question might be -- you might want to normalize versus the number of [inaudible] versus the number of people who ever looked at that [inaudible]. >> Abram Hindle: Yeah. So there's definitely multiple ways of doing the normalization. And so I don't have any hard and fast [inaudible]. So if I was to give this to an end user, like a manager, they'd have all these coefficients which they could fiddle with and potentially see what they want to see but also potentially see what's there. So there is that balance. Because -- yeah. >>: [inaudible] talked about these kind of things matching up to the release cycle [inaudible] at these signals [inaudible]? >> Abram Hindle: No, I didn't correlate with [inaudible]. I basically -- for validating this, it was two case studies, FreeBSD and SQLite, which we'll get into. And basically I was looking at mostly the peaks. So I didn't do negative validation [inaudible] it was mostly just two case studies looking at are the large behaviors visible. So there's definitely a lot more validation work to be done on this. But I think what was neat about it was that we showed that we could try to derive some kind of process view out of the events that occurred, and I don't think unified process is really all that valid for every project, especially a lot of the open source projects. You'd probably want to show some more concrete signals as well. Like I think the build signal is very important for C project, because every time you add a .C file, well, you're probably going to change the make file or the [inaudible] file or one of those things. >>: How do you choose the projects that you decided to look at? >> Abram Hindle: Ad hoc. So I had FreeBSD because I had done the mining challenge, and it was a long-lived project. And SQLite was long lived as well. And I had just written a fossil extractor, so I was the only guy who actually had most of SQLite's information because no one else had done the fossil extractor. So it was basically two case studies of long-lived projects which were popular. So FreeBSD is popular and SQLite's popular. >>: If you had to -- let's say that like you could snap your fingers right now and get the data for another open source project, what would be one that, given what you know about them, would be a good one to look at? Do you think like [inaudible]? >> Abram Hindle: I'm not going to answer unified process one, but I'd use Apache, would be the next step. Because ->>: Because everybody looks at Apache? >> Abram Hindle: Well, because Apache is very explicit about what their process is. And if I'm doing process validation, obviously the next step is compare -- we said we did this with what we did. So like this stuff is pretty well baby steps towards the real software [inaudible] needs a lot of validation. A lot of people aren't really up to this point. So this is what I got a thesis out of. Okay. I'll carry on. The implementation signal is much more concrete. I took the source code changes and I said they were implementation changes. They might have been maintenance changes, but at the very least it was a very concrete signal, very direct. The testing signal is more interesting because I took the testing changes along with the portability changes, which we can argue about, and the efficiency changes because it might be benchmarking regression tests. Especially something like FreeBSD or SQLite where they care about performance. And I also took reliability changes because a lot of those sometimes when you do a fix you might do a test. So this is the most concrete of the signals and these ones I would say have less power but they might be relevant in terms of regression tests and performance testing. And this produces the UP testing signal. >>: [inaudible] how would you write error bars on these things in the sense of you have some -so you're doing several different kinds of analysis that each have their own possibility for error [inaudible] and merging them together, is there a sense that you would actually -- it would be [inaudible] more clear if you drew like a band of what the trend could be over time as opposed to these individual spikes which actually probably are attenuated based on error. >> Abram Hindle: I don't think you can truly get error until you have a concrete view of what's going on. Oracle can do everything, then you know. But if you don't know, then sort of hard to tell an error. So in that case I think you rely on the confidence of those people who have expertise in some of these signals. So it's sort of more accumulative. So that's a big problem with it is how to view it and how to display it and how to analyze it. >>: It's interesting. Your source signals have pretty high standard deviations, but then the summation of them seems to have [inaudible]. >> Abram Hindle: Yeah. >>: And it's interesting to me that they are not mutually supporting; that what that's suggesting is there is a not high degree of correlation between those four source signals. >> Abram Hindle: Yeah. If we go back to the topics, we could see that certain topics are prevalent over different periods. So I think this has to do a lot with how software isn't about everything at once; it's about what you're focusing on at one time. At least if you look at one slice, what are we doing right now, we're not doing everything, we've chosen to do a couple little things. So that might be a kind of topic shift. >>: Right. Which kind of makes you wonder [inaudible]. Yeah. It shows you pick signals that are highly correlated or should you pick signals that are deliberately not correlated at all. So you're getting some sort of ->>: And then you get a flat line [inaudible] which is by definition interesting. >> Abram Hindle: You don't have to totally smoosh. You can look at these signals and then you can look at the signals they're derived from. So I believe like the build signal is a really great example of something that has a lot of interesting information in it for certain kinds of projects, especially C projects. Because they -- oftentimes the build change would indicate an architectural change. >>: Yeah, I guess that comes back to your question about [inaudible] there's no story here. There's not -- it doesn't feel -- certainly with these [inaudible] there's not -- there's no explanation happening. Maybe that comes back to what Chris was saying earlier, you've got to be deeply contextualized on this step to -- for these [inaudible]. And I don't know if that's a presentation problem or ->> Abram Hindle: Well, I don't have the releases [inaudible]. >>: [inaudible] test it, let's say, you took a release manager or a testing manager and you gave that in the chart for their project [inaudible]. >>: Right. Exactly. >>: And how accurate are they or how knowledgeable are they without having ->> Abram Hindle: Yeah. >>: And maybe there's another question of like why is it interesting to reflect on a decade and a half of the history of those projects? What do we expect to learn from that? >> Abram Hindle: Doesn't need to be the whole project. >>: But it is. I mean, you're showing me the whole project. >> Abram Hindle: That's true. >>: It is what it is. So you're showing me this picture for a reason. And what is that reason? >> Abram Hindle: Well, I needed a way to express what was potentially the underlying software development process, a view of it [inaudible]. And so if we look at the original UP diagram, it actually -- the UP diagram has what whole lifetime on there. They've got the inception and then they've got -- what's the last phase when you peter out -- >>: Transition? >> Abram Hindle: Yeah, transition. So they have that in there. So they've got the whole lifetime. So this was first step in trying to show, well, what would the lifetime of [inaudible]. >>: [inaudible] >> Abram Hindle: So there's definitely a lot of issues with it, and there's a lot more validation to be done and a lot more investigation in like how to actually show it as well as in future work I mentioned, I really want to see if iterations are automatically identifiable or what adding the iteration bars would really tell you. Okay. So I already discussed the FreeBSD stuff. Sorry. >>: It seems like there's another problem with smooshing these things together like the -- if you have a particular issue, presumably a feature, like it's this fast to log on -- gosh, I wish we had this particular feature in [inaudible]. Presumably that's coming up on user lists, then the developers are doing a whole bunch of [inaudible] and then the user list is going back on the user [inaudible] for a particular issue, that's going to kind of move through the pipeline ->> Abram Hindle: Yeah. But at least with the smooshing you get to see, you know, in this repository [inaudible] in repository 2 it was visible, repository 3 it was visible. So it might look flat even though in repository 3 it might look -- so it's a multirepository view. So it's already multidimensional signal in total, so how do we show that. We could do other stuff [inaudible] but I don't think that's as valid as, say, asking people how much they trust a signal and how much they want to see it. Like do you really want to see the user list? Maybe you do. Maybe it's very important. Maybe actually a lot of dev work [inaudible] into one user list. Like for SQLite, these developers don't really let a lot of people onto the dev list. So I'd already explained this before when Rob asked about the peak. And so just to reiterate, it was the GCC 2.95 port and it was the single user UNIX Specification, Version 2, conformance. And so that caused this peak in analysis and requirements. At least that's what I grepped out of there. And I'm -- wasn't really sure about testing. So we applied this to SQLite as well. And we looked at the interesting peak at the end which was across quite a few things. And so we see this big requirements peak here. If we look over here, the configuration management peak is different. The testing peak is different. So it's not necessarily the same event. So in terms of requirements, this was really interesting. They want to their .H files, and they had requirements jammed into the .H file comments. What they did here is they actually took those requirements out and they made a formal requirements document. Very rare in open source and also sort of strange seeing as 2001 was over here, and this occurred in 2009. So later I went and looked and looked for why would you do this. And they basically wanted to have a requirements document where anyone could reimplement SQLite, even though it is public domain software. So not even on open source license, it's public domain. Like you're free to take it and no attribution. But what was interesting was, yeah, that was noticed, the requirements things was noticed. And there was also another requirements peak around here which was interesting, which was the SQLite 3 discussion where they were referencing SQL books to look at for, say, implementations of B-trees, things like that. So at least in terms of some of the peaks, they panned out in terms of what was in the repositories, and that was validated by AWK and grep. Actually, grep's really nice in terms of ignoring schemas. So you can basically grep across data without the schema getting in the way. So that actually have some value. So does SQLite. So this led to issues of observability. So we discussed the requirements, we discussed the weightings, we discussed things like that. Certain signals were not as observable, particularly business modeling. In open source sense, not a lot of projects have business modeling. Maybe Evolution had a little bit at the start in their mailing list, the mail client, but that was about it. So not everything was observable, and that was one of the issues. So some common threads we observed while going through all this stuff was there was idioms and we could rely on idioms, whether they're file naming, other kinds of naming, behaviors, use of different kinds of files. And this also related to sort of a vocabulary within the project. A kind of lexicon used internally to a project which actually had little shared vocabulary. But of those shared vocabulary we found that many of the shared terms seemed to relate to nonfunctional requirements, like usability, maintainability, portability, and especially reliability and correctness. >>: Did you -- could you identify these idioms automatically? >> Abram Hindle: I don't know. Maybe. >>: It seems like -- I mean, each [inaudible] I look at has their own idioms for a number of different I guess dimensions. So I bet it'd be useful for like maybe a newcomer to a project or something like that just trying to -- having gone through this just recently, for like, you know, joining scripts or something within a project, understanding the idioms are seeing examples or saying like this is prevalent [inaudible] would be ->> Abram Hindle: I think language management is sort of the next sort of big software engineering tool. So let's try to fortify, you know, this word means this and we're going to use this on all of our clients as well as the domain modeling. >>: Yeah. Or the reverse. Like I've encountered problems where I know what word I used for a concept and I'm trying to figure out what word is used for that concept somewhere else. It can be really tough [inaudible] like understanding how to [inaudible] concepts amount to stuff I've seen in the open source world is really difficult. >> Abram Hindle: So for future work we want to apply more people in the teams-orientated analysis. So imagine doing the RUPVs per author, how would those change the unified process views we extract, how would those change per author, things like that. We also have to do a ton of validation work, some of which requires harassing people and asking them is this really what happened. We also want to improve the accuracy of some of these things and maybe do an additional case study. We also want to look into iteration identification. So basically I've done some machine learning in the past trying to figure out is this release time or not. And it didn't really pan out that well. But with this new source of data, it might pan out. >>: I have a question. Not to be too down on it, but I see the issue identifying releases as [inaudible] but it's unclear to me what the real benefit of that is. Because if it's in real time, you can like [inaudible] and if it's retrospective, you can get super high accuracy. The releases aren't that frequent. You can just go into the project page or something like that. So is it really worth putting a lot into identifying releases? >> Abram Hindle: Maybe not releases. But the phases within a release I think. If you can say the certain phase is a linear combination of certain disciplines or certain signals you extracted, then you can suggest how much a certain window is, how much a certain time that -- how much a certain window of development is related to that phase. So are we in a freeze phase, are we in a testing phase, are we in a heavy implementation phase, things like that. Are we crystallizing. So those kinds of phases would be [inaudible]. >>: How much of all this could be replaced with an anthropologist hired by let's say the team manager to sit there and shout it out, get all this information [inaudible] what's the tradeoff here? >> Abram Hindle: Well, I think definitely some of these tools would be used by the anthropologist to keep [inaudible] what's going on. >>: [inaudible] >>: No, no, no, very good point [inaudible]. >>: [inaudible] anthropologists, that's their whole job. >> Abram Hindle: [inaudible] >>: [inaudible] >> Abram Hindle: But if you're, say, purchasing the company who didn't the anthropologist [inaudible]. A lot of this stuff is after the fact. If you're going to change the process [inaudible] anthropologist [inaudible] I definitely think some -- I think it's managing language, having at least the language half in terms of a shared lexicon, shared terms [inaudible] things like that. I think that's [inaudible]. >>: The hiring [inaudible] company that didn't have the anthropologist, then you need the archaeologist, but the archaeologist [inaudible]. >>: I would also say there's two other problems with anthropology which is all [inaudible] which is anthropologist A [inaudible] plus they also, I don't know, I think they spend like 20 years or something [inaudible] so you probably don't want to wait 20 years when you have 15 years of [inaudible] but if back in 1995 someone wanted to know what was going on [inaudible] say, well, [inaudible]. That might be a problem as well. >>: [inaudible] you have an anthropologist for two weeks of time for something. Could they help you identify the relevant topics ahead of time or identify the [inaudible]? >> Abram Hindle: One of the reasons I was doing the topic analysis was because of yak shaving. So say you have a story card that you have to fulfill and it causes, say, a performance regression but that wasn't a story card but you've got to put out that fire, so you've gone on this long journey to fulfill this story card and you end up over there shaving a yak for some reason. So the topic analysis I was hoping would maybe highlight what's going on if there's yak shaving involved, what were some topics that occurred that were important that maybe the manager wouldn't know about because why didn't you guys finish that story card last week, we're working on this [inaudible] different topics. Okay. Just a quick summary of the process recovery. So we've got our repositories that we relied on at least for this study, our discussion lists, our bugs in bug tracker, our version control systems, and we applied a wide variety of analysis related from NFR-related word lists, from maintenance classes, topic analysis, release patterns, all that, and then we aggregated lots of those signals up into a -- what we called the recovered unified process views, which was basically a concrete version unified process diagram. And this stuff was relevant to a wide variety of stakeholders, mostly those not inherently involved with the code or those who are just beginning in a project and not really sure what's going on. Now, I want to go a little bit further about this. This is like the awful selling point. So how do I think this is relevant to Microsoft. Well, I think there's three main points. So internally for use, integration in existing products and how existing products would actually help in the future help this kind of analysis. So for internal processes, I think one of the issues is when you have the globally distributed development you want to keep tabs on things, you don't always have your manager in the same location, things like that. You might have the proposed process, and at least maybe you can validate what the underlying process was and what the differences were. So at least in terms of Project Dashboard or project timeline, some of this stuff might help, and I don't doubt that you have some of it. Another thing would be to look at, say, a successful project and try to see is there actually a correlation between the successful projects in terms of their processes or teams and look at that and see if there's any kind of consistency across there. >>: I have a question. >> Abram Hindle: Yeah. >>: In terms of studying the software process, is there a relationship to how MBAs study business processes and are there lessons to be learned from management schools on what good processes look like or whether they correlate? >> Abram Hindle: Okay. So I was reading Van der Aalst's book on workflow processes where basically he wanted to mine the processes and then refactor them. And he used the formalism of Petri net [inaudible]. So it's trying to look at how that's applied, and Cook applied that, but it required a lot of work and a lot of tooling. So I think it is possible to use some of the business process work, but a lot of the business process work is very, very fine grained and would probably only be appropriate to software in very small domains, at least in my opinion, like the bug tracker. The bug tracker oftentimes imposes a process. So that's one side effect of a software. So you could look at that process and you could use the business process for factoring in order to improve, well, how can we get nicer bug reports, how can we get people who report bugs by not [inaudible] things like that. So I think in some subsets, in some contexts, we can use business process stuff for software. But in many cases the very information work kind of thing, so we don't really have the very strict states where we can just pass the document on. I guess one thing to look into would be is there any way to avoid the expertise issue where you have nonexpert programmers and expert programmers. Because a lot of the business process, refactoring is about parallelizing the work and moving the documents between different people who might not have a lot of expertise and what's necessary there. So maybe some of that. But I didn't really go into that in my stuff. And I didn't really feel that the business processes were all that appropriate to, say, modeling an open source system. Maybe some of the Apache stuff where they claim they have very strict processes, but not in all cases. And I think because we're fundamentally information work while developing, we don't really have this clear kind of staggered stage-to-stage kind of development. So there's also various products that might be amenable to having some of the software process recovery integrated into them, whether for Microsoft or for their end users. So Project Dashboard in Visual Studio 2010 might help. Some signals might be useful, like the STBD stuff, the source test build documentation stuff. It might be useful just to see is your current release similar to your last release. If they're not similar, well, why aren't they similar. You can go and look. There's also the word bag analysis stuff which is very cheap to apply, relatively inaccurate, but still very cheap, once you've got a dictionary you can ship to other people. And then the topic analysis is sort of interesting, but it eventually becomes a supervised method if you really want very well-labeled topics. So that might be a little bit hard. And then the harder stuff to apply would be anything that requires end-user interaction for training our words. And I don't think that would really pan out and go very far with end users because, well, I bet everyone in this room has done annotation, and I don't think anyone here likes it. Okay. So these were a couple things that could be integrated in, say, Project Dashboard. Project Dashboard has some interesting stuff in it that I really like, like the burndown charts, the burn rates, backlogs. These are things which I'd want as signals when I'm looking at or aggregating to produce the recovered unified process views. Those are really nice signals to have. Because those are very process oriented. They're usually like requirements or story card based, what's getting done. I think that's really interesting. And it'd be neat to have it flow both ways, get access to that and vice versa. So Codebook is interesting, and I think some of the stuff could go in Codebook. So imagine like a social view of the recovery unified process views. So you're looking at one project, you're looking at a couple people involved who you're connected with, what does your aggregate network look like or what does one of your buddies look like, or has your one buddy been mostly working in portability related issues, things like that, that kind of analysis. So that might be really neat. >>: If you have a smaller corpus, like [inaudible] produced by an individual, then does this approach scale down like that? Or do you need big bags of words to work on? >> Abram Hindle: For like the [inaudible]? Well, with at least the NFRs we have shared terminology, at least in English. So I think with the NFRs you can get away with them -- with cross projects and cross individual [inaudible] so you can get away with a lot of training on individuals. >>: But, I mean, you'd end up with just little tiny spikes [inaudible] a whole lot of zero baseline. The further you go down from -- getting smaller and smaller in terms of people or in smaller and smaller windows of time, in both cases your bags of words could get small and the signals are going to just become very spotty. >>: You could imagine [inaudible] solution approach to a stack [inaudible]? >>: You could. >>: You could see if there is a topic that moves across people or there's a spike from one person [inaudible] from the other people like after noise, they'll all stack up and you can kind of see that everyone got involved. Actually this one person got involved, maybe there was like another [inaudible] that turned out to be really significant and everyone had [inaudible]. >> Abram Hindle: Yeah. So I think you could get some really neat interaction graphs out of this. And like even this stuff, like seeing who's working on implementation immediately, who's actually doing testing, who's committing build changes. Like this stuff is totally unsupervised and very easy to apply. So I think that would be pretty neat. And I think you're right about the word bag. Maybe some people don't use certain terms. But you could also ->>: [inaudible] >> Abram Hindle: Anyway, so I'm just saying that there's potential for interaction [inaudible] the recovery unified process. And at least in the social setting, I think it'd be really interesting, especially person specific and group specific. >>: Okay. Here's the [inaudible] if Chris and I work on internationalization, localization of all the [inaudible], that's our job and that's what we do together, we will never say those words. >>: Because they're tacit. >>: Because they're tacit. Because they're -- they are the bubble that is around us. We don't need to talk about them. So even though that's what we're doing [inaudible]. >> Abram Hindle: But then, again, if someone starts talking about internationalization that's not part of your group, but because they're buddies, that would be interesting too. >>: It's going to be the same as a Web where like a link to a page is often much more descriptive of that page, so anybody who references you would probably also use the word internationalization, because, oh, jeez, the expert on internationalization [inaudible] and so if you found links to you in other e-mail lists, then you could figure out that you were the internationalization person and assign that topic and then understand what the domain is inside your messages. >> Abram Hindle: Okay. So then I'm going to cover what can Microsoft do. Basically spy. Spying hard. So one of the problems I found was I couldn't estimate effort at all, at least with open source stuff. There was no indication of time. You had lock. But that doesn't necessarily say how long they spent. So you could take the big brother approach and you could actually record -- at least this might work in companies where there's sort of a lesser expectation of privacy. But it doesn't always happen. Like I understand there's serious issues with it. But at the very least, if you allowed the commits to be tagged with time spent or effort spent, things like that, and then maybe allowed the developers to modify them just in case, you know, like maybe you left the computer on and ->>: What? [inaudible] >> Abram Hindle: So I've done not quite this but a little bit of it, and what I found was interesting about monitoring windows being opened was that you really had to be aware of idle and non- idle time. And different applications probably had different idle times. So I had mine set to 30 seconds which would make all my movie watching not count because movies are pretty long. >>: Well, yeah, but you're talking about a particular [inaudible]. >> Abram Hindle: Yeah. >>: [inaudible] like sometimes when I'm actually developing, I'll go read an MSDN page and I'm moving around that for 15 minutes -- >> Abram Hindle: But it might be relevant. >>: Yeah, and it's totally relevant, Visual Studio just thinks I'm totally idle. >> Abram Hindle: Yeah. So that's also a possibility, like ->>: [inaudible] >> Abram Hindle: If you had a Web browser aware. >>: I have a hard time trusting developers' estimates. >> Abram Hindle: So having some kind of concrete measurement might be useful, but there's definite downsides to it and there's definite [inaudible]. It's still a -[multiple people speaking at once] >> Abram Hindle: So this goes back into the commits and adding more metadata to commits. So something like Visual Studio can provide more information, the structure or the time, traceability, related artifacts. It could even go through all the Web pages you looked at, which in some cases would not be great ->>: Filter out [inaudible]. >> Abram Hindle: Probably have to have a delete button on some of those. It's like Facebook, Facebook, Facebook, Facebook. Yep. So there's possibilities adding more information such that you could do better tracking of certain things, certain things you're interested. Especially structure. A lot of the Smalltalk VCSs, they cover structure. So you can actually see when structure changes over time. >>: You mean structure of the code itself? >> Abram Hindle: Yeah, like the architecture. Yeah. Another thing that could be improved would be -- so I'm not really sure what you guys use for project documentation. I assume it's like Word and maybe you commit that somewhere. I'm not sure. I don't know. But the point is that one other thing that could help would be the ability to have more traceability between all these documents. So when you're writing up something and it goes into a mail message or it references someone else, if there was a way to get better traceability out of that, and that could be enabled by, I don't know, better document repositories, better analysis of the documents put into repositories, things like that. Okay. So in conclusion we've got software process recovery which is the after-the-fact recovery of software development processes from the artifacts left behind. And this is exploitable by Microsoft both internally and externally, and I've shown how Microsoft can also improve the future at least in getting some interesting signals out such that you can better track your project with some caveats. Okay. Thanks. >> Chris: Thanks. [applause]