27095 >> Nachi Nagappan: Thank you for coming and to... pleasure to welcome Hridesh Rajan from Iowa State University. ...

27095 >> Nachi Nagappan: Thank you for coming and to all our viewers online. It's my pleasure to welcome Hridesh Rajan from Iowa State University. He has been there since 2005. And prior to that he was at the University of Virginia where he worked with Kevin Sullivan and got his Ph.D. His primary area of research has been on aspect-oriented software development. He's actually going to talk to us today about a new project he's been working on called BOA, about mining large scale software repositories. Hridesh Rajan: Thank you for coming, both here and online. So this talk is about BOA, which grew out of my personal pain trying to do empirical evaluation of program languages in some sense. This is joint work with Robert Dyer, Hoan Nguyen and Tien Nguyen. All of them are at Iowa State University. So hopefully I don't have to convince many of you that mining software repositories is an important aspect of software development analytics. We want to be able to learn from the past to perhaps part anti patterns, what is actually practiced. And then help those observations to inform the future. So, for example, define better designs, keep doing what works. Do some empirical validation. So there are -- so in order to mine software repositories, the first thing is to find a target, the repositories. To mine. So there are several examples of repositories that academics have access to, and then some organizations specific repositories that industry folks have access to. This is a rich set of data that is available out there and it's available in the sense it allows us to make observations that previously would not be imagined. So give you an idea of the scale of these things. There are three large repositories, Google code get hub and source forge, and these three together have over a million projects over a billion lines of code. And tens of millions of revisions. When you compare this to a large software company, this is really not such big data. But for academics, this is a huge amount of data that is available at our disposal to be able to test research ownerships. So what would be examples of such hypothesis? I'll pick some simple ones. So, for example, you could ask what is the most used program language, very simple hypothesis. And how many methods in my code are named test and what is the number of unit tests in the system. You could ask things like how many words are in log messages. Do people actually bother to communicate their intent when committing code. You could ask questions like how many issue reports have duplicates, and this is essentially very common question in software development. So I'm going to pick one of these questions and explain to you what it takes to answer this question today. So here's this task. This task is computing what is the average churn rate for projects on source forge. Churn rate naturally are the number of files that are changed per revision. So today, if we wanted to go do this task, we would first go to sourceforge.net. And we'll say, okay, let me mine the project metadata. This is essentially a JSON file and for each project you find you're going to run this loop. And what this loop does, it essentially says, okay, is this project a DA project, sure if that is the case, then does it have a repository? Is that the case? If so, then write some code that accesses the repository. To mine revisions, and then using the revisions calculate the project rate, and keep churn rate and keep doing this until you're done to finally give you the average churn rate of the project. Now, for somebody whose primary task is not mining software, software repositories, this is a lot of work. So a solution I present on the slide there is too much code, don't bother reading this. Just I want to more illustrate it to show how complex it could become. So this code is written by a mining software repository expert, one of my collaborators. Optimizes it down to the bare minimum. So over 70 lines of code you have to write. And you have to write this code after you know about these two libraries. JSON libraries and SVN libraries and how to access them and use them effectively. This code runs sequentially. And it takes over 24 hours to run and process source forge data. So if you start running it, this will take over 24 hours. And if you take the time to download the project data, locally, and run this code on that download data, it will take almost three hours to run and finish this code. Now, it is conceivable that one can write a parallel version of this that will run faster but that will be additional complexity on top of what we have already. For answering a relatively simple question. Here is an alternative in BOA. This program essentially answers the same question. I'm going to explain different parts of this program to you as during the part of, during my presentation. But just to make some basic high level observations. This full program is about six lines of code. It doesn't require users to understand external libraries. It's automatically parallelized. It gives you the same result in about a minute. And that's what BOA is all about. It's about giving software engineers and program language practitioners capabilities to write queries of this sort and be able to run them in a timely manner in some sense. So this is a topic of my talk, more language and data [inaudible]. It is available with limited release right now at boa@iowastate.edu. So to tell you a little bit about how I came to BOA, from 2005 and through 2012 the focus of my work was compiling modular reasoning and processing concerns. I worked on a language called [inaudible] with Gary from University of Central Florida that had, whose primary goal was to do this. And as part of the language evaluation, we wanted to empirically validate the Ptolemy design. In order to say we wanted to use real code and a significant set at that to show benefits. And the question was we wanted to ask was which projects use event-based separation of concern and how do they evolve over time. So this was -- through this work I had a couple of my students look at and mine software repositories such as source forge and so on. Took them all three months, with not very much significant result coming out of that. I didn't think much of that problem at the moment. But recently again when I started working on this idea called modular guided parallelism in this [inaudible] that we're working, I won't get into the details what this does, but we want to do the same thing empirically validate what does the design looks like, use real code and significant steps to show benefits. And we want for that we wanted to find out which projects use concurrent language feature. In the same stories, three, four months of time, that students spent, not so much progress to show for it. Because it's just very hard to do these things manually. So a lot of work. Not so much result. And so then we talk about oh can we fix this problem? Can we fix this problem of doing these sort of queries so that we can get these results in a reasonable time and not have to write thousands of lines of code or spend thousands of hours doing this work. And that's where BOA -- that's where BOA came to be. So BOA has three design goals. First one is that it should be easy to use. It is for folks, and I'll come and explain that. It should be scaleable and efficient and the third is should enable reproducible results. And I'll explain to you each of these goals. So easy to use, simple language, no need to know the details of repository mining and data parallelization. And what I want to do is I want to be able to empower social scientists and humanities expert to be able to do open source research. That's essentially my goal with this project. There's also a little story behind this. This requirement came out of a collaboration that I'm recently involved in with English faculty at Iowa State University. And this person is interested in understanding whether the terms mean the same thing in different domains. So, for example, if I use the word "efficiency" in one context does it mean the same in this other context. So this person is trying to do the study for open source research, and he came to me and said this is just too hard to do this sort of work. As a person who doesn't know very much about programming and details of that, it's just not possible for me to go and answer these sort of questions. And BOA is hoping to do the sort of gap. And I will come back and show you some details in that. Another goal is to be scaleable and efficient. We want to study millions of projects. And we want to have results in minutes and not days. And essentially what we want to do is we want to have a more ambitious [inaudible] scientific discovery in this particular area. Finally, we want to have reproducible research results. And this goal essentially came out of this paper that was published in the conference called mining software repositories by [inaudible] Robles. This paper essentially studied results in this area, about 171 papers. And it essentially says that only about a couple of them are replication friendly. And it's for a variety of reasons. Sometimes the data source is not available. It's proprietary at the source. There's really nothing that others can do to make this. So what we wanted to fix this problem as well so that researchers can provide something that a third party can verify on. And we'll show how we achieve that. And if you have any questions, please do not hesitate in interrupting me. So BOA capabilities essentially are on three dimensions. First dimension that most program, program analysis folks are familiar with is containment. So we look at the program as an [inaudible] programs contain classes, classes contain methods. Methods contain statements and so on and so forth. And when people with program analysis we think about this containment relation. That's one dimension. The second dimension that is important is software artifact evolution in the sense how does code change over time. Is the method being added? Is the statement, are statements being added and so on. The third dimension that is often thrust to us is scale. We want to be able to do this on a very, very large source code, code base. I'm going to now delve into details of this language. So let's look at the BOA architecture. An important component of that is this data store that we have. Data store essentially replicates source forge. There's some caching translation that goes on essentially creates a local cache here. It essentially keeps it up to date in some sense. It's the responsibility of that. The other aspect of that was designing the language. The language that we are designing and I'm going to come to the details of that essentially variant of the language that came out of Google called Sawzall, we did more work on that. On top of that we have added domain specific types functions. I'll get into details of those also. As part of, we built a compiler for BOA, which because Google's compiler -Google's -- language is available but the compiler isn't available in the usable form. So we did some work on that. Essentially a BOA program runs like this. So once you write a query program it is compiled. Query plan is created. This query plan is deployed on the cluster which makes you use local cache which finally produces query results. Standard MapReduce environment. And I will see a demo of that also today. So regarding domain-specific types. This is one of the core contributions of BOA, what kind of types we need to provide to make job of mining software repository export easier or. So BOA provides domain specific type. And this program we are seeing several domain-specific types. Project is a type. Project has properties like program languages, code repositories, repository kind. Code repositories have properties like revisions and revisions have properties called files and all that. Once these facilities are available at the language level, the person is not doing any work to mine these sort of information. >>: Sorry, can I think of these types as basically just being the schema of the underlying data? Hridesh Rajan: Yes, you can. And that's -- at the implementation level, that's where we start from. So the role that these types play is to abstract the details of how to mine software repositories so that people can just worry about programming in terms of these. >>: Maybe you'll explain the program later. But I'm confused about the program. If you're going to compute the average churn rate I would have expected to get the number out. And here the main operators exist. So I would guess from that that this turns out gives you [inaudible]. Hridesh Rajan: I will come to that. But you will get a number out. And so for now think of the first line in the program as what you will get out of it. >>: The rate? Hridesh Rajan: Yes. And I'll come back and talk about that also. >>: So here's some details of this domain-specific type. It provides the name of the project, home page URL, program language, licenses and all that information that you would typically expect from a project. There are other types like code repositories, revisions, files so on, and they provide usual, a usual details about these artifacts. More information about this is available from the website, of course. We also have domain-specific type for source code analysis. And these include declarations, name spaces, types, et cetera. Things like variables. Things like modifiers. Things like methods. So all these types essentially to make the life of a mining software repository expert easier. We also have a statement level access and expression level access. So you really have the entire abstract available at your disposal. Once again not getting into details of these because this information is all available on line, and really when you're programming, you can look up which one, which type you need. Okay. So most types are not specific to any type of language. Generates support any object oriented language. Currently we have support for Java source code. But we're planning to have support for C++ and also for C# if we refine the library that is useful. What we do essentially for mapping is to reparse the source file AST and we store it in the sugar database, and the database we're using is the H base, H base database to do the job here. So Java code like that would probably represented like this in internal AST. Again the details of that are not so important but just taking a look at the containment relation. This is essentially what I meant when I talked about the containment dimension. You can browse through the statements. What is the name, what are is the variable, what are the expressions it contains so on. The program language parse, this will come naturally in some sense. This is standard AST containment. Also has domain-specific functions. These are first class functions that we make available in the language. So has file type is a function which takes two arguments, revisions and a string and it returns a Boolean. And as the body of that function, it tasks whether there exists a project, there exists a file that matches certain format. Returns true or otherwise it returns false. There are other versions of these functions that take project of core repository. So here's another example. This is one of the common things that people do in mining software repositories to say when somebody comes to the core where they're fixing a bug. So this function tries to test for that. By looking at the log of the commitment, oftentimes people write things like fixing this bug or fixing some other bug and so on and so forth. So it looks at the log string and it says, okay, if it's matching a certain pattern, then we will consider it a bug fix or not. The details of this particular implementation is not so important. What's important is that a user is allowed to write these heuristics themselves. So in general, user defined functions take this form, return types is optional. And essentially the idea is to allow for some other complex algorithms to be encoded. Next are quantifiers, and this was essentially earlier your question. So quantifiers are essentially the same as in mathematics. And this essentially allows us to easily express loop over data and bounds infer from the conditionals. So I'll go through some of the quantifiers in details. This is one. Essentially for each quantifier it says for each value of I, if certain conditions hold, then run body with I bound to the value. The exist quantifier essentially is the same as mathematical exist, one would imagine. If there exists a value of I where condition holds then run body once. When I is bound to one matching value. And if all essentially says for all values, if the condition holds, then run body once with I not bound. The reason why we have these is because we have found that these kinds of quantifiers make the task of expressing queries easier. So BOA as a height detail of distributed computation. And underlying architecture is a map, is used as MapReduce framework. So I'm going to -- for those not familiar with this, I'm going to give you a quick summary. So MapReduce essentially this idea that you have, you are processing a bunch of data, and in the map step you process, you have identical copies of a program that each processes a certain portion of the data. And then in this shuffle step you take the output and you take it to reducer, which then based on this output gives you one or more values. So essentially the idea -- this idea is the same as, there's multiple data but single program is running on each dataset. So that's essentially the basic idea. It's not a contribution to BOA per se but BOA makes reuse of that paradigm. So output migration or part of BOA's design. So output is defined in terms of predefined data aggregators, like some set, mean, maximum, minimum, et cetera. Values are sent to output aggregate variables. So an example of output is declared here. This says rates is an output. And this rate is going to compute by a mean index by the string sent to it. And this mean will be essentially an integer. Okay. Here is sending data to the aggregator. And this being using a C++ syntax of doing the -- if you're familiar with sending data to output, C out, C in, you'll definitely see what's going on here. So we're sending to rates and we're not just -- we're sending rates and index and this value. So this is essentially saying find this index and increment the mean with this additional data value. Okay. So let's see it in demo. See if there's a -- so when you go to the Web page, there's a user log in on the left. We don't have open access to this infrastructure available yet. But once you log in, you get to this page. This is how you use BOA out of a tablet or a smartphone or Web browser. And we teach people how to program in BOA using examples. So we have a wide variety of examples that we make available to users. So if my query matches one of these examples I'll typically start with that. So imagine for a second that I was interested in saying, in asking this question how many projects use Java programming language. So I'll start with this example which says how many projects use the schema program language, and this, when I pick that, it populates the program. And here this program is saying that the output of this program is count. It's going to be sum of integer of values that are sent, integral values that are sent to it. And input of the program is a project. And for each integer I such that program language used in the project matches scheme, I'm going to send a value one to the counts so that basically counting distributed over all. So I want to change this program to say that counting projects using Java. Okay. Now, this is a program that counts the number of total amount of projects in source forge that use the Java program language. The last thing we have to do is we have to pick a dataset we want this program to run. This is sort of where the reproducibility requirement comes in. When you report a result, you could report result by mentioning which dataset you ran your program on. So there's a better understanding to the third person that oh this is the dataset on which this result holds. So we have only one dataset at the moment which was done in 2012 in July. We're planning on doing one more snapshot. But I'm currently just going to run it on the live data. So once I have my program, I can then run it. This is essentially just clicking the button. That's about all the work that a user has to do. So it's going to compile the program and let me return to the presentation here for a second. So it's -- so we'll wait a minute for the results. Let's understand why are we waiting for these results. The program we just wrote is it's analyzing about 600,000 projects. It is analyzing about 370,000 repositories. And about 4 million revisions. Okay. And these are contained in this many files. So let's take a look at what the program is doing. So see here. The compilation has finished. And the execution has also finished. And this allows you after execution has finished it allows you to see the job. It says only about 50,136 projects use Java as a program language. You can also download the Java output if you so wish. Let's suppose your program made an error, you can edit the source code and you can send the new query that is based on your program. And this sort of has to do with all of this has to do with the ease of use requirement that I pointed out earlier. We want this to be available to a class of users that just cannot be bothered with installing a bunch of libraries and a bunch of other software systems to make their queries work. They should be able to just point to a Web page and be done with it in some sense. And this is a requirement that BOA achieves very well in the setting. So already check the results of BOA. We checked the performance of BOA on a bunch of tasks that, important details of these -- details of these tasks are not important. Essentially different categories of things. More details about that is in our XP paper this year. What we saw is that compared to Java program, BOA programs achieve results in much faster time. And this is as you can see this is a time and logarithmic scale. On an average, BOA programs finish much faster. There were three categories of these programs. The one category was of these programs was accessing only one repository revision. The second category was where it was accessing all revisions of a project and the third category was where it was only doing metadata. That was this guy here. So Java programs as you see variants, the BOA programs take nearly the same amount of time. And most of that essentially is Hadoop's start-up, start overhead here. >>: Java program [inaudible] parallelism. Hridesh Rajan: Java programs are sequential, but they're looking at local -- and you can, for example, parallelize them if you would like. >>: But the main speedup here is due to parallelism? I'm just trying to interpret why. Hridesh Rajan: The main speedup here is because of parallelism and also because of data storage strategies. Those two things play a role. Because querying from distributed store was querying from centralized dataset. That's where it comes from. Any other questions on this? Look at it in some more detail. One of the important things to note is predictability of result. Bo programs nearly we can say takes about a minute. It will take you about a minute to run it whereas for Java program it varies based on what you're trying to do. Okay. Next thing we did was we examined whether these things scale. So for that we again looked at three different output categories. 6,000 projects. 60,000 projects, 620,000 projects. 6,000 projects, 60,000 projects were picked out of this set using random sampling. And this again shows the BOA programs in blue and Java programs in red as you can observe. For Java programs for different dataset, the performance as expected decreases. For BOA the slope is much smaller. And this again like I pointed out previously due to two reasons. First is increased parallelism and second is better data storage strategies in some sense, utilized here. At this point it will be important to look at all of these things could be done and all these queries could be run using Java in about the same time, but the cost of writing those queries will be much greater compared to what we are providing here. So some concrete examples for numbers, we can see the improvement for larger size, you see 30 seconds to 32, 41 seconds, not very much increase in the amount of time it takes to process, whereas for Java you see 96, 92, 600,6,000 seconds. And at this point it sort of becomes -- it becomes very difficult to rate that much to compute answer to questions like this. Next thing we did was we said okay if we throw more infrastructure at it it's going to work for us. So for that we scale to more course. So this shows different charts for these maps here essentially represent the amount of resources that were available to 32 maps, essentially means 32 cores were available to it. This shows that as we throw more machine at it, the performance actually decreases. Performance increases. The time taken decreases. So if we continue to add more we'll see similar trends if we have more infrastructure available to it. So one of the things that -- one of the goals for BOA is to be able to replicate earlier results, be able to -- so if you recall we had -- we brought up this paper earlier which says two out of 154 experimental papers were application friendly and most of them were due to the lack of published data. And it's understandable, because there are cases in which you cannot make data available publicly. So part of the results were difficult to reproduce and our claim is BOA makes this easier. And to validate this claim, I would like to reproduce something very small for you today to show what, to give you a taste of what it is like. So what we have done is we have, we've written about 18, 19 questions using BOA. And we make this available from our examples, examples Web page, and these questions are in many different categories. And one thing we did was we wanted to show what it will be like to publish the result using BOA. Okay. And this is an example. So imagine that this is essentially what the contribution of my research paper, very small research paper would be. What I would do is I would say, okay, this is the research question that I'm answering. Here's a program to answer that research question. And I'm using this dataset for my results. Okay. So I can take a look at that dataset, and this is essentially -- it's essentially -- please don't try to read this. Just shows you the counter projects. What I'm going to try to do however is to try to reproduce this result. So trying to reproduce the result essentially involves copying this program here. Going to this infrastructure. Pasting it there, picking the right dataset. And running it. Okay. So for this trivial thing, it will run. It will give you the right answer. But the essence of this is that if, with a simple example query and a tag, this infrastructure will allow people to essentially see what you saw in the first place, without having to build the whole thing. So we also did a control experiment where we used the published artifacts from the BOA website and dataset used. And here are the result of this control study. So this study we had folks involved from post-doc stage to undergrad student who tried to use BOA to reproduce some results. Some were experts. Some were not. And their task was essentially to pick one task out of the 18 that we provide. And try to see if they can reproduce that result. And most of them were able to really do this in less than ten minutes. They were able to show the same results that was published earlier. So there are a bunch of work in this, but none provides record support for mining software repositories. So I'm going to give you some examples for that. Sawzall is where BOA comes, similar syntax to BOA, [inaudible] to MapReduce. Pig Latin is another example in this category. Occurred a similar syntax to SQL. Dry link is another example of this kind of language, composite framework. There was some work on force [inaudible] SQL database about 18,000, 18 K projects, which allowed you to do mining software repositories. BOA is more scaleable compared to sourcer and more fine grained because it provides full abstract syntax random walk. And there are commercial products available. There used to be a Google core search that allowed you to do, grab over repositories. But that allowed you to only answer very limited questions, whereas we provide much more information than that. Although from Black Duck, it gives you a Res API, project matrix, et cetera, for some projects. But this is not -- we found it's not as scaleable as BOA is. And then there are things like Google core search, get hub, et cetera. Compared to those BOA provides more than just textual map. There's ongoing work in several directions. I'm going to identify the directions and tell you a little bit about that. First we want to support more kinds of repositories here. So we have support for SVN. Want to do get, bizarre, et cetera. And we want to support more datasets. Expand to Google code. Get hub. This is more of a logistics question than a research question. But for dataset to be really contained, millions of projects, it will be important to have some sort of arrangement with these folks to have the dataset and repository. We want to enable mining of other artifacts, issues, e-mails, bug reports. Project-related, other project-related communication, wiki, et cetera. And then we want to also be able to design from language abstractions. And some of them we have started to look at already. Last but not least infrastructure improvements are extremely important. As we are putting, throwing more data at it, we are really hitting the scale issues, but we have some progress on making some progress on that also. So with that, I will conclude for today. What I've shown you today is domain specific language for software repository mining, which essentially has three goals. Wants to be easy to use, efficient, scaleable and allows you to reproduce research results. So if you have any further questions I'll be happy to take them at this point. [applause]. >>: So do you support intermediate results in caching of intermediate results? Hridesh Rajan: Let's suppose you have two queries and you want to -- you are talking -- no, currently we do not. Currently we do not support intermediate results. >>: Queries is totally independent. Hridesh Rajan: That is correct. >>: If I wanted to define some tables to reuse multiple queries stored in the cluster? Hridesh Rajan: At the moment we don't -- the website that I showed you does not have support for it. But that may not be the case a month. Because there's somebody already working on it. >>: Okay. Hridesh Rajan: So this is -- but that's a good question. Thank you, Tom. And -- >>: Why did you decide to decide external error rather than saying L embedded in Java what could call out to other Java methods to whatever build comma separated, output or ran analysis on ASTs something like that. Hridesh Rajan: It was primarily due to the audience that we are targeting. The goal of BOA is to bring this sort of capabilities to people who may not necessarily be expert programmers. And often giving them an assignment in Java, compared to giving them this simple feature, we think that it's likely that they will succeed more here than in Java. >>: How hard would it be to support something like that? I would assume most of the infrastructure is independent of that choice. Hridesh Rajan: That is correct. It would not be, it would not be very difficult to support something like that for expert users. As an API-like thing available. It would not be difficult at all. It was simply a matter of selecting the right product for the audience that we wanted to target. >>: So in the examples you showed, the scripts are kind of running over kind of cleaned up well schematized datasets. So what do you think about the other 80 percent of the job, which is starting with the messy raw data and getting it to a clean dataset? Is that kind of in scope for your project? >>: It is. It is. So if you recall from the data infrastructure, one of the important things is to replicate the data from this repository of our tool. And there are many problems that need to be reconciled for that. So, for example, same things don't mean the same across repositories. And it is certainly within the scope. So we have started off with addressing issues related to source forge at the moment. Currently there's a work in progress for trying to see how ontology for source code matches ontology for Google code. And so, yes, that is within scope. It is essentially an ontology mapping problem. >>: I was going to suggest that whatever approach you take, you kind of want to make it open, because as more and more miners go after this data they'll want to schemetize it in different ways to answer different questions. So if too many decisions get baked in the tool that will hurt adoption rates. So you want it pretty open, of course. Hridesh Rajan: Correct. Also want to bring up another philosophy that we have, which is to not provide interpretation of data but rather provide the data itself. And let people make their own interpretation in some sense. All right. Good. Thank you. [applause]

27095 >> Nachi Nagappan: Thank you for coming and to... pleasure to welcome Hridesh Rajan from Iowa State University. ...

Related documents

Products

Support

27095 &gt;&gt; Nachi Nagappan: Thank you for coming and to... pleasure to welcome Hridesh Rajan from Iowa State University. ...

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib

27095 >> Nachi Nagappan: Thank you for coming and to... pleasure to welcome Hridesh Rajan from Iowa State University. ...