27095 >> Nachi Nagappan: Thank you for coming and to... pleasure to welcome Hridesh Rajan from Iowa State University. ...

advertisement
27095
>> Nachi Nagappan: Thank you for coming and to all our viewers online. It's my
pleasure to welcome Hridesh Rajan from Iowa State University. He has been
there since 2005. And prior to that he was at the University of Virginia where he
worked with Kevin Sullivan and got his Ph.D.
His primary area of research has been on aspect-oriented software development.
He's actually going to talk to us today about a new project he's been working on
called BOA, about mining large scale software repositories.
Hridesh Rajan: Thank you for coming, both here and online. So this talk is about
BOA, which grew out of my personal pain trying to do empirical evaluation of
program languages in some sense. This is joint work with Robert Dyer, Hoan
Nguyen and Tien Nguyen. All of them are at Iowa State University. So hopefully
I don't have to convince many of you that mining software repositories is an
important aspect of software development analytics.
We want to be able to learn from the past to perhaps part anti patterns, what is
actually practiced. And then help those observations to inform the future. So, for
example, define better designs, keep doing what works. Do some empirical
validation. So there are -- so in order to mine software repositories, the first thing
is to find a target, the repositories. To mine. So there are several examples of
repositories that academics have access to, and then some organizations
specific repositories that industry folks have access to. This is a rich set of data
that is available out there and it's available in the sense it allows us to make
observations that previously would not be imagined. So give you an idea of the
scale of these things. There are three large repositories, Google code get hub
and source forge, and these three together have over a million projects over a
billion lines of code. And tens of millions of revisions. When you compare this to
a large software company, this is really not such big data. But for academics,
this is a huge amount of data that is available at our disposal to be able to test
research ownerships. So what would be examples of such hypothesis? I'll pick
some simple ones. So, for example, you could ask what is the most used
program language, very simple hypothesis. And how many methods in my code
are named test and what is the number of unit tests in the system.
You could ask things like how many words are in log messages. Do people
actually bother to communicate their intent when committing code.
You could ask questions like how many issue reports have duplicates, and this is
essentially very common question in software development.
So I'm going to pick one of these questions and explain to you what it takes to
answer this question today. So here's this task. This task is computing what is
the average churn rate for projects on source forge. Churn rate naturally are the
number of files that are changed per revision.
So today, if we wanted to go do this task, we would first go to sourceforge.net.
And we'll say, okay, let me mine the project metadata. This is essentially a
JSON file and for each project you find you're going to run this loop. And what
this loop does, it essentially says, okay, is this project a DA project, sure if that is
the case, then does it have a repository? Is that the case? If so, then write some
code that accesses the repository.
To mine revisions, and then using the revisions calculate the project rate, and
keep churn rate and keep doing this until you're done to finally give you the
average churn rate of the project.
Now, for somebody whose primary task is not mining software, software
repositories, this is a lot of work. So a solution I present on the slide there is too
much code, don't bother reading this.
Just I want to more illustrate it to show how complex it could become. So this
code is written by a mining software repository expert, one of my collaborators.
Optimizes it down to the bare minimum.
So over 70 lines of code you have to write. And you have to write this code after
you know about these two libraries. JSON libraries and SVN libraries and how to
access them and use them effectively.
This code runs sequentially. And it takes over 24 hours to run and process
source forge data. So if you start running it, this will take over 24 hours.
And if you take the time to download the project data, locally, and run this code
on that download data, it will take almost three hours to run and finish this code.
Now, it is conceivable that one can write a parallel version of this that will run
faster but that will be additional complexity on top of what we have already.
For answering a relatively simple question. Here is an alternative in BOA. This
program essentially answers the same question. I'm going to explain different
parts of this program to you as during the part of, during my presentation. But
just to make some basic high level observations.
This full program is about six lines of code. It doesn't require users to understand
external libraries. It's automatically parallelized. It gives you the same result in
about a minute.
And that's what BOA is all about. It's about giving software engineers and
program language practitioners capabilities to write queries of this sort and be
able to run them in a timely manner in some sense.
So this is a topic of my talk, more language and data [inaudible]. It is available
with limited release right now at boa@iowastate.edu.
So to tell you a little bit about how I came to BOA, from 2005 and through 2012
the focus of my work was compiling modular reasoning and processing concerns.
I worked on a language called [inaudible] with Gary from University of Central
Florida that had, whose primary goal was to do this.
And as part of the language evaluation, we wanted to empirically validate the
Ptolemy design. In order to say we wanted to use real code and a significant set
at that to show benefits. And the question was we wanted to ask was which
projects use event-based separation of concern and how do they evolve over
time.
So this was -- through this work I had a couple of my students look at and mine
software repositories such as source forge and so on. Took them all three
months, with not very much significant result coming out of that.
I didn't think much of that problem at the moment. But recently again when I
started working on this idea called modular guided parallelism in this [inaudible]
that we're working, I won't get into the details what this does, but we want to do
the same thing empirically validate what does the design looks like, use real code
and significant steps to show benefits.
And we want for that we wanted to find out which projects use concurrent
language feature. In the same stories, three, four months of time, that students
spent, not so much progress to show for it. Because it's just very hard to do
these things manually.
So a lot of work. Not so much result. And so then we talk about oh can we fix
this problem? Can we fix this problem of doing these sort of queries so that we
can get these results in a reasonable time and not have to write thousands of
lines of code or spend thousands of hours doing this work.
And that's where BOA -- that's where BOA came to be. So BOA has three
design goals. First one is that it should be easy to use. It is for folks, and I'll
come and explain that. It should be scaleable and efficient and the third is
should enable reproducible results. And I'll explain to you each of these goals.
So easy to use, simple language, no need to know the details of repository
mining and data parallelization. And what I want to do is I want to be able to
empower social scientists and humanities expert to be able to do open source
research. That's essentially my goal with this project.
There's also a little story behind this. This requirement came out of a
collaboration that I'm recently involved in with English faculty at Iowa State
University. And this person is interested in understanding whether the terms
mean the same thing in different domains.
So, for example, if I use the word "efficiency" in one context does it mean the
same in this other context. So this person is trying to do the study for open
source research, and he came to me and said this is just too hard to do this sort
of work.
As a person who doesn't know very much about programming and details of that,
it's just not possible for me to go and answer these sort of questions.
And BOA is hoping to do the sort of gap. And I will come back and show you
some details in that.
Another goal is to be scaleable and efficient. We want to study millions of
projects. And we want to have results in minutes and not days.
And essentially what we want to do is we want to have a more ambitious
[inaudible] scientific discovery in this particular area.
Finally, we want to have reproducible research results. And this goal essentially
came out of this paper that was published in the conference called mining
software repositories by [inaudible] Robles. This paper essentially studied
results in this area, about 171 papers. And it essentially says that only about a
couple of them are replication friendly. And it's for a variety of reasons.
Sometimes the data source is not available. It's proprietary at the source.
There's really nothing that others can do to make this. So what we wanted to fix
this problem as well so that researchers can provide something that a third party
can verify on. And we'll show how we achieve that.
And if you have any questions, please do not hesitate in interrupting me. So
BOA capabilities essentially are on three dimensions. First dimension that most
program, program analysis folks are familiar with is containment.
So we look at the program as an [inaudible] programs contain classes, classes
contain methods. Methods contain statements and so on and so forth. And
when people with program analysis we think about this containment relation.
That's one dimension. The second dimension that is important is software
artifact evolution in the sense how does code change over time.
Is the method being added? Is the statement, are statements being added and
so on.
The third dimension that is often thrust to us is scale. We want to be able to do
this on a very, very large source code, code base.
I'm going to now delve into details of this language. So let's look at the BOA
architecture. An important component of that is this data store that we have.
Data store essentially replicates source forge. There's some caching translation
that goes on essentially creates a local cache here. It essentially keeps it up to
date in some sense. It's the responsibility of that. The other aspect of that was
designing the language.
The language that we are designing and I'm going to come to the details of that
essentially variant of the language that came out of Google called Sawzall, we
did more work on that.
On top of that we have added domain specific types functions. I'll get into details
of those also.
As part of, we built a compiler for BOA, which because Google's compiler -Google's -- language is available but the compiler isn't available in the usable
form.
So we did some work on that. Essentially a BOA program runs like this. So
once you write a query program it is compiled. Query plan is created. This query
plan is deployed on the cluster which makes you use local cache which finally
produces query results. Standard MapReduce environment.
And I will see a demo of that also today. So regarding domain-specific types.
This is one of the core contributions of BOA, what kind of types we need to
provide to make job of mining software repository export easier or. So BOA
provides domain specific type. And this program we are seeing several
domain-specific types. Project is a type. Project has properties like program
languages, code repositories, repository kind.
Code repositories have properties like revisions and revisions have properties
called files and all that. Once these facilities are available at the language level,
the person is not doing any work to mine these sort of information.
>>: Sorry, can I think of these types as basically just being the schema of the
underlying data?
Hridesh Rajan: Yes, you can. And that's -- at the implementation level, that's
where we start from.
So the role that these types play is to abstract the details of how to mine software
repositories so that people can just worry about programming in terms of these.
>>: Maybe you'll explain the program later. But I'm confused about the program.
If you're going to compute the average churn rate I would have expected to get
the number out. And here the main operators exist. So I would guess from that
that this turns out gives you [inaudible].
Hridesh Rajan: I will come to that. But you will get a number out. And so for
now think of the first line in the program as what you will get out of it.
>>: The rate?
Hridesh Rajan: Yes. And I'll come back and talk about that also.
>>: So here's some details of this domain-specific type. It provides the name of
the project, home page URL, program language, licenses and all that information
that you would typically expect from a project.
There are other types like code repositories, revisions, files so on, and they
provide usual, a usual details about these artifacts. More information about this
is available from the website, of course.
We also have domain-specific type for source code analysis. And these include
declarations, name spaces, types, et cetera.
Things like variables. Things like modifiers. Things like methods. So all these
types essentially to make the life of a mining software repository expert easier.
We also have a statement level access and expression level access. So you
really have the entire abstract available at your disposal.
Once again not getting into details of these because this information is all
available on line, and really when you're programming, you can look up which
one, which type you need. Okay. So most types are not specific to any type of
language. Generates support any object oriented language. Currently we have
support for Java source code. But we're planning to have support for C++ and
also for C# if we refine the library that is useful.
What we do essentially for mapping is to reparse the source file AST and we
store it in the sugar database, and the database we're using is the H base, H
base database to do the job here.
So Java code like that would probably represented like this in internal AST.
Again the details of that are not so important but just taking a look at the
containment relation. This is essentially what I meant when I talked about the
containment dimension. You can browse through the statements. What is the
name, what are is the variable, what are the expressions it contains so on.
The program language parse, this will come naturally in some sense. This is
standard AST containment. Also has domain-specific functions. These are first
class functions that we make available in the language.
So has file type is a function which takes two arguments, revisions and a string
and it returns a Boolean. And as the body of that function, it tasks whether there
exists a project, there exists a file that matches certain format. Returns true or
otherwise it returns false.
There are other versions of these functions that take project of core repository.
So here's another example. This is one of the common things that people do in
mining software repositories to say when somebody comes to the core where
they're fixing a bug. So this function tries to test for that. By looking at the log of
the commitment, oftentimes people write things like fixing this bug or fixing some
other bug and so on and so forth. So it looks at the log string and it says, okay, if
it's matching a certain pattern, then we will consider it a bug fix or not.
The details of this particular implementation is not so important. What's
important is that a user is allowed to write these heuristics themselves.
So in general, user defined functions take this form, return types is optional. And
essentially the idea is to allow for some other complex algorithms to be encoded.
Next are quantifiers, and this was essentially earlier your question. So
quantifiers are essentially the same as in mathematics. And this essentially
allows us to easily express loop over data and bounds infer from the conditionals.
So I'll go through some of the quantifiers in details. This is one. Essentially for
each quantifier it says for each value of I, if certain conditions hold, then run body
with I bound to the value.
The exist quantifier essentially is the same as mathematical exist, one would
imagine. If there exists a value of I where condition holds then run body once.
When I is bound to one matching value. And if all essentially says for all values,
if the condition holds, then run body once with I not bound. The reason why we
have these is because we have found that these kinds of quantifiers make the
task of expressing queries easier.
So BOA as a height detail of distributed computation. And underlying
architecture is a map, is used as MapReduce framework. So I'm going to -- for
those not familiar with this, I'm going to give you a quick summary. So
MapReduce essentially this idea that you have, you are processing a bunch of
data, and in the map step you process, you have identical copies of a program
that each processes a certain portion of the data.
And then in this shuffle step you take the output and you take it to reducer, which
then based on this output gives you one or more values. So essentially the
idea -- this idea is the same as, there's multiple data but single program is
running on each dataset.
So that's essentially the basic idea. It's not a contribution to BOA per se but BOA
makes reuse of that paradigm. So output migration or part of BOA's design. So
output is defined in terms of predefined data aggregators, like some set, mean,
maximum, minimum, et cetera.
Values are sent to output aggregate variables. So an example of output is
declared here. This says rates is an output. And this rate is going to compute by
a mean index by the string sent to it. And this mean will be essentially an
integer. Okay.
Here is sending data to the aggregator. And this being using a C++ syntax of
doing the -- if you're familiar with sending data to output, C out, C in, you'll
definitely see what's going on here. So we're sending to rates and we're not
just -- we're sending rates and index and this value. So this is essentially saying
find this index and increment the mean with this additional data value.
Okay. So let's see it in demo. See if there's a -- so when you go to the Web
page, there's a user log in on the left. We don't have open access to this
infrastructure available yet.
But once you log in, you get to this page. This is how you use BOA out of a
tablet or a smartphone or Web browser. And we teach people how to program in
BOA using examples. So we have a wide variety of examples that we make
available to users.
So if my query matches one of these examples I'll typically start with that. So
imagine for a second that I was interested in saying, in asking this question how
many projects use Java programming language. So I'll start with this example
which says how many projects use the schema program language, and this,
when I pick that, it populates the program. And here this program is saying that
the output of this program is count. It's going to be sum of integer of values that
are sent, integral values that are sent to it. And input of the program is a project.
And for each integer I such that program language used in the project matches
scheme, I'm going to send a value one to the counts so that basically counting
distributed over all.
So I want to change this program to say that counting projects using Java. Okay.
Now, this is a program that counts the number of total amount of projects in
source forge that use the Java program language.
The last thing we have to do is we have to pick a dataset we want this program to
run. This is sort of where the reproducibility requirement comes in.
When you report a result, you could report result by mentioning which dataset
you ran your program on. So there's a better understanding to the third person
that oh this is the dataset on which this result holds.
So we have only one dataset at the moment which was done in 2012 in July.
We're planning on doing one more snapshot. But I'm currently just going to run it
on the live data. So once I have my program, I can then run it. This is
essentially just clicking the button. That's about all the work that a user has to
do.
So it's going to compile the program and let me return to the presentation here
for a second.
So it's -- so we'll wait a minute for the results. Let's understand why are we
waiting for these results. The program we just wrote is it's analyzing about
600,000 projects. It is analyzing about 370,000 repositories.
And about 4 million revisions. Okay. And these are contained in this many files.
So let's take a look at what the program is doing.
So see here. The compilation has finished. And the execution has also finished.
And this allows you after execution has finished it allows you to see the job. It
says only about 50,136 projects use Java as a program language.
You can also download the Java output if you so wish. Let's suppose your
program made an error, you can edit the source code and you can send the new
query that is based on your program. And this sort of has to do with all of this
has to do with the ease of use requirement that I pointed out earlier. We want
this to be available to a class of users that just cannot be bothered with installing
a bunch of libraries and a bunch of other software systems to make their queries
work. They should be able to just point to a Web page and be done with it in
some sense. And this is a requirement that BOA achieves very well in the
setting.
So already check the results of BOA. We checked the performance of BOA on a
bunch of tasks that, important details of these -- details of these tasks are not
important. Essentially different categories of things. More details about that is in
our XP paper this year.
What we saw is that compared to Java program, BOA programs achieve results
in much faster time. And this is as you can see this is a time and logarithmic
scale. On an average, BOA programs finish much faster. There were three
categories of these programs.
The one category was of these programs was accessing only one repository
revision. The second category was where it was accessing all revisions of a
project and the third category was where it was only doing metadata. That was
this guy here.
So Java programs as you see variants, the BOA programs take nearly the same
amount of time. And most of that essentially is Hadoop's start-up, start overhead
here.
>>: Java program [inaudible] parallelism.
Hridesh Rajan: Java programs are sequential, but they're looking at local -- and
you can, for example, parallelize them if you would like.
>>: But the main speedup here is due to parallelism? I'm just trying to interpret
why.
Hridesh Rajan: The main speedup here is because of parallelism and also
because of data storage strategies. Those two things play a role. Because
querying from distributed store was querying from centralized dataset. That's
where it comes from.
Any other questions on this? Look at it in some more detail. One of the
important things to note is predictability of result. Bo programs nearly we can say
takes about a minute. It will take you about a minute to run it whereas for Java
program it varies based on what you're trying to do. Okay.
Next thing we did was we examined whether these things scale. So for that we
again looked at three different output categories. 6,000 projects. 60,000
projects, 620,000 projects. 6,000 projects, 60,000 projects were picked out of
this set using random sampling.
And this again shows the BOA programs in blue and Java programs in red as
you can observe. For Java programs for different dataset, the performance as
expected decreases. For BOA the slope is much smaller.
And this again like I pointed out previously due to two reasons. First is increased
parallelism and second is better data storage strategies in some sense, utilized
here.
At this point it will be important to look at all of these things could be done and all
these queries could be run using Java in about the same time, but the cost of
writing those queries will be much greater compared to what we are providing
here.
So some concrete examples for numbers, we can see the improvement for larger
size, you see 30 seconds to 32, 41 seconds, not very much increase in the
amount of time it takes to process, whereas for Java you see 96, 92,
600,6,000 seconds.
And at this point it sort of becomes -- it becomes very difficult to rate that much to
compute answer to questions like this.
Next thing we did was we said okay if we throw more infrastructure at it it's going
to work for us. So for that we scale to more course. So this shows different
charts for these maps here essentially represent the amount of resources that
were available to 32 maps, essentially means 32 cores were available to it. This
shows that as we throw more machine at it, the performance actually decreases.
Performance increases. The time taken decreases. So if we continue to add
more we'll see similar trends if we have more infrastructure available to it.
So one of the things that -- one of the goals for BOA is to be able to replicate
earlier results, be able to -- so if you recall we had -- we brought up this paper
earlier which says two out of 154 experimental papers were application friendly
and most of them were due to the lack of published data. And it's
understandable, because there are cases in which you cannot make data
available publicly.
So part of the results were difficult to reproduce and our claim is BOA makes this
easier. And to validate this claim, I would like to reproduce something very small
for you today to show what, to give you a taste of what it is like.
So what we have done is we have, we've written about 18, 19 questions using
BOA. And we make this available from our examples, examples Web page, and
these questions are in many different categories.
And one thing we did was we wanted to show what it will be like to publish the
result using BOA. Okay. And this is an example. So imagine that this is
essentially what the contribution of my research paper, very small research paper
would be.
What I would do is I would say, okay, this is the research question that I'm
answering. Here's a program to answer that research question. And I'm using
this dataset for my results.
Okay. So I can take a look at that dataset, and this is essentially -- it's
essentially -- please don't try to read this. Just shows you the counter projects.
What I'm going to try to do however is to try to reproduce this result. So trying to
reproduce the result essentially involves copying this program here.
Going to this infrastructure. Pasting it there, picking the right dataset. And
running it. Okay. So for this trivial thing, it will run. It will give you the right
answer. But the essence of this is that if, with a simple example query and a tag,
this infrastructure will allow people to essentially see what you saw in the first
place, without having to build the whole thing.
So we also did a control experiment where we used the published artifacts from
the BOA website and dataset used. And here are the result of this control study.
So this study we had folks involved from post-doc stage to undergrad student
who tried to use BOA to reproduce some results. Some were experts. Some
were not. And their task was essentially to pick one task out of the 18 that we
provide. And try to see if they can reproduce that result. And most of them were
able to really do this in less than ten minutes. They were able to show the same
results that was published earlier.
So there are a bunch of work in this, but none provides record support for mining
software repositories. So I'm going to give you some examples for that. Sawzall
is where BOA comes, similar syntax to BOA, [inaudible] to MapReduce. Pig
Latin is another example in this category. Occurred a similar syntax to SQL. Dry
link is another example of this kind of language, composite framework.
There was some work on force [inaudible] SQL database about 18,000, 18 K
projects, which allowed you to do mining software repositories.
BOA is more scaleable compared to sourcer and more fine grained because it
provides full abstract syntax random walk. And there are commercial products
available. There used to be a Google core search that allowed you to do, grab
over repositories. But that allowed you to only answer very limited questions,
whereas we provide much more information than that.
Although from Black Duck, it gives you a Res API, project matrix, et cetera, for
some projects. But this is not -- we found it's not as scaleable as BOA is.
And then there are things like Google core search, get hub, et cetera. Compared
to those BOA provides more than just textual map.
There's ongoing work in several directions. I'm going to identify the directions
and tell you a little bit about that. First we want to support more kinds of
repositories here. So we have support for SVN. Want to do get, bizarre, et
cetera. And we want to support more datasets. Expand to Google code. Get
hub. This is more of a logistics question than a research question.
But for dataset to be really contained, millions of projects, it will be important to
have some sort of arrangement with these folks to have the dataset and
repository.
We want to enable mining of other artifacts, issues, e-mails, bug reports.
Project-related, other project-related communication, wiki, et cetera.
And then we want to also be able to design from language abstractions. And
some of them we have started to look at already. Last but not least infrastructure
improvements are extremely important. As we are putting, throwing more data at
it, we are really hitting the scale issues, but we have some progress on making
some progress on that also.
So with that, I will conclude for today. What I've shown you today is domain
specific language for software repository mining, which essentially has three
goals. Wants to be easy to use, efficient, scaleable and allows you to reproduce
research results.
So if you have any further questions I'll be happy to take them at this point.
[applause].
>>: So do you support intermediate results in caching of intermediate results?
Hridesh Rajan: Let's suppose you have two queries and you want to -- you are
talking -- no, currently we do not. Currently we do not support intermediate
results.
>>: Queries is totally independent.
Hridesh Rajan: That is correct.
>>: If I wanted to define some tables to reuse multiple queries stored in the
cluster?
Hridesh Rajan: At the moment we don't -- the website that I showed you does
not have support for it. But that may not be the case a month. Because there's
somebody already working on it.
>>: Okay.
Hridesh Rajan: So this is -- but that's a good question. Thank you, Tom. And --
>>: Why did you decide to decide external error rather than saying L embedded
in Java what could call out to other Java methods to whatever build comma
separated, output or ran analysis on ASTs something like that.
Hridesh Rajan: It was primarily due to the audience that we are targeting. The
goal of BOA is to bring this sort of capabilities to people who may not necessarily
be expert programmers.
And often giving them an assignment in Java, compared to giving them this
simple feature, we think that it's likely that they will succeed more here than in
Java.
>>: How hard would it be to support something like that? I would assume most
of the infrastructure is independent of that choice.
Hridesh Rajan: That is correct. It would not be, it would not be very difficult to
support something like that for expert users. As an API-like thing available. It
would not be difficult at all. It was simply a matter of selecting the right product
for the audience that we wanted to target.
>>: So in the examples you showed, the scripts are kind of running over kind of
cleaned up well schematized datasets. So what do you think about the other
80 percent of the job, which is starting with the messy raw data and getting it to a
clean dataset? Is that kind of in scope for your project?
>>: It is. It is. So if you recall from the data infrastructure, one of the important
things is to replicate the data from this repository of our tool. And there are many
problems that need to be reconciled for that. So, for example, same things don't
mean the same across repositories. And it is certainly within the scope. So we
have started off with addressing issues related to source forge at the moment.
Currently there's a work in progress for trying to see how ontology for source
code matches ontology for Google code. And so, yes, that is within scope. It is
essentially an ontology mapping problem.
>>: I was going to suggest that whatever approach you take, you kind of want to
make it open, because as more and more miners go after this data they'll want to
schemetize it in different ways to answer different questions. So if too many
decisions get baked in the tool that will hurt adoption rates. So you want it pretty
open, of course.
Hridesh Rajan: Correct. Also want to bring up another philosophy that we have,
which is to not provide interpretation of data but rather provide the data itself.
And let people make their own interpretation in some sense. All right. Good.
Thank you. [applause]
Download