Boa: Ultra-Large-Scale Software Repository and Source Code Mining Robert Dyer et al.

advertisement
Boa: Ultra-Large-Scale
Software Repository and
Source Code Mining
Robert Dyer et al.
Presented by:
Esteban Parra
Javier Escobar-Avila
Mining Software Repositories
Challenges
●
●
●
●
●
Software repositories are huge
Knowledge discovery is non trivial
Complex (aka almost non-existing) structure
Short term usage
Not reusable or scalable
Motivating example
RQ: What are the average numbers of changed files per revision for all Java
projects that use SVN?
Boa
Knowledge Discovery and Data Mining framework for
analyzing ultra-large software repositories.
● Domain-specific language
● Domain-specific types (e.g., Project, Revision, and Code
Repository, AST)
● User specifies what to do, instead of how to do it.
● Implemented on top of Sizzle (Sawzall implementation)
● Runs on a Hadoop cluster (parallelization)
BoA design
Extracted from Dyer et al.
Characteristics
●
●
●
●
●
Domain Specific Types
MapReduce Support (Hadoop)
User-Defined Functions
Up to date repository data (Monthly)
Source code as AST trees
Motivating example (Boa)
RQ: What are the average numbers of changed files per revision for all Java
projects that use SVN?
Extracted from Dyer et al.
Evaluation
● 700K SourceForge projects
● 12 software repositories related questions (Tasks)
● A program to answer each task was written and execute
using Boa, Java, and Hadoop
● Each program was run with different configurations on
input size.
Java
<LOC,seconds>
Hadoop <LOC,
seconds>
Boa
<LOC,seconds>
What are the five most
used licenses?
<63, 673>
<83, 24>
<3, 26>
How often is each
database used in each
programming language?
<71, 655>
<46, 26>
<4, 27>
What are the ten most
used programming
languages?
<61, 706>
<88, 24>
<3, 26>
<68, 13457>
<60, 26>
<4, 30>
Task
What are the churn rates
for all Java projects that
use SVN?
How many fixing revisions added null checks?
45 lines of code for Boa
Revisions that added lines of
codes with the form:
if(<variable> == null){
//Do something
}
Conclusions
● Users can easily produce fast, parallel code using Boa,
but with many fewer lines of code and without having to
learn how to write Hadoop programs
● Boa lacks of additional software artifacts such as bug
reports, forum posts, social data, among others
Our project
●
●
Same motivation as Boa (ultra-large repository), but using a different
approach.
We want to include different sources of information about Java programming
in a single repository
○
○
○
●
●
Define a semantic similarity measure among all the elements in our
repository.
User inputs a query (set of keywords):
○
○
●
Source code from Github
Questions and answers from Stack Overflow
Video tutorials from Youtube.
Top k relevant results.
Complementary information (using the semantic links)
Community detection algorithms to identify topics/communities.
Download