Boa: Ultra-Large-Scale Software Repository and Source Code Mining Robert Dyer et al. Presented by: Esteban Parra Javier Escobar-Avila Mining Software Repositories Challenges ● ● ● ● ● Software repositories are huge Knowledge discovery is non trivial Complex (aka almost non-existing) structure Short term usage Not reusable or scalable Motivating example RQ: What are the average numbers of changed files per revision for all Java projects that use SVN? Boa Knowledge Discovery and Data Mining framework for analyzing ultra-large software repositories. ● Domain-specific language ● Domain-specific types (e.g., Project, Revision, and Code Repository, AST) ● User specifies what to do, instead of how to do it. ● Implemented on top of Sizzle (Sawzall implementation) ● Runs on a Hadoop cluster (parallelization) BoA design Extracted from Dyer et al. Characteristics ● ● ● ● ● Domain Specific Types MapReduce Support (Hadoop) User-Defined Functions Up to date repository data (Monthly) Source code as AST trees Motivating example (Boa) RQ: What are the average numbers of changed files per revision for all Java projects that use SVN? Extracted from Dyer et al. Evaluation ● 700K SourceForge projects ● 12 software repositories related questions (Tasks) ● A program to answer each task was written and execute using Boa, Java, and Hadoop ● Each program was run with different configurations on input size. Java <LOC,seconds> Hadoop <LOC, seconds> Boa <LOC,seconds> What are the five most used licenses? <63, 673> <83, 24> <3, 26> How often is each database used in each programming language? <71, 655> <46, 26> <4, 27> What are the ten most used programming languages? <61, 706> <88, 24> <3, 26> <68, 13457> <60, 26> <4, 30> Task What are the churn rates for all Java projects that use SVN? How many fixing revisions added null checks? 45 lines of code for Boa Revisions that added lines of codes with the form: if(<variable> == null){ //Do something } Conclusions ● Users can easily produce fast, parallel code using Boa, but with many fewer lines of code and without having to learn how to write Hadoop programs ● Boa lacks of additional software artifacts such as bug reports, forum posts, social data, among others Our project ● ● Same motivation as Boa (ultra-large repository), but using a different approach. We want to include different sources of information about Java programming in a single repository ○ ○ ○ ● ● Define a semantic similarity measure among all the elements in our repository. User inputs a query (set of keywords): ○ ○ ● Source code from Github Questions and answers from Stack Overflow Video tutorials from Youtube. Top k relevant results. Complementary information (using the semantic links) Community detection algorithms to identify topics/communities.