Advanced Research In Software Engineering (ARiSE) ECE, UT Austin Information Retrieval Based Analysis of Software Artifacts Ripon K. Saha, Lingming Zhang, Sarfraz Khurshid, Dewayne E Perry ARiSE – UT Austin © 2005, Dewayne E Perry S5 2014 1 Advanced Research in Software Engineering (ARiSE) ECE, UT Austin Introduction Threats to Safety, Security, Reliability Poor software engineering Software faults Inherent because of complexity Even with best SE teams with best SE practices Software artifacts Source code, software fault reports, test cases, etc Text based Have structure Semantics-infused names and words © 2005, Dewayne E Perry S5 2014 2 Advanced Research in Software Engineering (ARiSE) ECE, UT Austin Exciting Results Significant improvements gained using well-understand and standard IR techniques Scalable approaches that are Text based Lightweight Language independent In structured retrieval we do exploit language structure Independent of special built structures Computationally efficient Will look at two types of fault related analyses Bug Localization Test case prioritization © 2005, Dewayne E Perry S5 2014 3 Advanced Research in Software Engineering (ARiSE) ECE, UT Austin Bug Localization Approaches Dynamic + precise detection - needs information such as test coverage, execution traces - computationally expensive Static + + - computationally efficient less requirements, e.g., source code and/or bug reports generally less accurate than dynamic approaches high level detection IR-based approach Input: Document collection: source code files Query: bug reports Output: ranked list of source code files © 2005, Dewayne E Perry S5 2014 4 Advanced Research in Software Engineering (ARiSE) ECE, UT Austin A Bug Report and Corresponding Fix Bug ID: 80720 Summary: Pinned console does not remain on top Description: Open two console views, ... Pin one console. Launch another program that produces output. Both consoles display the last launch. The pinned console should remain pinned. Source code file: ConsoleView.java public class ConsoleView extends PageBookView implements IConsoleView, IConsoleListener {... public void display(IConsole console) { if (fPinned && fActiveConsole != null){return;} }... public void pin(IConsole console) { if (console == null) { setPinned(false); } else { if (isPinned()) { setPinned(false); } display(console); setPinned(true); } } } Figure: A real bug report from Eclipse Project and the corresponding fixed source code [taken from Zhou et al., ICSE 2012] © 2005, Dewayne E Perry S5 2014 5 Advanced Research in Software Engineering (ARiSE) ECE, UT Austin Steps in IR-based Bug Localization Preprocessing source code and bug report Extract program constructs Class name, method name, variable name, comments Bug report summary, bug report description Tokenization Removing stop words Stemming Indexing preprocessed source code Used Lemur/Indri toolkit for indexing Indexed both CamelCase identifiers as-is and tokenized identifiers Retrieving source code based on bug report TF.IDF based upon BM25 Term weighting © 2005, Dewayne E Perry S5 2014 6 Advanced Research in Software Engineering (ARiSE) ECE, UT Austin BLUiR: Architecture © 2005, Dewayne E Perry S5 2014 7 Advanced Research in Software Engineering (ARiSE) ECE, UT Austin Case Study 1 - Java Data Set Eclipse: 3075 bug reports, 12863 Java files Aspectj: 286 bug reports, 6485 Java files Used as the training set SWT: ZXing: 98 bug reports, 484 Java files 20 bug reports, 391 Java files Evaluation Metrics Threats to validity Recall in Top 1 Recall in Top 5 Recall in Top 10 Mean Average Precision (MAP) Mean Reciprocal Rank (MRR) Structure of OOP Quality of bug reports and source code Open source code & system code bias © 2005, Dewayne E Perry S5 2014 8 Advanced Research in Software Engineering (ARiSE) ECE, UT Austin RQ: Does indexing the exact identifier names improve bug localization? 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 Top 1 Top 5 Top 10 SWT MAP MRR 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 Token Both Top 1 Top 5 Top 10 MAP MRR Top 1 Top 5 Top 10 MAP MRR Eclipse 0 0 Top 1 © 2005, Dewayne E Perry Top 5 Top 10 AspectJ MAP MRR S5 2014 ZXing 9 Advanced Research in Software Engineering (ARiSE) ECE, UT Austin RQ: Does modeling source code structure help improve accuracy? 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 Top 1 Top 5 Top 10 MAP MRR No Yes Top 1 Top 5 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 © 2005, Dewayne E Perry Top 5 Top 10 AspectJ MAP MRR MAP MRR Eclipse SWT Top 1 Top 10 MAP Top 1 MRR S5 2014 Top 5 Top 10 ZXing 10 Advanced Research in Software Engineering (ARiSE) ECE, UT Austin RQ: Does BLUiR outperform other bug localization tools and models? 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 No SB SB Top 1 No SB SB Top 5 No SB SB Top 10 No SB SB MAP No SB Top 1 MRR 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 © 2005, Dewayne E Perry Top 10 AspectJ MAP Top 10 MAP MRR Eclipse No SB SB No SB SB No SB SB No SB SB No SB SB No SB SB No SB SB No SB SB No SB SB No SB SB Top 5 Top 5 1 0.8 Top 1 BLUiR No SB No SB No SB No SB No SB SB SB SB SB SB SB SWT 1 BugLocator Top 1 MRR S5 2014 Top 5 Top 10 ZXing MAP MRR 11 Advanced Research in Software Engineering (ARiSE) ECE, UT Austin RQ: Does BLUiR outperform other bug localization tools and models? 50 Rank Improvemnt 40 30 20 10 0 1 11 21 31 41 51 61 71 81 91 -10 -20 Query Number Query-wise comparison for SWT © 2005, Dewayne E Perry S5 2014 12 Advanced Research in Software Engineering (ARiSE) ECE, UT Austin Case Study 2 – C With Julia Lawall Same metrics used Two data sets used: For comparison with Java, the previous data set (less Zxing) For analysis of C code, the following data set: Systems Python 3.4.0 GDB 7.7 WineHQ GCC 4.9.0 GCC NT Linux Kernal 3.14 © 2005, Dewayne E Perry SLOC 380k 1982k 2340k 2571k 2062k 11829k Bugs Files 3407 195 2350 216 1548 S5 2014 488 2655 2815 22678 2473 19853 Bugs Functions 177 2218 193 1178 41298 89430 75746 40684 347057 13 Advanced Research in Software Engineering (ARiSE) ECE, UT Austin RQ: IR-related Properties of C & Java © 2005, Dewayne E Perry S5 2014 14 Advanced Research in Software Engineering (ARiSE) ECE, UT Austin RQ: IR-related Properties of C & Java Term Type Python GDB WineHQ GCC GCC NT Linux SWT AspectJ Eclipse © 2005, Dewayne E Perry Function/Method-Terms* Actual% Unique% 66% 67% 72% 71% 76% 59% 44% 32% 20% 30% 46% 20% 95% 92% 97% 85% 67% 75% S5 2014 Identifier-Terms Actual% Unique% 67% 59% 59% 69% 72% 59% 33% 18% 12% 21% 25% 10% 84% 91% 94% 48% 52% 48% 15 Advanced Research in Software Engineering (ARiSE) ECE, UT Austin RQ: IR-related Properties of C & Java C file sizes larger C file median indicates file size highly skewed Number of terms in C files considerably larger than in Java classes English words considerably higher in Java code than in C code C and Java quite different Different language paradigms Different naming styles (affecting IR issues) © 2005, Dewayne E Perry S5 2014 16 Advanced Research in Software Engineering (ARiSE) ECE, UT Austin RQ: Accuracy of Language Independent Retrieval Project Python GDB WineHQ GCC Linux Top 1 Top 5 Top 10 45% 68% 76% 26% 41% 48% 20% 40% 50% 31% 52% 61% 25% 44% 53% MAP 0.508 0.249 0.273 0.314 0.307 SWT AspectJ Eclipse 38% 72% 86% 26% 47% 54% 16% 35% 44% 0.465 0.539 0.200 0.360 0.190 0.253 © 2005, Dewayne E Perry S5 2014 MRR 0.555 0.341 0.301 0.406 0.343 17 Advanced Research in Software Engineering (ARiSE) ECE, UT Austin RQ: Accuracy of Language Independent Retrieval Get best results from smallest projects Recall at Top 1, MAP, and MRR consistently higher for C However, Java gains a larger improvement in Recall at Top 10 than C High correlation between the use of English words and accuracy of retrieval Hence, the greater use of English words in C programs leads to higher accuracy in localizing bugs © 2005, Dewayne E Perry S5 2014 18 Advanced Research in Software Engineering (ARiSE) ECE, UT Austin RQ: Accuracy of Flat-Text Retrival Project Top 1 Top 5 Top 10 MAP MRR Python 47% (+4%) 72% (+6%) 80%(+5%) 0.532 (+5%) 0.581 (+5%) 49% (+2%) 0.231 (-7%) 0.315 (-8%) GDB 22% (-15%) 42% (+2%) WineHQ 26% (+30%) 47% (+18%) 55% (+10%) 0.322 (+18%) 0.358 (+19%) GCC 29% (-6%) 52% (0%) 61% (0%) 0.300 (-4%) 0.401 (-1%) Linux 23% (-8%) 44% (0%) 53% (0%) 0.299 (-3%) 0.333 (-3%) SWT 38% (0%) 72% (0%) 86% (0%) 0.465 (0%) 0.539 (0%) AspectJ 30% (+15%) 51% (+9%) 61% (+13%) 0.221 (+10%) 0.407 (+13%) Eclipse 24% (+50%) 45% (+29%) 54% (+23%) 0.265 (+39%) 0.340 (+34%) © 2005, Dewayne E Perry S5 2014 19 Advanced Research in Software Engineering (ARiSE) ECE, UT Austin RQ: Accuracy of Flat-Text Retrival Parse source code and extract only class/file names method/function names Identifiers Comment words Discard language keywords and tokens Again, small projects have highest recall, MAP & MRR But removing keywords does not bring much improvement to C code retrieval © 2005, Dewayne E Perry S5 2014 20 Advanced Research in Software Engineering (ARiSE) ECE, UT Austin RQ: Accuracy of Structured Retrival Project Top 1 Python 53%(+13%) 77%(+7%) 85%(+6%) GDB 28%(+27%) 46%(+10%) 55%(+15%) 0.261(+13%) 0.366(+16%) WineHQ 25%(-4%) Top 5 Top 10 MAP MRR 0.586(+10%) 0.640(+10%) 47%(+0%) 56%(+4%) 0.325(+1%) 0.360 (+1%) GCC 34%(+17%) 55%(+6%) 63%(+3%) 0.319(+6%) 0.441(+10%) Linux 27%(+17%) 49%(+11%) 57%(+8%) 0.337(+13%) 0.375(+15%) SWT 55%(+45%) 77%(+7%) 86%(+0%) 0.557(+19%) 0.645(+20%) 61%(0%) 0.240(+20%) 0.414(+11%) AspectJ 32%(+7%) Eclipse 51%(0%) 31%(+29%) 53%(+18%) 63%(+17%) 0.316(+19%) 0.416(+22%) © 2005, Dewayne E Perry S5 2014 21 Advanced Research in Software Engineering (ARiSE) ECE, UT Austin RQ: Accuracy of Structured Retrieval Distinguish between class/file names Method/function names Identifiers Comment words Case Study 1 showed improvement for Java Structure more effective than Flat for all C projects except WineHQ – 13% to 27% in Recall at Top 1 for C; 7% to 45% for Java MAP: up to 13% for C; 20% for Java Less improvement over Language-Independent for C 25% for WineHQ but less than 10% for most others Eclipse: 94% improvment © 2005, Dewayne E Perry S5 2014 22 Advanced Research in Software Engineering (ARiSE) ECE, UT Austin RQ:Accuracy of Function-Level Structured Retrieval Project GDB WineHQ GCC Linux Top1 7% 8% 9% 11% Top5 21% 16% 18% 19% Top10 27% 21% 25% 24% MAP 0.073 0.085 0.081 0.127 MRR 0.145 0.122 0.144 0.159 Top 1 Recall 7 – 11%; Top 10 – 21 – 27% Not surprising: function has a lot less information than a file Retrieval at function level more expensive than at file Effective strategy: Use file level retrieval – 80% accuracy for Top 100 Then within that set, function retrieval with almost the same level of accuracy as at the file level © 2005, Dewayne E Perry S5 2014 23 Advanced Research in Software Engineering (ARiSE) ECE, UT Austin RQ: Importance of Information Sources All information sources important Method/function names and identifier names most so © 2005, Dewayne E Perry S5 2014 24 Advanced Research in Software Engineering (ARiSE) ECE, UT Austin Conclusions: IR Based Bug Localization Challenges for C-based IR Bug Localization Less structure than Java C preprocessor directives Macros Historically, fewer English words used in C programs Still, IR-based bug localization still effective in C programs But C programs benefit less from the use of program structure in IR retrieval © 2005, Dewayne E Perry S5 2014 25 Advanced Research in Software Engineering (ARiSE) ECE, UT Austin IR-based Regression Test Prioritization Dynamic/static analysis of regression test case prioritization Precise but significant overhead: Instrumentation Dynamic profiling of test coverage Static analysis REPiR – IR-based regression test prioritization tool Document collection Query Test Difference between two versions Results of our study of REPiR Computationally efficient Significantly getter than many existing dynamic and static analysis techniques At least as well for others © 2005, Dewayne E Perry S5 2014 26 Advanced Research in Software Engineering (ARiSE) ECE, UT Austin REPiR Approach Used the same IR approach as in bug localization Off the shelf, state of the art, Indri toolkit Same techniques: tokenization, stemming, removing stop words Evaluation 7 open source Java projects with real regression faults Compare against dynamic and static strategies Document Collection Structured Document has 4 kinds of terms: class, method and identifier names, and comments Queries Low level differences using diff against versions in git High level differences – 9 types of changes © 2005, Dewayne E Perry S5 2014 27 Advanced Research in Software Engineering (ARiSE) ECE, UT Austin REPiR Architecture © 2005, Dewayne E Perry S5 2014 28 Advanced Research in Software Engineering (ARiSE) ECE, UT Austin Empirical Evaluation Is REPiR more effective than random or untreated test orders? Do high level program differences help improve the performance of REPiR? Is structured retrieval more effective for RTP? How does REPiR perform compared to the existing RPT techniques? How does REPiR perform when it ignores language level information? © 2005, Dewayne E Perry S5 2014 29 Advanced Research in Software Engineering (ARiSE) ECE, UT Austin Evaluation Subject Systems – range from in size Dynamic and static analysis strategies Time and money Joda-Time 2.7k sloc & 3k loc test code 32.9k sloc & 55.9k loc test code UT - untreated test prioritization keeps the order of the original test cases RT – random test prioritization rearranges test cases randomly Dynamic coverage-based test prioritization CMA - method coverage with the additional strategy CMT – method coverage with the total strategy CSA – statement coverage with the additional strategy CST – statement coverage with the total strategy JUPTA – static program analysis base on TA, test ability Analogously, JMA, JMT, JSA, JST © 2005, Dewayne E Perry S5 2014 30 Advanced Research in Software Engineering (ARiSE) ECE, UT Austin REPiR (Ldiff) at Test method and class level © 2005, Dewayne E Perry S5 2014 31 Advanced Research in Software Engineering (ARiSE) ECE, UT Austin REPiR at Test method and class level REPiR LDiff performs significantly better than UT or RT REPiR is fairly independent of the length of program differences At method level, HDiff works a bit better than LDiff At class level, LDiff works a bit better than HDiff Structured retrieval works basically the same as unstructured retrieval © 2005, Dewayne E Perry S5 2014 32 Advanced Research in Software Engineering (ARiSE) ECE, UT Austin REPiR vs JUPTA/Coverage-based RPT Test Method level and Test Class level © 2005, Dewayne E Perry S5 2014 33 Advanced Research in Software Engineering (ARiSE) ECE, UT Austin REPiR vs JUPTA/Coverage-based RTP REPiR with HDiff at method level and LDiff at class level outperform JUPTA and coverage based RTP At test method level APFD for REPiR HDiff is APFD for JUPTA (JMA) is APFD for CSA is 0.76 0.72 0.73 APFD for REPiR LDiff is APFD for JUPTA (JMA) is APFD for CSA is 0.81 0.76 0.7 At test class level Significance of results REPiR LDiff & HDiff working better than any total strategies was statistically significant REPiR’s better performance against additional strategies was not statistically significant © 2005, Dewayne E Perry S5 2014 34 Advanced Research in Software Engineering (ARiSE) ECE, UT Austin Summary and Future IR-based analysis of SW artifacts is an exciting area of research Initial results are very encouraging Language independent, scalable, efficient, and competitive Future Explore other areas of SW artifact analysis Explore hybrid approaches – eg, using IR-based in conjunction with dynamic and/or static techniques Explore with IR colleagues (Matt Lease, UT; Dave Binkley and Dawn Lawrie, Loyola) new IR search and IR evaluation techniques © 2005, Dewayne E Perry S5 2014 35