Automatic Software Testing Via Mining Software Data Wujie Zheng Supervisor: Prof. Michael R. Lyu Department of Computer Science & Engineering The Chinese University of Hong Kong August 17, 2011 Outline • Introduction • Part 1: Unit-Test Generation via Mining Relevant APIs • Part 2: Test Selection via Mining Operational Models • Part 3: Mining Test Oracles of Web Search Engines • Conclusions 2 Introduction • Software bugs annoy users or even cause great losses! Google harms your computer Destruction of NASA Mariner 1 • Software failures cost the US economy about $60 billion every year [NIST Report 2002] 3 Software Testing • The primary way for removing bugs • Three steps – Generate test inputs – Run test inputs – Inspect test results (check actual outputs or properties against test oracles) 4 Software Testing • A system test Test Inputs Test Oracles 5 Software Testing • A unit test Test Inputs public void testSearch() { // test input Stack var0 = new Stack(); String var1 = "hi!"; var0.push((Object)var1); int var2 = var0.search(var1); } // test oracle assertTrue(var2==1); Test Oracles 6 Software Testing • Manual software testing – Difficult to create a good set of test inputs • Software systems become large-sized and complex – Tedious to inspect a large set of test results 7 Automatic Software Testing • Test input generation – Random testing, combinatorial testing, model-based testing, grammar-based testing • Test result inspection – Model-based testing Specification Test Input Generation Test Result Inspection 8 Automatic Software Testing • Specification: a complete description of the behavior of a software to be developed – Constraints on test inputs • socket->bind->listen->accept • For a method f(int x, int y), x>0,y>0 – Constraints on program states • From state s and action x, the next state should be t. • There should be no memory errors, e.g., double free – Constraints on test outputs • For a method sort(x), the output is sorted 9 Challenges • The specification is often unavailable or incomplete Specification Test Input Generation Test Result Inspection 10 My Thesis • Mining specifications from software data to guide test input generation and test result inspection Software Data Specification Test Input Generation Test Result Inspection 11 My Thesis • Part 1: unit-test generation via mining relevant APIs – A unit-test is a method call sequence Source Code Relevant APIs f Test Input Generation g • Contribution – Reduce the search space of possible method call sequences by exploiting the relevance of methods 12 My Thesis • Part 2: test selection via mining operational models – Control rules, data rules Execution Traces Operational Models Test Result Inspection Br1 => Br2 • Contribution – Propose two kinds of operational models that can detect failing tests effectively and can be mined efficiently 13 My Thesis • Part 3: mining test oracles of Web search engines Program Outputs Output Rules and Test Result Inspection Classification Models P1 => P2 • Contribution – Apply test selection techniques to Web Search Engines – Select failing tests by exploiting application-level knowledge 14 My Thesis • Overview Software Data Mined/Learned Specifications Testing Tasks Source Code Relevant APIs (Specifications about Program Inputs) Test Input Generation Part 2 Execution Traces Operational Models (Specifications about Program States) Test Result Inspection (Test Selection) Part 3 Program Inputs and Outputs Output Rules Test Result Inspection (Specifications about Program Outputs) (Test Selection) Part 1 15 Part 1: Unit-Test Generation via Mining Relevant APIs Problem • Given a set of methods under test (MUTs), generate inputs (method-call sequences) that explore different behaviors of each method. 17 Existing Approaches • Random – Select parameters of methods randomly A.f(B) means f is a method class A and it has an argument of class B Stack var0 = new Stack(); String var1 = "hi!"; Stack.push(Object) Stack var0 = new Stack(); String var1 = "hi!"; var0.push((Object)var1); Stack.search(Object) Stack var0 = new Stack(); String var1 = "hi!"; var0.push((Object)var1); int var2 = var0.search(var1); 18 Existing Approaches • Feedback-directed generation – Discard sequences whose execution throw exceptions • Adaptive random generation – Select sequences that are most different from previous selected ones • They do not consider how the specific method under test is implemented 19 The Idea • A method cannot affect the execution of the method under test (MUT) if it does not mutate an input’s fields accessed by the MUT. Stack var0 = new Stack(); Stack var0 = new Stack(); String var1 = "hi!"; var0.push((Object)var1); Stack var0 = new Stack(); int var1 = var0.size(); X Stack.search(Object) – the size() method has no effect because it does not change any fields that search() access. 20 Example • openDatabase() calls setupDatabase() calls getAllowCreate() accesses allowCreate • setAllowCreate() accesses allowCreate • To test openDatabse(), for sequences of DatabaseConfig objects, we prefer the sequences that call setAllowCreate() 21 Our Approach • Mining relevant APIs – Use Eclipse JDT Compiler to analyze the object fields accessed by each method • Each method is represented as an itemset of the object fields that it accesses openDatabase() : Environment.envImpl, DatabaseConfig.allowCreate, ... setAllowCreate() : DatabaseConfig.allowCreate – Find relevant APIs that access the same object fields • openDatabase() is relevant to setAllowCreate() 22 Our Approach • RecGen: recommendation-based test generation – For each parameter, recommend a method call sequence from the existing sequences • Assign more weights to short sequences with more relevant APIs Method Call Sequences of Type A Method Call Sequences of Type B A.f(B) 23 Experiments • Three subjects – Berkeley DB Java Edition (BDB) – Java Data Structure Library (JDSL) – Science Computing Library (JScience) • Compared with three representitive tools – JCrasher – Randoop – ARTGen • Metrics – Code Coverage 24 Experiments • With feedback is better • With sequence recommendation is better 25 Experiments • With feedback is better • With sequence recommendation is better 26 Summary of Part 1 • Problem – Unit-Test input generation (method call sequence) • Our approach – Mine relevant APIs that access common fields – For each parameter, select short method call sequences that have more relevant APIs • Contribution – Reduce the search space of possible method call sequences by exploiting the relevance of methods 27 Part 2: Test Selection via Mining Operational Models Problem • Given a large set of test results, find the failing tests from them – Without executable test oracles – Manual test result inspection could be laborintensive 29 Solution • Test selection for result inspection – Select a small subset of tests that are likely to reveal faults Hey! Check only these tests! 30 Existing Approaches • Code coverage based selection • Clustering based selection • Operational model based selection 31 Code Coverage Based Selection • Select a new test if it increases some coverage criteria, otherwise discard it – Method, line, branch coverage Test1 Test2 Test3 Test4 Br1 1 1 0 1 Br2 0 0 1 0 Br3 1 1 0 1 Br4 1 1 0 0 … … … … … Test1, Test3 32 Clustering Based Selection • Use hierarchical clustering of execution profiles and perform one-per-cluster sampling – Failing tests are often grouped into small clusters 33 Operational Model Based Selection • Mine invariants from passing tests (Daikon, DIDUCE) • Select tests that violate the existing invariants (Jov, Eclat, DIDUCE) 34 Our Approach • Mine common operational models from unverified tests – The models are often but not always true in the observed traces 35 Our Approach • Why is it difficult? – The previous templates of operational models generate too much candidates – Examine all the candidates at runtime may incur high runtime overhead • For passing tests, we can discard any violation • For unverified tests, we cannot! 36 Our Approach • Effective mining of operational models – Collect simple traces at runtime • Branch coverage • Data value bounds – Generate and evaluate potential operational models after running all the tests • Control rules: implication relationships between branches • Data rules: implicit data value distributions 37 Common Operational Models • Control rules: implication relationships between branches Test1 Test2 Test3 Test4 Br1 1 1 0 1 Br2 0 0 1 0 Br3 1 1 0 1 Br4 1 1 0 0 … … … … … Br1 => !Br2 Br1 => Br3 38 Common Operational Models • Data rules: implicit data value distributions Test1 Test2 Test3 Test4 min(Var1) max(Var1) min(Var2) max(Var2) … 0 10 0 11 … 0 32 -1 1 … 0 1 1 3 … 0 23 2 6 … The distribution of max(Var1) Too large or too small values are suspicious 39 Test Selection • Select tests for result inspection – Sort the mined rules in the descending order of confidence – Select tests that violate the rules from the top to bottom 40 Experiments • Subject programs – Siemens suite: 130 faulty versions of 7 programs – grep program: 20 faulty versions 41 Experiments • Effectiveness – The number of the selected tests – The percentage of revealed faults 42 Experiments • Our approach is more effective 43 Control Rules vs. Data Rules • Control rules reveal more faults 44 Random Test Suites • Our approach works well on automatically generated test suites 45 Summary of Part 2 • Problem – Test selection for result inspection • Our approach – Mining common operational models (control rules, data rules) from execution traces of unverified tests • Contribution – Propose two kinds of operational models that can detect failing tests effectively and can be 46 mined efficiently Part 3: Mining Test Oracles of Web Search Engines Background • Find defects of Web search engines with respect to retrieval effectiveness. – Web search engines have major impact in people’s everyday life. – Retrieval effectiveness is one of the major concerns of search engine users • How well a search engine satisfies users’ information need • Relevance, authority, and freshness 48 Background • An example – Declaration from the PuTTY Website for Google’s search result change – This declaration suggests that Google’s search results for “putty” at some time may not be satisfactory and may cause confusions of the users. 49 Problem • Given a large set of search results, find the failing tests from them – Test oracles: relevance judgments 50 Problem • It is labor-intensive to collect the relevance judgments of search results – For a large number of queries • Previous relevance judgments may not be reusable – The desired search results may change over time 51 Existing Approaches • The pooling process – Different information retrieval systems submit the top K results per query – The assessors judge for relevance manually • The idea – Inspect parts of search results for all queries • Limitations – Too costly, hardly reusable 52 Existing Approaches • Click through data as implicit feedback – Clicked results are relevant • The idea – Let users inspect all search results of all queries • Limitations – Position bias, summary bias • E.g., cannot find relevant pages that are not in the search results 53 Our Approach • Test selection – Inspect parts of search results for some queries by mining search results of all queries – Exploit application-level knowledge • Execution traces may not help – Utilize the existing labels of testers • The process needs to be repeated 54 Our Approach • Mining and learning output rules Queries and Search Results Failing/Passing Labels Association Rules Detecting Violations Classification Models Predicting Failing Tests Feature Vectors (Itemsets) 55 Mining Output Rules • Query items – Query words, query types, query length, etc. • Search result items – Domain, domain’s Alexa rank, etc. • Query-result matching items – Whether the domain name has the query, whether the title has the query, etc. • Search engine items – Search engine names 56 Example Itemsets • SE:bing, Q:boston colleges, QW:boston, QW:colleges, TwoWords, CommonQ, top10:searchboston.com, top1:searchboston.com, top10:en.wikipedia.org, …, SOMEGE100K, SOMELE1K • SE:bing, Q:day after day, QW:day, QW:after, ManyWords, CommonQ, top10:en.wikipedia.org, top1:en.wikipedia.org, top10:dayafterday.org, …, SOMEGE100K, SOMELE1K 57 Mining Association Rules • Mining frequent itemsets with length constraint – An itemset is frequent if its support is larger than the min_support {SE:bing, top10:en.wikipedia.org} • Generating rules with only one item in the right hand side – For each item xi in Y, generate a rule Y-xi => xi SE:bing=>top10:en.wikipedia.org 58 Learning Classification Models • Feature Vectors – Can describe more general types of properties wordLength queryType max(domainRank) google.com facebook.com … Search Result List 1 2 common 900 1 0 … Search Result List 2 3 hot 100000 0 1 … Search Result List 3 1 hot 9782 1 1 … 59 Learning Classification Models • Learn classification models of the failing tests based on the training data • Given new search results, use the learned model to predict whether they fail. 60 Experiments • Search engines – Google – Bing These two search engines, together with many other search engines powered by them (e.g., Yahoo! Search is now powered by Bing and AOL Search is powered by Google), possess more than 90 percent search market share in U.S. 61 Experiments • Queries – Common queries • Queries in KDDCUP 2005, 800 queries – Hot queries • 3432 unique hot queries from Google Trends and Yahoo! Buzz from November 25, 2010 to April 21, 2011 62 Experiments • Search results – Use the Web services of Google and Bing to collect the top 10 search results of each query from December 25, 2010 to April 21, 2011 – 390797 ranked lists of search results (each list contains the top 10 search results) 63 The Mined Rules • Mining from one search engine' results in one day – Google's search results on Dec. 25, 2010 – minsup = 20, minconf = 0.95, and maxL = 3 64 The Mined Rules • Mining from multiple search engines' results in one day – Google and Bing's search results on Dec. 25, 2010 – minsup = 20, minconf = 0.95, and maxL = 3 – Rules 9-12 show the different opinions of search engines to certain Websites 65 The Mined Rules • Mining from one search engine' results in multiple days – Google's search results from December 25, 2010 to March 31, 2011. – minsup = 20, minconf = 0.95, and maxL = 2 – Rules 13-18 show the rules about the top 1 results for the queries 66 Example Violations • Search results of Bing on April 1st, 2011 violate the following rule • The actual result of Bing http://www.jcu.edu/index.php points to the homepage of the John Carroll University, not easy to get the answer of the query 67 Learning Classification Models • Conduct experiments with the following classes – Unexpected top 1 change • the other search engines oppose the change (they returned the same top 1 result and do not change) – Normal top 1 change • the other search engines do not oppose the change • Task – Given a top 1 change of the search engine under test, predict whether it is an unexpected change 68 Learning Classification Models • Data – Training data: December 26, 2010 to March 31, 2011 – Testing data: April 1, 2011 to April 22, 2011 • Results of predicting unexpected top 1 changes – Decision Tree is more accurate, but Naive Bayes is faster 69 Summary of Part 3 • Problem – Search engine testing • Our Approach – Mine and learn output rules to find suspicious search results automatically • Contribution – Apply test selection techniques to Web Search Engines – Select failing tests by exploiting application-level knowledge 70 Conclusions Conclusions • Mining specifications from software data to guide test input generation and test result inspection – Part 1: unit-test generation via mining relevant APIs • Reduce the search space of possible method call sequences by exploiting the relevance of methods 72 Conclusions – Part 2: test selection via mining operational models • Propose two kinds of operational models that can detect failing tests effectively and can be mined efficiently – Part 3: mining test oracles of web search engines • Apply test selection techniques to Web Search Engines • Select failing tests by exploiting application-level knowledge 73 Publications Wujie Zheng, Hao Ma, Michael R. Lyu, Tao Xie, and Irwin King, “Mining Test Oracles of Web Search Engines”, To appear in the 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011), Short Paper, Lawrence, Kan., US, November 6-10, 2011. Wujie Zheng, Qirun Zhang, and Michael R. Lyu, “Cross-library API Recommendation using Web Search Engines”, To appear in ESEC/FSE 2011, New Ideas Track, Szeged, Hungary, September 5-9, 2011. Qirun Zhang, Wujie Zheng, and Michael R. Lyu, “Flow-Augmented Call Graph: A New Foundation for Taming API Complexity“, Fundamental Approaches to Software Engineering (FASE'11), Saarbrücken, Germany, 26 March - 3 April, 2011. Wujie Zheng, Qirun Zhang, Michael Lyu, and Tao Xie, “Random Unit-Test Generation with MUT-aware Sequence Recommendation”, In Proceedings of the 25th IEEE/ACM International Conference on Automated Software Engineering (ASE 2010), Short Paper, Antwerp, Belgium, September 2010. Wujie Zheng, Michael R. Lyu, and Tao Xie, “Test Selection for Result Inspection via Mining Predicate Rules”, In Proceedings of ICSE Companion 2009, pp. 219-222, Vancouver, Canada, May 2009. Xiaoqi Li, Wujie Zheng, Michael R. Lyu, “A Coalitional Game Model for Heat Diffusion Based Incentive Routing and Forwarding Scheme”, In Proceedings of IFIP Networking 2009, pp. 664-675, Aachen, Germany, May, 2009. 74 Q&A Thanks! 75 76 Introduction • Software permeates our life 77 Our Approach • Why mining unverified tests can help? – Models that are often true could be good approximations of real models The real operational model An operational model learned from limited passing tests The operational rules learned from all the unverified tests 78 Common Operational Models • Control rules 1 2 3 4 – x: upward_prefered, line 12 y: upward_prefered, line 5 – 1608 tests, y is true in 514 56 7 tests, among which x is 8 9 true 478 times 10 – We can mine a control rule 11 y=>x, whose violation is an 1213 14 error bool Non_Crossing_Biased_Climb() { ... upward_preferred = Inhibit_Biased_Climb() >=Down_Separation; /* >= should be > */ if(upward_preferred) ... } bool Non_Crossing_Biased_Descend() { ... upward_preferred = Inhibit_Biased_Climb() >Down_Separation; if(upward_preferred) ... } 79 Common Operational Models • Data rules 1 2 3 4 5 6 7 8 9 10 11 12 13 14 – Var: token_ind, line 5 – 4130 tests, max(Var): mean=11, std=13 – We can mine a data rule max(Var)<=c1=28, 15 by solving 16 17 P(max(Var)<=c1)=0.9 while(!token_found) { if(token_ind<80) { token_str[token_ind++] = ch; next_st = next_state(cu_state,ch); } ... switch(next_st) { ... case 30: skip(tstream_ptr->ch_stream); next_st = 0; /* missing code: token_ind = 0; */ break; } } 80