An Analysis of Machine Learning Algorithms for Condensing Reverse Engineered Class Diagrams Presenter: Hafeez Osman Hafeez Osman (hosman@liacs.nl) Michel R.V. Chaudron (chaudron@chalmers.se) Peter v.d Putten (putten@liacs.nl) Leiden University. The university to discover. OVERVIEW 1. Introduction 2. Research Question 3. Approach 4. Results 5. Discussion 6. Future Work 7. Conclusion Leiden University. The university to discover. Introduction Why ? Reverse engineered class diagrams are typically too detailed a representation What ? Simplifying UML Class Diagram: Based on Software Design Metrics using Machine Learning Who ? Software Engineer, Software Maintainer, Software Designer Leiden University. The university to discover. Leiden University. The university to discover. Introduction Aim: analyze performance of classification algorithms that decide which classes should be included in a class diagram This paper focusses on using design metrics as predictors (input variables used by the classification algorithm) Omit Leiden University. The university to discover. Introduction Explore Structural Properties of Classes • Software design metrics from the following categories : • Size : NumAttr, NumOps, NumPubOps, Getters, Setters • Coupling : Dep_Out, Dep_In, EC_Attr, IC_Attr, EC_Par, IC_Par Machine Learning Classification Algorithms • Supervised classification algorithms • J48 Decision Tree, Decision Tables, Decision Stumps, Random Forests and Random Trees • k-Nearest Neighbor, Radial Basis Function Networks • Logistic Regression, Naive Bayes, Leiden University. The university to discover. Research Questions RQ1: Which individual predictors are influential for the classification? For each case study, the predictive power of individual predictors are explored RQ2: How robust is the classification to the inclusion of particular sets of predictors? Explore how the performance of the classification algorithm is influenced by partitioning the predictor-variables in different sets. RQ3: What are suitable algorithms for classifying classes? The candidate classification algorithms are evaluated w.r.t. how well they perform in classifying the key classes in a class diagram. Leiden University. The university to discover. Approach Evaluation Method RQ1: Predictors Univariate Analysis – Information Gain Attribute Evaluator To measure predictive power of predictors RQ2, 3: Machine Learning Classification Algorithm Area Under ROC Curve (AUC) The AUC shows the ability of the classification algorithms to correctly rank classes as included in the class diagram or not Leiden University. The university to discover. Approach Case Study Characteristics Project Total Classes (a)/(b) = % Source code (a) Design (b) ArgoUML 903 44 4.87 Mars 840 29 3.45 JavaClient 214 57 26.64 JGAP 171 18 10.52 Neuroph 2.3 161 24 14.9 JPMC 127 11 8.66 Wro4J 87 11 12.64 xUML-Compiler 84 37 44.05 Maze 59 28 47.45 Leiden University. The university to discover. Approach Grouping Predictors in Sets No Predictor 1 2 3 4 5 6 7 8 9 10 11 NumAttr NumOps NumPubOps Setters Getters Dep_out Dep_In EC_Attr IC_Attr EC_Par IC_Par Predictor Set A Predictor Set Predictor Set B C x x x x x x x x Leiden University. The university to discover. Approach 1. Reverse engineer the source code to UML design. i. Eliminate library classes 2. Calculate design metrics i. Eliminate unused metrics 3. Merge the information “In Design” with the software design metrics data 4. Prepare set of predictors 5. Run all set of predictors with machine learning tool Leiden University. The university to discover. Result RQ1 : Predictor Evaluation Influential Predictors 7 6 No of Projects 5 4 3 2 1 0 EC_Par NumOps Dep_In Predictor 6 5 5 NumPub Dep_out NumAttr Setters Ops 4 4 3 3 Getters EC_Attr 3 3 IC_Attr IC_Par 3 2 ** Out of 9 projects Leiden University. The university to discover. Result RQ2 : Dataset Evaluation No. of Projects for which a Classification Algorithm scores AUC > 0.60 10 9 No of projects 8 7 6 5 4 3 2 1 0 Decision Table Prediction Set A 3 Prediction Set B 3 Prediciton Set C 3 J48 5 5 5 Decision Stump 6 7 6 RBF Network 7 7 7 Naïve Bayes 8 7 7 Random Tree 8 7 7 Function Logistic 7 7 8 k-NN(1) k-NN(5) 8 7 9 8 8 9 Random Forest 9 9 9 ** Out of 9 projects Leiden University. The university to discover. Result RQ3 : Evaluation on Classification Algorithms Average AUC Score 0.80 0.70 0.60 0.50 0.40 0.30 0.20 0.10 0.00 Decision Table Predictor Set A 0.60 Predictor Set B 0.59 Predictor Set C 0.58 J48 0.63 0.61 0.61 Random Tree 0.66 0.66 0.66 RBF Network 0.66 0.67 0.66 Decision Stump 0.66 0.64 0.65 Function Logistic 0.69 0.70 0.68 Naïve Bayes 0.70 0.70 0.69 k-NN(1) k-NN (5) 0.70 0.70 0.71 0.73 0.72 0.72 Random Forest 0.75 0.75 0.74 ** Out of 9 projects Leiden University. The university to discover. Discussion A. Predictor Three class diagram metrics should be considered as influential predictors: • Export Coupling Parameter(EC Par), • Dependency In (Dep In) • Number of Operation (NumOps) ** Means, a higher value of these metrics for a class indicates that this class can be a candidate for inclusion in the CD. B. Classification Algorithm k-NN(5) and Random Forest are suitable classification algorithms in this study. • Their AUC score is at least 0.64 • The classifiers are robust for all projects and predictor sets Leiden University. The university to discover. Discussion C. Threat to Validity i. Assumption of ground truth: Exactly all classes that should be in the forward designs are in the forward design. There is a possibility that : • some of these classes were not the key classes of the system. • there is a possibility that the forward design used is too ‘old’. i. Input is dependent on Reverse Engineering tool (MagicDraw) ii. Cover only 9 open-source projects Leiden University. The university to discover. Future Work 1. Alternative predictor variables • use of other type of design metrics ex. (semantics of) the names of classes, methods and attributes. • use source code metrics such as Line of Code (LOC) and Lines of Comments. • Change History of a class 2. Learning models (classification algorithm) • testing out an ensemble approach (combines classification algorithms) 3. Semi supervised or interactive approach 4. Compare this study result with other approaches • Other works that apply different algorithm such as HITS web mining, network analysis on dependency graphs and PageRank. 5. Validate understandability of abstract Class Diagrams Leiden University. The university to discover. Conclusion 1. The most influential predictors • Export Coupling Parameter • Dependency In • Number of Operation 2. Most suitable Classification Algorithms • Random Forest • k-Nearest Neighbor 3. Classification algorithms are able to produce a predictor that can be used to rank classes by relative importance. 4. Based on this class-ranking information, a tool can be developed that provides views of reverse engineered class diagrams at different levels of abstraction. 5. Developers may generate multiple levels of class diagram abstractions. Questions………….. Leiden University. The university to discover.