Towards Automated Software Architectures Design Using Model

advertisement
An Analysis of Machine Learning
Algorithms for Condensing Reverse
Engineered Class Diagrams
Presenter: Hafeez Osman
Hafeez Osman
(hosman@liacs.nl)
Michel R.V. Chaudron
(chaudron@chalmers.se)
Peter v.d Putten
(putten@liacs.nl)
Leiden University. The university to discover.
OVERVIEW
1. Introduction
2. Research Question
3. Approach
4. Results
5. Discussion
6. Future Work
7. Conclusion
Leiden University. The university to discover.
Introduction
Why ?
Reverse engineered class diagrams
are typically too detailed a
representation
What ?
Simplifying UML Class
Diagram: Based on Software
Design Metrics using
Machine Learning
Who ?
Software Engineer,
Software Maintainer,
Software Designer
Leiden University. The university to discover.
Leiden University. The university to discover.
Introduction
Aim: analyze performance of classification algorithms that
decide which classes should be included in a class diagram
This paper focusses on using design metrics as predictors
(input variables used by the classification algorithm)
Omit
Leiden University. The university to discover.
Introduction
Explore Structural Properties of Classes
• Software design metrics from the following categories :
• Size : NumAttr, NumOps, NumPubOps, Getters, Setters
• Coupling : Dep_Out, Dep_In, EC_Attr, IC_Attr, EC_Par,
IC_Par
Machine Learning Classification Algorithms
• Supervised classification algorithms
• J48 Decision Tree, Decision Tables, Decision Stumps,
Random Forests and Random Trees
• k-Nearest Neighbor, Radial Basis Function Networks
• Logistic Regression, Naive Bayes,
Leiden University. The university to discover.
Research Questions
RQ1: Which individual predictors are influential for
the classification?
For each case study, the predictive power of individual predictors
are explored
RQ2: How robust is the classification to the inclusion of
particular sets of predictors?
Explore how the performance of the classification algorithm is
influenced by partitioning the predictor-variables in different sets.
RQ3: What are suitable algorithms for classifying
classes?
The candidate classification algorithms are evaluated w.r.t. how well
they perform in classifying the key classes in a class diagram.
Leiden University. The university to discover.
Approach
Evaluation Method
RQ1: Predictors
Univariate Analysis – Information Gain Attribute Evaluator
To measure predictive power of predictors
RQ2, 3: Machine Learning Classification Algorithm
Area Under ROC Curve (AUC)
The AUC shows the ability of the
classification algorithms to
correctly rank classes as
included in the class diagram
or not
Leiden University. The university to discover.
Approach
Case Study Characteristics
Project
Total Classes
(a)/(b) = %
Source code (a)
Design (b)
ArgoUML
903
44
4.87
Mars
840
29
3.45
JavaClient
214
57
26.64
JGAP
171
18
10.52
Neuroph 2.3
161
24
14.9
JPMC
127
11
8.66
Wro4J
87
11
12.64
xUML-Compiler
84
37
44.05
Maze
59
28
47.45
Leiden University. The university to discover.
Approach
Grouping Predictors in Sets
No
Predictor
1
2
3
4
5
6
7
8
9
10
11
NumAttr
NumOps
NumPubOps
Setters
Getters
Dep_out
Dep_In
EC_Attr
IC_Attr
EC_Par
IC_Par
Predictor
Set A











Predictor Set Predictor Set
B
C
x

x

x
x
x
x
x
x












Leiden University. The university to discover.
Approach
1. Reverse engineer the source code to UML
design.
i. Eliminate library classes
2. Calculate design metrics
i. Eliminate unused metrics
3. Merge the information “In Design” with the
software design metrics data
4. Prepare set of predictors
5. Run all set of predictors with machine learning
tool
Leiden University. The university to discover.
Result
RQ1 : Predictor Evaluation
Influential Predictors
7
6
No of Projects
5
4
3
2
1
0
EC_Par NumOps Dep_In
Predictor
6
5
5
NumPub
Dep_out NumAttr Setters
Ops
4
4
3
3
Getters EC_Attr
3
3
IC_Attr
IC_Par
3
2
** Out of 9 projects
Leiden University. The university to discover.
Result
RQ2 : Dataset Evaluation
No. of Projects for which a Classification Algorithm scores AUC > 0.60
10
9
No of projects
8
7
6
5
4
3
2
1
0
Decision
Table
Prediction Set A
3
Prediction Set B
3
Prediciton Set C
3
J48
5
5
5
Decision
Stump
6
7
6
RBF
Network
7
7
7
Naïve
Bayes
8
7
7
Random
Tree
8
7
7
Function
Logistic
7
7
8
k-NN(1)
k-NN(5)
8
7
9
8
8
9
Random
Forest
9
9
9
** Out of 9 projects
Leiden University. The university to discover.
Result
RQ3 : Evaluation on Classification Algorithms
Average AUC Score
0.80
0.70
0.60
0.50
0.40
0.30
0.20
0.10
0.00
Decision
Table
Predictor Set A
0.60
Predictor Set B
0.59
Predictor Set C
0.58
J48
0.63
0.61
0.61
Random
Tree
0.66
0.66
0.66
RBF
Network
0.66
0.67
0.66
Decision
Stump
0.66
0.64
0.65
Function
Logistic
0.69
0.70
0.68
Naïve
Bayes
0.70
0.70
0.69
k-NN(1)
k-NN (5)
0.70
0.70
0.71
0.73
0.72
0.72
Random
Forest
0.75
0.75
0.74
** Out of 9 projects
Leiden University. The university to discover.
Discussion
A. Predictor
Three class diagram metrics should be considered as influential
predictors:
• Export Coupling Parameter(EC Par),
• Dependency In (Dep In)
• Number of Operation (NumOps)
** Means, a higher value of these metrics for a class indicates that this class can be a candidate
for inclusion in the CD.
B. Classification Algorithm
k-NN(5) and Random Forest are suitable classification algorithms in
this study.
• Their AUC score is at least 0.64
• The classifiers are robust for all projects and predictor sets
Leiden University. The university to discover.
Discussion
C. Threat to Validity
i.
Assumption of ground truth:
Exactly all classes that should be in the forward designs are in the
forward design. There is a possibility that :
• some of these classes were not the key classes of the system.
• there is a possibility that the forward design used is too ‘old’.
i.
Input is dependent on Reverse Engineering tool (MagicDraw)
ii.
Cover only 9 open-source projects
Leiden University. The university to discover.
Future Work
1. Alternative predictor variables
• use of other type of design metrics ex. (semantics of) the names of
classes, methods and attributes.
• use source code metrics such as Line of Code (LOC) and Lines of
Comments.
• Change History of a class
2. Learning models (classification algorithm)
• testing out an ensemble approach (combines classification algorithms)
3. Semi supervised or interactive approach
4. Compare this study result with other approaches
• Other works that apply different algorithm such as HITS web
mining, network analysis on dependency graphs and PageRank.
5. Validate understandability of abstract Class Diagrams
Leiden University. The university to discover.
Conclusion
1. The most influential predictors
• Export Coupling Parameter
• Dependency In
• Number of Operation
2. Most suitable Classification Algorithms
• Random Forest
• k-Nearest Neighbor
3. Classification algorithms are able to produce a predictor that can be used to
rank classes by relative importance.
4. Based on this class-ranking information, a tool can be developed that provides
views of reverse engineered class diagrams at different levels of abstraction.
5. Developers may generate multiple levels of class diagram abstractions.
Questions…………..
Leiden University. The university to discover.
Download