On the application of GP for software engineering predictive modeling: A systematic review Expert systems with Applications, Vol. 38 no. 9, 2011 Wasif Afzal, Richard Torkar Blekinge Institute of Technology, Karlskrona, Sweden. {waf,rto}@bth.se Agenda • • • • • • • Research question Symbolic regression Prediction and estimation in sw engineering GP for prediction and estimation in sw engineering Application of GP for sw quality classification Application of GP for sw cost/effort/size estimation Application of GP for sw fault prediction and sw reliability growth modeling • Future work • Conclusions • Recommendations Our research question • Is there evidence that: symbolic regression using GP is an effective method for: prediciton and estimation, in comparison with: regression, machine learning and other models (including expert opinion and different improvements over the standard GP algorithm)? It is about symbolic regression! • Symbolic regression – One of the many application areas of GP – Finds a function, with the outputs having desired outcomes. – Makes no assumptions about: • Structure of the function • Data distribution • Relationship between independent and dependent variables • Helps in identifying the significant variables in subsequent modeling attempts Prediction and estimation in sw engineering • Software quality – Software quality classification – Software fault prediction – Software reliability growth modeling • Software size • Software development cost/effort • Maintenance task effort • Software release timing GP for prediction and estimation in sw engineering • 23 identified primary studies – Software quality classification (8) – Software cost/effort/size estimation (7) – Software fault prediction and software reliability growth modeling (8) GP for prediction and estimation in sw engineering cntd… Application of GP for sw quality classification (8 studies) • Variations of the dependent variable: – Fault proneness – Quality ranking of program modules (high risk to low risk) • Variations in sampling of training and testing sets: – Simple hold-out and 10-fold CV. Application of GP for sw quality classification cntd… • Variations in fitness function – Single objective • Minimization of root mean square • Minimization of average cost of misclassification – Multi-objective • Minimization of average cost of misclassification + minimization of tree size • Maximization of the best percentage of the actual faults averaged over the percentiles level of interest + controlling the tree size. • Balancing the over sampling and under sampling in each class for a decision tree. Application of GP for sw quality classification cntd… • Variations in comparison groups: – Neural networks – k-nearnest neighbour – Regression (linear, logistic) – Humans Application of GP for sw quality classification cntd… • Results: – Majority of the studies (6 out of 8) reported results in favor of using GP for the classification task. • Limitations: – Increase the comparisons with a more representative set of techniques. – Increase the use of publically available data sets for easier replications. Application of GP for sw quality classification cntd… • Encouraging aspects: – The datasets used represent real-world projects. – Problem dependent objectives represented in fitness functions perform better than standard GP. Application of GP for sw cost/effort/size (CES) estimation (7 studies) • Variations of the dependent variable – Software effort – Software cost – Software size • Variations in fitness function – Single objective • Minimization of mean squared error or MMRE Application of GP for sw cost/effort/size (CES) estimation cntd… • Variations in comparison groups – ANN, nearest neighbour and different forms of regression. • Variations in sampling of training and testing sets – Simple hold-out. Application of GP for sw cost/effort/size (CES) estimation cntd… • Results – No strong evidence of GP performing consistently on all evaluation measures used. • Limitations – – – – Evaluation measures used are not standardized. Different hold-out samplings for train and test sets. Lack of statistical hypothesis testing. Lack of comparison groups. Application of GP for sw fault prediciton and sw reliability growth modeling (8 studies) • Variations of the dependent variable – SW fault prediction – SW reliability growth modeling • Variations in fitness function – Single objective: • Minimization of standard error Application of GP for sw fault prediciton and sw reliability growth modeling cntd … • Variations in comparison groups – Standard GP, Naive Bayes, traditional software reliability growth models. • Variations in sampling of training ad testing sets – Hold-out and 10-fold CV Application of GP for sw fault prediciton and sw reliability growth modeling cntd … • Results: – 7 out of 8 studies favor the use of GP. • Limitations: – Poor representation of comparison groups – Absence of a baseline to compare to. Promising future work to undertake • Multi-objective fitness evaluation (e.g. Minimization of standard error and maximization of correlation coefficient) • Simplification of GP solutions to help interpretation of relationships between variables. • Evaluation of techniques to minimize overfitting of GP solutions. Conclusions • A total of 23 studies apply GP for predictive studies in sw engineering: – sw quality classification (8) – sw cost/effort/size estimation (7) – sw fault prediciton and sw reliability growth modeling (8) • There is evidence in support of using GP for: – sw quality classifiaction – sw fault prediction and SW reliability growth modeling • but not for: – sw cost/effort/size estimation. Recommendations • Use public data sets wherever possible. • Apply commonly used sampling strategies. • Use techniques to avoid overfitting in GP solutions. • Report the settings of GP parameters. • Compare the performances against a commonly used baseline. • Use statistical experimental designs.