SE690 Research Proposal Title : Prediction of Testing effort in the Enterprise Level Software Development Projects Using Multiple Regression Model Presented By: Subhash Mookerji Presentation Date: June 4, 2004 Literature Review In this section brief results of the literature review have been presented. The title of the article is followed by a brief description of Independent and dependent variables, modeling technique. 1. Ross Jeffery, Melanie Ruhe, Isabella Wieczorek , 'Using Public Domain Metrics to Estimate Software Development Effort' , Proceedings of the Seventh International Software Metrics Symposium (METRICS .01), April 04 - 06, 2001, Page 16, London, England Dependent variable = project cost Independent variables = work effort, system size, maximum team size, development Platform, language Type, business area type, org type. Analysis technique = Ordinary Least Square, Stepwise ANOVA, Regression trees, analogy based estimation 2. Taghi M. Khoshgoftaar, Kehan Gao, Robert M. Szabo, 'An Application of Zero-Inflated Poisson Regression for Software Fault Prediction', 12th International Symposium on Software Reliability Engineering (ISSRE'01) ,November 27 - 30, 2001, Hong Kong, China Dependent or response variable = number of faults discovered in a source file during system test Independent variables Number of times the source file was inspected prior to system test release. Number of lines of code for the source file prior to the coding phase. This represents autogenerated code. Number of lines of code for the source file prior to system test release. Number of lines of commented code for the source file prior to the coding phase. This represents auto-generated code. Number of lines of commented code for the source file prior to system test release. Modeling technique = Zero-Inflated Poisson Regression 3. J.A. Morgan, G.J. Knafl, 'Residual fault density prediction using regression methods,' The Seventh International Symposium on Software Reliability Engineering (ISSRE '96), October 30 November 02, 1996, White Plains, New York Dependent or Response variable = fault detection effectiveness that is, the percentage of seeded faults detected by a test set. Independent or explanatory variables Product Measure Program size - lines of code (LOC) Blocks per LOC Decisions per LOC all-uses per LOC Testing Process Measure Size of test set Test coverage - block, in percent Test coverage - decision, in percent Test coverage - all-uses, in percent Discovered coverage per LOC Linear and quadratic model selection technique used was based on "leave one out at a time" cross validation using the predicted residual sum of squares (PRESS). 4. S. Muthanna, K. Ponnambalam, K. Kontogiannis, B. Stacey , 'A Maintainability Model for Industrial Software Systems Using Design Level Metrics', Seventh Working Conference on Reverse Engineering (WCRE'00),November 23 - 25, 2000 , Brisbane, Australia Dependent or Response variable = Maintainability Index of the code as determined by the developers Independent or explanatory variables 1. Module level Function Point Metric 2. Module level Information Flow Metric 3. Module level Global Data Flow 4. Average Structural Complexity Metric (Fan-out) 5. Average Knot Count 6. Average Cyclomatic Complexity Metric The modeling technique used was polynomial regression. 5. L. Angelis, I. Stamelos, M. Morisio, ' Building A Software Cost Estimation Model Based On Categorical Data', Seventh International Software Metrics Symposium, April 04 - 06, 2001, London, England Dependent variable = The work effort Independent variables Development type Development platform Language type Used Methodology (yes/no) Organisation type Business area type Application type Modeling technique used is Ordinary Least Square (OLS). 7. Taghi M. Khoshgoftaar, Edward B. Allen, Jianyu Deng , ' Controlling Overfitting in Software Quality Models: Experiments with Regression Trees and Classification' , Seventh International Software Metrics Symposium ,April 04 - 06, 2001 ,London, England, page 190 Dependent variable = number of faults in a module Independent variables are many including number of distinct procedure calls to others, number of second and following calls to others. Modeling technique used was Regression Tree analysis. 8. E. Mendes, N. Mosley, S. Counsell, ' Using an Engineering Approach to Understanding and Predicting Web Authoring and Design', 34th Annual Hawaii International Conference on System Sciences ( HICSS-34)-Volume 7,January 03 - 06, 2001, Maui, Hawaii, page 7075. Dependent or response variable = Total effort Independent or predictor variables Page Count Media Count Program Count Total Page Allocation Total Media Allocation Total Code Length Reused Media Count Reused Program Count Total Reused Media Allocation Total Reused Code Length Connectivity Connectivity Density Total Page Complexity Cyclomatic Complexity Both Linear Regression and Stepwise Multiple Regression modeling techniques were used. 9. Lesley Pickard, Barbara Kitchenham, Susan Linkman, ' An Investigation of Analysis Techniques for Software Datasets', Sixth IEEE International Symposium on Software Metrics , November 04 - 06, 1999 , Boca Raton, Florida, page 130 In this paper authors compared different software data analysis techniques namely Ordinary Least Squares, Residual analysis, Multivariate regression, Classification and Regression Trees (CART) on a single data set to investigate the efficacy of these techniques. 10. Emilia Mendes, Ian Watson, Chris Triggs, Nile Mosley, Steve Counsell , ' A Comparison of Development Effort Estimation Techniques for Web Hypermedia Applications', Eighth IEEE Symposium on Software Metrics ,June 04 - 07, 2002 ,Ottawa, Canada, page 131. In this paper, the authors compared linear regression , stepwise regression, Case-based Reasoning and regression tree statistical analysis techniques to estimate the effort to develop hypermedia applications. 11. Lionel C. Briand, Jürgen Wüst, ' Modeling Development Effort in Object-Oriented Systems Using Design Properties', IEEE Transactions on Software Engineering , vol. 27, No. 11, November 2001, pp 963986. Dependent variable = development and testing effort Independent variables are design attributes such as size, coupling, cohesion , inheritance Statistical techniques used for modeling was a combination ofd regression trees and poisson regression analysis 12. Yooichi Yokoyama, Mitsuhiko Kodaira , ' Software Cost and Quality Analysis by Statistical Approaches', The 20th International Conference on Software Engineering ,April 19 - 25, 1998 , Kyoto, Japan, page 465. Dependent variable = projected man hours Independent variable is development size in terms of 1000 lines of code. Modeling techniques used was forward stepwise regression technique. 13. E. Stensrud, I. Myrtveit, ' Human Performance Estimating with Analogy and Regression Models: An Empirical Validation ', 5th. International Symposium on Software Metrics , March 20 - 21, 1998 , Bethesda, Maryland, page 205. In this paper, the authors compared human performance estimation by comparing results obtained from experience , by using a Freeware estimation tool called ANGEL and regression analysis. Conclusions from literature review All the articles reviewed can be classified into two main groups Main Group One - Predict number of faults in software • Dependent variable is predicted number of faults • Independent variables are very low level development attributes. Main Group Two - Predict work effort • Dependent variable is development work effort • Independent Variables • Technology or business related — platform , area of business, methodology, nature of org, language type, development type etc. • Code related attributes — page count , total code length, connectivity density, media count etc Gaps Found in Current Approaches • Work effort included mostly development effort • A few papers used development and testing effort combined • Independent variables are: — either major technology or business related variables — Or detail code level variables This study will focus on testing effort only and will use a combination of variables related to human skills and test matrices. Definition of the problem To Build a multiple regression analysis model in order to predict the testing effort for enterprise level software development projects based on several independent or predictor variables using a real-life data set. The Approach: Data Collection 1. Data collection - Collect and analyze real life data from two different major development projects from a source. Both projects were completed in two years between 2001 and 2002. The possible independent variables I want to consider are total number of defects found per module during functional, regression and system testing, total number of test cases created per module, experience of the developer of the module in years, experience of the tester in years and the number of the steps in the use cases for the module. The dependent variable of the analysis will be the total testing effort per module. 2. Data Analysis - The data set will be tested for correlation between each set of the independent variable. Since, these are real data points, wherever needed, the data and related information will be masked to maintain client confidentiality agreement. 3. Use of tool - A statistical analysis tool will be used to perform all statistical analysis. 4. Statistical Technique - The multiple regression analysis technique will be used to model the problem. 5. Model validation - A number of statistical techniques will be used to validate the final regression model. 6. Publication of Results and future work - The results will be presented in the final presentation of SE690 class. A final paper will be written describing all the steps Study Rationale Has the study been done before? As indicated in the conclusion of literature review, not a single paper was located with a study with the same problem definition. Some of the papers [11] [12] noted in literature review section presented related work with Multiple Regression or Step Wise Regression method. If so, how will your study advance the understanding of the topic area? Why is this study of interest to others? This is a very practical problem in the industry for practicing Software Project Managers and Software QA managers to predict the testing effort. The thumb rule method used frequently is to try to compare the size of last testing project with the current one and come up with a forecast. This unscientific method more than often creates resource bottlenecks in large software development projects. A robust predicting model can be written into an estimation tool to help in decision making and budgeting the Testing Effort in large software development projects. Study Problem or Purpose. What are the goals of the projects? To build a Testing Effort predictor Multiple Regression model for software development projects. Research Objectives A multiple regression analysis model will be built to predict the Testing Time Effort for software development projects based on several Independent or Predictor variables. The current variables under consideration are total number of test cases created per module, experience of the developer of the module in years, experience of the tester in years and the size of the module. Research design Data Collection - Real life data from software development projects will be collected and normalized. Data Analysis – The data will be analyzed using statistical analysis such as Correlation analysis, Histogram analysis, Scattered Diagram analysis. Use of tool – All statistical analysis will be performed in MS Excel. Statistical technique - The Multiple Regression Analysis technique will be used to model the problem. Model validation - Residual Analysis statistical technique will be used to validate the final regression model. Publication of results and future work - The results will be presented in the final presentation of SE690 class. A final paper will be written describing all the steps and results and conclusion. Work plan Phase 1: ... Milestone 1: Present Project Plan ( Short Presentation) - Target Date :6/4/2004 Milestone 2: Complete Data Collection and Analysis - Target Date: 9/1/2004 ... Phase 2: ... Milestone 1: Complete Model Building , Target Date: 10/1/2004 Milestone 2: Complete Model Validation, Target Date: 11/1/2004 Milestone 3: Final Presentation and complete final paper , Target Date : 12/15/2004 Data Details Independent or Predictor Variables 1. Total number of defects found in different testing phases per module (X1) : Source –defect tracking tool Testing phases are a. Unit Testing b. Functional Testing c. Functional Integration Testing d. Client Acceptance Testing 2. Total number of test case per module (X2): Source – Test Management tool a. Basic flow i. Number of positive test cases ii. Number of negative test cases b. 3. 4. 5. Each Alternate flow i. Number of positive test cases ii. Number of negative test cases Experience in number of years for the developer of each module (X3): Source - Survey Experience in number of years for the tester of each module(X4): Source – Survey Number of use case steps per module (X5): Source - Requirements Management tool Dependent or Response Variable Total testing effort per module (Y): Source – Defect Tracking Tool and Project Plan Components of Testing Effort i. Functionality testing time for the module (including test design, test scripting, test data preparation and test execution) : Source – Project plan ii. Testing time for each resolved defect for each module : Source – Defect tracking system logs and further analysis to eliminate the idle time from the data. For example: Let T1 is the time stamp from the log when the status of the defect was changed from Assigned to Resolved and T2 is the time stamp from the log when the status of the defect was changed from Resolved to Closed. Then total duration is T2 – T1. However, we need to analyze whether the tester had other tasks during the same period. If yes, then how many other tasks the same tester had at the same time. Use that number to assign a percentage to the tester’s time for this specific defect. After that, calculate the non-working time during that time period so that we will be considering only working times. I will explain this in details using an example. Steps of Model Development and Comparisons 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. Insert collected data into Excel Macro Sort and list Correlation Analysis Run the Model in using Excel Statistical tool. Analyze the Results Transformation: Multiply One Independent Variable By Another Independent Variable Plot All Independent Variables Against dependent variable Delete Insignificant Variables & Add Significant Variables Plot % residual Vs. Observed Values of Dependent Variables & Analyze Square Transformation of the independent variables Repeat steps 4 to 9 Plot Residual Vs. Predicted Values of Dependent Variable Plot Residual Vs. Independent Variable Cubic Transformation Repeat steps 4 to 9 Pick up the Best Model Analyzing All Alternatives If satisfied Final Formula If not staisfied go to step 19 Find Outllers and Remove If Necessary Run the model in the computer Analyze the results Plot % Residuals Vs. Independent Variables Plot all predictor variables at a time against response variable Run Stepwise Regression Maximize B Run Stepwise Regression Minimize A Plot R-value Vs. # of Independent Variables Used and Look for Elbow Choose Candidate Model(s) Transformation: Divide One Independent Variable by Another Correlated Independent Variable Repeat steps 20-25 Transformation Add Correlated Independent Variables Repeat steps 20-25 Transformation Squarefoot of Dependent Variables Repeat steps 20-25 Plot Normal Probability Plots of Residuals Based on All Plots Check for Appropriate Relationship Function Calculate AAPD for All Alternatives and Compare Expected Result from the Study A robust testing effort prediction model which contains response variable predictor variables o or their transformations o or their arithmetic relationships the intercept the coefficients Future Work This is an a posteriori observation - convert it into a controlled a priori experiment Use larger set of data Use more testing phases (e.g. performance) Add maintenance effort Use other predictor variables Develop an estimation tool based on the model