SE690 Research Proposal

advertisement
SE690 Research Proposal
Title : Prediction of Testing effort in the Enterprise Level Software
Development Projects Using Multiple Regression Model
Presented By: Subhash Mookerji
Presentation Date: June 4, 2004
Literature Review
In this section brief results of the literature review have been presented. The title of the article is followed
by a brief description of Independent and dependent variables, modeling technique.
1. Ross Jeffery, Melanie Ruhe, Isabella Wieczorek , 'Using Public Domain Metrics to Estimate
Software Development Effort' , Proceedings of the Seventh International Software Metrics
Symposium (METRICS .01), April 04 - 06, 2001, Page 16, London, England
Dependent variable = project cost
Independent variables = work effort, system size, maximum team size, development Platform,
language Type, business area type, org type.
Analysis technique = Ordinary Least Square, Stepwise ANOVA, Regression trees, analogy based
estimation
2. Taghi M. Khoshgoftaar, Kehan Gao, Robert M. Szabo, 'An Application of Zero-Inflated Poisson
Regression for Software Fault Prediction', 12th International Symposium on Software Reliability
Engineering (ISSRE'01) ,November 27 - 30, 2001, Hong Kong, China
Dependent or response variable = number of faults discovered in a source file during system test
Independent variables
 Number of times the source file was inspected prior to system test release.
 Number of lines of code for the source file prior to the coding phase. This represents autogenerated code.
 Number of lines of code for the source file prior to system test release.
 Number of lines of commented code for the source file prior to the coding phase. This
represents auto-generated code.
 Number of lines of commented code for the source file prior to system test release.
Modeling technique = Zero-Inflated Poisson Regression
3. J.A. Morgan, G.J. Knafl, 'Residual fault density prediction using regression methods,' The
Seventh International Symposium on Software Reliability Engineering (ISSRE '96), October 30 November 02, 1996, White Plains, New York
Dependent or Response variable = fault detection effectiveness that is, the percentage of seeded
faults detected by a test set.
Independent or explanatory variables
Product Measure
 Program size - lines of code (LOC)



Blocks per LOC
Decisions per LOC
all-uses per LOC
Testing Process Measure
 Size of test set
 Test coverage - block, in percent
 Test coverage - decision, in percent
 Test coverage - all-uses, in percent
 Discovered coverage per LOC
Linear and quadratic model selection technique used was based on "leave one out at a time"
cross validation using the predicted residual sum of squares (PRESS).
4. S. Muthanna, K. Ponnambalam, K. Kontogiannis, B. Stacey , 'A Maintainability Model for
Industrial Software Systems Using Design Level Metrics', Seventh Working Conference on
Reverse Engineering (WCRE'00),November 23 - 25, 2000 , Brisbane, Australia
Dependent or Response variable = Maintainability Index of the code as determined by the
developers
Independent or explanatory variables
1. Module level Function Point Metric
2. Module level Information Flow Metric
3. Module level Global Data Flow
4. Average Structural Complexity Metric (Fan-out)
5. Average Knot Count
6. Average Cyclomatic Complexity Metric
The modeling technique used was polynomial regression.
5. L. Angelis, I. Stamelos, M. Morisio, ' Building A Software Cost Estimation Model Based On
Categorical Data', Seventh International Software Metrics Symposium, April 04 - 06, 2001,
London, England
Dependent variable = The work effort
Independent variables
Development type
Development platform
Language type
Used Methodology (yes/no)
Organisation type
Business area type
Application type
Modeling technique used is Ordinary Least Square (OLS).
7. Taghi M. Khoshgoftaar, Edward B. Allen, Jianyu Deng , ' Controlling Overfitting in Software
Quality Models: Experiments with Regression Trees and Classification' , Seventh
International Software Metrics Symposium ,April 04 - 06, 2001 ,London, England, page 190
Dependent variable = number of faults in a module
Independent variables are many including number of distinct procedure calls to others, number of
second and following calls to others.
Modeling technique used was Regression Tree analysis.
8. E. Mendes, N. Mosley, S. Counsell, ' Using an Engineering Approach to Understanding and
Predicting Web Authoring and Design', 34th Annual Hawaii International Conference on
System Sciences ( HICSS-34)-Volume 7,January 03 - 06, 2001, Maui, Hawaii, page 7075.
Dependent or response variable = Total effort
Independent or predictor variables
 Page Count
 Media Count
 Program Count
 Total Page Allocation
 Total Media Allocation
 Total Code Length
 Reused Media Count
 Reused Program Count
 Total Reused Media Allocation
 Total Reused Code Length
 Connectivity
 Connectivity Density
 Total Page Complexity
 Cyclomatic Complexity
Both Linear Regression and Stepwise Multiple Regression modeling techniques were used.
9. Lesley Pickard, Barbara Kitchenham, Susan Linkman, ' An Investigation of Analysis Techniques for
Software Datasets', Sixth IEEE International Symposium on Software Metrics , November 04 - 06, 1999 ,
Boca Raton, Florida, page 130
In this paper authors compared different software data analysis techniques namely Ordinary Least Squares,
Residual analysis, Multivariate regression, Classification and Regression Trees (CART) on a single data set
to investigate the efficacy of these techniques.
10. Emilia Mendes, Ian Watson, Chris Triggs, Nile Mosley, Steve Counsell , ' A Comparison of
Development Effort Estimation Techniques for Web Hypermedia Applications', Eighth IEEE Symposium
on Software Metrics ,June 04 - 07, 2002 ,Ottawa, Canada, page 131.
In this paper, the authors compared linear regression , stepwise regression, Case-based Reasoning and
regression tree statistical analysis techniques to estimate the effort to develop hypermedia applications.
11. Lionel C. Briand, Jürgen Wüst, ' Modeling Development Effort in Object-Oriented Systems Using
Design Properties', IEEE Transactions on Software Engineering , vol. 27, No. 11, November 2001, pp 963986.
Dependent variable = development and testing effort
Independent variables are design attributes such as size, coupling, cohesion , inheritance
Statistical techniques used for modeling was a combination ofd regression trees and poisson regression
analysis
12. Yooichi Yokoyama, Mitsuhiko Kodaira , ' Software Cost and Quality Analysis by Statistical
Approaches', The 20th International Conference on Software Engineering ,April 19 - 25, 1998 , Kyoto,
Japan, page 465.
Dependent variable = projected man hours
Independent variable is development size in terms of 1000 lines of code.
Modeling techniques used was forward stepwise regression technique.
13. E. Stensrud, I. Myrtveit, ' Human Performance Estimating with Analogy and Regression Models: An
Empirical Validation ', 5th. International Symposium on Software Metrics , March 20 - 21, 1998 ,
Bethesda, Maryland, page 205.
In this paper, the authors compared human performance estimation by comparing results obtained from
experience , by using a Freeware estimation tool called ANGEL and regression analysis.
Conclusions from literature review
All the articles reviewed can be classified into two main groups
Main Group One - Predict number of faults in software
• Dependent variable is predicted number of faults
• Independent variables are very low level development attributes.
Main Group Two - Predict work effort
• Dependent variable is development work effort
• Independent Variables
• Technology or business related
— platform , area of business, methodology, nature of org, language type,
development type etc.
• Code related attributes
— page count , total code length, connectivity density, media count etc
Gaps Found in Current Approaches
• Work effort included mostly development effort
• A few papers used development and testing effort combined
• Independent variables are:
— either major technology or business related variables
— Or detail code level variables
This study will focus on testing effort only and will use a combination of variables
related to human skills and test matrices.
Definition of the problem
To Build a multiple regression analysis model in order to predict the testing effort
for enterprise level software development projects based on several independent or
predictor variables using a real-life data set.
The Approach: Data Collection
1.
Data collection - Collect and analyze real life data from two different major development projects from
a source. Both projects were completed in two years between 2001 and 2002.
The possible independent variables I want to consider are total number of defects found per module
during functional, regression and system testing, total number of test cases created per module,
experience of the developer of the module in years, experience of the tester in years and the number
of the steps in the use cases for the module.
The dependent variable of the analysis will be the total testing effort per module.
2.
Data Analysis - The data set will be tested for correlation between each set of the independent variable.
Since, these are real data points, wherever needed, the data and related information will be masked to
maintain client confidentiality agreement.
3.
Use of tool - A statistical analysis tool will be used to perform all statistical analysis.
4.
Statistical Technique - The multiple regression analysis technique will be used to model the problem.
5.
Model validation - A number of statistical techniques will be used to validate the final regression
model.
6.
Publication of Results and future work - The results will be presented in the final presentation of
SE690 class. A final paper will be written describing all the steps
Study Rationale
Has the study been done before?
As indicated in the conclusion of literature review, not a single paper was located with a study with the
same problem definition. Some of the papers [11] [12] noted in literature review section presented related
work with Multiple Regression or Step Wise Regression method.
If so, how will your study advance the understanding of the topic area? Why is this study of interest to
others?
This is a very practical problem in the industry for practicing Software Project Managers and Software QA
managers to predict the testing effort. The thumb rule method used frequently is to try to compare the size
of last testing project with the current one and come up with a forecast. This unscientific method more than
often creates resource bottlenecks in large software development projects. A robust predicting model can
be written into an estimation tool to help in decision making and budgeting the Testing Effort in large
software development projects.
Study Problem or Purpose.
What are the goals of the projects?
To build a Testing Effort predictor Multiple Regression model for software development projects.
Research Objectives
A multiple regression analysis model will be built to predict the Testing Time Effort for software
development projects based on several Independent or Predictor variables. The current variables under
consideration are total number of test cases created per module, experience of the developer of the module
in years, experience of the tester in years and the size of the module.
Research design
Data Collection - Real life data from software development projects will be collected and normalized.
Data Analysis – The data will be analyzed using statistical analysis such as Correlation analysis, Histogram
analysis, Scattered Diagram analysis.
Use of tool – All statistical analysis will be performed in MS Excel.
Statistical technique - The Multiple Regression Analysis technique will be used to model the problem.
Model validation - Residual Analysis statistical technique will be used to validate the final regression
model.
Publication of results and future work - The results will be presented in the final presentation of SE690
class. A final paper will be written describing all the steps and results and conclusion.
Work plan
Phase 1: ...
Milestone 1: Present Project Plan ( Short Presentation) - Target Date :6/4/2004
Milestone 2: Complete Data Collection and Analysis - Target Date: 9/1/2004
...
Phase 2: ...
Milestone 1: Complete Model Building , Target Date: 10/1/2004
Milestone 2: Complete Model Validation, Target Date: 11/1/2004
Milestone 3: Final Presentation and complete final paper , Target Date : 12/15/2004
Data Details
Independent or Predictor Variables
1. Total number of defects found in different testing phases per module (X1) : Source –defect
tracking tool
Testing phases are
a. Unit Testing
b. Functional Testing
c. Functional Integration Testing
d. Client Acceptance Testing
2.
Total number of test case per module (X2): Source – Test Management tool
a. Basic flow
i. Number of positive test cases
ii. Number of negative test cases
b.
3.
4.
5.
Each Alternate flow
i. Number of positive test cases
ii. Number of negative test cases
Experience in number of years for the developer of each module (X3): Source - Survey
Experience in number of years for the tester of each module(X4): Source – Survey
Number of use case steps per module (X5): Source - Requirements Management tool
Dependent or Response Variable
Total testing effort per module (Y): Source – Defect Tracking Tool and Project Plan
Components of Testing Effort
i.
Functionality testing time for the module (including test design, test scripting,
test data preparation and test execution) : Source – Project plan
ii.
Testing time for each resolved defect for each module : Source – Defect tracking
system logs and further analysis to eliminate the idle time from the data. For
example: Let T1 is the time stamp from the log when the status of the defect
was changed from Assigned to Resolved and T2 is the time stamp from the log
when the status of the defect was changed from Resolved to Closed. Then total
duration is T2 – T1. However, we need to analyze whether the tester had other
tasks during the same period. If yes, then how many other tasks the same tester
had at the same time. Use that number to assign a percentage to the tester’s time
for this specific defect. After that, calculate the non-working time during that
time period so that we will be considering only working times. I will explain
this in details using an example.
Steps of Model Development and Comparisons
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
Insert collected data into Excel Macro
Sort and list
Correlation Analysis
Run the Model in using Excel Statistical tool.
Analyze the Results
Transformation: Multiply One Independent Variable By Another Independent Variable
Plot All Independent Variables Against dependent variable
Delete Insignificant Variables & Add Significant Variables
Plot % residual Vs. Observed Values of Dependent Variables & Analyze
Square Transformation of the independent variables
Repeat steps 4 to 9
Plot Residual Vs. Predicted Values of Dependent Variable
Plot Residual Vs. Independent Variable
Cubic Transformation
Repeat steps 4 to 9
Pick up the Best Model Analyzing All Alternatives
If satisfied Final Formula
If not staisfied go to step 19
Find Outllers and Remove If Necessary
Run the model in the computer
Analyze the results
Plot % Residuals Vs. Independent Variables
Plot all predictor variables at a time against response variable
Run Stepwise Regression Maximize B
Run Stepwise Regression Minimize A
Plot R-value Vs. # of Independent Variables Used and Look for Elbow
Choose Candidate Model(s)
Transformation: Divide One Independent Variable by Another Correlated Independent
Variable
Repeat steps 20-25
Transformation Add Correlated Independent Variables
Repeat steps 20-25
Transformation Squarefoot of Dependent Variables
Repeat steps 20-25
Plot Normal Probability Plots of Residuals
Based on All Plots Check for Appropriate Relationship Function
Calculate AAPD for All Alternatives and Compare
Expected Result from the Study
A robust testing effort prediction model which contains
 response variable
 predictor variables
o or their transformations
o or their arithmetic relationships
 the intercept
 the coefficients
Future Work

This is an a posteriori observation - convert it into a controlled a priori experiment





Use larger set of data
Use more testing phases (e.g. performance)
Add maintenance effort
Use other predictor variables
Develop an estimation tool based on the model
Download