Modeling - Center for Software Engineering

advertisement
“2cee”
A 21st Century
Effort Estimation Methodology
Jairus Hihn
Karen Lum
jhihn@jpl.nasa.gov
ktlum@jpl.nasa.gov
Tim Menzies
Dan Baker
tim@menzies.us
dbaker6@mix.wvu.edu
22nd International Forum on COCOMO and Systems/Software Cost Modeling (2007)
Our Journey
• It became quickly apparent in the early stages of our research
task that there was a major disconnect between the
techniques used by estimation practitioners and the numerous
ideas being addressed in the research community
• It also became clear that many fundamental estimation
questions were not being addressed
– What is a models real estimation uncertainty?
– How many records required to calibrate?
• Answers have varied from 10-20 just for intercept and slope
• If we do not have enough data what is the impact on model uncertainty
– Data is expensive to collect and maintain so want to keep cost
drivers and effort multipliers as few as possible
• But what are the right ones?
• When should we build domain specific models?
– What are the best functional forms?
– What are the best ways to tune/calibrate a model?
2cee
2
Our Journey Continued
• Data mining techniques provided us with the
rigorous tool set we needed to explore the
many dimension of the problem we were
addressing in a repeatable manner
– Different Calibration and Validation Datasets
– Analyze standard and non-standard models
– Perform exhaustive searches over all parameters and
records in order to guide data pruning
• Rows (Stratification)
• Columns (variable reduction)
– Measure model performance by multiple measures
– We have even been able to determine what
performance measures are best
2cee
3
Some Things We
Learned Along the Way
2cee
4
Local Calibration
Does Not Always Improve Performance
• For the NASA data set Local Calibration (LC) or
re-estimating a and b only does not produce the
‘best’ model.
• A more thorough analysis is required including
reducing the number of variables
• Effort models were learned via either standard
LC or COSEEKMO
• The top plot shows the number of projects in
27 subsets of our two data sources
• The middle and bottom plots show the
standard deviation and mean in performance
error
• Data subsets are sorted by the error’s
standard deviation
2cee
5
Stratification
Does Not Always Improve Performance
• Stratification does not always improve
model performance
• Results show it is 50-50
• Main implication is that one must really
know their data as there is no solution
to determine the best approach to
model calibration
• The plots show mean performance error
(i.e. |(predicted − actual)|/actual) based on 30
experiments with each subset
• The dashed horizontal lines shows the error rate of
models learned from all data from the two sources
• The crosses show the mean error performance seen in
models learned from subsets of that data
• Crosses below/above the lines indicate models
performing better/worse (respectively) than models
built from all the data
2cee
6
Cost Driver Instability
Data Subset
coc81_all
coc81_mode_embedded
coc81_mode_organic
nasa93_all
nasa93_mode_embedded
nasa93_mode_semidetached
nasa93_fg_ground
nasa93_category_missionplanning
nasa93_category_avionicsmonitoring
nasa93_year_1975
nasa93_year_1980
nasa93_center2
acap
time
cplx
aexp
virt
m
m
l
l
m
l
l
m
l
l
l
l
l
l
l
l
l
l
m
m
l
m
l
l
l
l
l
l
l
nasa93_center5
nasa93_project_gro
m
nasa93_project_sts
Usually Significant
5
8
Total Number of Significant Occurrences 13
Always Significant
l
l
m
l
l
l
l
l
m
l
1
11
12
l
l
l
l
l
l
3
9
12
COCOMO 81 Cost Drivers
data
turn
rely
stor lexp
l
m
l
l
l
l
m
l
l
l
l
m
l
l
l
m
m
l
l
l
l
5
7
12
l
l
m
l
l
l
l
l
l
m
sced
tool
Cost Drivers
l
l
l
l
l
l
15
14
13
8
11
3
5
9
6
10
11
14
9
13
7
l
m
l
l
m
l
m
l
m
m
l
l
Number of Significant
pcap modp vexp
l
l
l
l
l
l
0
11
11
l
l
m
l
l
2
9
11
l
l
l
l
m
l
l
2
9
11
l
l
m
l
m
l
3
8
11
l
l
l
l
m
l
3
8
11
m
l
m
l
m
m
m
m
m
m
l
m
l
l
l
l
m
l
m
l
l
m
3
5
8
4
4
8
1
6
7
2
5
7
l
2
5
7
3
4
7
Legend:
l
m
= Not significantly different than 10 at a 95% Confidence Interval
= Not significantly different than 9 or greater at a 95% Confidence Interval
The bottom line is that we have way too many cost drivers in our models!
• Furthermore, what smaller set is best varies across different domains and stratifications
• The cost drivers that are unlikely to improve model performance are pcap, vexp, lexp, modp, tool,
sced
• It is expected for more contemporary data that stor and time would drop out because there are fewer
computer constraints these days and modp may become more significant
2cee
7
Some Good News
•
•
Physical SLOC always loads as significant with no language
adjustment
The standard functional form shown below is virtually always
selected as indicated by the non-standard model M5P being
selected only once


effort( personmonths )  a * KLOC b *   EM j 
 j



• The ‘out-of-the-box ‘ version of COCOMO 81 is almost
always the best model on the original COCOMO81 data
– View as a sanity check on our methodology
• However, for the NASA93 data sometimes
– one can use the model right out of the box
– sometimes local calibration is sufficient
– sometimes a full regression analysis needs to be performed to obtain
optimal results
2cee
8
Key Research Findings
•
Our models have too many inputs
–
•
Median measures of error not Mean or
Pred should be used to compare models
–
–
•
Measures of RE go up with over specified
models
There is an instability issue due to the
small data sets with significant outliers,
which makes it difficult to determine which
estimation model or calibration is best.
Mann Whitney U Test
Manual stratification does not lead to the
‘best’ model
–
–
E.g. a combination of flight SW and Class B
ground produces a ‘better’ model then just
selecting all your flight records and doing
LC.
Nearest Neighbor searches for analogous
records based on your current project
model inputs
•
–
–
–
•
2cee
The same approach is never best but
some combination of the following
always wins
LC
Column Pruning
Nearest Neighbor
Which is best is determined
by-case
case-
9
2cee
• 21st Century Estimation Environment
– Just Born: released October 2, 2007
– Result of four years of research using machine learning technique
to study model calibration and validation techniques
– Probabilistic
– Key Features:
• Dynamic calibration using variable reduction and nearest neighbor
search
• Can be used as either a model analysis tool, calibration tool, and/or an
estimation tool
• Can estimate with partial inputs
• Uses N-Fold Cross Validation (also called Leave One Out Cross
Validation)
• Uses median not mean to evaluate model performance
– Runs in Windows, coded in Visual Basic
– Will be running it in parallel with core tools over next year
2cee
10
2CEE
Load Historical Data
Use Predefined
COCOMO
Coefficients
Optionally Use
Manual Stratification
Optionally Use Manual
Or Automatic Feature Selection
Define
Project
Ranges
Full Local
Calibration
Bootstrapped
Local Calibration
Monte Carlo
Project Instances
Nearest Neighbour
Local Calibration
Produce Range of
COCOMO Estimates
2CEE Steps
Define Model Calibration
Evaluate with Cross Validation
Define Project Ranges
Monte Carlo Estimates
2cee Provides Insight into
Model Performance and Tuning
• e.g. “officially”, COCOMO’s
tuning parameters vary
– 2.5 <= a <= 2.94
– 0.91 <= b < 1.01
There are many
outliers
in our data
• Which is nothing like what
we see with real NASA data,
– 3.5 <= a <= 14
– 0.65 <= b <= 1
2cee
14
Karen will be available at the tool fair
Stop in and take a look
under the hood
2cee
16
Bibliography
Current Research Publications
Selecting Best Practices for Effort Estimation, IEEE Transactions On Software Engineering, Nov 2006. (Menzies, Chen, Hihn,
Lum)
Evidence-Based Cost Estimation for Better-Quality Software, IEEE Software, July/August 2006. (Menzies and Hihn )
Studies in Software Cost Model Behavior:Do We Really Understand Cost Model Performance?, Proceedings of the ISPA
International Conference 2006, Seattle, WA. (Lum, Hihn, Menzies) (Best Paper Award)
Simple Software Cost Analysis: Safe or Unsafe?, Proceedings of the International Workshop on Predictor Models in Software
Engineering (PROMISE 2005), St Louis, MS, 14 June 2005. (Menzies, Port, Hihn , Chen)
Feature Subset Selection Improves Software Cost Estimation. (PROMISE 2005), St Louis, MS, 14 June 2005. (Chen, Menzies, Port,
Boehm)
Validation methods for calibrating software effort models, ICSE 2005 Proceedings, May 2005, St Louis, MS. May 2005. (Menzies,
Port, Hihn, Chen)
Specialization and Extrapolation of Software Cost Models, Proceeding in Automation in Software Engineering Conference, Nov
2005. (Menzies, Chen, Port, Hihn)
Finding the Right Data for Software Cost Modeling, IEEE Software, Nov/Dec 2005. (Chen, Menzies, Port, Boehm)
2cee
17
State of the Art Best Practice
The following is a comprehensive list of best-practice based on an extensive review
of the literature.
Our proposed methodology, 2cee, addresses the practices designated in green
•
•
According to Jorgensen [2], expert-based best practices include:
1.
Evaluate estimation accuracy, but avoid high evaluation pressure;
2.
Avoid conflicting estimation goals;
3.
Ask the estimators to justify and criticize their estimates;
4.
Avoid irrelevant and unreliable estimation information;
5.
Use documented data from previous development tasks;
6.
Find estimation experts with relevant domain background;
7.
Estimate top-down and bottom-up, independently of each other;
8.
Use estimation checklists;
9.
Combine estimates from different experts and estimation strategies;
10. Assess the uncertainty of the estimate;
11. Provide feedback on estimation accuracy ; and,
12. Provide estimation training opportunities.
According to Boehm [3], [4]; Chulani [5], [6]; Kemerer [7]; Stutzke [8]; Shepperd [9]; our own work [10]–[12];
and a recent tutorial at the 2006 International Conference of the International Society of Parametric Analysts
[13], best practices for model-based estimation include at least the following:
13. Reuse regression parameters learned from prior projects on new projects.
14. Log-transforms on costing data before performing linear regression to learn log-linear effort models.
15. Model-tree learning to generate models for non-linear relationships.
16. Stratification, i.e. given a database of past projects, and a current project to be estimated, just learn
models from those records from similar projects;
17. Local calibration, i.e. tune a general model to the local data via a small number of special tuning
parameters;
18. Hold-out experiments for testing the learned effort model [10];
19. Assessing effort model uncertainty via the performance deviations seen during the hold-out experiments
of item #17.
20. Variable subset selection methods for minimizing the size of the learned effort model [11], [12], [14], [15];
2cee
18
Download