“2cee” A 21st Century Effort Estimation Methodology Jairus Hihn Karen Lum jhihn@jpl.nasa.gov ktlum@jpl.nasa.gov Tim Menzies Dan Baker tim@menzies.us dbaker6@mix.wvu.edu 22nd International Forum on COCOMO and Systems/Software Cost Modeling (2007) Our Journey • It became quickly apparent in the early stages of our research task that there was a major disconnect between the techniques used by estimation practitioners and the numerous ideas being addressed in the research community • It also became clear that many fundamental estimation questions were not being addressed – What is a models real estimation uncertainty? – How many records required to calibrate? • Answers have varied from 10-20 just for intercept and slope • If we do not have enough data what is the impact on model uncertainty – Data is expensive to collect and maintain so want to keep cost drivers and effort multipliers as few as possible • But what are the right ones? • When should we build domain specific models? – What are the best functional forms? – What are the best ways to tune/calibrate a model? 2cee 2 Our Journey Continued • Data mining techniques provided us with the rigorous tool set we needed to explore the many dimension of the problem we were addressing in a repeatable manner – Different Calibration and Validation Datasets – Analyze standard and non-standard models – Perform exhaustive searches over all parameters and records in order to guide data pruning • Rows (Stratification) • Columns (variable reduction) – Measure model performance by multiple measures – We have even been able to determine what performance measures are best 2cee 3 Some Things We Learned Along the Way 2cee 4 Local Calibration Does Not Always Improve Performance • For the NASA data set Local Calibration (LC) or re-estimating a and b only does not produce the ‘best’ model. • A more thorough analysis is required including reducing the number of variables • Effort models were learned via either standard LC or COSEEKMO • The top plot shows the number of projects in 27 subsets of our two data sources • The middle and bottom plots show the standard deviation and mean in performance error • Data subsets are sorted by the error’s standard deviation 2cee 5 Stratification Does Not Always Improve Performance • Stratification does not always improve model performance • Results show it is 50-50 • Main implication is that one must really know their data as there is no solution to determine the best approach to model calibration • The plots show mean performance error (i.e. |(predicted − actual)|/actual) based on 30 experiments with each subset • The dashed horizontal lines shows the error rate of models learned from all data from the two sources • The crosses show the mean error performance seen in models learned from subsets of that data • Crosses below/above the lines indicate models performing better/worse (respectively) than models built from all the data 2cee 6 Cost Driver Instability Data Subset coc81_all coc81_mode_embedded coc81_mode_organic nasa93_all nasa93_mode_embedded nasa93_mode_semidetached nasa93_fg_ground nasa93_category_missionplanning nasa93_category_avionicsmonitoring nasa93_year_1975 nasa93_year_1980 nasa93_center2 acap time cplx aexp virt m m l l m l l m l l l l l l l l l l m m l m l l l l l l l nasa93_center5 nasa93_project_gro m nasa93_project_sts Usually Significant 5 8 Total Number of Significant Occurrences 13 Always Significant l l m l l l l l m l 1 11 12 l l l l l l 3 9 12 COCOMO 81 Cost Drivers data turn rely stor lexp l m l l l l m l l l l m l l l m m l l l l 5 7 12 l l m l l l l l l m sced tool Cost Drivers l l l l l l 15 14 13 8 11 3 5 9 6 10 11 14 9 13 7 l m l l m l m l m m l l Number of Significant pcap modp vexp l l l l l l 0 11 11 l l m l l 2 9 11 l l l l m l l 2 9 11 l l m l m l 3 8 11 l l l l m l 3 8 11 m l m l m m m m m m l m l l l l m l m l l m 3 5 8 4 4 8 1 6 7 2 5 7 l 2 5 7 3 4 7 Legend: l m = Not significantly different than 10 at a 95% Confidence Interval = Not significantly different than 9 or greater at a 95% Confidence Interval The bottom line is that we have way too many cost drivers in our models! • Furthermore, what smaller set is best varies across different domains and stratifications • The cost drivers that are unlikely to improve model performance are pcap, vexp, lexp, modp, tool, sced • It is expected for more contemporary data that stor and time would drop out because there are fewer computer constraints these days and modp may become more significant 2cee 7 Some Good News • • Physical SLOC always loads as significant with no language adjustment The standard functional form shown below is virtually always selected as indicated by the non-standard model M5P being selected only once effort( personmonths ) a * KLOC b * EM j j • The ‘out-of-the-box ‘ version of COCOMO 81 is almost always the best model on the original COCOMO81 data – View as a sanity check on our methodology • However, for the NASA93 data sometimes – one can use the model right out of the box – sometimes local calibration is sufficient – sometimes a full regression analysis needs to be performed to obtain optimal results 2cee 8 Key Research Findings • Our models have too many inputs – • Median measures of error not Mean or Pred should be used to compare models – – • Measures of RE go up with over specified models There is an instability issue due to the small data sets with significant outliers, which makes it difficult to determine which estimation model or calibration is best. Mann Whitney U Test Manual stratification does not lead to the ‘best’ model – – E.g. a combination of flight SW and Class B ground produces a ‘better’ model then just selecting all your flight records and doing LC. Nearest Neighbor searches for analogous records based on your current project model inputs • – – – • 2cee The same approach is never best but some combination of the following always wins LC Column Pruning Nearest Neighbor Which is best is determined by-case case- 9 2cee • 21st Century Estimation Environment – Just Born: released October 2, 2007 – Result of four years of research using machine learning technique to study model calibration and validation techniques – Probabilistic – Key Features: • Dynamic calibration using variable reduction and nearest neighbor search • Can be used as either a model analysis tool, calibration tool, and/or an estimation tool • Can estimate with partial inputs • Uses N-Fold Cross Validation (also called Leave One Out Cross Validation) • Uses median not mean to evaluate model performance – Runs in Windows, coded in Visual Basic – Will be running it in parallel with core tools over next year 2cee 10 2CEE Load Historical Data Use Predefined COCOMO Coefficients Optionally Use Manual Stratification Optionally Use Manual Or Automatic Feature Selection Define Project Ranges Full Local Calibration Bootstrapped Local Calibration Monte Carlo Project Instances Nearest Neighbour Local Calibration Produce Range of COCOMO Estimates 2CEE Steps Define Model Calibration Evaluate with Cross Validation Define Project Ranges Monte Carlo Estimates 2cee Provides Insight into Model Performance and Tuning • e.g. “officially”, COCOMO’s tuning parameters vary – 2.5 <= a <= 2.94 – 0.91 <= b < 1.01 There are many outliers in our data • Which is nothing like what we see with real NASA data, – 3.5 <= a <= 14 – 0.65 <= b <= 1 2cee 14 Karen will be available at the tool fair Stop in and take a look under the hood 2cee 16 Bibliography Current Research Publications Selecting Best Practices for Effort Estimation, IEEE Transactions On Software Engineering, Nov 2006. (Menzies, Chen, Hihn, Lum) Evidence-Based Cost Estimation for Better-Quality Software, IEEE Software, July/August 2006. (Menzies and Hihn ) Studies in Software Cost Model Behavior:Do We Really Understand Cost Model Performance?, Proceedings of the ISPA International Conference 2006, Seattle, WA. (Lum, Hihn, Menzies) (Best Paper Award) Simple Software Cost Analysis: Safe or Unsafe?, Proceedings of the International Workshop on Predictor Models in Software Engineering (PROMISE 2005), St Louis, MS, 14 June 2005. (Menzies, Port, Hihn , Chen) Feature Subset Selection Improves Software Cost Estimation. (PROMISE 2005), St Louis, MS, 14 June 2005. (Chen, Menzies, Port, Boehm) Validation methods for calibrating software effort models, ICSE 2005 Proceedings, May 2005, St Louis, MS. May 2005. (Menzies, Port, Hihn, Chen) Specialization and Extrapolation of Software Cost Models, Proceeding in Automation in Software Engineering Conference, Nov 2005. (Menzies, Chen, Port, Hihn) Finding the Right Data for Software Cost Modeling, IEEE Software, Nov/Dec 2005. (Chen, Menzies, Port, Boehm) 2cee 17 State of the Art Best Practice The following is a comprehensive list of best-practice based on an extensive review of the literature. Our proposed methodology, 2cee, addresses the practices designated in green • • According to Jorgensen [2], expert-based best practices include: 1. Evaluate estimation accuracy, but avoid high evaluation pressure; 2. Avoid conflicting estimation goals; 3. Ask the estimators to justify and criticize their estimates; 4. Avoid irrelevant and unreliable estimation information; 5. Use documented data from previous development tasks; 6. Find estimation experts with relevant domain background; 7. Estimate top-down and bottom-up, independently of each other; 8. Use estimation checklists; 9. Combine estimates from different experts and estimation strategies; 10. Assess the uncertainty of the estimate; 11. Provide feedback on estimation accuracy ; and, 12. Provide estimation training opportunities. According to Boehm [3], [4]; Chulani [5], [6]; Kemerer [7]; Stutzke [8]; Shepperd [9]; our own work [10]–[12]; and a recent tutorial at the 2006 International Conference of the International Society of Parametric Analysts [13], best practices for model-based estimation include at least the following: 13. Reuse regression parameters learned from prior projects on new projects. 14. Log-transforms on costing data before performing linear regression to learn log-linear effort models. 15. Model-tree learning to generate models for non-linear relationships. 16. Stratification, i.e. given a database of past projects, and a current project to be estimated, just learn models from those records from similar projects; 17. Local calibration, i.e. tune a general model to the local data via a small number of special tuning parameters; 18. Hold-out experiments for testing the learned effort model [10]; 19. Assessing effort model uncertainty via the performance deviations seen during the hold-out experiments of item #17. 20. Variable subset selection methods for minimizing the size of the learned effort model [11], [12], [14], [15]; 2cee 18