Software Cost Estimation Strictly speaking effort! 강릉대학교 컴퓨터공학과 권기태 Agenda 1. Background 2. “Current” techniques 3. Machine learning techniques 4. Assessing prediction systems 5. Future avenues Slide 2: 2016-03-14 1. Background Scope: software projects early estimates effort ≠ cost estimate ≠ expected answer Slide 3: 2016-03-14 What the Papers Say... From Computing, 26 November 1998: Defence system never worked MoD project loses £34m The Ministry of Defence has been forced to write off £34.6 million on an IT project it commissioned in 1988 and abandoned eight years later, writes Joanne Wallen. The Trawlerman system, designed ... Slide 4: 2016-03-14 The Problem Software developers need to predict, e.g. effort, duration, number of features defects and reliability But ... little systematic data noise and change complex interactions between variables poorly understood phenomena Slide 5: 2016-03-14 So What is an Estimate? An estimate is a prediction based upon probabilistic assessment. most likely p equal probability of under / over estimate 0 effort Slide 6: 2016-03-14 Some Causes of Poor Estimation We don’t cope with political problems that hamper the process. We don’t develop estimating expertise. We don’t systematically use past experience. Tom DeMarco Controlling Software Projects. Management, Measurement and Estimation. Yourdon Press: NY, 1982. Slide 7: 2016-03-14 2. “Current” Techniques Essentially a software cost estimation system is an input vector mapped to an output. expert judgement COCOMO function points DIY models Barry Boehm “Software Engineering Economics,” IEEE Transactions on Software Engineering, vol. 10, pp. 4-21, 1984. Slide 8: 2016-03-14 2.1 Expert Judgement Most widely used estimation technique No consistently “best” prediction system Lack of historical data Need to “own” the estimate Experts plus … ? Slide 9: 2016-03-14 Expert Judgement Drawbacks BUT Lack of objectivity Lack of repeatability Lack of recall /awareness Lack of experts! Preferable to use more than one expert. Slide 10: 2016-03-14 What Do We Know About Experts? Most commonly practised technique. Dutch survey revealed 62% of estimators used intuition supplemented by remembered analogies. UK survey - time to estimate ranged from 5 minutes to 4 weeks. US survey found that the only factor with a significant positive relationship with accuracy was responsibility. Slide 11: 2016-03-14 Information Used Design requirements Resources available Base product/source code (enhancement projects) Software tools available Previous history of product ... Slide 12: 2016-03-14 Information Needed Rules of thumb Available resources Data on past projects Feedback on past estimates ... Slide 13: 2016-03-14 Delphi Techniques? Methods for structuring group communication processes to solve complex problems. Characterised by iteration anonymity Devised by Rand Corporation (1948). Refined by Boehm (1981). Slide 14: 2016-03-14 Stages for Delphi Approach 1. Experts receive spec + estimation form 2. Discussion of product + estimation issues 3. Experts produce individual estimate 4. Estimates tabulated and returned to experts 5. Only expert's personal estimate identified 6. Experts meet to discuss results 7. Estimates are revised 8. Cycle continues until an acceptable degree of convergence is obtained Slide 15: 2016-03-14 Wideband Delphi Form Project: X134 Date: 9/17/03 Estimator: Hyolee Estimation round: 1 0 10 x Key: x* 20 x x! x 30 x x = estimate; x* = your estimate; x! 40 50 x = median estimate Slide 16: 2016-03-14 Observing Delphi Groups Four groups of MSc student Developing a C++ prototype for some simple scenarios Requested to estimate size of prototype (number of delimiters) Initial estimates followed by 2 group discussions Recorded group discussions plus scribes Slide 17: 2016-03-14 Delphi Size Estimation Results Absolute errors Estimation Mean Median Min Max Initial Round 1 Round 2 371 219 271 160.5 40 40 2249 749 949 23 23 3 Slide 18: 2016-03-14 Converging Group true size Slide 19: 2016-03-14 A Dominant Individual true size Slide 20: 2016-03-14 2.2 COCOMO Best known example of an algorithmic cost model. Series of three models: basic, intermediate and detailed. Models assume relationships between: size (KDSI) and effort effort and elapsed time MM a.KDSI b TDEV c. MM d Barry Boehm “Software Engineering Economics,” IEEE Transactions on Software Engineering, vol. 10, pp. 4-21, 1984. http://sunset.usc.edu/COCOMOII/ cocomo.html Slide 21: 2016-03-14 COCOMO contd. Model coefficients are dependant upon the type of project: organic: small teams, familiar application semi-detached embedded: complex organisation, software and/or hardware interactions Slide 22: 2016-03-14 COCOMO Cost Drivers • product attributes • computer attributes • personnel attributes • project attributes Drivers hard to empirically validate. Many are inappropriate for 1990's e.g. database size. Drivers not independent e.g. MODP and TOOL. Slide 23: 2016-03-14 COCOMO Assessment Very influential, non-proprietory model. Drivers help the manager understand the impact of different factors upon project costs. Hard to port to different development environments without extensive recalibration. Vulnerable to mis-classification of development type Hard to estimate KDSI at the start of a project. Slide 24: 2016-03-14 2.3 What are Function Points? A synthetic (indirect) measure derived from a software requirements specification of the attribute functionality. This conforms closely to our notion of specification size. Uses: effort prediction productivity Slide 25: 2016-03-14 Function Points (a brief history) Albrecht developed FPs in mid 1970's at IBM. Measure of system functionality as opposed to size. Weighted count of function types derived from specification: interfaces inquiries inputs / outputs files A. Albrecht and J. Gaffney, “Software function, source lines of code, and development effort prediction: a software science validation,” IEEE Transactions on Software Engineering, vol. 9, pp. 639-648, 1983. C. Symons, “Function Point Analysis: Difficulties and Improvements,” IEEE Transactions on Software Engineering, vol. 14, pp. 2-11, 1988. Slide 26: 2016-03-14 Function Point Rules Weighted count of different types of functions: external input types (4) e.g. file names external output types (5) e.g. reports, msgs. inquiries (4) i.e. interactive inputs needing a response external files (7) i.e. files shared with other software systems internal files (10) i.e. invisible outside system The unadjusted count (UFC) is the weighted sum of the count of each type of function. Slide 27: 2016-03-14 Function Types Type Simple Average Complex External input External output Logical int. file Ext. interface Ext. inquiry 3 4 7 5 3 4 5 10 7 4 6 7 15 10 6 Slide 28: 2016-03-14 Adjusted FPs 14 factors contribute to the technical complexity factor (TCF), e.g. performance, on-line update, complex interface. Each factor is rated 0 (n.a.) - 5 (essential). TCF = 0.65 + (sum of factors)/100 Thus TCF may range from 0.65 to 1.35, and FP = UFC*TCF Slide 29: 2016-03-14 Technical Complexity Factors Data communications Distributed functions Performance Heavily used configuration Transaction rate Online data entry End user efficiency Online update Complex processing Reusability Installation ease Operational ease Multiple sites Facilities change Slide 30: 2016-03-14 Function Points and LOC Language LOC Assembler C COBOL Modula-2 4GL Query languages Spreadsheet per 320 150 106 71 40 16 6 FP (128) (105) (80) (20) (13) Behrens (1983), IEEE TSE 9(6). C. Jones “Applied Software Measurement, McGraw-Hill (1991) Slide 31: 2016-03-14 FP Based Predictions Effort v FPs at XYZ Bank Simplest form is: effort = FC + p * FP Need to determine local productivity, p and fixed costs, FC. 40000 30000 E F F O 20000 R T 10000 500 1000 1500 2000 FP Slide 32: 2016-03-14 All environments are not equal Productivity figures in FPs per 1000 hours: IBM 29.6 Finnish 99.5 Canada 58.9 Mermaid 37.0 US 28.5 training personnel management techniques tools applications etc. Slide 33: 2016-03-14 Function Point Users Widely used, (e.g. government, financial organisations) with some success: monitor team productivity cost estimation Most effective where homogeneous environment Variants include Mk II Points and Feature Points Slide 34: 2016-03-14 Function Point Weaknesses Subjective counting (Low and Jeffery report 30% variation between different analysts). Hard to automate. Hard to apply to maintenance work. Not based upon organisational needs, e.g. is it productive to produce functions irrelevant to the user? Oriented to traditional DP type applications. Hard to calibrate. Frequently leads to inaccurate prediction systems. Slide 35: 2016-03-14 Function Point Strengths The necessary data can be available early on in a project. Language independent. Layout independent (unlike LOC) More accurate than estimated LOC? What is the alternative? Slide 36: 2016-03-14 2.4 DIY models 1000 750 A C T 500 250 75 150 225 FILES Predicting effort using number of files Slide 37: 2016-03-14 A Non-linear Model To introduce an economies or diseconomies of scale exponent: e effort = p * S where 0<e. An empirical study of 60 projects at IBM Federal Systems Division during the mid 1970s concluded that effort could be modelled as: effort (PM) = 5.2 * KLOC0.91 Slide 38: 2016-03-14 Productivity and Size Effort (PM) Size (KLOC) KLOC/PM 42.27 10 0.24 79.42 20 0.25 182.84 50 0.27 343.56 100 0.29 2792.57 1000 0.36 Productivity and Project Size using the Walston and Felix Model Slide 39: 2016-03-14 Productivity v Size Slide 40: 2016-03-14 Bespoke is Better! Model Researcher Basic COCOMO FP SLIM ESTIMACS COCOMO Intermediate COCOMO Kemerer Kemerer Kemerer Kemerer Miyazaki & Mori Kitchenham MMRE 601% 103% 772% 85% 166% 255% Slide 41: 2016-03-14 So Where Are We? • A major research topic. • Poor results “off the shelf”. • Accuracy improves with calibration but still mixed. • Needs accurate, (largely) quantitative inputs. Slide 42: 2016-03-14 3. Machine Learning Techniques A new area but demonstrating promise. System “learns” how to estimate from a training set. Doesn’t assume a continuous functional relationship. In theory more robust against outliers, more flexible types of relationship. Du Zhang and Jeffrey Tsai, “Machine Learning and Software Engineering,” Software Quality Journal, vol. 11, pp. 87-119, 2003. Slide 43: 2016-03-14 Different ML Techniques Case based reasoning (CBR) or analogical reasoning Neural nets Neuro-fuzzy systems Rule induction Meta-heuristics e.g. GAs, simulated annealing Slide 44: 2016-03-14 Case Based Reasoning problem new case RETRIEVE RETAIN new case previous cases tested / repaired case retrieved case general knowledge REUSE REVISE confirmed solution solved case suggested solution Slide 45: 2016-03-14 Using CBR Characterise a project e.g. no. of interrupts size of interface development method Find similar completed projects Use completed projects as a basis for estimate (with adaptation) Slide 46: 2016-03-14 Problems Finding the analogy, especially in a large organisation. Determining how good the analogy is Need for domain knowledge and expertise for case adaptation. Need for systematically structured data to represent each case. Slide 47: 2016-03-14 ANGEL ANaloGy Estimation tooL (ANGEL) http://dec.bmth.ac.uk/ESERG/ANGEL/ Slide 48: 2016-03-14 ANGEL Features Shell n features (continuous or categorical) Brute force search for optimal subset of features — O((2**n) -1) Measures Euclidean distance (standardised dimensions) Uses k nearest cases. Simple adaptation strategy (weighted mean). With k=1 becomes a NN technique Slide 49: 2016-03-14 CBR Results A study of 275 projects from 9 datasets suggests that CBR outperforms more traditional statistical methods e.g. stepwise regression. Shepperd, M. Schofield, C. IEEE Trans. on Softw. Eng. 23(11), pp736-743. Slide 50: 2016-03-14 Sensitivity Analysis 200 180 160 120 T1 T2 T3 100 80 60 40 20 31 29 27 25 23 21 19 17 15 13 11 9 7 5 0 3 % MMRE 140 No. of Proj ects Slide 51: 2016-03-14 Independent Replication Stensrud and Myrtviet (1998, 99) Jeffery and Walkerden (1999) Niessink and van Vliet (1997) no search for best subset of features Briand and El Eman (1998) approx. 30 features so exhaustive search for best subset not possible homogeneity + well defined relationships favour regression techniques Slide 52: 2016-03-14 Artificial Neural Nets FP # files effort # screens team size Input layer Hidden layers Output layer A multi-layer feed forward ANN Slide 53: 2016-03-14 ANN Results Study Learning Algorithm Venkatachalam BP Wittig & Finnie BP n 63 81 136 109 28 N/A Results “Promising” MMRE = 29% Jorgenson Serluca Karunanithi et al. BP BP CascadeCorrelation Samson et al Srinivasan & Fisher Hughes BP BP MMRE = 100% MMRE = 76% “More accurate than algorithmic models” 63 MMRE = 428% 78 MMRE = 70% BP 33 MMRE = 55% BP = back propagation learning algorithm Slide 54: 2016-03-14 ANN Lessons need large training sets deal with heterogeneous datasets opaque (poor explanatory power) sensitive to choices of topology and learning algorithm problems of over adaptation (neuro-fuzzy approaches?) Slide 55: 2016-03-14 Rule Induction IF module_size > 100 THEN high_development_effort ELSE IF developer_experience < 2 THEN low_development_effort ELSE moderate_development_effort C. Mair, G. Kadoda, M. Lefley, K. Phalp, C. Schofield, M. Shepperd, and S. Webster, “An investigation of machine learning based prediction systems,” J. of Systems Software, vol. 53, pp. pp23-29, 2000. Slide 56: 2016-03-14 Machine Learning Summary Need training sets ANNs require significant sized sets n≈50 Configuring the system can be a hard search problem Don’t need to specify the form of the relationship in advance Can produce more accurate results than other methods Adapts as new cases acquired Slide 57: 2016-03-14 4. Assessing Estimation Systems accuracy tolerant of measurement error explanatory power ease of use availability of inputs ... Slide 58: 2016-03-14 Assessing Model Performance Absolute error Percentage error and mean percentage error Magnitude of relative error and mean magnitude of relative error (MMRE) PRED(n) Sum of the squares of the residuals (SSR) ... Slide 59: 2016-03-14 Absolute Error Epred Eact But it fails to take into account the size of the project. A 6 PM error is serious if predicted is only 3 PMs, yet, a 6 PM error for a 3,000 PM project is a triumph. Slide 60: 2016-03-14 Percentage Error Epred Eact Eact or for more than one estimate the mean percentage error: 1 in Epred Eact . i n i1 Eact where n is the number of estimates. • Reveals any systematic bias to a predictive model, e.g. if the model always over-estimates then the percentage error will be positive. • A weakness is that it will mask compensating errors Slide 61: 2016-03-14 MMRE MMRE is defined as: 1 in Epred Eact . i n i1 Eact Masks any systematic bias but highlights overall accuracy. Penalises regression derived models based on least squares algorithms. Slide 62: 2016-03-14 PRED(n) Conte et al. suggest ≤ 25% as an indicator of an acceptable prediction model. PRED(25) measures the % of predictions that lie within 25% of actual values. PRED(25) ≥ 75% is a typical target (seldom achieved!) Slide 63: 2016-03-14 Sum of the Squared Residuals If you are risk averse it penalises large deviations more than small ones SSR = ∑ (Epred-Eact)2 Can also compute mean square error. Slide 64: 2016-03-14 A Comparison Case Study Statistic R-squared LSR 0.28 Robust 0.25 Median 0.26 MMRE 0.78 0.62 0.62 Pred (25) 45% 35% 35% 0.78 0.77 Balanced MMRE 0.84 Slide 65: 2016-03-14 So What’s Going On? The ith residual is yˆ i yi central tendency (mean, median) spread (variance, kurtosis + skewness) M. J. Shepperd, M. H. Cartwright, and G. F. Kadoda, “On building prediction systems for software engineers,” Empirical Software Engineering, vol. 5, pp175-182, 2000. Slide 66: 2016-03-14 Estimation Objectives Objective Indicator Type Risk averse sum of squares spread Error minimising median absolute error spread Portfolio total error centre Slide 67: 2016-03-14 5. Summary Accuracy is a non-trivial concept No ‘best’ technique Algorithmic models need to be calibrated Simple linear models can be surprisingly effective ANNs need large, not necessarily homogeneous training sets Evidence to suggest that CBR is often the most accurate and most robust technique Slide 68: 2016-03-14 Some Estimation Guidelines Collect data Use more than one estimating technique. Minimise the number of cost drivers / coefficients in a model to facilitate calibration: smaller, more homogeneous data sets look for simple solutions first Exploit any local structure or standardisation. Remember an estimate is a probabilistic statement (bounds?). Provide feedback for estimators. Slide 69: 2016-03-14 Future Avenues Great need for useful prediction systems Consider the nature of the prediction problem Combining prediction systems Collaboration with experts Managing with little or no systematic data Slide 70: 2016-03-14 Experts plus … ? Experiment by Myrtveit and Stensrud using project managers at Andersen Consulting Asked subjects to make predictions Found expert+tool significantly better than either expert or tool alone. ? What type of estimation systems are easiest to collaborate with? I. Myrtveit and E. Stensrud, “A controlled experiment to assess the benefits of estimating with analogy and regression models,” IEEE Trans on Softw. Eng, 25, pp510-525, 1999. Slide 71: 2016-03-14