Building Models from Your Software Data Brad Clark, Ph.D. Software Metrics, Inc. 16th International Forum on COCOMO and Software Cost Modeling Los Angeles, CA October 23-26, 2001 Agenda – 1:00 - 2:30 PM Tutorial – 2:30 - 3:00 PM Break – 3:00 - 4:30 PM Tutorial conclusion Miscellaneous – Bathrooms – Telephones – Tutorial Format: Collaborative participation • • • • One person talks at a time Keep discussions to the point No attribution End-of-course evaluation Tutorial: Building Models 2 Gate #1 Directions CSE - 3rd Floor Electrical Engineering Building (Hughes) School of Engineering West 37th Place Computer Science Building (Salvatori) McClintock Ave. Parking Structure A Gerontology Auditorium Tutorial: Building Models 3 Tutorial Outline • • • • • • Purpose A software engineering modeling example Model building steps Mean-based model exercise Regression based model exercise Summary Tutorial: Building Models 4 The need for models • Models are useful for forecasting, performance analysis, and decision-making – WBS is narrowly addressed with current estimation models – Strength of cause and effect relationships – Impact of decision making: Personnel turnover • Establish data requirements (model parameters) • Explain assignable causes of variation and their degree of influence • Used to validate data – Poor data definitions and collection consistency – Poor processes that produce the data Tutorial: Building Models 5 WBS Help How is the effort estimated for the rest of these? Software Cost Estimation Models 3.1 Program Management 3.1.1 Planning & Mgt 3.1.2 Program Control 3.1.3 Contract Management 3.1.4 Contractor Laboratory 3.2 System Engineering 3.2.1 SysReq'ts 3.2.2 Design and Integration 3.2.4 Sup. & Maintainability Eng. 3.2.5 QA 3.2.6 CM 3.2.7 Human Factors 3.2.8 Security 3.3 HW/SW Design, Development and Production 3.3.1 HW Design & Dev 3.3.2 SW Design & Dev 3.3.3 HW/SW Integration & Checkout 3.5 Test and Evaluation 3.5.1 Sys T&E 3.5.4 Site Accep 3.6 Documentation 3.7 Support Tutorial: Building Models 6 Decision Impact Analysis • Do we give the team an incentive to stay or do we look for new hirers? Estimated PM = 2.94 * KSLOC * PCON • Estimated Person Months for a 100 KSLOC project with 3% PCON = 238 PM • Same project with 12% PCON = 294 PM (23.5% increase) • If the burdened labor rate is $10,000/PM, the cost increase is $235,000/PM 3% / yr 6% / yr 12% / yr 24% / yr 48% / yr PCON Effect on PM 0.81 0.0% 0.90 +11.0% 1.00 +23.5% 1.12 +38.0% 1.29 +59.0% Why not give everyone a financial incentive to stay? PM: Person Months PCON: Personnel Continuity Tutorial: Building Models 7 Data validation -1 • • • • Check for internal consistency Be suspicious of “perfect” data Understand reason for outliers Check data relationships – – – – Effort and size Effort and schedule Size and defects Effort and defects Tutorial: Building Models 8 Data validation -2 What looks suspicious here? $140,000 $120,000 Budget $100,000 $80,000 $60,000 $40,000 $20,000 $0 0 500 1000 1500 2000 2500 Actual Hours Tutorial: Building Models 9 Tutorial Objectives • Share data analysis experiences with real data – (COCOMO as a thinking aid) • Show how models created from data are based on the average (or mean) of the data and its spread or variation • Show how model performance improves with the removal of assignable causes of variation • Raise awareness on the many sources of variation in software engineering data Tutorial: Building Models 10 What Will We Do? • Using supplied data, we will build simple models – Mean or Median – One variable regression models – Stratifying data • Two sets of data – The first set will be used to learn a technique – The second set will be used to practice the technique • Intent is to show how to create small models by example Tutorial: Building Models 11 What You Will Walk Away With • A new skill: using Excel to look at data – Data summaries – Graphing data – Simple regression models • An understanding of what is behind numbers produced by models, a.k.a. understanding variation – An intelligent consumer of data (which you can practice during this conference’s presentations) – A responsible data reporter • Understanding model parameters and their impact on explaining variation Tutorial: Building Models 12 About the Instructor and SMI • Brad Clark – Former Navy Pilot – Worked in civil service for 10 years – Attended USC Graduate School: 1992 - 1997 • Development of the COCOMO II model • Process Maturity Effects on Effort – Started consulting in 1998 in using measurement to manage software projects • Software Metrics, Inc. (SMI) – Very small, private consulting company located in Haymarket, Va. – Started in 1983 by John and Betsy Bailey – Focus: Using software measurement to manage software projects: estimation, feasibility analysis, performance Tutorial: Building Models 13 About You • What is your name? • Where do you work? • Do you have any experience with statistics or empirical modeling? Tutorial: Building Models 14 Tutorial Outline • • • • • • Purpose A software engineering modeling example Model building steps Mean-based model exercise Regression based model exercise Summary Tutorial: Building Models 15 What is a model? • A model is a representation of the essential structure of some object or event in the real world. – Physical (airplane, building, bridge) – Symbolic (language, computer program, mathematical equation) • Two major characteristics of models – Models are necessarily incomplete – Models may be changed or manipulated with relative ease • No model includes every aspect of the real world – Building models necessarily involves simplifying assumptions – It is critical that the assumptions made when constructing models be understood and be reasonable. Source: Introductory Statistics Concepts, Models, and Applications by David Stockburger Tutorial: Building Models 16 Using Data to Estimate Effort Consumption = 11.9 Person Hours / Function Point What does this mean? Actual Effort 165 14,080 3,602 Estimated Effort 880.6 3,665.2 5,057.5 = = = = Function Points 74 308 425 * * * * Effort Consumption 11.9 11.9 11.9 Yikes! Tutorial: Building Models 17 First Model: Sample Mean (est. X) 99.7% 95% 68% Normal Distribution -3 SD -2 SD -1 SD X Population Spread with Mean, X +1 SD +2 SD Sample Spread with estimated Mean T-Distribution 3.6 Confidence Interval (CI) +3 SD est. 11.9X -90% CI 20.3 +90% CI Tutorial: Building Models 18 Necessary and Sufficient Information • What additional information do we want to know about the stated relationship to make it more accurate? Effort Consumption = 11.9 Person Hours / Function Point ? ? ? Tutorial: Building Models 19 Frequency Data Analysis: PHr/FP PHr/FP 6 5 4 3 2 1 0 0 5 10 15 20 25 30 35 40 45 50 PHr/FP Confidence Interval (CI) 11.9 5.7 3.6 1.6 18.2 20.3 22.3 80% CI 90% CI 95% CI Confidence Interval can be “tightened” by removing assignable causes of variation. Mean Standard Error Median Standard Deviation Range Minimum Maximum Confidence Level(90.0%) PN 1 2 3 4 5 6 7 8 9 Tutorial: Building Models FP 40 931 425 181 308 163 74 333 241 PHrs 300 6,400 3,602 1,550 14,080 1,090 165 1,070 4,350 11.92 4.48 7.50 13.44 43.48 2.23 45.71 8.33 PHr/FP 7.50 6.87 8.48 8.56 45.71 6.69 2.23 3.21 18.05 20 Reducing the Confidence Interval • Some assignable causes of variation among project data points – – – – – – – – – Noisy data (size and effort) Complexity of the software (effort) Amount of required testing (effort) Building components for reuse (effort) Changes in requirements (size) Required reliability and safety features (size) Interoperability (effort and size) Development / maintenance team experience (effort) Turnover of key people (effort) Tutorial: Building Models 21 Measurement Specifications • Staff Turnover Specification Example – Typical Data Items • Number of personnel • Number of personnel gained (per period) • Number of personnel lost (per period) – Typical Attributes • Experience factor • Organization – Typical Aggregation Structure • Activity – Typically Collected for Each • Project – Count Actuals Based On Source: Practical Software Measurement Objective Information for Decision Makers by McGarry et. al. • Financial reporting criteria • Organization restructuring or new organizational chart Tutorial: Building Models 22 Models Depend on Solid Data • Models are created from data Models are only as good as the data used to create them – – – – – – life-cycle phase overtime to get work done experience tools complexity reuse • Data used to create models must be well specified Tutorial: Building Models 23 Frequency Accounting for Requirements Volatility PHr/Adj_FP 6 5 4 3 2 1 0 5 Mean Standard Error Median Standard Deviation Range Minimum Maximum Confidence Level(90.0%) 10 15 20 25 30 35 40 45 50 PHr/Adj_FP 8.0 4.2 11.9 90% CI Assignable cause of variation: Adjust the size with the effects of requirement’s volatility (REVL) Adj FP = FP * (1 + REVL%) PN 1 2 3 4 5 6 7 8 9 FP 40 931 425 181 308 163 74 333 241 REVL 50 30 10 100 1 9 10 60 Tutorial: Building Models Adj_FP 40.00 1396.50 552.50 199.10 616.00 164.63 80.66 366.30 385.60 PHrs 300 6,400 3,602 1,550 14,080 1,090 165 1,070 4,350 8.01 2.07 6.62 6.21 20.81 2.05 22.86 3.85 PHr/Adj_FP 7.50 4.58 6.52 7.79 22.86 6.62 2.05 2.92 11.28 24 Variation: Staff Turnover Impact of Personnel Continuity on Effort This factor captures the turmoil caused by the project losing key, lead personnel. The loss of key personnel leads to extra effort in new people coming to work for the project and having to spend time coming up to speed on what has to be done. The rating scale is in terms of the project’s personnel turnover normalized to a year. Descriptors: 48% per year 24% per year Rating Levels Very Low Low Nominal High Very High 1.29 1.12 1.00 0.90 0.81 Effort Multipliers Effect on Effort: +15% 12% per 6% per year year +12% -11% 3% per year -11% Source: Software Cost Estimation with COCOMO II by Barry Boehm et. al. Tutorial: Building Models 25 Accounting for Staff Turnover Frequency 8 Adj_PHr/Adj_FP 6 4 2 0 5 10 15 20 25 30 35 40 45 50 Adj_PHr / Adj_FP 7.9 5.2 10.6 90% CI Assignable cause of variation: Adjust the effort with the effects of Personnel Continuity (PCON) Adj_PHr = PHr / PCON PN 1 2 3 4 5 6 7 8 9 Adj_FP 40.00 1396.50 552.50 199.10 616.00 164.63 80.66 366.30 385.60 Mean Standard Error Median Standard Deviation Range Minimum Maximum Confidence Level(90.0%) PHrs PCON Adj_PHrs 300 0.90 333.33 6,400 0.81 7901.23 3,602 0.81 4446.91 1,550 0.81 1913.58 14,080 1.29 10914.73 1,090 1.00 1090.00 165 0.81 203.70 1,070 0.81 1320.99 4,350 1.29 3372.09 Tutorial: Building Models 7.87 1.46 8.05 4.39 15.19 2.53 17.72 2.72 Adj_PHr / Adj_FP 8.33 5.66 8.05 9.61 17.72 6.62 2.53 3.61 8.75 26 COCOMO Suite • Attempts to identify and quantify assignable causes of variation (drivers) Model COCOMO COCOTS COQUALMO CORADMO COPSEMO COPROMO COSYSMO Purpose Custom cost and schedule estimation COTS Based Systems cost estimation Defect introduction and removal Rapid application development cost and schedule estimation Staged schedule & effort model Productivity improvement model System engineering cost and schedule est. Tutorial: Building Models 27 Second Model: Linear Regression Analysis My favorite! 15000 10000 PHr • Statistical Regression fits a line through points minimizing the least square error between the points and the line • The regression analysis yields a line with a slope, M, and intercept, A: Y = A + MX + e • The goodness of fit is given by a statistic called R2. The closer to 1.0, the better the fit. 5000 0 0 500 1000 1500 Adj_FP est_PHr = 1061.3 + 6.0645 Adj_FP R2 = 0.3232 Tutorial: Building Models 28 Starting Point: Scatter Plot Model Boundaries Y = 11.3 + 0.0651 X + e 140.0 120.0 KSLOC 100.0 80.0 60.0 40.0 20.0 0.0 0 500 1000 1500 2000 Unadjusted Function Points There is variation with each model coefficient Random Source: Albrecht and Gaffney, "Software Function, Source Lines of Code, and Development Effort Prediction: A Software Science Validation," IEEE Transactions on Software Engineering, Vol SE-9, No 6, Nov 1983. Tutorial: Building Models 29 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 11000 12000 16000 14000 12000 10000 8000 6000 4000 2000 0 9500 10500 7500 8500 5500 6500 3500 4500 1500 2500 500 PHr 15500 13500 11500 9500 7500 5500 3500 1500 -500 -500 PHr Regression Analysis Example Model Estimate of PHr Model Estimate of PHr Compare Models est_PHr = -12127 + 6.16 Adj_FP + 13879 PCON Adj. R2 = 0.64 Linear Model est_PHr = 4.84* Adj_FP1.08 * PCON2.72 Adj. R2 = 0.88 Multiplicative Model Tutorial: Building Models 30 Model Accuracy PRED(L) = X • Means that the model estimates within L% of the actual values X% of the time • In other words, how often does the model predict within the desired circle? 30% of the actual Example: PRED(30) = 70 The model predicts within 30% of the actuals 70% of the time. Models are necessarily incomplete and are not 100% accurate Tutorial: Building Models 31 Model Evaluation PN 1 2 3 4 5 6 7 8 9 PRED(30) PHr/FP 476.80 11,097.52 5,066.00 2,157.52 3,671.36 1,942.96 882.08 3,969.36 2,872.72 Adj_PHr / Adj_FP 314.80 7,326.97 3,344.75 1,424.47 2,423.96 1,282.81 582.38 2,620.71 1,896.67 Linear Model 612.26 7,703.17 2,512.57 339.16 9,565.01 2,764.17 -389.25 1,367.44 8,148.05 Multiplicative Model 200.76 7,026.10 2,570.26 849.69 10,274.27 1,227.46 318.92 1,645.88 6,181.82 0.0 0.55 0.33 0.44 Actual PHrs 300 6,400 3,602 1,550 14,080 1,090 165 1,070 4,350 Which model would you choose? Tutorial: Building Models 32 Summary -1: Two Models Mean X 1-Variable Regression Y Y x x x x xx x x x x x xxxxx x x X X Y=0+X -3 SD Y = A + MX + e -2 SD -1 SD X +1 SD Tutorial: Building Models +2 SD +3 SD 33 Summary -2 • • • • Definition of a model Data specifications Normal versus T distribution Model characteristics – – – – – Model Usage Model boundaries Confidence interval Model accuracy Assignable causes of variation • Large model examples Tutorial: Building Models 34 Tutorial Outline • • • • • • Purpose A software engineering modeling example Model building steps Mean-based model exercise Regression based model exercise Summary Tutorial: Building Models 35 Modeling Steps: 1. Decide what relationship you would like to investigate • What do you want to know – – – – Estimation of Requirements Volatility Establishing thresholds for performance monitoring Working overtime’s effect on personnel turnover Estimation of the number of defects to be found before Final Acceptance Test People Cost Build Duration Code Size Function Points Requirements Defects Design Units Test Cases Rework Change Requests Req’ts Evolution Process Maturity Tutorial: Building Models Documentation 36 2. Identify assignable causes of variation (drivers) • Use your experience and intuition • Possible sources of variation: – – – – – – – – – Customer participation - Development team experience Application domain exp. - Complexity of application Development flexibility - Design constraints Requirements volatility - Adaptation of existing code Programming lang. exp. - Use of modern methodologies Compression of schedule - Use of software tools Code inspections - Management capability Team size - Application size Personnel turnover - Architecture & Risk resolution Tutorial: Building Models 37 3. Collect data • Specify data to be collected based on: – assignable causes of variation – what is available • Select 10 projects to go back and collect extra data – Based on project applicability – Use measurement specifications as a checklist for each data • How much project data is enough? Tutorial: Building Models 38 4. Normalize data and check for consistency • In some cases it may be appropriate to normalize: – normalize data to a “per unit” measure • • • • size defects calendar days, weeks, months effort hours, effort days, effort months – normalize about the mean of the data to get percentage increase or decrease from the mean • Plot data – Check that known relationships exist – Detect outliers and investigate – Scatter plots are very useful Tutorial: Building Models 39 5. Build model and evaluate • Models should be: – – – – Simple Explainable Analyzable Most important: They should make sense! • Models make explicit data relationships – Show strength and direction of relationships – While a relationship exists - it may not be valid or make sense – The relationships you want to use in modeling are ones that show valid “cause and effect” Tutorial: Building Models 40 6. Add or remove drivers • Drivers are data attributes that explain (or drive) variation. • The more drivers used in a model - the more data that must be collected. • While a driver may make sense to use in explaining variation, the data may not support this conclusion – Collect more data, the current dataset may be biased and not represent a true sample – There may be drivers that are correlated, this could cover the effects of the weak performing driver • Warning: Correlation effects between drivers Tutorial: Building Models 41 7. Repeat steps 3 to 7 • If the model does not have an acceptable accuracy, then: – – – – collect more data analyze it for its influence on variation add and remove cost drivers evaluate model Tutorial: Building Models 42 8. Pilot model • Document model and create a tool for its use • Model must be piloted to test its reasonableness, understandability, and accuracy. – Collect actual values of model inputs (including assignable causes of variation) • Model should be used with its confidence interval • Feedback should be incorporated into model and tool Tutorial: Building Models 43 Tutorial: Building Models 44 Tutorial Outline • • • • • • Purpose A software engineering modeling example Model building steps Mean-based model exercise Regression based model exercise Summary Tutorial: Building Models 45 Exercise 1: Growth Model • Modeling step #1: What do we want to investigate? – We are going to develop a growth model based on real data in a report from NASA Software Engineering Laboratory* – A growth model increases size based on “other information” – It will be used in estimating cost and schedule for future software projects • When will the model be used? – What information will be available at the time? • What will be the scope of the model? – What will be included or excluded in the estimate? * Cost and Schedule Estimation Study Report, SEL-93-001 Tutorial: Building Models 46 Exercise 1: Assignable Causes of Variation • Modeling step #2: What are possible causes (that can be controlled) of growth? ? ? ? ? Tutorial: Building Models 47 Exercise 1: Survey the Data • Modeling step #3: Collect data • Using Microsoft Excel, open the file with the NASA SEL data. • Select the data definitions worksheet – – – – – – – – – Project type Programming Language Duration Effort for management and technical Estimated SLOC size Actual SLOC size New SLOC size Growth (derived) Reuse (derived) Tutorial: Building Models 48 Exercise 1: Plot the Data • Modeling step #4: Normalize data and check for consistency – Copy Data worksheet and name it “Scatter Plots” • Create a scatter plot of the following data elements against Growth% – – – – – – – Project Type Programming Language Duration Effort Estimated SLOC New SLOC Reuse% Tutorial: Building Models 49 Exercise 1: Check for Correlation • Check for correlation of data elements to Growth% – Excel: Tools -> Data Analysis -> Correlation (new Worksheet Ply: Correlation - this will create a new worksheet) TypeN LangN Duration (Weeks) Effort (Hours) SLOC_Est SLOC_Act SLOC_New Growth Reuse% TypeN 1.000 -0.507 0.694 0.307 0.248 0.337 0.344 0.041 -0.331 Duration (Weeks) LangN 1.000 -0.274 -0.392 -0.348 -0.412 -0.363 -0.360 0.379 1.000 0.681 0.346 0.524 0.724 0.085 -0.623 Effort (Hours) SLOC Est 1.000 0.626 0.785 0.972 -0.001 -0.561 1.000 0.919 0.566 -0.359 0.011 SLOC Act 1.000 0.758 -0.095 -0.164 SLOC New 1.000 0.138 -0.673 Growth 1.000 -0.472 • Compare correlation numbers to scatter plots – What can you conclude? Tutorial: Building Models 50 Exercise 1: Create Mean-Based Models • Modeling step #5: Build model and evaluate – Copy the Data worksheet and name it “Project-Models” • Which relationships shown in the scatter plots looked most promising? • Based on the intended model’s purpose, what data would be realistically available? Tutorial: Building Models 51 Exercise 1: Create Project Type Mean-Based Models • Build 3 Mean-Models based on “Project Type” – Sort data by project type • 1 - TS • 2 - AGSS • 3 - DS – – – – – – Excel: Tools -> Data Analysis -> Descriptive Statistics Input Range: TypeN (for one set of values) Output Range: swipe two empty columns Check Summary Statistics Check Confidence Interval: 80% Describe each model • Mean, Standard Deviation, Min - Max values, Number of data points, 80% confidence intervals Tutorial: Building Models 52 Exercise 1: TS Mean-Model TS Projects Growth Mean Standard Error Median Mode Standard Deviation Sample Variance Kurtosis Skewness Range Minimum Maximum Sum Count Confidence Level (80.0%) 0.21 0.05 0.20 0.20 0.11 0.01 2.82 1.49 0.30 0.10 0.40 1.05 5.00 0.08 Sum of productivities / No. projects Std Deviation / SQRT(No projects) Middle Growth% value Spread; SQRT(Variance) Small TS project Growth% Largest TS project Growth% Confidence interval within which lies the real population mean Tutorial: Building Models 53 Exercise 2: Create Reuse% MeanBased Model • Build a Mean-Based Model for Reuse% – – – – Looking at scatter plot, how can this data be stratified? Copy Data worksheet and name it “Reuse-Model” Sort data by Reuse% Use Descriptive Statistics to build two Reuse% models based on stratified data Excel: Tools -> Data Analysis -> Descriptive Statistics (80% Confidence Interval) – Describe each model Mean, Standard Deviation, Min - Max values, Number of data points, 80% confidence intervals Tutorial: Building Models 54 Mean-Based Modeling Conclusions • Correlation analysis versus Scatter Plots • Variation in the data – When to use the Mean versus the Median – Stratifying or categorizing data – Determines (in part) the confidence interval • Number of data points are important • Minimum and maximum values set model boundaries • The mean is a model that describes data “on average” • The standard deviation is a model that describes distances “in general” Tutorial: Building Models 55 Tutorial Outline • • • • • • Purpose A software engineering modeling example Model building steps Mean-based model exercise Regression based model exercise Summary Tutorial: Building Models 56 Linear Regression Models 2500 Unadjusted Function Points • Statistical Regression fits a line through points minimizing the least square error between the points and the line • The regression analysis yields a line with a slope, M, and intercept, A: Y = A + MX • The goodness of fit is given by a statistic called R2. The closer to 1.0, the better the fit. y = 6.1365x + 206.12 R2 = 0.7286 2000 1500 1000 500 0 0.0 50.0 100.0 150.0 200.0 250.0 300.0 350.0 COBOL KSLOC Tutorial: Building Models 57 Two Types of Regression Models • Additive (linear) Y = A + MX + e • Multiplicative y = A • XM• e • The regression technique requires a linear form – Works for the first model form – Do not work for the second model form • Non-linear models must be transformed into a linear form Tutorial: Building Models 58 Transforming Non-Linear Models • Log-log transformation Y = A • XB ln(Y) = ln(A) + B • ln(X) • Reversing the log-log transformation eln(Y) = e[ln(a) + M • ln(X)] y = ea • XM A = ea Y = A • XM Tutorial: Building Models 59 Exercise 3: Create Additive Duration Regression Model • Modeling step #5: Build model and evaluate – Copy the Data worksheet and name it “Duration-Model” • Examine the Duration Relationship to Growth% in the scatter plots • Based on the intended model’s purpose, would this data be realistically available? Tutorial: Building Models 60 Exercise 3: Create Additive Model • Build a Regression Model based on Duration – – – – – – – – – Select Tools -> Data Analysis -> Regression Y Input range: Growth% data X Input range: Duration data Select labels Confidence interval set to 80% Output range : select 7 columns for the output Select residuals Select: OK Describe the model • Intercept (A), Slope (M), R2 of the model, Min - Max values, Number of data points, 80% confidence intervals for A and M – Create a scatter plot of the data with trend line Tutorial: Building Models 61 Exercise 3: Additive Duration Model SUMMARY OUTPUT Regression Statistics Multiple R 0.57204208 R Square 0.32723214 Adjusted R Square 0.26607143 Standard Error 29.2818403 Observations 13 Growth% = -40.5 + 0.72 Duration Growth% = -87.3 + 0.30 Duration (lower 80%) Growth% = 6.23 + 1.14 Duration (upper 80%) ANOVA df Regression Residual Total Intercept Duration 1 11 12 Significance SS MS F F 4587.542915 4587.543 5.350365 0.041074958 9431.687854 857.4262 14019.23077 Upper Coefficients Standard Error t Stat P-value Lower 80.0% 80.0% -40.527843 34.29305817 -1.181809 0.2622 -87.2840386 6.228353 0.71766616 0.310263556 2.313086 0.041075 0.294643418 1.140689 Tutorial: Building Models 62 Additive Model Scatter Plot 140 120 100 y = 0.7177x - 40.528 Growth 80 R2 = 0.3272 60 40 20 0 0 50 100 150 200 -20 Duration Tutorial: Building Models 63 Exercise 4: Create Multiplicative Duration Regression Model • Modeling step #5: Build model and evaluate – Copy the Data worksheet and name it “Ln-Duration-Model” • Transform the Growth% and Duration data into logspace by taking the logarithms of each column – Insert a new column next to Growth% and Duration – Label them Ln-Growth% and Ln-Duration – In the new column take the logarithms of the column next to it • e.g. in cell H2 type =ln(G2); copy this formula into the remaining cells Tutorial: Building Models 64 Exercise 4: Create Multiplicative Model • Build a Regression Model based on Ln-Duration – Use the same procedures as last time • Transform the results back into normal-space eln(y) = e[ln(a) + eb • ln(x)] y = ea • xb a = ea <- all we have to do is raise the intercept to the e =exp(intercept) • Describe the model – Intercept (A), Slope (M), R2 of the model, Min - Max values, Number of data points, 80% confidence intervals for A and M • Create a scatter plot of the data with trend line Tutorial: Building Models 65 Exercise 4: Multiplicative Duration Model SUMMARY OUTPUT Regression Statistics Multiple R 0.55929926 R Square 0.312815663 Adjusted R Square 0.250344359 Standard Error 0.735024727 Observations 13 Growth% = 0.015 * Duration1.6 Growth% = 0.0002 * Duration0.63 (lower 80%) Growth% = 1.42 * Duration2.6 (upper 80%) ANOVA df Regression Residual Total Intercept ln-Duration exp(Intercept) SS MS 1 2.705278 2.705278 11 5.942875 0.540261 12 8.648152 Significan F ce F 5.00735 0.046889 Standard Lower Upper Coefficients Error t Stat P-value 80.0% 80.0% -4.221394794 3.352932 -1.259016 0.234087 -8.792883 0.350094 1.613752301 0.721162 2.237711 0.046889 0.630498 2.597007 0.014678157 0.000152 1.419201 Tutorial: Building Models 66 Multiplicative Model Scatter Plot 140 120 y = 0.0147x 100 1.6138 2 Growth% R = 0.3128 80 60 40 20 0 0 50 100 150 200 Weeks Tutorial: Building Models 67 Multiplicative Model Scatter Plot (Log - Log scale) 6.000 y = 1.6138x - 4.2214 5.000 2 R = 0.3128 Growth% 4.000 3.000 2.000 1.000 0.000 0 1 2 3 4 5 6 Weeks Tutorial: Building Models 68 Model Comparison -1 • Using the two models, estimate Growth% – In each Duration worksheet, create a new column next to Growth% and Ln-Growth% – Label it est. Growth% – Using the models created in Exercise 3 and 4, compute the estimated Growth% • One model is additive: Growth% = -40.5 + 0.72 Duration • One model is multiplicative: Growth% = 0.015 * Duration1.6 Tutorial: Building Models 69 Model Comparison -2 • Compute the Magnitude Relative Error (MRE) for each Growth estimate: – Create a new column next the the est. Growth% column – Label it MRE – Compute MRE: (Actual Growth% - Estimated Growth%) / Actual Growth% • Count the errors that are less than or equal to 30% and divide by the number of data points. This is PRED(30) Tutorial: Building Models 70 Model Comparison Results PN 2 3 4 5 6 7 8 9 10 11 12 13 14 PRED(30) Growth % 5 50 30 30 40 80 130 20 20 15 10 25 20 Additive Model 0.43 0.66 0.45 41.80 30.28 64.84 51.16 26.68 32.44 18.76 -6.44 20.20 38.92 31 Tutorial: Building Models Multiplicative Model 30.15 45.00 31.40 29.73 23.39 44.04 35.29 21.53 24.54 17.64 7.35 18.33 28.09 62 71 Tutorial Outline • • • • • • Purpose A software engineering modeling example Model building steps Mean-based model exercise Regression based model exercise Summary Tutorial: Building Models 72 Summary -1 • Modeling steps 1. Decide what relationship you would like to investigate 2. Identify assignable causes of variation (drivers) 3. Collect data 4. Normalize data and check for consistency 5. Build model and evaluate 6. Add or remove drivers 7. Repeat steps 3 to 7 8. Pilot model • Mean-Based Models – Scatter plot versus correlation analysis – Stratify data to identify different relationships – Mean, Standard Deviation, Min - Max values, Number of data points, 80% confidence intervals Tutorial: Building Models 73 Summary -2 • Regression-Based Models – – – – – Additive: Y = A + MX Multiplicative: Y = A * XM Use of logarithms to transform multiplicative into additive Analysis in log-space versus linear-space Use of PRED(L) as a measure of model performance • Data defines the model! – – – – – Data quality Scope of coverage: life-cycle phases Depth of coverage: included / excluded in the count Correlation effects among assignable causes of variation Min and Max inputs (based on low and high data points) Tutorial: Building Models 74 Further Information • • • • • • Cost and Schedule Estimation Study, NASA Software Engineering Laboratory, SEL-93-002, Nov 1993 Introductory Statistics Concepts, Models, and Applications by David Stockburger, www.atomicdogpublishing.com, 2ed, 2001 Practical Software Measurement Objective Information for Decision Makers by John McGarry, David Card, Cheryl Jones, Beth Layman, Elizabeth Clark, Joseph Dean, and Fred Hall, Addison-Wesley, 2001 Software Cost Estimation with COCOMO II by Barry Boehm, Chris Abts, Winsor Brown, Sunita Chulani, Brad Clark, Ellis Horowitz, Ray Madachy, Donald Reifer, and Bert Steece, Prentice Hall PTR, 2000. Statistics, Data Analysis, and Decision Making, by James Evans and David Olson, Prentice-Hall, 1999 Statistical Analysis Simplified, by Glen Hoffherr and Robert Reid, McGraw-Hill, 1997 Tutorial: Building Models 75 Contact Information Brad Clark Software Metrics, Inc. Washington, D.C. area (703) 754-0115 Brad@Software-Metrics.com http://www.software-metrics.com Tutorial: Building Models 76