University of Southern California Center for Systems and Software Engineering An Investigation on Domain-Based Effort Distribution Thomas Tan 26th International Forum on Systems, Software, and COCOMO Cost Modeling November 2011 University of Southern California Center for Systems and Software Engineering Research Overview • Motivation: – Project risks early in the project lifecycle: • Lack of project knowledge. • Known project attributes may change later on. – These risks contribute to the Cone of Uncertainty effect. • Lead to extremely unrealistic cost schedule. – However, many estimation methodologies use one-size-fit-all effort distribution. – Application domains: • Easy to define for a project. • Available early. • Relatively stable throughout the project lifecycle. Figure 1: Cone of Uncertainty for Software Cost and Size Estimation Phase/Activities Effort % Plan and Requirement 7 (2-15) Product Design 17 Detailed Design 27-23 Code and Unit Test 37-29 Integration and Test 19-31 Transition 12 (0-20) Table 1: COCOMO II Waterfall Effort Distribution Percentages An Investigation on Domain-Based Effort Distribution 2 University of Southern California Center for Systems and Software Engineering Research Overview • Goal: – Study the impacts of application domains. – Apply the effects of application domains on effort distribution guideline. • This work is part of the research study to provide better software estimation guideline: – This is an on-going research. – Based on data analysis of the government projects – the SRDR data. – Sponsored by the Air Force Cost Analysis Agency. An Investigation on Domain-Based Effort Distribution 3 University of Southern California Center for Systems and Software Engineering Research Overview • We are investigating that if we can use application domains as illustrated in the following diagram: Size (KSLOC) Application Domain Personnel Ratings COCOMO II Model + Application Domain Extension for effort distribution Domain-based Effort Distribution Guideline Data Support Figure 2: Expected Extension using Application Domains An Investigation on Domain-Based Effort Distribution 4 University of Southern California Center for Systems and Software Engineering Research Overview • Research plan: – Determine application domains. • 22 application domains selected from US Air Force Cost Estimation Handbook and Mil-881 standard. • Also referencing other studies. – Normalize the SRDR data. – Determine effort distribution patterns by application domains. • Calculate average effort distribution percentages for each application domains. – Hoping that effort distribution patterns are different. • Prove the differences between each domain and between domains and the COCOMO II model are statistically significant. – Study how system size and personnel ratings affect the effort distribution patterns for different domains. – Establish effort distribution guideline based on findings. – Integrate this guideline with the COCOMO II model. An Investigation on Domain-Based Effort Distribution 5 University of Southern California Center for Systems and Software Engineering Effort Distribution Definitions • Matching the research data, we will only investigate the following activity groups: – – – – – – • Plan & requirements Architecture & Design Code and Unit Testing Integration Qualification Testing Note: we will combine Integration and Qualification Testing to match the COCOMO II Waterfall phases. Effort activity groups definitions: – – We will use the SRDR standard activities definitions: similar to that of COCOMO II model. Adjusted COCOMO II model distribution averages: divided by 1.07 to ensure the sum of all averages is 100%. COCOMO II Phase SRDR Activities Plan and requirement Software requirements analysis Product design and detail design Software architecture and detailed design Coding and unit testing Coding, unit testing Integration and qualification testing Software integration and system/software integration; Qualification/Acceptance testing Table 2: Mapping of SRDR Activities to COCOM II Phases Phase/Activities Effort % Plan and Requirement 6.5 Product Architecture & Design 39.3 Code and Unit Testing 30.8 Integration and Qualification Testing 23.4 Table 3: COCOMO II Waterfall Effort Distribution Percentages with adjustment An Investigation on Domain-Based Effort Distribution 6 University of Southern California Center for Systems and Software Engineering Data Processing • Data Set: – A collection of projects data from DoD and other government agencies. – Data are extracted from SRDR Form 2630-3, which is the final reports for each project after the project has been completed. – Data include effort, size, and other development parameters (such as language, process, staffing information, etc.). – Data issues: • Missing data. • Untrustworthy data patterns. • Duplicated records. • Lack of quality measures. – Need further data processing before analysis: • Eliminate bad records. • Normalize the data. • Backfill effort data. An Investigation on Domain-Based Effort Distribution 7 University of Southern California Center for Systems and Software Engineering Data Processing • Data Normalization – Evaluate all data records and eliminate those that we are unable to normalize (i.e. missing important fields, no definitions at all, and duplicated records). – Eliminate those with weird patterns that are likely results of bad reporting, i.e. those with huge effort and little size or vice versa. – Deal with missing effort fields (see detail in the data backfilling section on next couple slides). – Transform all data records into the same units of size, effort, and schedule. • Calculate equivalent size for all records. – Use of DM, CM, and IM. • Calculate effort and schedule total for all records. – Calculate personnel ratings. An Investigation on Domain-Based Effort Distribution 8 University of Southern California Center for Systems and Software Engineering Data Processing • Backfilling Data: – Rationale: • Backfilling is necessary to increase the number of usable data records. – Method: Approximative Non-Negative Matrix Factorization: • Factorize subject data set (X) into two matrices: X W x H • W and H are two random matrices whose dot product approximates data set X. • Iteratively adjusting values in W and H using a simple approximation algorithm (use α, β as the adjustment factors). • Exit iterations when error margin is smaller than a preset value (usually 0.01 or smaller). • Also set the maximum number of iterations to stop the process if it goes too long (usually 5,000 to 10,000). An Investigation on Domain-Based Effort Distribution 9 University of Southern California Center for Systems and Software Engineering Data Processing • Backfilling Data Sets: – Missing 2 Set: missing 1 to 2 values from the 5 activity groups. – Missing 3 Set: missing 3 values from the 5 activity groups. – Missing 4 Set: missing 4 values from the 5 activity groups. • Developed a matrix factorization program in Matlab™: – – – – Exit margin = 0.001. 10,000 iterations. α = 0.0002, β = 0.02. Import and export CSV. • Calculated error margins between backfilled and original data points. – Most backfilled values are within 10% of the original value (many are very small). – Few presents huge differences due to significant discrepancies in data patterns (very small in some activity while huge in another). An Investigation on Domain-Based Effort Distribution 10 University of Southern California Center for Systems and Software Engineering Data Processing • Error comparing original and backfilled data points. – Error = (Backfilled – Original) / Original Domain REQ ARCH Original CODE INT QT REQ ARCH Mission Planning Mission Planning Mission Planning Mission Planning Mission Planning Mission Planning Mission Planning Mission Planning Mission Planning Mission Planning Mission Planning Mission Planning Mission Planning 68.1 9.7 15.8 22.1 27.6 6.3 0.0 2.2 3.1 210.2 97.8 24.0 4.2 39.6 0.046 102.8 16.6 62.2 89.1 142.9 2.4 4.1 103.0 46.1 20.5 10.5 112.0 9.8 208.1 167.9 364.4 100.9 845.2 14.8 16.9 113.2 89.5 27.4 14.8 25.3 11.1 121.9 17.5 37.9 37.6 55.0 7.9 2.9 25.4 134.3 4.8 4.3 1.8 5.7 0.0 4.3 9.9 0.0 0.0 6.7 0.2 418.5 16.2 0.0 5.8 65.4 9.6 16.0 21.8 27.6 6.4 80.4 2.3 3.1 202.5 94.5 23.5 4.3 39.5 9.4 101.3 16.8 61.8 85.1 140.3 2.6 4.1 103.1 45.4 20.0 10.0 Backfilled CODE INT 109.9 9.9 206.2 160.8 350.3 100.4 823.3 14.4 16.4 113.3 91.2 27.2 14.7 25.0 10.8 116.8 17.6 37.7 37.0 55.2 7.7 2.9 25.5 128.0 5.0 4.4 QT REQ ARCH Error CODE INT QT 1.8 5.7 13.6 4.4 9.9 15.3 30.7 6.5 2.1 396.6 16.2 19.2 5.8 -4.1% -1.2% 1.5% -1.6% 0.0% 2.8% Inf 1.5% 0.1% -3.7% -3.4% -2.3% 0.8% -0.4% 20359.0% -1.4% 1.2% -0.7% -4.4% -1.8% 5.1% 0.7% 0.1% -1.5% -2.5% -4.7% -1.9% 0.5% -0.9% -4.2% -3.9% -0.6% -2.6% -3.2% -3.0% 0.0% 1.9% -0.7% -0.5% -1.3% -2.0% -4.2% 0.2% -0.4% -1.6% 0.5% -3.1% 1.2% 0.3% -4.7% 2.4% 2.3% 2.8% -0.6% Inf 0.8% -0.2% Inf Inf -2.6% 937.2% -5.2% 0.3% Inf -0.4% Table 4: Backfilled Error from “Missing 2” Set An Investigation on Domain-Based Effort Distribution 11 University of Southern California Center for Systems and Software Engineering Application Domain and Effort Distribution • Records by application domains: Research Data Records Count Application Domains Missing 4 Missing 3 Business 10 6 Command & Control 47 29 Communications 40 38 Controls & Displays 12 7 Executive 5 1 Information Assurance 1 1 Infrastructure or Middleware 13 8 Maintenance & Diagnostics 1 1 Mission Management 39 38 Mission Planning 20 16 None 1 1 Process Control 9 4 Scientific Systems 1 1 Sensor Control and Processing 18 17 Simulation & Modeling 19 17 Spacecraft Bus 1 1 Spacecraft Payload 1 1 Test & Evaluation 2 2 Tool & Tool Systems 5 5 Training 2 2 Weapons Delivery and Control 21 21 Total 268 216 Table 5: Research Data Records Count An Investigation on Domain-Based Effort Distribution Missing 2 6 29 37 7 1 1 8 1 20 13 1 4 1 17 14 1 1 1 5 1 15 184 Perfect Set 5 14 26 3 1 1 2 1 13 9 1 0 1 5 9 1 0 1 2 0 5 100 12 University of Southern California Center for Systems and Software Engineering Application Domain and Effort Distribution Calculate percentages for each records Calculate average percentages for each domain An Investigation on Domain-Based Effort Distribution 13 University of Southern California Center for Systems and Software Engineering Application Domain and Effort Distribution Effort Distribution by Domains - Missing 2 Effort Distribution by Domains - Perfect Set 50.00% 50.00% 45.00% 45.00% 40.00% 40.00% Effort % 30.00% MM MP Sim 25.00% 20.00% 30.00% MM MP Sim Sensors 25.00% 20.00% Sensors Weapons 15.00% Business CC Comm 35.00% Effort % Business CC Comm 35.00% 15.00% 10.00% 10.00% 5.00% 5.00% Weapons 0.00% 0.00% Requirement Arch&Design Code&Unit Test Requirement Integration & QT Arch&Design Effort Distribution by Domains - Missing 3 Integration & QT Effort Distribution by Domains - Missing 4 45.00% 50.00% 40.00% 45.00% 40.00% 35.00% 25.00% MM MP Sim 20.00% Sensors Weapons 15.00% 10.00% Business CC Comm 35.00% Effort % Business CC Comm 30.00% Effort % Code&Unit Test Activity Groups Activity Groups 30.00% MM MP Sim 25.00% 20.00% Sensors Weapons 15.00% 10.00% 5.00% 5.00% 0.00% 0.00% Requirement Arch&Design Code&Unit Test Integration & QT Requirement Arch&Design Activity Groups Code&Unit Test Integration & QT Activity Groups An Investigation on Domain-Based Effort Distribution 14 University of Southern California Center for Systems and Software Engineering Application Domain and Effort Distribution • From the plots of the calculated averages, we find the following: – Plan and requirement efforts are similar between all domains (not widely spreading). – Notable differences of efforts in coding and integration & qualification testing activity groups for all domains. – Obvious differences of efforts in architecture & design activities for all domains from the “Perfect” set results. Possible differences from the other data sets. – Allocating more efforts as project moves from plan & requirement activities to integration & qualification testing activities. – Results from the backfilled sets are similar. • Although we can see clear differences between domains’ averages in the resulting plots, we still need statistical proof that the differences are not results of noise. An Investigation on Domain-Based Effort Distribution 15 University of Southern California Center for Systems and Software Engineering Application Domain and Effort Distribution • Test 1: show if there is difference between application domains in term of effort distribution percentages. • Test 1: Use simple ANOVA to test the following: – H0: effort distributions are same. – Ha: effort distributions are not all the same between domains. • Test input is the list of effort percentages grouped by application domains. • Test uses 90% confidence level to determine the significance of the results An Investigation on Domain-Based Effort Distribution 16 University of Southern California Center for Systems and Software Engineering Application Domain and Effort Distribution • Test 1 Results: – The following table shows the results from two of the four testing data sets. – Results from “Missing 3” and “Missing 4” data sets are basically same as “Missing 2”. – The results indicates that domain effect is not significant in Plan & Requirements but active in all other activity groups (based on consensus that 3 data sets favors this result). – Based on this result, we can say that domains are different in effort distribution percentages. “Perfect” Data Set Activity Group “Missing 2” Data Set F P-Value Results F P-Value Results Plan & Requirements 0.8817 0.5249 Can’t Reject 1.9889 0.0605 Reject Architecture & Design 3.2783 0.0042 Reject 1.9403 0.0674 Reject Code & Unit Testing 1.5823 0.1531 Can’t Reject 3.0126 0.0056 Reject Integration and Qualification Testing 2.3402 0.0319 Reject 4.5838 0.0001 Reject Table 6: Test 1 Results An Investigation on Domain-Based Effort Distribution 17 University of Southern California Center for Systems and Software Engineering Application Domain and Effort Distribution • • Test 2: show if there is difference between application domains averages and the COCOMO II effort distribution averages. Test 2: Use independent one-sample t-test to test the following: – H0: domain average is the same as COCOMO average. – Ha: domain average is not the same as COCOMO average. – Tests run for every domain on every activity group. • Use the following formula to calculate T value in order to determine the result of the t-test: – where s is the standard deviation, n is the sample size, and µ0 is the COCOMO average we used to compare against. • Also uses 90% confidence level to determine the significance of the results. An Investigation on Domain-Based Effort Distribution 18 University of Southern California Center for Systems and Software Engineering Application Domain and Effort Distribution • Test 2 Results: – Again, the results from “Missing 3” and “Missing 4” testing data sets are similar to the results of “Missing 2”. – Three out of four activity group show at least 50% of the domains different from COCOMO II model: tentatively enough for us to move on at this point. Activity Group Plan & Requirements Architecture & Design Code & Unit Testing Integration and Qualification Testing COCOMO Averages “Perfect” Data Set “Missing 2” Data Set 6.5% All domains reject except Sensor Control and Simulation domains All domains reject except Sensor Control 39.3% All domains reject except Sensor Control and Simulation domains All domains reject except Simulation 30.8% No domains reject Only Mission Planning domain rejects 23.4% Only Mission Management and Weapon Delivery domain rejects Communications, Mission Management, Sensor Control, and Weapons Delivery domains reject; other four domains do not Table 7: Test 2 Results An Investigation on Domain-Based Effort Distribution 19 University of Southern California Center for Systems and Software Engineering Application Domain and Effort Distribution • System size effects over effort distribution: – COCOMO II model shows that effort distribution varies a little when system size grows. – For our study, we tried out simple experiment on system size to observe any effects on effort distribution: • Tests on Command & Control and Communication domains. • Divide projects into three sizing groups: – 0 to 32 KSLOCs – 33 to 128 KSLOCs – 129+ KSLOCs • Plot results to observe any patterns or trends in order to see any effects from system size to effort distribution. – Need to run similar experiment on all domains. An Investigation on Domain-Based Effort Distribution 20 University of Southern California Center for Systems and Software Engineering Application Domain and Effort Distribution Command & Control: • 45.0% 40.0% 35.0% 30.0% Effort % – Observed decreasing trend in Requirements and increasing trend in Architecture: as project size grow, more effort is allocated to design. – No clear trend for coding and integration & testing effort. Command & Control by Size 0 to 31 25.0% 32 to 128 20.0% 129+ 15.0% 10.0% 5.0% 0.0% Req Arch Code Int&QT Activity Groups Communications: – Hard to find an obvious trend in all activity group. – Seems that none of the sizing groups produces the common distribution: growing from requirement to coding and then decreasing from coding to integration & testing. Communications by Size 40.0% 35.0% 30.0% Effort % • 25.0% 0 to 31 20.0% 32 to 128 129+ 15.0% 10.0% 5.0% 0.0% Req Arch Code Int&QT Activity Groups An Investigation on Domain-Based Effort Distribution 21 University of Southern California Center for Systems and Software Engineering Next Steps • Go deeper with the system size experiment. • Expand the experiment on system size for all subject domains and groups of domains. • Use similar approach to study the effect of personnel ratings on effort distribution by application domains. • Run the study using the Productivity Types (from Brad's research) and compare the results with the current one using domains. • Combine the results of the studies to come up with a proposal of the domain-based effort distribution guideline. • Design and integrate the application domain extension to the COCOMO II model. An Investigation on Domain-Based Effort Distribution 22 University of Southern California Center for Systems and Software Engineering For more information, contact: Thomas Tan thomast@usc.edu 626-617-1128 Questions? An Investigation on Domain-Based Effort Distribution 23