An Investigation on Domain-Based Effort Distribution

advertisement
University of Southern California
Center for Systems and Software Engineering
An Investigation on Domain-Based Effort
Distribution
Thomas Tan
26th International Forum on Systems, Software,
and COCOMO Cost Modeling
November 2011
University of Southern California
Center for Systems and Software Engineering
Research Overview
•
Motivation:
– Project risks early in the project
lifecycle:
• Lack of project knowledge.
• Known project attributes may
change later on.
– These risks contribute to the Cone
of Uncertainty effect.
• Lead to extremely unrealistic cost
schedule.
– However, many estimation
methodologies use one-size-fit-all
effort distribution.
– Application domains:
• Easy to define for a project.
• Available early.
• Relatively stable throughout the
project lifecycle.
Figure 1: Cone of Uncertainty for Software Cost
and Size Estimation
Phase/Activities
Effort %
Plan and Requirement
7 (2-15)
Product Design
17
Detailed Design
27-23
Code and Unit Test
37-29
Integration and Test
19-31
Transition
12 (0-20)
Table 1: COCOMO II Waterfall Effort
Distribution Percentages
An Investigation on Domain-Based Effort Distribution
2
University of Southern California
Center for Systems and Software Engineering
Research Overview
• Goal:
– Study the impacts of application domains.
– Apply the effects of application domains on effort distribution
guideline.
• This work is part of the research study to provide better software
estimation guideline:
– This is an on-going research.
– Based on data analysis of the government projects – the SRDR data.
– Sponsored by the Air Force Cost Analysis Agency.
An Investigation on Domain-Based Effort Distribution
3
University of Southern California
Center for Systems and Software Engineering
Research Overview
• We are investigating that if we can use application domains as
illustrated in the following diagram:
Size
(KSLOC)
Application
Domain
Personnel
Ratings
COCOMO II Model
+
Application Domain Extension
for effort distribution
Domain-based
Effort
Distribution
Guideline
Data Support
Figure 2: Expected Extension using Application Domains
An Investigation on Domain-Based Effort Distribution
4
University of Southern California
Center for Systems and Software Engineering
Research Overview
• Research plan:
– Determine application domains.
• 22 application domains selected from US Air Force Cost Estimation
Handbook and Mil-881 standard.
• Also referencing other studies.
– Normalize the SRDR data.
– Determine effort distribution patterns by application domains.
• Calculate average effort distribution percentages for each application
domains.
– Hoping that effort distribution patterns are different.
• Prove the differences between each domain and between domains and
the COCOMO II model are statistically significant.
– Study how system size and personnel ratings affect the effort
distribution patterns for different domains.
– Establish effort distribution guideline based on findings.
– Integrate this guideline with the COCOMO II model.
An Investigation on Domain-Based Effort Distribution
5
University of Southern California
Center for Systems and Software Engineering
Effort Distribution Definitions
•
Matching the research data, we
will only investigate the following
activity groups:
–
–
–
–
–
–
•
Plan & requirements
Architecture & Design
Code and Unit Testing
Integration
Qualification Testing
Note: we will combine Integration and
Qualification Testing to match the
COCOMO II Waterfall phases.
Effort activity groups definitions:
–
–
We will use the SRDR standard
activities definitions: similar to that of
COCOMO II model.
Adjusted COCOMO II model
distribution averages: divided by 1.07
to ensure the sum of all averages is
100%.
COCOMO II Phase
SRDR Activities
Plan and requirement
Software requirements analysis
Product design and
detail design
Software architecture and detailed
design
Coding and unit testing Coding, unit testing
Integration and
qualification testing
Software integration and
system/software integration;
Qualification/Acceptance testing
Table 2: Mapping of SRDR Activities to COCOM II Phases
Phase/Activities
Effort %
Plan and Requirement
6.5
Product Architecture & Design
39.3
Code and Unit Testing
30.8
Integration and Qualification
Testing
23.4
Table 3: COCOMO II Waterfall Effort
Distribution Percentages with adjustment
An Investigation on Domain-Based Effort Distribution
6
University of Southern California
Center for Systems and Software Engineering
Data Processing
• Data Set:
– A collection of projects data from DoD and other government agencies.
– Data are extracted from SRDR Form 2630-3, which is the final reports for
each project after the project has been completed.
– Data include effort, size, and other development parameters (such as
language, process, staffing information, etc.).
– Data issues:
• Missing data.
• Untrustworthy data patterns.
• Duplicated records.
• Lack of quality measures.
– Need further data processing before analysis:
• Eliminate bad records.
• Normalize the data.
• Backfill effort data.
An Investigation on Domain-Based Effort Distribution
7
University of Southern California
Center for Systems and Software Engineering
Data Processing
• Data Normalization
– Evaluate all data records and eliminate those that we are unable to
normalize (i.e. missing important fields, no definitions at all, and
duplicated records).
– Eliminate those with weird patterns that are likely results of bad
reporting, i.e. those with huge effort and little size or vice versa.
– Deal with missing effort fields (see detail in the data backfilling
section on next couple slides).
– Transform all data records into the same units of size, effort, and
schedule.
• Calculate equivalent size for all records.
– Use of DM, CM, and IM.
• Calculate effort and schedule total for all records.
– Calculate personnel ratings.
An Investigation on Domain-Based Effort Distribution
8
University of Southern California
Center for Systems and Software Engineering
Data Processing
• Backfilling Data:
– Rationale:
• Backfilling is necessary to increase the number of usable data
records.
– Method: Approximative Non-Negative Matrix Factorization:
• Factorize subject data set (X) into two matrices: X  W x H
• W and H are two random matrices whose dot product approximates data
set X.
• Iteratively adjusting values in W and H using a simple approximation
algorithm (use α, β as the adjustment factors).
• Exit iterations when error margin is smaller than a preset value (usually
0.01 or smaller).
• Also set the maximum number of iterations to stop the process if it goes
too long (usually 5,000 to 10,000).
An Investigation on Domain-Based Effort Distribution
9
University of Southern California
Center for Systems and Software Engineering
Data Processing
• Backfilling Data Sets:
– Missing 2 Set: missing 1 to 2 values from the 5 activity groups.
– Missing 3 Set: missing 3 values from the 5 activity groups.
– Missing 4 Set: missing 4 values from the 5 activity groups.
• Developed a matrix factorization program in Matlab™:
–
–
–
–
Exit margin = 0.001.
10,000 iterations.
α = 0.0002, β = 0.02.
Import and export CSV.
• Calculated error margins between backfilled and original data
points.
– Most backfilled values are within 10% of the original value (many are
very small).
– Few presents huge differences due to significant discrepancies in
data patterns (very small in some activity while huge in another).
An Investigation on Domain-Based Effort Distribution
10
University of Southern California
Center for Systems and Software Engineering
Data Processing
• Error comparing original and backfilled data points.
– Error = (Backfilled – Original) / Original
Domain
REQ
ARCH
Original
CODE
INT
QT
REQ
ARCH
Mission Planning
Mission Planning
Mission Planning
Mission Planning
Mission Planning
Mission Planning
Mission Planning
Mission Planning
Mission Planning
Mission Planning
Mission Planning
Mission Planning
Mission Planning
68.1
9.7
15.8
22.1
27.6
6.3
0.0
2.2
3.1
210.2
97.8
24.0
4.2
39.6
0.046
102.8
16.6
62.2
89.1
142.9
2.4
4.1
103.0
46.1
20.5
10.5
112.0
9.8
208.1
167.9
364.4
100.9
845.2
14.8
16.9
113.2
89.5
27.4
14.8
25.3
11.1
121.9
17.5
37.9
37.6
55.0
7.9
2.9
25.4
134.3
4.8
4.3
1.8
5.7
0.0
4.3
9.9
0.0
0.0
6.7
0.2
418.5
16.2
0.0
5.8
65.4
9.6
16.0
21.8
27.6
6.4
80.4
2.3
3.1
202.5
94.5
23.5
4.3
39.5
9.4
101.3
16.8
61.8
85.1
140.3
2.6
4.1
103.1
45.4
20.0
10.0
Backfilled
CODE
INT
109.9
9.9
206.2
160.8
350.3
100.4
823.3
14.4
16.4
113.3
91.2
27.2
14.7
25.0
10.8
116.8
17.6
37.7
37.0
55.2
7.7
2.9
25.5
128.0
5.0
4.4
QT
REQ
ARCH
Error
CODE
INT
QT
1.8
5.7
13.6
4.4
9.9
15.3
30.7
6.5
2.1
396.6
16.2
19.2
5.8
-4.1%
-1.2%
1.5%
-1.6%
0.0%
2.8%
Inf
1.5%
0.1%
-3.7%
-3.4%
-2.3%
0.8%
-0.4%
20359.0%
-1.4%
1.2%
-0.7%
-4.4%
-1.8%
5.1%
0.7%
0.1%
-1.5%
-2.5%
-4.7%
-1.9%
0.5%
-0.9%
-4.2%
-3.9%
-0.6%
-2.6%
-3.2%
-3.0%
0.0%
1.9%
-0.7%
-0.5%
-1.3%
-2.0%
-4.2%
0.2%
-0.4%
-1.6%
0.5%
-3.1%
1.2%
0.3%
-4.7%
2.4%
2.3%
2.8%
-0.6%
Inf
0.8%
-0.2%
Inf
Inf
-2.6%
937.2%
-5.2%
0.3%
Inf
-0.4%
Table 4: Backfilled Error from “Missing 2” Set
An Investigation on Domain-Based Effort Distribution
11
University of Southern California
Center for Systems and Software Engineering
Application Domain and Effort Distribution
•
Records by application domains:
Research Data Records Count
Application Domains
Missing 4
Missing 3
Business
10
6
Command & Control
47
29
Communications
40
38
Controls & Displays
12
7
Executive
5
1
Information Assurance
1
1
Infrastructure or Middleware
13
8
Maintenance & Diagnostics
1
1
Mission Management
39
38
Mission Planning
20
16
None
1
1
Process Control
9
4
Scientific Systems
1
1
Sensor Control and Processing
18
17
Simulation & Modeling
19
17
Spacecraft Bus
1
1
Spacecraft Payload
1
1
Test & Evaluation
2
2
Tool & Tool Systems
5
5
Training
2
2
Weapons Delivery and Control
21
21
Total
268
216
Table 5: Research Data Records Count
An Investigation on Domain-Based Effort Distribution
Missing 2
6
29
37
7
1
1
8
1
20
13
1
4
1
17
14
1
1
1
5
1
15
184
Perfect Set
5
14
26
3
1
1
2
1
13
9
1
0
1
5
9
1
0
1
2
0
5
100
12
University of Southern California
Center for Systems and Software Engineering
Application Domain and Effort Distribution
Calculate percentages
for each records
Calculate average percentages
for each domain
An Investigation on Domain-Based Effort Distribution
13
University of Southern California
Center for Systems and Software Engineering
Application Domain and Effort Distribution
Effort Distribution by Domains - Missing 2
Effort Distribution by Domains - Perfect Set
50.00%
50.00%
45.00%
45.00%
40.00%
40.00%
Effort %
30.00%
MM
MP
Sim
25.00%
20.00%
30.00%
MM
MP
Sim
Sensors
25.00%
20.00%
Sensors
Weapons
15.00%
Business
CC
Comm
35.00%
Effort %
Business
CC
Comm
35.00%
15.00%
10.00%
10.00%
5.00%
5.00%
Weapons
0.00%
0.00%
Requirement
Arch&Design
Code&Unit Test
Requirement
Integration & QT
Arch&Design
Effort Distribution by Domains - Missing 3
Integration & QT
Effort Distribution by Domains - Missing 4
45.00%
50.00%
40.00%
45.00%
40.00%
35.00%
25.00%
MM
MP
Sim
20.00%
Sensors
Weapons
15.00%
10.00%
Business
CC
Comm
35.00%
Effort %
Business
CC
Comm
30.00%
Effort %
Code&Unit Test
Activity Groups
Activity Groups
30.00%
MM
MP
Sim
25.00%
20.00%
Sensors
Weapons
15.00%
10.00%
5.00%
5.00%
0.00%
0.00%
Requirement
Arch&Design
Code&Unit Test
Integration & QT
Requirement
Arch&Design
Activity Groups
Code&Unit Test
Integration & QT
Activity Groups
An Investigation on Domain-Based Effort Distribution
14
University of Southern California
Center for Systems and Software Engineering
Application Domain and Effort Distribution
• From the plots of the calculated averages, we find the following:
– Plan and requirement efforts are similar between all domains (not
widely spreading).
– Notable differences of efforts in coding and integration & qualification
testing activity groups for all domains.
– Obvious differences of efforts in architecture & design activities for all
domains from the “Perfect” set results. Possible differences from the
other data sets.
– Allocating more efforts as project moves from plan & requirement
activities to integration & qualification testing activities.
– Results from the backfilled sets are similar.
• Although we can see clear differences between domains’
averages in the resulting plots, we still need statistical proof that
the differences are not results of noise.
An Investigation on Domain-Based Effort Distribution
15
University of Southern California
Center for Systems and Software Engineering
Application Domain and Effort Distribution
• Test 1: show if there is difference between application domains in
term of effort distribution percentages.
• Test 1: Use simple ANOVA to test the following:
– H0: effort distributions are same.
– Ha: effort distributions are not all the same between domains.
• Test input is the list of effort percentages grouped by application
domains.
• Test uses 90% confidence level to determine the significance of
the results
An Investigation on Domain-Based Effort Distribution
16
University of Southern California
Center for Systems and Software Engineering
Application Domain and Effort Distribution
•
Test 1 Results:
– The following table shows the results from two of the four testing data sets.
– Results from “Missing 3” and “Missing 4” data sets are basically same as
“Missing 2”.
– The results indicates that domain effect is not significant in Plan &
Requirements but active in all other activity groups (based on consensus that 3
data sets favors this result).
– Based on this result, we can say that domains are different in effort distribution
percentages.
“Perfect” Data Set
Activity Group
“Missing 2” Data Set
F
P-Value
Results
F
P-Value
Results
Plan & Requirements
0.8817
0.5249
Can’t Reject
1.9889
0.0605
Reject
Architecture & Design
3.2783
0.0042
Reject
1.9403
0.0674
Reject
Code & Unit Testing
1.5823
0.1531
Can’t Reject
3.0126
0.0056
Reject
Integration and
Qualification Testing
2.3402
0.0319
Reject
4.5838
0.0001
Reject
Table 6: Test 1 Results
An Investigation on Domain-Based Effort Distribution
17
University of Southern California
Center for Systems and Software Engineering
Application Domain and Effort Distribution
•
•
Test 2: show if there is difference between application domains averages
and the COCOMO II effort distribution averages.
Test 2: Use independent one-sample t-test to test the following:
– H0: domain average is the same as COCOMO average.
– Ha: domain average is not the same as COCOMO average.
– Tests run for every domain on every activity group.
•
Use the following formula to calculate T value in order to determine the
result of the t-test:
– where s is the standard deviation, n is the sample size, and µ0 is the COCOMO
average we used to compare against.
•
Also uses 90% confidence level to determine the significance of the
results.
An Investigation on Domain-Based Effort Distribution
18
University of Southern California
Center for Systems and Software Engineering
Application Domain and Effort Distribution
•
Test 2 Results:
– Again, the results from “Missing 3” and “Missing 4” testing data sets are similar
to the results of “Missing 2”.
– Three out of four activity group show at least 50% of the domains different from
COCOMO II model: tentatively enough for us to move on at this point.
Activity Group
Plan & Requirements
Architecture & Design
Code & Unit Testing
Integration and
Qualification Testing
COCOMO
Averages
“Perfect” Data Set
“Missing 2” Data Set
6.5%
All domains reject except Sensor
Control and Simulation domains
All domains reject except Sensor
Control
39.3%
All domains reject except Sensor
Control and Simulation domains
All domains reject except
Simulation
30.8%
No domains reject
Only Mission Planning domain
rejects
23.4%
Only Mission Management and
Weapon Delivery domain rejects
Communications, Mission
Management, Sensor Control, and
Weapons Delivery domains reject;
other four domains do not
Table 7: Test 2 Results
An Investigation on Domain-Based Effort Distribution
19
University of Southern California
Center for Systems and Software Engineering
Application Domain and Effort Distribution
• System size effects over effort distribution:
– COCOMO II model shows that effort distribution varies a little when
system size grows.
– For our study, we tried out simple experiment on system size to
observe any effects on effort distribution:
• Tests on Command & Control and Communication domains.
• Divide projects into three sizing groups:
– 0 to 32 KSLOCs
– 33 to 128 KSLOCs
– 129+ KSLOCs
• Plot results to observe any patterns or trends in order to see any
effects from system size to effort distribution.
– Need to run similar experiment on all domains.
An Investigation on Domain-Based Effort Distribution
20
University of Southern California
Center for Systems and Software Engineering
Application Domain and Effort Distribution
Command & Control:
•
45.0%
40.0%
35.0%
30.0%
Effort %
– Observed decreasing trend in
Requirements and increasing
trend in Architecture: as project
size grow, more effort is allocated
to design.
– No clear trend for coding and
integration & testing effort.
Command & Control by Size
0 to 31
25.0%
32 to 128
20.0%
129+
15.0%
10.0%
5.0%
0.0%
Req
Arch
Code
Int&QT
Activity Groups
Communications:
– Hard to find an obvious trend in
all activity group.
– Seems that none of the sizing
groups produces the common
distribution: growing from
requirement to coding and then
decreasing from coding to
integration & testing.
Communications by Size
40.0%
35.0%
30.0%
Effort %
•
25.0%
0 to 31
20.0%
32 to 128
129+
15.0%
10.0%
5.0%
0.0%
Req
Arch
Code
Int&QT
Activity Groups
An Investigation on Domain-Based Effort Distribution
21
University of Southern California
Center for Systems and Software Engineering
Next Steps
• Go deeper with the system size experiment.
• Expand the experiment on system size for all subject domains
and groups of domains.
• Use similar approach to study the effect of personnel ratings on
effort distribution by application domains.
• Run the study using the Productivity Types (from Brad's research)
and compare the results with the current one using domains.
• Combine the results of the studies to come up with a proposal of
the domain-based effort distribution guideline.
• Design and integrate the application domain extension to the
COCOMO II model.
An Investigation on Domain-Based Effort Distribution
22
University of Southern California
Center for Systems and Software Engineering
For more information, contact:
Thomas Tan
thomast@usc.edu
626-617-1128
Questions?
An Investigation on Domain-Based Effort Distribution
23
Download