I-2a-Building Models Tutorial

advertisement
Building Models from Your
Software Data
Brad Clark, Ph.D.
Software Metrics, Inc.
16th International Forum on COCOMO
and Software Cost Modeling
Los Angeles, CA
October 23-26, 2001
Agenda
– 1:00 - 2:30 PM Tutorial
– 2:30 - 3:00 PM Break
– 3:00 - 4:30 PM Tutorial conclusion
Miscellaneous
– Bathrooms
– Telephones
– Tutorial Format: Collaborative participation
•
•
•
•
One person talks at a time
Keep discussions to the point
No attribution
End-of-course evaluation
Tutorial: Building Models
2
Gate #1
Directions
CSE - 3rd Floor
Electrical
Engineering
Building
(Hughes)
School of
Engineering
West 37th Place
Computer
Science Building
(Salvatori)
McClintock Ave.
Parking
Structure
A
Gerontology
Auditorium
Tutorial: Building Models
3
Tutorial Outline
•
•
•
•
•
•
Purpose
A software engineering modeling example
Model building steps
Mean-based model exercise
Regression based model exercise
Summary
Tutorial: Building Models
4
The need for models
• Models are useful for forecasting, performance
analysis, and decision-making
– WBS is narrowly addressed with current estimation models
– Strength of cause and effect relationships
– Impact of decision making: Personnel turnover
• Establish data requirements (model parameters)
• Explain assignable causes of variation and their
degree of influence
• Used to validate data
– Poor data definitions and collection consistency
– Poor processes that produce the data
Tutorial: Building Models
5
WBS Help
How is the effort
estimated for the rest
of these?
Software Cost Estimation
Models
3.1 Program Management
3.1.1 Planning & Mgt
3.1.2 Program Control
3.1.3 Contract Management
3.1.4 Contractor Laboratory
3.2 System Engineering
3.2.1 SysReq'ts
3.2.2 Design and Integration
3.2.4 Sup. & Maintainability Eng.
3.2.5 QA
3.2.6 CM
3.2.7 Human Factors
3.2.8 Security
3.3 HW/SW Design, Development and Production
3.3.1 HW Design & Dev
3.3.2 SW Design & Dev
3.3.3 HW/SW Integration & Checkout
3.5 Test and Evaluation
3.5.1 Sys T&E
3.5.4 Site Accep
3.6 Documentation
3.7 Support
Tutorial: Building Models
6
Decision Impact Analysis
• Do we give the team an incentive to stay or do we
look for new hirers?
Estimated PM = 2.94 * KSLOC * PCON
• Estimated Person Months for a
100 KSLOC project with 3%
PCON = 238 PM
• Same project with 12% PCON =
294 PM (23.5% increase)
• If the burdened labor rate is
$10,000/PM, the cost increase is
$235,000/PM
3% / yr
6% / yr
12% / yr
24% / yr
48% / yr
PCON Effect on PM
0.81
0.0%
0.90
+11.0%
1.00
+23.5%
1.12
+38.0%
1.29
+59.0%
Why not give everyone a
financial incentive to stay?
PM: Person Months
PCON: Personnel Continuity
Tutorial: Building Models
7
Data validation -1
•
•
•
•
Check for internal consistency
Be suspicious of “perfect” data
Understand reason for outliers
Check data relationships
–
–
–
–
Effort and size
Effort and schedule
Size and defects
Effort and defects
Tutorial: Building Models
8
Data validation -2
What looks suspicious here?
$140,000
$120,000
Budget
$100,000
$80,000
$60,000
$40,000
$20,000
$0
0
500
1000
1500
2000
2500
Actual Hours
Tutorial: Building Models
9
Tutorial Objectives
• Share data analysis experiences with real data
– (COCOMO as a thinking aid)
• Show how models created from data are based on
the average (or mean) of the data and its spread or
variation
• Show how model performance improves with the
removal of assignable causes of variation
• Raise awareness on the many sources of variation in
software engineering data
Tutorial: Building Models
10
What Will We Do?
• Using supplied data, we will build simple models
– Mean or Median
– One variable regression models
– Stratifying data
• Two sets of data
– The first set will be used to learn a technique
– The second set will be used to practice the technique
• Intent is to show how to create small models by
example
Tutorial: Building Models
11
What You Will Walk Away With
• A new skill: using Excel to look at data
– Data summaries
– Graphing data
– Simple regression models
• An understanding of what is behind numbers
produced by models, a.k.a. understanding variation
– An intelligent consumer of data
(which you can practice during this conference’s presentations)
– A responsible data reporter
• Understanding model parameters and their impact on
explaining variation
Tutorial: Building Models
12
About the Instructor and SMI
• Brad Clark
– Former Navy Pilot
– Worked in civil service for 10 years
– Attended USC Graduate School: 1992 - 1997
• Development of the COCOMO II model
• Process Maturity Effects on Effort
– Started consulting in 1998 in using measurement to manage
software projects
• Software Metrics, Inc. (SMI)
– Very small, private consulting company located in
Haymarket, Va.
– Started in 1983 by John and Betsy Bailey
– Focus: Using software measurement to manage software
projects: estimation, feasibility analysis, performance
Tutorial: Building Models
13
About You
• What is your name?
• Where do you work?
• Do you have any experience with statistics or
empirical modeling?
Tutorial: Building Models
14
Tutorial Outline
•
•
•
•
•
•
Purpose
A software engineering modeling example
Model building steps
Mean-based model exercise
Regression based model exercise
Summary
Tutorial: Building Models
15
What is a model?
• A model is a representation of the essential structure
of some object or event in the real world.
– Physical (airplane, building, bridge)
– Symbolic (language, computer program, mathematical
equation)
• Two major characteristics of models
– Models are necessarily incomplete
– Models may be changed or manipulated with relative ease
• No model includes every aspect of the real world
– Building models necessarily involves simplifying
assumptions
– It is critical that the assumptions made when constructing
models be understood and be reasonable.
Source: Introductory Statistics Concepts, Models, and Applications by David Stockburger
Tutorial: Building Models
16
Using Data to Estimate
Effort Consumption = 11.9 Person Hours / Function Point
What does this mean?
Actual
Effort
165
14,080
3,602
Estimated
Effort
880.6
3,665.2
5,057.5
=
=
=
=
Function
Points
74
308
425
*
*
*
*
Effort
Consumption
11.9
11.9
11.9
Yikes!
Tutorial: Building Models
17
First Model: Sample Mean (est. X)
99.7%
95%
68%
Normal
Distribution
-3 SD
-2 SD
-1 SD
X
Population Spread
with Mean, X
+1 SD
+2 SD
Sample Spread with
estimated Mean
T-Distribution
3.6
Confidence Interval (CI)
+3 SD
est.
11.9X
-90% CI
20.3
+90% CI
Tutorial: Building Models
18
Necessary and Sufficient
Information
• What additional information do we want to know
about the stated relationship to make it more
accurate?
Effort Consumption = 11.9 Person Hours / Function Point
?
?
?
Tutorial: Building Models
19
Frequency
Data Analysis: PHr/FP
PHr/FP
6
5
4
3
2
1
0
0
5 10 15 20 25 30 35 40 45 50
PHr/FP
Confidence
Interval (CI)
11.9
5.7
3.6
1.6
18.2
20.3
22.3
80% CI
90% CI
95% CI
Confidence Interval can be “tightened” by
removing assignable causes of variation.
Mean
Standard Error
Median
Standard Deviation
Range
Minimum
Maximum
Confidence Level(90.0%)
PN
1
2
3
4
5
6
7
8
9
Tutorial: Building Models
FP
40
931
425
181
308
163
74
333
241
PHrs
300
6,400
3,602
1,550
14,080
1,090
165
1,070
4,350
11.92
4.48
7.50
13.44
43.48
2.23
45.71
8.33
PHr/FP
7.50
6.87
8.48
8.56
45.71
6.69
2.23
3.21
18.05
20
Reducing the Confidence Interval
• Some assignable causes of variation among project
data points
–
–
–
–
–
–
–
–
–
Noisy data (size and effort)
Complexity of the software (effort)
Amount of required testing (effort)
Building components for reuse (effort)
Changes in requirements (size)
Required reliability and safety features (size)
Interoperability (effort and size)
Development / maintenance team experience (effort)
Turnover of key people (effort)
Tutorial: Building Models
21
Measurement Specifications
• Staff Turnover Specification Example
– Typical Data Items
• Number of personnel
• Number of personnel gained (per period)
• Number of personnel lost (per period)
– Typical Attributes
• Experience factor
• Organization
– Typical Aggregation Structure
• Activity
– Typically Collected for Each
• Project
– Count Actuals Based On
Source: Practical Software Measurement
Objective Information for Decision Makers
by McGarry et. al.
• Financial reporting criteria
• Organization restructuring or new organizational chart
Tutorial: Building Models
22
Models Depend on Solid Data
• Models are created from data  Models are only as
good as the data used to create them
–
–
–
–
–
–
life-cycle phase
overtime to get work done
experience
tools
complexity
reuse
• Data used to create models must be well specified
Tutorial: Building Models
23
Frequency
Accounting for Requirements
Volatility
PHr/Adj_FP
6
5
4
3
2
1
0
5
Mean
Standard Error
Median
Standard Deviation
Range
Minimum
Maximum
Confidence Level(90.0%)
10 15 20 25 30 35 40 45 50
PHr/Adj_FP
8.0
4.2
11.9
90% CI
Assignable cause of variation:
Adjust the size with the effects
of requirement’s volatility
(REVL)
Adj FP = FP * (1 + REVL%)
PN
1
2
3
4
5
6
7
8
9
FP
40
931
425
181
308
163
74
333
241
REVL
50
30
10
100
1
9
10
60
Tutorial: Building Models
Adj_FP
40.00
1396.50
552.50
199.10
616.00
164.63
80.66
366.30
385.60
PHrs
300
6,400
3,602
1,550
14,080
1,090
165
1,070
4,350
8.01
2.07
6.62
6.21
20.81
2.05
22.86
3.85
PHr/Adj_FP
7.50
4.58
6.52
7.79
22.86
6.62
2.05
2.92
11.28
24
Variation: Staff Turnover
Impact of Personnel Continuity on Effort
This factor captures the turmoil caused by the project losing key, lead
personnel. The loss of key personnel leads to extra effort in new people
coming to work for the project and having to spend time coming up to
speed on what has to be done. The rating scale is in terms of the
project’s personnel turnover normalized to a year.
Descriptors:
48% per
year
24% per
year
Rating Levels
Very Low
Low
Nominal
High
Very High
1.29
1.12
1.00
0.90
0.81
Effort Multipliers
Effect on Effort:
+15%
12% per 6% per
year
year
+12%
-11%
3% per
year
-11%
Source: Software Cost Estimation with COCOMO II by Barry Boehm et. al.
Tutorial: Building Models
25
Accounting for Staff Turnover
Frequency
8
Adj_PHr/Adj_FP
6
4
2
0
5
10 15 20 25 30 35 40 45 50
Adj_PHr / Adj_FP
7.9
5.2
10.6
90% CI
Assignable cause of variation:
Adjust the effort with the
effects of Personnel Continuity
(PCON)
Adj_PHr = PHr / PCON
PN
1
2
3
4
5
6
7
8
9
Adj_FP
40.00
1396.50
552.50
199.10
616.00
164.63
80.66
366.30
385.60
Mean
Standard Error
Median
Standard Deviation
Range
Minimum
Maximum
Confidence Level(90.0%)
PHrs PCON Adj_PHrs
300
0.90
333.33
6,400
0.81
7901.23
3,602
0.81
4446.91
1,550
0.81
1913.58
14,080 1.29
10914.73
1,090
1.00
1090.00
165
0.81
203.70
1,070
0.81
1320.99
4,350
1.29
3372.09
Tutorial: Building Models
7.87
1.46
8.05
4.39
15.19
2.53
17.72
2.72
Adj_PHr /
Adj_FP
8.33
5.66
8.05
9.61
17.72
6.62
2.53
3.61
8.75
26
COCOMO Suite
• Attempts to identify and quantify assignable causes
of variation (drivers)
Model
COCOMO
COCOTS
COQUALMO
CORADMO
COPSEMO
COPROMO
COSYSMO
Purpose
Custom cost and schedule estimation
COTS Based Systems cost estimation
Defect introduction and removal
Rapid application development cost and
schedule estimation
Staged schedule & effort model
Productivity improvement model
System engineering cost and schedule est.
Tutorial: Building Models
27
Second Model: Linear Regression
Analysis
My favorite!
15000
10000
PHr
• Statistical Regression fits a
line through points
minimizing the least square
error between the points
and the line
• The regression analysis
yields a line with a slope, M,
and intercept, A:
Y = A + MX + e
• The goodness of fit is given
by a statistic called R2. The
closer to 1.0, the better the
fit.
5000
0
0
500
1000
1500
Adj_FP
est_PHr = 1061.3 + 6.0645 Adj_FP
R2 = 0.3232
Tutorial: Building Models
28
Starting Point: Scatter Plot
Model Boundaries
Y = 11.3 + 0.0651 X + e
140.0
120.0
KSLOC
100.0
80.0
60.0
40.0
20.0
0.0
0
500
1000 1500 2000
Unadjusted Function Points
There is variation with
each model coefficient
Random
Source: Albrecht and Gaffney, "Software Function, Source Lines of Code, and Development Effort
Prediction: A Software Science Validation," IEEE Transactions on Software Engineering, Vol SE-9, No 6,
Nov 1983.
Tutorial: Building Models
29
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
12000
16000
14000
12000
10000
8000
6000
4000
2000
0
9500
10500
7500
8500
5500
6500
3500
4500
1500
2500
500
PHr
15500
13500
11500
9500
7500
5500
3500
1500
-500
-500
PHr
Regression Analysis Example
Model Estimate of PHr
Model Estimate of PHr
Compare Models
est_PHr = -12127 + 6.16 Adj_FP + 13879 PCON
Adj. R2 = 0.64
Linear Model
est_PHr = 4.84* Adj_FP1.08 * PCON2.72
Adj. R2 = 0.88
Multiplicative Model
Tutorial: Building Models
30
Model Accuracy
PRED(L) = X
• Means that the model estimates
within L% of the actual values X% of
the time
• In other words, how often does the
model predict within the desired
circle?
30% of the actual
Example: PRED(30) = 70
The model predicts within 30% of
the actuals 70% of the time.
Models are necessarily incomplete
and are not 100% accurate
Tutorial: Building Models
31
Model Evaluation
PN
1
2
3
4
5
6
7
8
9
PRED(30)
PHr/FP
476.80
11,097.52
5,066.00
2,157.52
3,671.36
1,942.96
882.08
3,969.36
2,872.72
Adj_PHr /
Adj_FP
314.80
7,326.97
3,344.75
1,424.47
2,423.96
1,282.81
582.38
2,620.71
1,896.67
Linear Model
612.26
7,703.17
2,512.57
339.16
9,565.01
2,764.17
-389.25
1,367.44
8,148.05
Multiplicative
Model
200.76
7,026.10
2,570.26
849.69
10,274.27
1,227.46
318.92
1,645.88
6,181.82
0.0
0.55
0.33
0.44
Actual PHrs
300
6,400
3,602
1,550
14,080
1,090
165
1,070
4,350
Which model would you choose?
Tutorial: Building Models
32
Summary -1: Two Models
Mean X
1-Variable Regression
Y
Y
x x x
x xx
x x x
x x xxxxx x x
X
X
Y=0+X
-3 SD
Y = A + MX + e
-2 SD
-1 SD
X
+1 SD
Tutorial: Building Models
+2 SD
+3 SD
33
Summary -2
•
•
•
•
Definition of a model
Data specifications
Normal versus T distribution
Model characteristics
–
–
–
–
–
Model Usage
Model boundaries
Confidence interval
Model accuracy
Assignable causes of variation
• Large model examples
Tutorial: Building Models
34
Tutorial Outline
•
•
•
•
•
•
Purpose
A software engineering modeling example
Model building steps
Mean-based model exercise
Regression based model exercise
Summary
Tutorial: Building Models
35
Modeling Steps: 1. Decide what
relationship you would like to investigate
• What do you want to know
–
–
–
–
Estimation of Requirements Volatility
Establishing thresholds for performance monitoring
Working overtime’s effect on personnel turnover
Estimation of the number of defects to be found before Final
Acceptance Test
People
Cost
Build Duration
Code Size
Function Points
Requirements
Defects
Design Units
Test Cases
Rework
Change Requests
Req’ts Evolution
Process Maturity
Tutorial: Building Models
Documentation
36
2. Identify assignable causes of
variation (drivers)
• Use your experience and intuition
• Possible sources of variation:
–
–
–
–
–
–
–
–
–
Customer participation - Development team experience
Application domain exp. - Complexity of application
Development flexibility
- Design constraints
Requirements volatility - Adaptation of existing code
Programming lang. exp. - Use of modern methodologies
Compression of schedule - Use of software tools
Code inspections
- Management capability
Team size
- Application size
Personnel turnover
- Architecture & Risk resolution
Tutorial: Building Models
37
3. Collect data
• Specify data to be collected based on:
– assignable causes of variation
– what is available
• Select 10 projects to go back and collect extra data
– Based on project applicability
– Use measurement specifications as a checklist for each data
• How much project data is enough?
Tutorial: Building Models
38
4. Normalize data and check for
consistency
• In some cases it may be appropriate to normalize:
– normalize data to a “per unit” measure
•
•
•
•
size
defects
calendar days, weeks, months
effort hours, effort days, effort months
– normalize about the mean of the data to get percentage
increase or decrease from the mean
• Plot data
– Check that known relationships exist
– Detect outliers and investigate
– Scatter plots are very useful
Tutorial: Building Models
39
5. Build model and evaluate
• Models should be:
–
–
–
–
Simple
Explainable
Analyzable
Most important: They should make sense!
• Models make explicit data relationships
– Show strength and direction of relationships
– While a relationship exists - it may not be valid or make
sense
– The relationships you want to use in modeling are ones that
show valid “cause and effect”
Tutorial: Building Models
40
6. Add or remove drivers
• Drivers are data attributes that explain (or drive)
variation.
• The more drivers used in a model - the more data
that must be collected.
• While a driver may make sense to use in explaining
variation, the data may not support this conclusion
– Collect more data, the current dataset may be biased and
not represent a true sample
– There may be drivers that are correlated, this could cover
the effects of the weak performing driver
• Warning: Correlation effects between drivers
Tutorial: Building Models
41
7. Repeat steps 3 to 7
• If the model does not have an acceptable accuracy,
then:
–
–
–
–
collect more data
analyze it for its influence on variation
add and remove cost drivers
evaluate model
Tutorial: Building Models
42
8. Pilot model
• Document model and create a tool for its use
• Model must be piloted to test its reasonableness,
understandability, and accuracy.
– Collect actual values of model inputs (including assignable
causes of variation)
• Model should be used with its confidence interval
• Feedback should be incorporated into model and tool
Tutorial: Building Models
43
Tutorial: Building Models
44
Tutorial Outline
•
•
•
•
•
•
Purpose
A software engineering modeling example
Model building steps
Mean-based model exercise
Regression based model exercise
Summary
Tutorial: Building Models
45
Exercise 1: Growth Model
• Modeling step #1: What do we want to investigate?
– We are going to develop a growth model based on real data
in a report from NASA Software Engineering Laboratory*
– A growth model increases size based on “other information”
– It will be used in estimating cost and schedule for future
software projects
• When will the model be used?
– What information will be available at the time?
• What will be the scope of the model?
– What will be included or excluded in the estimate?
* Cost and Schedule Estimation Study Report, SEL-93-001
Tutorial: Building Models
46
Exercise 1: Assignable Causes of
Variation
• Modeling step #2: What are possible causes (that
can be controlled) of growth?
?
?
?
?
Tutorial: Building Models
47
Exercise 1: Survey the Data
• Modeling step #3: Collect data
• Using Microsoft Excel, open the file with the NASA
SEL data.
• Select the data definitions worksheet
–
–
–
–
–
–
–
–
–
Project type
Programming Language
Duration
Effort for management and technical
Estimated SLOC size
Actual SLOC size
New SLOC size
Growth (derived)
Reuse (derived)
Tutorial: Building Models
48
Exercise 1: Plot the Data
• Modeling step #4: Normalize data and check for
consistency
– Copy Data worksheet and name it “Scatter Plots”
• Create a scatter plot of the following data elements
against Growth%
–
–
–
–
–
–
–
Project Type
Programming Language
Duration
Effort
Estimated SLOC
New SLOC
Reuse%
Tutorial: Building Models
49
Exercise 1: Check for Correlation
• Check for correlation of data elements to Growth%
– Excel: Tools -> Data Analysis -> Correlation
(new Worksheet Ply: Correlation - this will create a new
worksheet)
TypeN
LangN
Duration (Weeks)
Effort (Hours)
SLOC_Est
SLOC_Act
SLOC_New
Growth
Reuse%
TypeN
1.000
-0.507
0.694
0.307
0.248
0.337
0.344
0.041
-0.331
Duration
(Weeks)
LangN
1.000
-0.274
-0.392
-0.348
-0.412
-0.363
-0.360
0.379
1.000
0.681
0.346
0.524
0.724
0.085
-0.623
Effort
(Hours)
SLOC
Est
1.000
0.626
0.785
0.972
-0.001
-0.561
1.000
0.919
0.566
-0.359
0.011
SLOC
Act
1.000
0.758
-0.095
-0.164
SLOC
New
1.000
0.138
-0.673
Growth
1.000
-0.472
• Compare correlation numbers to scatter plots
– What can you conclude?
Tutorial: Building Models
50
Exercise 1: Create Mean-Based
Models
• Modeling step #5: Build model and evaluate
– Copy the Data worksheet and name it “Project-Models”
• Which relationships shown in the scatter plots looked
most promising?
• Based on the intended model’s purpose, what data
would be realistically available?
Tutorial: Building Models
51
Exercise 1: Create Project Type
Mean-Based Models
• Build 3 Mean-Models based on “Project Type”
– Sort data by project type
• 1 - TS
• 2 - AGSS
• 3 - DS
–
–
–
–
–
–
Excel: Tools -> Data Analysis -> Descriptive Statistics
Input Range: TypeN (for one set of values)
Output Range: swipe two empty columns
Check Summary Statistics
Check Confidence Interval: 80%
Describe each model
• Mean, Standard Deviation, Min - Max values, Number of data
points, 80% confidence intervals
Tutorial: Building Models
52
Exercise 1: TS Mean-Model
TS Projects Growth
Mean
Standard Error
Median
Mode
Standard Deviation
Sample Variance
Kurtosis
Skewness
Range
Minimum
Maximum
Sum
Count
Confidence Level (80.0%)
0.21
0.05
0.20
0.20
0.11
0.01
2.82
1.49
0.30
0.10
0.40
1.05
5.00
0.08
Sum of productivities / No. projects
Std Deviation / SQRT(No projects)
Middle Growth% value
Spread; SQRT(Variance)
Small TS project Growth%
Largest TS project Growth%
Confidence interval within which lies
the real population mean
Tutorial: Building Models
53
Exercise 2: Create Reuse% MeanBased Model
• Build a Mean-Based Model for Reuse%
–
–
–
–
Looking at scatter plot, how can this data be stratified?
Copy Data worksheet and name it “Reuse-Model”
Sort data by Reuse%
Use Descriptive Statistics to build two Reuse% models
based on stratified data
Excel: Tools -> Data Analysis -> Descriptive Statistics (80%
Confidence Interval)
– Describe each model
Mean, Standard Deviation, Min - Max values, Number of data
points, 80% confidence intervals
Tutorial: Building Models
54
Mean-Based Modeling Conclusions
• Correlation analysis versus Scatter Plots
• Variation in the data
– When to use the Mean versus the Median
– Stratifying or categorizing data
– Determines (in part) the confidence interval
• Number of data points are important
• Minimum and maximum values set model boundaries
• The mean is a model that describes data “on
average”
• The standard deviation is a model that describes
distances “in general”
Tutorial: Building Models
55
Tutorial Outline
•
•
•
•
•
•
Purpose
A software engineering modeling example
Model building steps
Mean-based model exercise
Regression based model exercise
Summary
Tutorial: Building Models
56
Linear Regression Models
2500
Unadjusted Function Points
• Statistical Regression
fits a line through
points minimizing the
least square error
between the points
and the line
• The regression
analysis yields a line
with a slope, M, and
intercept, A:
Y = A + MX
• The goodness of fit is
given by a statistic
called R2. The closer to
1.0, the better the fit.
y = 6.1365x + 206.12
R2 = 0.7286
2000
1500
1000
500
0
0.0
50.0
100.0
150.0
200.0
250.0
300.0
350.0
COBOL KSLOC
Tutorial: Building Models
57
Two Types of Regression Models
• Additive (linear)
Y = A + MX + e
• Multiplicative
y = A • XM• e
• The regression technique requires a linear form
– Works for the first model form
– Do not work for the second model form
• Non-linear models must be transformed into a linear
form
Tutorial: Building Models
58
Transforming Non-Linear Models
• Log-log transformation
Y = A • XB
ln(Y) = ln(A) + B • ln(X)
• Reversing the log-log transformation
eln(Y) = e[ln(a) + M • ln(X)]
y = ea • XM
A = ea
Y = A • XM
Tutorial: Building Models
59
Exercise 3: Create Additive
Duration Regression Model
• Modeling step #5: Build model and evaluate
– Copy the Data worksheet and name it “Duration-Model”
• Examine the Duration Relationship to Growth% in the
scatter plots
• Based on the intended model’s purpose, would this
data be realistically available?
Tutorial: Building Models
60
Exercise 3: Create Additive Model
• Build a Regression Model based on Duration
–
–
–
–
–
–
–
–
–
Select Tools -> Data Analysis -> Regression
Y Input range: Growth% data
X Input range: Duration data
Select labels
Confidence interval set to 80%
Output range : select 7 columns for the output
Select residuals
Select: OK
Describe the model
• Intercept (A), Slope (M), R2 of the model, Min - Max values,
Number of data points, 80% confidence intervals for A and M
– Create a scatter plot of the data with trend line
Tutorial: Building Models
61
Exercise 3: Additive Duration Model
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.57204208
R Square
0.32723214
Adjusted R Square 0.26607143
Standard Error
29.2818403
Observations
13
Growth% = -40.5 + 0.72 Duration
Growth% = -87.3 + 0.30 Duration (lower 80%)
Growth% = 6.23 + 1.14 Duration (upper 80%)
ANOVA
df
Regression
Residual
Total
Intercept
Duration
1
11
12
Significance
SS
MS
F
F
4587.542915 4587.543 5.350365 0.041074958
9431.687854 857.4262
14019.23077
Upper
Coefficients Standard Error
t Stat
P-value Lower 80.0% 80.0%
-40.527843
34.29305817 -1.181809
0.2622 -87.2840386 6.228353
0.71766616
0.310263556 2.313086 0.041075 0.294643418 1.140689
Tutorial: Building Models
62
Additive Model Scatter Plot
140
120
100
y = 0.7177x - 40.528
Growth
80
R2 = 0.3272
60
40
20
0
0
50
100
150
200
-20
Duration
Tutorial: Building Models
63
Exercise 4: Create Multiplicative
Duration Regression Model
• Modeling step #5: Build model and evaluate
– Copy the Data worksheet and name it “Ln-Duration-Model”
• Transform the Growth% and Duration data into logspace by taking the logarithms of each column
– Insert a new column next to Growth% and Duration
– Label them Ln-Growth% and Ln-Duration
– In the new column take the logarithms of the column next to
it
• e.g. in cell H2 type =ln(G2); copy this formula into the remaining
cells
Tutorial: Building Models
64
Exercise 4: Create Multiplicative
Model
• Build a Regression Model based on Ln-Duration
– Use the same procedures as last time
• Transform the results back into normal-space
eln(y) = e[ln(a) + eb • ln(x)]
y = ea • xb
a = ea <- all we have to do is raise the intercept to the e
=exp(intercept)
• Describe the model
– Intercept (A), Slope (M), R2 of the model, Min - Max values,
Number of data points, 80% confidence intervals for A and M
• Create a scatter plot of the data with trend line
Tutorial: Building Models
65
Exercise 4: Multiplicative Duration
Model
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.55929926
R Square
0.312815663
Adjusted R Square
0.250344359
Standard Error
0.735024727
Observations
13
Growth% = 0.015 * Duration1.6
Growth% = 0.0002 * Duration0.63 (lower 80%)
Growth% = 1.42 * Duration2.6 (upper 80%)
ANOVA
df
Regression
Residual
Total
Intercept
ln-Duration
exp(Intercept)
SS
MS
1 2.705278 2.705278
11 5.942875 0.540261
12 8.648152
Significan
F
ce F
5.00735 0.046889
Standard
Lower
Upper
Coefficients
Error
t Stat
P-value
80.0%
80.0%
-4.221394794 3.352932 -1.259016 0.234087 -8.792883 0.350094
1.613752301 0.721162 2.237711 0.046889 0.630498 2.597007
0.014678157
0.000152 1.419201
Tutorial: Building Models
66
Multiplicative Model Scatter Plot
140
120
y = 0.0147x
100
1.6138
2
Growth%
R = 0.3128
80
60
40
20
0
0
50
100
150
200
Weeks
Tutorial: Building Models
67
Multiplicative Model Scatter Plot
(Log - Log scale)
6.000
y = 1.6138x - 4.2214
5.000
2
R = 0.3128
Growth%
4.000
3.000
2.000
1.000
0.000
0
1
2
3
4
5
6
Weeks
Tutorial: Building Models
68
Model Comparison -1
• Using the two models, estimate Growth%
– In each Duration worksheet, create a new column next to
Growth% and Ln-Growth%
– Label it est. Growth%
– Using the models created in Exercise 3 and 4, compute the
estimated Growth%
• One model is additive: Growth% = -40.5 + 0.72 Duration
• One model is multiplicative: Growth% = 0.015 * Duration1.6
Tutorial: Building Models
69
Model Comparison -2
• Compute the Magnitude Relative Error (MRE) for
each Growth estimate:
– Create a new column next the the est. Growth% column
– Label it MRE
– Compute MRE: (Actual Growth% - Estimated Growth%) /
Actual Growth%
• Count the errors that are less than or equal to 30%
and divide by the number of data points. This is
PRED(30)
Tutorial: Building Models
70
Model Comparison Results
PN
2
3
4
5
6
7
8
9
10
11
12
13
14
PRED(30)
Growth %
5
50
30
30
40
80
130
20
20
15
10
25
20
Additive Model
0.43
0.66
0.45
41.80
30.28
64.84
51.16
26.68
32.44
18.76
-6.44
20.20
38.92
31
Tutorial: Building Models
Multiplicative Model
30.15
45.00
31.40
29.73
23.39
44.04
35.29
21.53
24.54
17.64
7.35
18.33
28.09
62
71
Tutorial Outline
•
•
•
•
•
•
Purpose
A software engineering modeling example
Model building steps
Mean-based model exercise
Regression based model exercise
Summary
Tutorial: Building Models
72
Summary -1
• Modeling steps
1. Decide what relationship you would like to investigate
2. Identify assignable causes of variation (drivers)
3. Collect data
4. Normalize data and check for consistency
5. Build model and evaluate
6. Add or remove drivers
7. Repeat steps 3 to 7
8. Pilot model
• Mean-Based Models
– Scatter plot versus correlation analysis
– Stratify data to identify different relationships
– Mean, Standard Deviation, Min - Max values, Number of
data points, 80% confidence intervals
Tutorial: Building Models
73
Summary -2
• Regression-Based Models
–
–
–
–
–
Additive: Y = A + MX
Multiplicative: Y = A * XM
Use of logarithms to transform multiplicative into additive
Analysis in log-space versus linear-space
Use of PRED(L) as a measure of model performance
• Data defines the model!
–
–
–
–
–
Data quality
Scope of coverage: life-cycle phases
Depth of coverage: included / excluded in the count
Correlation effects among assignable causes of variation
Min and Max inputs (based on low and high data points)
Tutorial: Building Models
74
Further Information
•
•
•
•
•
•
Cost and Schedule Estimation Study, NASA Software Engineering
Laboratory, SEL-93-002, Nov 1993
Introductory Statistics Concepts, Models, and Applications by David
Stockburger, www.atomicdogpublishing.com, 2ed, 2001
Practical Software Measurement Objective Information for Decision
Makers by John McGarry, David Card, Cheryl Jones, Beth Layman,
Elizabeth Clark, Joseph Dean, and Fred Hall, Addison-Wesley, 2001
Software Cost Estimation with COCOMO II by Barry Boehm, Chris
Abts, Winsor Brown, Sunita Chulani, Brad Clark, Ellis Horowitz, Ray
Madachy, Donald Reifer, and Bert Steece, Prentice Hall PTR, 2000.
Statistics, Data Analysis, and Decision Making, by James Evans and
David Olson, Prentice-Hall, 1999
Statistical Analysis Simplified, by Glen Hoffherr and Robert Reid,
McGraw-Hill, 1997
Tutorial: Building Models
75
Contact Information
Brad Clark
Software Metrics, Inc.
Washington, D.C. area
(703) 754-0115
Brad@Software-Metrics.com
http://www.software-metrics.com
Tutorial: Building Models
76
Download