Brad Clark
Software Metrics Inc.
November 17, 2015
• Share my experiences collecting and analyzing data to create cost and schedule estimating relationships
• Experience breakdown:
– Model conceptualization
– Collecting Data
– Analyzing data, statistics and model formulation
– Using the model
– Conclusion
4/15/2020 Data Collection and Analysis 2
• The model must be explainable (and teachable). The effect on the response variable by varying each predictor variable must be understandable.
• The model should use only enough predictors variables to explain the variation in the response variable and each predictor is significant.
• The model must use predictor variables that are independent of each other
• It must be possible to numerically calibrate the predictor variables in the model using a statistical analysis method (but which method?).
– Examines interrelationships among the predictor variables
– Shows unexplained variance
– Provides a goodness of fit measure of the model to data
– Reports the significance that each predictor variable has in predicting the response.
• The model must be sufficiently accurate.
4/15/2020 Data Collection and Analysis 3
Data
Relationships
Domain
Relationships
• Identify factors that predict the desired response, e.g.
– Software growth
– Integration & Testing cost and schedule
• Approach #1: Data Relationships
– Identify correlations between data elements (variables)
– Correlations may be nonsense
– Limited by whatever data is available
• Approach #2: Domain Relationships
– What will cause or influence the desired response
– Data analysis may show there is no correlation between “predictor” and “response”
– No data may be available
4/15/2020 Data Collection and Analysis 4
• Share my experiences collecting and analyzing data to create cost and schedule estimating relationships
• Experience breakdown:
– Model conceptualization
– Collecting Data
– Analyzing data, statistics and model formulation
– Using the model
– Conclusion
4/15/2020 Data Collection and Analysis 5
4/15/2020 Data Collection and Analysis 6
• Noisy data confounds analysis!
– There is natural variation in data
– But there is also induced variation in data due to lack of definition
• Define the data to be collected*
– Data item (what is to be collected)
• Included / Excluded
• Unit of measure
– Data attributes (priority, source, scope, estimated/actual)
– Aggregation structure (component, activity, lifecycle increment)
• Is the data consistently collected?
* From Practical System and Software Measurement (PSM)
4/15/2020 Data Collection and Analysis 7
• Share my experiences collecting and analyzing data to create cost and schedule estimating relationships
• Experience breakdown:
– Model conceptualization
– Collecting Data
– Analyzing data, statistics and model formulation
– Using the model
– Conclusion
4/15/2020 Data Collection and Analysis 8
1. Gather Collected Data
– Paper forms, web page forms, spreadsheet forms
Boehm Data Collection Rule:
The more data you ask for, the less data you are likely to collect
2. Inspect each Data Record
– From single item or aggregated component
– Size, effort and schedule data
3. Determine Data Quality Levels
– Rate the data by source & completeness
4. Correct Missing or Questionable Data
– Is this ethical?
5. Normalize Size and Effort Data
– Convert to the same units of measure
6. Convert Raw SLOC to Equivalent SLOC
– Account for new, modified, reused, deleted code
4/15/2020 Data Collection and Analysis 9
• Y = A * X B * C * D * E * F…
– This model is more difficult to calibrate
– Interactions between predictor variables
– More data is required
– This is the COCOMO II software estimation model (2000)
• Y = A * X B
– This model is simple and easy to calibrate
– Less data required
– May be less accurate
– This is the software estimation model used in the Software
Cost Estimation Metrics Manual for Defense Systems (2015)
4/15/2020 Data Collection and Analysis 10
Regression models must satisfy five assumptions to be valid in their results:
1. The independent variables and the dependent variables have a linear relationship.
2. The dependent variable is a continuous random variable and the independent variable are set at various values and are not random.
3. The variances of the dependent variable are equally distributed given various combinations of the independent variables.
4. Successive observed values of the dependent variable are uncorrelated.
5. The distribution of the sampling error, e i model is normal.
, in the regression
4/15/2020 Data Collection and Analysis 11
• Share my experiences collecting and analyzing data to create cost and schedule estimating relationships
• Experience breakdown:
– Model conceptualization
– Collecting Data
– Analyzing data, statistics and model formulation
– Using the model
– Conclusion
4/15/2020 Data Collection and Analysis 12
Evolutionary Overlapped
C
1
T
1
I
1
E
1
C
2
T
2
I
2
E
2
I
3
E
3
C
3
T
3
…
4/15/2020
Life cycle processes challenge the use of a model for estimation
Data Collection and Analysis 13
R 1.0 / 2.0-2.4
FY10 – FY14
(1&2Q) Sunk
ACA Estimate Breakdown Structure (EBS)
Software
R 3.0
CDR R1
ISR R1
IFSV R1
PTC R1
CPE
Test
ES: SE/EA
R 3.1
CPE
Test
ES: SE/EA
R 4.1
BPD R3.4
AIR R1
IPF R1
CPE
Test
ES: SE/EA
R 4.0
CDR R2
ISR R2
CPE
Test
ES: SE/EA
R 5.0
CDR R3
ISR R3
AIR R2
AVS R1
IRDB R1
CPE
Test
ES: SE/EA
R 6.0
CPE
Test
ES: SE/EA
R 6.1
CDR R4
ISR R4
IFSV R2
ACV R1
CPE
Test
ES: SE/EA
14 4/15/2020 Data Collection and Analysis
• Data specification and collection is the “Achilles heel” of model building
– Reduced number of model variables help this situation
• The arena of software development is becoming more diverse (rapid development, COTS, variety of platforms)
– Smaller models are quicker to develop and use to meet changing situations
• Software assurance (privacy, security, encryption, intrusion detection, etc.) is becoming increasing important
• The need for cost and schedule estimating models renews itself continuously
4/15/2020 Data Collection and Analysis 15