Data Collection and Analysis Lessons Learned on my Journey

advertisement

Data Collection and Analysis

Lessons Learned on my Journey

Brad Clark

Software Metrics Inc.

November 17, 2015

Intention

• Share my experiences collecting and analyzing data to create cost and schedule estimating relationships

• Experience breakdown:

– Model conceptualization

– Collecting Data

– Analyzing data, statistics and model formulation

– Using the model

– Conclusion

4/15/2020 Data Collection and Analysis 2

Modeling Requirements

• The model must be explainable (and teachable). The effect on the response variable by varying each predictor variable must be understandable.

• The model should use only enough predictors variables to explain the variation in the response variable and each predictor is significant.

• The model must use predictor variables that are independent of each other

• It must be possible to numerically calibrate the predictor variables in the model using a statistical analysis method (but which method?).

– Examines interrelationships among the predictor variables

– Shows unexplained variance

– Provides a goodness of fit measure of the model to data

– Reports the significance that each predictor variable has in predicting the response.

• The model must be sufficiently accurate.

4/15/2020 Data Collection and Analysis 3

Model Conceptualization

Data

Relationships

Domain

Relationships

• Identify factors that predict the desired response, e.g.

– Software growth

– Integration & Testing cost and schedule

• Approach #1: Data Relationships

– Identify correlations between data elements (variables)

– Correlations may be nonsense

– Limited by whatever data is available

• Approach #2: Domain Relationships

– What will cause or influence the desired response

– Data analysis may show there is no correlation between “predictor” and “response”

– No data may be available

4/15/2020 Data Collection and Analysis 4

Intention

• Share my experiences collecting and analyzing data to create cost and schedule estimating relationships

• Experience breakdown:

– Model conceptualization

– Collecting Data

– Analyzing data, statistics and model formulation

– Using the model

– Conclusion

4/15/2020 Data Collection and Analysis 5

Data Headaches

4/15/2020 Data Collection and Analysis 6

Data Collection

• Noisy data confounds analysis!

– There is natural variation in data

– But there is also induced variation in data due to lack of definition

• Define the data to be collected*

– Data item (what is to be collected)

• Included / Excluded

• Unit of measure

– Data attributes (priority, source, scope, estimated/actual)

– Aggregation structure (component, activity, lifecycle increment)

• Is the data consistently collected?

* From Practical System and Software Measurement (PSM)

4/15/2020 Data Collection and Analysis 7

Intention

• Share my experiences collecting and analyzing data to create cost and schedule estimating relationships

• Experience breakdown:

– Model conceptualization

– Collecting Data

– Analyzing data, statistics and model formulation

– Using the model

– Conclusion

4/15/2020 Data Collection and Analysis 8

Analyzing Data

1. Gather Collected Data

– Paper forms, web page forms, spreadsheet forms

Boehm Data Collection Rule:

The more data you ask for, the less data you are likely to collect

2. Inspect each Data Record

– From single item or aggregated component

– Size, effort and schedule data

3. Determine Data Quality Levels

– Rate the data by source & completeness

4. Correct Missing or Questionable Data

– Is this ethical?

5. Normalize Size and Effort Data

– Convert to the same units of measure

6. Convert Raw SLOC to Equivalent SLOC

– Account for new, modified, reused, deleted code

4/15/2020 Data Collection and Analysis 9

Model Calibration

• Y = A * X B * C * D * E * F…

– This model is more difficult to calibrate

– Interactions between predictor variables

– More data is required

– This is the COCOMO II software estimation model (2000)

• Y = A * X B

– This model is simple and easy to calibrate

– Less data required

– May be less accurate

– This is the software estimation model used in the Software

Cost Estimation Metrics Manual for Defense Systems (2015)

4/15/2020 Data Collection and Analysis 10

Regression Model Assumptions

Regression models must satisfy five assumptions to be valid in their results:

1. The independent variables and the dependent variables have a linear relationship.

2. The dependent variable is a continuous random variable and the independent variable are set at various values and are not random.

3. The variances of the dependent variable are equally distributed given various combinations of the independent variables.

4. Successive observed values of the dependent variable are uncorrelated.

5. The distribution of the sampling error, e i model is normal.

, in the regression

4/15/2020 Data Collection and Analysis 11

Intention

• Share my experiences collecting and analyzing data to create cost and schedule estimating relationships

• Experience breakdown:

– Model conceptualization

– Collecting Data

– Analyzing data, statistics and model formulation

– Using the model

– Conclusion

4/15/2020 Data Collection and Analysis 12

Evolutionary Overlapped

Using the Model

C

1

T

1

I

1

E

1

C

2

T

2

I

2

E

2

I

3

E

3

C

3

T

3

4/15/2020

Life cycle processes challenge the use of a model for estimation

Data Collection and Analysis 13

R 1.0 / 2.0-2.4

FY10 – FY14

(1&2Q) Sunk

Estimate Breakdown Structure (EBS)

ACA Estimate Breakdown Structure (EBS)

Software

R 3.0

CDR R1

ISR R1

IFSV R1

PTC R1

CPE

Test

ES: SE/EA

R 3.1

CPE

Test

ES: SE/EA

R 4.1

BPD R3.4

AIR R1

IPF R1

CPE

Test

ES: SE/EA

R 4.0

CDR R2

ISR R2

CPE

Test

ES: SE/EA

R 5.0

CDR R3

ISR R3

AIR R2

AVS R1

IRDB R1

CPE

Test

ES: SE/EA

R 6.0

CPE

Test

ES: SE/EA

R 6.1

CDR R4

ISR R4

IFSV R2

ACV R1

CPE

Test

ES: SE/EA

14 4/15/2020 Data Collection and Analysis

Conclusions

• Data specification and collection is the “Achilles heel” of model building

– Reduced number of model variables help this situation

• The arena of software development is becoming more diverse (rapid development, COTS, variety of platforms)

– Smaller models are quicker to develop and use to meet changing situations

• Software assurance (privacy, security, encryption, intrusion detection, etc.) is becoming increasing important

• The need for cost and schedule estimating models renews itself continuously

4/15/2020 Data Collection and Analysis 15

Download