Frank Tsui - Analysis and Evaluation of Data

advertisement
Managing Software Projects
Analysis and Evaluation of Data
- Reliable, Accurate, and Valid Data
- Distribution of Data
- Centrality and Dispersion
- Data Smoothing: Moving Averages
- Data Correlation
- Normalization of Data
(Source: Tsui, F. Managing Software Projects. Jones and Bartlett, 2004)
Reliable, Accurate, and Valid Data
Definitions
• Reliable data: Data that are collected and tabulated according to
the defined rules of measurement and metric
• Accurate data: Data that are collected and tabulated according
to the defined level of precision of measurement and metric
• Valid data: Data that are collected, tabulated, and applied
according to the defined intention of applying the measurement
3
Distribution of Data
Definition
• Data distribution: A description of a collection of data
that shows the spread of the values and the frequency of
occurrences of the values of the data
5
Example #1: Skew of the
Distribution
The number of problems detected at each of five severity levels
•
•
•
•
•
Severity level 1: 23
Severity level 2: 46
Severity level 3: 79
Severity level 4: 95
Severity level 5: 110
(more on next slide)
6
Example #1 (continued)
Number of problems is skewed towards the higher-numbered severity levels
Number of Problems Found
120 –
+
100 –
+
80 –
+
60 –
+
40 –
20 –
0
+
1
2
3
4
5
Severity Level
7
Example #2: Range of Data Values
The number of severity level 1 problems by functional area
•
•
•
•
•
•
•
Functional area 1:
Functional area 2:
Functional area 3:
Functional area 4:
Functional area 5:
Functional area 6:
Functional area 7:
2
7
3
8
0
1
8
The range is from 0 to 8
8
Example #3: Data Trends
The total number of problems found in a specific
functional area across the test time period in weeks
•
•
•
•
•
•
•
Week 1:
Week 2:
Week 3:
Week 4:
Week 5:
Week 6:
Week 7:
20
23
45
67
35
15
10
9
Centrality and Dispersion
Definition
• Centrality analysis: An analysis of a data set to find the
typical value of that data set
• Approaches
–
–
–
–
–
Average value
Median value
Mode value
Variance and Standard deviation
Control chart
11
Average, Median, and Mode
• Average value (or mean): One type of centrality analysis that
estimates the typical (or middle) value of a data set by summing all the
observed data values and dividing the sum by the number of data
points
– This is the most common of the centrality analysis methods
• Median: A value used in centrality analysis to estimate the typical (or
middle) value of a data set. After the data values are sorted, the
median is the data value that splits the data set into upper and lower
halves
– If there are an even number of values, the values of the middle two
observations are averaged to obtain the median
• Mode: The most frequently occurring value in a data set
– If the data set contains floating point values, use the highest frequency of
values occurring between two consecutive integers (inclusive)
12
Example
Data Set = {2, 7, 3, 8, 0, 1, 8}
Average = xavg = (2 + 7 + 3 + 8 + 0 + 1 + 8) / 7 = 4.1
Median: 3
0, 1, 2, 3, 7, 8, 8
^
Mode: 8
13
Variance and Standard Deviation
• Variance: The average of the squared deviations from the average value
s2 = SUM [ (xi – xavg)2) ] / (n – 1)
• Standard deviation: the square root of the variance. A metric used to define
and measure the dispersion of data from the average value in a data set
• It is numerically defined as follows:
s = SQRT [ SUM [ (xi – xavg)2) ] / (n – 1) ]
where
SQRT = square root function
SUM = sum function
xi = ith observation
xave = average of all xi
n = total number of observations
14
Standard Deviation: Example
Data Set = {2, 7, 3, 8, 0, 1, 8}
xavg = (2 + 7 + 3 + 8 + 0 + 1 + 8) / 7 = 4.1
SUM [ (xi – xavg)2) ] = 4.41 + 8.41 + 1.21 + 15.21 +
16.81 + 9.61 + 15.21
= 70.87
SUM [ (xi – xavg)2) ] / (n – 1) = 70.87 / 6
= 11.81
STANDARD DEVIATION = s = SQRT(11.81) = 3.44
15
Control Chart
• Control chart: A chart used to assess and control the
variability of some process or product characteristic
• It usually involves establishing lower and upper limits (the
control limits) of data variations from the data set’s
average value
• If an observed data value falls outside the control limits,
then it would trigger evaluation of the characteristic
16
Control Chart (continued)
+
7.54 problems
+
+
4.1 problems (average)
+
0.66 problems
17
Data Smoothing:
Moving Averages
Definitions
• Moving average: A technique for expressing data by
computing the average of a fixed grouping (e.g., data for a
fixed period) of data values; it is often used to suppress the
effects of one extreme data point
• Data smoothing: A technique used to decrease the effects
of individual, extreme variability in data values
19
Example
Test week
Problems found
2-week moving avg
3-week moving avg
1
20
-
-
2
33
26.5
-
3
45
39
32.7
4
67
56
48.3
5
35
51
49
6
15
25
39
7
20
17.5
23.3
20
Data Correlation
Definition
• Data correlation: A technique that analyzes the degree of
relationship between sets of data
• One sought-after relationship is software is that between
some attribute prior to product release and the same
attribute after product release
• One popular way to examine data correlation is to analyze
whether a linear relationship exists
– Two sets of data are paired together and plotted
– The resulting graph is reviewed to detect any relationship between
the data sets
22
Linear Regression
• Linear regression: A technique that estimates the relationship
between two sets of data by fitting a straight line to the two sets
of data values
• This is a more formal method of doing data correlation
• Linear regression uses the equation of line: y = mx + b,
where m is the slope and b is the y-intercept value
• To calculate the slope, use the following:
m = SUM [(xi – xavg) x (yi – yavg)] / SUM [(xi – xavg)2]
• To calculate the y-intercept, use the following:
b = yavg – (m x xavg)
23
Example
Pre-release and Post-release Problems
SW Products
A
B
C
#Pre-release
10
5
35
#Post-release
24
13
71
D
E
F
75
15
22
155
34
50
G
H
7
54
16
112
24
Example (continued)
xavg = 27.9
yavg = 59.4
m = 2.0 slope (approx.)
b = 3.6 y-intercept (approx.)
y = 2x + 3.6
25
Example (continued)
Number of Post-release Problems Found
200 -
+
150 –
+
100 –
+
50 –
0
++
++
10
+
20
30
40
50
60
Number of Pre-release Problems Found
70
80
26
Normalization of Data
Definition
• Normalizing data: A technique used to bring data
characterizations to some common or standard level so that
comparisons become more meaningful
• This is needed because a pure comparison of raw data
sometimes does not provide an accurate comparison
• The number of source lines of code is the most common
means of normalizing data
– Function points may also be used
28
Summary
•
•
•
•
•
•
Reliable, Accurate, and Valid Data
Distribution of Data
Centrality and Dispersion
Data Smoothing: Moving Averages
Data Correlation
Normalization of Data
29

Download