Managing Software Projects Analysis and Evaluation of Data - Reliable, Accurate, and Valid Data - Distribution of Data - Centrality and Dispersion - Data Smoothing: Moving Averages - Data Correlation - Normalization of Data (Source: Tsui, F. Managing Software Projects. Jones and Bartlett, 2004) Reliable, Accurate, and Valid Data Definitions • Reliable data: Data that are collected and tabulated according to the defined rules of measurement and metric • Accurate data: Data that are collected and tabulated according to the defined level of precision of measurement and metric • Valid data: Data that are collected, tabulated, and applied according to the defined intention of applying the measurement 3 Distribution of Data Definition • Data distribution: A description of a collection of data that shows the spread of the values and the frequency of occurrences of the values of the data 5 Example #1: Skew of the Distribution The number of problems detected at each of five severity levels • • • • • Severity level 1: 23 Severity level 2: 46 Severity level 3: 79 Severity level 4: 95 Severity level 5: 110 (more on next slide) 6 Example #1 (continued) Number of problems is skewed towards the higher-numbered severity levels Number of Problems Found 120 – + 100 – + 80 – + 60 – + 40 – 20 – 0 + 1 2 3 4 5 Severity Level 7 Example #2: Range of Data Values The number of severity level 1 problems by functional area • • • • • • • Functional area 1: Functional area 2: Functional area 3: Functional area 4: Functional area 5: Functional area 6: Functional area 7: 2 7 3 8 0 1 8 The range is from 0 to 8 8 Example #3: Data Trends The total number of problems found in a specific functional area across the test time period in weeks • • • • • • • Week 1: Week 2: Week 3: Week 4: Week 5: Week 6: Week 7: 20 23 45 67 35 15 10 9 Centrality and Dispersion Definition • Centrality analysis: An analysis of a data set to find the typical value of that data set • Approaches – – – – – Average value Median value Mode value Variance and Standard deviation Control chart 11 Average, Median, and Mode • Average value (or mean): One type of centrality analysis that estimates the typical (or middle) value of a data set by summing all the observed data values and dividing the sum by the number of data points – This is the most common of the centrality analysis methods • Median: A value used in centrality analysis to estimate the typical (or middle) value of a data set. After the data values are sorted, the median is the data value that splits the data set into upper and lower halves – If there are an even number of values, the values of the middle two observations are averaged to obtain the median • Mode: The most frequently occurring value in a data set – If the data set contains floating point values, use the highest frequency of values occurring between two consecutive integers (inclusive) 12 Example Data Set = {2, 7, 3, 8, 0, 1, 8} Average = xavg = (2 + 7 + 3 + 8 + 0 + 1 + 8) / 7 = 4.1 Median: 3 0, 1, 2, 3, 7, 8, 8 ^ Mode: 8 13 Variance and Standard Deviation • Variance: The average of the squared deviations from the average value s2 = SUM [ (xi – xavg)2) ] / (n – 1) • Standard deviation: the square root of the variance. A metric used to define and measure the dispersion of data from the average value in a data set • It is numerically defined as follows: s = SQRT [ SUM [ (xi – xavg)2) ] / (n – 1) ] where SQRT = square root function SUM = sum function xi = ith observation xave = average of all xi n = total number of observations 14 Standard Deviation: Example Data Set = {2, 7, 3, 8, 0, 1, 8} xavg = (2 + 7 + 3 + 8 + 0 + 1 + 8) / 7 = 4.1 SUM [ (xi – xavg)2) ] = 4.41 + 8.41 + 1.21 + 15.21 + 16.81 + 9.61 + 15.21 = 70.87 SUM [ (xi – xavg)2) ] / (n – 1) = 70.87 / 6 = 11.81 STANDARD DEVIATION = s = SQRT(11.81) = 3.44 15 Control Chart • Control chart: A chart used to assess and control the variability of some process or product characteristic • It usually involves establishing lower and upper limits (the control limits) of data variations from the data set’s average value • If an observed data value falls outside the control limits, then it would trigger evaluation of the characteristic 16 Control Chart (continued) + 7.54 problems + + 4.1 problems (average) + 0.66 problems 17 Data Smoothing: Moving Averages Definitions • Moving average: A technique for expressing data by computing the average of a fixed grouping (e.g., data for a fixed period) of data values; it is often used to suppress the effects of one extreme data point • Data smoothing: A technique used to decrease the effects of individual, extreme variability in data values 19 Example Test week Problems found 2-week moving avg 3-week moving avg 1 20 - - 2 33 26.5 - 3 45 39 32.7 4 67 56 48.3 5 35 51 49 6 15 25 39 7 20 17.5 23.3 20 Data Correlation Definition • Data correlation: A technique that analyzes the degree of relationship between sets of data • One sought-after relationship is software is that between some attribute prior to product release and the same attribute after product release • One popular way to examine data correlation is to analyze whether a linear relationship exists – Two sets of data are paired together and plotted – The resulting graph is reviewed to detect any relationship between the data sets 22 Linear Regression • Linear regression: A technique that estimates the relationship between two sets of data by fitting a straight line to the two sets of data values • This is a more formal method of doing data correlation • Linear regression uses the equation of line: y = mx + b, where m is the slope and b is the y-intercept value • To calculate the slope, use the following: m = SUM [(xi – xavg) x (yi – yavg)] / SUM [(xi – xavg)2] • To calculate the y-intercept, use the following: b = yavg – (m x xavg) 23 Example Pre-release and Post-release Problems SW Products A B C #Pre-release 10 5 35 #Post-release 24 13 71 D E F 75 15 22 155 34 50 G H 7 54 16 112 24 Example (continued) xavg = 27.9 yavg = 59.4 m = 2.0 slope (approx.) b = 3.6 y-intercept (approx.) y = 2x + 3.6 25 Example (continued) Number of Post-release Problems Found 200 - + 150 – + 100 – + 50 – 0 ++ ++ 10 + 20 30 40 50 60 Number of Pre-release Problems Found 70 80 26 Normalization of Data Definition • Normalizing data: A technique used to bring data characterizations to some common or standard level so that comparisons become more meaningful • This is needed because a pure comparison of raw data sometimes does not provide an accurate comparison • The number of source lines of code is the most common means of normalizing data – Function points may also be used 28 Summary • • • • • • Reliable, Accurate, and Valid Data Distribution of Data Centrality and Dispersion Data Smoothing: Moving Averages Data Correlation Normalization of Data 29