Measures of Variability I. Range The range for a set of data items is the difference between the largest and smallest values. Although the range is the easiest of the numerical measures of variability to compute, it is not widely used because it is based on only two of the items in the data set and thus is influenced too much by extreme data values. II. Interquartile Range A form of the range that avoids the dependence on extreme values in the data set is the interquartile range (IQR), or Q-spread. This descriptive measure of variability is simply the difference between the third quartile (Q3 ) , or 75%-tile data item, and the first quartile (Q1 ) , or 25%-tile data item. In effect, it is showing the range for the middle 50% of the data and, as such, is not affected by the extreme values in the 3 data set. To calculate Q3 , let i N where N is the number of data items. If i is 4 not an integer, then the next integer greater than i denotes the position of the 75%-tile; if i is an integer, then the 75%-tile is the average of the data values in positions i and 1 i + 1. Similarly, to calculate Q1 , let i N and follow the same guidelines as above. 4 Example 1: Given the following data: 2, 3, 5, 7, 11, 13, 17, 19, 23, 29. Find the IQR. N = 10 i i 3 (10) 7.5 Q3 is the 8th data item Q3 19. Next, 4 1 (10) = 2.5 Q1 is the 3rd data item Q1 5 . Therefore, IQR = 19-5 = 14. 4 Example 2: Given the following data: 2, 3, 5, 7, 11, 13, 17, 19. Find the IQR. 3 (8) 6 Q3 is the average of the data values in the 6th and 7th 4 13 17 1 15. Next, i (8) 2 Q1 is the average of the positions Q3 2 4 3 5 4. Therefore, IQR = 15-4 = 11. values in the 2nd and 3rd positions Q1 2 N 8i 1 III. Average Absolute Deviation from the Mean Obviously, there are limitations in using range or interquartile range as measures of variability. It would seem reasonable that any useful measure of variability should measure the spread around the mean since the mean is the “balance point” of a distribution. If you find the difference between each data item and the mean, you will get negative values for items that are less than the mean and positive values for items greater than the mean. If you then sum up all of these differences, you will get zero; this illustrates a special property of the mean. However, by taking the absolute value of each difference, you will get the distance of each item from the mean, and the sum of these distances would measure the total spread around the mean. If you were to include more data items, equally spread around the mean, you would increase the total of the distances even though the new distribution might be less variable. Therefore, it is important to divide the total absolute deviation by the number of data items; this will give an average absolute deviation from the mean. X X Average Absolute Deviation = N This average absolute deviation gives the average distance of any data item from the mean and thus is a good measure of spread. IV. Standard Deviation If you were to calculate the average absolute deviation of a distribution using a value other than the mean, you could possibly get a smaller average absolute deviation. This result is one of the reasons that the average absolute deviation is not the best measure of variability. Instead, calculate the average of the squared differences from the mean; this is the variance of a distribution. If you were to calculate the average of the squared differences of a distribution by using a value other than the mean, you would always get a larger value. The mean is the one number that minimizes the average of the squared differences in a distribution. Variance = 2 ( X X ) 2 N There are still two slight inconveniences in using variance as our measure of variability. First, variance does not give an estimate of the distance of a typical data from the mean; it is too big. Second, if the data items have a unit of measurement associated with them, then the variance would not have the same unit of measurement; it would have square units. By taking the square root of variance, we get standard deviation, which is the measure of variability that we want. 2 Standard Deviation = ( X X ) 2 N The standard deviation can be calculated in an alternative way. 2 X 2 X Standard Deviation = N Example: Given the following histogram, estimate the standard deviation. %/cig 3 2 30% 40% (.5) 20% 10% 0 0 10 20 40 Number of cigarettes 80 Recall that the mean of a histogram can be determined by calculating a “weighted” average using the midpoints of the class intervals and the areas of the blocks. Thus, X .1(5) .3(15) .4(30) .2(60) .5 4.5 12 12 29 cigarettes. The standard deviation of a histogram can also be calculated using the midpoints of the class interval, the area of the blocks, and the “weighted” average. Using the first formula, we get: SD .1(5 29) 2 .3(15 29) 2 .4(30 29) 2 .2(60 29) 2 17.6 cig .1 .3 .4 .2 Using the second formula, we get: SD .1(5 2 ) .3(15 2 ) .4(30 2 ) .2(60 2 ) 29 2 17.6 cig .1 .3 .4 .2 3 Important Note: Some textbooks will give the following formulas for variance and standard deviation: ( X X ) 2 X 2 N X Variance = s N 1 N 1 2 2 Standard Deviation = s ( X X ) 2 X 2 N X N 1 N 1 2 These formulas should be used when N data items are taken as a sample from a larger population in which the variance and standard deviation of that population are unknown. These formulas give good approximations of the variance and standard deviation of the population. Practice Sheet – Measures of Variability I. The following are 25 final averages in a math class: 46 49 53 60 61 64 66 66 67 71 72 74 75 76 79 79 79 80 83 88 89 91 94 95 98 (1) What is the range? (2) What is the interquartile range? II. Given the following data: 5, 7, 11, 12, 13, 18. (1) (2) (3) (4) (5) (6) (7) (8) (9) What is the mean? What is the average absolute deviation from the mean? What is the median? What is the average absolute deviation from the median? What is the standard deviation? Add 8 to each item. What is the new SD? Subtract 7 from each item. What is the new SD? Multiply each item by 7. What is the new SD? Divide each item by 5. What is the new SD? 4 III. In the histogram given below, the class intervals include the right endpoint, not the left: %/$1000 1.25 1.00 0.75 0.50 0.25 0 0 20 40 80 Income (in $1000) 100 120 (1) What is the estimated mean? (2) What is the estimated standard deviation? (3) What is the estimated interquartile range? IV. Class A N 20 X 70 X 10 (1) (2) (3) (4) Find Find Find Find Class B N 30 X 80 X 6 X for class A. X for class B. X for the two classes combined. X for the combined classes. (5) Find X 2 for class A. [Hint: Use the alternative formula for SD.] (6) Find X 2 for class B. (7) Find X 2 for the combined classes. (8) Find X for the combined classes. 5 Solution Key for Measures of Variability I. (1) 98 – 46 = 52 (2) 83 – 66 = 17 II. (1) 11 (2) 3 1 (3) (4) (5) (6) (7) (8) (9) 3 11.5 31 3 4.2 4.2 4.2 29.4 .84 III. (1) 56 (2) 26 (3) 76 – 35 = 41 IV. (1) (2) (3) (4) (5) (6) (7) (8) 1400 2400 3800 76 100,000 193,080 293,080 9.25 6