Measures of Variability

advertisement
Measures of Variability
I. Range
The range for a set of data items is the difference between the largest and smallest
values. Although the range is the easiest of the numerical measures of variability
to compute, it is not widely used because it is based on only two of the items in the
data set and thus is influenced too much by extreme data values.
II. Interquartile Range
A form of the range that avoids the dependence on extreme values in the data set is
the interquartile range (IQR), or Q-spread. This descriptive measure of variability
is simply the difference between the third quartile (Q3 ) , or 75%-tile data item, and
the first quartile (Q1 ) , or 25%-tile data item. In effect, it is showing the range for
the middle 50% of the data and, as such, is not affected by the extreme values in the
3
data set. To calculate Q3 , let i  N where N is the number of data items. If i is
4
not an integer, then the next integer greater than i denotes the position of the 75%-tile;
if i is an integer, then the 75%-tile is the average of the data values in positions i and
1
i + 1. Similarly, to calculate Q1 , let i  N and follow the same guidelines as above.
4
Example 1: Given the following data: 2, 3, 5, 7, 11, 13, 17, 19, 23, 29. Find the IQR.
N = 10  i 
i
3
(10)  7.5  Q3 is the 8th data item  Q3  19. Next,
4
1
(10) = 2.5  Q1 is the 3rd data item  Q1  5 . Therefore, IQR = 19-5 = 14.
4
Example 2: Given the following data: 2, 3, 5, 7, 11, 13, 17, 19. Find the IQR.
3
(8)  6  Q3 is the average of the data values in the 6th and 7th
4
13  17
1
 15. Next, i  (8)  2  Q1 is the average of the
positions  Q3 
2
4
3
5
 4. Therefore, IQR = 15-4 = 11.
values in the 2nd and 3rd positions  Q1 
2
N 8i 
1
III. Average Absolute Deviation from the Mean
Obviously, there are limitations in using range or interquartile range as measures of
variability. It would seem reasonable that any useful measure of variability should
measure the spread around the mean since the mean is the “balance point” of a
distribution. If you find the difference between each data item and the mean, you
will get negative values for items that are less than the mean and positive values for
items greater than the mean. If you then sum up all of these differences, you will get
zero; this illustrates a special property of the mean. However, by taking the absolute
value of each difference, you will get the distance of each item from the mean, and
the sum of these distances would measure the total spread around the mean. If you
were to include more data items, equally spread around the mean, you would
increase the total of the distances even though the new distribution might be less
variable. Therefore, it is important to divide the total absolute deviation by the
number of data items; this will give an average absolute deviation from the mean.
X X
Average Absolute Deviation =
N
This average absolute deviation gives the average distance of any data item from the
mean and thus is a good measure of spread.
IV. Standard Deviation
If you were to calculate the average absolute deviation of a distribution using a value
other than the mean, you could possibly get a smaller average absolute deviation.
This result is one of the reasons that the average absolute deviation is not the best
measure of variability. Instead, calculate the average of the squared differences from
the mean; this is the variance of a distribution. If you were to calculate the average
of the squared differences of a distribution by using a value other than the mean,
you would always get a larger value. The mean is the one number that minimizes
the average of the squared differences in a distribution.
Variance =  2 
( X  X ) 2
N
There are still two slight inconveniences in using variance as our measure of
variability. First, variance does not give an estimate of the distance of a typical data
from the mean; it is too big. Second, if the data items have a unit of measurement
associated with them, then the variance would not have the same unit of measurement; it would have square units. By taking the square root of variance, we get
standard deviation, which is the measure of variability that we want.
2
Standard Deviation =  
( X  X ) 2
N
The standard deviation can be calculated in an alternative way.
2
X 2
X
Standard Deviation =  
N
Example: Given the following histogram, estimate the standard deviation.
%/cig
3
2
30%
40%
(.5)
20%
10%
0
0
10
20
40
Number of cigarettes
80
Recall that the mean of a histogram can be determined by calculating a “weighted”
average using the midpoints of the class intervals and the areas of the blocks. Thus,
X  .1(5)  .3(15)  .4(30)  .2(60)  .5  4.5  12  12  29 cigarettes. The standard
deviation of a histogram can also be calculated using the midpoints of the class
interval, the area of the blocks, and the “weighted” average.
Using the first formula, we get:
SD   
.1(5  29) 2  .3(15  29) 2  .4(30  29) 2  .2(60  29) 2
 17.6 cig
.1  .3  .4  .2
Using the second formula, we get:
SD   
.1(5 2 )  .3(15 2 )  .4(30 2 )  .2(60 2 )
 29 2  17.6 cig
.1  .3  .4  .2
3
Important Note:
Some textbooks will give the following formulas for variance and standard deviation:
( X  X ) 2 X 2  N X

Variance = s 
N 1
N 1
2
2
Standard Deviation = s 
( X  X ) 2
X 2  N X

N 1
N 1
2
These formulas should be used when N data items are taken as a sample from a
larger population in which the variance and standard deviation of that population are
unknown. These formulas give good approximations of the variance and standard
deviation of the population.
Practice Sheet – Measures of Variability
I. The following are 25 final averages in a math class:
46
49
53
60
61
64
66
66
67
71
72
74
75
76
79
79
79
80
83
88
89
91
94
95
98
(1) What is the range?
(2) What is the interquartile range?
II. Given the following data: 5, 7, 11, 12, 13, 18.
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
What is the mean?
What is the average absolute deviation from the mean?
What is the median?
What is the average absolute deviation from the median?
What is the standard deviation?
Add 8 to each item. What is the new SD?
Subtract 7 from each item. What is the new SD?
Multiply each item by 7. What is the new SD?
Divide each item by 5. What is the new SD?
4
III. In the histogram given below, the class intervals include the right endpoint, not the
left:
%/$1000
1.25
1.00
0.75
0.50
0.25
0
0
20
40
80
Income (in $1000)
100
120
(1) What is the estimated mean?
(2) What is the estimated standard deviation?
(3) What is the estimated interquartile range?
IV.
Class A
N  20
X  70
 X  10
(1)
(2)
(3)
(4)
Find
Find
Find
Find
Class B
N  30
X  80
X 6
X for class A.
X for class B.
X for the two classes combined.
X for the combined classes.
(5) Find X 2 for class A. [Hint: Use the alternative formula for SD.]
(6) Find X 2 for class B.
(7) Find X 2 for the combined classes.
(8) Find  X for the combined classes.
5
Solution Key for Measures of Variability
I. (1) 98 – 46 = 52
(2) 83 – 66 = 17
II. (1) 11
(2) 3 1
(3)
(4)
(5)
(6)
(7)
(8)
(9)
3
11.5
31
3
4.2
4.2
4.2
29.4
.84
III. (1) 56
(2) 26
(3) 76 – 35 = 41
IV. (1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
1400
2400
3800
76
100,000
193,080
293,080
9.25
6
Download