Measures of Variability

advertisement
Measures of Variability
Sample I:
Sample II:
Sample III:
30, 35, 40, 45, 50, 55, 60, 65, 70
30, 41, 48, 49, 50, 51, 52, 59, 70
41, 45, 48, 49, 50, 51, 52, 55, 59
Liang Zhang (UofU)
Applied Statistics I
June 10, 2008
1 / 18
Measures of Variability
Sample Range: the difference between the largest and the smallest
sample values.
e.g. for Sample I: 30, 35, 40, 45, 50, 55, 60, 65, 70
the sample range is 40(= 70 − 30).
Deviation from the Sample Mean: the diffenence between the
individual sample value and the sample mean.
e.g. for Sample I: 30, 35, 40, 45, 50, 55, 60, 65, 70
the sample mean is 50 and thus the deviation from the sample mean
for each data is -20, -15, -10, -5, 0, 5, 10, 15, 20.
Liang Zhang (UofU)
Applied Statistics I
June 10, 2008
2 / 18
Measures of Variability
Sample Variance: the mean (or average) of the sum of squares of
the deviations from the sample mean for each individual data.
If our sample size is n, and we use x̄ to denote the sample mean, then
the sample variance s 2 is given by:
Pn
(xi − x̄)2
Sxx
2
s = i=1
=
n−1
n−1
Sample Standard Deviation: the square root of the sample variance
s=
Liang Zhang (UofU)
√
s2
Applied Statistics I
June 10, 2008
3 / 18
Measures of Variability
e.g. for Sample I: 30, 35, 40, 45, 50, 55, 60, 65, 70, the mean is 50 and
we have
xi
30
35
40 45 50 55
60
65
70
xi − x̄
-20 -15 -10 -5
0
5
10
15
20
(xi − x̄)2 400 225 100 25
0 25 100 225 400
Therefore the sample variance is
(400 + 225 + 100 + 25 + 0 + 25
√ + 100 + 225 + 400)/(9 − 1) = 187.5
and the standard deviation is 187.5 = 13.7.
Liang Zhang (UofU)
Applied Statistics I
June 10, 2008
4 / 18
Measures of Variability
e.g. for Sample II: 30, 41, 48, 49, 50, 51, 52, 59, 70, the mean is also 50
and we have
xi
30 41 48 49 50 51 52 59
70
xi − x̄
-20 -9 -2 -1
0
1
2
9
20
(xi − x̄)2 400 81
4
1
0
1
4 81 400
Therefore the sample variance is
(400 + 81 + 4 + 1 + 0 + 1 + 4√+ 81 + 400)/(9 − 1) = 121.5
and the standard deviation is 121.5 = 11.0.
Liang Zhang (UofU)
Applied Statistics I
June 10, 2008
5 / 18
Measures of Variability
e.g. for Sample III: 41, 45, 48, 49, 50, 51, 52, 55, 59, the mean is also 50
and we have
xi
41 45 48 49 50 51 52 55 59
xi − x̄
-9 -5 -2 -1
0
1
2
5
9
2
(xi − x̄)
81 25
4
1
0
1
4 25 81
Therefore the sample variance is
(81 + 25 + 4 + 1 + 0 + 1 + 4 +
√ 25 + 81)/(9 − 1) = 27.75
and the standard deviation is 27.75 = 4.9.
Liang Zhang (UofU)
Applied Statistics I
June 10, 2008
6 / 18
Measures of Variability
sample variance for Sample I is 187.5, for Sample II is 121.5 and for
Sample III is 27.75.
Liang Zhang (UofU)
Applied Statistics I
June 10, 2008
7 / 18
Measures of Variability
Remark: 1. Why use the sum of squares of the deviations? Why not sum
the deviations?
Because the sum of the deviations from the sample mean EQUAL TO 0!
n
n
n
X
X
X
(xi − x̄) =
xi −
x̄
i=1
i=1
=
=
n
X
i=1
n
X
i=1
xi − nx̄
n
xi − n(
i=1
1X
xi )
n
i=1
=0
Liang Zhang (UofU)
Applied Statistics I
June 10, 2008
8 / 18
Measures of Variability
Remark:
2. Why do we use divisor n − 1 in the calculation of sample variance while
we use use divisor N in the calculation of the population variance?
The variance is a measure about the deviation from the “center”.
However, the “center” for sample and population are different, namely
sample mean and population mean.
P
If we use µ instead of x̄ in the definition of s 2 , then s 2 = (xi − µ)/n.
But generally, population mean is unavailable to us. So our choice is the
sample mean. In that case, the observations xi0 s tend to be closer to their
average x̄ then to the population average µ. So to compensate, we use
divisor n − 1.
Liang Zhang (UofU)
Applied Statistics I
June 10, 2008
9 / 18
Measures of Variability
Remark:
3. It’ customary to refer to s 2 as being based on n − 1 degrees of
freedom (df).
s 2 is the average of n quantities: (x1 − x̄)2 , (x2 − x̄)2 , . . . , (xn − x̄)2 .
However, the sum of x1 − x̄, x2 − x̄, . . . , xn − x̄ is 0. Therefore if we know
any n − 1 of them, we know all of them.
e.g. {x1 = 4, x2 = 7, x3 = 1, and x4 = 10}.
Then the mean is x̄ = 5.5 and x1 − x̄ = −1.5, x2 − x̄ = 1.5 and
x3 − x̄ = −4.5. From that, we know directly that x4 − x̄ = 4.5 since their
sum is 0.
Liang Zhang (UofU)
Applied Statistics I
June 10, 2008
10 / 18
Measures of Variability
Some mathematical results for s 2 :
P
P
Sxx
s 2 = n−1
where Sxx = (xi − x̄)2 = xi2 −
If y1 = x1 + c, y2 = x2 + c, . . . , yn = xn + c,
P
( xi )2
;
n
then sy2
= sx2 ;
If y1 = cx1 , y2 = cx2 , . . . , yn = cxn , then sy =| c | sx .
Here sx2 is the sample variance of the x’s and sy2 is the sample
variance of the y ’s. c is any nonzero constant.
Liang Zhang (UofU)
Applied Statistics I
June 10, 2008
11 / 18
Measures of Variability
e.g. in the previous example, Sample III is {41, 45, 48, 49, 50, 51, 52, 55,
59} then we can calculate the sample variance as following
xi
41
45
48
49
50
51
52
55
59
2
x
1681
2025
2304
2401
2500
2601
2704
3025
3481
Pi
P x2i 450
xi 22722
Therefore the sample variance is
(22722 −
Liang Zhang (UofU)
4502
)/(9 − 1) = 27.75
9
Applied Statistics I
June 10, 2008
12 / 18
Measures of Variability
Boxplots
e.g. A recent article (“Indoor Radon and Childhood Cancer”) presented the accompanying data
on radon concentration (Bq/m2 ) in two different samples of houses. The first sample consisted
of houses in which a child diagnosed with cancer had been residing. Houses in the second
sample had no recorded cases of childhood cancer. The following graph presents a stem-and-leaf
display of the data.
2. No cancer
1. Cancer
9683795
86071815066815233150
12302731
8349
5
7
Liang Zhang (UofU)
0
1
2
3
4
5
6
7
8
95768397678993
12271713114
99494191
839
55
5
Stem: Tens digit
Leaf: Ones digit
Applied Statistics I
June 10, 2008
13 / 18
Measures of Variability
The boxplot for the 1st data set is:
Liang Zhang (UofU)
Applied Statistics I
June 10, 2008
14 / 18
Measures of Variability
The boxplot for the 2nd data set is:
Liang Zhang (UofU)
Applied Statistics I
June 10, 2008
15 / 18
Measures of Variability
We can also make the boxplot for both data sets:
Liang Zhang (UofU)
Applied Statistics I
June 10, 2008
16 / 18
Measures of Variability
Some terminology:
Lower Fourth: the median of the smallest half
Upper Fourth: the median of the largest half
Fourth spread: the difference between lower fourth and upper fourth
fs = upper fourth − lower fourth
Outlier: any observation farther than 1.5fs from the closest fourth
An outlier is extreme if it is more than 3fs from the nearest fourth,
and it is mild otherwise.
Liang Zhang (UofU)
Applied Statistics I
June 10, 2008
17 / 18
Measures of Variability
The boxplot for the 2nd data set is:
Liang Zhang (UofU)
Applied Statistics I
June 10, 2008
18 / 18
Download