Measures of Variability

advertisement
Measures of Variability
Sample I:
Sample II:
Sample III:
30, 35, 40, 45, 50, 55, 60, 65, 70
30, 41, 48, 49, 50, 51, 52, 59, 70
41, 45, 48, 49, 50, 51, 52, 55, 59
Measures of Variability
Sample I:
Sample II:
Sample III:
30, 35, 40, 45, 50, 55, 60, 65, 70
30, 41, 48, 49, 50, 51, 52, 59, 70
41, 45, 48, 49, 50, 51, 52, 55, 59
Measures of Variability
I
Sample Range: the difference between the largest and the
smallest sample values.
Measures of Variability
I
Sample Range: the difference between the largest and the
smallest sample values.
e.g. for Sample I: 30, 35, 40, 45, 50, 55, 60, 65, 70
the sample range is 40(= 70 − 30).
Measures of Variability
I
Sample Range: the difference between the largest and the
smallest sample values.
e.g. for Sample I: 30, 35, 40, 45, 50, 55, 60, 65, 70
the sample range is 40(= 70 − 30).
I
Deviation from the Sample Mean: the diffenence between
the individual sample value and the sample mean.
Measures of Variability
I
Sample Range: the difference between the largest and the
smallest sample values.
e.g. for Sample I: 30, 35, 40, 45, 50, 55, 60, 65, 70
the sample range is 40(= 70 − 30).
I
Deviation from the Sample Mean: the diffenence between
the individual sample value and the sample mean.
e.g. for Sample I: 30, 35, 40, 45, 50, 55, 60, 65, 70
the sample mean is 50 and thus the deviation from the sample
mean for each data is -20, -15, -10, -5, 0, 5, 10, 15, 20.
Measures of Variability
I
Sample Variance: the mean (or average) of the sum of
squares of the deviations from the sample mean for each
individual data.
Measures of Variability
I
Sample Variance: the mean (or average) of the sum of
squares of the deviations from the sample mean for each
individual data.
If our sample size is n, and we use x̄ to denote the sample
mean, then the sample variance s 2 is given by:
Pn
(xi − x̄)2
Sxx
2
=
s = i=1
n−1
n−1
Measures of Variability
I
Sample Variance: the mean (or average) of the sum of
squares of the deviations from the sample mean for each
individual data.
If our sample size is n, and we use x̄ to denote the sample
mean, then the sample variance s 2 is given by:
Pn
(xi − x̄)2
Sxx
2
=
s = i=1
n−1
n−1
I
Sample Standard Deviation: the square root of the sample
variance
s=
√
s2
Measures of Variability
e.g. for Sample I: 30, 35, 40, 45, 50, 55, 60, 65, 70, the mean is
50 and we have
xi
30
35
40 45 50 55
60
65
70
xi − x̄
-20 -15 -10 -5
0
5
10
15
20
(xi − x̄)2 400 225 100 25
0 25 100 225 400
Therefore the sample variance is
(400 + 225 + 100 + 25 + 0 + 25√+ 100 + 225 + 400)/(9 − 1) = 187.5
and the standard deviation is 187.5 = 13.7.
Measures of Variability
e.g. for Sample II: 30, 41, 48, 49, 50, 51, 52, 59, 70, the mean is
also 50 and we have
xi
30 41 48 49 50 51 52 59
70
xi − x̄
-20 -9 -2 -1
0
1
2
9
20
(xi − x̄)2 400 81
4
1
0
1
4 81 400
Therefore the sample variance is
(400 + 81 + 4 + 1 + 0 + 1 + 4√+ 81 + 400)/(9 − 1) = 121.5
and the standard deviation is 121.5 = 11.0.
Measures of Variability
e.g. for Sample III: 41, 45, 48, 49, 50, 51, 52, 55, 59, the mean is
also 50 and we have
xi
41 45 48 49 50 51 52 55 59
xi − x̄
-9 -5 -2 -1
0
1
2
5
9
(xi − x̄)2 81 25
4
1
0
1
4 25 81
Therefore the sample variance is
(81 + 25 + 4 + 1 + 0 + 1 + 4 +
√ 25 + 81)/(9 − 1) = 27.75
and the standard deviation is 27.75 = 4.9.
Measures of Variability
sample variance for Sample I is 187.5, for Sample II is 121.5 and
for Sample III is 27.75.
Measures of Variability
Remark: 1. Why use the sum of squares of the deviations? Why
not sum the deviations?
Measures of Variability
Remark: 1. Why use the sum of squares of the deviations? Why
not sum the deviations?
Because the sum of the deviations from the sample mean EQUAL
TO 0!
Measures of Variability
Remark: 1. Why use the sum of squares of the deviations? Why
not sum the deviations?
Because the sum of the deviations from the sample mean EQUAL
TO 0!
n
n
n
X
X
X
(xi − x̄) =
xi −
x̄
i=1
=
i=1
n
X
i=1
xi − nx̄
i=1
=
n
X
i=1
=0
n
xi − n(
1X
xi )
n
i=1
Measures of Variability
Remark:
2. Why do we use divisor n − 1 in the calculation of sample
variance while we use use divisor N in the calculation of the
population variance?
Measures of Variability
Remark:
2. Why do we use divisor n − 1 in the calculation of sample
variance while we use use divisor N in the calculation of the
population variance?
The variance is a measure about the deviation from the
“center”. However, the “center” for sample and population are
different, namely sample mean and population mean.
Measures of Variability
Remark:
2. Why do we use divisor n − 1 in the calculation of sample
variance while we use use divisor N in the calculation of the
population variance?
The variance is a measure about the deviation from the
“center”. However, the “center” for sample and population are
different, namely sample mean and population mean.
If we P
use µ instead of x̄ in the definition of s 2 , then
2
s = (xi − µ)/n.
Measures of Variability
Remark:
2. Why do we use divisor n − 1 in the calculation of sample
variance while we use use divisor N in the calculation of the
population variance?
The variance is a measure about the deviation from the
“center”. However, the “center” for sample and population are
different, namely sample mean and population mean.
If we P
use µ instead of x̄ in the definition of s 2 , then
2
s = (xi − µ)/n.
But generally, population mean is unavailable to us. So our
choice is the sample mean. In that case, the observations xi0 s tend
to be closer to their average x̄ then to the population average
µ. So to compensate, we use divisor n − 1.
Measures of Variability
Remark:
3. It’ customary to refer to s 2 as being based on n − 1 degrees of
freedom (df).
Measures of Variability
Remark:
3. It’ customary to refer to s 2 as being based on n − 1 degrees of
freedom (df).
s 2 is the average of n quantities: (x1 − x̄)2 , (x2 − x̄)2 , . . . ,
(xn − x̄)2 . However, the sum of x1 − x̄, x2 − x̄, . . . , xn − x̄ is 0.
Therefore if we know any n − 1 of them, we know all of them.
Measures of Variability
Remark:
3. It’ customary to refer to s 2 as being based on n − 1 degrees of
freedom (df).
s 2 is the average of n quantities: (x1 − x̄)2 , (x2 − x̄)2 , . . . ,
(xn − x̄)2 . However, the sum of x1 − x̄, x2 − x̄, . . . , xn − x̄ is 0.
Therefore if we know any n − 1 of them, we know all of them.
e.g. {x1 = 4, x2 = 7, x3 = 1, and x4 = 10}.
Measures of Variability
Remark:
3. It’ customary to refer to s 2 as being based on n − 1 degrees of
freedom (df).
s 2 is the average of n quantities: (x1 − x̄)2 , (x2 − x̄)2 , . . . ,
(xn − x̄)2 . However, the sum of x1 − x̄, x2 − x̄, . . . , xn − x̄ is 0.
Therefore if we know any n − 1 of them, we know all of them.
e.g. {x1 = 4, x2 = 7, x3 = 1, and x4 = 10}.
Then the mean is x̄ = 5.5 and x1 − x̄ = −1.5, x2 − x̄ = 1.5 and
x3 − x̄ = −4.5. From that, we know directly that x4 − x̄ = 4.5
since their sum is 0.
Measures of Variability
Some mathematical results for s 2 :
Measures of Variability
Some mathematical results for s 2 :
P
P
I s 2 = Sxx where Sxx =
(xi − x̄)2 = xi2 −
n−1
P
( xi )2
;
n
Measures of Variability
Some mathematical results for s 2 :
P
P
I s 2 = Sxx where Sxx =
(xi − x̄)2 = xi2 −
n−1
I
If y1 = x1 + c, y2 = x2 + c, . . . , yn = xn + c,
P
( xi )2
;
n
then sy2
= sx2 ;
Measures of Variability
Some mathematical results for s 2 :
P
P
I s 2 = Sxx where Sxx =
(xi − x̄)2 = xi2 −
n−1
P
( xi )2
;
n
then sy2
= sx2 ;
I
If y1 = x1 + c, y2 = x2 + c, . . . , yn = xn + c,
I
If y1 = cx1 , y2 = cx2 , . . . , yn = cxn , then sy =| c | sx .
Here sx2 is the sample variance of the x’s and sy2 is the sample
variance of the y ’s. c is any nonzero constant.
Measures of Variability
e.g. in the previous example, Sample III is {41, 45, 48, 49, 50, 51,
52, 55, 59} then we can calculate the sample variance as following
xi
41
45
48
49
50
51
52
55
59
2
xi
1681 2025 2304 2401 2500 2601 2704 3025 3481
P
x
i
P 2 450
xi 22722
Therefore the sample variance is
(22722 −
4502
)/(9 − 1) = 27.75
9
Measures of Variability
Boxplots
Measures of Variability
Boxplots
e.g. A recent article (“Indoor Radon and Childhood Cancer”) presented the
accompanying data on radon concentration (Bq/m2 ) in two different samples of
houses. The first sample consisted of houses in which a child diagnosed with cancer
had been residing. Houses in the second sample had no recorded cases of childhood
cancer. The following graph presents a stem-and-leaf display of the data.
1. Cancer
9683795
86071815066815233150
12302731
8349
5
7
2. No cancer
0
1
2
3
4
5
6
7
8
95768397678993
12271713114
99494191
839
55
5
Stem: Tens digit
Leaf: Ones digit
Measures of Variability
The boxplot for the 1st data set is:
Measures of Variability
The boxplot for the 2nd data set is:
Measures of Variability
We can also make the boxplot for both data sets:
Measures of Variability
Some terminology:
I
Lower Fourth: the median of the smallest half
Measures of Variability
Some terminology:
I
Lower Fourth: the median of the smallest half
I
Upper Fourth: the median of the largest half
Measures of Variability
Some terminology:
I
Lower Fourth: the median of the smallest half
I
Upper Fourth: the median of the largest half
I
Fourth spread: the difference between lower fourth and
upper fourth
fs = upper fourth − lower fourth
Measures of Variability
Some terminology:
I
Lower Fourth: the median of the smallest half
I
Upper Fourth: the median of the largest half
I
Fourth spread: the difference between lower fourth and
upper fourth
fs = upper fourth − lower fourth
I
Outlier: any observation farther than 1.5fs from the closest
fourth
Measures of Variability
Some terminology:
I
Lower Fourth: the median of the smallest half
I
Upper Fourth: the median of the largest half
I
Fourth spread: the difference between lower fourth and
upper fourth
fs = upper fourth − lower fourth
I
Outlier: any observation farther than 1.5fs from the closest
fourth
An outlier is extreme if it is more than 3fs from the nearest
fourth, and it is mild otherwise.
Measures of Variability
The boxplot for the 2nd data set is:
Download