samplevar

advertisement
Why is the sample variance calculated over n-1 instead of n?
The difference between the variance and the sample variance highlights the difference
between parameters and estimated parameters. There are mathematical consequences
to working with estimated parameters and samples rather than with entire populations.
One example of this is the way in which variance is estimated when using sample
means rather than population means.
This is reflected in the equation (1) which states that the sum of squares of a series of
x1…n variables about their true mean is equal to the sum of squares of the same
variables about their sample mean plus n times the squared difference between the
true mean and sample means. If the sample mean is exactly equal to the population
mean, then the rightmost term is equal to zero and there will be no difference between
the sample variance and the population variance. If there is a difference, the middle
term will be greater than zero but this error will not be accounted for by the sample
variance estimate (the middle term). This is because you don't know how much error
there is in your sample mean estimate and can't account for it.
𝑖=𝑛
(1)
𝑖=𝑛
∑(ð‘Ĩ𝑖 − μ)2 = ∑(ð‘Ĩ𝑖 − xĖ„)2 + n(xĖ„ − μ)2
𝑖 =1
𝑖 =1
If (xĖ„ − μ)2 > 0 (which it almost certainly will be), then the sample variance will be
smaller than the population variance. The terms in equation (1) can be rearranged in
order to isolate the sum of squares about the sample mean on the left hand side. This
is presented in equation (2):
𝑖=𝑛
(2)
𝑖=𝑛
∑(ð‘Ĩ𝑖 − xĖ„)2 = ∑(ð‘Ĩ𝑖 − μ)2 − n(xĖ„ − μ)2
𝑖 =1
𝑖 =1
The middle term only needs to be divided by n in order to be a population variance.
Another way to look at this is that it is equal to n times the population variance. The
middle term in (3) has been modified to reflect this.
𝑖=𝑛
(3)
∑(ð‘Ĩ𝑖 − xĖ„)2 = n𝜎 2 − n(xĖ„ − μ)2
𝑖 =1
The rightmost term on the right side of the equation is more easily handled if the
bracketed element is considered independently. The expected value of a squared
deviation of the sample mean about the population mean (a.k.a. the standard error of
the sample mean) is equal to the variance over n. Equation (4) shows this change.
𝑖=𝑛
(4)
𝜎2
∑(ð‘Ĩ𝑖 − xĖ„) = n𝜎 − n ( )
n
2
2
𝑖 =1
The rightmost term can be simplified so that the n multiple and divisors cancel each
other out. At this point, both sides of the equation can be divided by n-1:
(5)
2
∑𝑖=𝑛
n𝜎 2 − 𝜎 2
𝑖 =1(ð‘Ĩ𝑖 − xĖ„)
=
n−1
n−1
The left hand side of the equation is now equal to the sample variance. This is the sum
of squares about the sample mean, divided by n-1 degrees of freedom. The right hand
side can be simplified to show that it is equal to the population variance by eliminating
the n-1 term from the numerator and the denominator.
(6)
(n − 1) 𝜎 2
sĖ„ =
= 𝜎2
n−1
2
This shows that the sample variance, when calculated over n-1 degrees of freedom, has
the same expected value as the population variance. Calculating the sample variance
over n rather than n-1 results in an underestimation of the population variance by a
factor of (n - 1) / n.
Download