Why is the sample variance calculated over n-1 instead of n? The difference between the variance and the sample variance highlights the difference between parameters and estimated parameters. There are mathematical consequences to working with estimated parameters and samples rather than with entire populations. One example of this is the way in which variance is estimated when using sample means rather than population means. This is reflected in the equation (1) which states that the sum of squares of a series of x1…n variables about their true mean is equal to the sum of squares of the same variables about their sample mean plus n times the squared difference between the true mean and sample means. If the sample mean is exactly equal to the population mean, then the rightmost term is equal to zero and there will be no difference between the sample variance and the population variance. If there is a difference, the middle term will be greater than zero but this error will not be accounted for by the sample variance estimate (the middle term). This is because you don't know how much error there is in your sample mean estimate and can't account for it. ð=ð (1) ð=ð ∑(ðĨð − μ)2 = ∑(ðĨð − xĖ)2 + n(xĖ − μ)2 ð =1 ð =1 If (xĖ − μ)2 > 0 (which it almost certainly will be), then the sample variance will be smaller than the population variance. The terms in equation (1) can be rearranged in order to isolate the sum of squares about the sample mean on the left hand side. This is presented in equation (2): ð=ð (2) ð=ð ∑(ðĨð − xĖ)2 = ∑(ðĨð − μ)2 − n(xĖ − μ)2 ð =1 ð =1 The middle term only needs to be divided by n in order to be a population variance. Another way to look at this is that it is equal to n times the population variance. The middle term in (3) has been modified to reflect this. ð=ð (3) ∑(ðĨð − xĖ)2 = nð 2 − n(xĖ − μ)2 ð =1 The rightmost term on the right side of the equation is more easily handled if the bracketed element is considered independently. The expected value of a squared deviation of the sample mean about the population mean (a.k.a. the standard error of the sample mean) is equal to the variance over n. Equation (4) shows this change. ð=ð (4) ð2 ∑(ðĨð − xĖ) = nð − n ( ) n 2 2 ð =1 The rightmost term can be simplified so that the n multiple and divisors cancel each other out. At this point, both sides of the equation can be divided by n-1: (5) 2 ∑ð=ð nð 2 − ð 2 ð =1(ðĨð − xĖ) = n−1 n−1 The left hand side of the equation is now equal to the sample variance. This is the sum of squares about the sample mean, divided by n-1 degrees of freedom. The right hand side can be simplified to show that it is equal to the population variance by eliminating the n-1 term from the numerator and the denominator. (6) (n − 1) ð 2 sĖ = = ð2 n−1 2 This shows that the sample variance, when calculated over n-1 degrees of freedom, has the same expected value as the population variance. Calculating the sample variance over n rather than n-1 results in an underestimation of the population variance by a factor of (n - 1) / n.