Distribution of Sample Means and the Central Limit Theorem The Central Limit Theorem (CLT) is very important when dealing with statistical distributions concerning a sample mean. Background: Often applied statistics is concerned with drawing inferences about the population from a sample. A sample consists of y1 , y 2 , , y n Population Sample y s individuals sampled (obviously) from the population. Each individual’s value can be considered a random variable. Consequently, the mean of the sampled values, y , is a realization of a random variable. Example: If you were to measure the diameter of 10 randomly sampled abalones, you may get 10.5 inches as the average. Grab another 10 abalones, and you may get 9.2 inches as the average. The average can be considered a random variable. The sample mean, Y , is the most commonly used estimate of the population mean . (Note: Upper case Y is now used instead of lower case y because we are talking about yet to be observed random values whereas the lower case y refers to observed numbers that we have in hand.) Y is the average of the n values from a random sample, thus Y is itself a random variable. A random variable has a distribution, likewise it has a population mean and population standard deviation. Distribution of a sample mean 1. When Y is distributed with mean and standard deviation , then Y has a mean and standard deviation . n 2. As the sample size increases, the standard deviation of Y decreases. Consequently when trying to estimate the population mean, the larger your sample size, the more likely that your sample mean, y , will be close to the unobservable . 3. To halve the standard deviation of Y , you need to quadruple the sample size ( 4n 2 n ). 1 4. In practice, you do not know , so you calculate s. error of the sample mean. It an estimate of s is called the standard n . n Knowing the distribution of Y allows you to answer many practical questions using statistics. Each population, whether it be red abalone shell diameters, weight of people, river water flow, or tree heights certainly has its own unique probability distribution. (To really take advantage of statistics, you need to think of most objects in nature as being an observation of a random variable.) So how can you possibly know the distribution of the sample mean Y if you do not know the true distribution of the population? In many cases, the distribution of Y is known because of the Central Limit Theorem. The Central Limit Theorem (CLT) deals with the distribution of Y . The is very important when dealing with statistical distributions concerning a sample mean. Central Limit Theorem Suppose Y has most any distribution with mean and standard deviation . Then, as n gets large, the distribution of Y will become “approximately normal” with mean and standard deviation n . The larger n, the closer the distribution of Y will be to being normal. Note1: If Y is a normal distribution, then the distribution of Y is normal regardless of n. Note2: There are very rare instances where the CLT fails, but the distribution must have infinite values with non-zero probability – don’t worry about encountering this. When “n gets large” is large enough depends upon how different the Y’s distribution is from normal. Typically n=30 is used as a rough guideline. Remember, it is the distribution of the sample means that become approximate normal – the distribution of the individuals in the population does not change. Why is the CLT important? Statistical tests, such as a t-test or Analysis of Variance (ANOVA) assume that the data were sampled from a normal population, thus giving an Y that is normally distributed. Y , however, can be assumed to be normally distributed if n is large because of the CLT. Consequently, for “large” sample sizes the issue of whether or not the population is normally distributed is not a concern. Thus the CLT permits use of t-tests and ANOVA when the population is non-normal. Similarly this is true for confidence intervals for the population mean . Example: Suppose Gambel Quail lay an average of 14 eggs in a clutch with a standard deviation of 2. Obviously the number of eggs cannot be normally distributed because the number of eggs is a discrete, not continuous, random variable. Furthermore, it is safe to assume that the number of eggs is probably skewed either to the left or right. Suppose many students were to go out and count the eggs in 50 different Gambel Quail clutches and calculate the average number of eggs in a clutch. What would be the distribution that these 2 averages follow? Answer: Because n=50 is “large”, the distribution of the mean number of eggs would be approximately normal with x 14 and x 2 0.28 . 50 Example: Here we have the density curve for a chi-square distribution with 6 degrees of freedom. Obviously not normal, but not does have a bit of bell-shaped curve to it. Consequently, the distribution of it’s sample means approaches a normal distribution with relatively small sample sizes. the sample mean. 3 Example: The CLT will even work with a discrete distribution, such as rolling a die. The probabilities for 1 through 6 are equal. Roll the die n times and average those rolls – if n is large, we can consider the average to be a random variable from a normal distribution. 4