Sampling Distribution of Mean Young men’s heights are roughly bell-shaped with a mean of 70 inches (50 1000 ) and a standard deviation of 2.5 inches. A randomly selected man will on average be 50 1000 , but any value between 0 00 5 5 and 60 300 would not be unusual. Suppose we took a sample of 12 men, measured their heights and averaged them to get the statistic x. What would you expect to get? On average you would expect it to be 70 inches, but each sample may be a little higher or lower. How much? What values would be a typical range? What would be the shape of the histogram of X? • The basic statistic for a binary variable is the sample proportion, and the most basic statistic for a numerical variable is the sample mean, so these are the sampling distributions we will discuss. If you go further in statistics you will learn the sampling distributions of many different statistics. Sampling Distribution of Mean In general suppose X is a numerical variable with mean µX and standard deviation σX . The set of values of the sample mean X if all possible samples of size n is the sampling distribution of the mean and has three properties • It’s mean is µX = µX . • It’s standard error is σX σX = √ n . • If n is big enough its shape is roughly normal. • The mean looks like a pretty crazy thing but it is not. It says that X may be a little more or less than µX but on average it should give µX . • The formula for the standard deviation says that the more variation on X the more variation in X, which makes sense, and the bigger n is the less variation there is in X. Think of variation as all cancelling out. So you will get some tall people and some short people, but it would be really hard to have all tall people in a big sample so that the average was far from the mean. 1 The Fine Print Once again everything assumes the SRS Assumption (the sample is a simple random sample) and the formula for the standard error assumes sampling with replacement, but is a fine approximation as long as the sample and population satisfy the Large Population Assumption. The last fact relies on the Central Limit Theorem: Central Limit Theorem If X is any distribution, as n gets larger the distribution of X approaches a normal distribution. When n is small it looks like the distribution of X, but as it gets larger it becomes more unimodal, then more symmetric, then very close to normal. • The Central Limit Theorem is the basic mathematics that makes all of statistics tick, because it allows us to use the normal distribution to calculate things. • That funny formula for the normal distribution essentially comes from here, because of all possible distributions, this is the one that this process converges to. So that is why the normal distribution is the best model for symmetric unimodal distributions. • A fuzzy version of the CLT is why symmmetric unimodal distributions occur so frequently. Many random processes can be viewed as the sum (or average) of a bunch of relatively independent small effects. Such a sum is necessarily close to a normal distribution. Central Limit Theorem • As n gets bigger notice how the bumps get ironed out by the averaging process. The last thing that remains is the skew, and as n gets larger that goes away as well. 2 The 0 : 15 : 40 Rule The effect of the Central Limit Theorem means X is roughly normal if n is big enough or of X is close enough to normal, or some combination. The rule of thumb is the 0 : 15 : 40 Rule: X can be taken to be normal if one of the following conditions is met: Either... i X is known to be normal ...or... ii n ≥ 15 and X is not too skew, no major outliers ...or... iii n ≥ 40 • This is a complicated rule. I will give you a mnemonic next time to help you remember it. • Remember X refers to the population distribution and X to the sampling distribution. It is really important to keep them straight. The 0 : 15 : 40 rule uses knowledge of the pop. dist. to get information about the sampling dist. • In practice you will meet the first condition if the problem tells you X is normal, or bellshaped. You will meet the second condition if you have access to a histogram so you can see it is not too skewed. You will meet the third condition if the sample is big enough, so that is the most straightforward one and the first one to check. Example Young American men’s heights are bell-shaped with a mean of 70 inches and an s.d. of 2.5 inches. You take a simple random sample of 12 young American men and compute their average height X. Find the mean and standard error of X. What is the chance that the average you get will be more than 6 feet? Less than 50 600 ? Between what two values can we be 95% sure your answer will fall? It is an SRS. We have µX = 70, σX = 2.5, n = 12. This tells us the mean of X is µX = µX = 70 in Since there are more than 20 · 12 = 240 young American men, the Independence/Large Population assumption is met, so the standard error is 2.5 σX σX = √ = √ = .722 n 12 • 3 in Example Recall µX = 70, σX = 2.5, n = 12. We found √ µX = 70 σX = 2.5/ 12 = .722 n = 12 < 15 so we cannot possibly meet condition (ii) in the 0 : 15 : 40 rule (requires n ≥ 15) or condition (iii) (requires n ≥ 40), but we are told X is bellshaped, so we meet condition (i), and so we meet this assumption. Therefore we can assume X is normal. What is the chance that the average you get will be more than 6 feet? √ P X > 72 = 1 − normdist(72, 70, 2.5/ 12, 1) = .279% What is the chance X will be less than 50 600 ? √ P ( X < 66 = normdist(66, 70, 2.5/ 12, 1) = 1.49 × 10−8 = .00000149% • If it meets one of the conditions of 0 : 15 : 40, you don’t have to think about the other two. I just went through them to give you more experience with it. • So if you got an average of less than 50 600 , either it was a one in a hundred million coincidence, or you are wrong about its being a random sample or about the mean and standard deviation of men’s heights. That is an idea we will come back to. If a really low probability event happens, it probably means you are wrong in one of the assumptions you used to calculate that probability. More Example Recall µX = 70, σX = 2.5, n = 12. We found √ µX = 70 σX = 2.5/ 12 = .722 and X is normal. Between what two values will 95% of samples’ sample mean fall? We could use Empirical Rule and go up and down 2 standard errors from the mean, but lets be more precise and recall we get exactly 95% in a normal dist if we go up and down 1.96σX from µX . σX 70 ± 1.96 √ = 70 ± 1.96 × .722 = 70 ± 1.41 = [68.6, 71.4] in n • That is a pretty narrow range, compared to the usual range we got for the original variable X. 4 Another Example The mean cost of a haircut for an American college student is $18 with a standard deviation of $22. What is the probability that a simple random sample of 52 college students will have an average haircut of less than $16? Between $17 and $19? Between what two values would 95% of all such samples fall? If I actually got an average of 30 in my sample, would that suggest there was something wrong with my sampling or with my presumed mean and s.d.? Asking about the average of a sample of 52 is asking about X. It says it is a simple random sample. µX = µX = 18. There are more than 20 · 52 = 1040 college students (large pop) so σX 22 σX = √ = √ = 3.05 n 52 • Another Example µX = 18, µX = 18, σX = 22, σX n = 52 22 = √ = 3.05 52 We don’t know the shape of the population distribution so we cannot use the first two conditions (i-ii) (in fact because σ is about the same size as µ and X must be positive, it is surely skewed right). but n ≥ 40 so by condition (iii) we can assume X is normal. Chance X is less than 16 : √ P X < 16 = normdist 16, 18, 22/ 52, 1 = 25.6% • If X is always positive (like haircut costs) and µ is about the same as σ, then if X were normal we would know about a sixth of the time X would be below µ − σ, which would be negative, which makes no sense. Since X stops at 0 to have that big a σ it would have to get very large occasionally, so it would have to be skewed right. 5 More Another Example µX = 18, µX = 18, σX = 22, n = 52 22 σX = √ = 3.05 52 X is approximately normal. Chance between 17 and 19 : √ P 17 < X < 19 = normdist 19, 18, 22/ 52, 1 √ − normdist 17, 18, 22/ 52, 1 = 25.7% 95% of samples give an average haircut cost (X) between 18 ± 1.96σX = 18 ± 1.96 · 3.05 = 18 ± 5.98 = [12.0, 24.0] dollars • Finish Another Example Chance of getting over 30 : √ P X > 30 = 1 − normdist 30, 18, 22/ 52, 1 = 4 × 10−5 . Since this would be a very surprising result assuming the given mean and standard deviation and other assumptions, it suggests one of our assumptions is wrong. • Example: You Try College students heights are bimodal and symmetric with a mean of 68 inches and an s.d. of 3.5. If you take a simple random sample of 18 college students and compute their average height X, find the mean and standard error of X and check all assumptions. What is the probability you will get an average height for your sample over 70 inches? It says it is a simple random sample. There are more than 20 · 18 college students, so the Large Pop. assumption is met. We see n = 18 ≥ 15 and we are 6 told X is symmetric, so the second condition of the 0 : 15 : 40 rule is met, so we can assume X is normal. µX = µX = 68 3.5 σX σX = √ = √ = 0.825. n 18 Chance that X is more than 70 is √ P X > 70 = 1 − normdist 70, 68, 3.5/ 18, 1 = 0.767%. • We were told X is bimodal, so we know it does not meet assumption one, and n = 18 < 40 so it does not meet assumption 3. Lecture 17 Key Points After watching this lecture you should be able to • say what we mean by the sampling distribution of X, and what it represents • calculate the mean and standard deviation of X. • check the Independence/ Large Population assumption and what it tells you (that the s.d. formula is correct). • check the Normality / 0 : 15 : 40 Rule and what it tells you (can use normdist) • calculate probabilities of X using normdist. • Categorical Vs. Numerical If the original variable (pop. dist.) is CATEGORICAL Variable is a yes or no question. parameter=p, statistic=P̂ , r p(1 − p) n Assumptions: Large Pop (pop ≥ 20 * sample) and Rule of 15 (np ≥ 15, n(1 − p) ≥ 15) µP̂ = p σP̂ = If the original variable (pop. dist.) is Numerical Variable is a numerical question X, with mean µ and s.d. σ. parameter=µ, statistic=X, σX σX = √ n Assumptions: Large Pop (pop ≥ 20 * sample) and 0 : 15 : 40 Rule µX = µX 7