STAT 305: Chapter 9 – Sampling Distribution of the Sample Mean (đĖ ) Spring 2014 Introduction: When take a sample of size n from a population and calculate summary statistics like the sample mean (đĻĖ ), the sample median (m), the sample variance ( s 2 ), the sample standard deviation (s), or the sample proportion (đĖ) we must realize that these quantities will __________________________________________ and hence are themselves ___________________. Any random variable in statistics has a probability distribution. We have been talking about three common probability distributions in statistics. When Y = # of “successes” in n independent trials we used the binomial distribution to talk about Y probabilistically. When Y was continuous and had an approximate bell-shaped distribution we used the normal distribution to calculate probabilities and quantiles associated with Y. Because the summary statistics discussed above are random variables they also have a probability distribution that determines the likelihood of certain values of these statistics being obtained. The distribution of a summary statistic, e.g. the sample mean (đĻĖ ) is called the ______________________________________. In this handout we explore the sampling distributions of the sample mean (đĖ ). Ė Sampling Distribution of đ The sample mean (đĖ ) is a random quantity that varies from sample to sample. The probability distribution the sample mean follows is called the sampling distribution of đĖ . The sampling distribution demo I showed in class is found at the following web address: http://www.ruf.rice.edu/~lane/stat_sim/sampling_dist/ 160 STAT 305: Chapter 9 – Sampling Distribution of the Sample Mean (đĖ ) Spring 2014 The Central Limit Theorem for the Sample Mean (CLT) ~ tells us about the sampling distributions of the sample mean (đĖ ). The CLT for the sample mean đĖ says the following: 1. 2. 3. The sampling distribution will be ___________ if either of the two conditions below are met: īˇ or if īˇ We now consider applications of the central limit theorem (CLT). Applications of the CLT to Decision Making Example 9.1: Cholesterol levels of adult males (50-60 yrs. old) The mean blood cholesterol level of adult males (50-60 yrs. old) is 200 mg/dl with a standard deviation of đ = 40 mg/dl. Assume also that blood cholesterol levels are approximately normally distributed in this population. a) What is the probability that when taking a sample of size n = 25 that you would obtain sample mean greater than 225 mg/dl? 161 STAT 305: Chapter 9 – Sampling Distribution of the Sample Mean (đĖ ) Spring 2014 b) Give a range of values that we would expect the sample mean to fall approximately 95% of the time. c) Suppose we took sample of adult males between the ages of 50 – 60 who are also strict vegetarians and obtained sample mean of đĻĖ = 188 mg/dl. Does this provide evidence that the subpopulation of vegetarians have a lower mean cholesterol level that the greater population of men in this age group? Explain. Example 9.2: Mercury Levels Found in Boulder Reservoir Walleyes Fish consumption guidelines suggest you should limit the number of fish you eat with Hg levels above .25 ppm. Is there evidence to suggest that walleyes from Boulder Reservoir have a mean Hg content exceeding .25 ppm? 162 STAT 305: Chapter 9 – Sampling Distribution of the Sample Mean (đĖ ) Spring 2014 Confidence Intervals for the Population Mean (đ) Example 9.3: Suppose we are trying to estimate the mean protein content of zebra mussels, which are becoming an increased part of the diet for ducks on the Mississippi River. A sample of n = 25 zebra mussels are analyzed for their protein content and a sample mean of đĻĖ = 9.14 units. This is called a _____________________ for the population mean (đ) because it yields a single value for this unknown quantity. A better estimate might be 9.14 give or take _____ units, i.e. ______ up to _______. This is called an __________________________ as it gives a range or interval of plausible values for the population mean (đ). How do we know if this is a good interval estimate? __________________ What properties should a good interval estimate have? īˇ It īˇ The central limit theorem states that if our sample size (n) is sufficiently large, Y īī īŗ ~ N (0,1) then Y ~ N ( ī , ) which also implies that after standardizing Z īŊ īŗ n n This means that when we collect our data the probability our observed sample mean will fall within two standard errors of the mean is approximately .95 or a 95% chance, or being more precise we could use īą 1.96 standard errors because P(ī1.96 īŧ Z īŧ 1.96) īŊ .9500 Which gives, īĻ īŗ īŗ īļ Pī§ī§ ī ī 1.96 īŧ Y īŧ ī īĢ 1.96 īˇīˇ īŊ .9500 n nī¸ ī¨ For a 99% chance we use _______ and for 90% we use ________ in place of 1.96. 163 STAT 305: Chapter 9 – Sampling Distribution of the Sample Mean (đĖ ) Spring 2014 Starting with the statement, īĻ īļ ī§ īˇ Y īī P(ī1.96 īŧ Z īŧ 1.96) īŊ Pī§ ī 1.96 īŧ īŧ 1.96 īˇ īŊ .9500 īŗ ī§ īˇ n ī¨ ī¸ we will perform algebraic manipulations to isolate the population mean đ in the middle of this inequality instead. By doing this we will obtain an interval that has a 95% chance of covering the true population mean (đ). Algebraic manipulations of the inequality above: This says that the interval from Y ī 1.96 ī īŗ up to Y īĢ 1.96 ī īŗ has a 95% chance of n n covering the true population mean ī. This interval is simply the sample mean plus or minus roughly two standard errors. However, this interval cannot be calculated in practice! WHY? A “simple fix” to this would be replace ____ by the estimated standard deviation from Y īī our data _____. The problem with our “simple fix” is that the distribution of is s n not standard normal, i.e. N(0,1) and therefore the 1.96 value will not necessarily produce the desired level of confidence. FACT: If the population we are sampling from a population that is approximately normal then, Y īī has a t-distribution with degrees of freedom df = n – 1. s n 164 STAT 305: Chapter 9 – Sampling Distribution of the Sample Mean (đĖ ) Spring 2014 What does a t-distribution look like? Facts about the t-distribution: īˇ īˇ īˇ Examples: Using the t-table to find confidence intervals a) n = 20 and 95% confidence t = b) n = 20 and 99% confidence t = c) n = 50 and 90% confidence t = d) n = 10 and 95% confidence t = The basic form of most confidence intervals is: (estimate) īą (table value)( SE of estimate) MARGIN OF ERROR General Form for a Confidence Interval for the Mean For the population mean we have, Y īą ( t - table value) SE ( X ) or Y īąt s n The appropriate columns in t-distribution table) for the different confidence intervals are as follows: 90% Confidence look in the .05 column (if n is “large” we can use 1.645) 95% Confidence look in the .025 column (if n is “large” we can use 1.960) 99% Confidence look in the .005 column (if n is “large” we can use 2.576) 165 STAT 305: Chapter 9 – Sampling Distribution of the Sample Mean (đĖ ) Spring 2014 Example 9.3 (cont’d): Suppose we are trying to estimate the mean protein content of zebra mussels, which are becoming an increased part of the diet for ducks on the Mississippi River. A sample of n = 25 zebra mussels are analyzed for their protein content and a sample mean of y īŊ 9.14 units with a sample standard deviation of s = 2.98 units. a) Use this information to find a 95% CI for the mean protein content found in the tissues of zebra mussels, assuming that protein content of zebra mussels has a normal distribution. Suppose a sample of n = 25 freshwater clams was obtained and similar protein analysis was conducted resulting in a sample mean y īŊ 26.66 units with a standard deviation of s = 12.12 units. b) Find a 95% confidence interval for the mean protein content found in the tissue of freshwater clams. c) Does this interval in conjunction with the interval obtained for zebra mussels provide evidence that freshwater clams are richer in protein than zebra mussels? 166