Author: Brenda Gunderson, Ph.D., 2012 License: Unless otherwise noted, this material is made available under the terms of the Creative Commons Attribution-NonCommercial-Share Alike 3.0 Unported License: http://creativecommons.org/licenses/by-nc-sa/3.0/ The University of Michigan Open.Michigan initiative has reviewed this material in accordance with U.S. Copyright Law and have tried to maximize your ability to use, share, and adapt it. The attribution key provides information about how you may share and adapt this material. Copyright holders of content included in this material should contact open.michigan@umich.edu with any questions, corrections, or clarification regarding the use of content. For more information about how to attribute these materials visit: http://open.umich.edu/education/about/terms-of-use. Some materials are used with permission from the copyright holders. You may need to obtain new permission to use those materials for other uses. This includes all content from: Mind on Statistics Utts/Heckard, 4th Edition, Cengage L, 2012 Text Only: ISBN 9781285135984 Bundled version: ISBN 9780538733489 SPSS and its associated programs are trademarks of SPSS Inc. for its proprietary computer software. Other product names mentioned in this resource are used for identification purposes only and may be trademarks of their respective companies. Attribution Key For more information see: http:://open.umich.edu/wiki/AttributionPolicy Content the copyright holder, author, or law permits you to use, share and adapt: Creative Commons Attribution-NonCommercial-Share Alike License Public Domain – Self Dedicated: Works that a copyright holder has dedicated to the public domain. Make Your Own Assessment Content Open.Michigan believes can be used, shared, and adapted because it is ineligible for copyright. Public Domain – Ineligible. WOrkds that are ineligible for copyright protection in the U.S. (17 USC §102(b)) *laws in your jurisdiction may differ. Content Open.Michigan has used under a Fair Use determination Fair Use: Use of works that is determined to be Fair consistent with the U.S. Copyright Act (17 USC § 107) *laws in your jurisdiction may differ. Our determination DOES NOT mean that all uses of this third-party content are Fair Uses and we DO NOT guarantee that your use of the content is Fair. To use t his content you should conduct your own independent analysis to determine whether or not your use will be Fair. Module 4: Sampling Distributions and the CLT Objective: The objective of this module is to give you a hands-on discussion and understanding of sampling distributions and the Central Limit Theorem (CLT), a theorem that plays an important role in statistics. The sampling distribution of a statistic can be obtained mathematically, but we will simulate the sampling process and will observe the empirical sampling distribution of various statistics. In this module, you will simulate random samples from a known population distribution and compute a sample statistic for each of the generated samples. The generated sample statistics can be examined to learn about properties of the sampling distribution of the statistic. Overview: Statistical inference is the process of drawing conclusions about a population parameter based on data. When a sample is selected from a population, a summary number can be computed from the observations resulting in the value of a statistic. A statistic is used to estimate the corresponding value for a population (that is, a sample statistic estimates a population parameter). However, a sample chosen at random will not necessarily yield an estimate (or statistic) that is exactly equal to the corresponding parameter for the population; the next selected sample of the same size will probably give a different estimate from the first one. If additional samples of the same size were taken, you would begin to see how the possible estimates (possible values of the statistic) vary and how close they tend to be to the parameter value. With a large number of samples, you can assess whether the value of the statistic (e.g., sample mean, X ) will frequently be close to the true value of the population parameter (e.g., population mean, μ), and if so, how close on average. This can be seen more easily through some pictures (next page): 60 1 Random Sample 5 Random Samples 20 Random Samples Note: Each X represents one statistic value (one estimate) computed from one sample. 61 When data are gathered by random sampling, the statistic will be a random variable and as such, it will have a probability distribution. The probability distribution of the sample statistic is called its sampling distribution. Generally, if we use a statistic to make an inference about a population parameter, we want its sampling distribution to be centered at the true parameter (a characteristic which allows us to call that statistic unbiased), and we would like variability in the estimates to be as small as possible. Below, we have two estimators that are both unbiased, but Estimator I has less variability (is more precise). Thus, we would prefer Estimator I to Estimator II. We will next examine the sampling distribution of the sample statistic most commonly used for measuring the center of a distribution -- the sample mean. Formula Card: 62 Activity: How Do Sample Size and the Distribution of the Parent Population Affect the Sampling Distribution of the Sample Mean? In this activity, you will observe the effects that sample size and the distribution of the population you are sampling from have on the sampling distribution of the sample mean. The sampling distribution of the sample mean, X , is the distribution of the sample mean values for all possible samples of the same size from the same population. Open the sampling distribution applet from the applet link in the “Links to Applets for Modules” folder on the Stat 250 CTools site (in the “Lab Info” folder, which is in the “Resources” folder). Alternatively, the original applet can be found at: http://onlinestatbook.com/stat_sim/sampling_dist/index.html This applet will help you simulate sampling distributions for a variety of statistics and allows you to vary the sample size and the population from which the samples are taken. Read the Instructions. Press Begin to open the applet; you will see the screen pictured below. 63 Notice that when the applet begins, a histogram of the normal distribution with mean 16 and standard deviation 5 is displayed for the default parent distribution. The Sampling Distribution Applet has several options from which you can choose: The 1st histogram, the Parent Population histogram, is the population from which the sample will be drawn. You can select from Normal, Uniform, Skewed, or even customize the distribution by selecting Custom and dragging the mouse over the plot. For now, keep the default N(16, 5) distribution as the parent population. When you are done with a particular simulation, you can click on Clear lower 3 button to clear the remaining histograms, and select new settings for your next simulation. The 2nd plot, the Sample Data histogram, displays a histogram of the sampled data. This histogram is initially blank. You can select to draw Animated Sample, 5 Samples, 1,000 Samples, or 10,000 Samples from the parent population. The 3rd and 4th histograms show the distribution of statistics computed from the sampled data. The number of samples (replications) on which the 3rd and 4th histograms are based is indicated by the label "Reps=," which is displayed once the simulation is started. You can also control which statistic to examine, as well as the sample size by using the dropdown menu options to the right of each plot. (Note that the applet uses N to denote sample size, whereas we generally use n.) The statistic options include: Mean Median sd = standard deviation (uses N in the denominator) Variance = variance of the sample (uses N in the denominator) Variance (U) = unbiased estimate of variance (uses N-1 in the denominator) MAD = mean absolute value of the deviation from the mean Range Select Mean as the statistic in the 3rd histogram and a sample size of 5 (default), then click on Animated Sample to draw one sample of size n = 5 from the normal parent population. You will see five observations appear in the 2nd histogram, and the sample mean of the five numbers will appear in the 3rd histogram as a blue rectangle. This graphically shows the process of attaining the sample mean from one sample of size 5. Repeat this several times and you will see how the sampling distribution of the sample mean starts to form in the 3rd histogram. Once you have a feeling of this works, you can speed things up by choosing the larger sampling options – 5, 1,000, or 10,000 samples. 64 1. Select the Normal distribution as a parent population. a. What are the mean and standard deviation of this population? b. Select Mean (sample mean) as the statistic of interest in both the 3rd and 4th histograms, sample size n = 5 for the 3rd histogram, and n = 25 for the 4th. Do about 5 animated samples, and then take 10,000 samples at once. Draw rough sketches of each of the distributions of the sample means. Make sure to label both axes. n = 5: n = 25: How do the distributions of each sample mean in the 3rd and 4th histograms compare with the parent population in the 1st histogram? Comment on shape, mean, standard deviation, etc. n = 5: n = 25: c. Looking at the properties of the population and sample distributions (displayed to the left of their respective histograms), what can you say about the relationship between the standard deviation of the sample mean and the population standard deviation? d. What can you say about the relationship between the sample size and the standard deviation of the sample mean? e. Does the number of replications influence the shape of the sampling distribution? That is, as you take more samples, does the shape of the sampling distribution change significantly 65 2. Clear the lower three graphs and then select the Skewed distribution as a parent population. a. Select Mean (sample mean) as the statistic of interest in both the 3rd and 4th histograms, sample size n = 5 for the 3rd histogram, and n = 25 for the 4th. Do about 5 animated samples, and then take 10,000 samples at once. Draw rough sketches of each of the distributions of the sample means. Make sure to label both axes. n = 5: n = 25: How do the distributions of each sample mean in the 3rd and 4th histograms compare with the parent population in the 1st histogram? Comment on shape, mean, standard deviation, etc. n = 5: n = 25: How do the distributions of each sample mean in the 3rd and 4th histograms compare to each other? Comment on shape, mean, standard deviation, etc. n = 5: n = 25: How do the distributions of each sample mean in the 3rd and 4th histograms compare with those created of the sample mean when the parent population was normal (in question 1)? Comment on shape, mean, standard deviation, etc. n = 5: n = 25: b. What should be the value of the standard deviation of the sample mean if the population standard deviation is 6.22 and the sample size is n = 25? (Show the calculation.) How does this value compare to the standard deviation displayed to the left of the 4th histogram created above? 66 3. Clear the lower three graphs and then select the Custom distribution as a parent population. The parent population plot should be empty. To create a distribution, you will need to use the mouse to click and drag on different parts of the parent population graph until you have drawn a distribution that you like. a. Provide a rough sketch your custom population. Be sure to note the mean and standard deviation. b. Select Mean (sample mean) as the statistic of interest in both the 3rd and 4th histograms, sample size n = 5 for the 3rd histogram, and n = 25 for the 4th. Do a few animated samples, and then take 10,000 samples at once. Draw rough sketches of each of the distributions of the sample means. Make sure to label both axes. n = 5: n = 25: How do the distributions of each sample mean in the 3rd and 4th histograms compare with the parent population in the 1st histogram? Comment on shape, mean, standard deviation, etc. n = 5: n = 25: How do the distributions of each sample mean in the 3rd and 4th histograms compare to each other? Comment on shape, mean, standard deviation, etc. n = 5: n = 25: c. Considering the changes observed from n = 5 to n = 25 in questions 2 and 3, what can you say about the shape of the distribution of the sample mean with respect to the sample size n? 67 d. What should be the standard deviation of the sample mean for samples of size n = 25 from your custom population? (Show the calculation.) How does this value compare to the standard deviation displayed to the left of the 4th histogram created above? 4. Fill in the blanks to summarize your findings in Questions 1, 2, and 3: a. If the parent population is a normal distribution with a mean μ and a standard deviation σ, then for any sample size, the sample mean will have a _____________ distribution with a mean of _______ and a standard deviation of ___________. a. If the parent population is NOT a normal distribution, but with a mean μ and a standard deviation σ, then for a large sample size, the sample mean will have approximately a _____________ distribution with a mean of _______ and a standard deviation of _________. The result in 4(a) is known as the Sampling Distribution of the Sample Mean. The result in 4(b) is known as the Central Limit Theorem. While you should note that there are several similarities between them, make sure you can see and understand the difference between the two results. Fill out the chart below to further summarize your findings regarding the sampling distribution of the sample mean based on the CLT. Will the sampling distribution of sample mean be approximately normal? Sample Settings n = 10, Parent Population Normal n = 10, Parent Population NOT Normal n = 50, Parent Population Normal n = 70, Parent Population NOT Normal 68 Check Your Understanding: A researcher interested in the environmental impact of contaminants in soil has collected a sample of 100 tree saplings of a certain species. Ten years ago, the average height of all such tree saplings was 60 inches with a standard deviation of 4 inches. Let X denote the height of a tree sapling. a. The sample mean for the 100 tree saplings was 56. Fill in appropriate notation: ____ = 56 b. Provide the expected value, standard deviation, and approximate distribution of the sample mean height of tree saplings assuming the values from ten years ago are treated as population parameters. c. Draw a detailed sketch of the sampling distribution of the sample mean height of tree samplings. Make sure to include your labels. 69 Example Exam Question on Sampling Distribution of the Sample Mean For a particular community it is known that the mean amount of water used per home during October is 1250 gallons and the standard deviation is 325 gallons. a. The distribution for amount of water used is skewed to the right. Sketch a skewed right distribution below and label both axes. b. For a promotional campaign, a radio station plans to randomly select 50 homes and pay their water bills for the month of October. Describe the approximate sampling distribution of the sample mean amount of water used for a random sample of 50 homes. Provide all features of the distribution. c. The radio station can afford to pay for a total of 67,000 gallons. What is the probability that the total number of gallons for a random sample of 50 homes will exceed 67,000 gallons? (Hint: Think about how a total and an average are related.) 70