Module on Sampling Distribution

Author(s): Brenda Gunderson, Ph.D., 2011 License: Unless otherwise noted, this material is made available under the terms of the Creative Commons Attribution–Non-commercial–Share Alike 3.0 License: http://creativecommons.org/licenses/by-nc-sa/3.0/ We have reviewed this material in accordance with U.S. Copyright Law and have tried to maximize your ability to use, share, and adapt it. The citation key on the following slide provides information about how you may share and adapt this material. Copyright holders of content included in this material should contact open.michigan@umich.edu with any questions, corrections, or clarification regarding the use of content. For more information about how to cite these materials visit http://open.umich.edu/education/about/terms-of-use. Any medical information in this material is intended to inform and educate and is not a tool for self-diagnosis or a replacement for medical evaluation, advice, diagnosis or treatment by a healthcare professional. Please speak to your physician if you have questions about your medical condition. Viewer discretion is advised: Some medical content is graphic and may not be suitable for all viewers. Some material may be sourced from: Mind on Statistics Utts/Heckard, 3rd Edition, Duxbury, 2006 Text Only: ISBN 0495667161 Bundled version: ISBN 1111978301 Material from this publication used with permission. Attribution Key for more information see: http://open.umich.edu/wiki/AttributionPolicy Use + Share + Adapt { Content the copyright holder, author, or law permits you to use, share and adapt. } Public Domain – Government: Works that are produced by the U.S. Government. (17 USC § 105) Public Domain – Expired: Works that are no longer protected due to an expired copyright term. Public Domain – Self Dedicated: Works that a copyright holder has dedicated to the public domain. Creative Commons – Zero Waiver Creative Commons – Attribution License Creative Commons – Attribution Share Alike License Creative Commons – Attribution Noncommercial License Creative Commons – Attribution Noncommercial Share Alike License GNU – Free Documentation License Make Your Own Assessment { Content Open.Michigan believes can be used, shared, and adapted because it is ineligible for copyright. } Public Domain – Ineligible: Works that are ineligible for copyright protection in the U.S. (17 USC § 102(b)) *laws in your jurisdiction may differ { Content Open.Michigan has used under a Fair Use determination. } Fair Use: Use of works that is determined to be Fair consistent with the U.S. Copyright Act. (17 USC § 107) *laws in your jurisdiction may differ Our determination DOES NOT mean that all uses of this 3rd-party content are Fair Uses and we DO NOT guarantee that your use of the content is Fair. To use this content you should do your own independent analysis to determine whether or not your use will be Fair. Module 3: Sampling Distributions and the CLT Objectives: The objective of this module is to give you a hands-on discussion and understanding of the Central Limit Theorem (CLT), a theorem that plays an important role in statistics. The sampling distribution of a statistic can be obtained mathematically, but we will simulate the sampling process and will observe the empirical sampling distribution of various statistics. In this module you will simulate random samples from a known population distribution and compute a sample statistic for each of the generated samples. The generated sample statistics can be examined to learn about properties of the sampling distribution of the statistic. Overview: Statistical inference is the process of drawing conclusions about a population parameter based on data. When a sample is selected from a population, a summary number can be computed from the observations resulting in the value of a statistic. A statistic is used to estimate the corresponding value for a population (that is, a sample statistic estimates a population parameter). However, a sample chosen at random will not necessarily yield an estimate (a value of a statistic) that is exactly equal to the corresponding parameter for the population. The next selected sample of the same size will probably give a different estimate from the first one. If additional samples of the same size were taken you would begin to see how the possible estimates (possible values of the statistic) vary and how close they tend to be to the parameter value. With a large number of samples, you can assess whether the value of the statistic (e.g. sample mean X ) will be frequently close to the true value of the population parameter (e.g. population mean  ), and if so, how close on average. This can be seen more easily through some pictures: One Random Sample X True Population Parameter Five Random Samples Twenty Random Samples X X XX X X XXX XXXXX X X XXXXXXXX X True Population Parameter True Population Parameter Note: Each X represents one statistic value (one estimate) computed from one sample. When data are gathered by random sampling, the statistic will be a random variable and as such it will have a probability distribution. The probability distribution of the sample statistic is called its sampling distribution. Generally speaking, if we use a statistic to make an inference about a population parameter, we want its sampling distribution to be centered at the true parameter (a characteristic which allows us to call that statistic unbiased), and we would like variability in the estimates to be as small as possible. 33 Below we have two estimators that are both unbiased, but Estimator I has less variability (is more precise). Thus, we would prefer Estimator I over Estimator II. Estimator I Estimator II X XXX XXXXX XXXXXXX X XXX X X X X X X XX X X X X X X X X X X XX True Population Parameter True Population Parameter We will next examine the sampling distribution of the sample statistic most commonly used for measuring the center of a distribution -- the sample mean. Formula card: Activity: How do Sample Size and the Distribution of the Parent Population affect the Sampling Distribution of the Sample Mean? In this activity you will observe the effects that sample size and the distribution of the population you are sampling from have on the sampling distribution of the sample mean. The sampling distribution of the sample mean, X , is the distribution of the sample mean values for all possible samples of the same size from the same population. For this activity open the sampling distribution applet (the original applet can be found at http://onlinestatbook.com/stat_sim/sampling_dist/index.html). This applet will help you simulate sampling distributions for a variety of statistics, allowing you to vary the sample size and the population from which the samples are taken. 34 Read the Instructions. Press “Begin” and the Sampling Distribution Applet will open; you will see the screen at the right. Notice that when the applet begins, a histogram of the normal distribution with mean 16 and standard deviation 5 is displayed for the default “parent distribution”. The Sampling Distribution Applet has several options you can choose from:  The 1st histogram, the Parent Population histogram is the population from which the sample will be drawn. You can select from Normal, Uniform, Right Skewed or even customize the distribution by selecting Custom and dragging the mouse over the plot of the parent distribution. For now, keep the default N(16, 5) distribution as the parent population.  The 2nd histogram, the Sample Data plot, displays a histogram of the sampled data. This histogram is initially blank. The 3rd and 4th histograms show the distribution of statistics computed from the sampled data. The number of samples (replications) that the 3 rd and 4th histograms are based on is indicated by the label "Reps=" which will be displayed once the sample is simulated.  Select the Mean as the statistic in the 3rd histogram with a sample size of 5 (default), then click on Animated sample, and one sample of size n = 5 will be drawn from the normal parent population (note N is sample size, whereas we generally use n to indicate it). You will see the five observations appear in the 2nd histogram; the sample mean of the five numbers will appear in the 3rd histogram as a blue square. This graphically shows the process of getting the sample mean from one sample of size 5. Repeat this several times and you will see how the “sampling distribution” of the sample mean starts to form in the third histogram. Once you have a feeling of this works you can speed things up by taking 5, 1000 or 10,000 samples at one time.  Although we will focus primarily on the sampling distribution of the sample mean, you do have the option to simulate the sampling distribution of any of the following statistics: Mean; Median; sd= Standard deviation (N is used in the denominator); Variance= Variance of the sample (N is used in the denominator); Variance(U)=Unbiased estimate of variance (N-1 is used in denominator); MAD= Mean absolute value of the deviation from the mean; Range When you are done with a particular simulation, you can click on Clear Lower 3 button to clear the histograms 2, 3 and 4 and select new settings for your next simulation. 35 Tasks: For the following tasks always select Mean (sample mean) as the statistic of interest in the 3rd histogram (and leave the 4th histogram with none). 1. Select the Normal distribution as a parent population. a. What are the mean and standard deviation of this population? Mean = 16.00, sd = 5.00 b. Select a sample size n = 5 for the mean as the statistic of interest. Do about 5 animated samples and then take 10,000 samples at once. Draw a picture of the distribution of the sample means. Make sure to label both axes. How does the distribution of the sample mean (3rd histogram) compare with the parent population (e.g., shape, mean, standard deviation)? The distribution of the resulting sample mean values follows approximately a normal shape that is centered around the original population mean value of 16, but the spread of the sample mean values is smaller than the spread of the values in the original population – that is, the sample mean values have a smaller standard deviation. c. Clear the lower three graphs and change the sample size to n = 25. Again, do about 5 animated samples and then take 10,000 samples at once. Draw a picture of the distribution of the sample means. Comment on the changes observed on the 3rd histogram here as compared to the 3rd histogram generated in part 1(a). The distribution of the resulting sample mean values again follows a normal shape that is centered around the original population mean value of 16, but the sample means seem to be more concentrated (less varied) around the population mean of 16. d. What can you say about the relationship between the standard deviation of the sample mean and the population standard deviation? The standard deviation of the sample mean is smaller than the population standard deviation. e. What can you say about the relationship between the sample size and the standard deviation of the sample mean? The standard deviation of the sample mean becomes smaller as the sample size increases. f. Does the number of samples (replications) influence the shape of the sampling distribution? (Note: the number of samples is not the sample size.) For example, is the shape of the sampling distribution when Rep = 10,000 significantly different from the shape of the sampling distribution when Rep = 100,000? No, only the sample size n and the shape of the parent population will influence the shape of the sampling distribution. 36 2. Clear the lower three graphs and then select the skewed distribution as a parent population. a. Select a sample size n = 5 for the mean as the statistic of interest. Do a few animated samples and then take 10,000 samples at once. Draw a picture of the distribution of the sample means. b. How does the distribution of the sample mean (3rd histogram) compare to the distribution of the sample mean in part 1(a) (when the parent population was normal)? When the parent population was normal, the distribution of the sample mean looked more like a normal distribution – more symmetric and bell shaped than this histogram of sample means. c. How does the distribution of the sample mean (3rd histogram) compare with the parent population (e.g., shape, mean, standard deviation)? The distribution of the sample mean has a somewhat symmetric shape, with a mean close to the population mean, and the standard deviation smaller than that of the population. d. Change the sample size to n = 25. Do a few animated samples and then take 10,000 samples at once. Draw a picture of the distribution of the sample means. Comment on the changes observed on the 3rd histogram as compared to the 3rd histogram generated in part 2(a). The sample means seem to be more concentrated around the value of the population mean and the shape of the distribution is somewhat normal looking. e. What should be the value of the standard deviation of the sample mean if the population standard deviation is 6.22 and the sample size is n = 25? How does the standard deviation in histogram 3 from part 2(c) compare to this value? The standard deviation of the sample mean will be equal to 6.22/ 25 = 1.24. The standard deviation from 2(c), 2.81, is larger due to have a smaller sample size. (1/sqrt(n) is smaller). 3. Clear the lower three graphs, then select the custom distribution as a parent population. The parent population plot should be empty. To “draw” a distribution, you will need to use the mouse. Click and drag on different parts of the parent population graph until you have drawn a distribution that you like. a. Sketch your custom population. This will vary by student. Encourage students to create a unique distribution. 37 b. Select a sample size n = 5 for the mean as the statistic of interest. Do a few animated samples and then take 10,000 samples at once. How does the distribution of the sample mean (3rd histogram) compare with the parent population (e.g., shape, mean, standard deviation)? The distribution of the sample mean has a somewhat symmetric shape, with a mean close to the population mean, and the standard deviation smaller than that of the population. c. Change the sample size to n = 25. Do a few animated samples and then take 10,000 samples at once. Comment on the changes observed on the 3rd histogram here as compared to the 3rd histogram generated in part 3(b). What can you say about the shape of the distribution of the sample mean with respect to the sample size n? The sample means seem more concentrated around the value of the population mean with a distribution that does look approximately normal. The larger sample size n, the narrower the distribution of the sample mean is. d. What should be the standard deviation of the sample mean for samples of size n = 25 from your custom population? (Show your calculation.) How does the standard deviation of the values in histogram 3 from part 3(c) compare to it? According to the central limit theorem, the standard deviation for the sample mean should be equal to  n , where  is the population standard deviation. In this particular “custom” distribution  =6.26 , thus the standard deviation of the sample mean is 6.26 25 =1.25. We can see from the 3rd histogram, the standard deviation of this empirically generated sampling distribution of the sample mean is 1.26, which is quite close to the expected 1.25. 4. Fill in the blanks to summarize your findings in Exercises 1, 2, and 3: a. If the parent population is a normal distribution with a mean  and a standard deviation  then for any sample size (small or large), the sample mean will have a __normal__ distribution with a mean of __µ___ and a standard deviation of __  n __. b. If the parent population is NOT a normal distribution but with a mean  and a standard deviation  then for a large sample size, the sample mean will have approximately a __ normal __ distribution with a mean of __ µ __ and a standard deviation of __  The result in 4a is known as the Sampling Distribution of the Sample Mean. n __. The result in 4b is known as the Central Limit Theorem. You should note that there are several similarities between them. However, make sure you can see and understand the difference between the two results. Fill out the chart below to further summarize your findings regarding the sampling distribution of the sample mean based on the CLT. Will the Sampling Distribution of Sample Mean be approximately Normal? n = 10, Parent Population Normal Yes n = 10, Parent Population NOT Normal No n = 50, Parent Population Normal Yes n = 70, Parent Population NOT Normal Yes 38 Check Your Understanding: A researcher interested in the environmental impact of contaminants in soil has collected a sample of 100 tree saplings of a certain species. Ten years ago, the average height of all such tree saplings was 60 inches with a standard deviation of 4 inches. Let X denote the height of a tree sapling. a. The sample mean for the 100 tree saplings was 56. Fill in appropriate notation: __ __ = 56. b. Provide the expected value, standard deviation, and approximate distribution of the sample mean height of tree saplings assuming the values from ten years ago are treated as population parameters. Approximately Normal (60,0.4) Note that 0.4 comes from 4/√100 c. Draw a detailed sketch of the sampling distribution of the sample mean height of tree samplings. Make sure to include your labels. This will be approximately normal with the x-axis labeled x(bar) or ‘sample mean values’. The distribution should be centered at a mean of 60 with a standard deviation of 0.4. density N(60, 0.4) 48 52 56 60 64 39 68 72 Example Exam Question on Sampling Distribution of the Sample Mean For a particular community it is known that the mean amount of water used per home during October is 1250 gallons and the standard deviation is 325 gallons. a. The distribution for amount of water used is skewed to the right. Sketch a skewed right distribution below and label both axes. density Amount of water (gallons) b. For a promotional campaign a radio station plans to randomly select 50 homes and pay their water bills for the month of October. Describe the approximate sampling distribution of the sample mean amount of water used for a random sample of 50 homes? Provide all features of the distribution. The sample mean will have approximately a NORMAL distribution with a mean of 1250 gallons and a standard deviation of 325 50  45.962 c. The radio station can afford to pay for a total of 67,000 gallons. What is the probability that the total number of gallons for a random sample of 50 homes will exceed 67,000 gallons? (Hint: think about how a total and an average are related.) æ 67000 ö P (TOTAL > 67000) = Pç MEAN > ÷ = P ( X > 1340) è 50 ø æ 1340 -1250 ö = Pç Z > ÷ = P ( Z > 1.96) = 0.025 è 45.962 ø 40

Module on Sampling Distribution

Related documents

Products

Support

Module on Sampling Distribution

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib