Module on Sampling Distribution

advertisement
Author(s): Brenda Gunderson, Ph.D., 2011
License: Unless otherwise noted, this material is made available under the
terms of the Creative Commons Attribution–Non-commercial–Share
Alike 3.0 License: http://creativecommons.org/licenses/by-nc-sa/3.0/
We have reviewed this material in accordance with U.S. Copyright Law and have tried to maximize your
ability to use, share, and adapt it. The citation key on the following slide provides information about how you
may share and adapt this material.
Copyright holders of content included in this material should contact open.michigan@umich.edu with any
questions, corrections, or clarification regarding the use of content.
For more information about how to cite these materials visit http://open.umich.edu/education/about/terms-of-use.
Any medical information in this material is intended to inform and educate and is not a tool for self-diagnosis
or a replacement for medical evaluation, advice, diagnosis or treatment by a healthcare professional. Please
speak to your physician if you have questions about your medical condition.
Viewer discretion is advised: Some medical content is graphic and may not be suitable for all viewers.
Some material may be sourced from:
Mind on Statistics
Utts/Heckard, 3rd Edition, Duxbury, 2006
Text Only: ISBN 0495667161
Bundled version: ISBN 1111978301
Material from this publication used with permission.
Attribution Key
for more information see: http://open.umich.edu/wiki/AttributionPolicy
Use + Share + Adapt
{ Content the copyright holder, author, or law permits you to use, share and adapt. }
Public Domain – Government: Works that are produced by the U.S. Government. (17 USC §
105)
Public Domain – Expired: Works that are no longer protected due to an expired copyright term.
Public Domain – Self Dedicated: Works that a copyright holder has dedicated to the public domain.
Creative Commons – Zero Waiver
Creative Commons – Attribution License
Creative Commons – Attribution Share Alike License
Creative Commons – Attribution Noncommercial License
Creative Commons – Attribution Noncommercial Share Alike License
GNU – Free Documentation License
Make Your Own Assessment
{ Content Open.Michigan believes can be used, shared, and adapted because it is ineligible for copyright. }
Public Domain – Ineligible: Works that are ineligible for copyright protection in the U.S. (17 USC § 102(b)) *laws in
your jurisdiction may differ
{ Content Open.Michigan has used under a Fair Use determination. }
Fair Use: Use of works that is determined to be Fair consistent with the U.S. Copyright Act. (17 USC § 107) *laws in your
jurisdiction may differ
Our determination DOES NOT mean that all uses of this 3rd-party content are Fair Uses and we DO NOT guarantee that
your use of the content is Fair.
To use this content you should do your own independent analysis to determine whether or not your use will be Fair.
Module 3: Sampling Distributions and the CLT
Objectives: The objective of this module is to give you a hands-on discussion and understanding of the
Central Limit Theorem (CLT), a theorem that plays an important role in statistics. The sampling
distribution of a statistic can be obtained mathematically, but we will simulate the sampling process and
will observe the empirical sampling distribution of various statistics.
In this module you will simulate random samples from a known population distribution and compute a
sample statistic for each of the generated samples. The generated sample statistics can be examined to
learn about properties of the sampling distribution of the statistic.
Overview: Statistical inference is the process of drawing conclusions about a population parameter
based on data. When a sample is selected from a population, a summary number can be computed
from the observations resulting in the value of a statistic. A statistic is used to estimate the
corresponding value for a population (that is, a sample statistic estimates a population parameter).
However, a sample chosen at random will not necessarily yield an estimate (a value of a statistic) that is
exactly equal to the corresponding parameter for the population. The next selected sample of the same
size will probably give a different estimate from the first one. If additional samples of the same size
were taken you would begin to see how the possible estimates (possible values of the statistic) vary and
how close they tend to be to the parameter value.
With a large number of samples, you can assess whether the value of the statistic (e.g. sample mean X )
will be frequently close to the true value of the population parameter (e.g. population mean  ), and if
so, how close on average. This can be seen more easily through some pictures:
One Random Sample
X
True Population Parameter
Five Random Samples
Twenty Random Samples
X
X XX X
X
XXX
XXXXX X
X XXXXXXXX X
True Population Parameter
True Population Parameter
Note: Each X represents one statistic value (one estimate) computed from one sample.
When data are gathered by random sampling, the statistic will be a random variable and as such it will
have a probability distribution. The probability distribution of the sample statistic is called its sampling
distribution.
Generally speaking, if we use a statistic to make an inference about a population parameter, we want its
sampling distribution to be centered at the true parameter (a characteristic which allows us to call that
statistic unbiased), and we would like variability in the estimates to be as small as possible.
33
Below we have two estimators that are both unbiased, but Estimator I has less variability (is more
precise). Thus, we would prefer Estimator I over Estimator II.
Estimator I
Estimator II
X
XXX
XXXXX
XXXXXXX
X
XXX
X X X X X X XX
X X X X X X X X X X XX
True Population Parameter
True Population Parameter
We will next examine the sampling distribution of the sample statistic most commonly used for
measuring the center of a distribution -- the sample mean.
Formula card:
Activity: How do Sample Size and the Distribution of the Parent
Population affect the Sampling Distribution of the Sample Mean?
In this activity you will observe the effects that sample size and the distribution of the population you
are sampling from have on the sampling distribution of the sample mean. The sampling distribution of
the sample mean, X , is the distribution of the sample mean values for all possible samples of the same
size from the same population.
For this activity open the sampling distribution applet (the original applet can be found at
http://onlinestatbook.com/stat_sim/sampling_dist/index.html). This applet will help you simulate
sampling distributions for a variety of statistics, allowing you to vary the sample size and the population
from which the samples are taken.
34
Read the Instructions.
Press “Begin” and the Sampling
Distribution Applet will open; you will
see the screen at the right.
Notice that when the applet begins, a
histogram of the normal distribution
with mean 16 and standard deviation 5
is displayed for the default “parent
distribution”.
The Sampling Distribution Applet has several options you can choose from:
 The 1st histogram, the Parent Population histogram is the population from which the sample will
be drawn. You can select from Normal, Uniform, Right Skewed or even customize the
distribution by selecting Custom and dragging the mouse over the plot of the parent
distribution. For now, keep the default N(16, 5) distribution as the parent population.

The 2nd histogram, the Sample Data plot, displays a histogram of the sampled data. This
histogram is initially blank. The 3rd and 4th histograms show the distribution of statistics
computed from the sampled data. The number of samples (replications) that the 3 rd and 4th
histograms are based on is indicated by the label "Reps=" which will be displayed once the
sample is simulated.

Select the Mean as the statistic in the 3rd histogram with a sample size of 5 (default), then click
on Animated sample, and one sample of size n = 5 will be drawn from the normal parent
population (note N is sample size, whereas we generally use n to indicate it). You will see the
five observations appear in the 2nd histogram; the sample mean of the five numbers will appear
in the 3rd histogram as a blue square. This graphically shows the process of getting the sample
mean from one sample of size 5. Repeat this several times and you will see how the “sampling
distribution” of the sample mean starts to form in the third histogram. Once you have a feeling
of this works you can speed things up by taking 5, 1000 or 10,000 samples at one time.

Although we will focus primarily on the sampling distribution of the sample mean, you do have
the option to simulate the sampling distribution of any of the following statistics:
Mean; Median; sd= Standard deviation (N is used in the denominator); Variance= Variance of
the sample (N is used in the denominator); Variance(U)=Unbiased estimate of variance (N-1 is
used in denominator); MAD= Mean absolute value of the deviation from the mean; Range
When you are done with a particular simulation, you can click on Clear Lower 3 button to clear the
histograms 2, 3 and 4 and select new settings for your next simulation.
35
Tasks: For the following tasks always select Mean (sample mean) as the statistic of interest in the
3rd histogram (and leave the 4th histogram with none).
1. Select the Normal distribution as a parent population.
a. What are the mean and standard deviation of this population?
Mean = 16.00, sd = 5.00
b. Select a sample size n = 5 for the mean as the statistic of interest. Do about 5 animated
samples and then take 10,000 samples at once.
Draw a picture of the distribution of the sample means. Make sure to label both axes.
How does the distribution of the sample mean (3rd histogram) compare with the parent
population (e.g., shape, mean, standard deviation)?
The distribution of the resulting sample mean values follows approximately a normal shape that
is centered around the original population mean value of 16, but the spread of the sample mean
values is smaller than the spread of the values in the original population – that is, the sample
mean values have a smaller standard deviation.
c. Clear the lower three graphs and change the sample size to n = 25. Again, do about 5
animated samples and then take 10,000 samples at once.
Draw a picture of the distribution of the sample means.
Comment on the changes observed on the 3rd histogram here as compared to the 3rd
histogram generated in part 1(a).
The distribution of the resulting sample mean values again follows a normal shape that is
centered around the original population mean value of 16, but the sample means seem to be
more concentrated (less varied) around the population mean of 16.
d. What can you say about the relationship between the standard deviation of the sample
mean and the population standard deviation?
The standard deviation of the sample mean is smaller than the population standard deviation.
e. What can you say about the relationship between the sample size and the standard
deviation of the sample mean?
The standard deviation of the sample mean becomes smaller as the sample size increases.
f. Does the number of samples (replications) influence the shape of the sampling distribution?
(Note: the number of samples is not the sample size.) For example, is the shape of the
sampling distribution when Rep = 10,000 significantly different from the shape of the
sampling distribution when Rep = 100,000?
No, only the sample size n and the shape of the parent population will influence the shape of
the sampling distribution.
36
2. Clear the lower three graphs and then select the skewed distribution as a parent population.
a. Select a sample size n = 5 for the mean as the statistic of interest. Do a few animated samples
and then take 10,000 samples at once. Draw a picture of the distribution of the sample means.
b. How does the distribution of the sample mean (3rd histogram) compare to the distribution of the
sample mean in part 1(a) (when the parent population was normal)?
When the parent population was normal, the distribution of the sample mean looked more like a
normal distribution – more symmetric and bell shaped than this histogram of sample means.
c. How does the distribution of the sample mean (3rd histogram) compare with the parent
population (e.g., shape, mean, standard deviation)?
The distribution of the sample mean has a somewhat symmetric shape, with a mean close to the
population mean, and the standard deviation smaller than that of the population.
d. Change the sample size to n = 25. Do a few animated samples and then take 10,000
samples at once. Draw a picture of the distribution of the sample means.
Comment on the changes observed on the 3rd histogram as compared to the 3rd histogram
generated in part 2(a).
The sample means seem to be more concentrated around the value of the population mean and the
shape of the distribution is somewhat normal looking.
e. What should be the value of the standard deviation of the sample mean if the population
standard deviation is 6.22 and the sample size is n = 25? How does the standard deviation in
histogram 3 from part 2(c) compare to this value?
The standard deviation of the sample mean will be equal to 6.22/ 25 = 1.24. The standard
deviation from 2(c), 2.81, is larger due to have a smaller sample size. (1/sqrt(n) is smaller).
3. Clear the lower three graphs, then select the custom distribution as a parent population. The
parent population plot should be empty. To “draw” a distribution, you will need to use the mouse.
Click and drag on different parts of the parent population graph until you have drawn a distribution
that you like.
a. Sketch your custom population.
This will vary by student. Encourage students to create a unique distribution.
37
b. Select a sample size n = 5 for the mean as the statistic of interest. Do a few animated samples
and then take 10,000 samples at once. How does the distribution of the sample mean
(3rd histogram) compare with the parent population (e.g., shape, mean, standard deviation)?
The distribution of the sample mean has a somewhat symmetric shape, with a mean close to the
population mean, and the standard deviation smaller than that of the population.
c. Change the sample size to n = 25. Do a few animated samples and then take 10,000 samples at
once. Comment on the changes observed on the 3rd histogram here as compared to the 3rd
histogram generated in part 3(b). What can you say about the shape of the distribution of the
sample mean with respect to the sample size n?
The sample means seem more concentrated around the value of the population mean with a
distribution that does look approximately normal. The larger sample size n, the narrower the
distribution of the sample mean is.
d. What should be the standard deviation of the sample mean for samples of size n = 25 from your
custom population? (Show your calculation.) How does the standard deviation of the values in
histogram 3 from part 3(c) compare to it?
According to the central limit theorem, the standard deviation for the sample mean should be equal
to 
n , where  is the population standard deviation. In this particular “custom” distribution
 =6.26 , thus the standard deviation of the sample mean is 6.26 25 =1.25. We can see from the
3rd histogram, the standard deviation of this empirically generated sampling distribution of the
sample mean is 1.26, which is quite close to the expected 1.25.
4. Fill in the blanks to summarize your findings in Exercises 1, 2, and 3:
a. If the parent population is a normal distribution with a mean  and a standard deviation 
then for any sample size (small or large), the sample mean will have a __normal__ distribution
with a mean of __µ___ and a standard deviation of __  n __.
b. If the parent population is NOT a normal distribution but with a mean  and a standard
deviation  then for a large sample size, the sample mean will have approximately a
__ normal __ distribution with a mean of __ µ __ and a standard deviation of __ 
The result in 4a is known as the Sampling Distribution of the Sample Mean.
n __.
The result in 4b is known as the Central Limit Theorem. You should note that there are several
similarities between them. However, make sure you can see and understand the difference between
the two results.
Fill out the chart below to further summarize your findings regarding the sampling distribution of
the sample mean based on the CLT.
Will the Sampling Distribution of Sample Mean
be approximately Normal?
n = 10, Parent Population Normal
Yes
n = 10, Parent Population NOT Normal
No
n = 50, Parent Population Normal
Yes
n = 70, Parent Population NOT Normal
Yes
38
Check Your Understanding:
A researcher interested in the environmental impact of contaminants in soil has collected a sample of
100 tree saplings of a certain species. Ten years ago, the average height of all such tree saplings was
60 inches with a standard deviation of 4 inches. Let X denote the height of a tree sapling.
a. The sample mean for the 100 tree saplings was 56. Fill in appropriate notation: __
__ = 56.
b. Provide the expected value, standard deviation, and approximate distribution of the sample mean
height of tree saplings assuming the values from ten years ago are treated as population
parameters.
Approximately Normal (60,0.4)
Note that 0.4 comes from 4/√100
c. Draw a detailed sketch of the sampling distribution of the sample mean height of tree samplings.
Make sure to include your labels.
This will be approximately normal with the x-axis labeled x(bar) or ‘sample mean values’. The
distribution should be centered at a mean of 60 with a standard deviation of 0.4.
density
N(60, 0.4)
48
52
56
60
64
39
68
72
Example Exam Question on Sampling Distribution of the Sample Mean
For a particular community it is known that the mean amount of water used per home during October is
1250 gallons and the standard deviation is 325 gallons.
a. The distribution for amount of water used is skewed to the right. Sketch a skewed right distribution
below and label both axes.
density
Amount of water (gallons)
b. For a promotional campaign a radio station plans to randomly select 50 homes and pay their water
bills for the month of October. Describe the approximate sampling distribution of the sample
mean amount of water used for a random sample of 50 homes?
Provide all features of the distribution.
The sample mean will have approximately a NORMAL distribution with a mean of 1250 gallons and a
standard deviation of
325
50
 45.962
c. The radio station can afford to pay for a total of 67,000 gallons. What is the probability that the
total number of gallons for a random sample of 50 homes will exceed 67,000 gallons? (Hint: think
about how a total and an average are related.)
æ
67000 ö
P (TOTAL > 67000) = Pç MEAN >
÷ = P ( X > 1340)
è
50 ø
æ
1340 -1250 ö
= Pç Z >
÷ = P ( Z > 1.96) = 0.025
è
45.962 ø
40
Download