15-f12-bgunderson-wb-module4 - Open.Michigan

advertisement
Author: Brenda Gunderson, Ph.D., 2012
License: Unless otherwise noted, this material is made available under the terms of the
Creative Commons Attribution-NonCommercial-Share Alike 3.0 Unported License:
http://creativecommons.org/licenses/by-nc-sa/3.0/
The University of Michigan Open.Michigan initiative has reviewed this material in accordance
with U.S. Copyright Law and have tried to maximize your ability to use, share, and adapt it.
The attribution key provides information about how you may share and adapt this material.
Copyright holders of content included in this material should contact
open.michigan@umich.edu with any questions, corrections, or clarification regarding the use of
content.
For more information about how to attribute these materials visit:
http://open.umich.edu/education/about/terms-of-use. Some materials are used with permission
from the copyright holders. You may need to obtain new permission to use those materials for
other uses. This includes all content from:
Mind on Statistics
Utts/Heckard, 4th Edition, Cengage L, 2012
Text Only: ISBN 9781285135984
Bundled version: ISBN 9780538733489
SPSS and its associated programs are trademarks of SPSS Inc. for its proprietary
computer software. Other product names mentioned in this resource are used for identification
purposes only and may be trademarks of their respective companies.
Attribution Key
For more information see: http:://open.umich.edu/wiki/AttributionPolicy
Content the copyright holder, author, or law permits you to use, share and adapt:
Creative Commons Attribution-NonCommercial-Share Alike License
Public Domain – Self Dedicated: Works that a copyright holder has
dedicated to the public domain.
Make Your Own Assessment
Content Open.Michigan believes can be used, shared, and adapted because it is ineligible for
copyright.
Public Domain – Ineligible. WOrkds that are ineligible for copyright
protection in the U.S. (17 USC §102(b)) *laws in your jurisdiction may
differ.
Content Open.Michigan has used under a Fair Use determination
Fair Use: Use of works that is determined to be Fair consistent with the
U.S. Copyright Act (17 USC § 107) *laws in your jurisdiction may differ.
Our determination DOES NOT mean that all uses of this third-party content are Fair Uses and
we DO NOT guarantee that your use of the content is Fair. To use t his content you should
conduct your own independent analysis to determine whether or not your use will be Fair.
Module 4: Sampling Distributions and the CLT
Objective: The objective of this module is to give you a hands-on discussion and
understanding of sampling distributions and the Central Limit Theorem (CLT), a
theorem that plays an important role in statistics. The sampling distribution of a
statistic can be obtained mathematically, but we will simulate the sampling
process and will observe the empirical sampling distribution of various statistics.
In this module, you will simulate random samples from a known population
distribution and compute a sample statistic for each of the generated samples.
The generated sample statistics can be examined to learn about properties of the
sampling distribution of the statistic.
Overview: Statistical inference is the process of drawing conclusions about a
population parameter based on data. When a sample is selected from a
population, a summary number can be computed from the observations resulting
in the value of a statistic. A statistic is used to estimate the corresponding value
for a population (that is, a sample statistic estimates a population parameter).
However, a sample chosen at random will not necessarily yield an estimate (or
statistic) that is exactly equal to the corresponding parameter for the population;
the next selected sample of the same size will probably give a different estimate
from the first one. If additional samples of the same size were taken, you would
begin to see how the possible estimates (possible values of the statistic) vary and
how close they tend to be to the parameter value.
With a large number of samples, you can assess whether the value of the statistic
(e.g., sample mean, X ) will frequently be close to the true value of the population
parameter (e.g., population mean, μ), and if so, how close on average. This can be
seen more easily through some pictures (next page):
60
1 Random Sample
5 Random Samples
20 Random Samples
Note: Each X represents one statistic value (one estimate) computed from one sample.
61
When data are gathered by random sampling, the statistic will be a random
variable and as such, it will have a probability distribution. The probability
distribution of the sample statistic is called its sampling distribution.
Generally, if we use a statistic to make an inference about a population parameter,
we want its sampling distribution to be centered at the true parameter (a
characteristic which allows us to call that statistic unbiased), and we would like
variability in the estimates to be as small as possible.
Below, we have two estimators that are both unbiased, but Estimator I has less
variability (is more precise). Thus, we would prefer Estimator I to Estimator II.
We will next examine the sampling distribution of the sample statistic most
commonly used for measuring the center of a distribution -- the sample mean.
Formula Card:
62
Activity:
How Do Sample Size and the Distribution of the
Parent Population Affect the Sampling
Distribution of the Sample Mean?
In this activity, you will observe the effects that sample size and the distribution of
the population you are sampling from have on the sampling distribution of the
sample mean. The sampling distribution of the sample mean, X , is the
distribution of the sample mean values for all possible samples of the same size
from the same population.
Open the sampling distribution applet from the applet link in the “Links to Applets
for Modules” folder on the Stat 250 CTools site (in the “Lab Info” folder, which is in
the “Resources” folder).
Alternatively, the original applet can be found at:
http://onlinestatbook.com/stat_sim/sampling_dist/index.html
This applet will help you simulate sampling distributions for a variety of statistics
and allows you to vary the sample size and the population from which the samples
are taken.
Read the Instructions. Press Begin to open the applet; you will see the screen
pictured below.
63
Notice that when the applet begins, a histogram of the normal distribution with
mean 16 and standard deviation 5 is displayed for the default parent distribution.
The Sampling Distribution Applet has several options from which you can choose:
 The 1st histogram, the Parent Population histogram, is the population from
which the sample will be drawn. You can select from Normal, Uniform,
Skewed, or even customize the distribution by selecting Custom and
dragging the mouse over the plot. For now, keep the default N(16, 5)
distribution as the parent population. When you are done with a
particular simulation, you can click on Clear lower 3 button to clear the
remaining histograms, and select new settings for your next simulation.

The 2nd plot, the Sample Data histogram, displays a histogram of the
sampled data. This histogram is initially blank. You can select to draw
Animated Sample, 5 Samples, 1,000 Samples, or 10,000 Samples from the
parent population.

The 3rd and 4th histograms show the distribution of statistics computed
from the sampled data. The number of samples (replications) on which
the 3rd and 4th histograms are based is indicated by the label "Reps=,"
which is displayed once the simulation is started. You can also control
which statistic to examine, as well as the sample size by using the dropdown menu options to the right of each plot. (Note that the applet uses N
to denote sample size, whereas we generally use n.) The statistic options
include:







Mean
Median
sd = standard deviation (uses N in the denominator)
Variance = variance of the sample (uses N in the denominator)
Variance (U) = unbiased estimate of variance (uses N-1 in the
denominator)
MAD = mean absolute value of the deviation from the mean
Range
Select Mean as the statistic in the 3rd histogram and a sample size of 5 (default),
then click on Animated Sample to draw one sample of size n = 5 from the normal
parent population. You will see five observations appear in the 2nd histogram, and
the sample mean of the five numbers will appear in the 3rd histogram as a blue
rectangle. This graphically shows the process of attaining the sample mean from
one sample of size 5. Repeat this several times and you will see how the sampling
distribution of the sample mean starts to form in the 3rd histogram. Once you have
a feeling of this works, you can speed things up by choosing the larger sampling
options – 5, 1,000, or 10,000 samples.
64
1. Select the Normal distribution as a parent population.
a. What are the mean and standard deviation of this population?
b. Select Mean (sample mean) as the statistic of interest in both the
3rd and 4th histograms, sample size n = 5 for the 3rd histogram, and
n = 25 for the 4th. Do about 5 animated samples, and then take
10,000 samples at once. Draw rough sketches of each of the
distributions of the sample means. Make sure to label both axes.
n = 5:
n = 25:
How do the distributions of each sample mean in the 3rd and 4th
histograms compare with the parent population in the 1st
histogram? Comment on shape, mean, standard deviation, etc.
n = 5:
n = 25:
c. Looking at the properties of the population and sample
distributions (displayed to the left of their respective histograms),
what can you say about the relationship between the standard
deviation of the sample mean and the population standard
deviation?
d. What can you say about the relationship between the sample size
and the standard deviation of the sample mean?
e. Does the number of replications influence the shape of the
sampling distribution? That is, as you take more samples, does the
shape of the sampling distribution change significantly
65
2. Clear the lower three graphs and then select the Skewed distribution as a
parent population.
a. Select Mean (sample mean) as the statistic of interest in both the 3rd and
4th histograms, sample size n = 5 for the 3rd histogram, and n = 25 for the
4th. Do about 5 animated samples, and then take 10,000 samples at once.
Draw rough sketches of each of the distributions of the sample means.
Make sure to label both axes.
n = 5:
n = 25:
How do the distributions of each sample mean in the 3rd and 4th
histograms compare with the parent population in the 1st histogram?
Comment on shape, mean, standard deviation, etc.
n = 5:
n = 25:
How do the distributions of each sample mean in the 3rd and 4th
histograms compare to each other? Comment on shape, mean, standard
deviation, etc.
n = 5:
n = 25:
How do the distributions of each sample mean in the 3rd and 4th
histograms compare with those created of the sample mean when the
parent population was normal (in question 1)? Comment on shape, mean,
standard deviation, etc.
n = 5:
n = 25:
b. What should be the value of the standard deviation of the sample mean if
the population standard deviation is 6.22 and the sample size is n = 25?
(Show the calculation.) How does this value compare to the standard
deviation displayed to the left of the 4th histogram created above?
66
3. Clear the lower three graphs and then select the Custom distribution as a
parent population. The parent population plot should be empty. To create a
distribution, you will need to use the mouse to click and drag on different parts
of the parent population graph until you have drawn a distribution that you
like.
a. Provide a rough sketch your custom population. Be sure to note the mean
and standard deviation.
b. Select Mean (sample mean) as the statistic of interest in both the 3rd and
4th histograms, sample size n = 5 for the 3rd histogram, and n = 25 for the
4th. Do a few animated samples, and then take 10,000 samples at once.
Draw rough sketches of each of the distributions of the sample means.
Make sure to label both axes.
n = 5:
n = 25:
How do the distributions of each sample mean in the 3rd and 4th
histograms compare with the parent population in the 1st histogram?
Comment on shape, mean, standard deviation, etc.
n = 5:
n = 25:
How do the distributions of each sample mean in the 3rd and 4th
histograms compare to each other? Comment on shape, mean, standard
deviation, etc.
n = 5:
n = 25:
c. Considering the changes observed from n = 5 to n = 25 in questions 2 and
3, what can you say about the shape of the distribution of the sample
mean with respect to the sample size n?
67
d. What should be the standard deviation of the sample mean for samples of
size n = 25 from your custom population? (Show the calculation.) How
does this value compare to the standard deviation displayed to the left of
the 4th histogram created above?
4. Fill in the blanks to summarize your findings in Questions 1, 2, and 3:
a. If the parent population is a normal distribution with a mean μ and a
standard deviation σ, then for any sample size, the sample mean will have
a _____________ distribution with a mean of _______ and a standard
deviation of ___________.
a. If the parent population is NOT a normal distribution, but with a mean μ
and a standard deviation σ, then for a large sample size, the sample mean
will have approximately a
_____________ distribution with a
mean of _______ and a standard deviation of _________.
The result in 4(a) is known as the Sampling Distribution of the Sample Mean.
The result in 4(b) is known as the Central Limit Theorem. While you should
note that there are several similarities between them, make sure you can see
and understand the difference between the two results.

Fill out the chart below to further summarize your findings regarding the
sampling distribution of the sample mean based on the CLT.
Will the sampling distribution of
sample mean be approximately
normal?
Sample Settings
n = 10, Parent Population Normal
n = 10, Parent Population NOT
Normal
n = 50, Parent Population Normal
n = 70, Parent Population NOT
Normal
68
Check Your Understanding:
A researcher interested in the environmental impact of contaminants in soil has
collected a sample of 100 tree saplings of a certain species. Ten years ago, the
average height of all such tree saplings was 60 inches with a standard deviation of
4 inches. Let X denote the height of a tree sapling.
a. The sample mean for the 100 tree saplings was 56. Fill in appropriate
notation: ____ = 56
b. Provide the expected value, standard deviation, and approximate distribution
of the sample mean height of tree saplings assuming the values from ten years
ago are treated as population parameters.
c. Draw a detailed sketch of the sampling distribution of the sample mean height
of tree samplings. Make sure to include your labels.
69
Example Exam Question on
Sampling Distribution of the Sample Mean
For a particular community it is known that the mean amount of water used per home
during October is 1250 gallons and the standard deviation is 325 gallons.
a. The distribution for amount of water used is skewed to the right. Sketch a
skewed right distribution below and label both axes.
b. For a promotional campaign, a radio station plans to randomly select 50
homes and pay their water bills for the month of October. Describe the
approximate sampling distribution of the sample mean amount of water
used for a random sample of 50 homes. Provide all features of the
distribution.
c. The radio station can afford to pay for a total of 67,000 gallons. What is the
probability that the total number of gallons for a random sample of 50 homes
will exceed 67,000 gallons? (Hint: Think about how a total and an average are
related.)
70
Download