SAMPLING: Process of Selecting your Observations

advertisement
SAMPLING:
Process of Selecting your
Observations
(Masoud Hemmasi, Ph.D.)
SAMPLING: Process of Selecting your
Observations
• QUESTION:
During presidential election campaigns, in a
typical poll, of the potentially 100 million
potential voters, how many would you say
are contacted?
• History and Evolution of Political Polling
SAMPLING: Process of Selecting your Observations
Types of Probability Sampling:

Simple (Unrestricted) Random Sampling

Complex (Restricted) Probability Sampling:
Some times offer more efficient alternatives to Simple
Random Sampling
b. Stratified Random Sampling
c. Cluster Sampling
a. Systematic Sampling
d. Convenience Sampling
e. Double Sampling
Types of Probability Sampling:
Simple Random (or Unrestricted) Sampling
A sampling procedure in which every element in the population
has a known and equal chance of being selected as a subject
(e.g., drawing names out of a hat).
Advantage: has the least bias and offers the most generalizability.
Disadvantage: At times, can be inefficient/expensive.
Systematic Sampling
If a sample size of n is desired from a population containing N
elements, we might sample one element for every n/N elements
in the population.
First, we randomly select one of the first n/N elements from
the population list.
We then select every n/Nth element that follows in the
population list.
This method has the properties of a simple random sample,
especially if the list of the population elements is a random
ordering.
Systematic Sampling
Advantage: The sample usually will be easier to identify than it
would be if simple random sampling were used.
Example: Selecting every 100th listing in a telephone book
after the first randomly selected listing
Stratified Random Sampling
The population is first divided into groups called strata with
respect to salient/relevant characteristics (e.g., gender, age, race,
department, location, industry, etc.)
Each element in the population belongs to one and only one
stratum.
Best results are obtained when the elements within each stratum
are as much alike as possible (i.e. a homogeneous group).
A simple random sample is taken from each stratum.
Advantage: If strata are homogeneous, this method is as “precise”
as simple random sampling but with a smaller total sample size.
Cluster Sampling
The population is first divided into separate groups called clusters.
Ideally, each cluster would be a small-scale version (representative)
of the population.
A simple random sample of the clusters is then taken.
All elements within each selected cluster will make up the final
sample.
Example: A primary application is area sampling, where clusters
are city blocks or other well-defined areas (neighborhoods,
precincts, school districts, etc.).
Cluster Sampling
Advantage: The close proximity of elements can be cost and time
effective (i.e. many sample observations can be obtained in
a short time).
Disadvantage: This method generally requires a larger total
sample size than simple or stratified random sampling.
Convenience Sampling
It is a nonprobability sampling technique.
Items are included in the sample without known
probabilities of being selected.
The sample is identified primarily by convenience.
Example: A professor conducting research might use
student volunteers to constitute a sample.
Advantage: Sample selection and data collection are
relatively easy.
Disadvantage: It is impossible to determine how
representative of the population the sample is.
Sampling Process of Selecting your observations
Sample Size Determination
SAMPLING: Process of Selecting your Observations
Standard Deviation—What does it measure?
• Variations/differences in scores among members of a group with
respect to a given characteristic (e.g., test scores for a class, income).
• Standard deviation represents the average distance of a group of
numbers from their mean.
How do we calculate it?
Hint: You can think of it as the average deviation from the
norm/typical.
For a Population:
For a Sample:
Sx
Income level for particular a class like this:
Xs = Incomes of students in an MBA Class
$6,000
Grad Assistants
$6,000
$15,000
Part-Time Employed
$16,000
$39,000
$38,000
Part-Time Employed
$50,000
$70,000
ΣX = $240,000
Average = x = $240,000 / 8 = $30,000
X
Sum ( x )
Average = x
Variance = 2
Std. Dev. = 
X - x
(X - x )2
6,000
-24,000
576,000,000
6,000
-24,000
576,000,000
15,000
-15,000
225,000,000
16,000
-14,000
196,000,000
39,000
9,000
81,000,000
38,000
8,000
64,000,000
50,000
20,000
400,000,000
70,000
40,000
1,600,000,000
240,000
0
3,718,000,000
30,000
3,718,000,000 / 8 = 464,750,000
$21,558.06
SAMPLING: Process of Selecting Your Observations
Freq
Suppose frequency distribution of life of light bulbs
is normal.
x = life of light bulbs—e.g.,
.
3 bulbs lasted 108 hrs each
………
…………
What can we say
……………..
………………..
about the expected
xi
…………………….
life of a randomly
……………………….
……………………………..
selected bulb (xi) = ?
………………………………………..
85
90
95 100 105 110
x = 100 hrs
x = 5 hrs
115
X= Hours
Life of a randomly drawn light bulb: 100 – 5 Z  x  100 + 5 Z
Z = 1 for 68% confidence,
Z = 1.96 for 95% confidence,
Formula: X = x + Z x
Z = 3for 99% confidence
(Where Z is an index that reflects the level of
confidence/certainty with which we wish to estimate x.)
Income Distribution for a hypothetical population
$1
$0
$3
$6
$2
$4
$7
$5
$9
$8
True Population Mean = μ = Σxi / n = 45 / 10 = $4.5
Population Standard Deviation:
Income of a randomly drawn person (Xi) = ?
= 2.87
SAMPLING: Process of Selecting your Observations


This formula: X = x + Z x is ONLY applicable
when the population distribution is NORMAL
What is the Distribution of our hypothetical population?
Distribution of the Hypothetical Population
10
9
8
7
6
5
4
3
2
1*
*
*
*
*
*
*
*
*
*
$0
$1
$2
$3
$4
$5
$6
$7
$8
$9
Uniform Distribution
x
SAMPLING: Process of Selecting your Observations
X = x + Z x
NOTE that X is the X of a sample of size n = 1
What is the generic formula for mean (X) of samples of any
size (any n)?
 That is, what if instead of a single observation/case (X), we draw
a random sample of a particular size from the population? Can
we say something about the mean of that sample--X?
•
If (and only if) we know that our sample mean ( x )
comes from a normally distributed population, the
same formula can be modified and applied. Std. Error
Rather than X
= x + Z x
But, what does this statement mean?
use
X = x + Z x
Sampling Distribution = Frequency distribution of sample means
Sampling Distribution for Samples of Size n = 2 (from our earlier population)
MEAN (X)
Sample #
SAMPLE
1
$0 & $1
0.5
2
$0 & $2
1.0
3
$0 & $3
1.5
.
.
.
.
.
.
10
$1 & $2
1.5
11
$1 & $3
2.0
12
$1 & $4
2.5
.
.
.
.
.
.
18
$2 & $3
2.5
19
$2 & $4
3.0
.
.
.
.
.
.
43
$7 & $8
7.5
44
$7 & $9
8.0
45
$8 & $9
8.5
45 Possible Samples of size n = 2,
thus 45 possible sample means.
Distribution of these 45 sample
means is called Sampling
Distribution! See next slide!!!
x = Standard Error is the standard dev. of these Xs
Mean of all the 45 sample means xs =
x = x = 4.5 (i.e., the same as mean
of the original population
So, the earlier statement means: if these
sample means are normally distributed,
we can use the related formula.
Sampling Distribution of
Samples of Size n=2
#
x = ($0+$1)/2=$.50
μx =
($0+$3)/2=$1.50
&
($1+$2)/2=$1.50
x
SAMPLE
MEAN
1
$0 &
$1
0.5
2
$0 &
$2
1.0
3
$0 &
$3
1.5
.
.
.
.
.
.
10
$1 &
$2
1.5
11
$1 &
$3
2.0
.
.
.
.
.
.
44
$7 & $9
8.0
45
$8 & $9
8.5
SAMPLING: Process of Selecting Your Observations
Freq
So, if we know that distribution of our Sample Means (i.e.,
Sampling Distribution) is NORMAL, as shown below:
.
………
…………
……X……..
………--..…..
x
…….…….X….…….
……………………….
……………………………..
………………………………………..
x = x
x
= Standard Error = x / √ n
We will be able to say the following about the mean ( x ) of
a randomly selected sample:
x = x + Z x
Since μX = μX , substitute x for x :
x = x + Z x
SAMPLING: Process of Selecting your Observations
QUESTION: What is the primary purpose of sampling?
Answer: To use sample characteristics (e.g., X) as estimates
of population characteristics (e.g.,
What is the significance of this formula? x
x)
= x + Z x
Answer: Shows the relationship between x and x.
--So, if x comes from a normal distribution, we can rewrite
the formula to estimate x based on value of x.
x =x + Z x
x = x + Z x
Question: But, is the sampling distribution (i.e.,
distribution of x ) always normal (so that we can use the
above formula)?
Let’s see it!
Think of these as distribution
of life of all individual light
bulbs (X).
(n = 1)
(n = 1)
(n = 1)
(n = 1)
Think of these as distribution
of average life of samples
of n light bulbs (X).
Distribution of Sample Means (Xs) for Different Population Distributions
SAMPLING: Process of Selecting your Observations
Conclusion?
 As n increases, sampling distribution (i.e.,
distribution of Xs) will more and more resemble
a normal distribution so that for all n > 30,
sampling distribution will always be normal,
regardless of the distribution of the original
population.
SAMPLING: Process of Selecting Your
Observations
Sampling distribution is
Variable of interest X is
NOT normally distributed.
n1>30
Xs
Distribution of Xs
Mean of Xs = x
Std. Dev. of Xs =x
n2 >30
n3 >30
guaranteed to be normal
only when n 30 is used.
x1
x2
•
x3
•
xs
•
Distribution of x s for all
samples of the same size
(Sampling Distribution)
Mean of x s =
x
Std. Error =  =
x
= x
SAMPLING: Process of Selecting your Observations
So, for samples of n  30:
_
x = X + Z x_
Standard Error =
SO,
x = x / √ n
x = X + Z x / √ n
Now, Let’s examine the elements of this formula!
SAMPLING: Process of Selecting your Observations
1) We are interested in estimating x from x
2) Estimation involves a margin of error, that is
3) Actual Score = Estimate + Margin of Error
_
x = X + Z x / √ n
Estimate
Actual Score
Margin of Error,
lets call it “E”
So, when using random samples of size n > 30, margin of
error in estimation would be:
E = Z x / √ n
SAMPLING: Process of Selecting your Observations
E = Z x / √ n
Square both sides of the equation:
E =Z 
2
2
2
x
/ n
Rewrite it to solve for n:
n=Z 
2
2
x
/E
2
• x (population Std. Dev.) is often unknown.
Sx (Std. Dev.
of a sample) is a reasonable estimate (substitute) for it.
• Sx can be estimated based on previous studies or a pilot study.
2
2
n=Z S
x
2
/E
SAMPLING: Process of Selecting your Observations
Sample size required for estimating a population mean* (x):
2
2
n=Z S
2
x
/E
 n = Sample size required
 E = Margin of error we are willing/able to tolerate
in estimating the population characteristic (mean)
 Z = An index reflecting the degree of confidence/
certainty we wish to have in achieving the level
of precision/accuracy represented by E above.
 S = An estimate of Std. Dev. of the characteristic
being estimated/studied.
* The case of n for estimating a population proportion will be covered later.
SAMPLING: Process of Selecting your Observations
2
An example:
2
n = Z S /E
2
Suppose you were to use a random sample to estimate
average IQ of adult males. Suppose you know, from a
pilot study that the Std. Dev. of males’ IQ is about 16
points. What size sample should you use if you wish to be
95% sure that your margin of error in estimating average
IQ is no more than 3 points (that is if you wish to be 95%
sure that the estimate you will obtain from the sample
would be within +3 points of the actual/true average IQ
of the adult male population)?
Z=?
Z=2
S = 16
E=3
S=?
E=?
n = 22 (16)2 / 32 = 113.78 round up = 114
SAMPLING: Process of Selecting your Observations
Assuming worst case scenario when S is unknown:
2 2
2
n = Z S /E
If no information is available on S, you can assume
maximum variability by setting S = ¼ of Range.
An Example:
Suppose we were to use a random sample to estimate average IQ of adult
males. Further suppose that we have absolutely no basis for determining
the Std. Dev. of males’ IQ. But, we know that the IQ of the overwhelming
majority of adult males ranges between 80 and 120. What size sample
should we use if we wish to be 99% sure that our margin of error in
estimating the average IQ is no more than 2 points (that is if we wish to
be 99% sure that the estimate we will obtain from the sample would be
within +2 points of the actual/true average IQ of the adult male
population)?
Range = 120 – 80 = 40
S = 40/4 = 10
Z=3
n = 32 ( 10)2 / 22 = 225
E=2
SAMPLING: Process of Selecting your Observations
Assessing Resulting Accuracy/Precision of the Estimates,
Given a Particular Sample Size:
•
•
•
•
Suppose, we used a survey with lots of 7-point scale items,
Collected data from 225 respondents, and
Descriptive statistics on the data shows typical Std. Dev. on
most items/variables is in the 1.3 to 1.5 range.
What can we say about the precision/accuracy of our
results, say, with 95% confidence/certainty?
2
2
n=Z S
2
2
2
E =Z S /n
x
/E
2
E = Z S / \/ n
E = 2 (1.5) / \/ 225 = 3/15 = .2
?
We can be 95% certain that the sample mean for a typical variable is
not off from the true population mean by more than two-tenth of a
point. (e.g., if the reported sample mean on a given variable is 4.7,
we can be 95% sure that the actual population mean is between 4.5
and 4.9).
SAMPLING: Process of Selecting your Observations
Sample size determination for estimating Proportions (p):
EXAMPLE: Projecting the percentage of people who would be voting
for a particular candidate in a presidential election.
In such cases, dispersion is measured by = pq (instead of variance, s2)
Where, p = proportion of the population that is expected to have the
attribute under study, and
q = (1- p), the proportion of the population that is expected
NOT to have that attribute
So, the sample size formula will change to:
Or :
NOTE:
n = Z2 pq / E2
n = Z2 p(1-p) / E2
If we have no basis for judging the expected value of p, we can
assume maximum variability (i.e., err on the side of overestimating
the required sample size) by setting p at p=0.50 (see the example
on next slid).
SAMPLING: Process of Selecting your Observations
Sample size determination for Estimating Proportions:
n = Z2 p(1-p) / E2
EXAMPLE:
Suppose you are to project the percentage of potential voters who
would be expected to vote for the Republican candidate in the
upcoming presidential election. Suppose you have no basis for
estimating/guessing what the percentage could possibly be. Also,
suppose that you want to be 99% confident/certain that your margin
of error would be 3% (i.e., 99% certain that your projection/estimate
will be within + 3% of the actual number). What size sample will
you need?
n = Z2 p(1-p) / E2
 Z=3
 p = 0.50
n = 32 ( 0.5) (0.5) / 0.032
 E = 0.03
n = 9 (0.25) / 0.0009 = 2500
SAMPLING: Process of Selecting your Observations
QUESTIONS OR COMMENTS
?
Download