Chapter 4 & 8 Sampling Spring 2016

advertisement
Chapter 8 Sampling Page 1 of 14
Chapter 4 & 8
Sampling
Sample bias
Margin of error
Spring 2016 12/29/15
Learning Objectives: samples, surveys, polls
population N
sample n
sample proportion
population proportion p
one sample survey
two sample survey
sample bias
margin of error E
95% confidence range
Chapter 8 Sampling Page 2 of 14
Chapter 8.1 Sampling, polling, and surveying, are terms used to describe estimating
what a large group of objects (usually people) will do based on what a small group of
similar objects do.
Sample = n = a small group of objects intended to represent a much larger group.
Population = N = large group of objects.
ONE SAMPLE Survey
How many SUNY-Oswego students have a personal computer?
There are 10,000 people at SUNY- Oswego. N = 10,000
There are 1,000 students in the Math Dept.
n = 1,000
Select the 1,000 Math students as a sample and ask if they have a personal computer.
216 said yes
784 said no
1,000 sample
The proportion of people who answered “YES” = 216/1000 = .216 = 21.6%
21.6% is called the sample proportion, written as
(“p-hat”)
Based on the 1,000 student sample, we will temporarily “assume” that 21.6% of all
10,000 people at SUNY- Oswego have a computer. Later, we will change that
assumption with some statistics, but let it go for now.
Assume 21.6% (10,000) = .216 (10,000) = 2,160 of all SUNY- Oswego people have a
computer. 21.6% is now also called the population proportion, called p
= p (for now)
sample proportion = population proportion for now
ONE SAMPLE Survey
How many of the 100,000 light bulbs made in your department each month are defective?
N = population = 100,000 each month
Randomly select 800 bulbs as a sample and test them. 17 are defective.
n = sample = 800 randomly selected
17 are defective
783 are OK
800 sample
= sample proportion = 17/800 = .02125 = 2.1% are defective
assume
= p for now
Estimate the number of bulbs that are defective in a month
defective = (N) = p(N) = 2.1% of (100,000)
= .021 (100,000) = 210
Chapter 8 Sampling Page 3 of 14
ONE SAMPLE Survey
You were hired to determine how many voters in USA will vote Republican and how
many will vote Democrat. You selected 485 voters in the City of Oswego to sample.
268 said they will vote Democrat
the sample n has changed to
100 said they will vote Republican
n = 268 + 100 = 368
117 did not reply or said it was none of your business.
485 old n
The sample number has changed from n = 485 to n = 368 because 117 sample voters
provided no useful information and are disqualified.
Democrat
= Democrat sample proportion = 268/368 = .728260869 = 72.8%
Republican
= Republican sample proportion = 100/368 = .27173913 = 27.2%
There are 122,500,000 voters in the USA.
Assume for now,
=p
For now, Democrat sample proportion = p Democrat population proportion = 72.8%
72.8% (122,500,000) = 89,180,000 will vote Democrat
For now, Republican sample proportion = p Republican population proportion = 27.2%
27.2% (122,500,000) = 33,320,000 will vote Republican
Chapter 8 Sampling Page 4 of 14
Two sample survey - how many N pike fish
are in the lake?
Procedure: - using a net, capture a sample of all fish you can in a period of time.
- count only the pike. There are 200 pike in the net. n1 = 200
tag all n1 = 200 pike
and return all to the lake
…sometime later
Recapture another sample all the
fish you can and count only
the pike n2 = 150 tagged and untagged.
Count tagged
pike = 21
Chapter 8 Sampling Page 5 of 14
Summary: n1 = 200 pike (1st capture) and were all tagged
n2 = 150 pike (2nd capture), of which 21 were tagged in 1st capture
N = all pike in the lake
Assume the percentage of pike in lake
200
N
captured pike
all pike in lake
is the same as the percentage of tagged pike in the 2nd sample
21
150
200 = 21 captured tagged pike
N
150 captured pike
Number of pike in the lake = N = (n1)(n2) = 200(150) = 1429 pike in lake
21
21
We’ve temporarily assumed (for now) in all above examples that the sample proportion
is = population proportion p.
They are nearly, but two things prevent them from being equal: sample bias and
statistical error (covered in next 2 lectures).
Chapter 8 Sampling Page 6 of 14
Practice problems from page 132
1.) Large jar contains N = 200 gumballs of two different colors: red and green. A sample
of n = 25 gumballs is randomly drawn. 8 are red
17 are green
25 sample
Estimate the number of red gumballs in the jar.
One sample survey - calculate
assume
= 8/25 = 0.32
= p = .32
# red = p(200) = 64
3.) Madison County population is 34,522. Estimate the number of blood-type A– based
on a random sample of 253 patients, of which 17 were A–.
One sample survey - calculate
assume
= 17/253 = .067193675 (leave this in your calculator)
= p = .067193676
# A- = p(34,522) = 2320
8.) A rookery has an unknown quantity (N) of fur seal pups. 4965 were captured and
tagged. Later, 900 were captured. Of these, 218 were found to be tagged previously.
Estimate the total fur seal pup population.
Two sample survey - how many (N) fur seal pups are in rookery?
- capture 1st sample 4965 (n1) fur seal pups and tag each
- capture 2nd sample 900 (n2) fur seal pups and count 218 tagged
previously
N = (n1)(n2 ) = 4965(900) = 20,498 pups
218
218
10.) Maui has an unknown quantity (N) of dolphins. 26 were captured and tagged. Later,
27 were captured and 12 were found to be tagged previously. Estimate the dolphin
population.
Two sample survey - how many (N) dolphins in Maui?
- capture 1st sample 26 (n1) dolphins and tag each
- capture 2nd sample 27 (n2) dolphins and count 12 tagged
previously
N = (n1)(n2 ) = 26(27) = 58.5 = 59 dolphins
12
12
Chapter 8 Sampling Page 7 of 14
Practice problems from page 233
13a.)
6,523 of 12,345 people moved within last 5 years.
N = 12,345 people
13b.)
6,523 moved
population proportion p = 6,523/12,345
= .528
= 52.8%
245 of sample 500 moved within last 5 years.
n = 500 people
sample proportion
245 moved
= 245/500 = .49 = 49% Note: p ≠
14a.)
269 of 2,444 people are left handed
p = 269/2,444 = .110 = 11%
14b.)
8 of sample 50 are left handed.
= 8/50 = .16 = 16% Note: p ≠
Chapter 8 Sampling Page 8 of 14
We assumed
= p, the sample proportion = population proportion
It’s often not exactly true. Samples are often biased and moves p away from .
p≠
…here’s one reason why:
Sample Bias - a built-in tendency (whether intentional or not) which excludes a
particular group or characteristic within the population, or includes those
that shouldn’t be included.
Common type of sample bias:
Convenience sampling bias - the selection of individuals dictated by what is easiest or
cheapest to sample.
- How many of 10,000 SUNY-Oswego people have a
personal computer? Use Math students as sample because
they are convenient for us to contact? It’s well known that
Math students buy computers much more often than all
other students. By selecting only Math students for the
sample, we will bias for more computers.
- Selecting only Math students is biased another way. Many
of the 10,000 population don’t need a personal computer.
Infants in the day-care center, landscapers, plumbers,
carpenters, senior citizens, visitors, etc. are included in the
10,000 and must be included in the sample.
Another example (convenience bias)
You were hired to determine how many USA people will
vote Republican and how many Democrat. You selected
485 voters in City of Oswego because you live here and
Oswego voters are convenient to contact. Oswego is,
however, a dominantly Democrat city. By selecting only
Oswego voters, the sample proportion for Democrat
voters will bias the result toward Democrats.
Non-response bias - many individuals do not respond to a survey.
- Of 10 million people selected in our text page 119 to survey voter
preference, only 2.4 million responded, resulting in a low 24%
response rate. A low response rate often means that people aren’t
interested now, but usually interested later (on election day).
How prevent sample bias:
Random Sampling - the best alternative is to let the laws of chance, randomness,
determine the selection of a sample. This means that any group of
members of the population should have the same chance of being in
the sample as any other group of the same size. The personal
computer sample should have included students and non students,
Math students and other students. The voter sample should have
included voters from other counties, other states.
Quota sampling- is a systematic effort to force the sample to be representative of a given
population through the use of quotas. The sample should have so many
women, so many men, so many blacks, so many whites, so many people
living in urban areas, so many people living in rural areas.
Chapter 8 Sampling Page 9 of 14
Stratified sampling - an alternative to simple random sampling. Divide the sampling
frame into categories, called strata, and then (unlike quota sampling)
randomly choose a sample from these strata. The chosen strata are
then further divided into categories, called substrata, and a random
sample is taken from these substrata. The selected substrata are
further subdivided, a random sample is taken from them, etc. The
process goes on for a predetermined number of steps (usually four
or five). Stratiļ¬ed sampling has generally proved to be a reliable
way to collect national data.
Cause and effect in medical community
Sampling sometimes suggest a cause and effect relationship which can’t be proven. If
formal proof is needed, use a clinical study or clinical trial.
Examples of faulty studies/trials:
Hormone replacement therapy studies (text page 125) yielded conflicting
results.
Coffee drinking extends life (text page 126) yielded confounded results.
Alar apple study (text page 127) yielded scary misleading results.
Salk Polio vaccine study (text page 128) yielded confused results.
Methods to prevent faulty studies/trials:
Controlled study - the subjects are divided into (2) groups - (1) that gets
treatment (treatment group) and another that doesn’t (control group). The
control group is there for comparison purposes only- they give
the experimenters a baseline to see if the treatment group does better or not.
Placebo effect states that just the idea that one is getting a treatment can
produce positive results in suggestive people.
Blind study - a study in which neither the members of the treatment group nor
the members of the control group know to which of the two groups they
belong.
Double-blind study - a controlled placebo study in which neither the subjects
nor the scientists conducting the experiment know which subjects are in the
treatment group and which are in the control group.
Chapter 8 Sampling Page 10 of 14
Practice problems page 133
15.a) Convenience - selecting sample close by, not random. George peeks only at scores
of nearby classmates.
15.b) Stratified - this is good random sampling. Population divided into (4) strata, then
5% sampled randomly from each strata.
15.c) Simple - all players in sample; selecting random sample from entire population.
15.d) Quota - forced sample to have a specific trait (seniors only)…not random.
17a.) The sample population is registered Cleansburg voters only.
17b.) The sample is 680 registered Cleansburg voters surveyed by phone.
17c.) The sampling method is simple random selection.
18a.) The sampling proportion is 680 randomly chosen out of 8325 registered voters
sampling proportion = n/N = 680/8325 = .08168 = 8.2%
18b.) 306 out of 680 sampled stated they would vote for Smith. The sample statistic
estimating the percentage of the vote going to Smith = 306/680 = .45 = 45%
19.) Smith actually received 42% compared to 45% estimated
sample error = 45% - 42% = 3%
Jones estimated percentage from the survey was = 272/680 = .4 = 40%
Jones actually received 43%
sample error = 43% actual - 40% estimated = 3%
Brown’s estimated percentage from the survey was 102/680 = .15 = 15%
Brown actually received 15%
sample error = 15% - 15% = 0%
20.) The error appears to be chance because the sample was selected randomly. Also,
since there was a 100% response rate, no-response bias can be disregarded.
37a.) The target population are those experiencing a cold and likely to buy medication to
help.
37b.) The sample frame is college students in San Diego area.
37c.) The sample is 500 students from a warm weather climate. Also, they are young,
likely to overlook a cold, and likely to spend their limited money on other things.
Chapter 8 Sampling Page 11 of 14
38a.) The study was not a controlled study. There was no control group.
38b.) List four possible causes other than the effectiveness of vitamin X itself that could
have confounded the results of the study.
1.) San Diego students are young, healthy compared to the target population which
includes older adults, young children, people that live in cold wintry weather.
2.) Students were paid to participate.
3.)
4.) There was no control group.
39. List four different problems with the study that indicate poor design.
1.) College students don’t represent the population in age, health, finances.
2.) They were paid to participate.
3.) San Diego isn’t “cold” country compared to Northeast.
4.) The medical response to the vitamin was self-reported, not medically
determined.
40.) List 4 recommendations to improve this study:
1.) Select sample participants randomly from all over the country
2.) Set up a control group who are given identical looking placebos
3.) Have medical staff determine actual improvement like temperature, congestion.
4.) Make it a double-blind study where no one knows who is getting vitamin and
who is getting placebo.
Chapter 8 Sampling Page 12 of 14
….another reason why
≠p
Chapter 8.3 – Sample error
We assumed = p and if the sample wasn’t biased, it will be close, but not exactly
same…..there will be a margin of error even if there were no bias.
The closeness that
represents p (sample proportion vs. population proportion) is
expressed by a margin of error E:
E =2
( 1n
)
and a 95% confidence interval given by the range: (
-E)
to
(
+E)
The 95% confidence interval tells us that we can be 95% sure that p lies within that
range.
Example: Nielsen Rating Service is paid to estimate how many of the world’s
2,000,000,000 sports fans are watching World Cup Soccer. They randomly
sampled 5,000 households and found 3,615 were watching.
= sample proportion = 3,615/5,000 = .723. In words, 72.3% of the sample
were watching. But, p ≠
The margin of error E is:
E =2
=2
( 1n
) =2
.723(1-.723) =
5,000
2
.200271
5,000
.0000400542 = 2(.006328838756) = .012657677
= .013 = 1.3%
With 95% confidence, p (population proportion) will be somewhere between
( -E)
to
( +E)
(72.3% - 1.3%) to
= 71%
to
(72.3% + 1.3%)
73.6%
In words, the population proportion p will be in the range 71% to 73.6% of sports fans.
Let’s assume p is at the low end of the range, 71%. With 95% confidence, Nielsen Rating
Service can say 71% of 2,000,000,000 sports fans are watching World Cup Soccer.
71% (2,000,000,000) = .71(2,000,000,000) = 1,420,000,000 fans
Chapter 8 Sampling Page 13 of 14
Example: Yankee Stadium Rock Concert
Theater Concert Promotions Inc., was asked about the possibility of producing a rock
concert at Yankee Stadium. They needed to estimate how many from the likely 4,000,000
person market (18-35 year olds) would purchase $175 tickets for this proposed event.
After randomly selecting 600 people in this age range, they acquired the following data:
Summary: 11 yes
575
564 no
sample
25 no response
What was the population N?
N = 4,000,000 person market
What was the sample size n?
n = 575 people (the no responses are excluded)
What will be p with a 95% confidence level?
= 11/575 = .019 = 1.9%
Calculate the range in which p will occur with a 95% confidence level
E = 2
( 1n
)= 2
.019(1 - .019) = 2
575
95% confidence range is: 1.9% - 1.1% = .8%
.000032 = 2 (.0057) = .011 = 1.1%
to
1.9% + 1.1% = 3%
Assume worst case that p will be the low end (.8%) of the range.
How many will attend the concert? .8% of 4,000,000 = .008(4,000,000) = 32,000
If Yankee Stadium requires $10,000,000 guaranteed revenue to book the concert, will
Theater Concert Promotions Inc., be able to guarantee it?
(32,000 attendance)($175ticket) = $5,600,000…no guarantee possible if $10,000,000
needed.
Chapter 8 Sampling Page 14 of 14
Practice problems page 237:
68.) Estimate margin of error E and a 95% confidence interval for n = 1,000 and
= 0.4 = 40%
E =2
( 1-
) = 2
n
.4(1-.4) = 2
1,000
.00024
E = .031 = 3.1%
So, with 95% confidence we can estimate the actual p ranges between:
-E
to
+E
(40% - 3.1%) to (40% + 3.1%)
36.9% and
43.1%
With 95% confidence p will be in the range 36.9% to 43.1% someplace.
82.) Based on sample n = top 400 movies,
= 98% of them involve drugs, etc.
a.) Estimate the population proportion p of all movies that contain drugs, etc.
For all movies, we can say with 95% confidence, p will be somewhere inside the
range:
-E
to
+E
98% - E to
98% + E
b.) What is E:
E =2
( 1n
) = 2
.98(1-.98) = 2
400
.000049
E = .014 = 1.4%
So, with 95% confidence we can estimate p ranges between:
(98% - 1.4%) and (98% + 1.4%)
96.6% and 99.4%
With 95% confidence, we can say p ranges between 96.6% and 99.4% of all movies
c.) Is the top 400 movies a random sample? No, they’re the top 400. Middle and bottom
are not included in the sample. The sample is not random, it’s biased toward the top.
Download