4.1 Sampling and Surveys

advertisement
The Federalist Papers
The Authors
Word Bank
• Anarchy
• Difficulty
• Oblige
• Cede
• Mankind
• Instituted
• Perfection
• Natural
• Governed
Alexander Hamilton (Federalist No. 65)
• [On Government ]
• “If mankind were to resolve to
agree in no institution of
government, until every part of
it had been adjusted to the most
exact standards of perfection,
society would soon become a
general scheme of anarchy, and
the world a desert.”
John Jay (Federalist No. 2)
• [On Government] “Whenever
and however it is instituted, the
people must cede to it, some of
their natural rights in order to
vest it with requisite powers.”
James Madison (Federalist No. 51)
• [On Government and Human
Nature]
• “The great difficulty lies in this:
you must first enable the
government to control the
governed; and in the next place,
oblige it to control itself.”
Activity: Sampling from The Federalist Papers
The Federalist Papers are a series of 85 essays supporting the ratification of the U.S.
Constitution. At the time they were published, the identity of the authors was a
secret known to just a few people. Over time, however, the authors were identified
as Alexander Hamilton, James Madison, and John Jay. The authorship of 73 of the
essays is fairly certain, leaving 12 in dispute. However, thanks in some part to
statistical analysis, most scholars now believe that the 12 disputed essays were
written by Madison alone or in collaboration with Hamilton.
There are several ways to use statistics to help determine the authorship of a
disputed text. One example is to estimate the average word length in a disputed
text and compare it to the average word lengths of works where the authorship is
not in dispute.
• Directions: The following passage is the opening paragraph of Federalist Paper #51,
one of the disputed essays. The theme of this essay is the separation of powers
between the three branches of government. Choose 5 words from this passage,
count the number of letters in each of the words you selected and find the average
word length. Share your estimate with the class and create a class dotplot.
• To what expedient, then, shall we finally resort, for
maintaining in practice the necessary partition of power among
the several departments, as laid down in the Constitution? The
only answer that can be given is, that as all these exterior
provisions are found to be inadequate, the defect must be
supplied, by so contriving the interior structure of the
government as that its several constituent parts may, by their
mutual relations, be the means of keeping each other in their
proper places. Without presuming to undertake a full
development of this important idea, I will hazard a few general
observations, which may perhaps place it in a clearer light,
and enable us to form a more correct judgment of the principles
and structure of the government planned by the convention.
Directions: Use a table of
random digits or a random
number generator to select a
simple random sample (SRS) of 5
words from the opening passage
to the Federalist Paper #51. Once
you have chosen the words,
count the number of letters in
each of the words you selected
and find the average word length.
Share your estimate with the
class and create a class dotplot.
Trig has a few questions about the activity.
• How does the first dotplot (individual choice)
compare to the second one (random choice)?
Can you think of any reasons why they might be
different?
• Do we see any problems with the first sampling
technique?
• How about this scenario. We want to run a study to
determine the average salary of HS students, so we
go outside into the sophomore hallway and sample
the first 20 students we see. Can you see any
problems with that? What might be a fix to correct
those problems?
• Which sampling technique do you think is more
reliable? Why?
• Think about what would happen if we kept running
simulations. Which technique would center around
the true mean?
J.K. Rowling
• A similar statistical technique was used to determine that J.K. Rowling
was actually the author of “The Cuckoo’s Calling,” which is a crime
fiction novel written in 2013 under the pseudonym name Robert
Galbraith.
• J.K. Rowling "The Cuckoo's Calling“
4.1 Sampling and Surveys
Objectives
SWBAT:
1. IDENTIFY the population and sample in a statistical study.
2. IDENTIFY voluntary response samples and convenience samples.
EXPLAIN how these sampling methods can lead to bias.
3. DESCRIBE how to obtain a random sample using slips of paper,
technology, or a table of random digits.
4. DISTINGUISH a simple random sample from a stratified random
sample or cluster sample. Give the advantages and disadvantages
of each sampling method.
5. EXPLAIN how undercoverage, nonresponse, question wording, and
other aspects of a sample survey can lead to bias.
What’s the difference between a population and a sample? What is a
census?
• A population is the entire group of individuals we want information
about.
• A census collects data from every individual in the population.
• Often times, it is too time consuming or too costly to reach every
member of the population, so a sample is chosen to represent the
population.
• A sample is a subset of individuals in the population from which we
actually collect data.
What is the icon in the top-right corner of the example on page 210?
• That’s the video icon!!! It mean’s there is a video for this example on
textbook website. See the link on my teacher page!!!
When sampling, we want to choose a sample that we can trust to
represent the population. There are a number of different methods to
select samples.
What’s the problem with convenience samples?
• A convenience sample is choosing individuals from the population who
are easy to reach.
• For example, if you want to know how long students in a high school spend doing
homework, you might go to the library and ask the first 30 students you see.
• The problem with this is that it is unlikely to represent that study habits of all high
school students. Students sampled in a library are much more likely to spend
more time on homework. As a result, we will overestimate the average
homework time of our population.
• Convenience samples are almost guaranteed to show bias.
• Think about this convenience sample. You take a survey at a football game: “Do
you support or oppose cuts to the athletic budget?” How might this sample be
biased?
What is bias?
• The design of a statistical study shows bias if it would consistently
underestimate or consistently overestimate the value you want to know.
• Think of our Federalist Papers activity! Was there any bias?
• On the AP test, if asked to describe how the design of a study leads to
bias, you need to do two things:
• 1) identify a problem with the design
• 2) explain how this problem would lead to an underestimate or overestimate
• Bias also exists in voluntary response samples.
What’s a voluntary response sample? Is this a good method for obtaining a
sample?
• A voluntary response sample consists of people who choose themselves
by responding to a general invitation.
• Think about call-in, text-in, write-in, and Internet polls. Usually the
people in such surveys are not representative of some larger population
of interest, because generally these people feel strongly about an issue,
and often share the same opinion. This leads to bias.
• Think about American Idol. A lot of people watch, but does everyone vote?
Usually your average fan will watch the show but not vote, whereas your super
fans will vote several times. Do the numbers really represent how the population
feels about certain contestants? Probably not.
The Check Your Understand feature is to give you practice to ensure
you understand the concepts covered!
Example: To estimate the proportion of families that oppose budget
cuts to the athletic department, PC principal surveys families as they
enter the football stadium on Friday night. Explain how this plan will
result in bias and how the bias will affect the estimated proportion.
Families attending the football game are likely pro-athletics. Therefore,
they would be more likely to oppose budget cuts, and thus would
result in an overestimate for the proportion of families that oppose
budget cuts.
Note: This is a convenience sample.
• A sample chosen by chance rules out both favoritism by the sampler and selfselection by respondents.
• Random sampling involves using a chance process to determine which members
of a population are included in the sample.
What’s a simple random sample (SRS)? How can you choose a SRS?
A simple random sample (SRS) of size n is chosen in such a
way that every group of n individuals in the population has an
equal chance to be selected as the sample.
• There are a few ways to choose a SRS.
• One way is with a hat.
• For example, let’s say this class was our population, and we wanted to draw a sample.
We can write everyone’s name on a slip of paper, place the slips into a hat, shake the
hat to shuffle the names, and then randomly start selecting names from the hat.
• Other ways to choose a SRS involve using random numbers generated by a computer
or calculator to choose samples. If you don’t have technology handy, you can use a
table of random digits.
How to Choose an SRS
Choosing an SRS With Technology
Step 1: Label. Give each individual in the population a distinct numerical label from 1 to N.
Step 2: Randomize. Use a random number generator to obtain n different integers from 1 to N.
How to Choose an SRS Using Table D
Step 1: Label. Give each member of the population a numerical label with the same number of
digits. Use as few digits as possible.
Step 2: Randomize. Read consecutive groups of digits of the appropriate length from left to
right across a line in Table D. Ignore any group of digits that wasn’t used as a label or that
duplicates a label already in the sample. Stop when you have chosen n different labels.
Your sample contains the individuals whose labels you find.
What’s the difference between sampling with replacement and
sampling without replacement? How should you account for this
difference when using a table of random digits or other random
number generator?
• Sampling with replacement means the numbers can repeat, or be
selected again.
• Sampling without replacement means the numbers must remain
unique and cannot be selected again.
• Think selecting cards from a standard deck of 52 cards. Let’s say you want to
draw two consecutive cards. With replacement means you will draw the first
card and then put the card back into the deck before you draw the second
card. Without replacement means you will draw the first card and then
discard it before you draw the second.
• To account for without replacement on the table of random digits, if
you encounter a number that has already been selected, discard the
number and keep selecting.
• To account for without replacement on the calculator, use option
8:randIntNoRep(
Example: Mall Hours
The management company of a local mall plans to survey a random sample of 3
stores to determine the hours they would like to stay open during the holiday
season. Use Table D at line 101 to select an SRS of size 3 stores.
First give each member of the population a numerical label with the same number of digits.
Now read the digits from left to right, ignoring digits that weren’t used as labels and ignoring repeat
numbers.
We would survey Star Jewelers, Carter’s for Kids, and Mrs. Fields.
Suppose we want to estimate the yield of our corn field. The field is square
and divided into 16 equally sized plots (4 rows X 4 columns). A river runs along
the eastern edge of the field. We want to take a sample of 4 plots.
Use a random number
generator and pick a SRS of 4
plots. Mark them with an X.
Now randomly choose one plot
from each horizontal row. This
is called a stratified random
sample.
Which method do you think will work best? Explain.
Finally choose one plot from
each vertical column. This is
also a stratified random
sample.
• Now it’s time for the harvest! The numbers to
the right are the yield for each of the 16 plots.
For each of your three sample, calculate the
average yield.
What is a stratified random sample? How is it different than a simple
random sample?
• The basic idea of sampling is straightforward: take a SRS from the
population and use your sample results to gain information about the
population.
• Sometimes there are statistical advantages to using more complex
sampling methods.
To get a stratified random sample, start by classifying the population
into groups of similar individuals, called strata.
Then choose a separate SRS in each stratum and combine these
SRS’s to form the sample.
When is it beneficial to use a stratified random sample? What is the
benefit? How do you choose a variable to stratify by?
• It is beneficial to use a stratified random sample when subpopulations
within an overall population vary. It is advantageous to sample each
subpopulation independently.
• The benefit is that stratified random samples are more precise
estimates (the estimates will be less variable).
• See the example on the bottom of page 220.
• When choosing which variable to stratify by, pick the variable that is
the best predictor of what you are trying to measure (the variable
that has the strongest association).
• For example, in our harvest example, it was a good idea to stratify by column
because one of the best predictors of corn yield is irrigation. Stratifying by
column guaranteed us that we had plants of different distances to the water
(hence the closer ones had more exposure to water than the further ones).
What is a cluster sample? Why do we use a cluster sample? How is it
different than a stratified sample?
• Although a stratified random sample can sometimes give more
precise information about a population than an SRS, both sampling
methods are hard to use when populations are large and spread out
over a wide area.
• In that situation, we’d prefer a method that selects groups of
individuals that are “near” one another.
• To get a cluster sample, start by classifying the population into groups
of individuals that are located near each other, called clusters.
• Then choose an SRS of the clusters. All individuals in the chosen
clusters are included in the sample.
• Cluster samples are often used for practical reasons, like saving time
and money.
• Example: The school administration is considering a new schedule
and would like student input. Administration decides to survey 200
randomly selected students. Think about how difficult it would be to
locate 200 students. Administration opts to use a cluster sample.
The clusters are homerooms. They take a SRS of 8 homerooms and
give the survey to all 25 students in each homeroom.
• Question: Why would it be better to cluster by homeroom than to
cluster by math classes?
• Math classes are usually formed by grouping students of the same ability.
These clusters will be more homogenous than the general population.
Certain types of classes (i.e. honors or remedial) may be over- or
underrepresented. However homeroom classes are much more mixed.
• Clusters are different than strata. We want each strata to contain
similar individuals and for there to be large differences between
strata. For a cluster, we’d like each cluster to look just like the
population, but on a smaller scale.
Example: A Good Read
A school librarian wants to know the average number of pages in all the books in
the library. The library has 20,000 books, arranged by type (fiction, biography,
history, etc…) in shelves that hold about 50 books each.
a) Explain how to select a simple random sample of 500 books.
Number the books 1 to 20000. Use the command randIntNoRep with a lower
bound of 1, upper bound of 2000, and a sample size of 500 to select 500 unique
numbers between 1 and 20000. Use the books that correspond to those
numbers.
b) Explain how to select a stratified random sample of 500 books. Explain your
choice of strata and one reason why this method might be chosen.
Stratify by type because different types of books might be longer or shorter than
other types. This will provide a more precise estimate of the average page
length than a simple random sample. To select the sample, take an
appropriately sized SRS of each type of book and combine the books selected
from each type to form the sample. For example, if there are 1000 biographies
(5% of 20000), select a SRS of 25 biographies (5% of 500) using the method
described in part (a).
A school librarian wants to know the average number of pages in all the books
in the library. The library has 20,000 books, arranged by type (fiction, biography,
history, etc…) in shelves that hold about 50 books each.
c) Explain how to select a cluster sample of 500 books. Explain your choice of
cluster and one reason why this method might be chosen.
Clusters are formed by grouping books that are located near each other, making
it easier for the librarian. We can use each shelf of 50 books as a cluster and
randomly select 10 shelves to obtain the 500 books for our sample. Number the
shelves from 1 to 400 (20000/50) and choose a SRS of 10 shelves using the
method described in part (a). Then use all the books on the 10 selected shelves
as the sample.
d) Discuss a potential drawback with each of the methods above.
SRS: Time consuming and difficult to locate the 500 books. You also might miss
certain book types in the sample (for example if only 1% of books are cat
biographies, there’s a chance a cat biography might not be in the sample).
Stratified random sample: Time consuming (need to find books from all over the
library).
Cluster: Because only 10 shelves are being chosen, it is possible that certain
book types will be left out.
What is inference?
• The purpose of a sample is to give us information about a larger
population.
• The process of drawing conclusions about a population on the basis
of sample data is called inference.
• We infer information about the population from what we know about the
sample.
Why should we rely on random sampling?
1. To avoid bias in selecting samples from the list of available individuals.
2. The laws of probability allow trustworthy inference about the population
• Results from random samples come with a margin of error that sets
bounds on the size of the likely error.
• Larger random samples give better information about the population
than smaller samples.
What is a margin of error?
• It is unlikely that results from a random sample are exactly the same
as for the entire population. Sample results are only estimates of the
truth about the population.
• If we selected two samples at random from the same population, we
will almost certainly choose different individuals, sot he sample
results will differ somewhat, by chance.
• A margin of error sets bounds on how far we expect the sample
proportion to be from the true proportion, at most.
• The word “error” does not mean that a mistake has been made!!!
The margin of error simply compensates for the variability that results
from taking a random sample from a population. It does not account
for a mistake (bias) made during the data collection process.
What is the benefit of increasing the sample size?
• Increasing the sample size increases our precision (not accuracy).
Larger samples give better information about a population than
smaller samples. As sample size increases, random chance decreases.
• Think if we were going to perform an experiment flipping a coin, in
order to determine how often heads will appear. Let’s say you flip a
coin 10 times. If heads lands 3 times, does that mean heads will only
appear 30% of the time?
• How can you get a more precise result? Increase your sample size. If
we increased our sample size to 1000, what would you expect would
happen to the success rate for flipping heads? It should move much
closer to 50%. There is too much chance involved with small samples,
but as the sample size increases, the chance factor decreases.
What is a sampling frame?
• The list of individuals from which a sample will be drawn is called the
sampling frame.
What is undercoverage and what problems might undercoverage
cause?
• Undercoverage occurs when some members of the population cannot
be chosen in a sample.
• Examples: A sample survey of households will miss homeless people, prison
inmates, and students in dormitories.
The Federalist Paper example only looked at the opening paragraph.
It would have missed any words that appeared in subsequent
paragraphs that weren’t in the first.
• Undercoverage may lead to some bias.
What is nonresponse and what problems might nonresponse cause?
How is it different than voluntary response?
• Nonresponse occurs when an individual chosen for the sample can’t
be contacted or refuses to participate.
• Bias results if people who respond differ from those who don’t, in a
way that is related to the response.
• Nonresponse is different from voluntary response. Nonresponse can
occur only after a sample is selected. In a voluntary response sample,
every individual has opted to take part, so there won’t be any
nonresponse.
What is response bias and what problems might response bias cause?
• A systematic pattern of incorrect responses in a sample survey leads
to response bias.
• Response bias occurs because people may lie about their age,
income, lifestyle habits, they may misremember or make up answers.
• The wording of questions, gender, race, age, ethnicity, human nature,
the order of questions, or even the interviewer can affect people’s
responses.
• Obviously this can lead to bias if people are not being truthful.
Download