Chapter 1. Introduction to Statistical Inference: One Proportion

advertisement
Chapter 1: Introduction to Statistical Inference: One Proportion
I
Chapter 1. Introduction to Statistical Inference:
One Proportion
Learning Objectives:
•
•
•
•
•
•
•
Begin to understand the process of statistical investigations as it relates to the scientific
method: from posing a question to collecting data to analyzing data to drawing inferences
beyond the data to communicating results.
Be able to distinguish population from sample, parameter from statistic.
Be able to state null and alternative hypotheses appropriate to a research question and
conjecture.
Begin to understand the reasoning process of statistical significance, using a 3S strategy:
statistic, simulate, strength of evidence.
Be able to conduct simulations and draw conclusions for inferences regarding strength of
evidence about a (single) proportion.
Begin to understand that a dataset allows for rejecting some hypothesized values of a
parameter while retaining others as plausible values of the parameter.
Begin to recognize that sample size plays an important role in assessing statistical
significance and strength of evidence.
Outline:
Section 1.1: Introduction to Statistics
Topics
• Anecdotal evidence
• Statistics from samples
• Parameters from populations
• Descriptive statistics versus
inferential statistics
• Scientific Method
Section 1.2: Introduction to Statistical
Reasoning: One Proportion
Example: Matching Names to Faces
Activity: A Preference Study: Friend or Foe?
Topics:
• Criminal Justice System as analogy
for Statistics
• Null and Alternative Hypotheses
• Test of significance, P-value
• Simulation
• Strength of evidence
• Stating conclusions
Section 1.3: Statistical Significance for One
Proportion—Other Null Hypotheses
Example: Appetizing Dog Food?
Activity: Cola Discrimination?
Topics:
• Testing null value other than 0.5
Section 1.4: Plausible Values
Copyright © 2010
Example: Competitive advantage of red
uniforms?
Activity: Kissing Right?
• Interval of plausible values
• Not finding evidence against a value
does not mean confirming the value
Section 1.5: Effect of Sample Size
Example: Predicting election results?
Activity: Baseball “big bang”?
• Sample size matters
• All else being equal, larger samples
produce stronger evidence against a
null hypothesis than smaller
samples do.
Case Study: Will skipping breakfast lead to
a world without men?
Research Article: Stock Monkey
Practice Exercises
Exercises
Chapter 1: Introduction to Statistical Inference: One Proportion
II
Section 1.1: Introduction to Statistics
Have you ever heard statements like these?
• “I don’t wear my seatbelt anymore because a friend of mine was in a car accident
and wasn’t wearing a seatbelt when they crashed and she was the only survivor
of the accident.”
• “Don’t get your child vaccinated. I vaccinated my child and now he is autistic.”
• “I’m never going to become a runner, my friend’s dad just started running and he
died of a heart attack last week.”
The people making these statements each use a single case to support a decision about
how they will live their life. They are basing their conclusions on anecdotal evidence.
Key Idea: Anecdotal evidence is information obtained from only one or a handful of
particular cases. It doesn’t take into account that there are many other cases that may
have very different outcomes.
Thought Question: Do you think it is ever reasonable to draw a conclusion based on
anecdotal evidence?
We think most people would agree that drawing conclusions based on a single case is
often a bad idea. However, there are a lot of everyday situations where we use
anecdotal evidence to make decisions. For example, if you’ve been in a windowless
room all day and someone enters the room holding a wet umbrella, you might conclude
that it is raining outside. In this case, the use of a single case to make a conclusion is
reasonable and probably correct.
Of course, there are many scenarios where anecdotal evidence is insufficient for
drawing a conclusion or making a decision. Typically, in these scenarios the risk
involved with being wrong is higher than in situations where we are comfortable using
anecdotal evidence. For example, in the statements earlier, the ramifications of being
wrong about wearing your seatbelt or vaccinating your child are more severe than
getting your hair wet! In these scenarios, we would prefer to see many cases, which
each provide some evidence, in order to make a better conclusion or decision than what
we can with only a few cases.
Specifically, before we decide to stop wearing seatbelt, we would like to know the
survival rates of individuals wearing seatbelts and individuals not wearing seatbelts in a
large number of similar accidents. Before we decide not to vaccinate our child, we
would like to know the rates of autism in a large number of vaccinated and unvaccinated
children. We would also like to know the rates of disease and death in a large number of
vaccinated and unvaccinated children. Before we put the skids on our running, and
before we even look at a large number of runners and non-runners, we might find out a
little bit more about that friend’s dad. Was he overweight? Was he a smoker? Did he
have a family history of heart attacks? Gathering information or data can help us to
make better informed decisions. This is Statistics at work.
Key Idea: Basing decisions on anecdotal evidence is often not appropriate. Instead, it is
often appropriate to draw conclusions only after we have gathered sufficiently large
amounts of data in a carefully planned manner.
What is Statistics?
Copyright © 2010
Chapter 1: Introduction to Statistical Inference: One Proportion
III
Statistics is a discipline that guides us in weighing evidence about phenomena in the
world around us. More specifically, Statistics gives us a formal procedure for gathering
evidence, evaluating that evidence, suggesting conclusions based on that evidence, and
assessing our confidence in those conclusions.
A large part of the discipline of Statistics (capital “S”) is the use of descriptive statistics
(little “s”). You’ve likely come across and used statistics before; they are numbers like
averages and percentages or graphs like bar charts. These numbers and graphs are
called descriptive statistics because they are used to describe data that we have
collected. If we continue beyond the data we have collected and make broader claims,
this process is called Inferential Statistics.
We can use descriptive and inferential statistics when we are trying to learn about a
large and difficult to observe group of people, called the population, but we only have
data on a portion of that population, called the sample. For example, when we are trying
to learn about vaccine use and autism, the population of interest is children, but we
could never investigate all living children. Instead we use a sample of children and
investigate the relationship between vaccinations and autism in the sample. Descriptive
statistics (e.g., percentages of vaccinated children and unvaccinated children in our
sample who are autistic) are calculated, and then used to make conclusions or
inferences about the population of interest, all living children.
Key Idea: Descriptive statistics summarize, with graphs and numbers, what we see in
the sample. Inferential statistics involves weighing the evidence to make conclusions
about the population.
As we’ve already pointed out, numbers that we calculate from our sample are called
statistics. On the other hand, numbers that summarize information about the population
are called parameters. The statistics from our sample can help us draw conclusions
about the corresponding parameters in the population. In our previous example, we use
the descriptive statistic (the percentage of vaccinated children in our sample who are
autistic), to learn about the parameter (the percentage of all vaccinated children who are
autistic).
Key Idea: Descriptive statistics are numerical summaries of the sample, which can be
used to learn about parameters, which are numerical summaries of the population.
Sometimes instead of using statistics to summarize sample data and draw conclusions
about a population, we use statistics to see whether what a researcher’s manipulations
have made an impact on a response of interest. For example, in an experiment we might
see whether taking a newly developed drug lowers blood pressure. Specifically, we
might measure the blood pressure of everyone in our sample, then give them the drug
and measure their blood pressure again. If, overall, we see a decrease in blood pressure
after taking the drug, Statistics can help us decide if we should be convinced that this
change can be attributed to taking the drug.
Key Idea: Statistical methods can be used to tell us whether researcher intervention is a
reasonable explanation for changes in a response.
Copyright © 2010
Chapter 1: Introduction to Statistical Inference: One Proportion
IV
In this book we will use Statistics in a variety of real-life situations to help us to weigh
evidence about research questions of interest, including the following:
•
•
•
•
•
•
•
Is swimming with dolphins successful therapy for people diagnosed with severe
to moderate depression?
Do pre-verbal infants have preferences between helper toys and hinderer toys?
Do females have a higher body temperature than males?
Does vitamin C prevent the common cold?
Can people correctly distinguish between two brands of cola?
What is the average weight of newborn babies?
Are there effects of sleep deprivation on learning?
Scientific Method
The scientific method is a series of techniques used to objectively guide scientific
inquiry. When we use the scientific method we start by asking questions, which we then
refine into testable hypotheses based on prior research (if available). Studies are
designed to test the hypotheses. Results from the studies are analyzed, conclusions are
drawn, and often new research hypotheses are formed. Statistics informs all parts of the
scientific method. See Figure 1.1 for an outline of the scientific method.
Specifically, knowledge of Statistics helps to refine questions into testable hypotheses,
guides many study design decisions, is used to evaluate whether the results of the study
are evidence in favor of the research hypotheses, and guides the ultimate conclusions
drawn from the research. In short, Statistics is used in many studies that use the
scientific method
Figure 1.1: Flowchart of Scientific Method
Copyright © 2010
Chapter 1: Introduction to Statistical Inference: One Proportion
V
Example: The Physicians’ health study conducted by the Harvard Medical School began
in 1982 and ended in 1995. One question this study hoped to answer was: Is a bi-daily
aspirin beneficial in the prevention of cardiovascular disease? Approximately twenty-two
thousand male physicians ages 40 to 84 were split randomly into two groups of
approximately 11,000 physicians each. One group took an aspirin every other day for
the duration of the study and the other group took a placebo (sugar pill) every other day
for the duration of the study. It was concluded that aspirin reduced the risk of a first
myocardial infarction (heart attack) by 44%.
The Physician’s health study is an example of applying the scientific method.
Figure 1.2: Flowchart of Scientific Method applied to Physician’s health study.
Copyright © 2010
Chapter 1: Introduction to Statistical Inference: One Proportion
VI
Section 1.2: Introduction to Statistical Reasoning: One
Proportion
Ei incumbit probatio qui dicit --- The burden of proof rests on one who asserts
Statistics and the Criminal Justice System
In our criminal justice system, each trial is designed to answer the same question: “Is the
defendant guilty?” The system involves a jury that evaluates the strength of the
evidence suggesting guilt of the defendant. An important initial instruction is given to the
jury, to assume that the defendant is innocent, and not to conclude that the defendant is
guilty unless the evidence is “beyond a reasonable doubt.”
Notice that there are two competing hypotheses here. The first hypothesis is that the
defendant is innocent. It is this first hypothesis that the jury is instructed to assume is
true. The second hypothesis is that the defendant is guilty. It is this hypothesis that the
prosecutor believes to be true. Indeed, if the prosecutor did not have a strong reason to
suspect the defendant’s guilt he would not have brought the defendant to trial. In the
trial, evidence is presented. The jury then examines the evidence in order to gauge
whether the evidence strongly points to the defendant’s guilt. The jury must examine
every piece of evidence assuming that the defendant is innocent; asking themselves the
question “Is it possible that the defendant is innocent and we still see this evidence? Is it
unlikely? How unlikely?”
The logic of the criminal justice system is similar to the approach we use for weighing
evidence in Statistics. In Statistics we have a research conjecture we want to evaluate;
we call this research conjecture the alternative hypothesis. To see whether sufficient
evidence exists to conclude that our research conjecture is reasonable, we start by
assuming that the research conjecture is not correct. The statement of the research
conjecture being incorrect, stated in such a way as to communicate no effect or equality,
is known as the null hypothesis. In short, we assume the null hypothesis is true, and
evaluate the data under that assumption in order to weigh the evidence supporting the
alternative hypothesis (our research question). And so, the null hypothesis is like
assuming the defendant is innocent, the data we gather is the evidence, and we are
looking for strong evidence in favor of the alternative hypothesis (research question; the
defendant is guilty).
Key Idea: The alternative hypothesis is the research conjecture we are trying to
establish. The null hypothesis is a statement contrary to the research conjecture, which
is often a statement of no effect or equality. As in our criminal justice system, we will
assume that the null hypothesis is true unless there is evidence (data), beyond a
reasonable doubt, supporting the alternative hypothesis. If we don’t find strong enough
evidence, the null hypothesis remains plausible.
This process of assuming the null hypothesis (or the “dull hypothesis”) to be true,
gathering data, then analyzing that data to see whether we have convincing evidence in
favor of the alternative hypothesis is called a test of significance.
Key Idea: Tests of significance use the data gathered to assess the strength of evidence
in favor of the alternative hypothesis over the null hypothesis.
Copyright © 2010
Chapter 1: Introduction to Statistical Inference: One Proportion
VII
Example: Matching Names to Faces: Bob or Tim?
A study in Psychonomic Bulletin and Review (Lea, Thomas, Lamkin, & Bell, 2007)
presented evidence that “people use facial prototypes when they encounter different
names.” Participants were given two faces and had to determine which one was Tim and
which one was Bob. The researchers wrote that their participants “overwhelmingly
agreed” on which face belonged to Tim and which face belonged to Bob, but did not
provide the exact results of the study. A recent class of statistics students (our sample)
replicated this study and 23 of the 33 students correctly identified the face that belonged
to Tim. So our statistic could be the 23/33 ≈ 0.697 proportion that made the correct
identification.
But do 23 of the 33 students making that choice convince us that there is something to
this theory? Maybe these students just got lucky? What does that mean here?
If our research conjecture is that people in general (our population) have a tendency to
associate certain facial features with a name, this gives us our alternative hypothesis.
Note: We might want to debate what population you are willing to consider this sample
representative of, but we will come back to that.
Alternative hypothesis: In the population, people have a tendency to associate
certain facial features with a name. In other words, the correct face is matched
with Tim more than half the time (i.e., the parameter, the proportion in the
population, is greater than 0.5).
The null hypothesis then becomes there is no such association. This means that people
trying to match names to faces are essentially blindly guessing which name goes with
which face.
Null hypothesis: People are not more likely to match Tim with one face over the
other. In other words, people match Tim with his face half the time (i.e., the
parameter, the proportion in the population, is equal to 0.5).
So, like in a criminal trial, we will begin our analysis by assuming there is nothing special
going on here; in other words that the null hypothesis is true and people are blindly
guessing which face is Tim’s. If that were the case, would it be possible that in a sample
of 33 students, more than half would guess the Tim face correctly just by chance?
Sure, this is possible. But what if all 33 students had matched Tim correctly? Well then,
we would be pretty convinced there was something going on. Why? Because it’s unlikely
that everyone would get the right answer if they were all blindly guessing. How unlikely?
That would be like flipping a coin 33 times and getting all heads! This would happen
less than 1 in a billion times!
So what about 23 people matching names to faces correctly just by chance? How
unlikely is that? We know that each time we toss a coin 33 times we will get a different
value for the number of heads. Is it unusual to get 23 heads? How unusual? We will
explore this by looking at what “could have happened” when we toss a coin 33 times.
Figure 1.3 shows the results from tossing a coin 33 times, sorted by heads or tails.
Copyright © 2010
Chapter 1: Introduction to Statistical Inference: One Proportion
VIII
Figure 1.3: Results from tossing a coin 33 times
This time we got 15 heads and 18 tails, pretty close to the 50/50 split we would expect.
But when we did it again we got 20 heads as shown in Figure 1.4.
Figure 1.4: Results from tossing a coin 33 times a second time
There are a lot of different possible “number of heads” we could get when flipping a coin
33 times, and the result will potentially vary every time we flip the coin another 33 times.
So to help us assess whether 23 heads is unusual, we need to flip the coin 33 times
over and over again. Below is a graph of the number of heads we obtained in 33 tosses,
when we performed the 33 coin flips 1000 different times.
Copyright © 2010
Chapter 1: Introduction to Statistical Inference: One Proportion
IX
Figure 1.5: A graph showing 1000 repetitions of flipping a fair coin 33 times and
counting the number of heads.
lower tail
6
7 8
outcomes
6 4 typical
44 7
4 4 48
upper tail
6
7 8
Examining the graph in Figure 1.5 (a “dotplot”) confirms that it is virtually impossible to
get all 33 heads (it never happened), there is clearly some chance we will see 23 heads.
Specifically, in these 1000 sets of 33 tosses, we obtained exactly 23 heads 11 times,
and we got 23 or more heads in 17 of the 1000 times. Therefore, a result like 23 is not
very likely. In fact, 23 lies out in the “tail” of the distribution. So our conclusion would be
that a result of 23 heads in 33 tosses is fairly unusual.
Thought Question: How does the above analysis help us answer our research question
as to whether we have convincing evidence that people are able to match names to
faces more than half the time?
So what does this have to do with our study? Well, if the null hypothesis is true and
people can’t match names to faces any better than with a coin toss, we can model their
behavior with a coin. Heads means they matched the names correct, tails means they
did not. This doesn’t mean every sample is expected to have exactly half the people
match the right name; we know there will be some variability, just be chance. By
repeating this process over and over again, like we showed in Figure 1.5 we get a sense
for what the pattern of outcomes looks like if people are blindly guessing at which name
belongs to which face. This allows us to evaluate whether the results of our students
appear to be consistent with the random behavior we would see from coin tosses.
Copyright © 2010
Chapter 1: Introduction to Statistical Inference: One Proportion
X
Figure 1.6: A graph showing 1000 repetitions of a group of 33 students picking
Tim’s face and counting the number of correct matches under the null hypothesis
that people are equally likely to match his face correctly as not.
lower7 8
tail
6
typical
outcomes
if null
is
true
6 4 4
4 7 4
4 4
8
upper
tail
6
7 8
The graph shown in Figure 1.6 is the same as the one shown in Figure 1.5, but we have
changed the context from repeatedly flipping 33 coins and counting the number of heads
to repeatedly sampling 33 people and counting the number of times each group correctly
identifies the correct face. In both cases, the simulation is built on the premise that the
probability of a “success” (heads or a correct guess) equals .5. This graph tells us that, if
people are simply guessing between the two faces, getting an outcome of 23 matches is
rather unlikely. But we DID see 23 correct matches. So we can conclude that the
subjects in the study appear to have behaved differently than they would have if they
were making their decisions by the result of a coin toss. Our data did not appear
consistent with the null hypothesis, so we conclude that we have evidence against the
null hypothesis, beyond a reasonable doubt.
The more unusual the outcome is under the assumption that the null hypothesis is true,
the stronger the evidence provided by the observed data against the null hypothesis that
generated the “could have happened” results.
Note that a key step in this process was assuming the null hypothesis was true so that
the people in the study were behaving like they each tossed a coin to make their choice.
Copyright © 2010
Chapter 1: Introduction to Statistical Inference: One Proportion
XI
To help us determine whether 23 was an unlikely outcome in 33 tosses, we could either
toss a coin 33 times, over and over, or we could use a computer to carry out the coin
tosses. Typically, instead of actually flipping coins many times, we use a computer to
simulate the “could have happened” data by assuming the people behaved like coins
and then generating random coin toss results.
Of course, people don’t behave exactly like coins, even when they are equally choosing
between two choices. But this model gives us a reasonable assessment of how unusual
our sample results are under this null hypothesis.
Key Idea: A simulation is the imitation of a real world process that is typically performed
by a computer. The use of a computer allows us to repeat the process many, many
times very quickly.
To test some different hypotheses we will use a three step strategy called the “three S”
strategy. This process serves as the foundation for a majority of research questions we
will investigate in this book.
Three S Strategy for Test of Significance
1. Statistic: Compute the statistic from the observed data.
2. Simulate: Simulate the process to produce “could have happened” data under the
assumption that the null hypothesis is true and calculate the value of the statistic in that
data. Repeat the simulation process to generate a large number of could have happened
data sets, always assuming the null hypothesis to be true. Examine the “what if the null
hypothesis was true” distribution of these statistics.
3. Strength of evidence: Consider where the observed statistic falls in the “what if the
null was true” distribution. If the statistic falls in the far tail of the distribution, then we
have strong evidence against the null hypothesis. Otherwise, if the value of the observed
sample statistic is not in the tail of the “what if” distribution, then consider the null
hypothesis to be plausible.
For example, in our study the statistic we computed was 23 out of 33 students who
correctly matched Tim to his face. We simulated what “could have happened” if the null
were true (people correctly match Tim with his face half the time and incorrectly half the
time) using a coin flip (heads = correct match). We repeated this simulation several
times, each time counting the number of heads out of 33 tosses (number of correct
matches out of 33 just guessing students). The observed number of correct matches by
the students in our study, 23, was far enough in the tail of the “what if the null was true”
distribution to give us some evidence against the null hypothesis.
How far out in the tail of the distribution does the observed statistic need to be so that we
consider the result “beyond a reasonable doubt?”
Thought Question: Looking at the dotplot in Figure 1.6, how many correct matches
would you need to see in a group of 33 students to convince you that their outcome was
better than what you would expect just by random chance?
One way to quantify how unusual an outcome is in the “what if the null was true”
distribution is to calculate a p-value.
Copyright © 2010
Chapter 1: Introduction to Statistical Inference: One Proportion
XII
Key Idea: The p-value is the probability that we would get a result as extreme as or
more extreme than the one that was actually observed, if the null hypothesis was true.
In our example, we found 23 or more correct matches occurred 11+ 5 + 1 = 17 times,
giving us an approximation to the p-value of 17/1000 = 0.017. [You can calculate this pvalue more exactly as you will learn in Chapter XX, but 1000 repetitions of our simulation
should give us a very reasonable estimate.] So this gives us a measure for how unusual
it is to get 23 correct matches if everyone in the class simply guessing.
FAQ: Why do we need to include more than when computing our p-value?
Student: So 23 out of 33 students correctly matched Tim’s name to his picture, why isn’t
the p-value the likelihood I would just get 23 heads out of 33 flips of a coin? Why do we
have to also include the likelihood of getting more than 23 heads?
Stat Prof: In flipping just 33 coins, the probability of getting 23 heads is very similar to the
probability of getting 23 or more heads, so I can see how this would be confusing. I
have an idea to help you understand this. Do you have any coins with you?
Student: I might, let me check. It looks like I have a few.
Stat Prof: Good. Do you think you can toss the coin “fairly”?
Student: I don’t see why not.
Stat Prof: Start flipping it. As you do this, keep track of the number of heads you get as
well as the total number of flips. Also think about whether or not your results indicate
whether this is an unfair tossing process.
Student: Okay, I flipped the coin 10 times and got 6 heads. Since this is close to 50-50, I
don’t think I could conclude that my coin tossing is unfair.
Stat Prof: Good. The probability of getting 6 heads out of 10 flips is about 0.20 while the
probability of getting 6 or more heads (our actual p-value) is about 0.37. Either way you
think of this, getting 6 heads out of 10 coin flips is not too unlikely. Let’s keep flipping.
Student: Okay, I’ve flipped it 50 times and …
Stat Prof: Keep flipping.
Student: I’ve flipped it now 100 times and…
Stat Prof: Keep flipping.
(1 hour later)
Student: I think my arm is going to fall off!
Stat Prof: I guess you can stop now. What are your results?
Student: I flipped the coin 1000 times and got 505 heads.
Stat Prof: Do you think your results show that your coin tossing is not fair?
Student: Of course not. Getting 505 heads out of 1000 flips is close enough to 50% that
the result is not unexpected.
Stat Prof: Do you mean you expected to get 505 heads?
Student: Well no, not exactly 505, but I did expect to get something close to 500.
Stat Prof: In fact, the probability of getting exactly 505 heads out of 1000 flips of a fair
coin is only about 0.02. Because there are so many different outcomes possible, the
probability of any one particular outcome is rather small. But we wouldn’t want to look a
the probability of .02 and consider this a surprising outcome, it is definitely among the
typical values. this is better conveyed by noting that the probability of getting 505 heads
or more in 1000 flips is about 0.39.
Student: And since this p-value is so high, I would not conclude my coin tossing is unfair!
Stat Prof: Exactly!
Copyright © 2010
Chapter 1: Introduction to Statistical Inference: One Proportion
XIII
Key Idea: The p-value for a test of significance is what we base our conclusion on. A
small p-value means it is unlikely the result in our data would have occurred simply by
random chance, assuming the null hypothesis is true.
What if only 18 out of 33 students had correctly identified Tim?
In the class we reported on 23 out of 33 students correctly identified Tim, but what if that
number had only been 18? In Figure 1.6 we see that 18 is a result that is typical if the
null hypothesis is true. When we compute the p-value we see that it is 0.340 and so we
would not consider this outcome unusual for a class that was simply guessing.
Because the p-value is not small, this says that if people really can’t correctly identify
Tim, it’s a fairly common occurrence to see a number like 18 correctly identify Tim
anyway. Thus, these “new” data (18 out of 33 students correctly identifying Tim) would
not provide enough evidence that the population of students can identify Tim. In this
scenario, it is plausible that the population cannot correctly identify Tim and that we just
happened to see a slight preference in our sample by random chance.
Key Idea: If the p-value for our test of significance turns out to be large, then our data do
not provide evidence that the observed outcome is something other than what we would
have expected to see by chance alone assuming the null hypothesis to be true. Note, we
never conclude that the null is true or even that we have evidence in favor of the null.
We can only say that we don’t have evidence against the null hypothesis and so the null
is plausible.
Again, the legal analogy applies, in a trial, a verdict in favor if the defendant is not
“innocent” but rather “not guilty.” We haven’t proven anyone innocent, only that the
evidence provided does not convince us, beyond a reasonable doubt, of their guilt.
Strength of Evidence
But still, how do we decide whether our p-value is a small number or not? How rare did
the outcome need to be in Figure 1.6 to convince you and your fellow classmates that
the result should not be attributed to random chance? How similar were your answers?
FAQ: What p-value should make us suspicious?
Stat Prof: Did you ever google “Persi Diaconis”?
Student Skeptic: No, why?
Stat Prof: Try it sometime. He’s unusual … even for a statistician he’s unusual. He was
one of the first people to win one of the MacArthur “genius” awards.
Student Skeptic: Is that all? What else can he do?
Stat Prof: Flip a coin and make it come up heads every single time.
SS: I don’t believe anyone can do that. Isn’t this whole dialog just a stat prof’s geeky
way to sneak in p-values?
SP: Sorry, I can’t help it. We stat profs are so transparent sometimes. That’s just the
data. But humor me: How many heads in a row would Persi have to get to make you
begin to think maybe it’s not just chance? Would heads on the first two tosses do it for
you?
SS: Of course not. That happens 25% of the time.
SP: You win a point for extra credit. What about heads on the first three tosses?
SS: You’re making me get serious about this. Three heads in a row wouldn’t happen
very often, but it’s still not unusual enough to be suspicious.
SP: What about four in a row? Five in a row?
Copyright © 2010
Chapter 1: Introduction to Statistical Inference: One Proportion
XIV
SS: Now it’s really getting serious … and suspicious. I find it hard to believe that you
can get five heads in a row just by chance. Sure, it can happen, but five in a row is
enough to make me think maybe there’s something else going on.
It turns out that four heads in a row happens about 6% of the time and 5 heads in a row
happens about 3% of the time. This is about when most people would start to think
there is something suspicious going on. And in fact, many studies consider 5% the cutoff value for what we would consider strong evidence against the null hypothesis. Of
course, in some situations you may want even stronger evidence. (A civil trial only
requires “preponderance of evidence” opposed to “beyond a reasonable doubt” in a
criminal trail.) Keep in mind that the smaller the p-value, the stronger the evidence
against the null hypothesis in favor of the alternative hypothesis.
Statisticians have agreed on the following standards for how they evaluate the strength
of evidence conveyed by a p-value.
Standards for judging p-values
0.10 < p-value
0.05 < p-value < 0.10
0.01 < p-value < 0.05
0.001 < p-value < 0.01
p-value < 0.001
not much evidence against null hypothesis
moderate evidence
strong evidence
very strong evidence
extremely strong evidence
When the statistic provides strong evidence against the null hypothesis in favor of the
alternative hypothesis, we often say the result is statistically significant, meaning unlikely
to happen by random chance alone.
Copyright © 2010
Chapter 1: Introduction to Statistical Inference: One Proportion
XV
Activity 1.2: A Preference Study: Friend or Foe?
As adults we know the difference between naughty and nice, but what about pre-verbal
infants? A study in the November 2007 issue of Nature looked at children less than 1
year old to see whether they recognized and had a preference for nice versus naughty
toys.
In one component of the study, 10-month-old infants were shown a “climber” character
(a piece of wood with “google” eyes glued onto it) that could not make it up a hill in two
tries. Then they were alternately shown two scenarios for the climber’s next try, one
where the climber was pushed to the top of the hill by another character (“helper”) and
one where the climber was pushed back down the hill by another character (“hinderer”).
The infant was alternately shown these two scenarios several times. Then the child was
presented with both pieces of wood (the helper and the hinderer) and asked to pick one
to play with. The helper toy was chosen by 14 of the 16 children. We will consider these
16 infants (our sample) as representative of a larger population of 10-month-old infants
(our population).
To see a video of both the “helper” and “hinder” scenarios as well as to see infants
choosing the “helper,” visit the link below:
http://www.yale.edu/infantlab/socialevaluation/Helper-Hinderer.html
Keep in mind that the shape and colors of the helper and hinderer toys were changed
across the trials, so focus only on the helper vs. hinderer roles rather than on these other
aspects of shape and color.
Setting up the hypotheses
Remember, we said that the research conjecture becomes the alternative hypothesis
and the null hypothesis is then a related statement of “no effect” or “nothing special
going on.” The “dull” hypothesis.
1.
State the null and alternative hypotheses for this research question. (Keep in
mind there are a couple of different ways to state them, but be sure to state them
about the population.)
Examining the data
2.
Statistic: In the study, 14 of the 16 infants chose the helper toy. Convert this to
a proportion. Is this statistic in the direction of the alternative hypothesis? That is,
does it provide initial support for the research conjecture?
Strength of Evidence
Although our data are supportive of the alternative hypothesis, a result as extreme as 14
could have arisen even if the infants were each choosing equally between the helper
and the hinderer. So we again need to consider how unusual such a result would be by
random chance alone.
If the null hypothesis is true, then infants will equally choose either the helper or the
hinderer toy and so each infant’s choice is really just like flipping a coin. Remember that
tests of significance are like the criminal justice system in that we will be making an initial
Copyright © 2010
Chapter 1: Introduction to Statistical Inference: One Proportion
XVI
assumption (the null hypothesis is true). So let’s assume, for the time being, that infants
really don’t prefer the helper toy any more than the hinderer.
3.
Simulate: Let’s use coin flipping to simulate “could have happened” results
under this null hypothesis.
a.
Start by simulating a single infant randomly (with equal
probability) choosing the helper or hinderer toy. Explain how you will use a coin to do
this and what heads and tails represent.
b.
Now, simulate 16 infants randomly choosing the helper
or hinderer toy. Explain how you will use a coin to do this, what heads and tails
represent, and how you will obtain the statistic from your simulation.
4.
Pool your results with your classmates to create the “what if the null was true”
distribution for the number of infants choosing the helper toy. Based on these
results, does 14 out of 16 infants choosing the helper toy when the null hypothesis is
true and they have no genuine preference between the toys (or 14 heads in 16 coin
tosses) appear to be an unlikely occurrence?
It’s hard to get a real secure feeling about your conclusion having only replicated the
study a few times. It would also be quite time consuming if we had to do too many more
of these repetitions flipping a coin. We will use the power of the computer to repeat this
simulation of 16 children’s toy choices under the null hypothesis. The more times we
repeat the simulation the more accurately we will be able to simulate the p-value for this
study.
5.
Let’s use the “Coin Tossing” web applet to generate more repetitions.
a.
First, use the applet to simulate 16 coin flips. Report
the statistic you obtained and explain how the simulation relates to the infants
choice of helper or hinderer.
b.
Now change the Number of Repetitions to 20 and press the
“16 Tosses” button to get a sense for how this statistic varies from trial to trial.
c.
Next change the number of repetitions to 979 and
press the “16 Tosses” button to create a total of 1000 replications.
d.
Use the applet to count how many of the dots (sets of
16 tosses) had 14 or more heads by entering 14 in the “As extreme as” box and
press Count. The proportion of repetitions reported by the applet is your
approximation of the p-value – how often, when the null hypothesis is true, do we
get a statistic that is at least as extreme as what the researchers found in the
actual study.
6. Press the Exact p-value button and the applet will calculate the p-value exactly
using probability rules (Think of repeating the above simulating infinitely many times).
How does this value compare to your estimation from the 1000 repetitions?
7. Interpret this p-value in the context of this study. Explain what it measures.
Copyright © 2010
Chapter 1: Introduction to Statistical Inference: One Proportion
XVII
8. Would you consider this p-value strong evidence against the null hypothesis?
Explain.
Let’s recap what you have done. We suspect you found that the proportion of repetitions
with 14 or more heads was very small. The exact p-value equals 0.002. This means that
if infants in general are just randomly choosing toys and you did this same study with 16
impartial infants 1000 times, only about 2 out of 1000 times will you get a number as
large as or larger than 14 (or 87.5%) infants choosing the helper toy as observed in the
actual study. This provides very strong evidence against the null hypothesis that infants
in the population do not have a genuine preference between the two toys.
Warning: Keep in mind that we might be making an error here. It’s possible that this
study could be one of those 2 in 1000 studies that would produce such an extreme
sample even if the null hypothesis was true. However, the chance of that is sufficiently
small, that we feel more comfortable concluding something else is going on.
Now, let’s step back a bit further and think about the implications of the fact that these 16
infants seem to be demonstrating a preference for the helper toy. Do you think this
means that all ten-month-old infants can tell the difference between naughty and nice?
An important question here is how representative these 16 infants are of all infants.
We’ll discuss this more in Chapter 2. We would also probably need to do some more
sophisticated testing to make the leap from toy preference to knowledge of difference
between naughty and nice.
9. Use the applet to investigate how many infants would need to choose the helper toy
to convince you that their result is more extreme than you would reasonably expect
from chance alone.
Copyright © 2010
Chapter 1: Introduction to Statistical Inference: One Proportion
XVIII
Section 1.3: Statistical Significance for One Proportion—Other
Null Hypotheses
In the previous section, we discussed testing whether or not a null hypothesis of
“choosing equally between two choices” was believable based on some observed
sample data. But there are situations where we may wish to consider proportions other
than .5.
Example: Appetizing Dog Food?
Does canned dog food make for a suitable and inexpensive alternative to blended meat
products such as spam or liverwurst? As unappetizing as this might sound, researchers
investigated this question in a study that was described in the February 20, 2009 issue
of Science magazine. The researchers presented 18 subjects with 5 unlabeled blended
meat products. One of these products was Newman’s Own dog food, prepared with a
food processor to have the texture and appearance of a liver mousse. After tasting all 5
products, subjects were asked to identify which was their least favorite on the basis of
taste. The research question was whether subjects are more likely to identify dog food
as their least favorite than if they were making selections at random.
Statistic: It turned out that 13 of the 18 subjects (≈72%) identified the dog food as their
least favorite.
This study has much in common with the infants’ toy study: We have 18 subjects instead
of 16 infants, and researchers keep track of which selection each subject makes. Also,
the research question is whether a certain selection is made so often that random
chance can be eliminated as a plausible explanation for that tendency. We’ll analyze
these studies in the same way: We’ll simulate the subjects’ selections assuming a null
hypothesis that selections are made by random chance alone. We’ll repeat this
simulation a large number of times, and we’ll see how often we obtain a result as (or
more) extreme as the actual result from the subjects in the real study. But there’s one
important difference between these studies, which affects how we’ll conduct the
simulation.
Thought Questions
• What is the important difference? Hint: Would it make sense to use a fair coin to
simulate the subjects’ selection process?
• If they are picking among the five foods completely at random, roughly how many of
the 18 subjects would you expect to pick dog food as their least favorite?
Hopefully it is obvious that it would not make sense to use a fair coin here, because
there are five options rather than two available to each subject (and then we will record
whether or not they pick dog food as the least favorite). In order to simulate subjects’
choices of their least favorite product as random selections, we need to give dog food a
1 in 5 chance, rather than a 1 in 2 chance, of being selected. But this is the only change
we need to make to our analysis strategy; the rest will proceed exactly as with the
infants’ toy study.
So our hypotheses will be:
Copyright © 2010
Chapter 1: Introduction to Statistical Inference: One Proportion
XIX
Null hypothesis: Subjects in the population pick equally among the 5 foods (the
probability dog food is picked as the least favorite equals 0.20)
Alternative hypothesis: Subjects in the population are more likely to choose the
dog food as their least favorite. (The probability dog food is picked as the least
favorite is larger than 0.20.)
Simulate: To simulate a process with a 0.2 probability, we could use a computer or a
calculator. The results of 1000 sets of such random selections are shown in Figure 1.7.
Figure 1.7: The “what if the null was true” distribution of the number of 18
subjects randomly choosing dog food as their least favorite
As expected under the null hypothesis, the center of this distribution is close to (1/5)×18
= 3.6. Most of the trials in this simulation saw 2, 3, 4 or 5 people picking the dog food as
their least favorite.
Strength of Evidence: In particular, when the null hypothesis is true we never saw a
result as extreme as 13 of the subjects just randomly picking the dog food when subjects
are choosing equally among the five foods. (The exact p-value can be shown to equal
0.0000025, or less than 3 in a million or less often than tossing 18 heads in a row.)
Thought Question
What conclusion would you draw from this very small p-value?
This small p-value suggests that the actual result of the study would be extremely
unlikely to occur if in fact the selections were made purely at random (picking equally
among the five food choices). Therefore, this study provides extremely strong evidence
that dog food really is chosen as least favorite more than 20% of the time.
Follow-up:
Copyright © 2010
Chapter 1: Introduction to Statistical Inference: One Proportion
XX
The 18 subjects in this study were also asked to guess which of the 5 products was dog
food. Interestingly, only 3 of the 18 subjects correctly identified which product was dog
food.
Thought Question
Explain why it’s not even necessary to conduct a simulation analysis to investigate
whether this provides strong evidence that people can correctly identify which is dog
food more than random chance.
In our 3S strategy, the first S: statistic involves looking at the sample data. One thing to
consider is whether the data are in the direction conjectured by the alternative
hypothesis. In this case, 3 out of 18 or about 16.7% of the subjects correctly identified
the dog food. Since this is less than 20%, such data will certainly not convince us that
people are more likely to correctly identify the dog food than if they were simply
guessing.
Copyright © 2010
Chapter 1: Introduction to Statistical Inference: One Proportion
XXI
Activity 1.3: Discriminating Colas
Do you think that, overall, the cola drinkers in your class can tell the difference between
two similar colas? Let’s find an answer to this question by running a taste test. Students
who are cola drinkers will be given three plain cups. Two contain the same brand of cola
and one contains a different brand of cola. They will then taste from all three cups and
attempt to correctly determine which one is different from the others. Students who
aren’t cola drinkers can set up and run the taste tests. Record the results to see how
many of the students involved in the taste test correctly identify which cup of cola is
different from the other two. We want to know if the proportion of students that can
correctly identify the cola that is greater than if they were just guessing (making their
decision by random chance). This is similar to the dog food example since we have
more than two outcomes from which to choose, so we won’t be able to use a fair coin to
simulate the student’s choice. (But will classify the response as “correct” or “not.)
1. If we assume that, overall, the class can’t tell the difference between the two colas,
or each individual is just randomly guessing as to which of the three cups is different,
what is the chance an individual will correctly choose the cup that is different? How
many correct identifications do you think you would get on average in your sample if
students are really just guessing?
2. What are your null and alternative hypotheses?
3. Statistic: Let’s examine the data we collected from our taste test.
a) How many students correctly identified which of the three cups was different?
b) What proportion of students correctly identified which of the three cups was
different?
c) Do you suspect that the evidence is strong enough to convince someone that the
alternative hypothesis is true? Why or why not? What other information would
you like to have in order to decide how convincing the evidence is?
Simulate: Let’s use an applet to simulate 1000 repetitions of the class taste tests, where
we assume that students are just guessing which cup is different (probability of 0.3333
as to whether a student correctly identify the different cola).
Go to the “One Proportion Inference” applet which lets you specify the conjectured
population proportion that can correctly distinguish the different cup (as specified by the
null hypothesis) as well as the number of observations in each trial (e.g., the number of
subjects in the study).
4. What value are you entering for the null value? What value are you entering for the
number of tosses?
5. Press the “Randomize” button to generate one version of the results that “could have
happened” if the null hypothesis was true. One dot is added to the dotplot. Clearly
explain what this dot represents.
Copyright © 2010
Chapter 1: Introduction to Statistical Inference: One Proportion
XXII
6. Now change the Number of Repetitions box to 999 in order to generate a total of
1000 repetitions. Describe the resulting “what if the null is true” distribution.
7. How many students correctly identified the different cola in the class results? Enter
this value in the “As extreme as” box and press Count. What proportion of times out
of 1000 did you simulate a number as large or larger than the actual result of the
study?
8. What is your proportion from question 8 called?
9. Strength of evidence: What do you conclude about the strength of evidence that
the cola tasters are able to tell the difference between the two colas better than
random guessing based on our results? Why?
10. Is there a larger population to which you can infer your results?
Copyright © 2010
Chapter 1: Introduction to Statistical Inference: One Proportion
XXIII
Section 1.4: Plausible Values
In this section, we continue to apply the statistical process to explore a population
proportion in more detail. Often our goal is less about examining one particular
conjectured value for the parameter, but rather to identify a range of plausible values for
the parameter.
Example: Competitive advantage of red uniforms?
Do athletes who wear red uniforms have an advantage over their competitors? To
investigate this question, Hill and Barton (Nature, 2005) examined the records in the
2004 Olympic Games for four combat sports: boxing, tae kwon do, Greco-Roman
wrestling, and freestyle wrestling. Competitors in these sports were randomly assigned
to wear either a red or a blue uniform. The competitor wearing red defeated the
competitor wearing blue in 248 matches, and the competitor wearing blue emerged as
the winner in 209 matches.
1. Statistic: The proportion of matches won by the competitor wearing red is 248 / (248
+ 209) = 248 / 457 ≈ .543.
The question is: Do these data provide strong evidence that competitors wearing red
really do win more than half the time? To investigate this question we ask how unlikely it
would be to obtain 248 or more winners wearing red in 457 matches, if the winner is
really equally likely to be either competitor. In other words, we ask how often a sample
proportion of .543 or higher would result if in fact the population proportion were .5. The
null hypothesis is that 50% of matches are won by the competitor wearing red, and the
alternative hypothesis is that more than 50% of matches are won by the competitor
wearing red.
2. Simulate: We have seen in previous sections that we can use simulation to evaluate
the strength of evidence in favor of the alternative hypothesis. First, we will model the
outcome of the game with a coin flip and we simulate some data that “could have
happened” had the null hypothesis been true. To do this, we flip a coin 457 times,
representing the matches in these sports, and we count how many heads appear,
representing wins by the competitor wearing red. We then repeat this process many,
many times generating the “what if the null hypothesis was true” distribution. We used a
computer applet to simulate 1000 repetitions of flipping a coin 457 times. The results
are shown in Figure 1.8.
Copyright © 2010
Chapter 1: Introduction to Statistical Inference: One Proportion
XXIV
Figure 1.8: A “what if the null is true” distribution showing the number of heads in
1000 repetitions of a coin being flipped 457 times. Note that 248 is in the upper
tail and thus an unlikely event if the red competitor wins 50% of the time.
3. Strength of Evidence: We see that the outcome 248, which is the sample number of
matches won by the competitor wearing red, is in the upper tail of the distribution. In
fact, only 35 of the 1000 simulated samples produced 248 or more heads, so the
simulated p-value is .035. This is small enough to conclude that the sample data
provide fairly strong evidence that the competitor wearing red really does have a better
than 50% chance of winning the match.
Now what? We have reason to believe that the competitor wearing red wins more than
half the time, so a natural question to ask next is: How much more than half? In other
words, we have concluded that .5 is not a plausible value for the underlying probability
that the competitor wearing red wins a match, but what do we think are plausible values
for that underlying probability? To investigate this, we can perform a similar test using a
different potential value for this probability in the null hypothesis.
Let’s start by testing the null hypothesis that the competitor wearing red has a .55
probability of winning a match. We’ll simulate 1000 sets of 457 coin tosses with a .55
probability of landing heads. The results are shown in Figure 1.9.
Figure 1.9: When the probability that a competitor wearing red wins a match is .55,
we find that 248 out of 457 matches won by the red competitor is not out in the
tail, but lies fairly close to the center of this “what if the null is true” distribution.
In this case we see that the observed data value (248) is not in the tail of the distribution.
In fact, 39.5% of the above repetitions had a smaller number of successes (red
competitor wins) than 248. So we don’t have evidence against the null hypothesis that
the probability of winning while wearing red is .55.
Copyright © 2010
Chapter 1: Introduction to Statistical Inference: One Proportion
XXV
Notice, that although any number between 0 and 1 is a possible value for the underlying
probability that the competitor wearing red wins the match, our analyses above have
revealed that 0.5 does not seem to be a plausible (or believable) value in light of the
sample data, whereas 0.55 do appear to be plausible based on the sample data. This is
not surprising, considering that the observed sample proportion of matches won by the
red-wearing competitor is .543, much closer to .55 than to .5. But we still don’t know the
true probability of winning while wearing red. Using the scientific method, let’s rehypothesize. Maybe we could close in on the true probability by finding an upper bound.
Let’s next test the null hypothesis that the competitor wearing red has 53% chance of
winning a match. The simulation results shown in Figure 1.10 reveal that an outcome of
248 is again not a surprising outcome in this “what if the null was true” distribution.
Figure 1.10: When the probability that a red-wearing competitor wins the match is
0.53, we find that 248 out of 457 matches won by the person wearing red among
the typical values of our “what if the null is true” distribution.
So both .53 and .55 are considered plausible values for the probability of the red
competitor winning.
Key Idea: Keep in mind that not gathering enough evidence to support the alternative
hypothesis is not the same thing as concluding the null hypothesis is true. There
are almost always other null hypotheses that would also be plausible.
Let’s next test a 60% chance of winning the match. The simulation result shown in
Figure 1.11 reveals that an outcome of 248 wins (or less, looking in the more extreme
tail) by the red competitor would be very surprising if this null hypothesis were true. The
simulated p-value is .006, so we have very strong evidence that the actual probability of
winning while wearing red is less than .6.
Figure 1.11: When the probability that a red-wearing competitor wins the match is
0.6, we find that 248 out of 457 matches won by the person wearing red is out in
the lower tail of our “what if the null is true” distribution.
Copyright © 2010
Chapter 1: Introduction to Statistical Inference: One Proportion
XXVI
We can continue to use this analysis strategy to investigate many different null
hypotheses, each using a different value for the probability that the red-wearing
competitor wins the match. We’ve already tested whether .5, .53, .55, or .6 are plausible
values for this probability, and we’ve concluded so far that the probability is above .5 but
less than .6. We also found that .55 is plausible, because the observed data are not in
either tail of its “what if null were true” distribution. There must be other values between
.5 and .6 that would not be rejected and thus can also be considered plausible values for
the true probability. Let’s go ahead and test values .51, .52, .53, and so on.
We used the applet to test all of these null hypotheses, with 1000 repetitions for each
analysis. Table 1.1 shows the strength of evidence for each of these hypothesized
values.
Table 1.1: p-values when testing various null hypothesis probabilities that the
competitor wearing red wins the match, based on the sample result of 248 wins in
457 matches
Simulated
p-value
Strength of evidence against null
value
0.5
0.04
Strong
0.51
0.073
Moderate
0.52
0.17
Not much
0.53
0.319
Not much
0.54
0.469
Not much
0.55
0.375
Not much
0.56
0.253
Not much
0.57
0.109
Not much
0.58
0.044
Strong
0.59
0.021
Strong
0.6
0.008
Very strong
Null value
Note: In this table we have always computed the p-value by looking in the tail of the
“what if the null hypothesis was true” distribution. This means that when the null
hypothesis value is less than 0.645 we look in the upper tail and when the null
hypothesis value is greater than 0.645 we look in the lower tail. In Chapter 2 we
Copyright © 2010
Chapter 1: Introduction to Statistical Inference: One Proportion
XXVII
introduce the idea of computing our p-values using a “two-sided” testing approach where
we measure strength of evidence based on both tails of the distribution.
This table illustrates that there are many values for which the null hypothesis value is
plausible. Specifically, for values from .52 to .57, the observed data do not provide much
evidence against the null hypothesis, so we can say that it’s plausible that the actual
probability that the red competitor wins a match is between .52 and .57. This suggests a
very slight advantage for the competitor wearing red. Notice that these endpoints (.52
and .57) are almost equally far from the observed sample proportion (.543).
We could refine this analysis further by considering more possible values. For instance,
we could consider values from .510 to .530 and from .560 to .580 in order to “zero in”
more precisely on reasonable endpoints. We could also modify this procedure by
deciding that we want to use a more or less stringent rule for evidence against the null
hypothesis, for example by considering a value to be plausible unless the simulation
produces very strong evidence against it.
Key Idea: We consider a value plausible for the population parameter if a simulation
analysis based on that value does not put the observed data in the tail of the “what if null
were true” distribution. This analysis produces an interval of plausible values for a
population parameter based on the observed sample data. This interval of all such
plausible values is called a confidence interval.
We will further explore confidence intervals in Chapters 5, 6, 7, and 8.
Copyright © 2010
Chapter 1: Introduction to Statistical Inference: One Proportion
XXVIII
Activity 1.4: Kissing Right?
Most people are right-handed and even the right eye is dominant for most people.
Molecular biologists have suggested that late-stage human embryos tend to turn their
heads to the right. In a study reported in Nature (2003), German bio-psychologist Onur
Güntürkün conjectured that this tendency to turn to the right manifests itself in other
ways as well, so he studied kissing couples to see if they tended to lean their heads to
the right while kissing. He and his researchers observed couples in public places such
as airports, train stations, beaches, and parks. They were careful not to include couples
who were holding objects such as luggage that might have affected which direction they
turned. For each couple observed, the researchers noted whether the couple leaned
their heads to the right or to the left. They observed 124 kissing couples, finding that 80
leaned to the right.
The question is: Do these data provide strong evidence that kissing couples really do
lean to the right a majority of the time? To investigate this question we ask how unlikely
it would be to obtain 80 or more couples leaning to the right in a random sample of 124
couples, if the couples are equally likely to lean right or left.
1. Statistic: Calculate the proportion of the observed couples who leaned to the right.
2. State the null and alternative hypotheses for this study.
This time we don’t want to ask you to take the time to actually do a lot of coin flipping,
but we do want you to think about how you would conduct this simulation analysis if you
had to.
3. Simulate: Describe how you would use a coin to conduct a simulation analysis to
determine whether these data provide strong evidence that in general kissing
couples really do tend (more often than not) to lean to the right. Provide sufficient
details that someone else could implement the analysis based solely on your
description. You should include statements about how you would make a conclusion
based on the simulation results.
4. Use technology (one proportion inference applet) to simulate 1000 repetitions of 124
couples, assuming the null hypothesis that couples are equally likely to lean right or
left is true. Where is the distribution centered? Also report your simulated p-value.
Would you consider the result of 80 or more out of 124 unlikely assuming the null
hypothesis is true?
5. Strength of evidence: Based on your simulation results, would you conclude that
the researchers’ data (80 of 124 couples leaning to the right) provides strong
evidence that couples in general really do tend to the lean to the right more often
than the left? Explain the reasoning behind your answer.
Now suppose the researchers believed that couples are more likely to lean to the right
when kissing. In fact, they believe this happens about 2/3 of the time, as that is similar to
other right-sided tendencies that have been observed (right handedness, right eye-ness,
embroyos turning to the right in the womb and so on).
Copyright © 2010
Chapter 1: Introduction to Statistical Inference: One Proportion
XXIX
6. Carry out a test of significance to decide whether you have convincing evidence that
couples turn to the right less than 2/3 of the time. Make sure you state the null and
alternative hypotheses, explain how you generate the simulation, and report the
empirical p-value.
7. What do you conclude about whether 2/3 (roughly .667) is a plausible value for the
probability a couple leans to the right?
8. Carry out a test of significance to decide whether you have convincing evidence that
couples lean to the right less than 75% of the time. Make sure you state the null and
alternative hypotheses, explain how you generate the simulation, and report your
simulated p-value.
9. What do you conclude about whether .75 is a plausible value for the probability a
couple leans to the right?
Confidence Interval: We can continue to use our analysis strategy to investigate many
different null hypotheses, each stipulating a different value for the probability that a
couple leans to the right. We’ve already tested whether 0.5, 0.67, or 0.75 are a plausible
value. These analyses rejected the null hypothesis in two cases and we concluded that
the proportion of all couples that lean to the right when kissing is more than 0.5 and less
than 0.75. So, there must be values in between 0.50 and 0.75 that would not be
rejected and thus be plausible values for the true proportion of couples that lean to the
right when kissing. Let’s go ahead and test values such as 0.51, 0.52, 0.53, and so on.
Granted, this will get tedious, but with technology it’s not too cumbersome.
10. Use the applet to test all of these null hypotheses, starting with values 0.51, 0.52,
0.53, and continuing up to and including 0.74 for the probability that a kissing couple
leans to the right. Use 1000 repetitions for each analysis. Report the hypothesized
values which do not provide strong evidence (p-value > .05) against the null
hypothesis. [Hint: For now, always put the observed result in the “tail” of the
distribution, so you may need to switch the count inequality in the applet.]
11. Where does the observed sample proportion (.645) lie compared to this interval of
plausible value?
12. Why does it make sense that you received a large p-value for your test of the null
hypothesis that 65% of couples lean right? Does this mean that you’ve proven that
65% f kissing couples lean right? Why or why not?
Copyright © 2010
Chapter 1: Introduction to Statistical Inference: One Proportion
XXX
Section 1.5: Role of Sample Size
In this section we explore the important role of sample size in considering statistical
significance.
Example: Predicting Elections from Faces?
Do voters make judgments about political candidates based on how their faces look?
Can you correctly predict the outcome of an election, more often than not, simply by
choosing the candidate whose face is judged to look more competent? Researchers
investigated this question in a study published in Science (Todorov, Mandisodka, Goren,
and Hall, 2005). Participants were shown pictures of two candidates and asked which
one looked more competent. Researchers then predicted the winner to be the one
whose face was judged to look more competent by most of the participants.
For the 32 U.S. Senate races in 2004, this method predicted the winner correctly in 23 of
them. This “competent face” method therefore succeeded in 23/32 ≈ 0.719, or 71.9% of
the 32 races. Is this significantly higher than 50%? We can use the 3S method to
investigate: Assume that the “competent face” method is no better than flipping a coin,
and simulate the set of 32 races, over and over, for a total of 1,000 times. How often do
you get 23 or more correct predictions? Our results are shown in Figure 1.12.
Figure 1.12: A distribution of the number of correct predictions in 32 races if the
“competent face” method works half the time. In our sample of 32 races we found
the “competent face” method to work 23 times.
number of correct predictions in 32 races (under null hypothesis) Only 9 of these 0,000 simulated sets of 32 races show 23 or more correct predictions, so
the simulated p-value is 0.009. This p-value is small enough to provide strong evidence
against the null hypothesis, in favor of concluding that the “competent face” method
makes the correct prediction more than half the time.
These researchers also predicted the outcomes of 279 races for the U.S. House of
Representatives in 2004. [While all 435 House seats were up for reelection in 2004, the
researchers were only able to obtain pictures for the winter and runner up in 279 of
these races.] The “competent face” method correctly predicted the winner in 189 of
those races, which is a proportion of 189/279 ≈ 0.677, or 67.7% of the 279 House races.
Notice that this percentage is similar but a bit smaller than the 71.9% of correct
predictions in 32 Senate races.
Copyright © 2010
Chapter 1: Introduction to Statistical Inference: One Proportion
XXXI
Thought Question: Do you expect the strength of evidence for the “competent face”
method to be stronger for the House results, weaker for the House results, or essentially
the same for the House results as compared to the Senate results?
Let’s investigate this key question with a simulation of 1,000 repetitions of 279 races,
assuming that the “competent face” method does no better than a coin flip. (See Figure
1.13.)
Figure 1.13: A distribution of the number of correct predictions in 279 races if the
“competent face” method worked half the time. In our sample of 279 races we
found the “competent face” method to work 189 times.
number of correct predictions in 279 races (under null hypothesis) Notice that none of these 1,000 repetitions produced 179 or more correct predictions. In
fact, none came close to that many correct predictions. Let’s try 10,000 repetitions.
Figure 1.14: A larger distribution of the number of correct predictions in 279 races
if the “competent face” method worked half the time. In our sample of 279 races
we found the “competent face” method to work 189 times.
number of correct predictions in 279 races (under null hypothesis) Notice this “fills in” and “smooths out” the distribution a bit, but we still did not receive a
single simulated study that had as many as 189 correct predictions. So, the simulated pvalue from these data is extremely close to zero (less than .0001). These data provide
overwhelmingly strong evidence that the “competent face” method makes a correct
prediction more than half the time.
This analysis reveals that sample size plays a very important role in assessing statistical
significance. Even with a slightly smaller sample proportion of correct predictions, the
data from the 279 House races provide much stronger evidence that the “competent
Copyright © 2010
Chapter 1: Introduction to Statistical Inference: One Proportion
XXXII
face” method works better than a coin flip, as compared to the data from the 32 Senate
races.
Key idea: Whenever you need to decide whether a sample proportion is strong
evidence against the null hypothesis, be sure to take into account the sample size of the
study. If two studies have the same value of the statistic, a larger sample provides
stronger evidence against a null hypothesis than a smaller sample does.
Also keep in mind that the “number of repetitions” in the simulation gives us a more
accurate estimate of the p-value but does not necessarily make the p-value smaller like
a larger sample size does.
Frequently Asked Question: More extreme versus bigger sample?
House Member: “My big House has a lot more data than your little Senate. So
I’m betting the House evidence is stronger, which means the House p-value will
be closer to 0 than your Senate p-value.”
Senator: “Big House indeed. Your percentage is only 67.7%. Mine is 71.9%,
which is more extreme – farther from 50/50 – so my evidence is stronger.”
Houser: “Hold on! Your sample is so small that there’s a lot of shakiness in your
percentage. It could easily have been lower. My percentage is much more solid,
because it’s based on a lot more data.”
The debate between the Houser and the Senator illustrates that the strength of
evidence, as captured by the p-value, is really driven by two components. The first
component is the distance the proportion is from the null hypothesis proportion, and the
other is the sample size. Both components contribute to how strong the evidence is
against the null hypothesis, and the p-value takes both into account. In later chapters,
we will see how these components are directly associated with the statistical power, a
measurement of how likely the study is to find evidence in favor of the alternative
hypothesis.
Copyright © 2010
Chapter 1: Introduction to Statistical Inference: One Proportion
XXXIII
Activity 1.5: Baseball Big Bang
A reader wrote in to the “Ask Marilyn” column in Parade magazine to say that his
grandfather told him that in 3/4 of all baseball games, the winning team scores more
runs in one inning than the losing team scores in the entire game. (This phenomenon is
known as a “big bang.”) Marilyn responded that this probability seemed to be too high to
be believable.
1. Identify the relevant parameter in this study.
2. State the grandfather’s claim and Marilyn’s response, in terms of this parameter.
Also identify which is the null hypothesis and which is the alternative.
3. Suppose that you take a sample of Major League Baseball (MLB) games and find
that half of them contain a big bang. What more do you need to know before you
can test whether this provides strong evidence against the grandfather’s claim?
4. Suppose that Jose and Maria both take a sample of MLB games and keep track of
which contain a big bang and which do not. Suppose that Jose finds that half of his
sample of 10 games contains a big bang, and Maria finds that half of her sample of
50 games has a big bang. Which person do you expect to have the smaller p-value,
and therefore stronger evidence against the grandfather’s claim and in favor of
Marilyn’s alternative?
5. Use the One Proportion Inference applet to conduct a simulation analysis to simulate
the p-value for both Jose and Maria’s sample data. You will need to toggle the
greater than sign to less than for this test. Report the simulated p-value, and
summarize the strength of evidence, in each case. Also comment on whether your
analysis supports or refutes your answer to question 4.
6. Collect your own data on a sample of baseball games. Use a sample size of at least
20, but preferably much larger, and try to make your sample as representative of the
population as possible. Feel free to use Major League or minor league or college or
high school or Little League games. For each game take note of whether the game
contains a big bang or not. Describe how you collect your sample data, and report
your sample proportion of games with a big bang.
7. Continuing from question 6, conduct a simulation analysis to investigate how much
evidence your sample provides against the grandfather’s claim in favor of Marilyn’s
alternative. Submit a graph of the “what if the null is true” distribution, and report the
simulated p-value. Write a paragraph summarizing your conclusions and explaining
the reasoning process by which they follow from your analysis.
Copyright © 2010
Chapter 1: Introduction to Statistical Inference: One Proportion
XXXIV
Chapter Summary
Basing decisions on data is better than relying on anecdotal information. The study of
statistics provides us with a systematic procedure for gathering data (evidence),
evaluating that evidence, suggesting conclusions based on that evidence, and assessing
our confidence in those conclusions.
Descriptive statistics involves summarizing information from data to detect patterns and
tendencies. Numbers that describe a sample are called statistics, whereas numbers that
describe a population are called parameters. In practice, we almost never know the
values of parameters, and so we use sample statistics to learn about population
parameters. This process is called inferential statistics.
Statistical investigations make use of the scientific method: Ask research question, form a
hypothesis, collect data, analyze results, draw conclusions, communicate results, repeat.
The research conjecture is stated as the alternative hypothesis, to be compared against
the null hypothesis that is often a statement of no effect or equality (the “dull”
hypothesis). As with the presumption of innocence in the criminal justice system, a test
of significance begins with the assumption that the null hypothesis is true. After
gathering and examining sample data, we conduct a test of significance to assess the
strength of evidence that the data provide against the null hypothesis in favor of the
alternative hypothesis. To assess the strength of evidence, we outlined a three step
process that we call the 3S strategy:
1. Statistic: Calculate the value of a statistic from the sample data.
2. Simulate: Use simulation to hypothetically replicate the data collection process many
times, but under the assumption that the null hypothesis is true. Examine the “what if
null were true” distribution to see what are typical and unusual values of the statistic
when the null hypothesis is true.
3. Strength of evidence: If the observed data value falls in the tail of the simulated “what
if null were true” distribution, then the sample data provide strong evidence against
the null hypothesis in favor of the alternative hypothesis.
The key component of this strategy is the p-value, which is the probability of obtaining a
result as extreme as or more extreme than the one from the sample data, assuming the
null hypothesis were true. Small p-values provide us with strong evidence against the
null hypothesis, as they indicate our observed data would be unlikely to occur by chance
alone if the null hypothesis is true. Large p-values don’t provide us with enough evidence
to reject the null in favor of the alternative, so we consider the null hypothesis to be
plausible (but not proven). Also keep in mind the distinction between a possible value for
the parameter and a plausible or believable value for the parameter.
We can use repeated test of significance under different null hypothesis values to
determine an interval of plausible values for a population parameter. If we get a large pvalue and thus can’t reject the null, the parameter under the null is considered to be
plausible. This interval of plausible values is called a confidence interval.
Sample size plays a large role in assessing statistical significance. If the value of a
statistic is the same with two samples, then a larger sample provides stronger evidence
against the null hypothesis than a smaller sample.
So far, you considered how to test values for a population proportion, but the same logic
will apply to other parameter values as well.
Copyright © 2010
Chapter 1: Introduction to Statistical Inference: One Proportion
XXXV
Case Study: Will skipping breakfast lead to a world without men?
In a recent study that made headlines around the world, researchers at Oxford
University (Mathews et al. 2008) explored whether women who had high food
consumption at the time of conception were more like to have boys. Two hundred and
forty-one first-time mothers who were classified as having high-food consumption at the
time of pregnancy were followed for nine months until the gender of their child was
identified.
1. Before learning about how the study turned out, specify the null and alternative
hypotheses you will use to test the researchers’ conjecture that the women who had
high food consumption were more likely to have boys than girls. Since all 241 women
were first time mothers in England, it is important to know that, overall, the proportion
of male babies born to all first time mothers was England is 51.2%.
2. Statistic: So, how did the study turn out? In 135 of the 241 pregnancies, a boy was
born. What proportion of the births to high-food consumption mothers resulted in a
boy being born?
3. Simulate: How can you use the coin flipping approach to test the hypotheses?
Specifically describe the process you would use (remember, you can’t use a 50/50
coin) to carry out the simulation, including how to calculate the p-value and make a
conclusion. Provide sufficient details that someone else could implement the analysis
based solely on your description. Your answer should not simply explain how to use
the applet.
4. Use technology to carry out the simulation. Produce a rough sketch of the number of
correctly predicted outcomes assuming the null hypothesis is true, and indicate
where the result observed by the researchers falls in that distribution. Also report the
p-value.
5. Strength of evidence: Based on your p-value, state your conclusions about whether
high-food consuming mothers were more likely to have boys.
6. Using non-statistical language, explain the process you conducted in order to arrive
at your conclusion.
7. These researchers also explored whether the likelihood of having boys amongst the
lowest food consuming mothers was less than expected. Write the null and
alternative hypothesis you wish to test on this data.
8. One hundred eight of the 240 lowest food consuming mothers had boys. Use
technology to analyze these data with the 3-S analysis strategy. Produce a rough
sketch of the null distribution, and indicate where the observed research result falls in
that distribution. Also report the p-value. Summarize your conclusions.
9. Compare the two p-values you calculated, remembering that the p-value is a
measure of the strength of evidence. Explain why, based on the data, it makes sense
that one of the p-values is smaller than the other.
Copyright © 2010
Chapter 1: Introduction to Statistical Inference: One Proportion
XXXVI
10. In another analysis presented in their paper, the researchers tried to pinpoint specific
foods consumed by the mothers associated with the gender of the child. Of 300
women in the study who consumed at least one bowl of cereal per day during the
period of conception, 181 had boys. State hypotheses, analyze the data with the 3S
strategy, report the p-value and summarize your conclusions.
11. Do you think that this analysis proves that higher food and/or breakfast cereal
consumption is causing women to have male babies, or are other explanations
possible? Why or why not?
12. Perhaps she ate a lot of breakfast cereal, but in the early 1900s Annie Grace
Buckland Jones, married to Grover C. Jones Sr., of Peterstown, West Virginia had
15 children---all boys! The Jones’ large family size and the 15 consecutive male
children raised quite a stir at the New York World fair in 1940, where they were
invited to be guests of then President Franklin Delano Roosevelt and were featured
nationally on the radio. Of course, there were some naysayers who couldn’t believe
this to be possible, they must have just been faking it!
a) Using simulation, estimate the probability of getting 15 boys in a row. Make sure
you describe how you’ve carried out your simulation (including any simplifying
assumptions you have made) and estimated the probability.
b) Do you think that this proves that the Jones’ must have been faking it?
c) What are the chances that after 15 boy babies in a row, the next baby (16th) born
to the Jones’ will be a boy? It was!
d) What are the chances that, after having had 16 children, their next baby would be
a girl? It was!
13. We would be remiss in telling you the story of the Jones’ if we did not mention
something that made the Jones’ even more famous than having 16 boys in a row.
While playing horseshoes with one of his sons, “Punch,” Grover C. Jones and his
son found a bluish rock which they believed to be quartz. They placed it in a cigar
box, where it stayed for 14 years while the Joneses struggled through the Great
Depression. Later, in 1942, Grover brought the rock to a geology professor at a
nearby University, who determined that it was a 34.48 carat diamond—the largest
alluvial diamond ever found in the United States. What are the chances of that?
Don’t worry, you don’t need to calculate that value.
Copyright © 2010
Chapter 1: Introduction to Statistical Inference: One Proportion
XXXVII
Research Article: Stock Monkey
Have you ever heard it said that taking your stock broker’s investment advice is no
different than if a monkey had chosen your stocks? This popular notion was first
introduced by Prof. Burton Malkiel’s in his book A Random Walk down Wall Street.
Since that time a number of serious and not-so-serious studies have taken place that
have placed real monkey’s up against experts to “test” the theory.
Read the article from “The Daily Princetonian” which investigates the results of an
ongoing “game” published in the Wall Street Journal. Answer the remaining questions.
1. Did the experts or randomly chosen stocks perform better in the WSJ’s game? What
measure of stock performance do the authors use to make their point?
2. What is Malkiel’s reason for why the expert picks did better? What justification does
he give for being right?
3. Explain intuitively why Malkiel’s “towel” approach of buying multiple stocks makes
sense.
4. How does Malkiel argue that the price of a stock is determined?
5. Why do some people argue that the collapse of Enron means Malkiel’s theory is not
true?
6. How does Malkiel argue that the collapse of Enron gives his theory more evidence?
James “Jim” Cramer is host of the popular Mad Money investment advice show airing
daily on CNBC. Jim gives investment advice. In a popular segment called “The Lightning
Round” callers phone in and ask Jim’s opinion as to whether a stock will go up or down.
Over a 30-day period, Jim Cramer’s picks of whether a stock would go up or down were
right 124 times and wrong 122 times. Assume in the same 30 day period, 50% of stocks
went up and 50% went down.
7. Carry out the 3S strategy for evaluating the quality of Jim’s picks. Make sure to state
null and alternative hypotheses.
8. To see if Jim Cramer’s advice holds up, folks compared Jim’s picks to those of
Leonard the Wonder Monkey who flipped a fair coin to predict whether the stock
would go up or down over the ensuing 30 day period. Is this a fair comparison of
Jim’s picks? If yes, why? If not, what might you do to make it more fair?
9. Imagine if instead of 50% of the stocks in the market went up over the same 30 day
period, that only 40% did. Carry out the 3S strategy for evaluating the quality of Jim’s
picks. Make sure to state null and alternative hypotheses.
10. Is comparing Jim’s picks to those of Leonard the Wonder Monkey a fair comparison
in this scenario? If yes, why? If not, what might make it more fair?
Imagine that you were to test your own stockbroker to see if they could give good shortterm picks (like the test of Jim Cramer). You have your stockbroker pick some stocks
Copyright © 2010
Chapter 1: Introduction to Statistical Inference: One Proportion
XXXVIII
that he thinks will go up over the next 30 days. You then track all stocks for the next 30
days, and find that 55% of all stocks increased in price during the 30 day time period.
11. If you had asked your broker to pick 5 stocks, how many stocks would he have
needed to get right for you to be convinced he was “better than a monkey”? Why?
12. If you had asked your broker to pick 15 stocks, how many stocks would he have
needed to get right for you to be convinced he was “better than a monkey”? Why?
13. If you had asked your broker to pick 30 stocks, how many stocks would he have
needed to get right for you to be convinced he was “better than a monkey”? Why?
14. Calculate the proportion of “correct picks” needed for each of your answers to 11 to
13. Why is this different in the different scenarios?
Copyright © 2010
Chapter 1: Introduction to Statistical Inference: One Proportion
XXXIX
Practice Exercises
1. Review the graph that summarizes the simulation for the Bob or Tim example. The
graph shows the results of 1000 different sets of 33 tosses of a fair coin. The goal of
the simulation was to find out how likely it is to get 23 or more heads in 33 tosses.
For each of (a) – (d) below, indicate what it corresponds to in the example.
a. Each toss that lands heads
A. In individual student
b. Each set of 33 tosses
B. A class of students
c. Each toss of a coin
C. The number of Tim matches
d. Each bar of the graph
D. A single Tim match
2. Review Activity 1.1, Friend or Foe. For each of (a) – (d) below, indicate what it
corresponds to in the activity.
a. Each toss that lands heads
A. In individual infant
b. Each set of 16 tosses
B. The 16 infants
c. Each toss of a coin
C. The number of helpers chosen
d. Each stack of dots in the graph
D. An infant choosing the helper toy
3. Review the Dog Food example and Discriminating Cola activity. Compare them with
the Bob/Tim and Friend/Foe scenarios. In both the Dog Food and Cola scenarios, a
set of individuals is asked to make a choice from a set of options. In all four,cases,
the _______ (null/alternative) hypothesis says that the _______ (options/individuals)
are equally likely. Explain.
4. Review the graphs for the Bob/Tim, Dog Food, and Kissing examples.
• Each stack of dots corresponds to a value of the __________ (parameter/statistic).
• The height of the stack reveals _____________ (choose one of a-e):
a. How many individuals were asked to make a choice
b. How many options there were
c. How many times the value of the statistic occurred
d . How many times the value of the parameter occurred
e. The total number of repetitions
5. In these graphs, the p-value corresponds to ________ (choose one):
a. The height of the bar above the observed value of the statistic.
b. The area of the bars above or to the right of the observed value
c. The area of the bars above or to the left of the observed value
d. The number of bars above or to the right of the observed value
e. The number of bars above or to the left of the observed value
6. The set of individuals you want to know about is called the ________
(sample/population). The set of individuals you get to see is called the ________
(sample/population).
7. Suppose you take several different samples from the same population. You are
likely to get several different values for the _________ (statistic/parameter), but the
value of the ________ (statistic/parameter) will not change.
Copyright © 2010
Chapter 1: Introduction to Statistical Inference: One Proportion
XL
8. For almost all practical problems, once you have the data, you can compute the
value of the _________ (statistic/parameter) but you can’t compute the value of the
_________ (statistic/parameter).
9. For an election poll, the set of likely voters is often used as the _______
(population/sample).
10. Which one of the following statements is meaningful? (The other three are
meaningless.)
a. “The null hypothesis was statistically significant.”
b. “The observed difference was statistically significant.”
c. “The parameter was statistically significant.”
d. “The sample size was statistically significant.”
11. True or false, and explain:
a. The p-value depends on the sample.
b. The p-value depends on the null hypothesis.
c. The p-value is the probability that the null hypothesis is true.
d. The p-value is the probability that the alternative hypothesis is true.
12. If the p-value is below 0.05, the evidence against the _______ (null/alternative)
hypothesis is ________ (strong/weak).
13. Statistical testing allows you to reject the ________ (null/alternative) hypothesis
based on the data, but cannot justify rejecting the _________ (null/alternative)
hypothesis.
14. When you use the 3S method for hypothesis testing, the simulation you use to find
the p-value is based on the assumption that the ________ (null/alternative)
hypothesis is true.
15. When you use the 3S method, you use simulation to create a large number of
hypothetical _______ (samples/populations). For each one, you compute a value of
the ___________ (statistic/population) and look at the distribution of these values.
16. Multiple Choice. In a sample of size 20, we observe 12 (60%) of the subjects with
the characteristic of interest. In conducting the related test of significance we have a
computer applet simulate counting the number of heads when 20 fair coins are
flipped. This process is repeated 1000 times and a distribution of the number of
heads is made. This distribution represents:
a. Repeated results if the null hypothesis is true.
b. Repeated results if the alternative hypothesis is true.
c. Repeated results if the population proportion is the same as your sample
percentage of 60%.
d. Repeated results if the population proportion is more than your sample proportion
of 60%.
17. Multiple Choice. Suppose you sample 100 students at your school and find that 60%
of them exercise regularly. A friend at another school sample’s 200 students and
finds that 60% of that sample also exercises regularly. You both conduct tests of
significance and find p-values.
a. The sample with the larger sample size will have the smaller p-value.
Copyright © 2010
Chapter 1: Introduction to Statistical Inference: One Proportion
b. The sample with the smaller sample size will have the smaller p-value.
c. Since the sample proportions are the same, the p-values will be the same.
d. There is no way to tell which sample will have the smaller p-value.
Copyright © 2010
XLI
Chapter 1: Introduction to Statistical Inference: One Proportion
XLII
Exercises
Section1.1 Introduction to Statistics
1. Pro-life or pro-choice In the May 16, 2009 edition of the Grand Rapids Press, the
article entitled “Most of us say we’re pro-life, poll finds” reported the following
information: A telephone interview conducted by Gallup involved 1,015 adults
nationwide. The poll found that 51% called themselves pro-life rather than prochoice on the issue of abortion. This is the first time that a majority gave that answer
in the 15 years that Gallup has asked the question.
a. What is the population of interest in this study?
b. What is the sample?
c. Identify the statistic(s) in this study. If you don’t know the value of the statistic,
describe it in words.
d. Describe the parameter(s) in this study. If you don’t know the value of the
parameter, describe it in words.
2. Education and Government In August 2010, a Gallup poll asked a random sample
of 1,013 U.S. adults the following question, “In terms of public education in this
country, do you think the federal government [should be more involved in education
than it currently is, should keep its involvement the same, (or) should be less
involved in education than it currently is]?”
Here’s a bar chart that shows the distribution of responses.
a. Approximately what proportion of respondents want the federal government’s
involvement in public education to remain unchanged?
b. Is your proportion from part (a) a parameter or a statistic? How do you know?
c. How many respondents want the federal government’s involvement in public
education to remain unchanged?
d. Suppose that we want to use these data to test whether less than half of
American adults want the federal government to be more involved in public
education than it currently is. State appropriate null and the alternative
hypotheses for this scenario.
Copyright © 2010
Chapter 1: Introduction to Statistical Inference: One Proportion
XLIII
3. Practice Suppose that our population of interest is all households in San Luis
Obispo County. Identify which of the following are possible parameters.
a. Mean number of residents in all households in the county
b. Whether there are any children under 12 years of age living in the household
c. Proportion of households with children under 12 years of age
d. Whether household is in an urban area
e. Whether size (number of members) of the household is related to the household
expenditure
4. Oil spill Following the oil spill in the Gulf of Mexico, The Pew Research Center for
the People and the Press surveyed a random sample of 994 U.S. adults in early May
2010, and found that 547 of the respondents thought that the oil spill was a major
environmental disaster. In the context of the study, answer the following.
a. Identify the population of interest.
b. Identify the sample.
c. Describe, in words, the relevant parameter of interest.
d. Describe, in words, the statistic corresponding to the parameter described in part
(c). Also, specify the numeric value of the statistic.
5. U.S. Census Did you know that the U.S. Census is required by law? A survey of
1,504 U.S. adults conducted by The Pew Research Center for the People and the
Press in early January 2010, found only 31% of the respondents knew this fact.
a. Identify the population of interest.
b. Identify the sample.
c. Is 31% a parameter or a statistic? How do you know?
d. Do the data provide evidence that less than a third of U.S. adults are “censusaware”? State appropriate null and alternative hypotheses for this scenario.
6. Practice The data collected from a sample of students are analyzed below with a bar
chart and frequency table. Based on this information:
a. What percentage of the sample are freshmen?
b. What percentage of the sample are sophomores?
c. What percentage of the sample are juniors?
d. What percentage of the sample are seniors?
e. What should be true about these four percentages?
Bar Chart
35
30
25
20
Freshman
25
15
Junior
30
10
Senior
13
5
Sophomore
19
Class
Column Summary
87
Freshman
S1 = count ( )
count ( )
Copyright © 2010
Sophomore
Junior
Class
Senior
Chapter 1: Introduction to Statistical Inference: One Proportion
XLIV
Section 1.2 Introduction to Statistical Reasoning: One Proportion
7. Injection drug use Do a majority of women in Rhode Island Prisons who test
positive for Hepatitis C virus report injection drug use? A recent article in the
American Journal of Public Health, (Macalino, G.F. et al., 2005) reported that in a
representative sample of inmates at the time of intake to the prison system 197
women tested positive for the Hepatitis C virus. Of these 197 women, 110 reported
injection drug use.
a. What proportion of women with Hepatitis C virus in this study reported injection
drug use?
b. Specify a null hypothesis and an alternative hypothesis to reflect the research
question that a majority of women who test positive for Hepatitis C report
injection drug use.
c. Use the coin flipping applet to generate 1000 repetitions of 197 Rhode Island
women prisoners who tested positive for Hepatitis C virus under the null
hypothesis. Save a screenshot of your dotplot and p-value.
d. Based on the study’s result, what is the p-value for this test?
e. What are your conclusions based on the p-value you found in part d?
f. To what population are you willing to generalize these results? Explain.
8. Staying up late? In a random sample of 40 students from a small college, 30 went
to bed after 11 pm the previous night. From these data we want to know whether we
can we conclude that the majority of all students at the college go to bed after 11 pm.
a. What are the null and alternative hypotheses?
b. Statistic: What proportion of the sample went to bed after 11pm?
c. Simulate: Use the coin flipping applet to generate 1000 repetitions of 40 students
under the null hypothesis. Save a screenshot of your dotplot and p-value.
d. Strength of evidence: What are your conclusions based on the p-value you found
in part c?
9. Credit card usage Do a majority of female college freshmen at a certain college
have at least one credit card? To answer this question, some student researchers
collected data from 42 female freshmen at their school. They found that 27 of them
had at least one credit card.
a. State the null and alternative hypotheses for a test of significance of the research
question given above.
b. Use the coin flipping applet to answer the following questions and compute the pvalue for this test by using 1000 repetitions.
i) What value did you enter for the “Probability of heads”?
ii) What value did you enter for the “Number of tosses”?
iii) What value did you enter for the “Number of repetitions”?
c. The graph generated is centered at approximately what number?
d. What is your p-value?
e. Use the p-value to give a conclusion for your significance test.
10. Multiple Choice The p-value of a test of significance is:
a. The probability, assuming the null hypothesis is true, that we would get a result
as extreme as the one that was actually observed.
b. The probability, assuming the alternative hypothesis is true, that we would get a
result as extreme as the one that was actually observed.
c. The probability the null hypothesis is true.
d. The probability the alternative hypothesis is true.
Copyright © 2010
Chapter 1: Introduction to Statistical Inference: One Proportion
XLV
11. Planning to vote? The following figure shows the results of the coin tossing applet.
Suppose the research question was, “Do a majority of students say they plan to vote
in the next election?” Explain what the following numbers shown in the applet mean
in terms of the students being sampled and the process and results from the test of
significance.
a.
b.
c.
d.
e.
f.
What does the 0.5 for the probability of heads represent?
What does the 100 for the number of tosses represent?
What do the 1000 repetitions represent?
What does the 55 in the extreme as cell represent?
What does the 0.202 for the proportion of repetitions represent?
What would be an appropriate conclusion for this test?
12. Monkey see A recent article (Hauser, Glynn, and Wood, 2007) described a study
that investigated whether rhesus monkeys have some ability to understand gestures
made by humans. In one part of the study, the experimenter approached individual
rhesus monkeys and placed 2 boxes an equal distance from the monkey. The
experimenter then placed food in one of the boxes, making sure that the monkey
could tell that one of the boxes received food without revealing which one. Finally,
the researcher made eye contact with the monkey and then gestured toward the box
with the food by jerking his head toward that box. This process was repeated for a
total of 40 rhesus monkeys. It turned out that 30 of the monkeys approached the box
that the human had gestured toward, and 10 approached the other box.
a. Describe how you could use a coin to conduct a simulation analysis of this study
and its result. Give sufficient detail that someone else could implement this
simulation analysis based on your description. Be sure to indicate how you
would decide whether the observed data provide convincing evidence that
rhesus monkeys can read human gestures better than random chance.
b. Use software to conduct a simulation analysis with at least 1000 repetitions.
Report the approximate p-value, and summarize the conclusion that you would
draw about the research question of whether rhesus monkeys have some ability
to understand gestures made by humans.
13. CPR on pets A national survey conducted on October 1-5, 2009 asked pet owners
whether they would perform CPR on their pet in the event of a medical emergency.
In the sample of 1116 pet owners, 58% said that they are at least somewhat likely to
perform CPR on their pet. Investigate whether this sample result provides strong
evidence that more than half of all pet owners in the U.S. are at least somewhat
willing to perform CPR on their pet.
Copyright © 2010
Chapter 1: Introduction to Statistical Inference: One Proportion
XLVI
14. Practice Indicate whether each of the following p-values provides extremely strong
evidence against the null hypothesis, moderate evidence against the null hypothesis,
or not much evidence against the null hypothesis.
a. 0.052
b. 0.00035
c. 0.417
Section 1.3 Statistical Significance for One Proportion: Other Null
Hypotheses
15. Practice What are the null and alternative hypotheses in the following scenarios?
a. Do a majority of males believe that cigars smell good?
b. Hershey’s claims they put 45% orange Reese’s Pieces in their Reese’s Pieces
candy mixture. Is this true, or is the true percent different from 45%?
c. Do a majority of Americans believe in life-after-death?
d. Do less than 30% of students at your school read for pleasure during the term?
e. Do more than 12% of students at your school skip breakfast every morning?
f. Do more than 25% of the students at your school exercise at least three times
per week?
g. Do fewer than 5% of people aged 14 or older get arrested per year?
16. Psychic powers Statistician Jessica Utts has conducted extensive analysis of
studies that have investigated psychic functioning. One type of study involves
having one person (called the “sender” concentrate on an image while a person in
another room (the “receiver”) tries to determine which image is being “sent.” The
receiver is given four images to choose from, one of which is the actual image that
the sender is concentrating on.
a. If the subjects in these studies have no psychic ability, what is the probability that
they identify the correct image? Is this a null hypothesis or an alternative
hypothesis?
b. Utts (1995) cites research from Bern and Honorton (1994) that analyzed studies
using a technique called ganzfeld. These researchers analyzed a total of 329
sessions (http://www.ics.uci.edu/~jutts/air.pdf). Use software to simulate 1000
repetitions of this study, assuming the null hypothesis to be true. Produce a welllabeled graph of the results.
c. Based on the graph of your simulation results, about how many of these 329
sessions would have to produce a “hit” (correct identification of the image being
“sent.”) in order to provide very strong evidence against the null hypothesis?
Explain how you arrive at this number based on your graph.
d. Utts reported that Bern and Honorton found a total of 106 “hits” in the 329
sessions. Does this result provide very strong evidence against the null
hypothesis? Explain, and summarize your conclusion in the context of this study.
17. Reese’s pieces An astute statistics student wants to determine whether there are
more than 45% orange Reese’s Pieces in small bags of Reese’s Pieces. Being low
on funds, she purchases only one small bag of Reese’s Pieces. There are 25 total
Reese’s Pieces in her bag. Sixteen are orange. The student quickly simulates the
distribution of the number of orange Reese’s Pieces from 1000 samples of 25 total
Reese’s Pieces assuming there are indeed 45% orange in the population. The
graph resulting from 1000 repetitions of this simulation is displayed below.
Copyright © 2010
Chapter 1: Introduction to Statistical Inference: One Proportion
XLVII
number of orange Reese’s Pieces a. State a null hypothesis and an alternative hypothesis for this student.
b. What percentage of the student’s Reese’s pieces is orange? Is this more than
the 45% she is testing?
c. Use the graph above to find the p-value for this test. What conclusion would you
draw about the null hypothesis from this p-value?
d. Suppose the student’s original bag had 13 orange instead of 16. Now, what
percentage of the student’s Reese’s pieces is orange? Use the graph above to
find the p-value for this test.
e. Which p-value offers more support for the alternative hypothesis that there are
more than 45% orange Reese’s Pieces? Explain why this makes sense using
the percentage of orange Reese’s Pieces you found in (b) and (d).
18. Fantasy golf A statistics professor is in a fantasy golf league with 4 friends. Each
week one of the 5 people in the league is the winner of that week’s tournament.
During the 2010 season, this particular professor was the winner in 7 of the first 12
weeks of the season. Does this constitute strong evidence that his probability of
winning in one week was larger than would be expected if the 5 competitors were
equally likely to win? Conduct a simulation analysis to investigate this question.
Write a paragraph summarizing your conclusion and explaining the reasoning
process by which your conclusion follows from the simulation analysis.
20. Breakfast Findings at James Madison University indicate that 21% of students eat
breakfast 6 or 7 times a week. A similar question was asked of a random sample of
159 students at a different college. Of the 97 who responded, 35 reported eating
breakfast 6 or 7 times a week. Do students at the other college have healthier
breakfast habits than James Madison students? More specifically, do more than
21% of all students at the other college eat breakfast 6 or 7 times weekly?
a. State your null and alternative hypotheses.
b. Statistic: What sample proportion of students at the other college eat breakfast 6
or 7 times per week?
c. Simulate: Use the coin flipping applet to simulate “could have been” outcomes
under the null hypothesis 1000 times.
d. Based on the study’s result, what is the p-value for this test?
e. Strength of evidence: What are your conclusions based on the p-value you found
in part d?
f. What are your thoughts about the fact that only 97 out of the random sample of
159 responded?
21. Cloning humans In a May 2010 poll of a random sample of 1,029 U.S. adults,
Gallup found that 88% thought that the cloning of humans was morally unacceptable.
Explain what, if anything, is incorrect in the following statements.
Copyright © 2010
Chapter 1: Introduction to Statistical Inference: One Proportion
XLVIII
a. The population is the 1,029 U.S. adults who were interviewed.
b. The population is U.S. adults who think that cloning of humans is morally
unacceptable.
c. The number 88% is a parameter.
d. The sample is all U.S. adults.
e. The statistic is 1,029 U.S. adults.
f. The statistic is the average number of U.S. adults who think cloning humans is
morally unacceptable.
g. If we repeatedly poll random samples of 1,029 U.S. adults, the percentage of
respondents who think that the cloning of humans is morally unacceptable is
always going to be 88%.
22. Cloning humans (contd.) Reconsider the previous exercise. Suppose that we want
to use the data to test whether the proportion of U.S. adults who think that the
cloning of humans is morally unacceptable is over 0.80. What, if anything, is
incorrect in the following?
a. Alternative hypothesis: The proportion of U.S. adults who think that the cloning of
humans is morally unacceptable is less than 0.80.
b. The p-value is the probability of observing a sample proportion of 0.88, if the
proportion of U.S. adults who think that the cloning of humans is morally
unacceptable is 0.80.
c. Another survey of 1,029 randomly selected U.S. adults, run by a different survey
team found 85% of their respondents think that the cloning of humans is morally
unacceptable. These data would be stronger evidence against the null
hypothesis, than the data collected by Gallup.
23. Pro-choice or pro-life (contd.) Reconsider the exercise. Suppose that we want to
use the data from Gallup’s survey to test whether a majority of U.S. adults are now
pro-life rather than pro-choice. Using the Simulating Coin Tossing applet the
simulated p-value was found to be 0.276. What, if anything, is incorrect in the
following statements?
a. Null hypothesis: A majority of U.S. adults are now pro-life rather than pro-choice.
b. Since the p-value is fairly large, we have strong evidence that exactly half of all
U.S. adults are pro-life and exactly half are pro-choice.
c. Since the p-value is fairly small, we have strong evidence that a majority of U.S.
adults are now pro-life.
d. Since the p-value is fairly large, this survey has provided no information about
the proportion of U.S. adults who are now pro-life.
Section 1.4 Plausible Values
24. Multiple Choice If you don’t have much evidence against the null hypothesis, you
can:
a. Conclude the null hypothesis must be true
b. Conclude that the null hypothesis is one of a set of plausible values
25. Democrat or Republican A political poll finds that 305 of 600 likely voters are in
favor of the Democrat over the Republican in a recent local election between the two
candidates. Is this evidence that the majority of likely voters are in favor of the
Democrat candidate?
a. What is the proportion of likely voters in favor of the Democrat candidate in this
sample?
Copyright © 2010
Chapter 1: Introduction to Statistical Inference: One Proportion
XLIX
b. Use the 3S process to evaluate whether this sample is evidence that the majority
of likely voters in the population favor the Democrat candidate.
c. Does your answer to b) mean that you have proven that the population of all
likely voters is perfectly split (50/50) between the two candidates?
26. Outbreak Recently a small college in the Midwest experienced an outbreak of
norovirus which forced its temporary closure. In a survey of a random sample of
students after the outbreak, 34 out of 187 students reported experiencing symptoms
of norovirus during the outbreak.
a. Find the proportion of students reporting symptoms of norovirus.
b. What is the population parameter of interest?
c. State null and alternative hypotheses and then use the 3S process to investigate
whether this sample provides evidence that more than 10% of students at the
college experienced symptoms.
d. State null and alternative hypotheses and then use the 3S process to investigate
whether this sample provides evidence that less than 30% of students at the
college experienced symptoms.
e. Use your results from c) and d) and more computer simulations to find a range of
plausible values of the population parameter. Remember that a range of
plausible values is a set of numbers against which your sample does not provide
much evidence. Make your range accurate to the nearest 1 percentage point.
Clarify how you determined convincing evidence or not much evidence.
Section 1.5 Effect of Sample Size
27. Phone home Suppose we were testing the hypothesis that more than 50% of
college students call home at least once a week and use the 3S analysis strategy to
estimate the p-value. Also suppose that four students collect data to try to answer
the question. Each student collects a different sample size (one of 10, one of 20,
one of 40, and one of 80). Surprisingly, in each case their results showed that 70%
of the students in their respective samples called home at least once a week.
Use the following dotplots of 1000 repetitions for these simulations to decide whether
each result is statistically significant (small p-value) and they can therefore conclude
that more than 50% of students call home at least once a week. Also comment on
what happened to the p-value as the sample size increased even when the sample
proportion stayed the same?
number of college students calling home This student
found
that 28
out of 40
(or 70%)
of the
number
of college
students
calling
home students called home at least once a week.
This student found that 7 out of 10 (or 70%) of the
students called home at least once a week.
Copyright © 2010
number of college students calling home number of college students calling home This student found that 56 out of 80 (or 70%) of the
students called home at least once a week.
This student found that 14 out of 20 (or 70%) of the
students called home at least once a week.
Chapter 1: Introduction to Statistical Inference: One Proportion
L
28. Story falls flat A legendary campus story tells of two students who miss an exam
because they are off partying. When they return to campus, they sheepishly
approach the professor and say that they missed the exam because of a flat tire.
The students are delighted when the professor grants them an opportunity to take a
make-up exam. But when they are sent to separate rooms to take the make-up
exam, they find that one question is worth 95 points: Which tire was flat? It’s been
conjectured that when students are asked this question and forced to give an answer
(left front, left rear, right front, or right rear) off the top of their head, they tend to
answer “right front” more than would be expected by random chance. To test this
conjecture, this question was asked of a recent class of 32 students, with the
following results:
Left front Left rear Right front Right rear
5
4
18
5
a. State the appropriate null and alternative hypotheses to be tested.
b. Produce a bar graph to display the student responses, and comment on what the
graph reveals.
c. Statistic: Calculate the sample proportion who answered “right front.” Does this
statistic appear to support the research conjecture? Explain.
d. Simulate: Use software to conduct a simulation analysis for investigating whether
the sample data provide strong evidence for the research conjecture. Submit a
graph of the what if the null was true distribution, and report the approximate pvalue.
e. Strength of Evidence: Explain what this p-value means. In other words, this is
the probability of what, assuming what? (Do not draw a conclusion from the pvalue yet; that’s the next question.)
f. Summarize the conclusion that you draw from this test. Also explain the
reasoning process behind your conclusion.
29. Story falls flat (contd.) Reconsider the previous exercise. Suppose another class
conducts the same study with exactly half as many students, and suppose the
proportional breakdown in the four categories is identical to the previous exercise. In
other words, 9 out of 16 students answered, “right front.”
a. Before you analyze the data, would you expect to find stronger evidence for the
research conjecture (that people pick the right front tire more than ¼ of the time),
weaker evidence, or the same strength of evidence? Explain your thinking.
b. Conduct a simulation analysis to produce an approximate p-value. How does it
compare to the p-value from the previous exercise? Is this what you expected?
Explain.
30. Haiti earthquake On January 12, 2010, a 7.0 magnitude earthquake shook Haiti,
affecting about three million people. Aid started pouring in from all parts of the world
in the form of volunteer services, donations, etc. In a February 2010 survey of 1,383
randomly chosen U.S. adults, The Pew Research Center for the People and the
Press found that 719 of the respondents had made a donation to Haiti victims. We
want to use these data to investigate whether more than half of U.S. adults made a
donation to Haiti victims.
a. Identify the population of interest.
b. Describe the relevant parameter of interest.
Copyright © 2010
Chapter 1: Introduction to Statistical Inference: One Proportion
LI
c. State the null hypothesis and the alternative hypothesis.
d. Of the people that the Pew Research Center interviewed, what proportion made
a donation to Haiti victims?
e. Is the proportion you calculated in part (d) a parameter or a statistic? How do you
know?
f. The study description says that we are looking at data from “randomly chosen
U.S. adults.” What is the main advantage of having data from “randomly chosen
U.S. adults”?
g. Investigate whether the data provide evidence that more than half of U.S. adults
made a donation to Haiti victims. Be sure to include the p-value, an interpretation
of the p-value, and your conclusion in the context of the study.
h. A high school student surveys a random sample of 200 adults in his city, and
finds that 52% of the respondents made donations to Haiti victims. The high
school student decides to use his data to test whether more than half of the
adults in the city made a donation to Haiti victims. Will the p-value from his
analysis be larger than, smaller than, or the same as your p-value from part (g)?
Explain your choice.
Copyright © 2010
Download