Methods for Two Categorical Variables – Fisher`s Exact Test

advertisement
Methods for Two Categorical Variables – Fisher’s Exact Test
Example: MythBusters, a popular television show on the Discovery Channel, once conducted an
experiment to investigate whether or not yawning is contagious. The premise of the experiment was to
invite a stranger to sit in a booth for an extended period of time. Fifty subjects were said to be tested in
total, 34 of which were “seeded” with a yawn by the person conducting the experiment. The other 16
were not given a yawn seed. Using a two-way mirror and a hidden camera, the experimenters observed
and recorded the results given below.
Seeded with a yawn
Not seeded with a yawn
Yawned
10
4
Did not yawn
24
12
Total
34
16
Research Question – Are those individuals “seeded” with a yawn more likely to actually yawn?
When we analyze data on two variables, the first step is to determine which is the __________________
Variable and which is the ___________________ (or predictor) variable.

Response variable: The __________________ variable on which comparisons are made.

Explanatory (or predictor) variable: This defines the ____________ to be compared.
Questions:
1. What are the variables in the MythBusters yawning experiment? Are they categorical or
numerical?
2. Which is the response variable? Which is the explanatory variable?
Descriptive Methods for Two Categorical Variables

A ______________________________ shows the joint frequencies of the two categorical
variables. The rows of the table denote the categories for the first variable and the columns
denote the categories of the second variable.

A _________________________ is a visual representation of the relationship between two
categorical variables. A mosaic plot graphically represents the information given in the
contingency table.
1
To obtain these descriptive summaries, the data must be entered into JMP in the following manner.
Next, click Analyze Fit Y by X. Put the response variable (Yawned) in the Y, Response box , the
explanatory variable (Group) in the X, Factor box , and Count in the Freq box as shown below.
Click OK and JMP will return the following output.
Questions:
1. Find the proportion that yawn in the Seeded group.
2
2. Find the proportion that yawn in the Not Seeded group.
3. Find the difference in the proportion that yawn between these two groups. Do these
proportions differ in the direction conjectured by the researchers?
4. Even if the seeding of a yawn had absolutely no effect on whether or not a subject actually
yawned, is it possible to have obtained a difference such as this by random chance alone?
Explain.
Conducting a Simulation Study using Playing Cards
The above descriptive analysis tells us what we learned about the 50 subjects in the study. Can we
make inferences beyond what happened in the study (i.e., general statements about the population)?
Does the higher proportion of yawners in the seeded group provide convincing evidence that being
seeded with a yawn actually makes a person more likely to yawn? Note that it is possible that random
chance alone could have led to this large of a difference. That is, while it is possible that yawn seeding
had no effect and the MythBusters happened to observe more yawners in the seeded group just by
chance, the key question is whether it is probably (likely to happen).
We will answer this question by replicating the experiment over and over again, but in a situation where
we know that yawn seeding has no effect (i.e., the null model). We’ll start with 14 yawners and 36 nonyawners and we’ll randomly assign 34 of the 50 subjects to the seeded group and the remaining 16 to
the not seeded group.
Note that we could use playing cards to replicate this experiment: let 14 red cards represent the
yawners and 36 black cards represent the non-yawners. Shuffle the cards well, and randomly deal of 34
to be the seeded group. Then construct the contingency table to show the number of yawners and nonyawners in each group.
Seeded with a yawn
Not seeded with a yawn
Total
Yawned
Did not yawn
14
36
Total
34
16
50
3
Next, note that if you know the number of yawners in the Seeded group, then you can fill in the rest of
the cells in the contingency table. So, we need only focus on the number of yawners in the seeded
group. Record the number of yawners in the Seeded group from your first simulated experiment in the
table below. Then, repeat this randomization process nine more times, recording your results in the
table below.
Simulated Experiment
Number of Yawners in
Seeded Group
1
2
3
4
5
6
7
8
9
10
Next, create a dotplot of the results.
Number of Yawners Randomly Assigned to Seeded Group
Questions:
5. How many randomizations were performed by the class as a whole? What proportion gave
results at least as extreme as the actual study (10 or more yawner in the seeded group)?
6. Note that the random process used in this simulation study models the situation where the
yawn seeding has no effect on whether the subjects actually yawns – we simply assume there
are 14 people who will yawn no matter what group they are in, and they are assigned to the
two groups at random. Based on the simulation study, does it appear that random assignment
of yawners to groups will result in 10 or more yawners in the seeded group just by chance?
Explain.
7. Recall that the MythBusters obtained 10 yawners in the seeded group. Given your answer to
Question 6, would you say the data provides evidence to support the research question?
Explain.
4
Conducting a Simulation Study using Tinkerplots 2®
Note that we could use Tinkerplots 2® to simulate this process more efficiently. This will enable us to
obtain many more simulated results so that we can be more confident in our answer to Question 7
above.
Open the file YawnExperiment.tp from the course website. Note that each mixer contains 50
“subjects.” The first mixer contains 14 yawners and 36 non-yawners. The second mixer contains 34
subjects in the seeded group and 16 in the not seeded group. Make sure the Repeat value is set to 50
(since there were 50 subjects in the study) and set BOTH mixers to sample without replacement as
shown below.
Click Run and you’ll see the results for the first simulated trial.
5
To get a graph of these results, drag a new plot to the workspace. Drag the predictor variable (group) to
the x-axis and the response variable (yawn) to the y-axis. You can also show the counts and vertically
stack the points as shown below.
Recall, we are interested in the number of yawns in the seeded group, so that is what we’ll want to track
and collect our statistic on. Right click and collect 99 more trials.
Finally, we can graph our 100 trials as we’ve done previously.
6
Questions:
8. What does each dot on the plot represent?
9. What value(s) occurred most often by chance under the null model? Explain why this makes
sense.
10. How often did we see results at least as extreme as the observed data (10 or more yawners in
the seeded group) under the null model? Calculate the proportion of simulated results in which
we observed 10 or more yawners in the seeded group. Note that this is an approximate pvalue!
11. The MythBusters reported the following results: 25% yawned of those not given a yawn seed,
and 29% yawned of those given a yawn seed. Then, they cited the "large sample size" and the
4% difference in the proportion that yawned between the seeded and non-seeded group to
confidently conclude that yawn seed had a significant effect on the subjects. Therefore, they
concluded that the yawn is decisively contagious. Do you agree or disagree with their answer?
Justify your reasoning.
Fisher’s Exact Test to Obtain Exact p-values
We just used a simulation study to answer the research question of interest to the MythBusters. This is
a valid analysis. However, it provides only an approximate p-value. We can obtain the exact p-value
using probability theory and a distribution known as the _________________ distribution. Details about
the distribution are beyond the scope of this class, but we will discuss __________________________
test which uses the hypergeometric distribution to calculate p-values.
7
Note the remaining output obtained when choosing Analyze  Fit Y by X in JMP.
JMP has already used the hypergeometric distribution to find the probability of observing results as
extreme as was observed in the MythBusters experiment. Now, we can use the output to carry out the
formal hypothesis test.
Step 0: Define the research question
Are those individuals “seeded” with a yawn more likely to actually yawn?
Step 1: Determine the null and alternative hypotheses
H0: The proportion of yawners is equal for the seeded and not seeded groups (i.e., the
yawn seeding has no effect on whether or not a person yawns).
Ha: The proportion of yawners in the seeded group is larger than the not seeded group
(i.e., those seeded with a yawn are more likely to yawn).
We can also write the hypotheses in symbols as follows:
H0: pseeded = pnot-seeded
Ha: pseeded > pnot-seeded
Step 2: Find the p-value
8
Step 3: Write the conclusion in terms of the research question
Example: Studies of children with reading disabilities typically focus on “early-emerging” difficulties
identified prior to fourth grade. Psychologists at Haskins Laboratories recently studied children with
“late-emerging” reading difficulties (i.e. children who appeared to undergo a fourth-grade “slump” in
reading achievement) and published their findings in the Journal of Educational Psychology (June 2003).
A sample of 161 children was selected from fourth and fifth graders at elementary schools in
Philadelphia. In addition to recording the grade level, the researchers determined whether each child
had a previously undetected reading disability. Sixty-six children were diagnosed with a reading
disability. Of these children, 32 were fourth graders and 34 were fifth graders. Similarly, of the 95
children with normal reading achievement, 55 were fourth graders and 40 were fifth graders.
Questions:
12. Define the explanatory (predictor) variable for this study.
13. Define the response variable for this study.
14. Using the information from the scenario above, fill in the contingency table below.
Normal Reading
Achievement
4th Graders
5th Graders
Total
Diagnosed Reading
Disability
Total
161
15. Carry out the appropriate hypothesis test to answer the following research question.
Step 0: Define the research question
Is there evidence that the proportion of 4th graders with a diagnosed reading
disability less than the proportion of 5th graders with a diagnosed reading
disability?
9
Step 1: Determine the null and alternative hypotheses
Step 2: Find the p-value
Step 3: Write the conclusion in terms of the research study
Example: Some researchers hypothesized that swimming with dolphins would be therapeutic for
patients suffering from clinical depression. To investigate the possibility researchers recruited 30
subjects aged 18-65 with a clinical diagnosis of mild to moderate depression. Subjects were required to
discontinue the use of any antidepressant drugs or psychotherapy four weeks prior to the experiment
and throughout the experiment. These 30 subjects were taken to an island off the coast of Honduras,
where they were randomly assigned to one of the two treatment groups. Both groups engaged in the
same amount of swimming and snorkeling each day, but one group did so in the presence of bottlenose
dolphins and the other did not. At the end of two weeks, each subjects’ level of depression was
evaluated, as it had been at the beginning of the study. The results are summarized in the following
table.
Improvement
Group
Yes
No
Totals
Dolphin Therapy
10
5
15
Control Group (no dolphins)
2
13
15
Totals
12
18
30
10
Questions:
16. Define the explanatory (predictor) variable for this study.
17. Define the response variable for this study.
18. Carry out the appropriate hypothesis test to answer the following research question.
Step 0: Define the research question
Is there evidence that swimming with dolphins would be therapeutic for patients
suffering from clinical depression (i.e., they would show improvement) ?
Step 1: Determine the null and alternative hypotheses
Step 2: Find the p-value
Step 3: Write the conclusion in terms of the research study
11
Download