Statistical Pitfalls in Cognitive Neuroscience (and Beyond) Eric-Jan Wagenmakers

advertisement
Statistical Pitfalls in Cognitive
Neuroscience (and Beyond)
Eric-Jan
Wagenmakers
Overview
 The cliff effect
 Why p-values pollute
 The hidden prevalence of exploratory work
Pitfall
 “Escape latency in the Morris water maze was
affected by lesions of the entorhinal cortex (P
< 0.05), but was spared by lesions of the
perirhinal and postrhinal cortices (both P
values > 0.1), pointing to a specific role for the
enthorinal cortex in spatial memory.”
Pitfall
 The difference between significant and not
significant is itself not necessarily significant!
(Gelman & Stern, 2006)
 “Surely, God loves the 0.06 nearly as much as
the 0.05” (Rosnow & Rosenthal, 1989)
 Instead of considering the difference in pvalues, we should consider the p-value for the
difference.
The Imager’s Fallacy
 By painting the brain according to a voxel’s pvalue, imagers are particularly susceptible to
the pitfall.
 An area has a pretty color, the other area does
not. Conclusion: the areas differ from one
another.
 This conclusion is wrong; the difference itself
was never tested.
One Possible Solution
 Determine the least-significant voxel (LSV).
 Compare the non-significant voxels against the
LSV.


If the difference is significant, these non-significant
voxels differ from the significant voxels;
If the difference is not significant, however, these
voxels are “in limbo” and more data are needed for
their classification.
Imager’s Fallacy
NB. Other ACME products may work as well!
Overview
 The cliff effect
 Why p-values pollute
 The hidden prevalence of exploratory work
The Violent Bias
“Classical significance tests are violently
biased against the null hypothesis.”
(Edwards, 1965).
What is the p-value?
“The probability of obtaining a
test statistic at least as extreme as
the one you observed, given that
the null hypothesis is true.”
The p-Value and
Statistical Evidence
 Note that the p-value only considers how rare
the observed data are under H0.
 The fact that the observed data may also be
rare (or more rare) under H1 does not enter
consideration.
Bayesian Hypothesis Test
 Suppose we have two models, M1 and M2.
 After seeing the data, which one is preferable?
 The one that has the highest posterior probability!
Bayesian Hypothesis Test
P
M
|D
P
D
|M
P
M






1
1
1


P
M
|D
D
|M
M

P

P


2
2
2
Posterior
odds
Bayes
factor
Prior
odds
Guidelines for Interpretation
of the Bayes Factor
BF
Evidence
1–3
3 – 10
10 – 30
30 – 100
>100
Anecdotal
Moderate
Strong
Very strong
Extreme
Bayes Factor for the t Test
Prob. of Data Under the Null Hypothesis
Prob. of Data Under the Alternative Hypothesis
 H0 states that effect size δ = 0.
 But how do we specify H1?
Effect size δ
under H1
δ = .1
•
•
•
δ = .3
•
•
•
δ = .5
•
•
•
Likelihood
ratio
p(data | H0)
p(data | δ = .1)
p(data | H0)
p(data | δ = .3)
p(data | H0)
p(data | δ = .5)
The Bayes factor
is the weighted
average of the
likelihood ratios.
The weights are
given by the
prior plausibility
assigned to the
effect sizes.
So we need to assign weight to the different
values of effect size.
These weigths reflect the relative plausibility of the
effect sizes before seeing the data.
The most popular default choice is to assume a
standard Normal distribution:
But we could choose the width of the prior
differently. When we expect small effects, we
could set the width low. Each different value of
width gives a different answer. What to do?
Here we explore what happens for all possible
values for the prior width.
We could cheat, and cherry-pick the prior width
that makes H1 look best. This is useful because it
gives an upper bound on the evidence.
Can it be the case that the upper bound yields no
more than anecdotal evidence, whereas the p
value is smaller than .05?
YES!
Example
Example
Example
Even at best: Worth no more than a
bare mention!
Overview
 The cliff effect
 Why p-values pollute
 The hidden prevalence of exploratory work
Exploration
 The usual statistics are meant for purely
confirmatory research efforts.
 In “fishing expeditions” the data are used
twice: First to provide the hypothesis, and
then to test it.
Using the Data Twice
 Methodology 101: there is a conceptual
distinction between hypothesis-generating and
hypothesis-testing research (De Groot,
1956/2014).
 When the data inspire a hypothesis, you cannot
use those same data to test that hypothesis.
Wonky Stats
Preregistration
of Experiments
 Recently adopted by AP&P, Perspectives on
Psych Science, Cortex, and many others.
 Cleanly separates pre-planned from post-hoc
analyses.
 Feynman: “(…) you must not fool yourself—
and you are the easiest person to fool.”
Preregistration Debate
 Some brilliant researchers do not like
preregistration. They argue:



I have always done my work without
preregistration and it's pretty good stuff.
You want more bureaucracy?
You want to kill scientific serendipity?
Argument 1: Medicine
 Q: If you don't like preregistration as a
scientific method, how about eliminating this
requirement for clinical trials that assess the
efficacy of medication?
 A: Well, you should only be forced to
preregister your work if it is important[!]
Argument 2: ESP
 ESP is the jester in the court of academia.
 Research on ESP should be much more
important than it is right now, because it so
vividly demonstrates that our current
methodology is not fool-proof.
Argument 2: ESP
 How can we expose the fact that the jester is
fooling us only and the phenomenon does not
exist?
 One way only: study preregistration!
Replication
 Goal: replicate a series of correlations
between structural MRI patterns and
behavior (e.g., people with bigger amygdala's
have more Facebook friends).
Replication
 Copy the original authors' methods as closely
as possible;
 Preregister the analysis plan;
 Collect data (N=36);
 Report results.
Replication
 In all replications, Bayes factors support the
null hypothesis.
 However, the prior for the correlation can
(should?) be based on the original study.
 Posterior distributions suggest that some
replications attempts are more informative
than others.
Thanks for Your Attention!
Download