Statistical Pitfalls in Cognitive Neuroscience (and Beyond) Eric-Jan Wagenmakers Overview The cliff effect Why p-values pollute The hidden prevalence of exploratory work Pitfall “Escape latency in the Morris water maze was affected by lesions of the entorhinal cortex (P < 0.05), but was spared by lesions of the perirhinal and postrhinal cortices (both P values > 0.1), pointing to a specific role for the enthorinal cortex in spatial memory.” Pitfall The difference between significant and not significant is itself not necessarily significant! (Gelman & Stern, 2006) “Surely, God loves the 0.06 nearly as much as the 0.05” (Rosnow & Rosenthal, 1989) Instead of considering the difference in pvalues, we should consider the p-value for the difference. The Imager’s Fallacy By painting the brain according to a voxel’s pvalue, imagers are particularly susceptible to the pitfall. An area has a pretty color, the other area does not. Conclusion: the areas differ from one another. This conclusion is wrong; the difference itself was never tested. One Possible Solution Determine the least-significant voxel (LSV). Compare the non-significant voxels against the LSV. If the difference is significant, these non-significant voxels differ from the significant voxels; If the difference is not significant, however, these voxels are “in limbo” and more data are needed for their classification. Imager’s Fallacy NB. Other ACME products may work as well! Overview The cliff effect Why p-values pollute The hidden prevalence of exploratory work The Violent Bias “Classical significance tests are violently biased against the null hypothesis.” (Edwards, 1965). What is the p-value? “The probability of obtaining a test statistic at least as extreme as the one you observed, given that the null hypothesis is true.” The p-Value and Statistical Evidence Note that the p-value only considers how rare the observed data are under H0. The fact that the observed data may also be rare (or more rare) under H1 does not enter consideration. Bayesian Hypothesis Test Suppose we have two models, M1 and M2. After seeing the data, which one is preferable? The one that has the highest posterior probability! Bayesian Hypothesis Test P M |D P D |M P M 1 1 1 P M |D D |M M P P 2 2 2 Posterior odds Bayes factor Prior odds Guidelines for Interpretation of the Bayes Factor BF Evidence 1–3 3 – 10 10 – 30 30 – 100 >100 Anecdotal Moderate Strong Very strong Extreme Bayes Factor for the t Test Prob. of Data Under the Null Hypothesis Prob. of Data Under the Alternative Hypothesis H0 states that effect size δ = 0. But how do we specify H1? Effect size δ under H1 δ = .1 • • • δ = .3 • • • δ = .5 • • • Likelihood ratio p(data | H0) p(data | δ = .1) p(data | H0) p(data | δ = .3) p(data | H0) p(data | δ = .5) The Bayes factor is the weighted average of the likelihood ratios. The weights are given by the prior plausibility assigned to the effect sizes. So we need to assign weight to the different values of effect size. These weigths reflect the relative plausibility of the effect sizes before seeing the data. The most popular default choice is to assume a standard Normal distribution: But we could choose the width of the prior differently. When we expect small effects, we could set the width low. Each different value of width gives a different answer. What to do? Here we explore what happens for all possible values for the prior width. We could cheat, and cherry-pick the prior width that makes H1 look best. This is useful because it gives an upper bound on the evidence. Can it be the case that the upper bound yields no more than anecdotal evidence, whereas the p value is smaller than .05? YES! Example Example Example Even at best: Worth no more than a bare mention! Overview The cliff effect Why p-values pollute The hidden prevalence of exploratory work Exploration The usual statistics are meant for purely confirmatory research efforts. In “fishing expeditions” the data are used twice: First to provide the hypothesis, and then to test it. Using the Data Twice Methodology 101: there is a conceptual distinction between hypothesis-generating and hypothesis-testing research (De Groot, 1956/2014). When the data inspire a hypothesis, you cannot use those same data to test that hypothesis. Wonky Stats Preregistration of Experiments Recently adopted by AP&P, Perspectives on Psych Science, Cortex, and many others. Cleanly separates pre-planned from post-hoc analyses. Feynman: “(…) you must not fool yourself— and you are the easiest person to fool.” Preregistration Debate Some brilliant researchers do not like preregistration. They argue: I have always done my work without preregistration and it's pretty good stuff. You want more bureaucracy? You want to kill scientific serendipity? Argument 1: Medicine Q: If you don't like preregistration as a scientific method, how about eliminating this requirement for clinical trials that assess the efficacy of medication? A: Well, you should only be forced to preregister your work if it is important[!] Argument 2: ESP ESP is the jester in the court of academia. Research on ESP should be much more important than it is right now, because it so vividly demonstrates that our current methodology is not fool-proof. Argument 2: ESP How can we expose the fact that the jester is fooling us only and the phenomenon does not exist? One way only: study preregistration! Replication Goal: replicate a series of correlations between structural MRI patterns and behavior (e.g., people with bigger amygdala's have more Facebook friends). Replication Copy the original authors' methods as closely as possible; Preregister the analysis plan; Collect data (N=36); Report results. Replication In all replications, Bayes factors support the null hypothesis. However, the prior for the correlation can (should?) be based on the original study. Posterior distributions suggest that some replications attempts are more informative than others. Thanks for Your Attention!