Lies, damn lies, and statistics: How to avoid misleading yourself and others with data analysis Hiram Brownell Department of Psychology Research and Scholarship Integrity Program March 21, 2015 Quote popularized by Mark Twain. Attributed to Benjamin Disraeli (?) Goals Identify issues in empirical research, mostly tied to hypothesis testing Be cautious rather than prescriptive: --all practices mentioned have role; --be aware of how things can appear to others Advocate less reliance on p values Advocate greater use of confidence intervals, effect size in empirical research Not all empirical research relies on statistics to same extent. Vast influence of statistical thinking in many fields. Mid 19th century: rise of statistical thinking Nightingale’s graphic of fatalities in Crimean War Use p values and hypothesis testing in your research? Brief history of hypothesis testing: RA Fisher As undergrad at Cambridge (UK) (1911) Formed Eugenics Society with John Maynard Keynes, R. C. Punnett, and Horace Darwin (son of Charles Darwin). Enduring interest in genetics WWI Rothamsted Experimental Station: ANOVA Small sample workhypothesis testing Competing approach by Jerzy Neyman and Egon Pearson Neyman-Pearson approach: Signal detection theory (effect size) Bitter rivalry between Fisher and NeymanPearson Today, presentation of hypothesis testing in textbooks mixture of two approaches. I have sinned in application of hypothesis testing. Have you? How many have done the following? Collect lots of data. Decide later whether or not to include a variable in an analysis? (e.g., tepwise regression, Analysis of Covariance) Report, not report a “pilot” study? Whether or not to use transform of data (e.g., log transform)? Removal of outliers: +/- 2 or 3 SDs? Testing few more subjects if results close to significance? Fraud not an issue here. Assumption: honest researchers. Still… “p hacking” a real concern Consider public perception of science as well as quality of research literature. Let us not do bad science. Let us not be bad for science. What follows: Scare mongering by presenter Not balanced presentation --no discussion of exploratory data analysis Worst case scenario Worth thinking about Bad press for science: past and present Life is not fair A Statistical Model to Explain the Mendel– Fisher Controversy Ana M. Pires and Joao A. Branco Statistical Science 2010, Vol. 25, No. 4, 545–565 G. Mendel (1822-1884) R. A. Fisher (18901962) Mendel Exp.s 1856-1863: 29,000 pea plants. Informal analysis. Modern statistical tools not developed Publication: 1865, 1866. [1902 Karl Pearson’s Chi Square test] Fisher questioned Mendel’s statistical results: 1911 (as undergrad), 1936. Mendel’s results too good to be true Example of what to expect If you have a biased coin, p of H is .60 N=20, expect around 12 Heads Sometimes more, sometimes fewer Can you document that coin is biased? Binomial approx. to normal curve If Null Hypo true, expect about 10 heads To reject Null Hypo., need 13.7 or more H (one-tailed test) Re-analysis of Mendel’s data Application of Chi square (1902): p values Examine distribution of p values Define proportion of experiments that should “work” based on size of effect, sample size. Mendel’s report NOT what would be expected given the effects tested. Explanation? “Although no explanation can be expected to be satisfactory, it remains a possibility among others that Mendel was deceived by some assistant who knew too well what was expected.” (Fisher, 1936, page 132). 1964-2007: 50+ papers on controversy Mendel’s data “too good to be true” under assumption that all experiments Mendel performed were reported. Did Mendel follow rules for rigorous application of techniques developed much later? No. Is criticism fair? Alternative: Suppose Mendel repeated some experiments, presumably those which deviate most from his theory, and reports only the best of the two. Simulation of what might have gone on Experiment repeated whenever p-value smaller than alpha 0 ≤ alpha ≤ 1 parameter fixed by experimenter Only the exp with largest p-value reported. Mendel today: views respected, but risky career move Problems with scientific literature continue. Recent increase in public awareness Bad news from medical research Drug development: Raise standards for preclinical cancer research C. Glenn Begley & Lee M. Ellis Nature 483, 531–533 (29 March 2012) Haematology and Oncology Dept, Amgen (California, USA) 53 papers deemed 'landmark' studies Selection bias: papers described something completely new: fresh approaches to targeting cancers, alternative clinical uses for existing therapeutics) Scientific findings confirmed by Amgen via replication in only 6 (11%) cases. Bad news: Why Most Published Research Findings Are False J Ioannidis (2005). PLoS Med 2(8): e124. Content on slides copied, modified, pasted from article summary and other sources. http://www.economist.com/blogs/graphicdet ail/2013/10/daily-chart-2 Next couple of slides used only if url does not open during talk. Demonstration could be made to look better or worse by varying the outline of the problem. http://www.economist.com/news/briefing/21588057scientists-think-science-self-correcting-alarming-degree-itnot-trouble customary approach to statistical significance ignores three things: “statistical power” of the study (a measure of its ability to avoid type II errors, false negatives in which a real signal is missed in the noise); unlikeliness of the hypothesis being tested; bias favoring publication of claims to have found something new. Research finding less likely to be true when: studies are smaller (small sample size); effect sizes are smaller greater number and less preselection of tested relationships; greater flexibility in designs, definitions, outcomes, and analytical modes greater financial and other interest and prejudice more teams are involved in chase of statistical significance For many fields, claimed research findings may often be simply accurate measures of prevailing bias. Test for Excess Significance: TES Example of post hoc scrutiny The frequency of excess success for articles in Psychological Science Gregory Francis Psychon Bull Rev (2014) 21:1180–1187 Papers with 4 or more studies. Usually internal replication good. Calculate power of individual studies. Power: p of finding an effect that is truly present. Correct rejection of null hypothesis. If each study in paper with five studies has power of .6, then should not expect each and every study to reject null. p that all 5 studies in a single paper will reject null : (.6)(.6)(.6)(.6)(.6) = .08 Francis: cannot say much about any one paper (but…) Looking over several years of Psychological Science (2009-2012), over 900 articles 44 could be analyzed. Big picture: too good to be true False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant Joseph P. Simmons, Leif D. Nelson, & Uri Simonsohn Psychological Science, 2011 researcher degrees of freedom: inflation of p values Study 2: musical contrast and chronological rejuvenation Using the same method as in Study 1, 20 Undergraduates listened to either “When I’m Sixty-Four” by The Beatles or “Kalimba.” Unrelated task, participants indicated birth date (mm/dd/yyyy) and father’s age. Father’s age to control for variation in baseline age across participants. Silly results look “good” due to researcher flexibility ANCOVA revealed predicted effect: According to birth dates, participants nearly a year-and-a-half younger after listening to “When I’m Sixty-Four” (adjusted M = 20.1 years) rather than to “Kalimba” (adjusted M = 21.5 years), F (1, 17) = 4.92, p = .040. Adding subjects after first look at data. How good is your stopping rule? Fig. 1 Handy citation for stopping rule Improved stopping rules for the design of efficient small-sample experiments in biomedical and biobehavioral research. DOUGLAS A. FITTS Behavior Research Methods 2010, 42 (1), 3-22. Some solutions, or vaccinating your work against problems Requirements if you rely on p < .05 1. decide stopping rule before data collection 2. 20+ observations per cell. 3. list all variables collected in a study. 4. report all experimental conditions, including failed manipulations. 5. If observations are eliminated, report results with as well as without observations included. 6. If covariate, report results without the covariate. Guidelines for journal editors 1. ensure authors follow the requirements. 2. be more tolerant of imperfections in results. 3. require demonstration that results not hinge on arbitrary analytic decisions. 4. require authors to conduct exact replication if analysis, etc. not compelling Pre registration of data Registration of data Geoff Cumming (left panel): the new statistics Effect sizes, confidence intervals http://www.psychologicalscience.org/index. php/members/newstatisticshttp://www.psychologicalscience.or g/index.php/members/new-statistics Road to redemption: effect sizes, confidence intervals, nuanced interpretation of p values Defensive data analysis Be aware of possible criticisms Protect yourself Prospective strategies Biggest type font: take care with hypothesis testing Exploratory data analysis (not discussed) extremely important Thank you