Team Basics

advertisement
Lies, damn lies, and statistics:
How to avoid misleading yourself
and others with data analysis
Hiram Brownell
Department of Psychology
Research and Scholarship Integrity Program
March 21, 2015
Quote popularized by Mark Twain.
Attributed to Benjamin Disraeli (?)
Goals
 Identify issues in empirical research, mostly
tied to hypothesis testing
 Be cautious rather than prescriptive:
--all practices mentioned have role;
--be aware of how things can appear to others
 Advocate less reliance on p values
 Advocate greater use of confidence intervals,
effect size in empirical research
Not all empirical research relies on
statistics to same extent.
Vast influence of statistical thinking in many
fields.
Mid 19th century: rise of statistical thinking
Nightingale’s graphic of fatalities in
Crimean War
Use p values and hypothesis testing
in your research?
 Brief history of hypothesis testing: RA Fisher
 As undergrad at Cambridge (UK) (1911)
 Formed Eugenics Society with John Maynard
Keynes, R. C. Punnett, and Horace Darwin
(son of Charles Darwin). Enduring interest in
genetics
 WWI
 Rothamsted Experimental Station: ANOVA
 Small sample workhypothesis testing
Competing approach by Jerzy
Neyman and Egon Pearson
 Neyman-Pearson approach: Signal
detection theory (effect size)
 Bitter rivalry between Fisher and NeymanPearson
 Today, presentation of hypothesis testing in
textbooks mixture of two approaches.
I have sinned in application of hypothesis
testing. Have you?
How many have done the following?
 Collect lots of data. Decide later whether or
not to include a variable in an analysis?
(e.g., tepwise regression, Analysis of
Covariance)
 Report, not report a “pilot” study?
 Whether or not to use transform of data
(e.g., log transform)?
 Removal of outliers: +/- 2 or 3 SDs?
 Testing few more subjects if results close to
significance?
Fraud not an issue here.
Assumption: honest researchers.
Still… “p hacking” a real concern
Consider public perception of science
as well as quality of research literature.
Let us not do bad science.
Let us not be bad for science.
What follows:
 Scare mongering by presenter
 Not balanced presentation
--no discussion of exploratory data analysis
 Worst case scenario
 Worth thinking about
Bad press for science:
past and present
Life is not fair
A Statistical Model to Explain the Mendel–
Fisher Controversy
Ana M. Pires and Joao A. Branco
Statistical Science
2010, Vol. 25, No. 4, 545–565
G. Mendel (1822-1884) R. A. Fisher (18901962)
Mendel
Exp.s 1856-1863: 29,000 pea plants.
Informal analysis.
Modern statistical tools not developed
Publication: 1865, 1866.
[1902 Karl Pearson’s Chi Square test]
Fisher questioned Mendel’s statistical results:
1911 (as undergrad), 1936.
Mendel’s results too good to be true
Example of what to expect
 If you have a biased coin, p of H is .60
 N=20, expect around 12 Heads
 Sometimes more, sometimes fewer
 Can you document that coin is biased?
Binomial approx. to normal curve
If Null Hypo true, expect about 10 heads
To reject Null Hypo., need
13.7 or more H (one-tailed test)
 Re-analysis of Mendel’s data
 Application of Chi square (1902): p values
 Examine distribution of p values
 Define proportion of experiments that
should “work” based on size of effect,
sample size.
 Mendel’s report NOT what would be
expected given the effects tested.
 Explanation?
 “Although no explanation can be
expected to be satisfactory, it remains a
possibility among others that Mendel was
deceived by some assistant who knew
too well what was expected.” (Fisher,
1936, page 132).
 1964-2007: 50+ papers on controversy
 Mendel’s data “too good to be true”
under assumption that all experiments
Mendel performed were reported.
 Did Mendel follow rules for rigorous
application of techniques developed
much later? No.
 Is criticism fair?
Alternative:
 Suppose Mendel repeated some
experiments, presumably those which
deviate most from his theory, and reports
only the best of the two.
Simulation of what might
have gone on
 Experiment repeated whenever p-value
smaller than alpha
 0 ≤ alpha ≤ 1 parameter fixed by
experimenter
 Only the exp with largest p-value
reported.
Mendel today: views respected,
but risky career move
 Problems with scientific literature continue.
 Recent increase in public awareness
Bad news from medical research
 Drug development: Raise standards
for preclinical cancer research
 C. Glenn Begley & Lee M. Ellis
Nature 483, 531–533 (29 March 2012)
Haematology and Oncology Dept, Amgen
(California, USA)
 53 papers deemed 'landmark' studies
 Selection bias: papers described something
completely new: fresh approaches to
targeting cancers, alternative clinical uses
for existing therapeutics)
 Scientific findings confirmed by Amgen via
replication in only 6 (11%) cases.
Bad news:
Why Most Published Research Findings
Are False
 J Ioannidis
 (2005). PLoS Med 2(8): e124.
 Content on slides copied, modified, pasted
from article summary and other sources.
 http://www.economist.com/blogs/graphicdet
ail/2013/10/daily-chart-2
 Next couple of slides used only if url does
not open during talk.
 Demonstration could be made to look better
or worse by varying the outline of the
problem.
http://www.economist.com/news/briefing/21588057scientists-think-science-self-correcting-alarming-degree-itnot-trouble
 customary approach to statistical
significance ignores three things:
 “statistical power” of the study (a measure
of its ability to avoid type II errors, false
negatives in which a real signal is missed in
the noise);
 unlikeliness of the hypothesis being tested;
 bias favoring publication of claims to have
found something new.
Research finding less likely to be true when:
 studies are smaller (small sample size);
 effect sizes are smaller
 greater number and less preselection of
tested relationships;
 greater flexibility in designs, definitions,
outcomes, and analytical modes
 greater financial and other interest and
prejudice
 more teams are involved in chase of statistical
significance
 For many fields, claimed research findings
may often be simply accurate measures of
prevailing bias.
Test for Excess Significance: TES
Example of post hoc scrutiny
The frequency of excess success for
articles in Psychological Science
Gregory Francis
Psychon Bull Rev (2014) 21:1180–1187
Papers with 4 or more studies.
Usually internal replication good.
 Calculate power of individual studies.
 Power: p of finding an effect that is truly
present. Correct rejection of null
hypothesis.
 If each study in paper with five studies has
power of .6, then should not expect each
and every study to reject null.
 p that all 5 studies in a single paper will
reject null : (.6)(.6)(.6)(.6)(.6) = .08
 Francis: cannot say much about any one
paper (but…)
 Looking over several years of Psychological
Science (2009-2012), over 900 articles
 44 could be analyzed.
 Big picture: too good to be true
 False-Positive Psychology: Undisclosed
Flexibility in Data Collection and Analysis
Allows Presenting Anything as
Significant
 Joseph P. Simmons, Leif D. Nelson, &
Uri Simonsohn
 Psychological Science, 2011
researcher degrees
of freedom: inflation of p values
Study 2: musical contrast and chronological
rejuvenation
 Using the same method as in Study 1, 20
Undergraduates listened to either “When I’m
 Sixty-Four” by The Beatles or “Kalimba.”
 Unrelated task, participants indicated birth
date (mm/dd/yyyy) and father’s age.
 Father’s age to control for variation in
baseline age across participants.
Silly results look “good” due to
researcher flexibility
 ANCOVA revealed predicted effect:
 According to birth dates, participants
nearly a year-and-a-half younger after
listening to “When I’m Sixty-Four”
(adjusted M = 20.1 years) rather than to
“Kalimba” (adjusted M = 21.5 years),
 F (1, 17) = 4.92, p = .040.
Adding subjects after first look at data.
How good is your stopping rule? Fig. 1
Handy citation for stopping rule
Improved stopping rules for the design of
efficient small-sample experiments in
biomedical and biobehavioral research.
DOUGLAS A. FITTS
Behavior Research Methods
2010, 42 (1), 3-22.
Some solutions, or
vaccinating your work against
problems
Requirements if you rely on p < .05
 1. decide stopping rule before data collection
 2. 20+ observations per cell.
 3. list all variables collected in a study.
 4. report all experimental conditions, including
 failed manipulations.
 5. If observations are eliminated, report
results with as well as without observations
included.
 6. If covariate, report results without the
covariate.
Guidelines for journal editors
1. ensure authors follow the requirements.
2. be more tolerant of imperfections in
results.
3. require demonstration that results not
hinge on arbitrary analytic decisions.
4. require authors to conduct exact
replication if analysis, etc. not compelling
 Pre registration of data
 Registration of data
Geoff Cumming (left panel): the new
statistics
Effect sizes, confidence intervals
 http://www.psychologicalscience.org/index.
php/members/newstatisticshttp://www.psychologicalscience.or
g/index.php/members/new-statistics
Road to redemption: effect sizes,
confidence intervals, nuanced
interpretation of p values
Defensive data analysis
 Be aware of possible criticisms
 Protect yourself
 Prospective strategies
 Biggest type font: take care with
hypothesis testing
 Exploratory data analysis (not discussed)
extremely important
Thank you
Download