Andreas Schwab - Universitas Gadjah Mada

advertisement
The Case against Null-Hypothesis
Statistical Significance Tests:
Flaws, Alternatives and Action Plans
Andreas Schwab
Iowa State University
Institute of Technology - Bandung
Universitas Gadjah Mada
April 30, 2014
Whom you really would like to be
here?
Bill Starbuck
University of Oregon
Research Method Division
Professional Development Workshop
The Case Against Null Hypothesis Significance Testing:
Flaws, Alternatives, and Action Plans
William Starbuck
Andreas Schwab
Eric Abrahamson
Bruce Thompson
Donald Hatfield
Jose Cortina
Ray Hubbard
Lisa Lambert
Atlanta 2006 , Philadelphia 2007, Anaheim 2008,
Chicago 2009, Montreal 2010 , San Antonio 2011,
Orlando 2013.
Perspective Article
Researchers Should Make Thoughtful Assessments
Instead of Null-Hypothesis Significance Tests
Andreas Schwab, Iowa State University
Eric Abrahamson, Columbia University
Bill Starbuck, University of Oregon
Fiona Fidler, La Trobe University
2011 Organization Science, 22(4), 1105-1120.
What is wrong with
Null-Hypothesis Significance Testing ?
 Formal Statistics Perspective: Nothing!
 Application Perspective: Nearly everything!
Main Message:
 NHST simply does not answer the questions
we are really interested in.
 Our ritualized NHST applications impede
scientific progress.
NHSTs have been controversial for a
long time
 Fisher proposed NHSTs in 1925
 Immediately, Neyman & Pearson questioned
testing a null-hypothesis without testing any
alternative hypothesis.
 Other complaints have been added over time.
 Statistics textbooks teach a ritualized use of
NHSTs without reference to these complaints.
 Many scholars remain unaware of the strong
arguments against NHSTs.
NHSTs make assumptions that many
studies do not satisfy
 NHSTs calculates statistical significance based
on a sampling distribution for a random
sample.
 For any other type of sample, NHSTs results
have no meaningful interpretation.
 Non-random samples
 Population data
 If data is incomplete, missing data unlikely
to be random
NHSTs portray truth as dichotomous and
definite (= real , important , and certain)
 Either reject or fail to reject the null hypothesis.
 Ritualized choice of same arbitrary significance
levels for all studies (p < .05).
 "Cliff effects" amplify very small differences in the
data into very large differences in implications.
 No explicitly discussion and reporting of detailed
uncertainty information impedes model testing
and development (Dichotomous Thinking).
NHSTs do not answer the questions we
are really interested in
 H0: A new type of training has no effect on
knowledge of nurses.
 NHST estimates probability of observing the
actual effect in our data due to random
sampling -- assuming H0 is true.
 If p is small, we consider H0 is unlikely to be
true.
 ... and we conclude training has an effect on
nurses' knowledge.
NHSTs do not answer the questions we
are really interested in
 Problem 1: In most cases, we already know H0 is never
true. Any intervention will have some effect –
potentially small. (nill hypotheses)
 Problem 2: Apparent validity of findings becomes a
function of researchers efforts due to sample size
sensitivity of NHSTs. (sample-size sensitivity)
 Problem 3: The important question is not whether an
effect is different from zero, but whether the effect is
large enough to matter. (effect size evaluation)
 Problem 4: No direct probability statements if H0 or H1
are true given the data.
(inverse probability fallacy & infused meaning)
Higher-order negative consequences of
the ritualized NHST applications
 Risks of false-positive findings
 Risks of false-negative findings
 Corrosion of research ethics
Higher-order consequences:
Risk of false-positive findings
 NHSTs uses a low threshold for what is
considered important (p < .05; typical sample sizes).
 Empirical research is a search for "needles in a
haystack"
(Webster & Starbuck, 1988).
 In management research, the average correlation
between unrelated variables is not zero but 0.09.
 When choosing two variables at random, NHST offers
a 67% chance of significant findings on the first try,
and a 96% chance with three tries for average
reported sample sizes.
 Hence, we mistake lots of “straws” for “needles”
Second-order consequences:
“Significant” findings often do not replicate
 Published NHST research findings often do
not replicate or duplicate.
 Three-eighths of most cited and discussed
(Type 1 error)
medical treatments supported by
significant results in initial studies were
later disconfirmed. (Ioannidis, 2005)
 Refusal of management journals to publish
successful or failed replications:
 Discourages replication studies.
 Distorts meta-analyses.
 Supports belief in false claims.
Second-order consequences:
“Significant” findings often do not replicate
 P Values and Replication
 p = .01 false-positive 11%
 p = .05 false-positive 29%
 P Hacking
 effect size sensitivity
 choice of alternative dependent variables
 choice of alternative independent and control variables
 choice within statistical procedures
 choice of moderating variables
 Simulation studies show combined effect of choices: 60% or

more false-positives (Simmons et al. 2011)
clustering of published p-values below .05, .01 and .001
suggests p-hacking (Simonsohn et al. 2013)
Second-order consequences:
Risks of false-negative findings
 For extremely beneficial or detrimental
outcomes, the p < .05 threshold can be too
high. (Type 2 error)
 Example: Hormone treatments
 NHSTs with fixed significance thresholds
ignore important trade-offs between costs
and benefits of research outcomes.
Third-order consequences:
NHSTs corrode researchers’ motivation
and ethics
 Often repeated and very public misuse of
NHSTs creates cynicism and confusion.
 Familiar applications of NHST are published.
 Justified deviations from the familiar attract
extra scrutiny followed by rejection.
 Research feels more like a game played to
achieve promotion or visibility -- less of a
search for truth or relevant solutions.
 Accumulation of useful scientific knowledge is
hindered.
NHSTs have severe limitations
How can we do better?
Start by Considering Contingencies
 One attraction of NHSTs is superficial
versatility. Researchers can use the same
tests in most contexts.
 However, this appearance of similarity is
deceptive and, in itself, causes poor
evaluations.
 Research contexts in management are
extremely diverse.
 Researchers should take account for and
discuss these contingencies (methodological toolbox).
Improvements – an example
 Effects of training on 59 nurses’ knowledge about
nutrition. Traditional NHST told us that training
had a “statistically significant” effect, but it did
not show us:
 How much knowledge changed (effect size).
 The actual variability and uncertainty of these
changes.
1: Focus on effects size measures
and tailor them to research contexts
 What metrics best capture changes in the dependent
variables?
 Describe effects in the meaningful units used to measure


dependent variables – tons, numbers of people, bales, barrels.
Example: Percentage of correct answers by nurses on
knowledge tests!
Other effect size measures (e.g., ∆ R2, Cohen's d , f 2, ώ2,
Glass's ∆) (Cumming, 2011)
 Would multiple assessments be informative?
 Nurses, patients, hospital administrators, society may need


different measures of effects.
Triangulation opportunities.
Should measures capture both benefits and their costs?
2: Report the uncertainty associated
with measures of effects
 Report variability and uncertainty of effect
estimates (e.g., confidence intervals)
 Although nurses’ knowledge rose 21% on average,
changes ranged from -23% to +73%. Some nurses
knew less after training!
 Alternatives to CIs include likelihood ratios of
alternative hypotheses and posterior distributions
of estimated parameters.
 Show graphs of complete distributions – say, the
probability distribution of effect sizes.
Kosslyn 2006)
(Tukey 1977;
 Reporting CIs supports aggregation of findings
across studies (meta analyses).
(Cumming 2010)
Endorsement of effects size and CI
reporting by APA Manual
"The degree to which any journal emphasizes
(or de-emphasizes) NHST is a decision of the
individual editor. However, complete reporting
of all tested hypotheses and estimates of
appropriate effect sizes and confidence
intervals are the minimum expectation for all
APA journals."
APA Manual (2010, p. 33)
3: Compare new data with baseline
models rather than null hypotheses
 Compare favored theories with hypotheses
more challenging than a no-effect hypothesis.
 Alternative treatments as baselines
 Naïve Baseline type 1: Data arise from very
simple random processes.
 Example: Suppose that organizational survival is a
random walk.
 Naïve baseline type 2: Crude stability or
momentum processes.
 Example: Tomorrow will be the same as today.
3: ... more information on baseline models
Research Methodology in Strategy and Management
Using Baseline Models to Improve Theories About
Emerging Markets
Advances in International Management Research
Why Baseline Modelling is better than Null-Hypothesis Testing:
Examples from International Business Research
Andreas Schwab
Iowa State University
William H. Starbuck
University of Oregon
4: Can Bayesian statistics help?
 Revisit: NHSTs answer the wrong question.
 Probability of observing data assuming nullhypothesis is true
Pr(data|H0)
 Question of interest:
 Probability of proposed hypothesis being true given
the observed data
Pr(H1|data)
(Arbuthnot, 1710; Male vs. female birth rates)
 Bayesian approaches try to answer the later
question!
4: … more information on Bayesian stats
Research Method Division
Professional Development Workshops
Advanced Bayesian Statistics:
How to Conduct and Publish High-Quality Bayesian Studies
William H. Starbuck
University of Oregon
Eugene D. Hahn
Salisbury University
Andreas Schwab
Iowa State University
Zhanyun Zhao
Rider University
Philadelphia, August 2014
How to promote and support
methodological change
 Please speak up –
 When null hypotheses cannot be true
 When researchers apply NHSTs to non-random
samples or to entire populations
 When people misinterpret significance tests
 When researchers draw definitive conclusions from
results that is inherently uncertain and probabilistic
 When not statistically significant findings may be
substantively very important
 When researchers do not report effect sizes
Critics of NHSTs and P-Values
NHSTs and P-Values have been likened to:
 Mosquitoes (ANNOYING AND IMPOSSIBLE TO
SWAT AWAY)
 Emperor's New Clothes (fraught with obvious
problems that everyone ignores)
 Sterile Intellectual Rake (that ravishes science
but leaves it with no progeny)
 "Statistical Hypothesis Inference Testing"
(because it provides a more fitting acronym)
… and support your colleagues
when they raise such issues!
The Case against Null-Hypothesis
Statistical Significance Tests:
Flaws, Alternatives and Action Plans
Andreas Schwab
Iowa State University
Institute of Technology - Bandung
Universitas Gadjah Mada
April 30, 2014
The Case Against
Null Hypothesis Significance Testing
Additional Slides
NHSTs are frequently misinterpreted
 Individuals infuse more meaning into NHSTs
than these tests can offer.
 NHSTs estimate the probability that the data
would occur in a random sample -- if the H0
were true.
 p does NOT represent the probability that
the null hypothesis is true given the data.
 1 – p does NOT represent the probability
that H1 is true.
NH Significance depends on researchers’
efforts
 With large samples, NHSTs can turn random
errors, measurement errors and trivial
differences into statistically significant findings.
 Consequently, a researcher who gathers a
large enough sample can reject any point null
hypothesis.
 Computer technology facilitates efforts to
obtain larger samples.
 However, using smaller samples is not the
solution because power problems help turn
“noise” into significant effects.
NH Significance and Statistical Power
 Small samples offer limited protection against
false positives.
 NHSTs can turn random errors, measurement
errors and trivial differences into statistically
significant findings.
 Risks of exploiting instability of estimates.
 Journals should require the following final
sentences: "... or maybe this will turn out to be
unreplicable noise" in font size of (3000 ÷ N)
Recommended literature
Cumming, Geoff (2011):
Understanding the new
statistics: Effect sizes,
confidence intervals, and
meta-analysis.
Routledge, New York.
Effect Sizes
CI
Recommended literature
Stephen M. Kosslyn (2006)
Graph Design for the Eye
and Mind.
John W. Tukey (1977)
Exploratory Data Analysis.
5: Use robust statistics to make estimates,
especially robust regression
 Actual distributions of data often deviate from
probability distributions tests assume.
 Example: Even with samples from perfect Normal
populations, ordinary least-squares regression
(OLS) makes inaccurate coefficient estimates for
samples smaller than 400.
 With samples from non-Normal distributions, OLS
becomes even more unreliable.
 Robust statistics seek to provide more
accurate estimates when data deviate from
assumptions.
70%
60%
Figure 3:
Percentage errors in estimated regression coefficients
OLS versus MM Robust
1% of data have data-entry errors that shift the decimal point.
Light lines are quartiles.
50%
40%
30%
20%
10%
0%
50
100
200
Sample size
400
800
4: To support generalization and
replicability, frame hypotheses within
very simple models
• If seeking applicability, beware of using many
independent variables.
• If seeking generalization to new data, beware
of using many independent variables.
• A few independent variables are useful, but the
optimum occurs after a few.
•
Additional variables fit random noise or
idiosyncratic effects.
Figure 2. Ockham's Hill
Statistical Accuracy
50
40
30
20
10
0
0
1
2
3
4
5
6
7
Number of Independent Variables
Download