The Case against Null-Hypothesis Statistical Significance Tests: Flaws, Alternatives and Action Plans Andreas Schwab Iowa State University Institute of Technology - Bandung Universitas Gadjah Mada April 30, 2014 Whom you really would like to be here? Bill Starbuck University of Oregon Research Method Division Professional Development Workshop The Case Against Null Hypothesis Significance Testing: Flaws, Alternatives, and Action Plans William Starbuck Andreas Schwab Eric Abrahamson Bruce Thompson Donald Hatfield Jose Cortina Ray Hubbard Lisa Lambert Atlanta 2006 , Philadelphia 2007, Anaheim 2008, Chicago 2009, Montreal 2010 , San Antonio 2011, Orlando 2013. Perspective Article Researchers Should Make Thoughtful Assessments Instead of Null-Hypothesis Significance Tests Andreas Schwab, Iowa State University Eric Abrahamson, Columbia University Bill Starbuck, University of Oregon Fiona Fidler, La Trobe University 2011 Organization Science, 22(4), 1105-1120. What is wrong with Null-Hypothesis Significance Testing ? Formal Statistics Perspective: Nothing! Application Perspective: Nearly everything! Main Message: NHST simply does not answer the questions we are really interested in. Our ritualized NHST applications impede scientific progress. NHSTs have been controversial for a long time Fisher proposed NHSTs in 1925 Immediately, Neyman & Pearson questioned testing a null-hypothesis without testing any alternative hypothesis. Other complaints have been added over time. Statistics textbooks teach a ritualized use of NHSTs without reference to these complaints. Many scholars remain unaware of the strong arguments against NHSTs. NHSTs make assumptions that many studies do not satisfy NHSTs calculates statistical significance based on a sampling distribution for a random sample. For any other type of sample, NHSTs results have no meaningful interpretation. Non-random samples Population data If data is incomplete, missing data unlikely to be random NHSTs portray truth as dichotomous and definite (= real , important , and certain) Either reject or fail to reject the null hypothesis. Ritualized choice of same arbitrary significance levels for all studies (p < .05). "Cliff effects" amplify very small differences in the data into very large differences in implications. No explicitly discussion and reporting of detailed uncertainty information impedes model testing and development (Dichotomous Thinking). NHSTs do not answer the questions we are really interested in H0: A new type of training has no effect on knowledge of nurses. NHST estimates probability of observing the actual effect in our data due to random sampling -- assuming H0 is true. If p is small, we consider H0 is unlikely to be true. ... and we conclude training has an effect on nurses' knowledge. NHSTs do not answer the questions we are really interested in Problem 1: In most cases, we already know H0 is never true. Any intervention will have some effect – potentially small. (nill hypotheses) Problem 2: Apparent validity of findings becomes a function of researchers efforts due to sample size sensitivity of NHSTs. (sample-size sensitivity) Problem 3: The important question is not whether an effect is different from zero, but whether the effect is large enough to matter. (effect size evaluation) Problem 4: No direct probability statements if H0 or H1 are true given the data. (inverse probability fallacy & infused meaning) Higher-order negative consequences of the ritualized NHST applications Risks of false-positive findings Risks of false-negative findings Corrosion of research ethics Higher-order consequences: Risk of false-positive findings NHSTs uses a low threshold for what is considered important (p < .05; typical sample sizes). Empirical research is a search for "needles in a haystack" (Webster & Starbuck, 1988). In management research, the average correlation between unrelated variables is not zero but 0.09. When choosing two variables at random, NHST offers a 67% chance of significant findings on the first try, and a 96% chance with three tries for average reported sample sizes. Hence, we mistake lots of “straws” for “needles” Second-order consequences: “Significant” findings often do not replicate Published NHST research findings often do not replicate or duplicate. Three-eighths of most cited and discussed (Type 1 error) medical treatments supported by significant results in initial studies were later disconfirmed. (Ioannidis, 2005) Refusal of management journals to publish successful or failed replications: Discourages replication studies. Distorts meta-analyses. Supports belief in false claims. Second-order consequences: “Significant” findings often do not replicate P Values and Replication p = .01 false-positive 11% p = .05 false-positive 29% P Hacking effect size sensitivity choice of alternative dependent variables choice of alternative independent and control variables choice within statistical procedures choice of moderating variables Simulation studies show combined effect of choices: 60% or more false-positives (Simmons et al. 2011) clustering of published p-values below .05, .01 and .001 suggests p-hacking (Simonsohn et al. 2013) Second-order consequences: Risks of false-negative findings For extremely beneficial or detrimental outcomes, the p < .05 threshold can be too high. (Type 2 error) Example: Hormone treatments NHSTs with fixed significance thresholds ignore important trade-offs between costs and benefits of research outcomes. Third-order consequences: NHSTs corrode researchers’ motivation and ethics Often repeated and very public misuse of NHSTs creates cynicism and confusion. Familiar applications of NHST are published. Justified deviations from the familiar attract extra scrutiny followed by rejection. Research feels more like a game played to achieve promotion or visibility -- less of a search for truth or relevant solutions. Accumulation of useful scientific knowledge is hindered. NHSTs have severe limitations How can we do better? Start by Considering Contingencies One attraction of NHSTs is superficial versatility. Researchers can use the same tests in most contexts. However, this appearance of similarity is deceptive and, in itself, causes poor evaluations. Research contexts in management are extremely diverse. Researchers should take account for and discuss these contingencies (methodological toolbox). Improvements – an example Effects of training on 59 nurses’ knowledge about nutrition. Traditional NHST told us that training had a “statistically significant” effect, but it did not show us: How much knowledge changed (effect size). The actual variability and uncertainty of these changes. 1: Focus on effects size measures and tailor them to research contexts What metrics best capture changes in the dependent variables? Describe effects in the meaningful units used to measure dependent variables – tons, numbers of people, bales, barrels. Example: Percentage of correct answers by nurses on knowledge tests! Other effect size measures (e.g., ∆ R2, Cohen's d , f 2, ώ2, Glass's ∆) (Cumming, 2011) Would multiple assessments be informative? Nurses, patients, hospital administrators, society may need different measures of effects. Triangulation opportunities. Should measures capture both benefits and their costs? 2: Report the uncertainty associated with measures of effects Report variability and uncertainty of effect estimates (e.g., confidence intervals) Although nurses’ knowledge rose 21% on average, changes ranged from -23% to +73%. Some nurses knew less after training! Alternatives to CIs include likelihood ratios of alternative hypotheses and posterior distributions of estimated parameters. Show graphs of complete distributions – say, the probability distribution of effect sizes. Kosslyn 2006) (Tukey 1977; Reporting CIs supports aggregation of findings across studies (meta analyses). (Cumming 2010) Endorsement of effects size and CI reporting by APA Manual "The degree to which any journal emphasizes (or de-emphasizes) NHST is a decision of the individual editor. However, complete reporting of all tested hypotheses and estimates of appropriate effect sizes and confidence intervals are the minimum expectation for all APA journals." APA Manual (2010, p. 33) 3: Compare new data with baseline models rather than null hypotheses Compare favored theories with hypotheses more challenging than a no-effect hypothesis. Alternative treatments as baselines Naïve Baseline type 1: Data arise from very simple random processes. Example: Suppose that organizational survival is a random walk. Naïve baseline type 2: Crude stability or momentum processes. Example: Tomorrow will be the same as today. 3: ... more information on baseline models Research Methodology in Strategy and Management Using Baseline Models to Improve Theories About Emerging Markets Advances in International Management Research Why Baseline Modelling is better than Null-Hypothesis Testing: Examples from International Business Research Andreas Schwab Iowa State University William H. Starbuck University of Oregon 4: Can Bayesian statistics help? Revisit: NHSTs answer the wrong question. Probability of observing data assuming nullhypothesis is true Pr(data|H0) Question of interest: Probability of proposed hypothesis being true given the observed data Pr(H1|data) (Arbuthnot, 1710; Male vs. female birth rates) Bayesian approaches try to answer the later question! 4: … more information on Bayesian stats Research Method Division Professional Development Workshops Advanced Bayesian Statistics: How to Conduct and Publish High-Quality Bayesian Studies William H. Starbuck University of Oregon Eugene D. Hahn Salisbury University Andreas Schwab Iowa State University Zhanyun Zhao Rider University Philadelphia, August 2014 How to promote and support methodological change Please speak up – When null hypotheses cannot be true When researchers apply NHSTs to non-random samples or to entire populations When people misinterpret significance tests When researchers draw definitive conclusions from results that is inherently uncertain and probabilistic When not statistically significant findings may be substantively very important When researchers do not report effect sizes Critics of NHSTs and P-Values NHSTs and P-Values have been likened to: Mosquitoes (ANNOYING AND IMPOSSIBLE TO SWAT AWAY) Emperor's New Clothes (fraught with obvious problems that everyone ignores) Sterile Intellectual Rake (that ravishes science but leaves it with no progeny) "Statistical Hypothesis Inference Testing" (because it provides a more fitting acronym) … and support your colleagues when they raise such issues! The Case against Null-Hypothesis Statistical Significance Tests: Flaws, Alternatives and Action Plans Andreas Schwab Iowa State University Institute of Technology - Bandung Universitas Gadjah Mada April 30, 2014 The Case Against Null Hypothesis Significance Testing Additional Slides NHSTs are frequently misinterpreted Individuals infuse more meaning into NHSTs than these tests can offer. NHSTs estimate the probability that the data would occur in a random sample -- if the H0 were true. p does NOT represent the probability that the null hypothesis is true given the data. 1 – p does NOT represent the probability that H1 is true. NH Significance depends on researchers’ efforts With large samples, NHSTs can turn random errors, measurement errors and trivial differences into statistically significant findings. Consequently, a researcher who gathers a large enough sample can reject any point null hypothesis. Computer technology facilitates efforts to obtain larger samples. However, using smaller samples is not the solution because power problems help turn “noise” into significant effects. NH Significance and Statistical Power Small samples offer limited protection against false positives. NHSTs can turn random errors, measurement errors and trivial differences into statistically significant findings. Risks of exploiting instability of estimates. Journals should require the following final sentences: "... or maybe this will turn out to be unreplicable noise" in font size of (3000 ÷ N) Recommended literature Cumming, Geoff (2011): Understanding the new statistics: Effect sizes, confidence intervals, and meta-analysis. Routledge, New York. Effect Sizes CI Recommended literature Stephen M. Kosslyn (2006) Graph Design for the Eye and Mind. John W. Tukey (1977) Exploratory Data Analysis. 5: Use robust statistics to make estimates, especially robust regression Actual distributions of data often deviate from probability distributions tests assume. Example: Even with samples from perfect Normal populations, ordinary least-squares regression (OLS) makes inaccurate coefficient estimates for samples smaller than 400. With samples from non-Normal distributions, OLS becomes even more unreliable. Robust statistics seek to provide more accurate estimates when data deviate from assumptions. 70% 60% Figure 3: Percentage errors in estimated regression coefficients OLS versus MM Robust 1% of data have data-entry errors that shift the decimal point. Light lines are quartiles. 50% 40% 30% 20% 10% 0% 50 100 200 Sample size 400 800 4: To support generalization and replicability, frame hypotheses within very simple models • If seeking applicability, beware of using many independent variables. • If seeking generalization to new data, beware of using many independent variables. • A few independent variables are useful, but the optimum occurs after a few. • Additional variables fit random noise or idiosyncratic effects. Figure 2. Ockham's Hill Statistical Accuracy 50 40 30 20 10 0 0 1 2 3 4 5 6 7 Number of Independent Variables