Statistical Foundations: Hypothesis Testing Psychology 790 Lecture #9 9/19/2006 Today’ss Class Today • Hypothesis H th i Testing. T ti – General terms and philosophy. – Specific Examples Hypothesis yp Testing g Rules of the NHST Game • Recall our discussion about Null Hypothesis Significance Testing from the last lecture: • This probability value is often called a p-value or p. – When p < .05, a result is said to be “statistically significant” • In short, when a result is statistically significant (p < .05), we conclude that the difference we observed was unlikely to be due to sampling error alone. We “reject the null hypothesis.” • If the statistic is not statistically significant (p > .05), we conclude th t sampling that li error iis a plausible l ibl interpretation i t t ti off the th results. lt We W “fail to reject the null hypothesis.” Hypothesis Testing Notes • It iis iimportant t t to t keep k in i mind i d that th t NHSTs NHST were developed for the purpose of making yes/no decisions about the null hypothesis. hypothesis – As a consequence, the null is either accepted or rejected on the basis of the p-value. • For logical reasons, some people are uneasy “accepting the null hypothesis” when p > .05, and prefer to say that they “failed to reject the null hypothesis” instead instead. Hypothesis Testing Items of Interest • Very important points about Ver abo t significance testing: 1. The term “significant” does not mean important, po ta t, substa substantial, t a , oor wo worthwhile. t w e. Points continued Points, 2 Th 2. The null ll andd alternative lt ti hypotheses h th are often ft constructed to be mutually exclusive. If one is true the other must be false. true, false • As a consequence, – – • When you reject the null hypothesis, you accept the alternative. When you fail to reject the null hypothesis, you reject the alternative. This may seem tricky because NHSTs do not t t the test th researchh hypothesis h th i per se. – Formally, only the null hypothesis is tested. Points continued Points, 3 B 3. Because NHSTs NHST are often ft usedd to t make k a yes/no decision about whether the null h hypothesis th i is i a viable i bl explanation, l ti mistakes can be made. Errors in Hypothesis Testing Errors in Inference using NHST • NHST can lead to decisions which are not correct: • Type I error: Your test is significant (p < .05), so you reject the null hypothesis, but the null hypothesis is actually true. • Type II error: Your test is not significant (p > .05), 05) you ddon’t ’ reject j the h null ll hypothesis, h h i but b you should have because it is false. Errors in Inference using NHST • The probabilit probability of making a Type T pe I error is determined by the experimenter. Often called the alpha value. value Usually set to 5%. 5% • The probability of making a Type II error is determined by the experimenter. Often called the beta value. value Usually ignored by social science researchers. Errors in Inference using NHST • Th The converse off T Type II error is i called ll d Power: – The probability of rejecting the null hypothesis when it is false—a correct decision. – 11 beta b More on Power • P Power iis strongly t l influenced i fl d by b sample l size. – With larger N, more likely to reject null if it is false. – Power P analyses l are conducted d d to ddetermine i the h size of a sample needed to reject a null hypothesis. hypothesis Inferential Errors and NHST Null is trrue N Null is false N Connclusionn of the teest Real World Null is true Null is false Correct decision Type II error Type I error Correct decision Points of Interest • Th The example l we explored l d previously i l was an example of what is called a z-test of a sample l mean. • Significance tests have been developed for a number of statistics – difference between two group means: t-test – difference between two or more group means: ANOVA – differences between proportions: chi-square How do we control Type I errors? • The Type I error rate is typically controlled by the researcher. researcher • It is called the alpha rate, and corresponds to the probability cut-off that one uses in a significance test. • By convention, researchers often use an alpha rate of .05. – In other words, they will only reject the null hypothesis when a statistic is likely to occur 5% of the time or less when the null hypothesis is true. true • In principle, any probability value could be chosen for making the accept/reject decision. – 5% is used by convention. Type I errors • What does 5% mean in this context? • It means that we will only make a decision error 5% of the time if the null hypothesis is true. • If the null hypothesis is false, the Type I error rate is undefined. How do we control Type II errors? • Type II errors can also be controlled by the experimenter. experimenter • The Type II error rate is sometimes called beta. beta • How can the beta rate be controlled? The easiest way to control Type II errors is by increase the statistical power of a test. Statistical Power • Statistical power is defined as the probability of rejecting the null hypothesis when it is false—a correct decision (1-beta). • Power is strongly influenced by sample size. With a l larger N, N we are more likely lik l to t reject j t the th null ll hypothesis h th i if it is truly false. – (As N increases, the standard error shrinks. Sampling error becomes less problematic, and true differences are easier to detect.) Power and correlation 1.0 0.8 0.6 0.4 • Population r = .30 0.2 • This graph shows how the power of the significance test for a correlation varies as a function of sample size. Notice that when N = 80, there is about an 80% chance of correctly rejecting the null hypothesis (beta = .20). When N = 45, we only have a 50% chance of making the correct decision—a coin toss (beta = .50). POWER • 50 100 SAMPLE SIZE 150 200 Power and correlation • Power also varies as a function of the size of the correlation. r = .80 1.0 0.8 0.6 0.4 r = .20 0.2 When the ppopulation p correlation is smallish (e.g., .20), it requires a large number of subjects to correctly reject the null hypothesis. r = .40 0.0 • When the population correlation is large (e.g., .80), it requires fewer subjects to correctly reject the null hypothesis that the population correlation is 0. POWER • r = .60 • When the population correlation is 0, 0 the probability of rejecting the null is constant at 5% (alpha). Here “power” is technically undefined because the null hypothesis is true. 50 100 150 SAMPLE SIZE r = .00 200 Low Power Studies r = .80 r = .60 1.0 1 Because correlations in the .2 to .4 range are typically observed in non-experimental research, one would be wise not to trust research b d on sample based l sizes i less l than th 60ish. r = .40 0.2 0.4 r = .20 r = .00 0.0 0 Why? Because such research only stands d a 50% chance h off yielding i ldi the correct decision, if the null is false. It would be more efficient (and, importantly, just as accurate) to flip a coin to make the decision rather than collecting data and using a significance test. POWER • 0.6 0.8 • 50 100 SAMPLE SIZE 150 200 A Sad Fact • In 1962 Jacob Cohen surveyed all articles in the Journal of Abnormal and Social Psychology and determined that the typical power of research conducted in this area was 53%. • An even sadder fact: In 1989, Sedlmeier and Gigerenzer surveyed studies in the same journal (now called the Journal of Abnormal Psychology) and found that the power had decreased slightly. • Researchers, unfortunately, pay little attention to power. As a consequence, the Type II error rate of research in psychology is likely to be dangerously high—maybe high maybe as high as 50%. 50% Power in Research Design • Power is important to consider, consider and should be used to design research projects. – Given an educated guess about what the population parameter t might i ht be b (e.g., ( a correlation l ti off .30, 30 a mean difference of .5 SD), one can determine the number of subjects needed for a desired level of power. – Cohen and others recommend that researchers try to obtain a power level of about 80%. Power in Research Design • Thus Thus, if one used an alpha-level alpha level of 5% and collected enough subjects to ensure a power of 80% for an assumed effect, one would know, before the study was done, what the theoretical error rates are for the statistical test. test • Although these error rates correspond to long-run outcomes, one could get a sense of whether the research design was a credible one—whether it is likely to minimize the two kinds of errors that are ppossible in NHST and, correspondingly, maximize the likelihood of making a correct decision. Misconceptions About Hypothesis Testing Three Common Misinterpretations of Significance Tests and p-values 1 1. The p-value p value indicates the probability that the results are due to sampling error or “chance.” 2. A statistically significant result is a “reliable” result. 3. A statistically significant result is a powerful, important result. Misinterpretation # 1 • The p-value p value is a conditional probability. probability The probability of observing a specific range of sample statistics GIVEN (i.e., conditional upon) that the null hypothesis is true. P(D|Ho). • Thi This is i nott equivalent i l t to t the th probability b bilit off the th null ll hypothesis being true, given the data. P(Ho |D) ≠ P(D| Ho) Misinterpretation # 2 • Is a significant result a “reliable,” “reliable ” easily replicated result? • Not necessarily. necessarily The pp-value value is a poor indicator of the replicability of a finding. • Replicability (assuming a real effect exists, that is, that he null hypothesis is false), is primarily a function of statistical t ti ti l power. Misinterpretation # 2 • If a study had a statistical power equivalent to 80%, what is the probability of obtaining a “significant” result twice? • The probability of two independent events both occurring is the simple i l product d t off the th probability b bilit off each h off th them occurring. i – .80 × .80 = .64 • If power = 50%? .50 50 × .50 50 = .25 25 • Bottom line: The likelihood of replicating a result is determined by statistical power, not the p-value derived from a significance test. Wh power off the When h test is i low, l the h likelihood lik lih d off a long-run l series i off replications is even lower. Misinterpretation # 3 • Is a significant result a powerful, powerful important result? • Not necessarily. necessarily • The importance of the result, of course, depends on the issue at hand, the theoretical context of the finding, etc. Misinterpretation # 3 • We can measure the practical or theoretical significance of an effect using an index of effect size. • An effect size is a quantitative index of the strength of the relationship between two variables. • Some common measures of effect size are correlations, regression weights, t-values, and R-squared. Misinterpretation # 3 • Importantly Importantly, the same effect size can have different pp values, depending on the sample size of the study. • For example, a correlation of .30 would not statistically significant with a sample size of 30, but would be statistically t ti ti ll significant i ifi t with ith a sample l size i off 130. 130 • Bottom line: The pp-value value is a poor way to evaluate the practical “significance” of a research result. Wrapping Up • Today was another fun lecture about the pphilosophy p y of hypothesis testing. • We do hypothesis testing all the time. – That doesn doesn’tt make it something without error, though. Next Time • Office Offi hhours ttoday d (1pm-4pm, (1 4 449 F Fraser). ) • Lab tonight (examples of hypothesis tests). • Hypothesis testing example. • Confidence Intervals (Ch 6.8 – 6.11).