Lecture 11 - Power Power: Probability of rejecting the null hypothesis in those situations when the null is false. In terms of the 2 x 2 table of outcomes. The combination of two states with two decisions leads to four possible outcomes. Situation that exists in the populations Null is True Null is false Retain Null Correct Retention Incorrect Retention Probability: Type II error rate Reject Null Incorrect Rejection Probability: Significance level Correct Rejection Probability: Power. Experiment Outcome Note that power is not an issue for the left side of the 2x2 table. If we’re in the left side of the table, the null hypothesis is true. If we’re in the left side of the above table, the only thing that affects the probability of Rejection (or the probability of Retention) is the Significance Level of the statistical test. This is set (usually at .05) and does not depend on the outcome of the research. Power is an issue only for the right side of the 2x2 table. If we’re in the right side of the table, the null is false. There is some difference between the performance population means. If we’re in the right side of the table, a whole bunch of factors affect the probability of Rejection – the probability of our making the correct decision. Those are the factors we’re considering here. Obviously, if the null is false, then you want to do whatever you can to put yourself in the lower right cell. To recap If the null is true, power is not an issue. There is no difference between population means. Or the population correlation coefficient is 0. Oh woe is me! If the null is false, there IS a difference in the population means. Or the population correlation coefficient is different from zero. So you want your research project to be able to detect that falseness. So you want the most powerful design you can afford. You want to reject the null. Copyright © 2005 by Michael Biderman Power- 1 02/05/16 Factors that affect Power in order of importance. 1. The effect size – the actual size of the effect in the population: When comparing means: How big the difference between population means actually is. When doing correlational research: how strong the correlation actually is in the population. Definitions When comparing two population means When investigating a relationship Effect size is d = (E – uC) / . Effect size = Population r If the population means are equal or the population r = 0, then effect size is 0, the null is true, and power is not an issue. The larger the effect size, the more likely we are to detect it. Analogy: The brightness of a distant star. The brighter it is, the easier it will be to detect. What are small, medium, and big effect sizes? From Lance, C. E., & Bandenberg, R. J. (2008). Statistical and Methodological Myths and Urban Legends. Routledge. For a recent update on correlation effect sizes, see Bosco, F. A., Aguinis, H., Singh, K., Field, J. G., & Pierce, C. A. (2014, October 13). Correlational Effect Size Benchmarks. Journal of Applied Psychology. Advance online publication. http://dx.doi.org/10.1037/a0038047 Characterizations of effect sizes in terms of what Cohen considered small, medium, and large will be presented below. Copyright © 2005 by Michael Biderman Power- 2 02/05/16 2. The sample size. The only thing we really have control over. The larger the sample size the greater the power. Sample size is the primary method of manipulating power. Analogy: The size of our telescope. The larger then telescope, the greater the chance of detecting a star. Note that increasing the sample size has no effect on probabilities computed in those situations in which the null is true – the left side of the 2x2 table above. If the null is true, the probability of incorrectly rejecting it depends only on the significance level. The significance level is set before the research is conducted. 3. The particular test chosen. For example, in the comparison of two groups, if the assumptions of the ttest are met, it is the most powerful way to compare means. The Mann-Whitney U-test is less powerful as a test to compare means than the t-test when the assumptions of the t are met. 4. The significance level. The larger the significance level, the larger the power. But you can't have your cake and eat it too. Unfortunately, increasing the significance level increases the probability of a Type I error - since that's what significance level is. 5. The variability of scores within each population. Recall that for two populations, the effect size is the difference in population means divided by the population standard deviation, (1 - u0) / . Proper conduct of the experiment may affect the value of . The smaller the value of , the larger the power. Manipulating does not affect the probability of a Type I error. Telescope analogy: Get rid of random atmospheric distortion. 6. Direction of alternative hypothesis. All other things being equal, a one-tailed alternative hypothesis is more powerful than a two-tailed alternative if you've specified the direction correctly. Analogy Detecting a true difference is like Detecting a distant star Copyright © 2005 by Michael Biderman Effect size Sample size Test chosen Significance Variability of scores One-tailed H1 Power- 3 Brightness of the star Diameter of the telescope Type of telescope – refractive vs. reflective Willingness to call a spot of light “the star” Cleanliness of the lenses Knowing where to look 02/05/16 Summarizing Manipulation Effect of manipulation if there is no difference in population means. (Left side of 2x2 table.) Effect of manipulation if there is a difference in population means. (Right side of 2x2 table.) Increase the effect size No effect at all Increases power Increase sample size No effect at all Increases power Choose a more powerful test No effect at all Increases power Make significance level larger Increases Type I error rate Increases power Decrease variability of scores within groups No effect at all Increases power Choose appropriate one-tailed alternative hypothesis No effect at all Increases power Copyright © 2005 by Michael Biderman Power- 4 02/05/16 How big is an effect size Measures of Effect Size for Common Statistics Tests Population Value Sample Estimate One Population t Actual Pop Mean – Hyp’d Pop Mean --------------------------------------------Pop SD d= Small = .2 Sample Mean – Hyp’d Pop Mean ----------------------------------------Sample SD Medium = .5 Large = .8 Two Independent Samples t Pop Mean 1 –Pop Mean 2 --------------------------------------------Pop SD d= Small = .2 Sample Mean 1 – Sample Mean 2 ----------------------------------------Square root of (Pooled Variance, S2p) Medium = .5 Large = .8 Two Correlated Samples t Pop Mean 1 –Pop Mean 2 --------------------------------------------Pop SD d= Sample Mean 1 – Sample Mean 2 ----------------------------------------Square root of ((S21 + S22)/2) Small = .2 Medium = .5 But correlation of paired scores, r, influences actual effect size. Large = .8 One Way independent samples ANOVA, SD of Population Means --------------------------------------------Pop SD f= Small = .1 Sample SD of Sample Means ----------------------------------------Square root of MS Within Medium = .25 Large = .4 f2 η = Eta = --------------------------------------------1+f2 2 Printed by SPSS in some procedures. 2 Small = .010 Medium = .059 Large = .138 Pearson R between two variables Population r Small = . 10 Copyright © 2005 by Michael Biderman Sample r Medium = .30 Large = .50 Power- 5 02/05/16 Determining Sample Size for upcoming research It is important to take power into account when planning the sample size for research. Following is an illustration of what must be considered when comparing two groups, a common situation. I. Determine how big the population effect size is that you’re trying to detect. That is, determine how big of a difference you’ll be trying to discover in your research. Commonly asked question: How can we know what the difference will be in the population before we’ve conducted the experiment to discover if there is a difference? A Catch 22 situation. From Lance, C. E., & Bandenberg, R. J. (2008). Statistical and Methodological Myths and Urban Legends. Routledge. In the above table, the nonredundant correlations are in red. Mean of the red correlations is -.11, an estimated of effect size for inconsistency as a predictor of GPA. II. Determine the desired power – the probability of detecting the difference we think our manipulation will make. Typically, we want that probability to be as large as possible (1 would be great) but, realistically, we usually settle for the value .8. That value is to power analysis and sample size determination what .05 is to significance levels. III. We then consult sample size tables or a computer program such as SamplePower 3 or gPower to determine the sample size required to detect the estimate effect with the desired power. A collection of power tables is available at www.utc.edu/Michael-Biderman -> Psychology 2010 -> Power Tables. Choose a sample large enough to yield power identified in II to detect the effect size identified in I. Copyright © 2005 by Michael Biderman Power- 6 02/05/16 Example 1 – Two Groups Research. You plan to investigate a new method of teaching statistics. Prior to the research, you wish to determine how many participants will be required. I. Population Effect size: Hmm. If your new method will only yield a small effect, then it probably wouldn’t be worth your efforts to pursue it. So you’re only interested in the new method if it yields a medium effect size, d=0.5. So plan the statistical analysis so that it will be likely to detect a medium or larger effect size. If the effect size is smaller than medium, your analysis might not detect it, but that’s OK, since a small effect size would mean that the method wasn’t that effective. II. Power. We’d like at least a 90% chance of detecting a medium effect size. There’s no point in doing the research and the analysis if we can’t be quite sure that we’ll detect a useful difference. III. Sample Power Output Sample Power indicates that we’ll need 90+90 or 180 participants in order to have power of 92% to detect a difference of 0.5 standard deviations, a medium effect. Biderman’s Power tables . . .www.utc.edu/Michael-Biderman -> Psychology 2010 -> Power Tables So, we’ll use 90 persons per group and have .92 probability of detecting a difference of .5 SDs. Copyright © 2005 by Michael Biderman Power- 7 02/05/16 Example 2 – Correlational Research. You are investigating a new test for predicting performance of students in a statistics curriculum. How big of a sample should you use? 1. Effect Size: Hmm. CA correlates about .5 with performance. But Conscientiousness correlates only about .2 with performance in academia. You decide that you are not interested in doing any more work on your test unless is correlates more highly with performance than does Conscientiousness. You decide that a medium effect size correlation coefficient, r=.3, is the effect size you are most interested in. 2. Power: Let’s choose a sample that will have a 90% chance of detecting a correlation of .3. 3. Sample Power Output So the sample power output suggests that you’ll need 110 participants in order to have probability of .90 to detect a correlation of .3. Biderman’s Power Table Output Argh. Biderman didn’t prepare a power table for Correlations. Somebody get him to do that. Copyright © 2005 by Michael Biderman Power- 8 02/05/16 Why be concerned about Power? 1. Assuming we create treatments to make a difference, it only makes sense to conduct research that has the greatest probability of detecting the difference we set out to make. 2. To provide insight into reasons for failure to reject the null (failure to find differences). If we fail to reject the null, it will be due to one of at least two reasons. a. The manipulation we implemented had no effect - that is, the actual effect size was zero. Our treatment did not make a difference. We’ve learned something – although it may not be what we wanted to know. b. The manipulation had an effect, but the statistical test had insufficient power to detect the effect of our manipulation. Our treatment made a difference but we were too lazy or poor or ignorant to use enough participants, and we didn’t detect it. FOR Study example. We performed a study investigating the effect of Frame Of Reference (FOR) instructions on the validity of Conscientiousness as a predictor of GPA. Our original sample had 150 students. The FOR effect was not significant. For this and other reasons, the study was not accepted in a conference. We followed up by adding 150 more participants, on the assumption that the population FOR effect size was small, e.g., r=.1 or .2. For the 300 participant sample, the difference in validities between the nonFOR and the FOR condition was .07, quite small, but statistically significant. If you fail to reject, you should estimate the actual effect size in your data. If the estimated effect size is small, then this indicates that your manipulation was not as powerful as you might have expected. But if the sample estimate of effect size was large while your statistical test was not significant, that suggests that your sample size was too small. Copyright © 2005 by Michael Biderman Power- 9 02/05/16 When we don’t want high power In general, high power is good. If the null hypothesis is false, we want to be able to correctly reject it. There are instances, however, when we may not want to detect a difference even if it is there. Examples 1) We're not interested in the difference. E.g., We're interested in the effect of Type of Training. A Gender difference is found. We're not interested in gender differences. Nuts! Now we have to deal with them. 2) We're overwhelmed by differences already and don't have time to deal with any others. We've conducted research evaluating Type of Training, Sex, Type of Job, Age of Employee. A Gender difference is found. Rats! We don't have time to deal with the gender effect at the present time. 3) The difference is incredibly small. Suppose the average statistical test performance of the population of I/O students is 84.3 while the average statistical test performance of the population of Research students is 84.31. With 10,000 I/O students and 10,000 RM students, the difference would be statistically significant.. Oh wow! I really care! This issue is what is referred to as the issue of statistical vs. practical significance. Any difference, however small or inconsequential, can be made statistically significant by increasing power (usually through larger samples.) But whether a statistically significant difference is worth our dealing with is another question. Many times, statistically significant differences are not worth dealing with. For this reason, it has become common practice to report not only the statistical significance of a difference, but also a measure of sample effect size - the estimated size of the difference, measured in a standardized fashion. That way, small differences which were detected by extremely powerful statistical procedures can be recognized for what they are: small differences. The GLM procedure in SPSS can print such sample effect sizes. Copyright © 2005 by Michael Biderman Power- 10 02/05/16 Using SamplePower to obtain power and sample sizes Sample Power 3 is an add-on module available with the SPSS suite of programs. It can be used to compute power. More often then not, however, it’s used to compute the sample size required to have a prespecified power for proposed research. That’s what will be illustrated here. SamplePower opens with a blank screen, except for a randomly chosen tip. Pull down File and choose New. Copyright © 2005 by Michael Biderman Power- 11 02/05/16 Independent Groups t-test 1) Specify the effect size. Do that by changing one of the population means to the desired effect size. Either the mean of population 1 or the mean of population 2 can be changed. 2) Adjust the N per Group until the desired power appears below. To get the exact sample size for Power = 80%, pull down the Tools menu and choose “Sample size for 80% Power.” Copyright © 2005 by Michael Biderman Power- 12 02/05/16 Population Correlation Coefficient 1) Set the Population Correlation to the desired value. 2) Pull down Tools and choose “Sample Size for 80% power.” Copyright © 2005 by Michael Biderman Power- 13 02/05/16 One way Analysis of Variance 1) Initial Screen. Click on the “Number of Levels” field or the “Effect Size” field. Copyright © 2005 by Michael Biderman Power- 14 02/05/16 2) Enter the Number of categories in the field on the right. 3) Click on the appropriate effect size. Pull down the Tools menu and choose “Sample Size for 80% Power”. Copyright © 2005 by Michael Biderman Power- 15 02/05/16 Two Way chi-square Chi-square is unusual in that the power and thus, required sample size for a given difference between proportions depends on what specific values those proportions take on. Sample size required to detect a .05 difference - .40 vs. .45. 1) Choose the two population proportions whose difference you’ll want to detect. 2) Pull down the Tools menu and choose “Sample size for 80% Power.” Copyright © 2005 by Michael Biderman Power- 16 02/05/16 Sample size required to detect a .05 difference - .05 vs. .10. Note that the sample size required to detect the difference between .05 and .10 is much smaller than that required to detect the difference between .40 and .45. The bottom line is that when testing hypotheses about population proportions, you must specify not only the difference in proportions, but also the two specific proportions whose difference you wish to detect. Copyright © 2005 by Michael Biderman Power- 17 02/05/16