MISSOURI STATE UNIVERSITY Research, Statistical Training, Analysis & Technical Support D. Wayne Mitchell, Ph.D. – RStats Consultant - Psychology Department Kayla N. Jordan – Statistical Analyst – RStats Institute Graduate Student Experimental Master’s Track- Psychology Department Power and Effect Size Workshop Spring 2014 The information presented in this document is for your personal use. It is not to be quoted and/or distributed to others without the written permission of D. Wayne Mitchell and RStats’ administrative staff. 1 | Page M i t c h e l l a n d J o r d a n – R S t a t s W o r k s h o p S p r i n g , 2 0 1 4 Power and Effect Size Workshop Preface We would like to thank each one of you for attending this RStats Power and Effect Size workshop. Most of all the planned information to be presented today will be contained in this handout. We view this workshop as a dynamic presenter-attendee interaction format, but we wanted to make available to you a hard copy of the basic information for future reference. We encourage questions and comments regarding the material presented and/or request of information beyond that which is presented. Should questions arise after the workshop, please feel free to contact us. During the course of this workshop we will use primarily commonsensical, practical terminology as opposed to using strict statistical/formula proof jargon. However, there will be a need to define a few basic statistical terms/concepts and several basic effect size formulas will be presented and discussed. We would like to point out that the majority of today’s discussion will focus on Effect Size (its meaning and calculation from reported studies and the interpretation) for without a priori knowledge of Effect Size, power calculations are a moot issue. Much of the information contained in this handout was presented in a previous Power Analysis and Effect Size Workshops in the Fall of 2008 and Spring of 2009. The impetus behind this workshop is that over the past couple of years I have been asked to conduct power analysis for researchers, both here at MSU and for outside consults. The problems most often encountered are that researchers tend to expect a medium effect size without noting the actual effect size or a medium effect size, based upon Cohen’s d, is expected. Here lies the nature of the problems: (1) Consider the first example, stating a ‘medium effect size is expected’ without indicating an effect size value. If there is no value, one cannot conduct a power analysis. Determining an estimated effect size value requires the researcher to do a meta-analysis within one’s research area. Estimation of an effect size can be a problem if the dependent measures and treatments that you (the researcher) wish to employ have not been investigated and published previously. Also, many reported studies do not report effect sizes and therefore you have to estimate effect sizes based upon the information provided in the publications. (2) A second problem is that merely stating the expectation of a small to medium effect size based upon Cohen’s d, has resulted in researchers complaining to me (after conducting a power analysis) that the sample size required to detect such an effect is too large and unreasonable. As you will see, to detect a significant difference between two groups (e.g., an experimental group versus a control group), based upon a Cohen’s small effect size via an independent t-test, requires a sample size of 788 participants (394 per group), and based upon a Cohen’s medium effect size, a sample of size of 128 participants (64 per group) is required. These power analysis estimates assume a power of .80. So, what follows today in this workshop is three-fold: a review of (1) statistical terms, (2) the cookbook formulas for estimating effect sizes for published research articles, and (3) a brief outline of how to estimate effect size from your own data. 2 | Page M i t c h e l l a n d J o r d a n – R S t a t s W o r k s h o p S p r i n g , 2 0 1 4 Statistical Concepts (1) Type I Error – concluding there is a treatment effect or relationship between an independent variable and a dependent variable when, in fact, there is not. This error is noted in most all of our reported studies in the form of p < .05 (the alpha level we tend to live and die with regarding our research). (2) Type II Error – concluding that there is no treatment effect or relationship between an independent variable and a dependent variable when, in fact, there is an effect/relationship between the independent variable and dependent variable. This type of error is most can be very costly, especially in clinical research. The above error types tend to haunt us in our research endeavors and they will be important to our discussions of Power and Effect Size later in this document and workshop. We would like to credit and applaud Jacob Cohen who popularized and created most all of the early work on power analyses and introduced the importance of calculating and reporting Effect Size in statistical results (for both those results that were statistically significant and those not significant). (3) Power – The probability of detecting an effect (treatment, relationship, etc.) that is really there. To say it another way, the power of a statistical test is the probability that it (the statistical test) will yield statistically significant results (typically yielding that infamous, p < .05). Hence, we can then conclude there is a significant treatment effect/relationship between our independent variable and dependent variable and go on to publish, get promoted, become famous, and live happily ever after! (4) Effect Size – Effect size is a name given to a family of indices that measure the magnitude of a treatment effect or relationship. The types of indices reported vary due to the preferences of the researcher, journal, and/or type of statistical test. The most popular indices, and the ones that will be discussed today, are: r2 (r – squared), ω2 (omegasquared), η2 (eta-squared) and d (Cohen’s d). There are variations of eta-squared (e.g., partial etas) and Cohen’s d (e.g., f) which are employed in complex Analysis of Variance (ANOVAs) and Multivariate Analysis (MANOVAs). And in complex correlational analyses (e.g., multiple regression/correlation analyses) R2 is the effect size employed. The interpretations are the same, just more complex due to the complexity of the research design/questions and the subsequent analyses. Do note: these Effect Size indices are for parametric statistical tests (those for mean comparisons and correlations). There are also Effect Size indices for non-parametric statistical tests/results (e.g., Chi-Square and Odds Ratios; their respective effect size indices; Phi or Cramer’s Phi and Chinn’s (2000) odds ratio conversion (ln(odds ratio)/1.81)). And, again, the Effect Size interpretations are the equivalent. However, given the limited time of our workshop, only the more common Effect Size indices for parametric statistical tests will be discussed. 3 | Page M i t c h e l l a n d J o r d a n – R S t a t s W o r k s h o p S p r i n g , 2 0 1 4 The Four Most Common Effect Size Indices (1) r2 (r – squared): This is an Effect Size that all of you are probably familiar. Historically, it has been inferred, but not reported or discussed. However, the calculation is very simple. For example, suppose a researcher found a significant relationship between an individual’s number of years of education and salary (r (98) = .40, p < .05). The researcher concluded that the more years of education one has obtained the higher their salary. The Effect size here is r2 = .16; which means that approximately 16% of the variability in salary (why there are individual differences in salary) can be attributed (explained) by the number of years of education one has. Apparently, based upon this report, approximately 84% of the variability in salary is not explained (we typically say due to error). (2) ω2 (omega-squared): Omega-squared is usually reported for t-test, ANOVA, and MANOVA results. Like r2, omega-squared ranges from 0 to 1.00 and is interpreted in the same manner as r2. If r-squared is converted to an omega-squared, or vice versa, the values will be very similar. In fact, some researchers report r-squared with t-tests. (3) η2 (eta-squared or η2p - partial eta-squared): Eta-squared is also generally reported with ttest, ANOVA, and MANOVA results. Like r2 and omega-squared, eta-squared ranges from 0 to 1.00 and the interpretation are the same as those with r-squared and omegasquared. So, you ask, “What is the difference between omega-squared and eta-squared?” Well…from a technical standpoint, there is not much difference, at least for the most part. Overall etasquared (and partial eta-squared calculations) will be larger than omega-squared Effect Size estimates and tend to overestimate the expected Effect Size in the population (it is a math thing like the biased and unbiased calculations of the standard deviation, where one divides by n versus dividing by n – 1, respectively). But, fortunately (or unfortunately) our most popular computer statistical package (SPSS) employs eta-squared and partial etas. Do be aware that not all of SPSS statistical outputs report Effect Size (sigh!); hence the importance of the information to be presented later! (4) d (Cohen’s d): Although Cohen’s d is an important Effect Size index, it is often difficult for many to interpret and/or understand. But, hopefully, we can fix that today. Cohen’s d is a standardized difference between two means and therefore the calculated d can be greater than 1.00 as it is an estimate of the degree of overlap between the two group’s distributions (which is a unique viewpoint). Attached are two tables (from web.uccs.edu/lbecker/Psy590/es.htm). Table 1 shows the correspondence between d, r, and r2. Table 2 indicates the degree of overlap as a function of the size of d. 4 | Page M i t c h e l l a n d J o r d a n – R S t a t s W o r k s h o p S p r i n g , 2 0 1 4 Let us apply some Effect Size calculations to reported statistical results from a variety of statistical tests… (1) Example 1 Comparison of means (t-test): Imagine a basic science researcher was investigating the effects of a new drug that improves blood flow in individuals with heart disease. The researcher found a significant difference between the control and experimental group on “blood flow,” that is, the drug increased significantly blood flow above that of the control group (t (22) = 4.16, p < .05). To calculate an effect size (an omega-squared) use the following formula: omega-squared: 𝜔2 = (t2-1)/ (t2 + df +1) = .40 (this is interpreted that 40% of the difference in blood flow between the control group and the experimental can be attributed to the new drug). and to calculate an r2 or η2 , use the following formula: r2 = t2 / (t2 + df) = .44 (the interpretation would be like the omega-squared as stated above). and to calculate Cohen’s d, use the following formula: Cohen’s d = 2t / √𝑑𝑓 = 1.77 (the interpretation of a Cohen’s d is as follows: the experimental group mean is 1.77 standard deviation units above the mean of the control group). (2) Example 2 Comparison of means (One-way ANOVA): Assume that Professor Lutz (our resident clinical psychologist) compared the effects of two therapy conditions to a control condition on the weight gain in young anorexic women. The control condition received no intervention, one group received cognitive-behavioral therapy and one group received family therapy. The results of a One-way ANOVA revealed a significant treatment effect (F (2, 69) = 5.42, p < .05). And, of course, Professor Lutz did a series of post hoc tests (e.g., Tukey’s HSDs) to determine which condition means were different significantly from one another. He found that the participants in the family therapy group had weight gain significantly greater than both the control and cognitive-behavioral therapy groups (p < .05) and the participants in the cognitive-behavioral group had weight gain significantly greater than the control group (p < .05). The sample mean weight gains were as follows: 2.45, 5.16, 8.26, control, cognitive-behavioral, family therapy, respectively. The standard deviation of the control group was 2.9 (we will need this later). Now, Professor Lutz did not report his corresponding Effect Size, but we can determine this using the following formulas: 𝜔2 = df between (F - 1) / df between (F - 1) + N 𝜔2 = 2 (5.42 – 1) / 2 (5.42 – 1) + 72 = .11 Eta-squared: η2 = df between (F) / df between (F) + df within η2 = 2 (5.42) / 2 (5.42) + 69 = .14 Note: For One-Way ANOVAs you can determine the total sample size by adding the df between + df within + 1; in this case Professor Lutz had a total of 72 participants. Omega-Squared: 5 | Page M i t c h e l l a n d J o r d a n – R S t a t s W o r k s h o p S p r i n g , 2 0 1 4 The above Effect Sizes only represent an overall Effect size and, based upon the sample means and post hoc tests, the family therapy is the therapy of choice. But what was the Effect Size of each of these therapies? They both worked; but how good did they work? To answer this, more Effect Size calculations are needed. One could run 3 independent t-tests and calculate an omegasquared or r-squared as described in (1) above. Or another approach to determining the effects of each treatment compared to the control is to compute a variation of Cohen’s d, called Glass’s Delta. Glass’s Delta = (Experimental Group Mean – Control Group Mean) / Standard Deviation of the Control Group To compare the cognitive-behavioral treatment with the control: 5.16 - 2.45/2.9 = .93 (or converted to an r2 = .17) To compare the family therapy treatment versus the control group: 8.26 - 2.45/2.9 = 2.00 (or converted to an r2 = .50) (3) Example 3 Comparison of Means (a complex Mixed ANOVA): Suppose a researcher investigated the effects of a new drug to improve blood flow in individuals with heart disease over a 3 month time period. The design was a 2 (Group: Control vs. Experimental) X 2 (Gender: Male vs. Female) X 3 (Time: Pretest, Post-test1, Post-test 2). The researcher found no differences in Gender and no significant Group by Gender interaction, but did find a significant Group effect. The experimental group improved significantly with regard to increased blood flow (F (1, 60) = 41.44, p < .001). Also, there was a significant time effect (F (2, 120) = 39.46, p < .001), that is, there was significant change in blood flow from pretest to post-test, and a significant Group by Time interaction (F (2, 120) = 25.91, p < .001), that is, the Experimental group improved from pretest to post-test while the Control group did not improve. 6 | Page M i t c h e l l a n d J o r d a n – R S t a t s W o r k s h o p S p r i n g , 2 0 1 4 To calculate the Effect Sizes from these reported results use the following formulas: Let us calculate the Effect Size for the Group (Control vs. Experimental) effect (F (1, 60) = 41.44, p < .001): η2p (partial eta-squared) = df effect (F effect) / df effect (F effect) + df within η2p = 1 (41.44) / 1 (41.44) + 60 = .41 To determine the Effect Size for the magnitude of change across time (Pretest to Posttest2), apply the same formula for the given result (F (2, 120) = 39.46, p < .001): η2p (partial eta-squared) = df effect (F effect) / df effect (F effect) + df within η2p = 2 (39.46) / 2 (39.46) + 120 = .40 We have learned ways to estimate Effect Size from statistical results when the Effect Size was not reported. Determining Effect Size is necessary to determine the magnitude of an effect and Effect Size is needed to estimate power for future research. Now Let Us Consider Power… As Jacob Cohen pointed out, power is a function of the alpha level one selects, Effect Size and sample size. In most all power analyses one wants to determine the sample size needed to find the anticipated Effect Size at p < .05. So, how much power is needed? Currently most researchers assume a minimum power of .80 and the Effect Size is determined from the field of study, via reported meta-analyses, from your own published research, and/or your own metaanalyses. Do realize the reported/calculated Effect Sizes from studies are still influenced by the research design, measurement error, and sampling error. Therefore, replication is very important in determining the final Effect Size employed in one’s estimates of power and future sample size. *Of more importance: what is considered a small, medium or large Effect Size varies between and within areas of research. And too, since researchers tend to design complex studies with more than one independent variable and sometimes multiple dependent variables; how does one decide which Effect Size to use in determining the appropriate sample size? And why is power important? 7 | Page M i t c h e l l a n d J o r d a n – R S t a t s W o r k s h o p S p r i n g , 2 0 1 4 A Mitchell and Jordan Partial Solution and a Suggested Approach… We tend to be in favor of elementary statistics and the most parsimonious approach to effect size and power calculations; that being the comparison of two means (between two independent groups or two repeated measures). And since most researchers will have complex designs, comparing more than two means, here are our suggestions: (1) Pick the two mean differences that are most important to you (that is, the two means that you want to detect a significant difference at the p < .05) and has the smallest expected effect size of any of the other mean differences you wish to detect a significant difference. Ah, the question, “How do I know which mean difference is going to have the smallest effect size?” (a) You could review the literature in your area, sample similar studies who have used the same or similar independent and dependent measures, calculate the corresponding effect sizes, and take an average of those effect sizes… essentially you are doing a mini meta-analysis… Use that average effect size to conduct your power analysis. (b) If you cannot find similar studies, consider your own work. Here compare your two means of choice via a t-test, convert the observed t value to an omega-squared, and use that effect size to conduct your power analysis. (c) If you cannot find similar studies and you are just starting a new research program, use your pilot data. Assuming you have a reliable/valid dependent measure(s) and are employing appropriate sampling and randomization procedures, and appropriate/strong treatment(s), then test 5 to 10 participants per group, compare your two group means via a t-test, convert the observed t value to an omega-squared, and use that effect size to conduct your power analysis. (2) Finally, be cautious when reviewing the literature and estimating effect sizes, for studies can be weak in power, therefore you should question the results (spurious or not?)… Once you have estimated a study’s effect size (comparing two means), consider the sample size, and then examine the power of that study. If the power is extremely low, that study should be put in your ‘red flag’ category regarding the results. Do not forget to consider statistical results that were statistically significant and those reported as not statistically significant. (3) Power, participants, money, time and Type II Error… What follows are a series of power analyses that have been calculated for Cohen’s small, medium, and a very large effect sizes to give an idea of the sample sizes needed to detect an anticipated effect at the p < .05 alpha level using GPower. 8 | Page M i t c h e l l a n d J o r d a n – R S t a t s W o r k s h o p S p r i n g , 2 0 1 4 Small Effect Size (applying Cohen’s d = .2): t tests - Means: Difference between two independent means (two groups) Analysis: A priori: Compute required sample size Input: Tail(s) = Two Effect size d = 0.2 α err prob = 0.05 Power (1-β err prob) = 0.80 Allocation ratio N2/N1 = 1 Output: Critical t = 1.962987 df = 786 Sample size group 1 = 394 Sample size group 2 = 394 Total sample size = 788 Actual power = 0.800593 9 | Page M i t c h e l l a n d J o r d a n – R S t a t s W o r k s h o p S p r i n g , 2 0 1 4 Medium Effect Size (applying Cohen’s d = .5): t tests - Means: Difference between two independent means (two groups) Analysis: A priori: Compute required sample size Input: Tail(s) = Two Effect size d = 0.5 α err prob = 0.05 Power (1-β err prob) = 0.80 Allocation ratio N2/N1 = 1 Output: Critical t = 1.978971 df = 126 Sample size group 1 = 64 Sample size group 2 = 64 Total sample size = 128 Actual power = 0.801460 10 | Page M i t c h e l l a n d J o r d a n – R S t a t s W o r k s h o p S p r i n g , 2 0 1 4 Very Large Effect Size (applying Cohen’s d = 1.5): t tests - Means: Difference between two independent means (two groups) Analysis: A priori: Compute required sample size Input: Tail(s) = Two Effect size d = 1.5 α err prob = 0.05 Power (1-β err prob) = 0.80 Allocation ratio N2/N1 = 1 Output: Critical t = 2.119905 df = 16 Sample size group 1 = 9 Sample size group 2 = 9 Total sample size = 18 Actual power = 0.847610 11 | Page M i t c h e l l a n d J o r d a n – R S t a t s W o r k s h o p S p r i n g , 2 0 1 4 Selected Power and Effect Size Articles for Reference Campbell, J. M. (2004). Statistical comparison of four effect sizes for single-subject designs. Behavior Modification, 28 (2), 234-246. Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). New Jersey: Lawrence Erlbaum Associates. Cohen, J. (1990). Things I have learned (so far). American Psychologist, 45 (12), 1304-1312. Chinn, Susan (2000). A simple method for converting an odds ratio to effect size for use in metaanalysis. Statistics In Medicine, 19, 3127-3131. Cumming, G., & Finch, S. (2005). Inference by eye: Confidence intervals and how to read pictures of data. American Psychologist, 60 (2), 170-180. Fern, E. F., & Monroe, K. B. (1996). Effect-size estimates: Issues and problems in interpretation. Journal of Consumer Research, 23, 89-105. Ioannidis, J. P. A. (2005). Why most published research findings are false. PLoS Medicine, 2 (8), e124. Keppel, G. (1991). Design and analysis: A researcher’s handbook (3rd ed.). New Jersey: Prentice Hall. Kier, F. J. (1999). Effect size measures: What they are and how to compute them. Advances in Social Science Methodology, 5, 87-100. Levine, T. R., & Hullett, C. R. (2002). Eta squared, partial eta squared, and misreporting of effect size in communication research. Human Communication Research, 28 (4), 612-625. Maxwell, S. E. (2004). The persistence of underpowered studies in psychological research: Causes, consequences, and remedies. Psychological Methods, 9 (2), 147-163. Murphy, K. R., & Myor, B. (2003). Statistical power analysis: A simple and general model for traditional and modern hypothesis tests (2nd ed.). New Jersey: Lawrence Erlbaum Associates. McCartney, K., & Rosenthal, R. (2000). Effect size, practical importance, and social policy for children. Child Development, 71 (1), 173-180. Nourbakhsh, M. R., & Ottenbacher, K. J. (1994). The statistical analysis of single-subject data: A comparative examination. Physical Therapy, 74 (8), 768-776. Olejnik, S., & Algina, J. (2003). Generalized eta and omega squared statistics: Measures of effect size for some common research designs. Psychological Methods, 8 (4), 434-447. Trusty, J., Thompson, B., & Petrocelli, J.V. (2004). Practical guide for reporting effect size in quantitative research in the Journal of Counseling and Development. Journal of Counseling & Development, 82, 107-110. Wilson-VanVoorhis, C., & Levonian-Morgan, B. (2001). Statistical rules of thumb: What we don’t want to forget about sample sizes. Psi Chi Journal of Undergraduate Research, 6 (4), 139-141. 12 | Page M i t c h e l l a n d J o r d a n – R S t a t s W o r k s h o p S p r i n g , 2 0 1 4 Table 1 Cohen's Standard LARGE MEDIUM SMALL d r r² 2.0 .707 .500 1.9 .689 .474 1.8 .669 .448 1.7 .648 .419 1.6 .625 1.5 .600 1.4 .573 1.3 .545 1.2 .514 1.1 .482 1.0 .447 0.9 .410 0.8 .371 0.7 .330 As noted in the definition sections above, d .390 and be converted to r and vice versa. .360 For example, the d value of 1.2 .329 corresponds to an r value of .514. .297 .265 The square of the r-value (.265) is the percentage of variance in the dependent .232 variable that is accounted for by .200 membership in the independent variable .168 groups. For a d value of 1.2, the amount of variance in the dependent variable by .138 membership in the treatment and control .109 groups is 26.5%. 0.6 .287 .083 0.5 .243 0.4 .196 In meta-analysis studies rs are typically .059 presented rather than r². .038 0.3 .148 .022 0.2 .100 .010 0.1 .050 .002 0.0 .000 .000 13 | Page M i t c h e l l a n d J o r d a n – R S t a t s W o r k s h o p S p r i n g , 2 0 1 4 Table 2 The Interpretation of Cohen's d Cohen's Effect Percentile Percent of Standard Size Standing Nonoverlap 2.0 97.7 81.1% LARGE MEDIUM SMALL 1.9 97.1 79.4% 1.8 96.4 77.4% 1.7 95.5 75.4% 1.6 94.5 73.1% 1.5 93.3 70.7% 1.4 91.9 68.1% 1.3 90 65.3% 1.2 88 62.2% 1.1 86 58.9% 1.0 84 55.4% 0.9 82 51.6% 0.8 79 47.4% 0.7 76 43.0% 0.6 73 38.2% 0.5 69 33.0% 0.4 66 27.4% 0.3 62 21.3% 0.2 58 14.7% 0.1 54 7.7% 0.0 50 0% Cohen (1988) defined hesitantly effect sizes as "small, d = .2," "medium, d = .5," and "large, d = .8", stating that "there is a certain risk in inherent in offering conventional operational definitions for those terms for use in power analysis in as diverse a field of inquiry as behavioral science" (p. 25). Effect sizes can also be thought of as the average percentile standing of the average treated (or experimental) participant relative to the average untreated (or control) participant. An effect size of 0.0 indicates that the mean of the treated group is at the 50th percentile of the untreated group. An effect size of 0.8 indicates that the mean of the treated group is at the 79th percentile of the untreated group. An effect size of 1.7 indicates that the mean of the treated group is at the 95.5 percentile of the untreated group. Effect sizes can also be interpreted in terms of the percent of nonoverlap of the treated group's scores with those of the untreated group, see Cohen (1988, pp. 21-23) for descriptions of additional measures of nonoverlap. An effect size of 0.0 indicates that the distribution of scores for the treated group overlaps completely with the distribution of scores for the untreated group, there is 0% of nonoverlap. An effect size of 0.8 indicates a nonoverlap of 47.4% in the two distributions. An effect size of 1.7 indicates a nonoverlap of 75.4% in the two distributions. 14 | Page M i t c h e l l a n d J o r d a n – R S t a t s W o r k s h o p S p r i n g , 2 0 1 4