Power Effect Size Text - Missouri State University

advertisement
MISSOURI STATE UNIVERSITY
Research, Statistical Training, Analysis & Technical
Support
D. Wayne Mitchell, Ph.D. – RStats Consultant - Psychology Department
Kayla N. Jordan – Statistical Analyst – RStats Institute
Graduate Student Experimental Master’s Track- Psychology Department
Power and Effect Size Workshop
Spring 2014
The information presented in this document is for your personal use. It is not to be quoted and/or
distributed to others without the written permission of D. Wayne Mitchell and RStats’
administrative staff.
1 | Page M i t c h e l l a n d J o r d a n – R S t a t s W o r k s h o p S p r i n g , 2 0 1 4
Power and Effect Size Workshop
Preface
We would like to thank each one of you for attending this RStats Power and Effect Size
workshop. Most of all the planned information to be presented today will be contained in this
handout. We view this workshop as a dynamic presenter-attendee interaction format, but we
wanted to make available to you a hard copy of the basic information for future reference. We
encourage questions and comments regarding the material presented and/or request of
information beyond that which is presented. Should questions arise after the workshop, please
feel free to contact us. During the course of this workshop we will use primarily
commonsensical, practical terminology as opposed to using strict statistical/formula proof
jargon. However, there will be a need to define a few basic statistical terms/concepts and several
basic effect size formulas will be presented and discussed. We would like to point out that the
majority of today’s discussion will focus on Effect Size (its meaning and calculation from
reported studies and the interpretation) for without a priori knowledge of Effect Size, power
calculations are a moot issue.
Much of the information contained in this handout was presented in a previous Power
Analysis and Effect Size Workshops in the Fall of 2008 and Spring of 2009. The impetus behind
this workshop is that over the past couple of years I have been asked to conduct power analysis
for researchers, both here at MSU and for outside consults. The problems most often encountered
are that researchers tend to expect a medium effect size without noting the actual effect size or a
medium effect size, based upon Cohen’s d, is expected. Here lies the nature of the problems:
(1) Consider the first example, stating a ‘medium effect size is expected’ without indicating
an effect size value. If there is no value, one cannot conduct a power analysis.
Determining an estimated effect size value requires the researcher to do a meta-analysis
within one’s research area. Estimation of an effect size can be a problem if the dependent
measures and treatments that you (the researcher) wish to employ have not been
investigated and published previously. Also, many reported studies do not report effect
sizes and therefore you have to estimate effect sizes based upon the information provided
in the publications.
(2) A second problem is that merely stating the expectation of a small to medium effect size
based upon Cohen’s d, has resulted in researchers complaining to me (after conducting a
power analysis) that the sample size required to detect such an effect is too large and
unreasonable. As you will see, to detect a significant difference between two groups (e.g.,
an experimental group versus a control group), based upon a Cohen’s small effect size
via an independent t-test, requires a sample size of 788 participants (394 per group), and
based upon a Cohen’s medium effect size, a sample of size of 128 participants (64 per
group) is required. These power analysis estimates assume a power of .80.
So, what follows today in this workshop is three-fold: a review of (1) statistical terms, (2) the
cookbook formulas for estimating effect sizes for published research articles, and (3) a brief
outline of how to estimate effect size from your own data.
2 | Page M i t c h e l l a n d J o r d a n – R S t a t s W o r k s h o p S p r i n g , 2 0 1 4
Statistical Concepts
(1) Type I Error – concluding there is a treatment effect or relationship between an
independent variable and a dependent variable when, in fact, there is not. This error is
noted in most all of our reported studies in the form of p < .05 (the alpha level we tend to
live and die with regarding our research).
(2) Type II Error – concluding that there is no treatment effect or relationship between an
independent variable and a dependent variable when, in fact, there is an
effect/relationship between the independent variable and dependent variable. This type of
error is most can be very costly, especially in clinical research.
The above error types tend to haunt us in our research endeavors and they will be important to
our discussions of Power and Effect Size later in this document and workshop. We would like to
credit and applaud Jacob Cohen who popularized and created most all of the early work on
power analyses and introduced the importance of calculating and reporting Effect Size in
statistical results (for both those results that were statistically significant and those not
significant).
(3) Power – The probability of detecting an effect (treatment, relationship, etc.) that is really
there. To say it another way, the power of a statistical test is the probability that it (the
statistical test) will yield statistically significant results (typically yielding that infamous,
p < .05). Hence, we can then conclude there is a significant treatment effect/relationship
between our independent variable and dependent variable and go on to publish, get
promoted, become famous, and live happily ever after!
(4) Effect Size – Effect size is a name given to a family of indices that measure the magnitude
of a treatment effect or relationship. The types of indices reported vary due to the
preferences of the researcher, journal, and/or type of statistical test. The most popular
indices, and the ones that will be discussed today, are: r2 (r – squared), ω2 (omegasquared), η2 (eta-squared) and d (Cohen’s d).
There are variations of eta-squared (e.g., partial etas) and Cohen’s d (e.g., f) which are
employed in complex Analysis of Variance (ANOVAs) and Multivariate Analysis (MANOVAs).
And in complex correlational analyses (e.g., multiple regression/correlation analyses) R2 is the
effect size employed. The interpretations are the same, just more complex due to the complexity
of the research design/questions and the subsequent analyses.
Do note: these Effect Size indices are for parametric statistical tests (those for mean
comparisons and correlations). There are also Effect Size indices for non-parametric statistical
tests/results (e.g., Chi-Square and Odds Ratios; their respective effect size indices; Phi or
Cramer’s Phi and Chinn’s (2000) odds ratio conversion (ln(odds ratio)/1.81)). And, again, the
Effect Size interpretations are the equivalent. However, given the limited time of our workshop,
only the more common Effect Size indices for parametric statistical tests will be discussed.
3 | Page M i t c h e l l a n d J o r d a n – R S t a t s W o r k s h o p S p r i n g , 2 0 1 4
The Four Most Common Effect Size Indices
(1) r2 (r – squared): This is an Effect Size that all of you are probably familiar. Historically, it
has been inferred, but not reported or discussed. However, the calculation is very simple.
For example, suppose a researcher found a significant relationship between an
individual’s number of years of education and salary (r (98) = .40, p < .05). The
researcher concluded that the more years of education one has obtained the higher their
salary. The Effect size here is r2 = .16; which means that approximately 16% of the
variability in salary (why there are individual differences in salary) can be attributed
(explained) by the number of years of education one has. Apparently, based upon this
report, approximately 84% of the variability in salary is not explained (we typically say
due to error).
(2) ω2 (omega-squared): Omega-squared is usually reported for t-test, ANOVA, and
MANOVA results. Like r2, omega-squared ranges from 0 to 1.00 and is interpreted in the
same manner as r2. If r-squared is converted to an omega-squared, or vice versa, the
values will be very similar. In fact, some researchers report r-squared with t-tests.
(3) η2 (eta-squared or η2p - partial eta-squared): Eta-squared is also generally reported with ttest, ANOVA, and MANOVA results. Like r2 and omega-squared, eta-squared ranges
from 0 to 1.00 and the interpretation are the same as those with r-squared and omegasquared.
So, you ask, “What is the difference between omega-squared and eta-squared?” Well…from
a technical standpoint, there is not much difference, at least for the most part. Overall etasquared (and partial eta-squared calculations) will be larger than omega-squared Effect Size
estimates and tend to overestimate the expected Effect Size in the population (it is a math thing
like the biased and unbiased calculations of the standard deviation, where one divides by n
versus dividing by n – 1, respectively). But, fortunately (or unfortunately) our most popular
computer statistical package (SPSS) employs eta-squared and partial etas. Do be aware that not
all of SPSS statistical outputs report Effect Size (sigh!); hence the importance of the information
to be presented later!
(4) d (Cohen’s d): Although Cohen’s d is an important Effect Size index, it is often difficult
for many to interpret and/or understand. But, hopefully, we can fix that today. Cohen’s d
is a standardized difference between two means and therefore the calculated d can be
greater than 1.00 as it is an estimate of the degree of overlap between the two group’s
distributions (which is a unique viewpoint).
Attached are two tables (from web.uccs.edu/lbecker/Psy590/es.htm). Table 1 shows the
correspondence between d, r, and r2. Table 2 indicates the degree of overlap as a function
of the size of d.
4 | Page M i t c h e l l a n d J o r d a n – R S t a t s W o r k s h o p S p r i n g , 2 0 1 4
Let us apply some Effect Size calculations to reported statistical results from a variety of
statistical tests…
(1) Example 1 Comparison of means (t-test): Imagine a basic science researcher was
investigating the effects of a new drug that improves blood flow in individuals with heart
disease. The researcher found a significant difference between the control and
experimental group on “blood flow,” that is, the drug increased significantly blood flow
above that of the control group (t (22) = 4.16, p < .05). To calculate an effect size (an
omega-squared) use the following formula:
omega-squared: 𝜔2 = (t2-1)/ (t2 + df +1) = .40 (this is interpreted that 40% of the
difference
in blood flow between the control group and the experimental can be attributed to the
new
drug).
and to calculate an r2 or η2 , use the following formula:
r2 = t2 / (t2 + df) = .44 (the interpretation would be like the omega-squared as stated
above).
and to calculate Cohen’s d, use the following formula:
Cohen’s d = 2t / √𝑑𝑓 = 1.77 (the interpretation of a Cohen’s d is as follows: the
experimental group mean is 1.77 standard deviation units above the mean of the control
group).
(2) Example 2 Comparison of means (One-way ANOVA): Assume that Professor Lutz (our
resident clinical psychologist) compared the effects of two therapy conditions to a control
condition on the weight gain in young anorexic women. The control condition received
no intervention, one group received cognitive-behavioral therapy and one group received
family therapy. The results of a One-way ANOVA revealed a significant treatment effect
(F (2, 69) = 5.42, p < .05). And, of course, Professor Lutz did a series of post hoc tests
(e.g., Tukey’s HSDs) to determine which condition means were different significantly
from one another. He found that the participants in the family therapy group had weight
gain significantly greater than both the control and cognitive-behavioral therapy groups
(p < .05) and the participants in the cognitive-behavioral group had weight gain
significantly greater than the control group (p < .05). The sample mean weight gains were
as follows: 2.45, 5.16, 8.26, control, cognitive-behavioral, family therapy, respectively.
The standard deviation of the control group was 2.9 (we will need this later). Now,
Professor Lutz did not report his corresponding Effect Size, but we can determine this
using the following formulas:
𝜔2 = df between (F - 1) / df between (F - 1) + N
𝜔2 = 2 (5.42 – 1) / 2 (5.42 – 1) + 72 = .11
Eta-squared:
η2 = df between (F) / df between (F) + df within
η2 = 2 (5.42) / 2 (5.42) + 69 = .14
Note: For One-Way ANOVAs you can determine the total sample size by adding the df
between + df within + 1; in this case Professor Lutz had a total of 72 participants.
Omega-Squared:
5 | Page M i t c h e l l a n d J o r d a n – R S t a t s W o r k s h o p S p r i n g , 2 0 1 4
The above Effect Sizes only represent an overall Effect size and, based upon the sample means
and post hoc tests, the family therapy is the therapy of choice. But what was the Effect Size of
each of these therapies? They both worked; but how good did they work? To answer this, more
Effect Size calculations are needed. One could run 3 independent t-tests and calculate an omegasquared or r-squared as described in (1) above. Or another approach to determining the effects of
each treatment compared to the control is to compute a variation of Cohen’s d, called Glass’s
Delta.
Glass’s Delta =
(Experimental Group Mean – Control Group Mean) /
Standard Deviation of the Control Group
To compare the cognitive-behavioral treatment with the control:
5.16 - 2.45/2.9 =
.93 (or converted to an r2 = .17)
To compare the family therapy treatment versus the control group:
8.26 - 2.45/2.9 =
2.00 (or converted to an r2 = .50)
(3) Example 3 Comparison of Means (a complex Mixed ANOVA): Suppose a researcher
investigated the effects of a new drug to improve blood flow in individuals with heart
disease over a 3 month time period. The design was a 2 (Group: Control vs.
Experimental) X 2 (Gender: Male vs. Female) X 3 (Time: Pretest, Post-test1, Post-test 2).
The researcher found no differences in Gender and no significant Group by Gender
interaction, but did find a significant Group effect. The experimental group improved
significantly with regard to increased blood flow (F (1, 60) = 41.44, p < .001). Also, there
was a significant time effect (F (2, 120) = 39.46, p < .001), that is, there was significant
change in blood flow from pretest to post-test, and a significant Group by Time
interaction (F (2, 120) = 25.91, p < .001), that is, the Experimental group improved
from pretest to post-test while the Control group did not improve.
6 | Page M i t c h e l l a n d J o r d a n – R S t a t s W o r k s h o p S p r i n g , 2 0 1 4
To calculate the Effect Sizes from these reported results use the following formulas:
Let us calculate the Effect Size for the Group (Control vs. Experimental) effect
(F (1, 60) = 41.44, p < .001):
η2p (partial eta-squared) = df effect (F effect) / df effect (F effect) + df within
η2p = 1 (41.44) / 1 (41.44) + 60 = .41
To determine the Effect Size for the magnitude of change across time (Pretest to Posttest2), apply the same formula for the given result (F (2, 120) = 39.46, p < .001):
η2p (partial eta-squared) = df effect (F effect) / df effect (F effect) + df within
η2p = 2 (39.46) / 2 (39.46) + 120 = .40
We have learned ways to estimate Effect Size from statistical results when the Effect Size was not
reported. Determining Effect Size is necessary to determine the magnitude of an effect and Effect
Size is needed to estimate power for future research.
Now Let Us Consider Power…
As Jacob Cohen pointed out, power is a function of the alpha level one selects, Effect Size
and sample size. In most all power analyses one wants to determine the sample size needed to
find the anticipated Effect Size at p < .05. So, how much power is needed? Currently most
researchers assume a minimum power of .80 and the Effect Size is determined from the field of
study, via reported meta-analyses, from your own published research, and/or your own metaanalyses. Do realize the reported/calculated Effect Sizes from studies are still influenced by the
research design, measurement error, and sampling error. Therefore, replication is very important
in determining the final Effect Size employed in one’s estimates of power and future sample size.
*Of more importance: what is considered a small, medium or large Effect Size varies between
and within areas of research. And too, since researchers tend to design complex studies with
more than one independent variable and sometimes multiple dependent variables; how does one
decide which Effect Size to use in determining the appropriate sample size? And why is power
important?
7 | Page M i t c h e l l a n d J o r d a n – R S t a t s W o r k s h o p S p r i n g , 2 0 1 4
A Mitchell and Jordan Partial Solution and a Suggested Approach…
We tend to be in favor of elementary statistics and the most parsimonious approach to effect size
and power calculations; that being the comparison of two means (between two independent
groups or two repeated measures). And since most researchers will have complex designs,
comparing more than two means, here are our suggestions:
(1) Pick the two mean differences that are most important to you (that is, the two means that
you want to detect a significant difference at the p < .05) and has the smallest expected
effect size of any of the other mean differences you wish to detect a significant
difference. Ah, the question, “How do I know which mean difference is going to have the
smallest effect size?”
(a) You could review the literature in your area, sample similar studies who have
used the same or similar independent and dependent measures, calculate the
corresponding effect sizes, and take an average of those effect sizes…
essentially you are doing a mini meta-analysis… Use that average effect size
to conduct your power analysis.
(b) If you cannot find similar studies, consider your own work. Here compare
your two means of choice via a t-test, convert the observed t value to an
omega-squared, and use that effect size to conduct your power analysis.
(c) If you cannot find similar studies and you are just starting a new research
program, use your pilot data. Assuming you have a reliable/valid dependent
measure(s) and are employing appropriate sampling and randomization
procedures, and appropriate/strong treatment(s), then test 5 to 10 participants
per group, compare your two group means via a t-test, convert the observed t
value to an omega-squared, and use that effect size to conduct your power
analysis.
(2) Finally, be cautious when reviewing the literature and estimating effect sizes, for studies
can be weak in power, therefore you should question the results (spurious or not?)…
Once you have estimated a study’s effect size (comparing two means), consider the
sample size, and then examine the power of that study. If the power is extremely low, that
study should be put in your ‘red flag’ category regarding the results. Do not forget to
consider statistical results that were statistically significant and those reported as not
statistically significant.
(3) Power, participants, money, time and Type II Error…
What follows are a series of power analyses that have been calculated for Cohen’s small,
medium, and a very large effect sizes to give an idea of the sample sizes needed to detect an
anticipated effect at the p < .05 alpha level using GPower.
8 | Page M i t c h e l l a n d J o r d a n – R S t a t s W o r k s h o p S p r i n g , 2 0 1 4
Small Effect Size (applying Cohen’s d = .2):
t tests - Means: Difference between two independent means (two groups)
Analysis: A priori: Compute required sample size
Input: Tail(s)
= Two
Effect size d
= 0.2
α err prob
= 0.05
Power (1-β err prob)
= 0.80
Allocation ratio N2/N1
= 1
Output:
Critical t
= 1.962987
df
= 786
Sample size group 1
= 394
Sample size group 2
= 394
Total sample size
= 788
Actual power
= 0.800593
9 | Page M i t c h e l l a n d J o r d a n – R S t a t s W o r k s h o p S p r i n g , 2 0 1 4
Medium Effect Size (applying Cohen’s d = .5):
t tests - Means: Difference between two independent means (two groups)
Analysis: A priori: Compute required sample size
Input:
Tail(s)
= Two
Effect size d
= 0.5
α err prob
= 0.05
Power (1-β err prob)
= 0.80
Allocation ratio N2/N1
= 1
Output:
Critical t
= 1.978971
df
= 126
Sample size group 1
= 64
Sample size group 2
= 64
Total sample size
= 128
Actual power
= 0.801460
10 | Page M i t c h e l l a n d J o r d a n – R S t a t s W o r k s h o p S p r i n g , 2 0 1 4
Very Large Effect Size (applying Cohen’s d = 1.5):
t tests - Means: Difference between two independent means (two groups)
Analysis: A priori: Compute required sample size
Input:
Tail(s)
= Two
Effect size d
= 1.5
α err prob
= 0.05
Power (1-β err prob)
= 0.80
Allocation ratio N2/N1
= 1
Output:
Critical t
= 2.119905
df
= 16
Sample size group 1
= 9
Sample size group 2
= 9
Total sample size
= 18
Actual power
= 0.847610
11 | Page M i t c h e l l a n d J o r d a n – R S t a t s W o r k s h o p S p r i n g , 2 0 1 4
Selected Power and Effect Size Articles for Reference
Campbell, J. M. (2004). Statistical comparison of four effect sizes for single-subject designs.
Behavior Modification, 28 (2), 234-246.
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). New Jersey:
Lawrence Erlbaum Associates.
Cohen, J. (1990). Things I have learned (so far). American Psychologist, 45 (12), 1304-1312.
Chinn, Susan (2000). A simple method for converting an odds ratio to effect size for use in metaanalysis. Statistics In Medicine, 19, 3127-3131.
Cumming, G., & Finch, S. (2005). Inference by eye: Confidence intervals and how to read
pictures of data. American Psychologist, 60 (2), 170-180.
Fern, E. F., & Monroe, K. B. (1996). Effect-size estimates: Issues and problems in interpretation.
Journal of Consumer Research, 23, 89-105.
Ioannidis, J. P. A. (2005). Why most published research findings are false. PLoS Medicine, 2 (8),
e124.
Keppel, G. (1991). Design and analysis: A researcher’s handbook (3rd ed.). New Jersey:
Prentice Hall.
Kier, F. J. (1999). Effect size measures: What they are and how to compute them. Advances in
Social Science Methodology, 5, 87-100.
Levine, T. R., & Hullett, C. R. (2002). Eta squared, partial eta squared, and misreporting of
effect size in communication research. Human Communication Research, 28 (4),
612-625.
Maxwell, S. E. (2004). The persistence of underpowered studies in psychological research:
Causes, consequences, and remedies. Psychological Methods, 9 (2), 147-163.
Murphy, K. R., & Myor, B. (2003). Statistical power analysis: A simple and general model for
traditional and modern hypothesis tests (2nd ed.). New Jersey: Lawrence Erlbaum
Associates.
McCartney, K., & Rosenthal, R. (2000). Effect size, practical importance, and social policy for
children. Child Development, 71 (1), 173-180.
Nourbakhsh, M. R., & Ottenbacher, K. J. (1994). The statistical analysis of single-subject data: A
comparative examination. Physical Therapy, 74 (8), 768-776.
Olejnik, S., & Algina, J. (2003). Generalized eta and omega squared statistics: Measures of effect
size for some common research designs. Psychological Methods, 8 (4), 434-447.
Trusty, J., Thompson, B., & Petrocelli, J.V. (2004). Practical guide for reporting effect size in
quantitative research in the Journal of Counseling and Development. Journal of
Counseling & Development, 82, 107-110.
Wilson-VanVoorhis, C., & Levonian-Morgan, B. (2001). Statistical rules of thumb: What we
don’t want to forget about sample sizes. Psi Chi Journal of Undergraduate Research,
6 (4), 139-141.
12 | Page M i t c h e l l a n d J o r d a n – R S t a t s W o r k s h o p S p r i n g , 2 0 1 4
Table 1
Cohen's
Standard
LARGE
MEDIUM
SMALL
d
r
r²
2.0
.707
.500
1.9
.689
.474
1.8
.669
.448
1.7
.648
.419
1.6
.625
1.5
.600
1.4
.573
1.3
.545
1.2
.514
1.1
.482
1.0
.447
0.9
.410
0.8
.371
0.7
.330
As noted in the definition sections above, d
.390 and be converted to r and vice versa.
.360
For example, the d value of 1.2
.329 corresponds to an r value of .514.
.297
.265 The square of the r-value (.265) is the
percentage of variance in the dependent
.232 variable that is accounted for by
.200 membership in the independent variable
.168 groups. For a d value of 1.2, the amount of
variance in the dependent variable by
.138 membership in the treatment and control
.109 groups is 26.5%.
0.6
.287
.083
0.5
.243
0.4
.196
In meta-analysis studies rs are typically
.059 presented rather than r².
.038
0.3
.148
.022
0.2
.100
.010
0.1
.050
.002
0.0
.000
.000
13 | Page M i t c h e l l a n d J o r d a n – R S t a t s W o r k s h o p S p r i n g , 2 0 1 4
Table 2
The Interpretation of Cohen's d
Cohen's Effect Percentile Percent of
Standard Size Standing Nonoverlap
2.0
97.7
81.1%
LARGE
MEDIUM
SMALL
1.9
97.1
79.4%
1.8
96.4
77.4%
1.7
95.5
75.4%
1.6
94.5
73.1%
1.5
93.3
70.7%
1.4
91.9
68.1%
1.3
90
65.3%
1.2
88
62.2%
1.1
86
58.9%
1.0
84
55.4%
0.9
82
51.6%
0.8
79
47.4%
0.7
76
43.0%
0.6
73
38.2%
0.5
69
33.0%
0.4
66
27.4%
0.3
62
21.3%
0.2
58
14.7%
0.1
54
7.7%
0.0
50
0%
Cohen (1988) defined hesitantly effect sizes as
"small, d = .2," "medium, d = .5," and "large, d
= .8", stating that "there is a certain risk in
inherent in offering conventional operational
definitions for those terms for use in power
analysis in as diverse a field of inquiry as
behavioral science" (p. 25).
Effect sizes can also be thought of as the
average percentile standing of the average
treated (or experimental) participant relative to
the average untreated (or control) participant.
An effect size of 0.0 indicates that the mean of
the treated group is at the 50th percentile of
the untreated group. An effect size of 0.8
indicates that the mean of the treated group is
at the 79th percentile of the untreated group.
An effect size of 1.7 indicates that the mean of
the treated group is at the 95.5 percentile of
the untreated group.
Effect sizes can also be interpreted in terms of
the percent of nonoverlap of the treated
group's scores with those of the untreated
group, see Cohen (1988, pp. 21-23) for
descriptions of additional measures of
nonoverlap. An effect size of 0.0 indicates that
the distribution of scores for the treated group
overlaps completely with the distribution of
scores for the untreated group, there is 0% of
nonoverlap. An effect size of 0.8 indicates a
nonoverlap of 47.4% in the two distributions.
An effect size of 1.7 indicates a nonoverlap of
75.4% in the two distributions.
14 | Page M i t c h e l l a n d J o r d a n – R S t a t s W o r k s h o p S p r i n g , 2 0 1 4
Download