Statistical Power of Psychological Research

Journal of Consuming and Clinical Psychology
1990. Vol. 58, No. 5, 646-656
Statistical Power of Psychological Research:
What Have We Gained in 20 Years?
Joseph S. Rossi
Cancer Prevention Research Center
University of Rhode Island
Power was calculated for 6,155 statistical tests in 221 journal articles published in the 1982 volumes of the Journal of Abnormal Psychology, Journal of Consulting and Clinical Psychology, and
Journal oj Personality and Social Psychology. Power to detect small, medium, and large effects was
.17, .57, and .83, respectively. 20 years after Cohen (1962) conducted the first power survey, the
power of psychological research is still low. The implications of these results concerning the proliferation of Type I errors in the published literature, the failure of replication studies, and the
interpretation of null (negative) results are emphasized. An example is given of the use of power
analysis to help interpret null results by setting probable upper bounds on the magnitudes of effects.
Limitations of statistical power analyses, suggestions for future research, sources of computational
information, and recommendations for improving power are discussed.
The power of a statistical test is the probability that the test
1. Knowledge of the power of a statistical test indicates the
will correctly reject the null hypothesis. Because the aim of
likelihood of obtaining a statistically significant result. Presum-
behavioral research is to discover important relations between
ably, most researchers would not want to conduct an investiga-
variables, a consideration of power might be regarded as a natu-
tion of low statistical power. The time, effort, and resources
ral and important part of the planning and interpretation of
required to conduct research are usually sufficiently great that a
research. Cohen (1977) has expressed the point aptly: "Since
reasonable chance of obtaining a successful outcome is at least
statistical significance is so earnestly sought and devoutly
wished for by behavioral scientists, one would think that the a
the researcher might elect either to increase power or to aban-
implicitly desirable. Thus, if a priori power estimates are low,
priori probability of its accomplishment would be routinely
don the proposed research altogether if the costs of increasing
determined and well understood" (p. 1). Unfortunately, this
power are too high, or if the costs of conducting research of low
seems not to be the case, and it is probably not an exaggeration
power cannot be justified.
to assert that most researchers know little more about statisti-
2. Knowledge of the power of a statistical test facilitates in-
cal power than its definition, even though a routine consider-
terpretation of null results. It is often stressed in textbooks,
ation of power has several beneficial consequences.
although seldom adhered to in practice, that failure to reject the
null hypothesis does not mean that the null hypothesis is true,
only that there is insufficient evidence to reject the null. In fact,
the proper interpretation of a null result is conditional on the
power of the test. If power was low, then it is reasonable to
suggest that, a priori, there was not a fair chance of rejecting the
null hypothesis and that the failure to reject the null should not
weigh so heavily against the alternative hypothesis (the "alibi"
use of power). However, if power was high, then failure to reject
the null can, within limits, be considered as an affirmation of
the null hypothesis, because the probability of a Type II error
must be low. Thus, in the same way that a statistically significant test result permits the rejection of the null hypothesis with
only a small probability of error (alpha, the Type I error rate),
high power permits the rejection of the alternative hypothesis
with a relatively small probability of error (beta, the Type II
error rate). The qualification on accepting the null hypothesis
involves the concept of probable upper bounds on effect sizes
and will be addressed in greater detail in the Discussion section.
3. Statistical power provides insight concerning entire re646
search domains. When the average statistical power of an entire
(1969,1977) convenient power tables. These surveys represent a
research literature is low, the veracity of even statistically signifi-
wide variety of disciplines (see Table 1) and have closely fol-
cant results may be questioned, because the probability of re-
lowed the procedures described by Cohen (1962). The principal
jecting a true null hypothesis may then be only slightly smaller
than the probability of rejecting the null hypothesis when the
difference in method between Cohen's (1962) survey and those
conducted by later investigators concerns the definitions of ef-
alternative is true (Bakan, 1966). Thus, a substantial proportion
of published significant results may be Type I errors. When
(1969) himself so that the definitions of small, medium, and
fect size. Changes in these definitions were suggested by Cohen
power is marginal (e.g, approximately .50), an inconsistent pat-
large effects would be more consistent across different types of
tern of results may be obtained in which some studies yield
statistical tests. The difference between the earlier and later
significant results while others do not (Kazdin & Bass, 1989;
definitions of effect size are relatively minor, but they deserve
Rossi, 1982,1986).
This brief survey of the value of power analysis raises the
note. For example, for the / test the definitions were changed
from 0.25,0.50, and 1.00 standard deviation units to 0.20,0.50,
question of why researchers ignore an issue of such clear and
and 0.80 standard deviation units for small, medium, and large
obvious merit. Indeed, the benefits of a routine consideration
of statistical power may strike some readers as almost too good
effects, respectively. Rjr the Pearson correlation coefficient, the
definitions were changed from correlations of .20, .40, and .60
to be true. Certainly researchers have been admonished periodi-
to correlations of. 10, .30, and .50 for small, medium, and large
cally to pay more attention to statistical power and less to statis-
effects, respectively.
tical significance (Cohen, 1965; Greenwald, 1975; Rosnow &
Because these changes resulted in smaller effect size defini-
Rosenthal, 1989). Yet historically, behavioral researchers have
tions, a decrease in power might have been expected for subse-
been concerned primarily with Type I errors and the associated
quent power surveys. Instead, results were generally similar to
concept of statistical "significance" and have largely ignored
those obtained by Cohen (1962; see Table 1). With few excep-
the Type II error and its concomitant, statistical power (Chase
tions, most surveys produced remarkably consistent results, especially given the diverse nature of the research fields covered.
& Tucker, 1976; Cowles & Davis, 1982; Taylor, 1959).
The average statistical power for all 25 power surveys (includ-
Cohen's Power Survey
What effect has the lack of attention to statistical power had
on the power of research? Cohen (1962) was the first to conduct
a systematic survey of statistical power, and his paper is frequently credited with introducing power into the literature of
the social sciences (but for an earlier account see Mosteller &
ing Cohenls) was .26 for small effects, .64 for medium effects,
and .85 for large effects and was based on 40,000 statistical tests
published in over 1,500 journal articles.
Power of Psychological Research
Bush, 1954). Cohen (1962) surveyed all of the articles published
in the Journal of Abnormal and Social Psychology (JASP) for the
The low statistical power indicated by the results of these
surveys has created some concern among methodologists about
year 1960. Eliminated from consideration were articles in
the health of psychological research, especially with respect to
which no statistical tests were conducted (e.g, case reports, fac-
the possible inflation of Type I error rates and the interpreta-
tor-analytic studies). Also eliminated from the survey were statistical tests considered to be peripheral, that is, tests that did
not address major hypotheses.
tion of null results (e.g, Bakan, 1966; Cohen, 1965; Greenwald,
1975; Overall, 1969; Tversky & Kahneman, 1971). The irony of
The determination of power was not without difficulty.
these results is that the low statistical power often attributed to
psychological research is based primarily on power surveys of
Power is determined by three separate factors: sample size, al-
research in other disciplines (see Table 1) and on Cohen's re-
pha level, and effect size; and all three factors must be known to
sults, which were gathered over 25 years ago. Since the publica-
estimate the power of a test. Sample sizes were (usually) deter-
tion of Cohen's survey, only two other power surveys of psycho-
mined easily from the research report, and alpha was assumed
logical research have been conducted. Sedlmeier and Giger-
to be .05 and nondirectional. However, the estimation of effect
enzer (1989) surveyed the power of research reported in the
size proved to be a major difficulty, because authors never re-
Journal of Abnormal Psychology. Using Cohen's (1962) original
ported expected population effect sizes. Therefore, Cohen
effect size definitions, they found mean power values of .21,
(1962) elected to estimate the power of each statistical test for a
.50, and .84, respectively, for small, medium, and large effect
range of effect sizes: small, medium, and large. This procedure
resulted in three power estimates for each of 2,088 statistical
sizes, almost identical to those reported by Cohen. These re-
tests. Power for each journal article was determined by averag-
been no increase in statistical power since Cohen's survey.
sults are disappointing, because they suggest that there has
ing across all of the statistical tests reported in an article, and
The survey conducted by Chase and Chase (1976) for the
the power of abnormal-social psychological research as a
Journal of Applied Psychology is somewhat more encouraging.
whole was estimated by averaging the results of the 70 journal
Using Cohen^ (1969) smaller effect size definitions, Chase and
Chase (1976) nevertheless found substantially greater power for
medium-sized effects (.67) than either Cohen (1962) or Sedl-
articles included in the survey. The average statistical power for
the JASP articles was. 18 to detect small effects, .48 for medium
effects, and .83 for large effects.
Results of Other Power Surveys
Numerous power surveys have been conducted since Cohen
(1962), no doubt encouraged by the publication of Cohen's
meier and Gigerenzer (1989). Unfortunately, the results obtained by Chase and Chase (1976) may not be representative of
research in other areas of psychology. In particular, the sample
sizes characteristic of studies in the Journal of Applied Psychology are substantially larger than the sample sizes typically used
in most psychological research. Average sample size in the Journal of Applied Psychology is approximately 350 (Muchinsky,
the inclusion of either nonparametric or multivariate statistical methods would have affected the results of the power survey.
1979) but is only about 50 or 60 in most other psychological
journals (Holmes, 1979).
The purpose of the research reported here was to provide a
more current and representative assessment of the power of
psychological research. Three journals covering a wide range of
psychological research were selected for the survey: the Journal
of Abnormal Psychology, the Journal of Consulting and Clinical
Psychology, and the Journal of Personality and Social Psychology. The Journal of Abnormal Psychology and the Journal of
Personality and Social Psychology are the direct journal "de-
scendents" of the Journal of Abnormal and Social Psychology,
the journal examined by Cohen (1962). Thus, a second purpose
of this study was to provide a direct comparison to Cohen's
results, to discover what changes have occurred over the past 20
years in the power of psychological research.
Selection Procedures: Journal Articles
All of the articles published in the Journal of Abnormal Psychology,
1982, Volume 91; the Journal of Consulting and Clinical Psychology,
1982, Volume 50; and the Journal of Personality and Social Psychology,
1982, Volume 42, were examined, and articles not reporting any statistical tests were eliminated from the study. In addition, some articles
were excluded because they contained statistical methods for which
power could not be determined.
Selection Procedures: Statistical Tests
As in previous surveys, a distinction was made between major and
peripheral statistical tests. Major tests were those that bore directly on
the research hypotheses of the study, whereas peripheral tests did not.
Peripheral tests, which were excluded from the survey, included all of
the correlation coefficients of a factor analysis, unhypothesized higher
order analysis of variance interactions, manipulation checks, interrater
reliability coefficients, reliabilities of psychometric tests (internal consistency, test-retest), post hoc analysis of variance procedures (means
comparisons tests, simple effects), and tests of statistical assumptions.
For major statistical tests, power was determined for the following: /
test, Pearson r, z test for the difference between two independent
correlation coefficients, sign test, z test for the difference between two
independent proportions, chi-square test, F test in the analysis of variance and covariance, and multiple regression F test.
Primarily excluded from the survey were most nonparametric techniques (e.g., Mann-Whitney U, rank order correlation tests), multivariate methods (e.g., canonical correlation, multivariate analysis of variance), and various other methods for which the concept of power is not
directly relevant (e.g, factor analysis, cluster analysis, and multidimensional scaling). At the time of Cohen's (1962) study, nonparametric
techniques were much more popular than they are now, whereas the
use of multivariate techniques has increased. Because the power characteristics of many nonparametric statistics were unavailable, Cohen
(1962) substituted power estimates for the analogous parametric tests
(e.g., / test for the Mann-Whitney U test). This procedure seems problematical in that it overestimates power in many situations. Because
the frequency of use of nonparametric methods is now quite low, Cohen's procedure was not continued in the present survey. Multivariate
statistics were also excluded, because neither convenient computer
algorithms nor suitable power tables for most multivariate techniques
were available at the time the survey was conducted. It is unlikely that
Determination of Statistical Power
Cohen's (1977) tables were used to determine power for sign tests and
chi-square tests. Tables and formulas in Rossi (1985b) were used to
determine power for z tests for the difference between independent
correlation coefficients and for z tests for the difference between independent proportions. The tabled values in these sources are accurate to
about one d igit in the second decimal place when compared with exact
Computer programs were written by the author to expedite the determination of power for the remaining tests (t, r, and F). This was done
because of the frequency of occurrence of these statistics and to avoid
the interpolation errors inevitably encountered with the use of tables or
charts. All programs were written in double precision IBM BASIC (version 3.10) to run on an IBM PC/AT microcomputer. The computer program to determine the power of the t test was based on the normal
approximation to the noncentral I distribution given by Cohen (1977).
Because this formula assumes that sample sizes are equal, it was modified slightly to permit unequal n power calculations. The determination of power for the Pearson correlation coefficient was based on the
normal score approximation for r provided by the hyperbolic arctangent transformation, plus a correction factor for small sample sizes
(Cohen, 1977). The cube root normal approximation of the noncentral
F distribution was used for the analysis of variance power program
(Laubscher. 1960). The accuracy of this formula has been found to be
quite good, with errors appearing only in the 3rd or 4th decimal places
for a = .05 (Cohen & Nee, 1987). Although more accurate approximations exist, the small gain in precision did not justify the additional
computational complexity. For all computer programs, normal score
approximations were converted to probability (power) values using
formula 26.2.19 in Abramowitz and Stegun (1965, p. 932), who give the
accuracy of the algorithm as±1.5 X 10~7. The resulting power values
agreed to two decimal places with those obtained from Cohen's (1977)
power tables and recently available computer program (Borenstein &
Cohen, 1988).
A total of 312 articles were examined, 58 in the Journal of
Abnormal Psychology (JAP), 146 in the Journal of Consulting
and Clinical Psychology (JCCP), and 108 in the Journal of Personality and Social Psychology (JPSP). Of the total, 91 articles
were excluded: 9 in JAP 68 in JCCP, and 14 in JPSP The majority of excluded articles reported no statistical tests at all (n — 62).
Only 29 articles contained statistical tests for which power was
not or could not be determined (primarily data reduction techniques, such as cluster analysis and factor analysis). The large
number of excluded articles for JCCP was due to the publication of a special issue of the journal containing only review
articles. Power was determined for statistical tests reported in
the remaining 221 articles: 49 in JAP, 78 in JCCP, and 94 in
JPSP The total number of tests for which power was calculated
was 6,155:1,289 in JAK2,231 in JCCP, and 2,635 in JPSPThe
frequency of occurrence for each test is given in Table 2. The
sample is dominated by the traditional statistical tests: Pearson
r, analysis of variance, and the (test, which constituted 90% of
the reported techniques. Because this tabulation excludes tests
for which power was not determined, the listed frequencies are
Table 2
Frequency Distribution for Statistical Tests in Power Survey
Statistical test
F lest: ANOVA and ANCOVA
Pearson r
Chi-square test
F test: multiple regression
z test: difference between rs
z test: difference between ps
Sign test
Note. JAP = Journal of Abnormal Psychology; JCCP = Journal of Consulting and Clinical Psychology;
JPSP= Journal of Personality and Social Psychology. Freq. = frequency of occurrence. ANOVA = analysis of
variance; ANCOVA = analysis of covariance.
not necessarily representative of the use of statistics in psychological research (see Moore, 1981).
((289) = 3.71, p < .001, a2 = .042; and for large effects, ((289) =
3.80, p<. 001, u 2 = . 044.
Because the number of statistical tests appearing in each
journal article varied greatly (from I to 256), the article was
used as the unit of analysis.1 Power was determined by averaging across the statistical tests reported in the article so that each
article contributed equally to the overall power assessment of
the journal literature (following Cohen, 1962). Six separate
Power of Current Psychological Research Compared
With Cohen's (1962) Survey
power estimates were made for each statistical test. Three esti-
The results of this study seem to suggest that statistical power
mates were based on Cohen's (1962) original definitions of
small, medium, and large effects. A second set of estimates was
has increased slightly since Cohen (1962) conducted his survey.
However, these increases are no cause for joy. In fact, the gen-
based on Cohen's (1969, 1977) "modern" definitions. In most
eral character of the statistical power of psychological research
practical respects, power did not depend greatly on which definitions were used. Therefore, except where noted below, only
effects continues to be poor, power to detect medium effects
the results that were based on Cohen's more recent and fre-
remains the same today as it was then: Power to detect small
continues to be marginal, and power to detect large effects continues to be adequate. Furthermore, any increases in power are
quently used definitions will be reported. An alpha level of .05
(two-tailed) was assumed for all power calculations.
probably effectively eliminated by the increased use of alpha-
The mean power to detect small, medium, and large effects
adjusted procedures since Cohen's (1962) survey was conducted
for all 221 journal articles is given in Table 3. These results are
(Sedlmeier and Gigerenzer, 1989). Thus, the increases in power
collapsed across journals, inasmuch as a comparison of the
results for the three journals revealed very similar levels of sta-
reported here do not appear to be practically meaningful.
Nevertheless, from a purely technical standpoint, these re-
tistical power (see Table 4). The average statistical power com-
sults are not entirely consistent with those recently reported by
bined across journals was .17 to detect small effects, .57 to
Sedlmeier and Gigerenzer (1989), who found no increase in
detect medium effects, and .83 to detect large effects. Com-
statistical power compared with Cohen (1962). Unlike the pres-
pared with the results of other surveys (see Table 1), these results
suggest somewhat lower statistical power for JAP, JCCP, and
JPSP.2 For small effects, ((220) = 9.39, p < .001, oi2 = .283; for
medium effects, ((220) = 4.33, p < .001, ui2 = .074; and for large
effects, ((220) = 2.24, p < .05, w2 = .018. In general, the magnitude of the discrepancy decreases with increasing effect size.
Power for the three journals was also computed using Cohen's (1962) earlier definitions of effect size, to ensure comparability with the results reported by Cohen (1962). Using these
somewhat larger definitions, the mean power to detect small,
medium, and large effects was .25, .60, and .90, respectively.
Cohen (1962) reported results of .18, .48, and .83 for small,
medium, and large effects. For all levels of effect size, the increases in power were statistically significant:3 For small effects, ((289) = 3.03, p < .01, w2 = .027; for medium effects,
' Use of either statistical tests or separate studies reported within
articles as the unit of analysis gave results essentially identical to those
found using articles as the unit of analysis.
These are one-sample / tests using the mean power values of the 25
previously conducted power surveys reported in Table 1 as normative
(population) estimates and compared with the mean power values reported in Table 3 for the 221 studies used in the present survey.
Two-sample (independent groups) t tests were used to compare the
resultsof the 70 studies analyzed by Cohen (1962) with the 221 studies
analyzed in the present survey. Standard deviations of the mean power
values reported here are .08, .20, and. 16 for small, medium, and large
effect sizes, respectively, for Cohen (1962) and .19, .25, and .13 for
small, medium, and large effect sizes, respectively, for the present survey.
Table 3
Power of 221 Studies Published in the Journal of Abnormal Psychology, the Journal of
Consulting and Clinical Psychology, and the Journal of Personality and Social Psychology
Medium effects
Small effects
95% confidence
interval on M
Large effects
.148-. 186
.801 -.849
Note. Q, and Q, are the first and third quartiles of each frequency distribution.
ent survey, which analyzed studies appearing in the 1982 vol-
year variations in statistical power, although it is not clear why
umes of three journals, Sedlmeier and Gigerenzer (1989) examined studies published in the 1984 volume of one journal, the
this should be so.
Journal of Abnormal Psychology. However, in the current study
all three journals yielded virtually identical results. Examining
Power of Current Psychological Research Compared
With the Power of Other Surveys
only the Journal of Abnormal Psychology results for the present
survey and using Cohen's (1962) original effect size definitions,
Power estimates that are based on Cohen's (1977) more recent
power to detect small, medium, and large effects was .24, .60,
and smaller effect size definitions are naturally somewhat lower
and .91, respectively, compared with .21, .50, and .84, respectively, obtained by Sedlmeier and Gigerenzer (1989). Power is
than power estimates that are based on Cohen's (1962) earlier
somewhat greater for medium and large effects in the present
current psychological research is somewhat lower than the
survey. This discrepancy is somewhat puzzling, inasmuch as
power of research in other social and behavioral science disci-
these results are based on the same journal published only two
plines (Table 1). However, it is likely that these differences are
years apart. It is possible that there may be significant year-to-
not very important. The d isparity in power is most apparent for
definitions. These results (Table 3) suggest that the power of
Table 4
Comparison of the Power of Studies Published in the Journal of Abnormal Psychology,
the Journal of Consulting and Clinical Psychology, and the
Journal of Personality and Social Psychology
Small effects
Medium effects
Large effects
Note. JAP = Journal of Abnormal Psychology; JCCP = Journal of Consulting and Clinical Psychology;
JPSP = Journal of Personality and Social Psychology.
small effects, diminishing rapidly with increasing effect size. In
general, the power of current psychological research is quite
consistent with the range of results obtained by most of the
power surveys shown in Table 1. Thus, the common tendency
to generalize to psychological research the low power obtained
by studies in other areas of social and behavioral research
seems to have been justified.
Methodologists have typically recommended that researchers design their studies so that there is an 80% chance of
detecting the effect under investigation (e.g, Cohen, 1965). In
this respect, the results of these power surveys are not encouraging. The average statistical power of research found in our survey exceeded .80 only for large effects, and more than a third
(35%) of all studies were unable to attain this level of power even
for large effects. More than 75% of all studies in the survey
failed to achieve power of .80 for medium-sized effects, and
almost half of the studies did not have even a 50% chance of
detecting effects of this size. The prospects of detecting small
effects are dismal: More than 90% of the surveyed studies had
less than one chance in three of detecting a small effect.
Implications of Survey Results
Whether power is based on Cohen's earlier or later effect size
definitions, the results of these power surveys are not encouraging. Although power may be adequate if a large effect is postulated, there is often no way to judge whether or not such an
assumption is justified, because most researchers do not report
obtained effect sizes. Surveys of the magnitude of effects in
psychological research suggest that typical effect sizes may be
no larger than Cohen's (1977) definition of a medium-sized effect (Cooper & Findley, 1982; Haase, Waechter, & Solomon,
1982). Even small effects can no longer be ignored in psychological research, thanks to the illuminating analyses provided by
Abelson (1985) and Rosenthal and Rubin (1982) demonstrating
the potential impact of such effects. The levels of statistical
power found in the present survey have several serious consequences depending on how large we suppose population effect
sizes are in psychological research.
What if effect sizes are small? Power for small effects was
very low (. 17) in our survey. Of course, it might be argued that
the power of psychological research could not be this low, inasmuch as a large proportion of all published studies report statistically significant results (Greenwald, 1975; Sterling, 1959).
Unfortunately, not only does low power suggest that there may
be a large number of Type II errors, but low power also suggests
the possibility of a proliferation of Type I errors in the research
literature. Consider a researcher who conducts an investigation
for which the null hypothesis happens to be true. With effect
size equal to zero, the probability of obtaining a statistically
significant result will be equal to alpha, because there is no true
alternative to detect. Should our unfortunate researcher (or
more likely, other researchers) persist in conducting research in
this area, in the long run significant results will occur at a rate
of 5%, if alpha is .05. Assuming an editorial publication bias
favoring statistically significant results, disproportionately
more of the significant results are likely to be published, while
the nonsignificant results will for the most part remain in "file
drawers" (Rosenthal, 1979). Because the population effect size
in this case is zero, 100% of all published significant results
would be Type I errors, despite a Type I error rate of 5%!
Although this situation is admittedly contrived, the situation
with power greater than alpha but still low is not much better.
When the power of research is low, the probability of rejecting a
true null hypothesis may be only slightly smaller than the probability of rejecting the null hypothesis when the alternative is
true. That is, the ratio of Type I errors to power may be uncomfortably large, indicating that a substantial proportion of all
significant results may be due to false rejections of valid null
hypotheses. On the basis of the results of the present survey, this
ratio is .05:. 17, suggesting one Type I error for approximately
every three valid rejections of the null hypothesis, for a "true"
Type I error rate of about .23. Nor is this situation unknown to
methodologists, many of whom have commented on various
aspects of the problem (e.g, Bakan, 1966; Cohen, 1965; Overall,
1969; Selvin & Stuart, 1966; Sterling, 1959; Tversky & Kahneman, 1971), some going so far as to suggest that all published
significant results may be Type I errors. It is in this way that low
power undermines the confidence that can be placed even in
statistically significant results. This may well be the legacy of
low statistical power for small effects.
What if effect sizes are medium? Power to detect medium
effects was .57 in our survey. The problem of increased Type I
errors in the published literature is less serious here than if
effect sizes are small. Yet, a problem of a different sort arises
when power is marginal, that is, in the general vicinity of .50.
The problem is that an inconsistent pattern of results may be
obtained in which some studies yield significant results while
others do not. Such a pattern of results is especially troublesome for research that is directed at a specific issue or problem
area and frequently results in the failure to replicate an experimental finding. An example may be found in the literature on
the spontaneous recovery of verbal associations. During the
period when it was still a hot topic (1948-1969), spontaneous
recovery became quite controversial, there being roughly equal
numbers of significant and nonsignificant findings. The controversy was resolved by applying meta-analysis and power analysis to the entire body of spontaneous recovery studies (Rossi,
1982, 1986). The meta-analysis revealed a rather small effect
size of .032 (proportion of variance accounted for). Rather than
conduct a power survey of the spontaneous recovery literature
using Cohen's range of small to large effect sizes, power for each
study was computed using the actually obtained effect size. The
average power of all 48 studies was .375, suggesting that 37.5%
of spontaneous recovery studies should have found statistically
significant results, in good agreement with the observed rate of
significance of 41.7% (20 of 48 significant results). This analysis
suggested that the sample sizes of spontaneous recovery studies
were inadequate to ensure detection of the effect in most studies but were sufficient to guarantee some statistically significant results. It is easy to see how the controversy over the existence of the effect was generated under these circumstances.
Current texts regard spontaneous recovery as ephemeral, and
the issue was never resolved so much as it was abandoned. The
entire episode is reminiscent of Meehl's (1978) characterization
of the fate of theories in "soft psychology" as never dying, just
slowing fading away as researchers lose interest to pursue new
theories. This may be the legacy of marginal levels of statistical
nals. By specific research literatures 1 mean a set of studies for
which a specific research hypothesis has been clearly defined
power for medium effects.
If effect sizes in psychological
and operationalized, such as gender differences in mathemati-
research are large, then the results of the present survey suggest
that power will be somewhat greater than .80. But it is doubtful
cal ability, psychotherapy outcome studies, or the existence of
spontaneous recovery. A few such studies exist, all of which
that anyone seriously believes average effect sizes in psychology
have reported relatively low statistical power (Beaumont &
What if effect
This document is copyrighted by the American Psychological Association or one of its allied publishers.
sizes are large?
are large, especially for research conducted outside the labora-
Breslow, 1981; Bones, 1972; Crane, 1976; Freiman, Chalmers,
tory, and there is certainly no evidence to support such a belief.
Smith, & Kuebler, 1978; Katzell & Dyer, 1977; King, 1985;
The few surveys that have been conducted suggest average effect sizes approximating Cohen's (1977) definition of medium
Rothpearl, Mohs, & Davis, 1981). These studies are limited,
however, in the same way that traditional power surveys have
effects (Cooper & Findley, 1982; Haase et al, 1982). Informal
been limited: All have estimated statistical power using either
observations of the effect sizes reported in published meta-ana-
Cohen's range of effect size estimates or other similarly arbi-
lyses are consistent with this view.
trary effect sizes. Instead it would be much more interesting to
use meta-analysis to determine the actual obtained effect size of
Limitations of Power Surveys
the specific research literature and to use this estimate of effect
size as the population parameter against which to compute
Power surveys have some important limitations that should
be kept in mind when one is evaluating the results reported in
power. This has been done, as previously described, for the
this article. Several have already been mentioned, including
other research areas as well (Rossi, 1986), with generally similar
limited time sampling (usually only one year) and a small num-
results. The only other such study was conducted by Kazdin
ber of journals (usually only one). More comprehensive surveys
and Bass (1989), who analyzed the power of comparative psy-
are certainly needed. Unfortunately, other problems may not be
chotherapy outcome studies. They found adequate statistical
so easily addressed. One critical issue concerns the specific
definitions assumed to represent small, medium, and large ef-
power for treatment versus no-treatment studies, but relatively
fects. Lacking comprehensive survey data on effect sizes in be-
versus active control comparisons. Thus, the frequently ob-
havioral research, Cohen (1962,1969,1977) chose admittedly
tained result of no-treatment differences in psychotherapy outcome research may well be due to inadequate statistical power,
arbitrary definitions for small, medium, and large effects. If
research on spontaneous recovery (Rossi, 1982) and for several
marginal power for treatment versus treatment and treatment
Cohen^ range of effect sizes is not representative of psychological research, then the results of power surveys will be similarly
similar to the situation described for spontaneous recovery. The
not representative. Fortunately, as it has already been pointed
of the enormous amount of effort that has gone into the detec-
out, surveys of effect sizes have supported Cohen's estimates.
However, such surveys have been few, have covered a very lim-
tion of differential treatment effects over the past 25 years.
ited range of disciplines, and have for the most part been based
scribed here, may bring into focus more clearly the impact of
low statistical power on research.
on relatively small sample sizes. There is a real need to conduct
importance of this result should not be underestimated in view
Power surveys of specific research areas, such as those de-
more comprehensive effect size surveys over a wide range of
psychological journals. Because reporting of obtained effect
sizes is still not a common publication practice, a really comprehensive survey will be both difficult and tedious to complete. A survey of the effect sizes reported in published metaanalyses might be a good place to start.
Use of Statistical Power Analysis in the
Interpretation of Negative Results
Perhaps the most frequent use of statistical power analysis is
Another limitation of power surveys is that they have con-
in determining sample sizes for planned investigations. Somewhat overlooked is the fact that power analysis may also be
centrated on very general research domains, primarily by analyzing all of the studies appearing in one or more specific jour-
cant results have not been obtained. Therefore, it seems worth
nals. The research published in any journal necessarily covers a
noting in some detail how statistical power analysis may be
heterogeneous range of topics. Thus, even if the range of effect
useful in helping to interpret negative results. As an example, it
sizes on which power analyses are based is generally accurate,
will be convenient to reconsider the discrepancy in results be-
the results of power surveys are nevertheless strictly applicable
only to the broad domain of research covered in the particular
those obtained in the present survey.
useful after a study has been completed, especially when signifi-
tween those obtained by Sedlmeier and Gigerenzer (1989) and
journal that is surveyed. Of course, the knowledge gained
One possible explanation for the discrepancy is the relatively
through such surveys has considerable value, as I have tried to
small number of articles on which the Journal of Abnormal
point out throughout this article. Yet it may be argued that the
Psychology results are based, 54 for Sedlmeier and Gigerenzer
(1989) and 49 in our study, thus leading to relatively wide confi-
impact of such knowledge is diffused across the many specific
research areas represented in the journal, leading to the speculation that the accumulated impact of low statistical power on
research may be similarly diffuse. It may be for this reason that
dence intervals around the mean power values. For example,
for medium effects, the 95% confidence interval for Sedlmeier
surveys of statistical power have had so little influence on the
behavior of practicing researchers.
One solution to this problem would be to conduct power
.S33-.671. These results suggest that the inability of Sedlmeier
surveys on specific research literatures instead of entire jour-
population effect size for the increase in statistical power since
and Gigerenzer (1989) is .426-.S74 and for the present survey is
and Gigerenzer (1989) to detect a change in statistical power is
itself a problem in statistical power. The best estimate of the
per limit on the magnitude of the effect that escaped detection.
Such a procedure is commonplace in the more developed
Table 5
Probable Upper Bounds on the Increase in Statistical
Power to Detect Medium Effects for the
sciences (Rossi, 1985a).
Sedlmeier & Gigerenzer (1989) Survey
Effect size
Type II
error rate
Upper bound on
increase in power
Note. Effect size is reported in terms of proportion of variance accounted for. Upper bounds are for increases in power to detect medium effects since Cohen (1962).
Conclusions and Recommendations
The results of this study suggest that, more than 20 years
after Cohen (1962) conducted his study, the power of psychological research is still low. These results have serious implications
not only for individual researchers conducting their own studies but for the entire discipline, especially with respect to the
proliferation of Type I errors in the published literature and the
frequent failure of replication studies. It is clear that the power
of psychological research needs to be improved, but it is less
clear how best to accomplish this goal. Reviewers, editors, and
researchers themselves are unlikely to be comfortable with increasing alpha levels beyond the currently accepted standard of
.05. Of course the easiest recommendation is to yet again admonish researchers to increase the size of their samples, but
such increases are not always practical, and they are almost
always expensive. This would seem to indicate that increasing
Cohen (1962) is co2 = .038, the average of the three u2s for the
increases in power for small, medium, and large effects found
the magnitude of effects may be the only practical alternative to
expensive increases in sample size as a means for increasing the
in our survey. For sample sizes off!, = 70 (Cohen, 1962) and «2 =
statistical power of psychological research.
54 (Sedlmeier & Gigerenzer, 1989), an independent groups /
We tend to think of effect size (when we think of it at all) as a
test with alpha (two-tailed) of .05 has power of only .585 to
fixed and immutable quantity that we attempt to detect. It may
detect an effect accounting for 3.8% of total variance. To
achieve power of .80, Sedlmeier and Gigerenzer (1989) would
be more useful to think of effect size as a manipulable parameter that can, in a sense, be made larger through greater measure-
have had to include 178 studies in their survey. In contrast, the
ment accuracy. This can be done through the use of more effec-
power of our study was considerably higher, .822, but could be
tive measurement models, more sensitive research designs, and
obtained only by pooling results across the three journals in-
more powerful statistical techniques. Examples might include
more reliable psychometric tests; better control of extraneous
cluded in the survey. If only the Journal of Abnormal Psychology had been included in the present study, power would have
been only .561.
Power analysis can also help put Sedlmeier and Gigerenzer's
(1989) null results into some perspective. In particular, it would
be valuable to know, given the power of their study, how large an
sources of variance through the use of blocking, covariates,
factorial designs, and repeated measurement designs; and, in
general, through the use of any procedures that effectively reduce the "noise" in the system. Inasmuch as behavior has multiple causes, the increased use of multivariate statistical methods
needs to be encouraged. Toward this end, additional work must
increase in statistical power might have escaped detection. Table 5 gives the power and the probability of a Type II error of
be done to make the computation of multivariate power more
their study against a range of effect sizes. For example, for a
accessible to researchers, along the lines started by Cohen
population effect size accounting for 8% of total variance, the
(1988) and Stevens (1980,1986).
Type II error rate is approximately 10%. Using tables and formu-
Sources of additional information on statistical power for
las given in Cohen (1977), it can be shown that an effect of this
nontechnical readers, including computational details and
magnitude would have been obtained by Sedlmeier and Giger-
worked examples, have become much more widely available in
enzer (1989) if the power to detect medium effects reported in
the past few years. Notable among these is the newly revised
the Journal of Abnormal Psychology had increased to .62 (or
edition of Cohen's (1988) seminal power handbook and the
decreased to .33). Thus it may be concluded that the probability
monographs by Kraemer and Thiemann (1987) and Lipsey
that power has increased to .62 and that Sedlmeier and Giger-
(1990). Unfortunately, most current general statistics textbooks
still do not contain much information on the calculation of
enzer (1989) failed to detect the increase is. 10. If you are willing
to live with this probability of error, then this value of .62 may
be regarded as a probable upper bound on the increase in statistical power since Cohen (1962). Table 5 lists additional upper
statistical power. Those that do are usually couched in terms of
noncentrality parameters or are limited to the consideration of
simple cases, such as t tests and one-way analysis of variance F
bounds for a wide range of Type II error rates for the Sedlmeier
and Gigerenzer (1989) survey. In general, information similar
to that presented in Table 5 can be invaluable for interpreting
tests. Numerous computer programs are also now available to
help microcomputer users determine their power and sample
size requirements. Thirteen such programs covering a wide
negative results. It does not, strictly speaking, permit the acceptance of the null hypothesis, because the null hypothesis is es-
range of techniques are comprehensively reviewed by Goldstein
(1989). To date, the only mainframe computer package that
provides statistical power as standard output is the multivariate
sentially always false (Bakan, 1966; Meehl, 1978). Rather, this
information permits the researcher to put an approximate up-
analysis of variance (MANOVA) routine in SPSS-X (1988). Users
are cautioned, however, that this program provides only retrospective power, that is, power based on the effect size actually
obtained in the analysis.
Finally, because power analysis requires a knowledge of effect
size, researchers are strongly encouraged to use and report indices of effect size in their research reports or to provide enough
data to permit others to do so. If negative results are found, then
a probable upper bound on effect size should be determined.
Future researchers will then have an adequate data base on
which to estimate the power of their own research, for example,
through the use of meta-analytic procedures. If a researcher has
no past research or other knowledge to serve as a guide in the
selection of an appropriate effect size, then estimating power
over a broad range of effect sizes is the least that should be done.
