Power Analysis
Power is the conditional probability that one will reject the null hypothesis given that the null
hypothesis is really false.
Imagine that we are evaluating the effect of a putative memory enhancing drug. We have
randomly sampled 25 people from a population known to be normally distributed with a  of 100 and
a  of 15. We administer the drug, wait a reasonable time for it to take effect, and then test our
subjects’ IQ. Assume that we were so confident in our belief that the drug would either increase IQ or
have no effect that we entertained directional hypotheses. Our null hypothesis is that after
administering the drug   100; our alternative hypothesis is  > 100.
These hypotheses must first be converted to exact hypotheses. Converting the null is easy: it
becomes  = 100. The alternative is more troublesome. If we knew that the effect of the drug was to
increase IQ by 15 points, our exact alternative hypothesis would be  = 115, and we could compute
power, the probability of correctly rejecting the false null hypothesis given that  is really equal to 115
after drug treatment, not 100 (normal IQ). But if we already knew how large the effect of the drug
was, we would not need to do inferential statistics.
One solution is to decide on a minimum nontrivial effect size. What is the smallest effect
that you would consider to be nontrivial? Suppose that you decide that if the drug increases iq by 2
or more points, then that is a nontrivial effect, but if the mean increase is less than 2 then the effect is
trivial.
Now we can test the null of  = 100 versus the alternative of  = 102. Look at the figure on the
following page (if you are reading this on paper, in black and white, I recommend that you obtain an
electronic copy of this document from our BlackBoard site and open it in Word so you can see the
colors). Let the left curve represent the distribution of sample means if the null hypothesis were true,
15
 3 . Let the right curve represent
 = 100. This sampling distribution has a  = 100 and a  x 
25
the sampling distribution if the exact alternative hypothesis is true,  = 102. Its  is 102 and,
15
 3.
assuming the drug has no effect on the variance in IQ scores,  x 
25
The red area in the upper tail of the null distribution is . Assume we are using a one-tailed 
of .05. How large would a sample mean need be for us to reject the null? Since the upper 5% of a
normal distribution extends from 1.645  above the  up to positive infinity, the sample mean IQ
would need be 100 + 1.645(3) = 104.935 or more to reject the null. What are the chances of getting a
sample mean of 104.935 or more if the alternative hypothesis is correct, if the drug increases IQ by 2
points? The area under the alternative curve from 104.935 up to positive infinity represents that
probability, which is power. Assuming the alternative hypothesis is true, that  = 102, the probability
of rejecting the null hypothesis is the probability of getting a sample mean of 104.935 or more in a
normal distribution with  = 102,  = 3. Z = (104.935  102)/3 = 0.98, and P(Z > 0.98) = .1635. That
is, power is about 16%. If the drug really does increase IQ by an average of 2 points, we have a 16%
chance of rejecting the null. If its effect is even larger, we have a greater than 16% chance.

Copyright 2015, Karl L. Wuensch - All rights reserved.
Power30.docx
2
Suppose we consider 5 the minimum nontrivial effect size. This will separate the null and
alternative distributions more, decreasing their overlap and increasing power. Now,
Z = (104.935  105)/3 = 0.02, P(Z > 0.02) = .5080 or about 51%. It is easier to detect large
effects than small effects.
Suppose we conduct a 2-tailed test, since the drug could actually decrease IQ;  is now split
into both tails of the null distribution, .025 in each tail. We shall reject the null if the sample mean is
1.96 or more standard errors away from the  of the null distribution. That is, if the mean is 100 +
1.96(3) = 105.88 or more (or if it is 100  1.96(3) = 94.12 or less) we reject the null. The probability of
that happening if the alternative is correct ( = 105) is: Z = (105.88  105)/3 = .29, P(Z > .29) =
.3859, power = about 39%. We can ignore P(Z < (94.12  105)/3) = P(Z < 3.63) = very, very small.
Note that our power is less than it was with a one-tailed test. If you can correctly predict the
direction of effect, a one-tailed test is more powerful than a two-tailed test.
15
 1.5 .
100
With the null and alternative distributions less plump, they should overlap less, increasing power.
With  x  1.5 , the sample mean will need be 100 + (1.96)(1.5) = 102.94 or more to reject the null. If
the drug increases IQ by 5 points, power is : Z = (102.94  105)/1.5 = 1.37, P(Z > 1.37) = .9147, or
between 91 and 92%. Anything that decreases the standard error will increase power. This
may be achieved by increasing the sample size or by reducing the  of the dependent variable.
The  of the criterion variable may be reduced by reducing the influence of extraneous variables upon
the criterion variable (eliminating “noise” in the criterion variable makes it easier to detect the signal,
the grouping variable’s effect on the criterion variable).
Consider what would happen if you increased sample size to 100. Now the  x 
Now consider what happens if you change . Let us reduce  to .01. Now the sample mean
must be 2.58 or more standard errors from the null  before we reject the null. That is, 100 +
2.58(1.5) = 103.87. Under the alternative, Z = (103.87  105)/1.5 = 0.75, P(Z > 0.75) = 0.7734 or
about 77%, less than it was with  at .05, ceteris paribus. Reducing  reduces power.
3
Please note that all of the above analyses have assumed that we have used a normally
X  
distributed test statistic, as Z 
will be if the criterion variable is normally distributed in the
x
population or if sample size is large enough to invoke the CLT. Remember that using Z also requires
that you know the population  rather than estimating it from the sample data. We more often
estimate the population , using Student’s t as the test statistic. If N is fairly large, Student’s t is
nearly normal, so this is no problem. For example, with a two-tailed  of .05 and N = 25, we went out
± 1.96 standard errors to mark off the rejection region. With Student’s t on N 1 = 24 df we should
have gone out ± 2.064 standard errors. But 1.96 versus 2.06 is a relatively trivial difference, so we
should feel comfortable with the normal approximation. If, however, we had N = 5, df = 4, critical t = ±
2.776, and the normal approximation would not do. A more complex analysis would be needed.
One Sample Power the Easy Way
Howell presents a simple method of doing power analyses for various designs. His method
also assumes a normal sampling distribution, so it is most valid with large sample sizes. The first
step in using Howell’s method is to compute d, the effect size in  units. For the one sample test
1   
d
. For our IQ problem with minimum nontrivial effect size at 5 IQ points,

d = (105  100)/15 = 1/3. We combine d with N to get , the noncentrality parameter. For the one
sample test,   d N . For our IQ problem with N = 25,  = 1/3  5 = 1.67. Once  is obtained,
power is obtained using the power table in our textbook. For a .05 two-tailed test, power = 36% for a
 of 1.60 and 40% for a  of 1.70. By linear interpolation, power for  = 1.67 is 36% + .7(40%  36%)
= 38.8%, within rounding error of the result we obtained using the normal curve. For a one-tailed test,
use the column in the table with  twice its one-tailed value. For  of .05 one-tailed, use the .10 twotailed column. For  = 1.67 power then = 48% + 7(52%  48%) = 50.8%, the same answer we got
with the normal curve.
If we were not able to reject the null hypothesis in our research on the putative IQ drug, and
our power analysis indicated about 39% power, we would be in an awkward position. Although we
could not reject the null, we also could not accept it, given that we only had a relatively small (39%)
chance of rejecting it even if it were false. We might decide to repeat the experiment using an n large
enough to allow us to accept the null if we cannot reject it. In my opinion, if 5% is a reasonable
risk for a Type I error (), then 5% is also a reasonable risk for a Type II error (), so let us use power
 
= 1   = 95%. From the power table, to have power = .95 with  = .05,  is 3.60. n    . For a
d 
2
2
 3 .6 
five-point minimum IQ effect, n  
  116.64 . Thus, if we repeat the experiment with 117
 1/ 3 
subjects and still cannot reject the null, we can accept the null and conclude that the drug has no
nontrivial ( 5 IQ points) effect upon IQ. The null hypothesis we are accepting here is a “loose null
hypothesis” [95 <  < 105] rather than a “sharp null hypothesis” [ = exactly 100]. Sharp null
hypotheses are probably very rarely ever true.
Others could argue with your choice of the minimum nontrivial effect size. Cohen has defined
a small effect as d = .20, a medium effect as d = .50, and a large effect as d = .80. If you defined
minimum d at .20, you would need even more subjects for 95% power.
A third approach one can take is to find the smallest effect that one could have detected with
high probability given n. If that d is small, and the null hypothesis is not rejected, then it is accepted.
For example, I used 225 subjects in the IQ enhancer study. For power = 95%,  = 3.60 with  .05

3.60

 0.24 . If I can’t reject the null, I accept it, concluding that if the drug
two-tailed, and d 
15
N
has any effect, it is a small effect, since I had a 95% chance of detecting an effect as small as .24 .
The loose null hypothesis accepted here would be that the population  differs from 100 by less than
.24 .
4
Two Independent Samples
For the Two Group Independent Sampling Design,   d
n
, where n = the number of scores
2
1   2
, where  is the standard deviation in

either population, assuming  is identical in both populations.
in one group, and both groups have the same n; d 
If n1  n2 , use the harmonic mean sample size, n~ 
2
.
1
1

n1 n 2
For a fixed total N, the harmonic mean (and thus power) is higher the more nearly equal n1 and
n2 are. This is one good reason to use equal n designs. Other good reasons are computational
simplicity with equal n’s and greater robustness to violation of assumptions. Try computing the
effective (harmonic) sample size for 100 subjects evenly split into two groups of 50 each and
compare that with the effective sample size obtained if you split them into 10 in one group and 90 in
the other.
You should be able to rearrange the above formula for  to solve for d or for n as required.
Consider the following a priori (done before gathering the data) power analysis. We wish to
compare the mean psychology achievement test score of students in Research Psychology graduate
programs with that of those in Clinical Psychology graduate programs. We decide that we will be
satisfied if we have enough data to have an 80% chance of detecting an effect of 1/3 of a standard
deviation, employing a .05 criterion of significance. How many scores do we need in each group, if
we have the same number of scores in each group? From the table in Howell, we obtain the value of
 
 2.8 
2.8 for  . n  2   2
  141 scores in each group, a total of 282 scores.
d 
 1/ 3 
Consider the following a posteriori power analysis. We have available only 36 scores from
students in clinical programs and 48 scores from students in general programs. What are our
chances of detecting a difference of 40 points (which is that actually observed at ECU way back in
1981) if we use a .05 criterion of significance and the standard deviation is 98? The standardized
2
effect size, d, is 40/98 = .408. The harmonic mean sample size is n~ 
 41.14
1
1

361 48 2
2
  .408
2
41.14
 1.85 . From our power table, power is 46% (halfway between .44 and .48).
2
5
Correlated Samples
The correlated samples t test is mathematically equivalent to a one-sample t test conducted
on the difference scores (for each subject, score under one condition less score under the other
condition). The greater ρ12, the correlation between the scores in the one condition and those in the
second condition, the smaller the standard deviation of the difference scores and the greater the
power, ceteris paribus. By the variance sum law, the standard deviation of the difference scores is
 Diff   12   22  2 12 1 2 . If we assume equal variances, this simplifies to  Diff   2(1   ) .
The 12 (rho) is the correlation between the scores in the one population and those in the other
population. With independent samples, rho = 0, and everything to the right of the minus sign drops
out. With correlated samples, the rho should be positive and, hopefully, considerable in magnitude.
This results in our subtracting from the standard error a positive magnitude, which will reduce the
standard error. Since the standard error is in the denominator – t = difference in means divided by
standard error – the absolute value of t is increased, lowering the value of p, and increasing power.
The reduction in the standard error should increase power relative to an independent samples
test with the same number of scores, but the correlated t has only half the degrees of freedom as the
independent t, which causes some loss of power. The gain in power from reducing the standard error
will generally greatly exceed the loss of power due to loss of half of the degrees of freedom, but one
could actually have less power with the correlated t than with the independent t if sample size were
low and the ρ12 low. Please see my document Correlated t versus Independent t .
When conducting a power analysis for the correlated samples design, we can take into
  2
d

account the effect of ρ12 by computing dDiff, an adjusted value of d: d Diff  1
,
 Diff
2(1  12 )
where d is the effect size as computed above, with independent samples. We can then compute
power via   d Diff
 
n or the required sample size via n  
 d Diff
2

 .

Please note that using the standard error of the difference scores, rather than the standard
deviation of the criterion variable, as the denominator of dDiff, is simply Howell’s method of
incorporating into the analysis the effect of the correlation produced by matching. If we were
computing estimated d (Hedges’ g) as an estimate of the standardized effect size given the obtained
results, we would use the standard deviation of the criterion variable in the denominator, not the
standard deviation of the difference scores. I should admit that on rare occasions I have argued that,
in a particular research context, it made more sense to use the standard deviation of the difference
scores in the denominator of g.
Consider the following a posteriori power analysis. One of my colleagues has data on level of
cortisol in humans under two conditions: 1.) in a relaxed setting, and 2.) in a setting thought to
increase anxiety. The standard deviations in her samples are 108 and 114. If the effect size in the
population were 20 points (which she considers the smallest nontrivial effect), how much power would
she have? The estimation correlation resulting from the within-subjects design is .75.
The pooled standard deviation is
.18
.5(1082 )  .5(1142 )  111. d = 20/111 = .18.
 .255 .   .255 15  1.02. From the power table, power = 17%. My colleague
2(1  .75)
is not very happy about this low power. I point out that maybe she is lucky and has such a big
d Diff 
6
difference in means that it is significant despite the low power. Regretfully, that is not so. She starts
planning to repeat the study with a larger sample size.
How much data will my colleague need? She wants to have enough power to be able to
detect an effect as small as 20 points of a standard deviation (d = .18) with 95% power – she
considers Type I and Type II errors equally serious and is employing a .05 criterion of statistical
 
significance, so she wants beta to be not more than .05. . n  
 d Diff
2

 3.6 
  
  199.3.
 .255 

2
Here is another example of power analysis with matched pairs.
Pearson r
For a Pearson Correlation Coefficient, d = the size of the correlation coefficient in the
population, and   d n  1 where n = the number of pairs of scores in the sample.
Consider the following a priori power analysis. We wish to determine whether or not there is a
correlation between misanthropy and support for animal rights. We shall measure these attributes
with instruments that produce scores for which it is reasonable to treat the variables as continuous.
How many respondents would we need to have a 95% probability of obtaining significant results if we
employed a .05 criterion of significance and if the true value of the correlation (in the population) was
 
 3.6 
0.2? n     1  
  1  325 .
d 
 .2 
2
2
Type III Errors and Three-Choice Tests
Leventhal and Huynh (Psychological Methods, 1996, 1, 278-292) note that it is common
practice, following rejection of a nondirectional null, to conclude that the direction of difference in the
population is the same as what it is in the sample. This procedure is what they call a "directional twotailed test." They also refer to it as a "three-choice test" (I prefer that language), in that the three
hypotheses entertained are: parameter = null value, parameter < null value, and parameter > null
value. This makes possible a Type III error: correctly rejecting the null hypothesis, but incorrectly
inferring the direction of the effect - for example, when the population value of the tested parameter is
actually more than the null value, getting a sample value that is so much below the null value that you
reject the null and conclude that the population value is also below the null value. The authors show
how to conduct a power analysis that corrects for the possibility of making a Type III error. See my
summary at: http://core.ecu.edu/psyc/wuenschk/StatHelp/Type_III.htm
Links






POWER -- An overview
In Power Point format
G*Power3: Tests available
Internet Resources
Correlated t versus Independent t
Power Conventions: What effect size should I assume and how much power should I want?
Power, Retrospective -- Foolishness
Copyright 2015, Karl L. Wuensch - All rights reserved.