3.2 Power and sample size

advertisement
Power and sample size
Before we begin an experiment, we would like to know how many subjects to include,
and the probability that our experiment will yield a significant p-value.
Suppose we want to test if a drug is better than a placebo, or if a higher dose is better
than a lower dose.
Sample size: How many patients should we include in our clinical trial, to give ourselves
a good chance of detecting any effects of the drug?
Power: Assuming that the drug has an effect, what is the probability that our clinical trial
will give a significant result?
Power for a t-test.
We plan a test to determine if a drug is more effective than a placebo.
Power is the probability that our experiment will detect a significant difference between
the treatment groups. To calculate power we assume that there is a real difference.
Note that power makes the opposite assumption from the usual case, that is, we usually
assume that there is no difference between treatment groups.
For clinical trials and biology experiments, we typically aim for power of 80%, 90%, or
higher.
Simulation to illustrate power and sample size
We'll use the R statistics software to simulate how power changes with the difference of
responses between treatment groups, the variability of responses within treatment
groups, and the sample size. Suppose that we have the following situation.
We have a drug that lowers mean blood pressure by 10 units. We have two populations:
Create a population of 10,000 patients who receive a placebo. We'll generate these to
have approximately mean = 150, standard deviation = 20.
placebo= rnorm(10000, 150, 20)
mean(placebo)
sd(placebo)
> mean(placebo)
[1] 149.4223
> sd(placebo)
[1] 19.63531
Create a population of 10,000 patients who receive a drug to reduce blood pressure.
We'll generate these to have approximately mean = 140, standard deviation = 20.
drug = rnorm(10000, 140, 20)
mean(drug)
sd(drug)
> mean(drug)
[1] 140.0633
> sd(drug)
[1] 20.13818
Plot the two populations.
plot(density(placebo), xlim= c(50, 250),ylim=c(0,.025))
lines(density(drug), lty=2)
Take samples of size n = 30 from the placebo group and the drug group.
placebo.sample = sample(placebo, size=30)
drug.sample = sample(drug, size=30)
Plot the two samples
plot(density(placebo.sample), xlim= c(50, 250),ylim=c(0, .025))
lines(density(drug.sample), lty=2)
Perform a t-test to compare the two samples.
t.test(placebo.sample, drug.sample, var.equal=TRUE)
ttest.result = t.test(placebo.sample, drug.sample, var.equal=TRUE)
ttest.result$p.value
> t.test(placebo.sample, drug.sample, var.equal=TRUE)
Two Sample t-test
data: placebo.sample and drug.sample
t = 1.1682, df = 58, p-value = 0.2475
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-4.679472 17.795567
sample estimates:
mean of x mean of y
149.0953 142.5372
> ttest.result = t.test(placebo.sample, drug.sample, var.equal=TRUE)
> ttest.result$p.value
[1] 0.2475168
>
Repeat sampling and t-test a few times.
placebo.sample = sample(placebo, size=30)
drug.sample = sample(drug, size=30)
t.test(placebo.sample, drug.sample, var.equal=TRUE)
ttest.result = t.test(placebo.sample, drug.sample, var.equal=TRUE)
ttest.result$p.value
What is the probability that we will detect a significant difference (p < 0.05) if we take
many samples of size n=30? To answer this question, we'll do a simulation of 1000
samples and t-tests, and look at the distribution of p-values.
rm(pvalue.list)
n = 30
pvalue.list = c()
for (i in 1:1000)
{
placebo.sample = sample(placebo, size=n)
drug.sample = sample(drug, size=n)
pvalue.list[i] = t.test(placebo.sample, drug.sample, var.equal=TRUE)$p.value
pvalue.list
}
Plot the pvalue.list
hist(pvalue.list, xlim= c(0, 1), breaks=seq(0,1,.05), ylim=c(0,1000))
What percent of the 1000 simulated samples give a p-value less than 0.05?
Approximately 450 out of 1000 simulated samples give p < 0.05. So the probability that
our experiment will give us a significant result is 450/1000, or 45%. The simulation
indicates that we have 45% power.
Increase sample size
If we increase sample size we increase power. What is the probability that we will detect
a significant difference (p < 0.05) if we increase the sample size to n=50 ?
Do a simulation of 1000 samples and t-tests, and look at the distribution of p-values.
n = 50
pvalue.list = c()
for (i in 1:1000)
{
placebo.sample = sample(placebo, size=n)
drug.sample = sample(drug, size=n)
pvalue.list[i] = t.test(placebo.sample, drug.sample, var.equal=TRUE)$p.value
pvalue.list
}
Plot the pvalue.list
hist(pvalue.list, xlim= c(0, 1), breaks=seq(0,1,.05), ylim=c(0,1000))
What percent of the 1000 simulated samples give a p-value less than 0.05?
Approximately 630 out of 1000 simulated samples give p < 0.05. So the probability that
our experiment will give us a significant result is 630/1000, or 63%. The simulation
indicates that we have 63% power.
Decrease the unexplained variance
If we decrease the unexplained variance, we increase power. Suppose that we know
that other factors affect our y variable. For example, age, sex, medical history, and
baseline disease severity may all affect the y variable, regardless of treatment. As we
will see, including these variables in the model will reduce the unexplained variance.
Create a population of 10,000 patients who receive a placebo. We'll generate these to
have approximately mean = 150, standard deviation = 10.
placebo= rnorm(10000, 150, 10)
Create a population of 10,000 patients who receive a drug to reduce blood pressure.
We'll generate these to have approximately mean = 140, standard deviation = 10.
drug = rnorm(10000, 140, 10)
Plot the two populations.
plot(density(placebo), xlim= c(50, 250), ylim=c(0, .05))
lines(density(drug), lty=2)
Take sample of size n = 30 from each group.
placebo.sample = sample(placebo, size=30)
drug.sample = sample(drug, size=30)
Plot the two samples
plot(density(placebo.sample), xlim= c(50, 250), ylim=c(0, .05))
lines(density(drug.sample), lty=2)
# T test
t.test(placebo.sample, drug.sample, var.equal=TRUE)
ttest.result = t.test(placebo.sample, drug.sample, var.equal=TRUE)
ttest.result$p.value
> t.test(placebo.sample, drug.sample, var.equal=TRUE)
Two Sample t-test
data: placebo.sample and drug.sample
t = 4.305, df = 58, p-value = 6.512e-05
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
5.412883 14.821128
sample estimates:
mean of x mean of y
152.0005 141.8835
> ttest.result = t.test(placebo.sample, drug.sample, var.equal=TRUE)
> ttest.result$p.value
[1] 6.512177e-05
What is the probability that we will detect a significant difference (p < 0.05) if we take
many samples of size n=30? Do a simulation of 1000 samples and t-tests, and look at the
distribution of p-values.
n = 30
pvalue.list = c()
for (i in 1:1000)
{
placebo.sample = sample(placebo, size=n)
drug.sample = sample(drug, size=n)
pvalue.list[i] = t.test(placebo.sample, drug.sample, var.equal=TRUE)$p.value
pvalue.list
}
Plot the pvalue.list
hist(pvalue.list, xlim= c(0, 1), breaks=seq(0,1,.05), ylim=c(0,1000))
What percent of the 1000 simulated samples give a p-value less than 0.05?
Approximately 980 out of 1000 simulated samples give p < 0.05. So the probability that
our experiment will give us a significant result is 980/1000, or 98%. The simulation
indicates that we have 98% power.
So which would you prefer to do: increase the sample size (spending more time and
money), or decrease the unexplained variance by controlling for other variables that
affect the response?
Power increases as the effect size increases
Effect size is the difference between the means of the two groups.
If we have a more effective drug, the difference between the means of the two groups
will increase, so the effect size increases, and power increases. Suppose the drug
decreases blood pressure by 20 units (rather than 10 units as in the previous examples).
Create a population of 10,000 patients who receive a placebo. We'll generate these to
have approximately mean = 150, standard deviation = 20.
placebo= rnorm(10000, 150, 20)
Create a population of 10,000 patients who receive a drug to reduce blood pressure.
We'll generate these to have approximately mean = 130, standard deviation = 20.
drug = rnorm(10000, 130, 20)
Plot the two populations.
plot(density(placebo), xlim= c(50, 250), ylim=c(0, .025))
lines(density(drug), lty=2)
Take sample of size n = 30 from each group.
placebo.sample = sample(placebo, size=30)
drug.sample = sample(drug, size=30)
Plot the two samples
plot(density(placebo.sample), xlim= c(50, 250), ylim=c(0, .025))
lines(density(drug.sample), lty=2)
T test
t.test(placebo.sample, drug.sample, var.equal=TRUE)
ttest.result = t.test(placebo.sample, drug.sample, var.equal=TRUE)
ttest.result$p.value
> ttest.result$p.value
[1] 0.00677975
What is the probability that we will detect a significant difference (p < 0.05) if we take
many samples of size n=30 ? Do a simulation of 1000 samples and t-tests, and look at
the distribution of p-values.
n = 30
pvalue.list = c()
for (i in 1:1000)
{
placebo.sample = sample(placebo, size=n)
drug.sample = sample(drug, size=n)
pvalue.list[i] = t.test(placebo.sample, drug.sample, var.equal=TRUE)$p.value
pvalue.list
}
Plot the pvalue.list
hist(pvalue.list, xlim= c(0, 1), breaks=seq(0,1,.05), ylim=c(0,1000))
What percent of the 1000 simulated samples give a p-value less than 0.05?
Approximately 970 out of 1000 simulated samples give p < 0.05. So the probability that
our experiment will give us a significant result is 970/1000, or 97%. The simulation
indicates that we have 97% power.
Software for power and sample size
There are several commercial software packages that do power and sample size
calculations, including NCSS PASS, SAS, nQuery, Minitab, JMP, Statistica, and others.
The R statistics software is free and has functions for power and sample size
calculations. The base R distribution has a limited set of functions for power and sample
size, including power.t.test, power.anova.test and power.prop.test. You will need other
packages for other analyses. Two packages I use are piface.jar and pwr.
Piface.jar is a free Java applet for power and sample size developed by Russ Lenth at
Iowa University. The website is http://www.stat.uiowa.edu/~rlenth/Power. You can
download the software to your computer or run it from the website. The software is in
the file "piface.jar". As an alternative, you can install the piface.jar Java applet and run it
using R.
install.packages("piface.jar", repos="http://R-Forge.R-project.org")
library(piface)
piface()
pwr is another R package for power and sample size calculation. It is a little more
difficult to use than piface, but it is quite useful. Install and run it in the same was as
other R packages:
install.packages("pwr")
library(pwr)
How to calculate power and sample size
To calculate power for a t-test we need to specify the following:

Effect size: what is the difference between the means of the two treatment
groups?

Standard deviation: the average standard deviation of the two treatment groups.

Sample size: how many subjects will be in each group?
To calculate sample size for a t-test we need to specify the following:

Effect size: what is the difference between the means of the two treatment
groups?

Standard deviation: the average standard deviation of the two treatment groups.

Power: what power do we want the test to have, e.g., 80% power?
Here are examples using the base R functions for power calculations.
help(power.t.test)
power.t.test(n = NULL, delta = NULL, sd = 1, sig.level = 0.05, power = NULL, type =
c("two.sample", "one.sample", "paired"),alternative = c("two.sided", "one.sided"), strict
= FALSE)
n: Number of observations (per group)
delta: True difference in means
sd: Standard deviation
See also help(power.prop.test)
Estimate sample size for a two-sample t-test
# difference in means, delta = 0.5
# standard deviation, sd = 0.5
# alpha, sig.level = 0.01
# desired power, power = 0.9
power.t.test(delta = 0.5, sd = 0.5, sig.level = 0.01, power = 0.9)
Two-sample t test power calculation
n = 31.46245
delta = 0.5
sd = 0.5
sig.level = 0.01
power = 0.9
alternative = two.sided
NOTE: n is number in *each* group
Estimate power for a two-sample t-test
# difference in means, delta = 0.5
# standard deviation, sd = 0.5
# alpha, sig.level = 0.01
# sample size, n = 31
power.t.test(delta = 0.5, sd = 0.5, sig.level = 0.01, n = 31)
You can give several values for an argument, such as several different sample sizes.
power.t.test(delta = 0.5, sd = 0.5, sig.level = 0.01,
n=c(20, 30, 40, 50, 100))
Estimate power using data from a pilot study.
Suppose you plan an experiment, and want to estimate power. You don't have
information on mean effect size or standard deviation. You perform a pilot study to
collect information on mean and standard deviation of two group, and get the following
results.
group1=c(1.1, 2.3, 4.8, 5.4, 7.9)
group2=c(3.3, 4.2, 4.7, 7.6, 9.2)
mean(group2)-mean(group1)
sd(group1)
sd(group2)
> mean(group2)-mean(group1)
[1] 1.5
> sd(group1)
[1] 2.676752
> sd(group2)
[1] 2.490984
The estimated effect size is the difference between the means, which is 1.5.
The standard deviations of the two groups are similar. It is usually better to use the
larger estimate of the standard deviation when estimating sample size, so we'll use the
sd=2.7. We'll use power=0.8.
power.t.test(delta = 1.5, sd = 2.7, sig.level = 0.05, power=0.8)
Two-sample t test power calculation
n
delta
sd
sig.level
power
alternative
=
=
=
=
=
=
51.83884
1.5
2.7
0.05
0.8
two.sided
NOTE: n is number in *each* group
The power calculation indicates we need 52 subjects per group, for a total of 104
subjects. The main reason for the large sample size is the large standard deviation
(sds=2.7) compared to the mean effect (delta=1.5). If we want to reduce the sample
size, we need to find a way to reduce the within group standard deviation, which is
unexplained variance. We'll see methods to reduce this unexplained variance shortly.
Assay variability affects power
Suppose you plan an experiment using a 96 well plate. You run identical specimens in all
96 wells. You find that the standard deviation of the identical specimens in the 96 wells
is 10 units. The effect size you hope to detect is 5 units. What sample size will you need?
power.t.test(delta=5, sd=10, sig.level = 0.05, power=0.8)
Two-sample t test power calculation
n
delta
sd
sig.level
power
alternative
=
=
=
=
=
=
63.76576
5
10
0.05
0.8
two.sided
NOTE: n is number in *each* group
You think that by fixing the assay, you may be able to reduce the standard deviation of
identical specimens in the 96 wells to 5 units. How does this affect sample size?
power.t.test(delta=5, sd=5, sig.level = 0.05, power=0.8)
Two-sample t test power calculation
n
delta
sd
sig.level
power
alternative
=
=
=
=
=
=
16.71477
5
5
0.05
0.8
two.sided
NOTE: n is number in *each* group
Calculate power for several different effect sizes (delta) and create a graph of power
versus delta.
res.power= power.t.test(delta = c(0.25, 0.5, 0.75, 1), sd = 0.5, sig.level = 0.01, n=30)
res.power
res.power$delta
[1] 0.25 0.50 0.75 1.00
res.power$power
[1] 0.2437156 0.8820913 0.9989076 0.9999996
plot(res.power$delta, res.power$power)
Create a plot of power versus sample size.
res.2 = power.t.test(delta = 0.5, sd = 0.5, sig.level = 0.01, n=c(20, 30, 40, 50, 75, 100))
plot(res.2$n, res.2$power)
Power for ANOVA
The R function power.anova.test calculates power or sample size for one-way ANOVA.
See the documentation from "?power.anova.test".
power.anova.test(groups = NULL, n = NULL,
between.var = NULL, within.var = NULL,
sig.level = 0.05, power = NULL)
Arguments
groups
Number of groups
n
Number of observations (per group)
between.var
Between group variance
within.var Within group variance
sig.level
Significance level (Type I error)
power
Power of test (1 minus Type II error)
Details
Exactly one of the parameters groups, n, between.var, power,
within.var, and sig.level must be passed as NULL, and that parameter is
determined from the others. Notice that sig.level has non-NULL default
so NULL must be explicitly passed if you want it computed.
power.anova.test requires that we specify the within group variance (within.var) and
the between group variance (between.var). Notice that these are variances, rather than
standard deviations. So if our previous experience suggests that the within group
standard deviation is 20, the we specify within group variance as within.var = 20^2 =
400. The easiest way to specify the between group variance is to specify the expected
mean values of the treatment groups, and then calculate the variance of those means,
as follows.
Estimated group means are 120, 130, 140, 150.
groupmeans <- c(120, 130, 140, 150)
between.var=var(groupmeans)
Examples
Example 1. We wish to calculate the sample size for the following experiment.
4 groups
Estimated group means are 120, 130, 140, 150.
power = 0.9
Within group standard deviation is 20. So within group variance is 20^2 = 400, giving
within.var = 400.
Here's the R code:
groupmeans <- c(120, 130, 140, 150)
power.anova.test(groups = length(groupmeans), between.var=var(groupmeans),
within.var=400, power=.90)
> power.anova.test(groups = length(groupmeans),
between.var=var(groupmeans), within.var=400, power=.90)
Balanced one-way analysis of variance power calculation
groups
n
between.var
within.var
sig.level
power
=
=
=
=
=
=
4
12.36350
166.6667
400
0.05
0.9
NOTE: n is number in each group
We get n=12.36 per group, so we round up to 13 per group.
Example 2. We wish to calculate power for the following one-way ANOVA.
4 groups
n = 5 subjects per group
within group variance = 4, (assumes sd = sqrt(4) = 2)
between group variance = 1
Here's the R code.
power.anova.test(groups=4, n=5, between.var=1, within.var=4)
> power.anova.test(groups=4, n=5, between.var=1, within.var=4)
Balanced one-way analysis of variance power calculation
groups
n
between.var
within.var
sig.level
power
=
=
=
=
=
=
4
5
1
4
0.05
0.2713118
NOTE: n is number in each group
Our power is only 27%. We should reconsider doing this experiment. We should try to
find ways to reduce the within group variance (by controlling for other variables,
improving the assay, or exclusion criteria), or consider increasing the sample size. What
sample size do we need for power = 0.8?
power.anova.test(groups=4, power=0.8, between.var=1, within.var=4)
> power.anova.test(groups=4, power=0.8, between.var=1, within.var=4)
Balanced one-way analysis of variance power calculation
groups
n
between.var
within.var
sig.level
power
=
=
=
=
=
=
4
15.54913
1
4
0.05
0.8
NOTE: n is number in each group
We could use 15.5 (round up to 16) subjects per group to get power = 0.8.
power.anova.test(groups=4, power=0.8, between.var=1, within.var=4, sig.level=0.01)
> power.anova.test(groups=4, power=0.8, between.var=1, within.var=4,
sig.level=0.01)
Balanced one-way analysis of variance power calculation
groups
n
between.var
within.var
sig.level
power
=
=
=
=
=
=
4
22.05539
1
4
0.01
0.8
NOTE: n is number in each group
To have 80% power to reach an alpha significance level of 0.01, we need 22 subjects per
group.
NCSS PASS software example for t-test sample size
Here is an example of the output from the NCSS PASS software for a -test
Two-Sample T-Test Power Analysis
Numeric Results for Two-Sample T-Test
Null Hypothesis: Mean1=Mean2. Alternative Hypothesis: Mean1<>Mean2
The standard deviations were assumed to be unknown and equal.
Power
0.90596
0.81840
0.91250
0.80704
0.90163
N1
32
26
23
17
120
N2
32
26
23
17
120
Ratio
1.000
1.000
1.000
1.000
1.000
Alpha
0.01000
0.01000
0.05000
0.05000
0.01000
Beta
0.09404
0.18160
0.08750
0.19296
0.09837
Mean1
130.0
130.0
130.0
130.0
140.0
Mean2
150.0
150.0
150.0
150.0
150.0
S1
20.0
20.0
20.0
20.0
20.0
S2
20.0
20.0
20.0
20.0
20.0
0.80455
0.90323
0.80146
96
86
64
96
86
64
1.000
1.000
1.000
0.01000
0.05000
0.05000
0.19545
0.09677
0.19854
140.0
140.0
140.0
150.0
150.0
150.0
20.0
20.0
20.0
20.0
20.0
20.0
Report Definitions
Power is the probability of rejecting a false null hypothesis. Power should be close to one.
N1 and N2 are the number of items sampled from each population. To conserve resources, they should be
small.
Alpha is the probability of rejecting a true null hypothesis. It should be small.
Beta is the probability of accepting a false null hypothesis. It should be small.
Mean1 is the mean of populations 1 and 2 under the null hypothesis of equality.
Mean2 is the mean of population 2 under the alternative hypothesis. The mean of population 1 is
unchanged.
S1 and S2 are the population standard deviations. They represent the variability in the populations.
Summary Statements
Group sample sizes of 32 and 32 achieve 91% power to detect a difference of -20.0 between the
null hypothesis that both group means are 130.0 and the alternative hypothesis that the mean of
group 2 is 150.0 with estimated group standard deviations of 20.0 and 20.0 and with a
significance level (alpha) of 0.01000 using a two-sided two-sample t-test.
N1 vs M1 by Alpha with M2=150.0 S1=20.0 S2=20.0
Power=0.90 N2=N1 2-Sided T Test
150
N1
Alpha
100
50
0
120
0.01
0.05
125
130
M1
135
140
N1 vs M1 by Alpha with M2=150.0 S1=20.0 S2=20.0
Power=0.80 N2=N1 2-Sided T Test
110
90
N1
Alpha
70
0.01
50
0.05
30
10
120
125
130
M1
135
140
Download