Power and sample size Before we begin an experiment, we would like to know how many subjects to include, and the probability that our experiment will yield a significant p-value. Suppose we want to test if a drug is better than a placebo, or if a higher dose is better than a lower dose. Sample size: How many patients should we include in our clinical trial, to give ourselves a good chance of detecting any effects of the drug? Power: Assuming that the drug has an effect, what is the probability that our clinical trial will give a significant result? Power for a t-test. We plan a test to determine if a drug is more effective than a placebo. Power is the probability that our experiment will detect a significant difference between the treatment groups. To calculate power we assume that there is a real difference. Note that power makes the opposite assumption from the usual case, that is, we usually assume that there is no difference between treatment groups. For clinical trials and biology experiments, we typically aim for power of 80%, 90%, or higher. Simulation to illustrate power and sample size We'll use the R statistics software to simulate how power changes with the difference of responses between treatment groups, the variability of responses within treatment groups, and the sample size. Suppose that we have the following situation. We have a drug that lowers mean blood pressure by 10 units. We have two populations: Create a population of 10,000 patients who receive a placebo. We'll generate these to have approximately mean = 150, standard deviation = 20. placebo= rnorm(10000, 150, 20) mean(placebo) sd(placebo) > mean(placebo) [1] 149.4223 > sd(placebo) [1] 19.63531 Create a population of 10,000 patients who receive a drug to reduce blood pressure. We'll generate these to have approximately mean = 140, standard deviation = 20. drug = rnorm(10000, 140, 20) mean(drug) sd(drug) > mean(drug) [1] 140.0633 > sd(drug) [1] 20.13818 Plot the two populations. plot(density(placebo), xlim= c(50, 250),ylim=c(0,.025)) lines(density(drug), lty=2) Take samples of size n = 30 from the placebo group and the drug group. placebo.sample = sample(placebo, size=30) drug.sample = sample(drug, size=30) Plot the two samples plot(density(placebo.sample), xlim= c(50, 250),ylim=c(0, .025)) lines(density(drug.sample), lty=2) Perform a t-test to compare the two samples. t.test(placebo.sample, drug.sample, var.equal=TRUE) ttest.result = t.test(placebo.sample, drug.sample, var.equal=TRUE) ttest.result$p.value > t.test(placebo.sample, drug.sample, var.equal=TRUE) Two Sample t-test data: placebo.sample and drug.sample t = 1.1682, df = 58, p-value = 0.2475 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -4.679472 17.795567 sample estimates: mean of x mean of y 149.0953 142.5372 > ttest.result = t.test(placebo.sample, drug.sample, var.equal=TRUE) > ttest.result$p.value [1] 0.2475168 > Repeat sampling and t-test a few times. placebo.sample = sample(placebo, size=30) drug.sample = sample(drug, size=30) t.test(placebo.sample, drug.sample, var.equal=TRUE) ttest.result = t.test(placebo.sample, drug.sample, var.equal=TRUE) ttest.result$p.value What is the probability that we will detect a significant difference (p < 0.05) if we take many samples of size n=30? To answer this question, we'll do a simulation of 1000 samples and t-tests, and look at the distribution of p-values. rm(pvalue.list) n = 30 pvalue.list = c() for (i in 1:1000) { placebo.sample = sample(placebo, size=n) drug.sample = sample(drug, size=n) pvalue.list[i] = t.test(placebo.sample, drug.sample, var.equal=TRUE)$p.value pvalue.list } Plot the pvalue.list hist(pvalue.list, xlim= c(0, 1), breaks=seq(0,1,.05), ylim=c(0,1000)) What percent of the 1000 simulated samples give a p-value less than 0.05? Approximately 450 out of 1000 simulated samples give p < 0.05. So the probability that our experiment will give us a significant result is 450/1000, or 45%. The simulation indicates that we have 45% power. Increase sample size If we increase sample size we increase power. What is the probability that we will detect a significant difference (p < 0.05) if we increase the sample size to n=50 ? Do a simulation of 1000 samples and t-tests, and look at the distribution of p-values. n = 50 pvalue.list = c() for (i in 1:1000) { placebo.sample = sample(placebo, size=n) drug.sample = sample(drug, size=n) pvalue.list[i] = t.test(placebo.sample, drug.sample, var.equal=TRUE)$p.value pvalue.list } Plot the pvalue.list hist(pvalue.list, xlim= c(0, 1), breaks=seq(0,1,.05), ylim=c(0,1000)) What percent of the 1000 simulated samples give a p-value less than 0.05? Approximately 630 out of 1000 simulated samples give p < 0.05. So the probability that our experiment will give us a significant result is 630/1000, or 63%. The simulation indicates that we have 63% power. Decrease the unexplained variance If we decrease the unexplained variance, we increase power. Suppose that we know that other factors affect our y variable. For example, age, sex, medical history, and baseline disease severity may all affect the y variable, regardless of treatment. As we will see, including these variables in the model will reduce the unexplained variance. Create a population of 10,000 patients who receive a placebo. We'll generate these to have approximately mean = 150, standard deviation = 10. placebo= rnorm(10000, 150, 10) Create a population of 10,000 patients who receive a drug to reduce blood pressure. We'll generate these to have approximately mean = 140, standard deviation = 10. drug = rnorm(10000, 140, 10) Plot the two populations. plot(density(placebo), xlim= c(50, 250), ylim=c(0, .05)) lines(density(drug), lty=2) Take sample of size n = 30 from each group. placebo.sample = sample(placebo, size=30) drug.sample = sample(drug, size=30) Plot the two samples plot(density(placebo.sample), xlim= c(50, 250), ylim=c(0, .05)) lines(density(drug.sample), lty=2) # T test t.test(placebo.sample, drug.sample, var.equal=TRUE) ttest.result = t.test(placebo.sample, drug.sample, var.equal=TRUE) ttest.result$p.value > t.test(placebo.sample, drug.sample, var.equal=TRUE) Two Sample t-test data: placebo.sample and drug.sample t = 4.305, df = 58, p-value = 6.512e-05 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 5.412883 14.821128 sample estimates: mean of x mean of y 152.0005 141.8835 > ttest.result = t.test(placebo.sample, drug.sample, var.equal=TRUE) > ttest.result$p.value [1] 6.512177e-05 What is the probability that we will detect a significant difference (p < 0.05) if we take many samples of size n=30? Do a simulation of 1000 samples and t-tests, and look at the distribution of p-values. n = 30 pvalue.list = c() for (i in 1:1000) { placebo.sample = sample(placebo, size=n) drug.sample = sample(drug, size=n) pvalue.list[i] = t.test(placebo.sample, drug.sample, var.equal=TRUE)$p.value pvalue.list } Plot the pvalue.list hist(pvalue.list, xlim= c(0, 1), breaks=seq(0,1,.05), ylim=c(0,1000)) What percent of the 1000 simulated samples give a p-value less than 0.05? Approximately 980 out of 1000 simulated samples give p < 0.05. So the probability that our experiment will give us a significant result is 980/1000, or 98%. The simulation indicates that we have 98% power. So which would you prefer to do: increase the sample size (spending more time and money), or decrease the unexplained variance by controlling for other variables that affect the response? Power increases as the effect size increases Effect size is the difference between the means of the two groups. If we have a more effective drug, the difference between the means of the two groups will increase, so the effect size increases, and power increases. Suppose the drug decreases blood pressure by 20 units (rather than 10 units as in the previous examples). Create a population of 10,000 patients who receive a placebo. We'll generate these to have approximately mean = 150, standard deviation = 20. placebo= rnorm(10000, 150, 20) Create a population of 10,000 patients who receive a drug to reduce blood pressure. We'll generate these to have approximately mean = 130, standard deviation = 20. drug = rnorm(10000, 130, 20) Plot the two populations. plot(density(placebo), xlim= c(50, 250), ylim=c(0, .025)) lines(density(drug), lty=2) Take sample of size n = 30 from each group. placebo.sample = sample(placebo, size=30) drug.sample = sample(drug, size=30) Plot the two samples plot(density(placebo.sample), xlim= c(50, 250), ylim=c(0, .025)) lines(density(drug.sample), lty=2) T test t.test(placebo.sample, drug.sample, var.equal=TRUE) ttest.result = t.test(placebo.sample, drug.sample, var.equal=TRUE) ttest.result$p.value > ttest.result$p.value [1] 0.00677975 What is the probability that we will detect a significant difference (p < 0.05) if we take many samples of size n=30 ? Do a simulation of 1000 samples and t-tests, and look at the distribution of p-values. n = 30 pvalue.list = c() for (i in 1:1000) { placebo.sample = sample(placebo, size=n) drug.sample = sample(drug, size=n) pvalue.list[i] = t.test(placebo.sample, drug.sample, var.equal=TRUE)$p.value pvalue.list } Plot the pvalue.list hist(pvalue.list, xlim= c(0, 1), breaks=seq(0,1,.05), ylim=c(0,1000)) What percent of the 1000 simulated samples give a p-value less than 0.05? Approximately 970 out of 1000 simulated samples give p < 0.05. So the probability that our experiment will give us a significant result is 970/1000, or 97%. The simulation indicates that we have 97% power. Software for power and sample size There are several commercial software packages that do power and sample size calculations, including NCSS PASS, SAS, nQuery, Minitab, JMP, Statistica, and others. The R statistics software is free and has functions for power and sample size calculations. The base R distribution has a limited set of functions for power and sample size, including power.t.test, power.anova.test and power.prop.test. You will need other packages for other analyses. Two packages I use are piface.jar and pwr. Piface.jar is a free Java applet for power and sample size developed by Russ Lenth at Iowa University. The website is http://www.stat.uiowa.edu/~rlenth/Power. You can download the software to your computer or run it from the website. The software is in the file "piface.jar". As an alternative, you can install the piface.jar Java applet and run it using R. install.packages("piface.jar", repos="http://R-Forge.R-project.org") library(piface) piface() pwr is another R package for power and sample size calculation. It is a little more difficult to use than piface, but it is quite useful. Install and run it in the same was as other R packages: install.packages("pwr") library(pwr) How to calculate power and sample size To calculate power for a t-test we need to specify the following: Effect size: what is the difference between the means of the two treatment groups? Standard deviation: the average standard deviation of the two treatment groups. Sample size: how many subjects will be in each group? To calculate sample size for a t-test we need to specify the following: Effect size: what is the difference between the means of the two treatment groups? Standard deviation: the average standard deviation of the two treatment groups. Power: what power do we want the test to have, e.g., 80% power? Here are examples using the base R functions for power calculations. help(power.t.test) power.t.test(n = NULL, delta = NULL, sd = 1, sig.level = 0.05, power = NULL, type = c("two.sample", "one.sample", "paired"),alternative = c("two.sided", "one.sided"), strict = FALSE) n: Number of observations (per group) delta: True difference in means sd: Standard deviation See also help(power.prop.test) Estimate sample size for a two-sample t-test # difference in means, delta = 0.5 # standard deviation, sd = 0.5 # alpha, sig.level = 0.01 # desired power, power = 0.9 power.t.test(delta = 0.5, sd = 0.5, sig.level = 0.01, power = 0.9) Two-sample t test power calculation n = 31.46245 delta = 0.5 sd = 0.5 sig.level = 0.01 power = 0.9 alternative = two.sided NOTE: n is number in *each* group Estimate power for a two-sample t-test # difference in means, delta = 0.5 # standard deviation, sd = 0.5 # alpha, sig.level = 0.01 # sample size, n = 31 power.t.test(delta = 0.5, sd = 0.5, sig.level = 0.01, n = 31) You can give several values for an argument, such as several different sample sizes. power.t.test(delta = 0.5, sd = 0.5, sig.level = 0.01, n=c(20, 30, 40, 50, 100)) Estimate power using data from a pilot study. Suppose you plan an experiment, and want to estimate power. You don't have information on mean effect size or standard deviation. You perform a pilot study to collect information on mean and standard deviation of two group, and get the following results. group1=c(1.1, 2.3, 4.8, 5.4, 7.9) group2=c(3.3, 4.2, 4.7, 7.6, 9.2) mean(group2)-mean(group1) sd(group1) sd(group2) > mean(group2)-mean(group1) [1] 1.5 > sd(group1) [1] 2.676752 > sd(group2) [1] 2.490984 The estimated effect size is the difference between the means, which is 1.5. The standard deviations of the two groups are similar. It is usually better to use the larger estimate of the standard deviation when estimating sample size, so we'll use the sd=2.7. We'll use power=0.8. power.t.test(delta = 1.5, sd = 2.7, sig.level = 0.05, power=0.8) Two-sample t test power calculation n delta sd sig.level power alternative = = = = = = 51.83884 1.5 2.7 0.05 0.8 two.sided NOTE: n is number in *each* group The power calculation indicates we need 52 subjects per group, for a total of 104 subjects. The main reason for the large sample size is the large standard deviation (sds=2.7) compared to the mean effect (delta=1.5). If we want to reduce the sample size, we need to find a way to reduce the within group standard deviation, which is unexplained variance. We'll see methods to reduce this unexplained variance shortly. Assay variability affects power Suppose you plan an experiment using a 96 well plate. You run identical specimens in all 96 wells. You find that the standard deviation of the identical specimens in the 96 wells is 10 units. The effect size you hope to detect is 5 units. What sample size will you need? power.t.test(delta=5, sd=10, sig.level = 0.05, power=0.8) Two-sample t test power calculation n delta sd sig.level power alternative = = = = = = 63.76576 5 10 0.05 0.8 two.sided NOTE: n is number in *each* group You think that by fixing the assay, you may be able to reduce the standard deviation of identical specimens in the 96 wells to 5 units. How does this affect sample size? power.t.test(delta=5, sd=5, sig.level = 0.05, power=0.8) Two-sample t test power calculation n delta sd sig.level power alternative = = = = = = 16.71477 5 5 0.05 0.8 two.sided NOTE: n is number in *each* group Calculate power for several different effect sizes (delta) and create a graph of power versus delta. res.power= power.t.test(delta = c(0.25, 0.5, 0.75, 1), sd = 0.5, sig.level = 0.01, n=30) res.power res.power$delta [1] 0.25 0.50 0.75 1.00 res.power$power [1] 0.2437156 0.8820913 0.9989076 0.9999996 plot(res.power$delta, res.power$power) Create a plot of power versus sample size. res.2 = power.t.test(delta = 0.5, sd = 0.5, sig.level = 0.01, n=c(20, 30, 40, 50, 75, 100)) plot(res.2$n, res.2$power) Power for ANOVA The R function power.anova.test calculates power or sample size for one-way ANOVA. See the documentation from "?power.anova.test". power.anova.test(groups = NULL, n = NULL, between.var = NULL, within.var = NULL, sig.level = 0.05, power = NULL) Arguments groups Number of groups n Number of observations (per group) between.var Between group variance within.var Within group variance sig.level Significance level (Type I error) power Power of test (1 minus Type II error) Details Exactly one of the parameters groups, n, between.var, power, within.var, and sig.level must be passed as NULL, and that parameter is determined from the others. Notice that sig.level has non-NULL default so NULL must be explicitly passed if you want it computed. power.anova.test requires that we specify the within group variance (within.var) and the between group variance (between.var). Notice that these are variances, rather than standard deviations. So if our previous experience suggests that the within group standard deviation is 20, the we specify within group variance as within.var = 20^2 = 400. The easiest way to specify the between group variance is to specify the expected mean values of the treatment groups, and then calculate the variance of those means, as follows. Estimated group means are 120, 130, 140, 150. groupmeans <- c(120, 130, 140, 150) between.var=var(groupmeans) Examples Example 1. We wish to calculate the sample size for the following experiment. 4 groups Estimated group means are 120, 130, 140, 150. power = 0.9 Within group standard deviation is 20. So within group variance is 20^2 = 400, giving within.var = 400. Here's the R code: groupmeans <- c(120, 130, 140, 150) power.anova.test(groups = length(groupmeans), between.var=var(groupmeans), within.var=400, power=.90) > power.anova.test(groups = length(groupmeans), between.var=var(groupmeans), within.var=400, power=.90) Balanced one-way analysis of variance power calculation groups n between.var within.var sig.level power = = = = = = 4 12.36350 166.6667 400 0.05 0.9 NOTE: n is number in each group We get n=12.36 per group, so we round up to 13 per group. Example 2. We wish to calculate power for the following one-way ANOVA. 4 groups n = 5 subjects per group within group variance = 4, (assumes sd = sqrt(4) = 2) between group variance = 1 Here's the R code. power.anova.test(groups=4, n=5, between.var=1, within.var=4) > power.anova.test(groups=4, n=5, between.var=1, within.var=4) Balanced one-way analysis of variance power calculation groups n between.var within.var sig.level power = = = = = = 4 5 1 4 0.05 0.2713118 NOTE: n is number in each group Our power is only 27%. We should reconsider doing this experiment. We should try to find ways to reduce the within group variance (by controlling for other variables, improving the assay, or exclusion criteria), or consider increasing the sample size. What sample size do we need for power = 0.8? power.anova.test(groups=4, power=0.8, between.var=1, within.var=4) > power.anova.test(groups=4, power=0.8, between.var=1, within.var=4) Balanced one-way analysis of variance power calculation groups n between.var within.var sig.level power = = = = = = 4 15.54913 1 4 0.05 0.8 NOTE: n is number in each group We could use 15.5 (round up to 16) subjects per group to get power = 0.8. power.anova.test(groups=4, power=0.8, between.var=1, within.var=4, sig.level=0.01) > power.anova.test(groups=4, power=0.8, between.var=1, within.var=4, sig.level=0.01) Balanced one-way analysis of variance power calculation groups n between.var within.var sig.level power = = = = = = 4 22.05539 1 4 0.01 0.8 NOTE: n is number in each group To have 80% power to reach an alpha significance level of 0.01, we need 22 subjects per group. NCSS PASS software example for t-test sample size Here is an example of the output from the NCSS PASS software for a -test Two-Sample T-Test Power Analysis Numeric Results for Two-Sample T-Test Null Hypothesis: Mean1=Mean2. Alternative Hypothesis: Mean1<>Mean2 The standard deviations were assumed to be unknown and equal. Power 0.90596 0.81840 0.91250 0.80704 0.90163 N1 32 26 23 17 120 N2 32 26 23 17 120 Ratio 1.000 1.000 1.000 1.000 1.000 Alpha 0.01000 0.01000 0.05000 0.05000 0.01000 Beta 0.09404 0.18160 0.08750 0.19296 0.09837 Mean1 130.0 130.0 130.0 130.0 140.0 Mean2 150.0 150.0 150.0 150.0 150.0 S1 20.0 20.0 20.0 20.0 20.0 S2 20.0 20.0 20.0 20.0 20.0 0.80455 0.90323 0.80146 96 86 64 96 86 64 1.000 1.000 1.000 0.01000 0.05000 0.05000 0.19545 0.09677 0.19854 140.0 140.0 140.0 150.0 150.0 150.0 20.0 20.0 20.0 20.0 20.0 20.0 Report Definitions Power is the probability of rejecting a false null hypothesis. Power should be close to one. N1 and N2 are the number of items sampled from each population. To conserve resources, they should be small. Alpha is the probability of rejecting a true null hypothesis. It should be small. Beta is the probability of accepting a false null hypothesis. It should be small. Mean1 is the mean of populations 1 and 2 under the null hypothesis of equality. Mean2 is the mean of population 2 under the alternative hypothesis. The mean of population 1 is unchanged. S1 and S2 are the population standard deviations. They represent the variability in the populations. Summary Statements Group sample sizes of 32 and 32 achieve 91% power to detect a difference of -20.0 between the null hypothesis that both group means are 130.0 and the alternative hypothesis that the mean of group 2 is 150.0 with estimated group standard deviations of 20.0 and 20.0 and with a significance level (alpha) of 0.01000 using a two-sided two-sample t-test. N1 vs M1 by Alpha with M2=150.0 S1=20.0 S2=20.0 Power=0.90 N2=N1 2-Sided T Test 150 N1 Alpha 100 50 0 120 0.01 0.05 125 130 M1 135 140 N1 vs M1 by Alpha with M2=150.0 S1=20.0 S2=20.0 Power=0.80 N2=N1 2-Sided T Test 110 90 N1 Alpha 70 0.01 50 0.05 30 10 120 125 130 M1 135 140