Power and Sample Size A Practical Point of View Tatsuki Koyama, Ph.D. Department of Biostatistics Vanderbilt University School of Medicine 582 Preston 615-936-1232 tatsuki.koyama@vanderbilt.edu Topics 1. Basics 2. Specific Situations • Comparing a proportion to a constant • Comparing two means • Comparing three or more means 3. Software 4. Writing Power Analysis 1 Basics Suppose you have already developed an analysis plan. Now you need to determine the sample size, N . Sample size is closely related to • type I error rate, α. • type II error rate, β. Power = 1 − β. What are these? • type I error = Reject H0 when it is true. Conclude that the new treatment is effective when it is not effective. • type II error = Fail to reject H0 when it is false. Fail to conclude that the new treatment is effective when it is effective. More on type I and II errors. • As α ↓ or β ↓, N ↑. • Type I and II are (usually) not of equal importance. – Without a proper control of type I error rate, the conclusion is not valid. (α = .05.) – The experimenter can set β. – A small β (large power) costs more. – A large β (small power) may be unethical. 2 Specific Situations 2.1 Comparing a proportion to a constant Treatment of interest Autologous GM-CSF gene-modified tumor vaccines in combination with anti-VEGF antibodies. [Treatment] Primary endpoint The response rate among the patient population. Hypothesis of interest The response rate is higher than 12%. • type I error = Conclude that the response rate is higher than 12% when it is actually 12%. α = 0.05 • type II error = Fail to conclude that the response rate is higher than 12% when it is actually higher than 12%. Set β = 0.10 if the response rate is actually 15%. Power is 1 − 0.10 = 0.90. i.e., if the truth response rate is 15%, then we will be able to correctly reject H0 with a probability, 0.90. • Sample size calculation using PASS 6.0. Sample size = 1133. No! We can’t afford that! Let’s try : Power = .80. Power is .80 if the true response rate is 15%. N = 801. What else can we do to reduce the sample size? It costs too much to try to detect such a small difference (12% vs 15%). .. We want a power of .80 if the true response rate is 20%. Then the necessary sample size is 132. 0.8 0.6 0.2 0.4 n=127, power=.81 n=132, power=.80 n=143, power=.80 0.0 Power and Type I Error Rate 1.0 Sample Size and Power 100 110 120 130 140 150 Sample Size One more subtle point: Sample size also depends on the location. p0 = 12%, p1 = 20%, α = .05, β = .20 give N = 127. p0 = 42%, p1 = 50%, α = .05, β = .20 give N = 249. 2.2 Comparing two means Primary endpoint Proliferation index = the number of positively stained cells per tissue sample. Hypothesis of interest Average prolifieration index (PI) is different in the control and treatment groups. type I error = Conclude that the PI is different when it is the same for both groups. α = .05. type II error = Fail to conclude that the PI is different when it is different. What do you mean by “different”? The sample size depends on “difference” and “standard deviation”. Actually, it depends on “standardized difference.” Assumed Distributions of Proliferation Index SD=5 Control Treatment 30 40 50 60 70 Proliferation Index Assumed Distributions of Proliferation Index SD=10 Control Treatment 30 40 50 Proliferation Index 60 70 Suppose that α = 5% and Power=90% to detect a difference of 5. If we assume that SD = 5, then the sample size is 22 from each group. If we assume that SD = 10, then the sample size is 85 from each group. α = 5% 20 40 60 σ = 4 N = 14 σ = 5 N = 22 σ = 6 N = 31 σ = 8 N = 54 σ = 10 N = 85 0 Sample Size Per Group 80 Sample Size as Function of Standard Deviation 4 6 8 10 Standard Deviation 0.6 0.2 0.4 α = 5% N = 22 σ=4 σ=5 σ=6 σ=7 σ=8 Power = 99% Power = 90% Power = 79% Power = 66% Power = 55% 0.0 Power 0.8 1.0 Power as Function of Standard Deviation 4 6 8 10 How do we know the SD (equivalently variance)? • Preliminary studies – Run a few samples of new experiment (or collect pilot data). – Look at previous data that you have observed that may be similar. • Literature (other studies that have measured similar things) – use estimates based on other studies that are similar and have been published. (This is effective in determining “baseline” measures.) • Clinical/scientific expertise (guess!) – Make your best guess at what you hope to see. (this method can often be problematic, in that if these guesses are “way off” then you won’t have enough power in the end. Or · · · Power Analysis: A sample size of 85 for each group (170 total) provides 90% power to detect a difference of 0.5× standard deviation between the control and treatment groups with type I error rate of 5%. 2.3 Comparing three or more means Use Analysis of Variance (ANOVA) to test H 0 : µ1 = µ2 = µ3 = µ4 H1 : At least two means are different from each other. 20 40 60 80 ● ● ● Group1 Group2 Group3 Group4 We still need to know the variance for each group. (ANOVA assumes that they are all equal.) This is called “Within-Group Variance.” But there’s more... We need to specify “Between-Group Variance.” How much variability between groups? If there is only small variability, you will need larger samples. If there difference between groups are enormous, thern you will only need smaller samples. Large Between−Group Variance Small Sample Size 0 100 200 300 ● Group1 Group2 Group3 Group4 80 Small Between−Group Variance Large Sample Size 40 60 ● ● 20 ● ● ● Group1 Group2 Group3 Group4 Suppose α = 5%, Power=80%, Within-Group Standard Deviation = 5, Between-Group Variance = 10. Then, using nQuery, we get n = 8 per group. Issues How do we get these variance estimates? Do we really want to test to see if “at least two means are different from each other?” Sometimes, we want to compare Group 1 (Placebo) against everybody else. Then the hypothesis can be rewritten as: HA0 : µ1 =µ2 HA1 : µ1 6=µ2 HB0 : µ1 =µ3 HB1 : µ1 6=µ3 HC0 : µ1 =µ4 HC1 : µ1 6=µ3 Each test needs to have a smaller α. Simple adjustment is α = .05/3 = .0133. Power for each test may be set low and the overall power may still be adequate. Even when the analysis plan is complicated, sample size calculation may be based on a simpler method even though you will not carry out the simpler analysis. For example : For outcome variables with repeated measures, we will use a mixed model1 to obtain the overall effect of [intervention 1] and [intervention 2] in a similar manner as described under GLM. The sample size calculations were performed using a student’s t-test. Each experimental and control group will have n=12 measurements to give about 80 percent of power to detect a difference of 1.2 fold S.D. in an outcome variable with a two-sided significance level of 5 percent. It needs to be emphasized that this power calculation is based on the most conservative univaraite approach. The main analysis will use GLM and a mixed model, which are more powerful than the univariate tests.