Power and Sample Size A Practical Point of View

advertisement
Power and Sample Size
A Practical Point of View
Tatsuki Koyama, Ph.D.
Department of Biostatistics
Vanderbilt University School of Medicine
582 Preston
615-936-1232
tatsuki.koyama@vanderbilt.edu
Topics
1. Basics
2. Specific Situations
• Comparing a proportion to a constant
• Comparing two means
• Comparing three or more means
3. Software
4. Writing Power Analysis
1 Basics
Suppose you have already developed an analysis plan.
Now you need to determine the sample size, N .
Sample size is closely related to
• type I error rate, α.
• type II error rate, β.
Power = 1 − β.
What are these?
• type I error = Reject H0 when it is true.
Conclude that the new treatment is effective when it
is not effective.
• type II error = Fail to reject H0 when it is false.
Fail to conclude that the new treatment is effective
when it is effective.
More on type I and II errors.
• As α ↓ or β ↓, N ↑.
• Type I and II are (usually) not of equal importance.
– Without a proper control of type I error rate, the
conclusion is not valid. (α = .05.)
– The experimenter can set β.
– A small β (large power) costs more.
– A large β (small power) may be unethical.
2 Specific Situations
2.1 Comparing a proportion to a constant
Treatment of interest
Autologous GM-CSF gene-modified tumor vaccines in
combination with anti-VEGF antibodies. [Treatment]
Primary endpoint
The response rate among the patient population.
Hypothesis of interest
The response rate is higher than 12%.
• type I error = Conclude that the response rate is
higher than 12% when it is actually 12%.
α = 0.05
• type II error = Fail to conclude that the response
rate is higher than 12% when it is actually higher
than 12%.
Set β = 0.10 if the response rate is actually 15%.
Power is 1 − 0.10 = 0.90. i.e., if the truth response
rate is 15%, then we will be able to correctly reject
H0 with a probability, 0.90.
• Sample size calculation using PASS 6.0.
Sample size = 1133.
No! We can’t afford that!
Let’s try : Power = .80.
Power is .80 if the true response rate is 15%. N = 801.
What else can we do to reduce the sample size?
It costs too much to try to detect such a small difference (12% vs 15%).
..
We want a power of .80 if the true response rate is 20%.
Then the necessary sample size is 132.
0.8
0.6
0.2
0.4
n=127, power=.81
n=132, power=.80
n=143, power=.80
0.0
Power and Type I Error Rate
1.0
Sample Size and Power
100
110
120
130
140
150
Sample Size
One more subtle point:
Sample size also depends on the location.
p0 = 12%, p1 = 20%, α = .05, β = .20 give N = 127.
p0 = 42%, p1 = 50%, α = .05, β = .20 give N = 249.
2.2 Comparing two means
Primary endpoint
Proliferation index = the number of positively stained
cells per tissue sample.
Hypothesis of interest
Average prolifieration index (PI) is different in the control and treatment groups.
type I error = Conclude that the PI is different when
it is the same for both groups. α = .05.
type II error = Fail to conclude that the PI is different
when it is different. What do you mean by “different”?
The sample size depends on “difference” and “standard
deviation”.
Actually, it depends on “standardized difference.”
Assumed Distributions of Proliferation Index
SD=5
Control
Treatment
30
40
50
60
70
Proliferation Index
Assumed Distributions of Proliferation Index
SD=10
Control
Treatment
30
40
50
Proliferation Index
60
70
Suppose that α = 5% and Power=90% to detect a difference of 5.
If we assume that SD = 5, then the sample size is 22
from each group.
If we assume that SD = 10, then the sample size is 85
from each group.
α = 5%
20
40
60
σ = 4 N = 14
σ = 5 N = 22
σ = 6 N = 31
σ = 8 N = 54
σ = 10 N = 85
0
Sample Size Per Group
80
Sample Size as Function of Standard Deviation
4
6
8
10
Standard Deviation
0.6
0.2
0.4
α = 5%
N = 22
σ=4
σ=5
σ=6
σ=7
σ=8
Power = 99%
Power = 90%
Power = 79%
Power = 66%
Power = 55%
0.0
Power
0.8
1.0
Power as Function of Standard Deviation
4
6
8
10
How do we know the SD (equivalently variance)?
• Preliminary studies
– Run a few samples of new experiment (or collect
pilot data).
– Look at previous data that you have observed
that may be similar.
• Literature (other studies that have measured similar
things)
– use estimates based on other studies that are similar and have been published. (This is effective in
determining “baseline” measures.)
• Clinical/scientific expertise (guess!)
– Make your best guess at what you hope to see.
(this method can often be problematic, in that if
these guesses are “way off” then you won’t have
enough power in the end.
Or · · ·
Power Analysis: A sample size of 85 for each group
(170 total) provides 90% power to detect a difference of
0.5× standard deviation between the control and treatment groups with type I error rate of 5%.
2.3 Comparing three or more means
Use Analysis of Variance (ANOVA) to test
H 0 : µ1 = µ2 = µ3 = µ4
H1 : At least two means are different from each other.
20
40
60
80
●
●
●
Group1
Group2
Group3
Group4
We still need to know the variance for each group. (ANOVA
assumes that they are all equal.)
This is called “Within-Group Variance.”
But there’s more...
We need to specify “Between-Group Variance.”
How much variability between groups?
If there is only small variability, you will need larger
samples.
If there difference between groups are enormous, thern
you will only need smaller samples.
Large Between−Group Variance
Small Sample Size
0
100
200
300
●
Group1
Group2
Group3
Group4
80
Small Between−Group Variance
Large Sample Size
40
60
●
●
20
●
●
●
Group1
Group2
Group3
Group4
Suppose α = 5%, Power=80%,
Within-Group Standard Deviation = 5,
Between-Group Variance = 10.
Then, using nQuery, we get n = 8 per group.
Issues
How do we get these variance estimates?
Do we really want to test to see if “at least two means
are different from each other?”
Sometimes, we want to compare Group 1 (Placebo) against
everybody else.
Then the hypothesis can be rewritten as:
HA0 : µ1 =µ2
HA1 : µ1 6=µ2
HB0 : µ1 =µ3
HB1 : µ1 6=µ3
HC0 : µ1 =µ4
HC1 : µ1 6=µ3
Each test needs to have a smaller α.
Simple adjustment is α = .05/3 = .0133.
Power for each test may be set low and the overall power
may still be adequate.
Even when the analysis plan is complicated, sample size
calculation may be based on a simpler method even
though you will not carry out the simpler analysis.
For example :
For outcome variables with repeated measures, we will
use a mixed model1 to obtain the overall effect of [intervention 1] and [intervention 2] in a similar manner
as described under GLM. The sample size calculations
were performed using a student’s t-test. Each experimental and control group will have n=12 measurements
to give about 80 percent of power to detect a difference
of 1.2 fold S.D. in an outcome variable with a two-sided
significance level of 5 percent. It needs to be emphasized that this power calculation is based
on the most conservative univaraite approach.
The main analysis will use GLM and a mixed
model, which are more powerful than the univariate tests.
Download