Power and sample size, and Common tests

advertisement
FUNDAMENTALS OF MEDICAL RESEARCH
Ed Gracely, Ph.D.
Family, Community, and Preventive Medicine
Power and sample size, and
Common tests
July 16, 2014
Goals for the session: You should come away able to:
1.
Explain the concept of statistical power and why it is important.
2.
Define type II error probability and beta. Recognize how they relate to power.
3.
Explain how sample size and power figure into the interpretation of a non-significant
study.
4.
Explain how the magnitude of the real effect and the sample size impact on the power.
5.
Briefly describe a few common statistical tests and when they might be used.
A) Lead-in example
Ex: A researcher is studying the duration of a certain infection.
It is known that:
Untreated, the infection lasts a mean of 16 days, SD = 6
The researcher expects and wants to show that:
Treatment reduces mean duration by 2 days (to 14 days), SD = 6.
Before doing the study, the researcher approaches me and says: "Oh wise and
magnanimous statistician, we humbly seek your advice. We think our new treatment will
reduce the duration of infection by a mean of 2 days (from 16 to 14). Duration has a
standard deviation of about 6. We have 20 subjects per group. We beg of you, please tell
us we will obtain a significant result and be able to publish in some august journal."
Answer: “Not even the wise can guess the decisions that will be made by the editors of
august journals! But, I can tell you that if your sample results look exactly like you
hypothesize (mean reduction in duration of 2 days, SD = 6) it will not be significant!
You need sample results better than what you hypothesize to get a significant result.
The likelihood of getting lucky in the right way is only about 30%. Otherwise your
study will be non-significant, and even editors of lesser journals will laugh at it! I
proclaim that you need more subjects. The wise recommend enough subjects to have
an 80% probability of a significant result (if your assumptions are correct). To
accomplish this you need about 120 subjects per group..” Fade to gasping and
choking from consultees….
Note: The 30% and the 80% (for different sample sizes) are the power of the study for
those sample sizes.
1
B) Basic concepts
What is power?
Power is a probability.
The probability of what?
That the study will find a statistically significant result.
Under any particular conditions?
Yes, if a specific magnitude of difference or of effect exists, in the population
(reality).
Can you give an example?
OK, suppose SBP has a standard deviation of 20. A researcher with 30 subjects in
each of two groups would have an 80% probability of a statistically significant result
if the population mean difference between the two treatments was 15. Thus if that is
the true (population) difference, the power of the study is 80%.
How does sample size come in?
All else equal, the more subjects in the study, the larger the power.
How about a published example? Sure!
"Ex:Our target sample size for patients completing the study was approximately 396
patients (198 per treatment group). Assuming a dropout rate of 20%, we planned
to randomly assign 496 patients. This sample size would have provided 90%
power to detect, at a two-sided level of significance of 0.05, a treatment difference
of 18% in the primary end point …" Parving et al., NEJM, June 5, 2008
This means that if the new treatment (a direct renin inhibitor added to standard
treatments for diabetic retinopathy) actually works as well as the researcher thinks
it does, and if he or she has 198 per group, there is an 90% probability that the
sample data will be statistically significant in support of the superiority of the new
treatment.
Q: So this researcher has an 90% probability of obtaining a significant result and
deciding in favor of a difference if: a
a.
b.
c.
There is no true (that is, in the population) difference.
There is some population difference, to be determined.
The true (population) effect is a difference of 18% in the primary end point.
Q: By luck, they ended up with *more* subjects than planned. What would this mean
for the power? [Decrease // no change // increase ].
Type II error is failing to find a difference or effect that really exists. This is exactly what
having a large sample size (and enough power) is designed to prevent! The probability of a
type II error is symbolized with the Greek letter beta (β).
2
If there really is a difference (or an effect), say that renin inhibitor truly (in the population)
reduces the primary end point (the albumin to creatinine ratio) by 18% more than standard
therapy, there are two possible outcomes in your study:
i) Correctly find a benefit. The probability of finding a benefit under the condition that a
difference really exists is the power of study.
ii) Fail to find a benefit = type II error. Probability = beta. So power + beta = 100% (or
1.0).
Q: Researcher reports a beta error probability of 30%. What is the power? b
Q: Researcher reports a power of 90%. How likely is the researcher to commit a type
II error? c
In the literature, you may see power, or type II error probability, or beta (β). You
should know how to move between them. Remember the above and it isn’t tricky.
C) Applications to planning and critiquing studies
1.
In planning a study, the proper way to determine the number of subjects to use is to pick a
number that will provide sufficient power.
This should be done as early as possible, and certainly in advance of collecting data or
even submitting a proposal.
If too little information is available, you may need to do a pilot study.
2.
In critiquing a study, it is often important to establish whether or not the researchers had
sufficient power, especially if they fail to find a difference.
Ex: Researcher A finds a difference of 10 mmHg between two treatments, p < 0.001.
Researcher B finds a difference of 5, p = 0.25.
Q: For which of the two are you more likely to be concerned about the number of
subjects? d
Key point: Unless the researcher has a large N (and thus a large power), a real and important
difference can easily be missed.
Thus, the failure to find a statistically significant difference is difficult to interpret when N (and
power) are small.
Therefore, a nonsignificant result can ONLY provide substantial support for the absence of a
real difference (or at least a real difference of importance) when the researcher had a high power
to detect the effect of interest.
To a reader the question becomes: what difference would be clinically or theoretically important
to me, such that I would want to be sure the study had enough power for that difference?
3
Ex: “A total sample size of 200 patients was calculated to demonstrate with a power of 90%
that antibiotic prophylaxis reduces the proportion of patients with infected pancreatic
necrosis from 40% on placebo (PLA) to 20% on ciprofloxacin/metronidazole (CIP/MET).”
Gastroenterology. 126(4):997-1004, 2004 Apr.
Q: How sure did these authors want to be that they would find a clinically significant
e
difference if one existed?
Q: What difference do they seem to have defined as clinically significant? f
The actual key result was: “Twelve percent of the CIP/MET group developed infected
pancreatic necrosis compared with 9% of the PLA group (P = 0.585).” How might you
report this result in journal club, using the p-value, the power, and the actual differences? g
D) Common statistical tests:
We only have time to hit a few highlights of these. They are not usually critical for readers to
know, but it’s worth recognizing the commonest terms.
t-tests: For comparing two means, when the data are Gaussian (normally distributed).
ANOVA (analysis of variance): a family of techniques for comparing *more than* two
means when the data are Gaussian. There are complex ANOVA designs with several
grouping variables.
Nonparametric tests: A wide variety, for comparing two or more groups or conditions, but
the outcome data, albeit numeric is not Gaussian, or (sometimes) fails other
assumptions. There are equivalents to both t-tests and ANOVA.
Chi-square: A common technique used when there are two or more groups to be compared
and the outcome is “yes/no”, like “Had an adverse reaction” versus “Did not have
one”.
Measures of association between numeric variables, such as correlation and regression.
Methods to statistically combine a number of studies to get a summary results: Metaanalysis. This could be a class of its own. There are a variety of issues in doing one.
QUIZ
1. Which category of tests would be used for each of the following?
a. Comparing 3 body types of subjects on whether or not each of them can climb to the top of a
knotted rope.
b. Comparing males and females on mean SBP, which is Gaussian.
c. Comparing old and young subjects on cholesterol, which is non-Gaussian.
d. Looking for an association between heart rate and blood pressure.
e. You have treated and untreated subjects, broken down into male and female, as well as with
and without previous treatment. You want to use all of these grouping variables in one
analysis. The variable of interest is SBP, Gaussian.
2. All else equal, increasing the sample size will:
1) Increase power
2) Decrease power
3) Have no effect on power
4
3. The best time to determine the proper sample size is:
1) Before doing the study.
2) After obtaining non-significant results, to help interpret them.
3) After the reviewer rejects the paper for small sample sizes.
4) After you've killed 500 rats (so you can justify using so many).
4.
A researcher wants to show that animals excrete more uric acid under dietary condition A than
under dietary condition B. To be relatively sure that any such effect is detected, the researcher
should have:
1) A small power.
2) A large power.
3) A power of 0, if possible.
5.
"Given the sample size and analysis we will use, we have a 5% probability of finding a
difference that does not exist (by chance) and a 60% probability of finding a difference if one
truly exists (i.e., correctly finding a difference.) What is the power of this study?
5%
6.
40%
60%
95%
In a study comparing methods of weaning patients off of ventilators, the authors say, "..31
patients were needed in each group to detect at a power of 80% a difference in weaning times
between groups of two days." NEJM 2-9-95. With 31 patients, these authors have an 80%
probability of finding statistical significance if:
1) There really is no difference between groups.
2) There is a sample difference of two days between groups.
3) The true difference between groups is two days.
4) None of these -- the probability of finding a difference is 100 - power = 20%.
Answers to Quiz questions
1 a) chi-square b) t-test c) nonparametric test d) correlation/regression e) ANOVA
2:1
3: 1 4:2 5:60% 6:3
FMR.3_of_4.Power and common tests.doc
a
b
c
d
e
f
g
(c). Power is calculated for the specific difference of interest.
70%
10%
(B), since the small N could easily lead to missed differences.
90%
20% difference (40-20%).
A complete answer should note that they had a high power to detect a difference (40% versus 20%, which may
strike you as rather large). The actual results showed the treated group as slightly worse than the placebo
group, with a completely non-significant p-value. This argues against any benefit for the treatment. Perhaps
part of the problem is that they appear to have greatly overestimated the rate of infection in the placebo group.
5
Download