FUNDAMENTALS OF MEDICAL RESEARCH Ed Gracely, Ph.D. Family, Community, and Preventive Medicine Power and sample size, and Common tests July 16, 2014 Goals for the session: You should come away able to: 1. Explain the concept of statistical power and why it is important. 2. Define type II error probability and beta. Recognize how they relate to power. 3. Explain how sample size and power figure into the interpretation of a non-significant study. 4. Explain how the magnitude of the real effect and the sample size impact on the power. 5. Briefly describe a few common statistical tests and when they might be used. A) Lead-in example Ex: A researcher is studying the duration of a certain infection. It is known that: Untreated, the infection lasts a mean of 16 days, SD = 6 The researcher expects and wants to show that: Treatment reduces mean duration by 2 days (to 14 days), SD = 6. Before doing the study, the researcher approaches me and says: "Oh wise and magnanimous statistician, we humbly seek your advice. We think our new treatment will reduce the duration of infection by a mean of 2 days (from 16 to 14). Duration has a standard deviation of about 6. We have 20 subjects per group. We beg of you, please tell us we will obtain a significant result and be able to publish in some august journal." Answer: “Not even the wise can guess the decisions that will be made by the editors of august journals! But, I can tell you that if your sample results look exactly like you hypothesize (mean reduction in duration of 2 days, SD = 6) it will not be significant! You need sample results better than what you hypothesize to get a significant result. The likelihood of getting lucky in the right way is only about 30%. Otherwise your study will be non-significant, and even editors of lesser journals will laugh at it! I proclaim that you need more subjects. The wise recommend enough subjects to have an 80% probability of a significant result (if your assumptions are correct). To accomplish this you need about 120 subjects per group..” Fade to gasping and choking from consultees…. Note: The 30% and the 80% (for different sample sizes) are the power of the study for those sample sizes. 1 B) Basic concepts What is power? Power is a probability. The probability of what? That the study will find a statistically significant result. Under any particular conditions? Yes, if a specific magnitude of difference or of effect exists, in the population (reality). Can you give an example? OK, suppose SBP has a standard deviation of 20. A researcher with 30 subjects in each of two groups would have an 80% probability of a statistically significant result if the population mean difference between the two treatments was 15. Thus if that is the true (population) difference, the power of the study is 80%. How does sample size come in? All else equal, the more subjects in the study, the larger the power. How about a published example? Sure! "Ex:Our target sample size for patients completing the study was approximately 396 patients (198 per treatment group). Assuming a dropout rate of 20%, we planned to randomly assign 496 patients. This sample size would have provided 90% power to detect, at a two-sided level of significance of 0.05, a treatment difference of 18% in the primary end point …" Parving et al., NEJM, June 5, 2008 This means that if the new treatment (a direct renin inhibitor added to standard treatments for diabetic retinopathy) actually works as well as the researcher thinks it does, and if he or she has 198 per group, there is an 90% probability that the sample data will be statistically significant in support of the superiority of the new treatment. Q: So this researcher has an 90% probability of obtaining a significant result and deciding in favor of a difference if: a a. b. c. There is no true (that is, in the population) difference. There is some population difference, to be determined. The true (population) effect is a difference of 18% in the primary end point. Q: By luck, they ended up with *more* subjects than planned. What would this mean for the power? [Decrease // no change // increase ]. Type II error is failing to find a difference or effect that really exists. This is exactly what having a large sample size (and enough power) is designed to prevent! The probability of a type II error is symbolized with the Greek letter beta (β). 2 If there really is a difference (or an effect), say that renin inhibitor truly (in the population) reduces the primary end point (the albumin to creatinine ratio) by 18% more than standard therapy, there are two possible outcomes in your study: i) Correctly find a benefit. The probability of finding a benefit under the condition that a difference really exists is the power of study. ii) Fail to find a benefit = type II error. Probability = beta. So power + beta = 100% (or 1.0). Q: Researcher reports a beta error probability of 30%. What is the power? b Q: Researcher reports a power of 90%. How likely is the researcher to commit a type II error? c In the literature, you may see power, or type II error probability, or beta (β). You should know how to move between them. Remember the above and it isn’t tricky. C) Applications to planning and critiquing studies 1. In planning a study, the proper way to determine the number of subjects to use is to pick a number that will provide sufficient power. This should be done as early as possible, and certainly in advance of collecting data or even submitting a proposal. If too little information is available, you may need to do a pilot study. 2. In critiquing a study, it is often important to establish whether or not the researchers had sufficient power, especially if they fail to find a difference. Ex: Researcher A finds a difference of 10 mmHg between two treatments, p < 0.001. Researcher B finds a difference of 5, p = 0.25. Q: For which of the two are you more likely to be concerned about the number of subjects? d Key point: Unless the researcher has a large N (and thus a large power), a real and important difference can easily be missed. Thus, the failure to find a statistically significant difference is difficult to interpret when N (and power) are small. Therefore, a nonsignificant result can ONLY provide substantial support for the absence of a real difference (or at least a real difference of importance) when the researcher had a high power to detect the effect of interest. To a reader the question becomes: what difference would be clinically or theoretically important to me, such that I would want to be sure the study had enough power for that difference? 3 Ex: “A total sample size of 200 patients was calculated to demonstrate with a power of 90% that antibiotic prophylaxis reduces the proportion of patients with infected pancreatic necrosis from 40% on placebo (PLA) to 20% on ciprofloxacin/metronidazole (CIP/MET).” Gastroenterology. 126(4):997-1004, 2004 Apr. Q: How sure did these authors want to be that they would find a clinically significant e difference if one existed? Q: What difference do they seem to have defined as clinically significant? f The actual key result was: “Twelve percent of the CIP/MET group developed infected pancreatic necrosis compared with 9% of the PLA group (P = 0.585).” How might you report this result in journal club, using the p-value, the power, and the actual differences? g D) Common statistical tests: We only have time to hit a few highlights of these. They are not usually critical for readers to know, but it’s worth recognizing the commonest terms. t-tests: For comparing two means, when the data are Gaussian (normally distributed). ANOVA (analysis of variance): a family of techniques for comparing *more than* two means when the data are Gaussian. There are complex ANOVA designs with several grouping variables. Nonparametric tests: A wide variety, for comparing two or more groups or conditions, but the outcome data, albeit numeric is not Gaussian, or (sometimes) fails other assumptions. There are equivalents to both t-tests and ANOVA. Chi-square: A common technique used when there are two or more groups to be compared and the outcome is “yes/no”, like “Had an adverse reaction” versus “Did not have one”. Measures of association between numeric variables, such as correlation and regression. Methods to statistically combine a number of studies to get a summary results: Metaanalysis. This could be a class of its own. There are a variety of issues in doing one. QUIZ 1. Which category of tests would be used for each of the following? a. Comparing 3 body types of subjects on whether or not each of them can climb to the top of a knotted rope. b. Comparing males and females on mean SBP, which is Gaussian. c. Comparing old and young subjects on cholesterol, which is non-Gaussian. d. Looking for an association between heart rate and blood pressure. e. You have treated and untreated subjects, broken down into male and female, as well as with and without previous treatment. You want to use all of these grouping variables in one analysis. The variable of interest is SBP, Gaussian. 2. All else equal, increasing the sample size will: 1) Increase power 2) Decrease power 3) Have no effect on power 4 3. The best time to determine the proper sample size is: 1) Before doing the study. 2) After obtaining non-significant results, to help interpret them. 3) After the reviewer rejects the paper for small sample sizes. 4) After you've killed 500 rats (so you can justify using so many). 4. A researcher wants to show that animals excrete more uric acid under dietary condition A than under dietary condition B. To be relatively sure that any such effect is detected, the researcher should have: 1) A small power. 2) A large power. 3) A power of 0, if possible. 5. "Given the sample size and analysis we will use, we have a 5% probability of finding a difference that does not exist (by chance) and a 60% probability of finding a difference if one truly exists (i.e., correctly finding a difference.) What is the power of this study? 5% 6. 40% 60% 95% In a study comparing methods of weaning patients off of ventilators, the authors say, "..31 patients were needed in each group to detect at a power of 80% a difference in weaning times between groups of two days." NEJM 2-9-95. With 31 patients, these authors have an 80% probability of finding statistical significance if: 1) There really is no difference between groups. 2) There is a sample difference of two days between groups. 3) The true difference between groups is two days. 4) None of these -- the probability of finding a difference is 100 - power = 20%. Answers to Quiz questions 1 a) chi-square b) t-test c) nonparametric test d) correlation/regression e) ANOVA 2:1 3: 1 4:2 5:60% 6:3 FMR.3_of_4.Power and common tests.doc a b c d e f g (c). Power is calculated for the specific difference of interest. 70% 10% (B), since the small N could easily lead to missed differences. 90% 20% difference (40-20%). A complete answer should note that they had a high power to detect a difference (40% versus 20%, which may strike you as rather large). The actual results showed the treated group as slightly worse than the placebo group, with a completely non-significant p-value. This argues against any benefit for the treatment. Perhaps part of the problem is that they appear to have greatly overestimated the rate of infection in the placebo group. 5