Sample Size Calculation PD Dr. Rolf Lefering IFOM - Institut für Forschung in der Operativen Medizin Universität Witten/Herdecke Campus Köln-Merheim Sample Size Calculation sample size uncertainty costs & effort & time Sample Size Calculation Single study group - continuous measurement - count of events Comparative trial (2 or more groups) - continuous measurement - count of events Confidence Interval Which true value is compatible with the observation? Confidence interval ... range where the true value lies with a high probability (usually 95%) Confidence Interval Example: 56 patients with open fractures, 9 developed an infection (16%) sample all patients with open fractures n=56 infection rate: 16% true value ??? Confidence Interval Formula for event rates n = sample size p = percentage CI95 = P +/- 1,96 * p * (100 - p) n Example: n = 56 p = 16% CI95 = 16 +/- 1,96 * (16*84) / 56 = 16 +/- 9,6 [ 6,4 - 25,6 ] Confidence Interval 95% confidence interval around a 20% incidence rate 50 incidence rate (%) 45 40 35 30 25 20 15 10 5 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 sample size Confidence Interval Formula for continuous variables Mean: M = mean SE = standard error SD = standard deviation n = sample size Remember: SE = SD / n CI95 = M 1,96 * SE 1,65 für 90% 1,96 für 95% 2,58 für 99% Sample Size Calculation Comparative trials „What is the sample size to show that early weight-bearing therapy is better ?“ „Which key should I press here now ?“ „What is the sample size to show that early weight bearing therapy, as compared to standard therapy, is able to reduce the time until return to work from 10 weeks to 8 weeks, where time to work has a SD of 3 ?“ 36 cases per group ! Outcome Measures Survival Organ failure Hospital stay Recurrence Complications rate Sepsis Lab Wound values infection Beweglichkeit Wellbeing Pain Fear Inedpemdence, autonomy Depressionen Fatigue Social status Blood pressure Anxiety Select Outcome Measure • Relevance Does this endpoint convince the patient / the scientific community? • Reliability; measurability Could the outcome easily be measured, without much variation, also by different people? • Sensitivity Does the intervention lead to a significant change in the outcome measure? • Robustness How much is the endpoint influenced by other factors? Select Outcome Measure • Primary endpoint Main hypothesis or core question; aim of the study Statistics: confirmative • Secondary endpoints Other interesting questions, additional endpoints Statistics: explorative (could be confirmative in case of a large difference) Advantage: prospective selection in the study protocol • Retrospektively selected endpoints Selected when the trial is done, based on subgroup differences Statistics: ONLY explorative ! Sample Size Calculation Sample size Certainty - error Power Difference to be detected Statistical Testing A statistical test is a method (or tool) to decide whether an observed difference* is really present or just based on variation by chance * this is true for a test for difference which is the most frequently applied one in medicine Statistical Testing Test for difference „Intervention A is better than B“ Test for equivalence „Intervention A and B have the same effect“ Test for non- inferiority „Intervention A is not worse than B“ Statistical Testing How a test procedure works 1. Want to show: there is a difference 2. Assume: there is NO difference between the groups; („equal effects“, null-hypothesis) 3. Try to disprove this assumption: - perform study / experiment - measure the difference 4. Calculate: the probability that such a difference could occur although the assumption („no difference“) was true = p-value Statistical Testing statistical test for difference: The p-value is the probability for the case that the observed difference occured just by chance Statistical Testing statistical test for difference : p is the probability for „no difference“ Statistical Testing „Germany and Spain are equally strong soccer teams !“ trial Game tonight: n=6 Null hypothesis 6 : 0 für Germany statistical test: p = 0,031 p-value says: How big is the chance that one of two equally strong teams scores 6 goals, and the other one none. Spain could still be equally strong as Germany, but the chance is small (3,1%) Statistical Testing small sample large sample small difference p=0,68 p=0,05 large difference p=0,05 p<0,001 Statistical Testing The more cases are included, the better could „equality“ be disproved Example: drug A has a success rate of 80%, while drug B is better with a healing rate of 90% sample size 20 40 100 200 400 1000 drug A 80% drug B 90% p-value 8/10 16/20 40/50 80/100 160/200 400/500 9/10 18/20 45/50 90/100 180/200 450/500 0,53 0,38 0,16 0,048 0,005 <0,001 Statistical Testing A „significant“ p-value ... does NOT prove the size of the difference, but only excludes equality! Statistical Testing p-value p-value large (>0.05) p-value small (0.05) The observed difference is probably caused by chance only, or the sample size in not sufficient to exclude chance chance alone is not sufficient to explain this difference null-hypothesis in maintained “no difference” there is a systematic difference null-hypothesis is rejected “significant difference“ Statistical Testing Errors The decision - for a difference (significance, p 0.05) - or against it („equality“, not significant, p > 0.05) is not certain but only a probability (p-value). Therefore, errors are possible: Type 1 error: Decision for a difference although there is none => wrong finding Type 2 error: Decision for „equality“ although there is one => missed finding Statistical Testing Errors Truth Test says ... no difference significant type 1 error wrong finding not significant C difference C type 2 error missed finding b Statistical Testing type 1 error “wrong finding“ type 2 error b „missed finding“ Fire detector wrong alarm no alarm in case of fire Court conviction of an innocent set a criminal free difference was “significant” by chance difference was missed Clinical study Power “What is the Power of the study ?” Type 2 error b probability to miss a difference Power = 1 - b probability to detect a difference Power depends on: - the magnitude of a difference - the sample size - the variation of the outcome measure - the significance level () Power “What is the Power of the study ?” POWER is the probability to detect a certain difference X with the given sample size n as significant (at level ). “Does the study have enough power to detect a difference of size X ?” Power When to perform power calculations? 1. Planning phase – sample size calculation: if the assumed difference really exists, what risk would I take to miss this difference ? 2. Final analysis – in case of a non-significant result: what size of difference could be rejected with the present data ? Power Example Clinical trial: Laparoscopic versus open appendectomy Endpoint: Maximum post-operative pain intensity (VAS 0-100 points) Patients: 30 cases per group Results: lap.: 28 (SD 18) open: 32 (SD 17) p = 0.38 not significant ! What is the power of the study ??? Sample Size Calculation Sample size Certainty - error Power Difference to be detected Sample Size Calculation Sample size = 0.05 b = 0.20 Difference to be detected error Risk to find a difference by chance b error Risk to miss a real difference Sample Size Calculation Sample size = 0.05 b = 0.20 PT & PC or Difference & SD Event rates: Percentages in the treatment and the control group Continuous measures: difference of means and standard deviation Sample Size Calculation Continuous Endpoints SD unknown if the variation (standard deviation) is not known, the expected advantage could be expressed as „effect size“ which is the difference in units of the (unknown) SD Example: • pain values are at least 1 SD below the control group (effect size = 1.0) • the difference will be at least half a SD (effect size = 0.5) Sample Size Calculation Continuous Endpoints Test with non-parametric rank statistics • non-normal distribution, or non-metric values • Mann-Whitney U-test; Wilcoxon test Use t-Test for sample size calculation and add 10% of cases Sample Size Calculation Guess … How many patients are needed to show that a new intervention is able to reduce the complication rate from 20% to 14% ? (=0.05; b=0.20, i.e. 80% power) Sample Size Calculation Dupont WD, Plummer WD Power and Sample Size Calculations: A Review and Computer Program Contr. Clin. Trials (1990) 11:116-128 http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/PowerSampleSize Sample Size Calculation Multiple Testing • Mehr als eine Versuchs-/Therapiegruppe • Mehrere Zielgrößen • Mehrere Follow-Up Zeitpunkte • Zwischenauswertungen • Subgruppen-Analysen Multiple testing increases the risk of arbitrary significant results Overall statistical error in 8 tests at the 0.05 level: α = 1 - 0.95 8 = 1 - 0,66 = 0.34 Multiple Testing • 1 test (with 5% error) • 2 tests (with 5% error each) correct at least 1 error 95% 5% 90,25% 9,75% • 3 tests • 4 tests 90,25% 4,75% 4,75% 0,25% • 5 tests • ….. Multiple Testing correct at least 1 error 95% 5% • 2 tests (with 5% error each) 90,2% 9,8% • 3 tests 85,7% 14,3% • 4 tests 81,5% 18,5% • 5 tests 77,4% 22,6% • 1 test (with 5% error) • ….. Multiple Testing What could you do? Select ONE primary and multiple secondary questions Combination of endpoints multiple complications „Negative event“ multiple time points AUC, maximum value, time to normal multiple endpoints sum score acc. to O‘Brian Adjustment of p-values, i.e. each endpoint is tested with a „stronger“ α level e.g. Bonferroni: k tests at level α / k (5 tests at the 1% level, instead of 1 Test at 5% level) A priori ordered hypotheses predefine the order of tests (each at 5% level) Interim Analysis • Fixed sample size end of trial • Sequential design after each case • Group sequential design after each step • Adaptive design after each step Interim Analysis aus: TR Flemming, DP Harrington, PC O‘Brian Design of group sequential tests. Contr. Clin Trials (1984) 5: 348-361