Susan Stewart, Ph.D. UC Davis School of Medicine November, 2014 Intro to sample size determination Basic concepts Estimating sample size parameters Response variables ◦ Continuous ◦ Categorical ◦ Time-to-event Components of sample size estimation Why is it a good idea to do a sample size calculation? Why shouldn’t you just pick a size that’s convenient? Because the sample might be too small to help you answer your research question, Or the sample might be much larger than you need. Primary objective of a clinical trial: to evaluate the efficacy and safety of an intervention. Efficacy evaluation ◦ Compare the average response in the intervention and control groups in the study sample. ◦ Decide whether the difference between the groups indicates a true difference between treatments. Usually the efficacy evaluation is performed in the context of a hypothesis test. Problem: Determine whether or not the population means of the intervention and control groups truly differ with respect to the outcome of interest. ◦ We regard the intervention and control samples as being drawn from the target population. Solution: Assume that the two groups do not differ, and see if the sample data disagree with this assumption. That is, perform a hypothesis test. The null hypothesis (H0) assumes that there is no difference in outcome between the two groups. The alternative hypothesis (HA) assumes that one group has a more favorable outcome than the other. The research hypothesis is usually the alternative hypothesis. To do a hypothesis test: ◦ Calculate a test statistic from the data. ◦ Determine whether the value of the test statistic is likely or unlikely under the null hypothesis. ◦ If the value is very unlikely, reject the null hypothesis. Problem: we might reject the null hypothesis when it is true. ◦ That is, we might commit Type I error. Solution: Construct the test so that there is only a 5% chance of incorrectly rejecting the null hypothesis. ◦ That is, the level of the test (alpha) is 0.05. Hypothesis tests can be 1-sided or 2-sided ◦ 1-sided: tests for differences in one direction only e.g., higher response rate in the intervention group than in the control group ◦ 2-sided: tests for differences in both directions e.g., either higher or lower response rate in the intervention group than in the control group Even if you are primarily interested in one direction, it is customary to do a 2-sided test The p-value is the probability under the null hypothesis of obtaining data as extreme as that of the sample. ◦ That is, the p-value is the strength of the evidence against the null hypothesis. For a level 0.05 test, we reject the null hypothesis if the p-value is 0.05 or less. Problem: we might fail to reject the null hypothesis when the alternative is true. ◦ That is, we might commit Type II error. Solution: Select a large enough sample so that there is an 80% chance of rejecting the null hypothesis if the alternative is true. ◦ Then the power to detect the alternative is 80%. Specify null and alternative hypotheses, type I error rate, and power. Define the population under study. Gather information relevant to parameters. If measuring time to failure, model recruitment process and choose length of follow-up period. Calculate sample size over range of parameters. Select sample size to use. Epidemiol Rev, 2002; 24(1):39-53 Parameters include ◦ Variability of the response ◦ Level of the response variable in the control group ◦ Difference anticipated or judged clinically relevant May also need to consider ◦ Loss to follow-up ◦ Noncompliance Sources of information ◦ Pilot studies: external or internal ◦ Literature: what others have found in similar studies When a response variable is normally distributed, the difference between the means of two independent samples is assessed with a 2-sample t-test. ◦ The t-test is robust to departures from normality. ◦ May need to transform the response variable (e.g., log transform) to obtain approximate normality. The sample size for a z-test usually can be used to estimate the sample size for a t-test. ◦ A z-test assumes that the sample standard deviation is known. 𝑧𝑧 = 𝑥𝑥̅ − 𝑦𝑦� 𝜎𝜎 2/𝑛𝑛 𝑥𝑥̅ = intervention group mean 𝑦𝑦� = control group mean 𝜎𝜎 2 = common variance in each group 𝑛𝑛 = sample size in each group Epidemiol Rev, 2002; 24(1):39-53 (eq. 1) z 𝑛𝑛 = 2𝜎𝜎 2 𝑧𝑧1−𝛼𝛼/2 + 𝑧𝑧1−𝛽𝛽 /∆𝐴𝐴 2 𝜎𝜎 2 = common variance in each group 𝑧𝑧1−𝛼𝛼/2 = critical value for 2-sided level 𝛼𝛼 test 𝑧𝑧1−𝛽𝛽 = value of a standard normal variable with cumulative probability equal to 1 − 𝛽𝛽 (power) ∆𝐴𝐴 = difference corresponding to alternative hypothesis Epidemiol Rev, 2002; 24(1):39-53 (eq. 2) Randomized, age-matched Healthy post-menopausal Chinese women within 10 years of menopause onset Exclusion criteria ◦ Regular participation in exercise ◦ Hormone replacement therapy or drug treatment affecting bone density ◦ Hypo- or hyper-parathyroidism, hypo- or hyperthyroidism, renal or liver disease ◦ History of fractures ◦ BMI over 30 Arch Phys Med Rehabil 2004; 85:717-22 Intervention: Supervised TCC exercise (Yang style) 50 minutes a day, 5 times a week, for 12 months Control: Retained sedentary lifestyle Primary outcome: Change in bone mineral density over 12 months ◦ Areal BMD at lumbar spine and proximal femur measured by dual x-ray absorptiometry (DXA) ◦ Volumetric BMD in distal tibia measured by multislice peripheral quantitative computed tomography (pQCT) Null hypothesis ◦ Rate of bone mineral loss is the same in both study arms. Alternative hypothesis ◦ Rate of bone mineral loss is different (i.e., lower) in the intervention (TCC) group. Level of the test: 0.05 (2-sided) Power: 80% Mean bone loss in control group: 2.8% Mean bone loss in intervention group: 1.4% Standard deviation in each group ◦ Average annual trabecular bone loss in previous study in same population ◦ 50% reduction ◦ Based on previous study, ~same as mean 3.0% in control group, 1.5% in intervention group (say) ◦ Compute pooled SD=2.37% Dropout: 25% in one year 𝜎𝜎 2 = common variance in each group = 2.372=5.62 𝑧𝑧1−𝛼𝛼/2 = critical value for 2-sided level 𝛼𝛼 test = 1.96 𝑧𝑧1−𝛽𝛽 = value of a standard normal variable with cumulative probability equal to 1 − 𝛽𝛽 (power) = 0.842 ∆𝐴𝐴 = difference corresponding to alternative hypothesis = 1.4 2 𝑛𝑛 = 2𝜎𝜎 2 𝑧𝑧1−𝛼𝛼/2 + 𝑧𝑧1−𝛽𝛽 /∆𝐴𝐴 = 2(5.62) 1.96 + 0.842 /1.4 2 =45 per group =0.75 (60 per group), accounting for dropouts Actual enrollment n=132 total https://stattools.crab.org/ When a response variable is categorical, a chi-square test of independence is often used to compare two groups. When there are only 2 categories, this is the same as testing for a difference in proportions. Need to specify the response proportion in the control group and ◦ The response proportion in the intervention group, or ◦ The odds ratio 𝑛𝑛 = 𝑧𝑧1−𝛼𝛼/2 2𝜋𝜋� 1 − 𝜋𝜋� + 𝑧𝑧1−𝛽𝛽 𝜋𝜋𝑐𝑐 1 − 𝜋𝜋𝑐𝑐 + 𝜋𝜋𝑡𝑡 1 − 𝜋𝜋𝑡𝑡 𝑛𝑛′ = 𝑛𝑛 4 𝜋𝜋𝑐𝑐 − 𝜋𝜋𝑡𝑡 1+ 1+ 2 4 𝑛𝑛 𝜋𝜋𝑐𝑐 − 𝜋𝜋𝑡𝑡 2 2 𝜋𝜋𝑐𝑐 = probability of event in control group 𝜋𝜋𝑡𝑡 = probability of event in intervention group 𝜋𝜋� = average probability of event 𝑛𝑛′ = number needed in each group Epidemiol Rev, 2002; 24(1):39-53 (eq. 7B, 7C) Study aim: test an outreach and counseling intervention to reduce cervical cancer incidence & mortality in low income women Setting: Highland General Hospital (HGH) Time frame: 3 years Outcome measure: proportion of women who received initial follow-up at Highland within 6 months of an abnormal Pap test Prev Med 2005; 41: 741-8 Null hypothesis ◦ Rate of follow-up of abnormal Pap tests is the same in both study arms. Alternative hypothesis ◦ Rate of follow-up of abnormal Pap tests is different (i.e., greater) in the intervention group. Assume 60% follow-up in control group based on previous research Assume 75% follow-up in intervention group, a clinically important difference achieved in similar interventions To detect this difference at the 0.05 level (2sided) with 80% power: n=165 per arm No loss to follow-up—outcome ascertained through medical records 𝑛𝑛 = 1.96 2(0.675) 0.325 +0.842 0.6 0.4 +0.75 0.25 𝑛𝑛′ = 152 4 0.60−0.75 1+ 1+ 2 4 152 0.60−0.75 2 2 =152 = 165 𝜋𝜋𝑐𝑐 = probability of event in control group = 0.60 𝜋𝜋𝑡𝑡 = probability of event in intervention group = 0.75 𝜋𝜋� = average probability of event = 0.675 𝑛𝑛′ = number needed in each group = 165 https://stattools.crab.org/ The log rank test is often used to compare two survival curves. Most sample size calculations assume an exponential survival distribution. 𝑆𝑆 𝑡𝑡 = 𝑒𝑒 −λ𝑡𝑡 , where 𝑡𝑡 = time, 𝑆𝑆 𝑡𝑡 = probability of survival to time 𝑡𝑡, and λ = hazard rate = risk of an event per time unit Hazard rate: number of events per 100 person years Median survival time=𝑙𝑙𝑙𝑙𝑙𝑙𝑒𝑒 (2)/(hazard rate) Hazard rate=𝑙𝑙𝑙𝑙𝑙𝑙𝑒𝑒 (2)/(median survival time) Hazard rate=-𝑙𝑙𝑙𝑙𝑙𝑙𝑒𝑒 (𝑆𝑆 𝑡𝑡 )/t, where 𝑆𝑆 𝑡𝑡 =probability of surviving to time t =expected proportion without an event by t (𝑧𝑧1−α/2 + 𝑧𝑧1−β )2 [ϕ λ𝐶𝐶 + ϕ λ𝐼𝐼 ] 𝑛𝑛 = (λ𝐼𝐼 − λ𝐶𝐶 )2 where ϕ(λ) = λ2 1−[𝑒𝑒 −𝜆𝜆 𝑇𝑇−𝑇𝑇0 −𝑒𝑒 −λ𝑇𝑇 ]�λ𝑇𝑇0 𝑛𝑛 =number per group λ𝐼𝐼 =hazard rate in intervention group λ𝐶𝐶 =hazard rate in control group 𝑇𝑇 =total time of trial (first entry to end of study) 𝑇𝑇0 =recruitment time (first entry to last entry) (𝑧𝑧1−α/2 + 𝑧𝑧1−β )2 𝐷𝐷 = 𝑝𝑝(1 − 𝑝𝑝)(ln(𝜆𝜆𝐶𝐶 /λ𝐼𝐼 ))2 where 𝐷𝐷 =number of events required to detect the hazard ratio with power 1-β at level α (2-sided) λ𝐼𝐼 =hazard rate in intervention group λ𝐶𝐶 =hazard rate in control group 𝑝𝑝 =proportion of participants in the control group Primary research goal: Determine whether performing surgery of the primary tumor followed by systemic therapy improves survival in a certain patient population, compared with systemic therapy only. Patient population: Patients with synchronous unresectable metastases of colorectal cancer and few or absent symptoms Primary outcome: Overall survival Study design: Multi-center randomized phase III trial. BMC Cancer 2014; 14:741 Null hypothesis ◦ Overall survival is not affected by surgery of the primary tumor before systemic therapy in this patient population. Alternative hypothesis ◦ Surgery of the primary tumor improves overall survival in this patient population. Level of the test: 0.05 (2-sided) Power: 80% Median survival in control group: 13 months Median survival in intervention group: 19 months ◦ Minimal difference to justify a surgical procedure Recruitment period: 30 months Minimum follow-up: 8 months Total sample size: 360 where ϕ(λ) = (𝑧𝑧1−α/2 + 𝑧𝑧1−β )2 [ϕ λ𝐶𝐶 + ϕ λ𝐼𝐼 ] 𝑛𝑛 = (λ𝐼𝐼 − λ𝐶𝐶 )2 λ2 1−[𝑒𝑒 −𝜆𝜆 𝑇𝑇−𝑇𝑇0 −𝑒𝑒 −λ𝑇𝑇 ]�λ𝑇𝑇0 α=0.05; 𝑧𝑧1−α/2 =1.96; β=0.20; 𝑧𝑧1−β =0.842 λ𝐼𝐼 =hazard rate in intervention group = ln(2)/(19/12)=0.438 λ𝐶𝐶 =hazard rate in control group = ln(2)/(13/12)=0.640 hazard ratio = 19/13=1.46 𝑇𝑇 =total time of trial (first entry to end of study) =38/12=3.167 𝑇𝑇0 =recruitment time (first entry to last entry) = 2.5 𝝓𝝓 𝝀𝝀𝑪𝑪 =0.607; 𝝓𝝓 𝝀𝝀𝑰𝑰 =0.351; 𝟐𝟐𝒏𝒏 =368; required # of events = 218 https://stattools.crab.org/ 𝛼𝛼(level): larger → smaller sample size 1-𝛽𝛽 (power): larger → larger sample size Variance: larger → larger sample size ◦ Binary variable: 𝜋𝜋 (probability of event) = 0.5 has largest variance Difference to detect: larger → smaller sample size Problem: Sometimes the sample size required is too large. Solutions: ◦ Be content to detect with less power (allow more type II error). ◦ Increase the level of the test (allow more type I error). ◦ Pick a more extreme alternative. % Response in Intervention Group Level Power 60% 65% 5% 90% 538 239 5% 80% 407 182 10% 80% 325 146 Parameters used to estimate sample size are estimates ◦ Often based on small studies Effectiveness of the intervention ◦ May be based on a different population ◦ May be overestimated Inclusion and exclusion criteria may change Control group participants may do better than expected Mathematical models for sample size calculations are approximate www.statpages.org www.swogstat.org/statoolsout.html https://stattools.crab.org/