How many patients do I need for my study? Realistic Sample Size Estimates for Clinical Trials Sample Size Estimation 1. General considerations 2. Continuous response variable – Parallel group comparisons • • Comparison of response after a specified period of follow-up Comparison of changes from baseline – Crossover study 3. Success/failure response variable – Impact of non-compliance, lag – Realistic estimates of control event rate (Pc) and event rate pattern – Use of epidemiological data to obtain realistic estimates of experimental group event rate (Pe) 4. Time to event designs and variable follow-up Useful References • Lachin JM, Cont Clin Trials, 2:93-113, 1981 (a general overview) • Shih J, Cont Clin Trials, 16:395-407, 1995 (time to event studies with dropouts, dropins, and lag issues) – see size program on biostatistics network • Farrington CP and Manning G, Stat Med, 9:14471454, 1990 (sample size for equivalence trials) • Whitehead J, Stat Med, 12:2257-2271, 1993 (sample size for ordinal outcomes) • Donner A, Amer J Epid, 114:906-914, 1981 (sample size for cluster randomized trials) Key Points • Sample size should be specified in advance (often it is not) • Sample size estimation requires collaboration and some time to do it right (not solely a statistical exercise) • Often sample size is based on uncertain assumptions (estimates should consider a range of values for key parameters and the impact on power for small deviations in final assumptions should be considered) • Parameters that do not involve the treatment difference (e.g., SD) on which sample size was based should be evaluated by protocol leaders (who are blinded to treatment differences) during the trial • It pays to be conservative; however, ultimate size and duration of a study involves compromises, e.g., power, costs, timeliness. Some Evidence that Sample Size is Not Considered Carefully: A Survey of 71 “Negative” Trials (Freiman et al., NEJM, 1978) • • • • Authors stated “no difference” P-value > 0.10 (2-sided) Success/failure endpoint Expected number of events >5 in control and experimental groups • Using the stated Type I error and control group event rate, power was determined corresponding to: – 25% difference between groups – 50% difference between groups Frequency Distribution of Power Estimates for 71 “Negative” Trials 25% Reduction 25 20 15 10 5.63% 5 0 0-9 10-19 20-29 30-39 40-49 50-59 Power (1 - ß) References: Frieman et al, NEJM 1978. 60-69 70-79 80-89 90-99 Frequency Distribution of Power Estimates for 71 “Negative” Trials 50% Reduction 29.58% 25 20 15 10 5 0 0-9 10-19 20-29 30-39 40-49 50-59 Power (1 - ß) References: Frieman et al, NEJM 1978. 60-69 70-79 80-89 90-99 Implications of Review by Frieman et al. • Many investigations do not estimate sample size in advance • Many studies should never have been initiated; some were stopped too soon • “Non-significant” difference does not mean there is not an important difference • Design estimates (in Methods) are important to interpret study findings • Confidence intervals should be used to summarize treatment differences Percent of Studies with at Least 80% Power Studies with Power to Detect 25% and 50% Differences 6 50 l 45 l l 40 35 30 25 l l 6 20 15 10 6 6 6 5 0 1975 Moher et al, JAMA , 272:122-124,1994 1980 1985 1990 25% Difference 50% Difference These Results Emphasize the Importance of Understanding that the Size of P-Value Depends on: • Magnitude of difference (strength of association); and • Sample size “Absence of evidence is not evidence of absence”, Altman and Bland, BMJ 1995; 311:485. Steps in Planning a Study 1) Specify the precise research question 2) Define target population 3) Assess feasibility of studying question (compute sample size) 4) Decide how to recruit study participants, e.g., single center, multi-center, and make sure you have back-up plans Beginning: A Protocol Stating Null and Alternative Hypotheses Along with Significance Level and Power Null hypothesis (HO) Hypothesis of no difference or no association Alternative hypothesis (HA) Hypothesis that there is a specified difference (Δ) No direction specified (2-tailed) A direction specified (1-tailed) Significance Level (): Type I Error The probability of rejecting H0 given that H0 is true Power = (1 - ): ( = Type II Error) Power is the probability of rejecting H0 when the true difference is Δ End: Test of Significance According to Protocol Statistically Significant? Yes No Reject HO Do not reject HO Sampling variation is an unlikely explanation for the discrepancy Sampling variation is a likely explanation for the discrepancy Normal Distribution If Z is large (lies in yellow area), we assume difference in means is unlikely to have come from a distribution with mean zero. Continuous Outcome Example Observations: Many people have stage 1 (mild) hypertension (SBP 140-159 or DBP 90-99 mmHg) For most, treatment is life-long Many drugs which lower BP produce undesirable symptoms and metabolic effects (new drugs are needed) Research Question: Can new drug T adequately control BP for patients with mild hypertension? Objective: To compare new drug T with diuretic treatment for lowering diastolic blood pressure (DBP) Parallel Group Design Comparing Average Diastolic BP (DBP) After One Year Hypothesis HO: DBP after one year of treatment with new drug T equals the DBP for patients given a diuretic (control) HA: DBP after one year is different for patients given new drug T compared to diuretic treatment (difference is 4 mmHg or more) Drug T DBP at year 1 Diuretic DBP at year 1 Study Population: Those with mild hypertension Parallel Group Design Comparing Average Difference (Year 1 – Baseline) in DBP. Hypothesis HO: DBP change from baseline after one year of treatment with new Drug T equals the DBP change from baseline after one year for patients given a diuretic (control) HA: DBP change from baseline after one year of treatment with new Drug T is different than the DBP change from baseline after one year for patients given a diuretic (control) treatment (difference is 4 mmHg or more) Drug T Change in DBP(Year 1 – Baseline) Diuretic Change in DBP (Year 1 – Baseline) Study Population: Those with mild hypertension Why Δ= 4 mmHg? An important difference on a population-wide basis Observational studies (Lancet 2002;360:1903-13) • 58 studies; 958,074 participants • 5 mm Hg lower DBP among those 40-59 years • 41% (30%) lower risk of death from stroke (CHD) Clinical trials (Lancet 1990;335:827-38) • 14 randomized trials; 36,908 participants • 5-6 mmHg DBP difference (treatment vs. control) • 28% reduction in fatal/non-fatal CVD Considerations in Specifying Treatments Effect (Delta) • Smallest difference of clinical significance/interest • Stage of research • Realistic and plausible estimates based on: – previous research – expected non-compliance and switchover rates – consideration of type of participants to be studied • Resources (compromise) Delta is a difference that is important NOT to miss if present. Principal Determinants of Sample Size • Size of difference considered important (Delta) • Type I error () or significance level • Type II error (), or power (1- ) • Variability of response/frequency of event Constants Sample Size for Two Groups: Equal Allocation General Formula N Per = Group 2 x Variability x [Constant (,)]2 Delta2 Delta = Δ = clinically relevant and plausible treatment difference Sample Size Formula Derivation: One Sample Situation Sample size has to satisfy : Prob ( Z Z ) if Ho is true and Prob ( Z Z ) 1 if HA is true Ho : o o HA : o; Sample Size Derivation (cont.) X 0 Z N Reject Ho if Z 1.96 Z 1 2 For 0.05 (2 - sided) X 0 Prob Z 1 2 1- under HA N X Prob Z1 2 N N N(0,1) Z Note : Z 1 Z and solve for N N 2 (Z 1 Z 1 )2 2 Weighing the Errors Type 2 error: Sponsor’s concern Type 1 error: Regulator’s Concern Typical Values for (Z1-/2 + Z1- )2 Which Is Numerator of Sample Size Type I Error () or Significance Level (Z1-/2) 2-sided test Power (1 - ) (Z1-) (Z1-/2 + Z1-)2 0.05 (1.96) 0.80 (0.84) 0.90 (1.28) 0.95 (1.645) 7.84 10.50 13.00 0.01 (2.575) 0.80 (0.84) 0.90 (1.28) 0.95 (1.645) 11.67 14.86 17.81 Example Hypertension Study HO HA 0 4 mmHg HO : 1 = 2 ; 1 - 2 = 0 HA : 1 ≠ 2 ; 1 - 2 = 4 mmHg Usually formulated in terms of change from baseline (e.g., Ho = D1 - D2 = 0) Another Derivation Z d 0 2 N n1 n2 N ProbZ Z under HO Z d 2 N ProbZ Z 1 under H A d 0 d Z 2 2 Z N N 2 2 2 2 Z Z N 2 2 Z Z 2 N 2 Solve for N using these 2 equations and by noting that Δ = sum of 2 parts from the previous figure . Sources of Variability of BP Measurements Ref: Rose GA. Standardization of Observers in Blood Pressure Measurement. Lancet 1965;1:673-4. Known factors True variations in arterial pressure Recent physical activity Emotional state Position of subject and arm Room temperature and season of year Unknown factors Variability of blood pressure readings Inaccuracy of sphygmomanometer Instrument Cuff width and length Measurement errors Chiefly affecting the mean pressure estimate Observer Mental concentration Hearing acuity Confusion of auditory and visual Interpretation of sounds Rates of inflation and deflation Reading of moving column Distorting the frequency distribution curve (and sometimes affecting the mean) Terminal digit preference Prejudice, e.g., excess of readings at 120/80 Estimates of Variability for Diastolic Blood Pressure Measurements (MRFIT) Estimated Using Random-Zero (R-Z) Readings Variance Component Estimate (mmHg)2 2 Between Subject s 58.4 Within Subjects 2e 36.3 Estimates of Variability for Diastolic Blood Pressure Measurements Estimated Using Random-Zero (R-Z) Readings at Screen 2 and Screen 3 in MRFIT (2 Readings at Each Visit) Variance Component Estimate (mmHg)2 Between Subject s2 58.4 v2 26.1 Between Visits Between Readings e2 10.2 Within subject analyzed further Consequences on Sample Size of Using Multiple Readings for Defining Diastolic BP =0.05, 1-=0.90 Inter-subject variability=58.4 (mmHg)2 No. of No. of visits readings/visit 1 1 N per Group ∆=8 ∆=4 31 124 1 2 30 118 2 1 25 100 2 2 24 97 Between visit variability = 26.1 (mmHg)2 Within visit variability = 10.2 (mmHg)2 Parallel Group Design Comparing Average DBP After One Year. Hypothesis HO: DBP after one year of treatment with new Drug T equals the DBP for patients given a diuretic (control) HA: DBP after one year is different for patients given new Drug T compared to diuretic treatment (difference is 4 mmHg or more) Drug T DBP at year 1 Diuretic DBP at year 1 Study Population: Those with mild hypertension Parallel Group Studies Comparing Average DBP After One Year 1 measure, 1 visit (=0.05, = .10) 2=58.4 + 26.1+10.2=94.7 =8 mmHg =4 mmHg H O : T = C H O : T = C H A : T C ; C T 4 H A : T C ; C T 8 2 2 z z1- 12 n = nT = nC = 2 nT = nC = 2 2(94.7)10.5 124.3 125 2 4 2 2 z z1- 12 n = nT = nC = 2 nT = nC = 2 2(94.7) 10.5 31.07 32 82 Parallel Group Design Comparing Average Difference (Year 1 – Baseline) in DBP. Hypothesis (2-Tailed) HO: DBP change from baseline after one year of treatment with new Drug T equals the DBP change from baseline after one year for patients given a diuretic (control) HA: DBP change from baseline after one year of treatment with new Drug T is different than the DBP change from baseline after one year for patients given a diuretic (control) treatment (difference is 4 mmHg or more) Drug T Change in DBP (Year 1 – Baseline) Diuretic Change in DBP (Year 1 – Baseline) Study Population: Those with mild hypertension Sample Size for Two Groups: Equal Allocation General Formula N Per = Group 2 x Variability x [Constant (,)]2 Delta2 Delta = Δ = clinically relevant and plausible treatment difference Estimate of Variability for Change Outcome • Prior studies (For MRFIT, SD of DBP change after 12 months = 9.0 mmHg [baseline is one visit, 2 readings; follow-up is one visit, 2 readings]. For comparison, SD of 12 month DBP is 9.5 mmHg) • Use correlation (ρ) of repeat readings for participants to estimate e2. (For MRFIT, correlation of DBP at baseline and 12 months is 0.55; note that SD (diff) can be written as 2σT2 (1-ρ) = 2σe2 = 2(81)(1-0.55) = 72.9 (SD of change ≈ 8.5 mmHg) • Estimate of SD change using analysis of covariance (regression of change on baseline) (For MRFIT, SD = 7.9 mmHg) Let y B baseline measuremen t y F follow - up measuremen t t s2 e2 2 Var( y F - y B ) = Var ( y F ) Var ( y B ) 2 cov( y F y B ) cov( y F y B ) p y F y B p t 2 if y F y B t and Var( y F - y B ) = 2 t 2 t 2 t (1 ) 2 2 2 so, 2 2 2 2 2 2 s e Var( y F - y B ) 2( s e )(1 2 ) 2( s e ) 2 2 2 s e s e 2 e2 Crossover Group Design Comparing Average Difference (Diuretic – Drug T) in DBP Hypothesis HO: Average of paired differences for the two treatment sequences differences is zero. HA: Average is 4 mmHg or more) Drug T Washout Period Diuretic Diuretic Washout Period Drug T Study Population: Those with mild hypertension Crossover Study Design I Period 1 2 Diff. y1 y2 dl II y1 y2 dll 2 Var(dl) = 2 e 2 Var(d ) = 2 e ll ∆ = TT - TC = E – – dl + dll = Dl + Dll 2 2 With parallel group comparison we had: HO : T = C or HO : DT = DC where DT and DC refer to the difference between follow-up and baseline levels of outcome With crossover we have: D + D l ll = 0 HO = 2 or equivalently: HO = TT – TC = 0 Variance for Sample Size Formula: d I d II 1 Var var(d I ) var(d II ) 2 4 1 [ 2 e2 2 e2 ] e2 4 Substitution into Sample Size Formula Gives: nc = nI = nII e2 z1- z1- 2 2 2 n| = n|| = number randomly allocated to each sequence - I (AB) or II (BA). This follows because the variance of the pooled treatment difference across the 2 sequences is ¼ (22e + 22 e) Crossover Sample Size Compared to Parallel Design (no baseline) e2 z1- z1- nc n 2 2 2 2( s2 + e2 ) z1- z1- 2 2 2 = e2 2( + ) 2 s 2 e Crossover Sample Size Compared to Parallel Design (no baseline) s2 = 2 s + e2 nc e2 e2 1 since 1 = 2 2 2 2 n 2( s + e ) 2 s +e (1 )n nc 2 But the crossover design will require twice the number of measurements. So, if ρ= 0 then number of measurements are equal, but sample size for crossover is ½. Consider an Experiment with Diastolic BP Response Type 1 error = 0.05 (2-sided) and Power = 0.95 s2 = 58.4 (mmHg) 2 e2 = 36.3 (mmHg) 2 nc 36.3 = 0.19 n 2(58.4 36.3) 5 times more patients needed for parallel group design = 5, nc 19, n = 99 Examples nc n 36 0.62 0.19 1200 400 0.75 0.125 Overnight urine excretion Na+ (meq/8 hours) 325 625 0.34 0.33 2 overnights 325 312 0.51 0.24 7 overnights 325 90 0.78 0.11 DBP (mmHg) Cholesterol (mg/dl) s2 e2 58 With parallel group comparisons with baseline we need to consider Var(y A - y B ) = 2( s2 + e2 ) or Var(d A - d B ) = 4 e2 with crossover we need to consider d I + d II 1 2 2 2 Var 2 e + 2 e e 2 4 e2 ( z1- z1- )2 2 c 2 n (crossover) n(parallel with baseline) 4( e2 )( z1- z1- )2 2 2 1 = 4 Regardless of what e2 or is 4 times more patients required for parallel group design which uses baseline compared to crossover Sample size for = .05 (2-sided) and = .05 Parallel Number/group (no baseline) Parallel Baseline number/group (=0.75) Crossover (Number/seq.) = 0.00 = 0.25 = 0.50 = 0.75 0.4 0.6 0.8 1.0 163 72 41 26 80 36 20 12 82 62 41 20 36 27 18 9 21 15 10 5 13 10 7 3 Key Points • Sample size should be specified in advance • Sample size estimation requires collaboration • Often sample size is based on uncertain assumptions, therefore estimates should consider a range of values for key parameters (i.e., investigate the impact on power if sample size and treatment effect is not achieved) • Parameters on which sample size is based should be evaluated during the trial • It pays to be conservative; however, ultimate size and duration of a study involves compromises, e.g., power, costs, timeliness. Power A measure of how likely the study will detect a specific treatment difference (∆), if present. Prob (rej Ho | when HA is true) = 1- 1- = (rej Ho | 1 - 2 = ) x 1 - x2 - 0 = Prob 2 1 n1 x 1 - x2 - 0 2 1 n1 2 2 2 2 Z1 2 1 - 2 = n2 Z 1 2 1 - 2 = n2 Assume 12 22 2 and n1 n2 n Power (cont.) x - x - 0- 1- = Prob 1 2 Z 1 1 - 2 = 2 2 2 2 2 n n x1 - x 2 - 0- Prob Z 1 1 - 2 = 2 2 2 2 2 n n Prob Z Z 1 2 2 2 n Prob Z Z 1 2 2 2 n Usually one of these probabilities will be very close to zero, depending on whether ∆ is positive or negative. Power (cont.) If > 0, then 2nd Prob 0, then 1 = Prob Z Z1 2 2 2 n Zc = 0.05; Z1 1.96 2 = 0.01; Z1 2.575 2 Sensitivity of Power to Variations in Other Sample Size Parameters (Assume 2 = 100) ∆ n Zc Power 0.05 0.01 4 4 100 100 -0.87 -0.25 0.81 0.60 0.05 0.01 6 6 100 100 -2.28 -1.67 0.99 0.95 0.05 0.01 4 4 200 200 -2.04 -1.425 0.98 0.92 Unequal Sample Sizes ΝT 2Nc (2 : 1 allocation) 1 1 Nc 2Nc 1.5 2 (Z Z ) Nc 2 ΝT kNc (k : 1 allocation) 1 2 2 (1 ) Z Z k Nc 2 SE(diff) 1 Relative sample size for 1: 1 versus k : 1 (2 k ) / 4 k For k 2 4.5 / 4 12.5% Another Formulation: Unequal Allocation Comparison of means (Treatment C vs. E): Total N = 2 1 1 2 P P Z1 Z1 C E 2 Δ2 PC and PE= fraction of patients assigned control (C) and experimental treatment (E); PC+ PE = 1 Total Sample Size for Different Allocation Ratios ∆ (mm Hg) 4 Allocation Ratio (E:C) 1:1 2:1 3:1 1:2 250 280 332 280 8 64 Sample sizes rounded up 70 84 2 94.7 (mm Hg) 2 .05 (2 - sided test) Power (1 - β) 0.90 70 Multiple Treatments and Unequal Allocation Example: m experimental treatments and control; comparison of means Let m = no. experiment al treatments n = no. of patients on each experiment al treatment N - mm = no. of patients on control arm Xc = response to control treatment X e = response to each experiment al treatment = E(x) V X e Xc = e2 n + c2 N - mn Problem: Find n which minimizes variance Solution: Take derivative with respect to n and set = zero Then, c N mn n m e if e c N - mn = n m No. of patients in control group = no. of patients in experimental times square root no. of treatments Other Issues with Multiple Groups • Multiple comparisons ( adjustment) • Interim analyses – possible early termination of some, but not all, treatment groups Minimum Clinically Important Difference (MCID) For a given sample size (N) the null hypothesis (HO: difference in means = 0) will be rejected if the observed difference (d) d is Z 1 / 2 2 N 2 ( Z 1 / 2 Z 1 ) If N was determined so that N 2 2 (MCID)2 Z1 / 2 then d MCID ( ) Z1 / 2 Z1 For 0.05, 0.10, P 0.05 if d is 61% of MCID d can be smaller than MCID and p<0.05! Chuang-Stein C et al. Pharmaceutical Stat 2010. Sample Size for Dual Criteria: Statistical Significance and Clinical Significance • In some cases, you may want to establish with high probability that the treatment effect is as large as MCID – For example, a new HIV vaccine might be assumed to have 60% efficacy but the study is designed to have sufficient power to rule out efficacy lower than 30% – This will require a larger sample size – For example, if Δ=2MCID, then sample size is 4 times greater Summary (General) • It is important that sample size be large enough to achieve the goals of the study – too many studies are conducted which are underpowered. • Sample size assumptions are frequently very rough so they should be re-evaluated as the study progresses. • A good knowledge of the subject matter (background on disease and intervention, outcomes, and target population) is necessary to estimate sample size. Summary (Crossover versus Parallel Group) • Efficiency of crossover increases as increases. • Design using change from baseline as response is better than design which just uses follow-up responses if > 0.50. • With multiple measurements on each patient, to establish baseline and followup levels, sample size can be reduced.