Sample Size & Power An Introduction to Clinical Trials (7) October 4, 2007 Inkyung Jung, Ph.D. Background Quantitative properties of clinical trial designs { { { Correct precision or power Best sample size Optimal study duration Concern over sample size & power is important, especially for CTE and SA studies Methods for Determining Sample Size Frequentist approaches: { { Confidence interval Hypothesis testing Likelihood ratio based methods Bayesian approaches Quantitative Design Parameters in Clinical Trials Power: 1-β β level: Type II error probability α level: Type I error probability Likelihood ratio: Relative strength of evidence Sample size: Number of experimental subjects Effect size: Treatment difference expressed as number of s.d. Number of events: Number of experimental subjects with outcome Study duration: Interval from beginning of trial to end of f/u Percent censoring: Percentage of study participants left w/o an event by the end of f/u Allocation ratio: Ratio of sample sizes in the treatment groups Accrual rate: New subjects entered per unit of time Loss to follow-up: Rate at which study participants are lost before outcome can be observed Follow-up period: Interval from end of accrual to end of study Δ: Smallest treatment effect of interest based on clinical considerations Principles Precision { { Indirect specification through CI (absolute or relative width of CI) Power of a statistical hypothesis test Power Evidence { Likelihood ratio (LR): relative strength of evidence Sample Size & Power The calculations are approximation. { { Equations based on approximation Predicting #subjects depends on guesses about parameters The relationship between power/precision and sample size is quadratic. { SS increases as the square of s.d. of tx difference and the normal quantiles for type I & II error rates Translational Trials Small, usually smaller than 20 subjects The size can be motivated formally Issue: estimate reliably the mean of some measurement Translational Trials Measurements : samples from N ( μ , σ 2 ) X : UE of μ , ( ) X − μ : absolute error Pr X − μ ≤ d ≥ 1 − α ⎛ X −μ ⎞ d ⎟ Pr X − μ ≤ d = Pr ⎜ ≤ ⎜σ / n σ / n ⎟ ⎝ ⎠ ⎛ −d ⎞ ⎛ d ⎞ = Φ⎜ ⎟ ⎟ − Φ⎜ ⎝σ / n ⎠ ⎝σ / n ⎠ ( ) ⎛ d ⎞ = 2Φ ⎜ ⎟ −1 ≥ 1 − α ⎝σ / n ⎠ ⎛ d ⎞ Φ⎜ ⎟ ≥ 1−α / 2 ⎝σ / n ⎠ d ≥ Z1−α / 2 ; Φ (Z γ ) = γ σ/ n σ⎞ ⎛ n ≥ ⎜ Z1−α / 2 ⎟ d⎠ ⎝ 2 Translational Trials Bernoulli trials p : success probability, r : # successes, n : # trials r ~ B ( n, p ) ⎞ ⎛r Pr ⎜⎜ − p ≤ d ⎟⎟ ≥ 1 − α ⎠ ⎝n Using the normal approximation with σ 2 = p (1 − p ) ⎛ n ≥ ⎜ Z1−α / 2 ⎜ ⎝ p(1 − p ) ⎞ ⎟ ⎟ d ⎠ 2 Safety and Activity Studies Single cohort study: treatment effect is compared to a standard Estimate the unconditional probability of benefit (or lack of benefit), and thus form a basis for deciding whether or not to investigate the treatment in a larger, lengthier, more expensive trial with an internal control (i.e. randomized trial) Safety and Activity Studies Goal of SA studies: to estimate a clinical endpoint with a specified precision { { e.g. average blood or tissue levels of a drug, the proportion of patients responding, population failure rates Dichotomous outcome: summarized as a proportion Dichotomous Outcome 100(1 - α/2)% CI for pˆ : pˆ ± Z1−α / 2 p (1 − p ) ; Φ ( Z1−α / 2 ) = 1 − α / 2 n p unknown : Substitute pˆ for p Not accurate for extreme value of pˆ and small sample size Example: Consider a trial in which patients with esophageal cancer are treated with chemotherapy prior to surgical resection. A complete response is defined as the absence of macroscopic tumor at the time of surgery. We suspect that this might occur 35% of the time and would like the 95% CI of our estimate to be ±15%. 0.15 = 1.96 × 0.35(1 − 0.35) / n ; n = 39 Exact Binomial Confidence Limits Chance of r or fewer successes in n trials (lower tail probability) ⎛n⎞ k n−k Pr[X ≤ r ] = ∑ ⎜⎜ ⎟⎟ p (1 − p ) k =0 ⎝ k ⎠ r Lower & Upper bounds of p for 100(1-α/2)% CI α ⎛n⎞ k α n ⎛n⎞ k n−k n−k = ∑ ⎜⎜ ⎟⎟ p (1 − p ) ⇒ pL , = ∑ ⎜⎜ ⎟⎟ p (1 − p ) ⇒ pU 2 k =0 ⎝ k ⎠ 2 k =r ⎝ k ⎠ r See Tables 11.2 and 11.3 (p.261 & p.262) Bayesian Binomial CI Using a uniform prior for p, the posterior distribution for p is p u (1 − u ) ∫ F ( p) = ∫ u (1 − u) 0 1 0 r r n−r n−r du du = (n + 1)! p r n−r u u du ( 1 − ) ∫ 0 r!(n − r )! n +1 ⎛ n + 1⎞ k (n + 1)! pL r n +1− k n−r ⎜ ⎟ ( ) u u du p p = ( 1 − ) = 1 − ∑ L ⎜ k ⎟ L 2 r!(n − r )! ∫0 k = r +1 ⎝ ⎠ α r ⎛ n + 1⎞ k (n + 1)! 1 r n +1− k n−r ⎜ ⎟ ( ) u u du = ( 1 − ) = p 1 − p ∑ U U ⎜ ⎟ 2 r!(n − r )! ∫pU k =0 ⎝ k ⎠ α • See Tables 11.4 & 11.5 (p.264 & 265) • A Bayesian approach can use prior information Likelihood-based Approach e L ( p ) = p k (1 − p ) n − k k p1 (1 − p1 ) n − k e = k p0 (1 − p0 ) n − k Λ p1 1 − p1 Λ = k log + (n − k ) log p0 1 − p0 n= ⎛ p0 log⎜⎜ ⎝ Λ Λ = ⎛ p1 /(1 − p1 ) ⎞ p1 /(1 − p1 ) ⎞ 1 − p1 p1 ⎟⎟ + log ⎟⎟ p0 log − (1 − p1 ) log⎜⎜ p0 /(1 − p0 ) ⎠ 1 − p0 p0 ⎝ p0 /(1 − p0 ) ⎠ • See Table 11.6 (p.267) CI for a Mean When treatment effect of interest is the mean of a distribution Assume the sample mean has a normal distribution { { Need to specify both mean and s.d. No bounded range for mean and s.d. CI for a Mean μˆ : estimated mean from n obs 100(1 - α / 2)% CI : μˆ ± Z1−α / 2σ / n If our tolerance for the width of the CI is w = Z1−α / 2σ / n ⎛ Z1−α / 2σ ⎞ ⇒n=⎜ ⎟ ⎝ w ⎠ 2 CI for a Mean Precision can be expressed relative to μ or σ ( If w = w' ' μ , n = ( If w = w'σ , n = ) =( ) ) = ( ) ( ) ( μ ≠ 0) Z1−α / 2σ 2 w 'σ Z1−α / 2σ 2 w '' μ Z1−α / 2 2 w' Z1−α / 2 2 σ 2 w '' μ w' , w' ': desired tolerance expressed as a fraction of s.d., of mean σ/μ : coefficient of variation CI for Event Rates Time-to-event measurements with censoring (death, recurrence, or overall failure rate) Common in cancer trials { e.g. Some new drugs are developed. These may not shrink tumors, but might improve survival. CI for Event Rates Assume { { { { accrual is constant at rate a per unit time over some interval T a period of continued f/u is used to observe additional event the failure rate is constant over time (exponential event times) there are no losses to f/u CI for Event Rates d ˆ Estimated failure rate : λ = ∑t i d : number of events or failures, the sum is over all f/u times ti Approximate CI for λ : Z1−α / 2 λ ˆλ ± Z ˆ or log( λ) ± 1−α / 2 d d If w is the desired width of the CI expressed as a faction of λ, Z1−α / 2 λ ⎛ Z1−α / 2 ⎞ = wλ ⇒ d = ⎜ ⎟ d ⎝ w ⎠ 2 Likelihood-based Approach Assuming a normal model for log(λ ), ( ) ⎛ log(λˆ ) − log(λ ) 2 ⎞ ⎟ e L ( λ ) = exp⎜ − ⎜ ⎟ 2/d ⎝ ⎠ λ : true hazard, λˆ : observed hazard ( Λ= ) 2 ˆ log(λ ) − log(λ ) 2/d 2Λ 2Λ ⇒d = = 2 2 ˆ ( ) log( ) Δ log(λ ) − log(λ ) ( ) Comparative Trials An approach based on a planned hypothesis test Convenient and frequently used for determining sample size H0: equivalence between the treatments Alternative value chosen to be the smallest difference of clinical importance between tx Size is planned to yield a high power (1-β) at a pre-specified α-level Type I and II error rates Convention: two-sided α-level at 0.05 and 80 or 90% power In practice: should be chosen to reflect the consequences of making the particular type of error Comparison of Two Means Treatment comparison: testing the diff. of the estimated means of two groups μ1 & μ2: true means of two groups σ: s.d. of the measurement Δ= μ1 - μ2 H0: Δ=0 vs. Ha: |Δ|>0 Reject the null if |Δ|>c=Z1-α/2 * σΔ Comparison of Two Means Want to have a power of 1 - β at Δ(> 0) [ ] ⎡ Δˆ − Δ c - Δ ⎤ ˆ | Δ⎥ 1 − β = Pr Δ > c | Δ = Pr ⎢ > σ σ ⎥⎦ ⎢⎣ Δˆ Δˆ c-Δ = Z β = − Z1− β ; Φ ( Z β ) = β σ Δˆ − Z1− β = Z1-α/2σ Δˆ - Δ σ Δˆ Z1-α/2 + Z1− β = Δ σ Δˆ = Z1-α/2 − , σ Δˆ = σ Δ σ Δˆ 1 1 + n1 n2 1 1 Δ2 + = n1 n2 (Z1-α/2 + Z1− β )2 σ 2 r + 1 (Z1-α/2 + Z1− β ) n1 = rn2 ⇒ n2 = r (Δ / σ )2 2 Likelihood-based Approach e L ( X|μ ) 1 = 2π σ ⎛ ( xi − μ )2 ⎞ ⎟ exp⎜⎜ − ∏ 2 ⎟ 2σ i =1 ⎠ ⎝ n exp(− ∑ (x − μ ) / 2σ ) ⎛ 1 ⎞ ( = (x − μ ) − (x − μ ) )⎟ = exp⎜ ∑ exp(− ∑ ( x − μ ) / 2σ ) ⎝ 2σ ⎠ n e Λ Λ= 1 n i 1 i n(μ a − μb ) σ 2 2 2 n a 2 b 2 2 i =1 (x − μ ab ); μ ab = (μ a + μb ) / 2 Λσ 2 n= (μ a − μb )(x − μ ab ) 2 i b 2 i a Dichotomous Responses Treatment Success A B Yes a b No c d Comparing the proportion of success or failures: a/(a+c) vs. b/(b+d) Fisher’s exact test or chi-square test with or without continuity correction Dichotomous Responses • χ 2 - test without continuity correction n2 = (Z 1−α / 2 ( r + 1)π (1 − π ) + Z1− β rπ 1 (1 − π 1 ) + π 2 (1 − π 2 ) rΔ2 π = (π 1 + rπ 2 ) / (r + 1), Δ = π 1 − π 2 When r = 1, n2 (Z = 1−α / 2 + Z1− β ) (π 1 (1 − π 1 ) + π 2 (1 − π 2 ) ) 2 Δ2 (see Tables 11.12 & 11.13) • χ 2 - test with continuity correction 2(r + 1) ⎞⎟ n ⎛ n2 * = 2 ⎜⎜1 + 1 + 4⎝ rn2 Δ ⎟⎠ 2 ); 2 Hazard Comparison With event time endpoints, it is common to compare the ratio of hazards (vs. H0: hazard ratio=1) Power depends on #events (recurrence or death) Difference between #patients placed on study and #events required for the trial to have the intended properties Parametric Approach (Exponential) • If event times are exponentially distributed, d MLE of the hazard λ : λˆ = ; d is # uncensored obs, t i is f/u times ∑ti 2dλ / λˆ ~ χ 2 (2d ) 2d1λ1 / λˆ1 d1 Δ ⇒ = ~ F (2d1 , 2d 2 ) ˆ ˆ 2 d 2 λ2 / λ 2 d 2 Δ Can be used to construct tests and CIs 100(1 - α )% CI : (d 2 / d1 )Δˆ F2 d1 , 2 d 2 ,1−α / 2 < Δ < (d 2 / d1 )Δˆ F2 d1 , 2 d 2 ,α / 2 Other Parametric Approaches • Under the null, log(Δ) is approximately normally distributed with mean 0 and variance 1 ( Z ⇒D=4 1−α / 2 + Z1− β ) 2 (log(Δ)) 2 ; D : total number of events required More general form (log(Δ) ) 1 1 + = d1 d 2 (Z1−α / 2 + Z1− β )2 2 r + 1 (Z1−α / 2 + Z1− β ) Using r = d 2 / d1 , d1 = r (log(Δ) )2 2 Nonparametric Approaches No parametric assumptions about the distribution of event times ( Z D= 1−α / 2 + Z1− β ) (Δ + 1) 2 (Δ − 1) 2 2 ; total # events needed on the study Example: To detect a hazard ratio of 1.75 as being statistically significantly different from 1.0 using a two-sided 0.05 α-level test with 90% power requires (1.96 + 1.282)2 (1.75 + 1)2 (1.75 − 1)2 = 141 events Suppose 30% of subjects will remain event free at the end of the trial n= 141 = 202 1 − 0.3 Noninferiority H0: “the treatments are different” Ha: “the treatments are the same” Naturally one-sided Roles of α and β are reversed The sample size depends strongly on the quantitative definition of equivalence Noninferiority • For testing difference of two proportions using χ 2 - test without continuity correction ( Z n= + Z1− β ) (π 1 (1 − π 1 ) + π 2 (1 − π 2 ) ) 2 1−α / 2 Δ2 • For testing equivalence of two proportions (Z n= 1−α / 2 + Z1− β ) (π 1 (1 − π 1 ) + π 2 (1 − π 2 ) ) 2 (δ − (π 1 − π 2 ))2 if we declare two proportions equivalent when π 1 − π 2 ≤ δ Other approaches: confidence intervals, likelihood methods ES Trials Large safety trials (postmarketing surveillance) Intended to uncover and accurately estimate the frequency of uncommon side effects Size depends on how low the event rate of interest is and how powerful the study needs to be Poisson Distribution Assume : the study population is large, the prob. of an event is small all subjects are followed for approximately the same length of time The probability of observing exactly r events ( λ m) r e − λm Pr[D = r ] = ; λ : event rate, m : cohort size r! The chance of observing r or fewer events ( λ m ) k e − λm Pr[D ≤ r ] = ∑ k! k =0 r Example. The chance of seeing at least one event in the Poisson distribution β = 1 − Pr[D = 0] = 1 − e − λm ; m = − log(1 − β ) / λ If λ=0.001 and β=0.99, then m=4605 Likelihood Approach Relative evidence for an observed Poisson event rate λ vs. a hypothetical rate μ r ⎛λ⎞ e Λ = ⎜⎜ ⎟⎟ e − m ( λ − μ ) , where r events are observed with cohort size m ⎝μ⎠ ⎛λ⎞ Λ = r log⎜⎜ ⎟⎟ − m(λ − μ ) ⎝μ⎠ ⎛λ⎞ r log⎜⎜ ⎟⎟ − Λ ⎝μ⎠ ⇒m= λ−μ λ = r/m⇒ m = Λ λ log(λ / μ ) + (μ − λ ) Other Considerations Cluster randomization requires increased sample size It is possible to perform a simple cost optimization using unbalanced allocation and information about the relative cost of two treatment groups Increase the sample size for nonadherence Simulation is a powerful and flexible design alternative Computer Programs Power and Sample Size (PASS) by NCSS software nQuerry Advisor by Statistical Solutions SAS v9.1: power procedure And others… Power Curves Two sample t-test of group mean difference at α-level=0.05 Summary Motivated by precision, power, or relative evidence Size quantifications are useful and necessary for designing trials Hypothesis-testing framework is often adopted for sample size considerations Important design parameters: α-level, #events, accrual rate and duration, losses to f/u, allocation ratio, total study duration, the smallest tx difference of clinical importance Summary Phase I trial { { The sample size needed for the trial is usually an outcome of the study The exact sample size cannot be usually specified in advance Summary Developmental studies (SA trials) { { { { Look for evidence of treatment efficacy A fixed sample size The sample size can be determined as a consequence of the precision required to estimate the response, success or failure rate When faced with evidence of low efficacy, investigators wish to stop a SA trials as early as possible Summary CTE trials { { { Sample size and power depend on the particular test statistic used to compare the treatment groups Sample size increases as the type I & II error rates decrease Sample size decreases as the treatment difference increases or as the variance of the treatment difference decrease Summary CTE trials { { For event-time studies, the currency of design is #events required to detect a particular hazard ratio Nonadherence with assigned treatment may increase the required sample size dramatically Summary Noninferiority designs { { Often require very large sample sizes Requires relatively narrow confidence interval (high precision), increasing the sample size compared to superiority designs Summary Statistical simulation may be a useful way Specialized, flexible computer programs are necessary for performing the required calculation efficiently Depending on the shape of power curve, small changes in parameters can have large or small effects on power against a fixed alternative