Early stopping for phase II cancer studies: a likelihood approach Elizabeth Garrett-Mayer, PhD Associate Professor of Biostatistics The Hollings Cancer Center The Medical University of South Carolina garrettm@musc.edu 1 Motivation Oncology Phase II studies Single arm Evaluation of efficacy Historically, ‘clinical response’ is the outcome of interest Evaluated within several months (cycles) of enrollment Early stopping often incorporated for futility 2 Early Stopping in Phase II studies: Binary outcome (clinical response) Attractive solutions exist for this setting Common design is Simon’s two-stage Preserves type I and type II error Procedure: Enroll N1 patients (stage 1). (Simon, 1989) If x or more respond, enroll N2 more (stage 2) If fewer than x respond, stop. Appropriate for binary responses Bayesian approaches also implemented binary likelihood, beta prior → beta binomial model other forms possible requires prior Lee and Liu: predictive probability design (Clinical Trials, 2008) 3 Alternative approach for early stopping Use likelihood-based approach (Royall (1997), Blume (2002)) Similar to Bayesian Parametric model-based No “penalties” for early looks But has differences No prior information included “Probability of misleading evidence” controlled Can make statements about probability of misleading evidence 4 Law of Likelihood If hypothesis A implies that the probability of observing some data X is PA(X), and hypothesis B implies that the probability is PB(X), then the observation X=x is evidence supporting A over B if PA(x) > PB(x), and the likelihood ratio, PA(x)/PB(x), measures the strength of that evidence. (Hacking 1965, Royall 1997) 5 Likelihood approach Likelihood ratios (LR) 1.0 0.8 0.6 0.4 0.2 Take ratio of heights of L for different values of λ L(λ=0.030)=0.78; L(λ=0.035)=0.03. LR = 26 0.0 Likelihood 0.01 0.02 0.03 0.04 0.05 6 Lambda Likelihood-Based Approach Use likelihood ratio to determine if there is sufficient evidence in favor of the one or another hypothesis Error rates are bounded Implications: Can look at data frequently without concern over mounting errors 7 Key difference in likelihood versus frequentist paradigm Consideration of the alternative hypothesis Frequentist hypothesis testing: Frequentist p-values: H0: null hypothesis H1: alternative hypothesis calculated assuming the null is true, Have no regard for the alternative hypothesis Likelihood ratio: Compares evidence for two hypotheses Acceptance or rejection of null depends on the alternative 8 p = 0.01 Reject the null Likelihood LR = 1/4 Weak evidence in favor of null 0.8 0.6 0.4 0.2 0.0 Assume H0: λ = 0.12 vs. H1: λ = 0.08 What if true λ = 0.10? Simulated data, N=300 Frequentist: Likelihood 1.0 Example: 0.08 0.09 0.10 0.11 0.12 Lambda 9 Example: Why? P-value looks for evidence against null LR compares evidence for both hypotheses When the “truth” is in the middle, which makes more sense? 0.01 1e-04 Likelihood 0.1 1 0.08 0.09 0.10 0.11 0.12 Lambda 10 Likelihood Inference Weak evidence: at the end of the study, there is not sufficiently strong evidence in favor of either hypothesis This can be controlled by choosing a large enough sample size But, if neither hypothesis is correct, can end up with weak evidence even if N is seemingly large (appropriate) Strong evidence Correct evidence: strong evidence in favor of correct hypothesis Misleading evidence: strong evidence in favor of the incorrect hypothesis. This is our interest today: what is the probability of misleading evidence? This is analogous to the alpha (type I) and beta (type II) errors that frequentists worry about 11 Operating Characteristics 0.6 0.4 0.2 Accept H0 Reject H0 0.0 Probability 0.8 1.0 Simon Two-Stage 0.0 0.1 0.2 0.3 0.4 True p 0.5 0.6 0.7 12 Operating Characteristics 0.6 0.4 0.2 Accept H0 Accept HA Weak Evidence 0.0 Probability 0.8 1.0 Likelihood Approach 0.0 0.1 0.2 0.3 0.4 True p 0.5 0.6 0.7 13 Misleading Evidence in Likelihood Paradigm Universal bound: Under H0, P L1 L0 k 1 k (Birnbaum, 1962; Smith, 1953) In words, the probability that the likelihood ratio exceeds k in favor of the wrong hypothesis can be no larger than 1/k. In certain cases, an even lower bound applies (Royall,2000) Difference between normal means Large sample size Common choices for k are 8 (strong), 10, 32 (very strong). 14 Implications Important result: For a sequence of independent observations, the universal bound still holds (Robbins, 1970) Implication: We can look at the data as often as desired and our probability of misleading evidence is bounded That is, if k=10, the probability of misleading strong evidence is ≤ 10% Reasonable bound: Considering β = 10-20% and α = 5-10% in most studies 15 Motivating Example New cancer treatment agent Anticipated response rate is 40% Null response rate is 20% the standard of care yields 20% not worth pursuing new treatment with same response rate as current treatment Using frequentist approach: Simon two-stage with alpha = beta = 10% Optimum criterion: smallest E(N) First stage: enroll 17. if 4 or more respond, continue Second stage: enroll 20. if 11 or more respond, conclude success. 16 Likelihood Approach Recall: we can look after each patient at the data Use the binomial likelihood to compare two hypotheses. Difference in the log-likelihoods provides the log likelihood ratio Simplifies to something simple log L1 log L0 yi log p1 log p0 log(1 p1 ) log(1 p0 ) N log(1 p1 ) log(1 p0 ) 17 Implementation Look at the data after each patient Estimate the difference in logL0 and logL1 Rules: if logL0 – logL1 > log(k): stop for futility if logL0 – logL1 < log(k): continue 18 Likelihood Approach But, given discrete nature, only certain looks provide an opportunity to stop Current example: stop the study if… 0 1 2 3 4 5 6 7 8 responses in 9 patients response in 12 patients responses in 15 patients responses in 19 patients responses in 22 patients responses in 26 patients responses in 29 patients responses in 32 patients responses in 36 patients Although total N can be as large as 37, there are only 9 thresholds for futility early stopping assessment 19 Design Performance Characteristics How does the proposed approach compare to the optimal Simon two-stage design? What are performance characteristics we would be interested in? small E(N) under the null hypothesis frequent stopping under null (similar to above) infrequent stopping under alternative acceptance of H1 under H1 acceptance of H0 under H0 20 Example 1: Simon Designs H0: p = 0.20 vs. H1: p = 0.40. Power ≥ 90% and alpha ≤ 0.10. Optimal Design: Stage 1: N1 = 17, r2=3 Stage 2: N = 37, r=10 Enroll 17 in stage 1. Stop if 3 or fewer responses. If more than three responses, enroll to a total N of 37. Reject H0 if more than 10 responses observed in 37 patients Minimax Design: Stage 1: N1 = 22, r2=4 Stage 2: N = 36, r=10 Enroll 22 in stage 1. Stop if 4 or fewer responses. If more than four responses, enroll to a total N of 36. Reject H0 if more than 10 responses observed in 36 patients 21 0.6 0.2 0.4 Accept HA, Lik Accept H0, Lik Weak Evidence Accept HA, Simon Accept H0, Simon 0.0 Probability 0.8 1.0 Simon Optimal vs. Likelihood (N=37) 0.1 0.2 0.3 0.4 True p 0.5 0.6 22 0.6 0.2 0.4 Accept HA, Lik Accept H0, Lik Weak Evidence Accept HA, Simon Accept H0, Simon 0.0 Probability 0.8 1.0 Simon Minimax vs. Likelihood (N=36) 0.1 0.2 0.3 0.4 True p 0.5 0.6 23 1.0 0.8 0.2 0.4 0.6 Likelihood (optimal N) Likelihood (minmax N) Simon Optimal Simon MinMax 0.0 Probability of Stopping Early Probability of Early Stopping 0.1 0.2 0.3 0.4 True p 0.5 0.6 24 35 30 25 20 15 Likelihood (optimal N) Likelihood (minmax N) Simon Optimal Simon MinMax 10 Expected Sample Size Expected Sample Size 0.2 0.4 True p 0.6 0.8 25 Another scenario Lower chance of success H0: p = 0.05 vs. H1: p = 0.20 Now, only 3 criteria for stopping: 0 out of 14 1 out of 23 2 out of 32 26 Simon Designs H0: p = 0.05 vs. H1: p = 0.20. Power ≥ 90% and alpha ≤ 0.10. Optimal Design: Stage 1: N1 = 12, r2=0 Stage 2: N = 37, r=3 Enroll 12 in stage 1. Stop if 0 responses. If at least one response, enroll to a total N of 37. Reject H0 if more than 3 responses observed in 37 patients Minimax Design: Stage 1: N1 = 18, r2=0 Stage 2: N = 32, r=3 Enroll 18 in stage 1. Stop if 0 responses. If at least one response, enroll to a total N of 32. Reject H0 if more than 3 responses observed in 32 patients 27 0.6 0.2 0.4 Accept HA, Lik Accept H0, Lik Weak Evidence Accept HA, Simon Accept H0, Simon 0.0 Probability 0.8 1.0 Simon Optimal vs. Likelihood (N=37) 0.0 0.1 0.2 0.3 0.4 True p 28 0.6 0.2 0.4 Accept HA, Lik Accept H0, Lik Weak Evidence Accept HA, Simon Accept H0, Simon 0.0 Probability 0.8 1.0 Simon Minimax vs. Likelihood (N=32) 0.0 0.1 0.2 0.3 0.4 True p 29 1.0 0.8 0.2 0.4 0.6 Likelihood (optimal N) Likelihood (minmax N) Simon Optimal Simon MinMax 0.0 Probability of Stopping Early Probability of Early Stopping 0.00 0.05 0.10 0.15 0.20 0.25 0.30 True p 30 35 30 20 25 Likelihood (optimal N) Likelihood (minmax N) Simon Optimal Simon MinMax 15 Expected Sample Size Expected Sample Size 0.0 0.1 0.2 0.3 0.4 0.5 0.6 True p 31 Comparison with Predicted Probability Minimax Sample Size 32 Comparison with Predicted Probability Optimal Sample Size 33 Summary and Conclusions (1) Likelihood based stopping provides another option for trial design in phase II single arm studies We only considered 1 value of K Overall, sample size is smaller chosen to be comparable to frequentist approach other values will lead to more/less conservative results extension: different K for early stopping versus final go/no go decision especially marked when you want to stop for futility when early stopping is not expected, not much difference in sample size For ‘ambiguous’ cases: likelihood approach stops early more often than Simon In minimax designs, finds ‘weak’ evidence frequently 34 Summary and Conclusions (2) ‘r’ for final analysis is generally smaller. why? the notion of comparing hypotheses instead of conditioning only on the null. Comparison to the PP approach is favorable likelihood stopping is less computationally intensive LS does not require specification of a prior “search” for designs in relatively simple 35 Thank you for your attention! garrettm@musc.edu 36