Early stopping for phase II cancer studies

advertisement
Early stopping for phase II cancer studies:
a likelihood approach
Elizabeth Garrett-Mayer, PhD
Associate Professor of Biostatistics
The Hollings Cancer Center
The Medical University of South Carolina
garrettm@musc.edu
1
Motivation

Oncology Phase II studies



Single arm
Evaluation of efficacy
Historically,



‘clinical response’ is the outcome of interest
Evaluated within several months (cycles) of
enrollment
Early stopping often incorporated for futility
2
Early Stopping in Phase II studies:
Binary outcome (clinical response)

Attractive solutions exist for this setting

Common design is Simon’s two-stage


Preserves type I and type II error
Procedure: Enroll N1 patients (stage 1).




(Simon, 1989)
If x or more respond, enroll N2 more (stage 2)
If fewer than x respond, stop.
Appropriate for binary responses
Bayesian approaches also implemented




binary likelihood, beta prior → beta binomial model
other forms possible
requires prior
Lee and Liu: predictive probability design (Clinical Trials,
2008)
3
Alternative approach for early stopping

Use likelihood-based approach
(Royall (1997),
Blume (2002))

Similar to Bayesian



Parametric model-based
No “penalties” for early looks
But has differences



No prior information included
“Probability of misleading evidence” controlled
Can make statements about probability of
misleading evidence
4
Law of Likelihood
If hypothesis A implies that the probability of
observing some data X is PA(X), and
hypothesis B implies that the probability is
PB(X), then the observation X=x is evidence
supporting A over B if PA(x) > PB(x), and
the likelihood ratio, PA(x)/PB(x), measures the
strength of that evidence.
(Hacking 1965, Royall 1997)
5
Likelihood approach
Likelihood ratios (LR)
1.0
0.8
0.6
0.4

0.2

Take ratio of heights of L for different values of λ
L(λ=0.030)=0.78; L(λ=0.035)=0.03.
LR = 26
0.0

Likelihood

0.01
0.02
0.03
0.04
0.05
6
Lambda
Likelihood-Based Approach
Use likelihood ratio to determine if there is
sufficient evidence in favor of the one or
another hypothesis
 Error rates are bounded
 Implications: Can look at data frequently
without concern over mounting errors


7
Key difference in likelihood versus
frequentist paradigm


Consideration of the alternative hypothesis
Frequentist hypothesis testing:



Frequentist p-values:



H0: null hypothesis
H1: alternative hypothesis
calculated assuming the null is true,
Have no regard for the alternative hypothesis
Likelihood ratio:


Compares evidence for two hypotheses
Acceptance or rejection of null depends on the
alternative
8



p = 0.01
Reject the null
Likelihood


LR = 1/4
Weak evidence in favor of null
0.8
0.6
0.4

0.2

0.0

Assume H0: λ = 0.12 vs.
H1: λ = 0.08
What if true λ = 0.10?
Simulated data, N=300
Frequentist:
Likelihood

1.0
Example:
0.08
0.09
0.10
0.11
0.12
Lambda
9
Example:
Why?
 P-value looks for
evidence against null
 LR compares evidence
for both hypotheses
 When the “truth” is in the
middle, which makes
more sense?
0.01
1e-04
Likelihood
0.1
1

0.08
0.09
0.10
0.11
0.12
Lambda
10
Likelihood Inference

Weak evidence: at the end of the study, there is
not sufficiently strong evidence in favor of either
hypothesis



This can be controlled by choosing a large enough
sample size
But, if neither hypothesis is correct, can end up with
weak evidence even if N is seemingly large (appropriate)
Strong evidence


Correct evidence: strong evidence in favor of correct
hypothesis
Misleading evidence: strong evidence in favor of
the incorrect hypothesis.
 This is our interest today: what is the
probability of misleading evidence?
 This is analogous to the alpha (type I) and beta
(type II) errors that frequentists worry about
11
Operating Characteristics
0.6
0.4
0.2
Accept H0
Reject H0
0.0
Probability
0.8
1.0
Simon Two-Stage
0.0
0.1
0.2
0.3
0.4
True p
0.5
0.6
0.7
12
Operating Characteristics
0.6
0.4
0.2
Accept H0
Accept HA
Weak Evidence
0.0
Probability
0.8
1.0
Likelihood Approach
0.0
0.1
0.2
0.3
0.4
True p
0.5
0.6
0.7
13
Misleading Evidence in Likelihood Paradigm

Universal bound: Under H0,
P




L1
L0

k 
1
k
(Birnbaum, 1962; Smith, 1953)
In words, the probability that the likelihood ratio
exceeds k in favor of the wrong hypothesis can be no
larger than 1/k.
In certain cases, an even lower bound applies (Royall,2000)
 Difference between normal means
 Large sample size
Common choices for k are 8 (strong), 10, 32 (very
strong).
14
Implications

Important result: For a sequence of independent
observations, the universal bound still holds
(Robbins, 1970)

Implication: We can look at the data as often as
desired and our probability of misleading
evidence is bounded
That is, if k=10, the probability of misleading
strong evidence is ≤ 10%
Reasonable bound: Considering β = 10-20% and
α = 5-10% in most studies


15
Motivating Example



New cancer treatment agent
Anticipated response rate is 40%
Null response rate is 20%



the standard of care yields 20%
not worth pursuing new treatment with same response
rate as current treatment
Using frequentist approach:




Simon two-stage with alpha = beta = 10%
Optimum criterion: smallest E(N)
First stage: enroll 17. if 4 or more respond, continue
Second stage: enroll 20. if 11 or more respond,
conclude success.
16
Likelihood Approach




Recall: we can look after each patient at the
data
Use the binomial likelihood to compare two
hypotheses.
Difference in the log-likelihoods provides the log
likelihood ratio
Simplifies to something simple
log L1  log L0   yi log p1  log p0  log(1  p1 )  log(1  p0 ) 
N log(1  p1 )  log(1  p0 )
17
Implementation
Look at the data after each patient
 Estimate the difference in logL0 and logL1
 Rules:


if logL0 – logL1 > log(k): stop for futility

if logL0 – logL1 < log(k): continue
18
Likelihood Approach


But, given discrete nature, only certain looks provide an
opportunity to stop
Current example: stop the study if…










0
1
2
3
4
5
6
7
8
responses in 9 patients
response in 12 patients
responses in 15 patients
responses in 19 patients
responses in 22 patients
responses in 26 patients
responses in 29 patients
responses in 32 patients
responses in 36 patients
Although total N can be as large as 37, there are only 9
thresholds for futility early stopping assessment
19
Design Performance Characteristics
How does the proposed approach compare
to the optimal Simon two-stage design?
 What are performance characteristics we
would be interested in?






small E(N) under the null hypothesis
frequent stopping under null (similar to above)
infrequent stopping under alternative
acceptance of H1 under H1
acceptance of H0 under H0
20
Example 1: Simon Designs
H0: p = 0.20 vs. H1: p = 0.40. Power ≥ 90% and alpha ≤ 0.10.
Optimal Design:
Stage 1: N1 = 17, r2=3
Stage 2: N = 37, r=10
Enroll 17 in stage 1. Stop if 3 or fewer responses.
If more than three responses, enroll to a total N of 37.
Reject H0 if more than 10 responses observed in 37 patients
Minimax Design:
Stage 1: N1 = 22, r2=4
Stage 2: N = 36, r=10
Enroll 22 in stage 1. Stop if 4 or fewer responses.
If more than four responses, enroll to a total N of 36.
Reject H0 if more than 10 responses observed in 36 patients
21
0.6
0.2
0.4
Accept HA, Lik
Accept H0, Lik
Weak Evidence
Accept HA, Simon
Accept H0, Simon
0.0
Probability
0.8
1.0
Simon Optimal vs. Likelihood (N=37)
0.1
0.2
0.3
0.4
True p
0.5
0.6
22
0.6
0.2
0.4
Accept HA, Lik
Accept H0, Lik
Weak Evidence
Accept HA, Simon
Accept H0, Simon
0.0
Probability
0.8
1.0
Simon Minimax vs. Likelihood (N=36)
0.1
0.2
0.3
0.4
True p
0.5
0.6
23
1.0
0.8
0.2
0.4
0.6
Likelihood (optimal N)
Likelihood (minmax N)
Simon Optimal
Simon MinMax
0.0
Probability of Stopping Early
Probability of Early Stopping
0.1
0.2
0.3
0.4
True p
0.5
0.6
24
35
30
25
20
15
Likelihood (optimal N)
Likelihood (minmax N)
Simon Optimal
Simon MinMax
10
Expected Sample Size
Expected Sample Size
0.2
0.4
True p
0.6
0.8
25
Another scenario

Lower chance of success
H0: p = 0.05 vs. H1: p = 0.20

Now, only 3 criteria for stopping:



0 out of 14
1 out of 23
2 out of 32
26
Simon Designs
H0: p = 0.05 vs. H1: p = 0.20. Power ≥ 90% and alpha ≤ 0.10.
Optimal Design:
Stage 1: N1 = 12, r2=0
Stage 2: N = 37, r=3
Enroll 12 in stage 1. Stop if 0 responses.
If at least one response, enroll to a total N of 37.
Reject H0 if more than 3 responses observed in 37 patients
Minimax Design:
Stage 1: N1 = 18, r2=0
Stage 2: N = 32, r=3
Enroll 18 in stage 1. Stop if 0 responses.
If at least one response, enroll to a total N of 32.
Reject H0 if more than 3 responses observed in 32 patients
27
0.6
0.2
0.4
Accept HA, Lik
Accept H0, Lik
Weak Evidence
Accept HA, Simon
Accept H0, Simon
0.0
Probability
0.8
1.0
Simon Optimal vs. Likelihood (N=37)
0.0
0.1
0.2
0.3
0.4
True p
28
0.6
0.2
0.4
Accept HA, Lik
Accept H0, Lik
Weak Evidence
Accept HA, Simon
Accept H0, Simon
0.0
Probability
0.8
1.0
Simon Minimax vs. Likelihood (N=32)
0.0
0.1
0.2
0.3
0.4
True p
29
1.0
0.8
0.2
0.4
0.6
Likelihood (optimal N)
Likelihood (minmax N)
Simon Optimal
Simon MinMax
0.0
Probability of Stopping Early
Probability of Early Stopping
0.00
0.05
0.10
0.15
0.20
0.25
0.30
True p
30
35
30
20
25
Likelihood (optimal N)
Likelihood (minmax N)
Simon Optimal
Simon MinMax
15
Expected Sample Size
Expected Sample Size
0.0
0.1
0.2
0.3
0.4
0.5
0.6
True p
31
Comparison with Predicted Probability
Minimax Sample Size
32
Comparison with Predicted Probability
Optimal Sample Size
33
Summary and Conclusions (1)


Likelihood based stopping provides another option for trial
design in phase II single arm studies
We only considered 1 value of K




Overall, sample size is smaller



chosen to be comparable to frequentist approach
other values will lead to more/less conservative results
extension: different K for early stopping versus final go/no go
decision
especially marked when you want to stop for futility
when early stopping is not expected, not much difference in
sample size
For ‘ambiguous’ cases:


likelihood approach stops early more often than Simon
In minimax designs, finds ‘weak’ evidence frequently
34
Summary and Conclusions (2)

‘r’ for final analysis is generally smaller.



why?
the notion of comparing hypotheses instead of
conditioning only on the null.
Comparison to the PP approach is
favorable



likelihood stopping is less computationally
intensive
LS does not require specification of a prior
“search” for designs in relatively simple
35
Thank you for your attention!
garrettm@musc.edu
36
Download