Statistical Evaluation of Surrogate Markers

advertisement
University of Pennsylvania Annual Conference on
Statistical Issues in Clinical Trials
Statistical Evaluation of Surrogate Markers:
Validity, Efficiency and Sensitivity
Yongming Qu, PhD
Eli Lilly and Company
Indianapolis, Indiana
April 18, 2012
This is based on previous and ongoing research through collaboration with Michael Case,
Somnath Sarkar, Wen Li, and Pandurang M. Kulkarni.
Outline
 Introduction
 Biomarker, surrogate marker and surrogate endpoint
 Validity and efficiency of surrogate marker
 Quantities used in statistical validation
 Proportion of Treatment Effect (PTE)
 General Association
 Likelihood reduction factor (LRF)
 Proportion of Information Gain (PIG)
 Effect of measurement error and adjustment for it
 Summary
UPENN Clinical Trials Conference
April 18, 2012
2
Biomarker and Surrogate Endpoint (SE)
 Biomarker: "a characteristic that is objectively
measured and evaluated as an indicator of normal
biologic processes, pathogenic processes, or
pharmacologic responses to a therapeutic
intervention.” (Clinical Pharmacology Therapy
2001;69:89-95.)
 Surrogate endpoint: “a laboratory measurement or
a physical sign used as a substitute for a clinically
meaningful endpoint that measures directly how a
patient feels, functions or services. Changes
induced by a therapy on a surrogate endpoint are
expected to reflect changes in a clinically
meaningful endpoint” (Temple 1995)
UPENN Clinical Trials Conference
April 18, 2012
3
Validation of Surrogate Endpoint (SE)
 Surrogate endpoint is intended to replace clinical outcome for
any therapy
 Surrogate endpoint is independent of therapy
 Traditional way of validating surrogate endpoint using
treatment is not feasible
 Surrogate endpoint needs to be validated
 To evaluate the surrogate endpoint, large confirmatory clinical
trials need to be conducted for both surrogate and clinical
endpoints
 If large confirmatory clinical trials are conducted, the drug
efficacy should have been established.
 There is no need for surrogate endpoint for this drug
 The conclusion from this drug cannot be extrapolated to other
drugs because different drugs may work through different
pathways
UPENN Clinical Trials Conference
April 18, 2012
4
Validation of SE – New Thinking
 Validation of SE should be based on the disease mechanism,
not the effect of treatment
 Hemoglobin A1c (HbA1c) is a widely used SE for the average
of glucose
 The validation of this SE is not based on any clinical studies involving
different treatment
 It is based on biochemistry and physiology
 Progression-free survival (PFS) is widely used as an SE for
cancer survival
 The validation should be based on biology of the disease and tumor, not
individual drugs
UPENN Clinical Trials Conference
April 18, 2012
5
Surrogate Marker (SM)
 Surrogate marker for a drug is a marker which could be used to
predict the drug’s efficacy or safety
 Example of the usefulness of a surrogate marker
 Suppose bone mineral density (BMD) is a surrogate marker for
osteoporotic fracture
 The long-term effect of an osteoporosis drug on fracture is difficult to
know and is too costly to know
 A woman takes an osteoporosis drug
 Clinicians measure her BMD after 6 months of use
 If BMD is increased, this drug works for this woman and she should
continue to use this drug
 If BMD is not increased, this drug does not work for this woman, and she
should switch to a different drug
 SM is very useful to monitor patients and identify which drug
works best for a patient in the early stage of the disease
SM can NOT be used to replace clinical outcome for drug approval!
UPENN Clinical Trials Conference
April 18, 2012
6
SE, SM and Biomarker
SE
SM
Biomarker
Question: A particular biomarker is a SM?
UPENN Clinical Trials Conference
April 18, 2012
7
SE Validation (Prentice)
 Prentice (Stat. Med. 1989, 8:431–440) proposed a necessary and
sufficient condition for a surrogate endpoint
f(S|Z) = f(S)  f(T|Z)=f(T)
 This definition is too stringent: it essentially requires surrogate
endpoint is “equivalent” to clinical outcome
 Prentice’s key operational criterion f(T|S, Z) = f(T|S) does not
guarantee this condition
 This condition can be weakened. A marker is said to be a SE if
f(S|Z) = f(S)
f(T|Z)=f(T) for any Z
 Practically, this condition cannot be validated through clinical
trials testing drug effect

One can NOT prove a mathematical theory through enumeration!

One can invalidate a SE if the above relationship does not hold for one
treatment Z
UPENN Clinical Trials Conference
April 18, 2012
8
Surrogate Marker - Concepts
 Validity: A marker S is said to be a valid surrogate
marker for a clinical outcome T for a particular
treatment if
f(T|Z) ≠ f(T)
f(T|S, Z) = f(T|S)
where Z is the treatment indicator with Z = 1 for the
treatment and Z = 0 for control
 Efficiency: For two surrogate markers S1 and S2, we say
S1 is more efficient than S2 if
Var[T|Z, S1] < Var[T|Z, S2]
 Validity is a much higher hurdle than efficiency in
practice
UPENN Clinical Trials Conference
April 18, 2012
9
Proportion of Treatment Effect
 Consider two models
T|Z =a0 + aZZ
T|S, Z =b0 + bZZ + bsS
 The PTE (Freedman et al, Stat. Med. 1992;
11:167-178) is
PTE = 1 – bZ/aZ
 Drawbacks of PTE
 Not bounded by [0,1]
 Large variability makes the results not
informative
UPENN Clinical Trials Conference
April 18, 2012
10
General Association
 Consider two models
YS , j   S  a Z , S Z j   S , j
YT , j  T  a Z ,T Z j   T , j
  S , j    SS

where Var 
     
 T , j   SS
 ST 

 TT 
 Buyse and Molenberghs (Biometrics 1998:54:10141029) suggested using the coefficient of
determination to evaluate the surrogate marker
2
R2  ( SS TT )1 ST
1 2
Var[T | S ]   SS   TT
 ST   SS (1  R2 )
UPENN Clinical Trials Conference
April 18, 2012
11
Artificial Example 1
  S , j    SS
  
Var 

  T , j    SS
YS , j   S  a Z , S Z j   S , j
YT , j  T  a Z ,T Z j   T , j
 ST 

 TT 
2
R2  ( SS TT )1 ST
 Let S,j= T,j, then R2 = 1
E[YT , j | Z j  0, YS , j ]  T  S  YS , j
E[YT , j | Z j  1, YS , j ]  T  S  YS , j  (a Z ,T  a Z ,S )
 The relationship between clinical outcome and
marker depends on treatment group
 YS,j is not a good surrogate marker!
UPENN Clinical Trials Conference
April 18, 2012
12
Artificial Example 2
YS , j   S  a Z , S Z j   S , j
YT , j  b 0  b1YS , j  u j
R 
2
  0    SS
 S, j 



 u  ~ NID  0 ,  0
 j 
  
0 
 
 uu  
1
1
1  b12 SS
 uu
 Depending on the parameters, R2 can be any
number
 The effect of treatment on the clinical outcome acts
solely through the marker YS,j
 YS,j is a perfect surrogate marker!
UPENN Clinical Trials Conference
April 18, 2012
13
Likelihood Reduction Factor (LRF)
 Consider two models
T|Z =a0 + aZZ
(1)
T|S, Z =b0 + bZZ + bsS
(2)
 Alonso et al. (Biometrics 2004; 60:724-728) defined
the likelihood reduction factor (LRF) as
LRF(Z , S : S )  1  exp(LRT(Z , S : Z ) / n)
where LRT(Z,S:Z) is the likelihood ratio test statistic
comparing the two models (2) and (1)
 LRF is bounded by [0,1] but may be impossible to
reach 1 for some models
 The LRF adjusted (LRFa) was proposed
UPENN Clinical Trials Conference
April 18, 2012
14
A Different Approach
 Instead of comparing
T|Z =a0 + aZZ
LRFa(Z,S:Z)
T|S, Z =b0 + bZZ + bsS
Alonso, et al
 We compare
T|S =g0 + gZS
New Quantity
T|S, Z =b0 + bZZ + bsS
UPENN Clinical Trials Conference
April 18, 2012
15
Proportion of Information Gain (PIG)
 Consider three models
T =c0
(1)
T|S =g0 + gZS
(2)
T|S, Z =b0 + bZZ + bsS
(3)
 Qu and Case (Biometrics 2007;63:958-963) defined the
proportion of information gain (PIG) as
P IG 
LRT( S : 1)
LRT( Z , S : 1)
where LRT(Z,S:1) is the likelihood ratio test statistic
comparing the models (3) and (1), and LRT(S:1) is the
likelihood ratio test statistic comparing the models (2)
and (1)
UPENN Clinical Trials Conference
April 18, 2012
16
A Simple Simulation
logit(Pr(T=1) | S, Z) = -S
S = Z + u, u~N(0,s2)
 Validity of SE is met
 Compare the performance of PTE, LRFa and
PIG for various s2
 Sample size = 1,000 (n=500 per group)
 1,000 simulation samples
Qu and Case (Biometrics 2007;63:958-963)
UPENN Clinical Trials Conference
April 18, 2012
17
Simulation Results: Mean (SD)
s
PTE
LRFa
PIG
0.01
1.38 (6.66) 0.02 (0.02)
0.98 (0.02)
0.10
1.04 (0.70) 0.06 (0.06)
0.98 (0.02)
1.00
1.02 (0.20) 0.82 (0.05)
1.00 (0.01)
2.00
1.06 (0.34) 0.96 (0.02)
1.00 (0.00)
4.00
1.28 (1.57) 0.99 (0.01)
1.00 (0.00)
Qu and Case (Biometrics 2007;63:958-963)
UPENN Clinical Trials Conference
April 18, 2012
18
EFFECT OF MEASUREMENT
ERROR ON EVALUATION OF
BIOMARKERS
UPENN Clinical Trials Conference
April 18, 2012
19
Measurement Error in Biomarker
 Biomarker may be measured with error
 W = S + U, S = the true value for the marker, U is the
measurement error and W is the observed value
 The magnitude of measurement error is
generally described by
 Proportion of variation due to measurement
error: Var(U)÷Var(W)
 <30% is considered small
 30-50% is considered moderate
 > 50% is considered large
 Reliability: Var(S)÷Var(W)
 Measurement error could attenuate the
estimate for PIG (and in PTE, etc)
UPENN Clinical Trials Conference
April 18, 2012
20
Simulation extrapolation (SIMEX)

PIG(X) is what we want

PIG(W) is the estimate with measurement error

E[PIG(W  1U * ) | X ] has the same expectation as PIG(X), where U* and U are IID

Above quantity is generally hard to estimate. SIMEX is a method to use
simulation to estimate the trend of the bias (often using assuming a quadratic
curve) and then extrapolate to obtain a less biased estimator.
E[PIG(W  U * ) | W ]
Cook and Stefanski,
JASA1994; 89:1314--1328.
Li and Qu, Stat in Med.
2010: 2338–2346
UPENN Clinical Trials Conference
April 18, 2012
21
Bone Mineral Density (BMD) and Fracture
Healthy
spine
Kyphotic
spine
Dual-energy x-ray
absorptiometry
(DEXA)
Vertebral Fracture
BMD 
UPENN Clinical Trials Conference
April 18, 2012
BM C
BMA
22
Multiple Outcomes of Raloxifene Evaluation (MORE)

MORE study was a 3-year placebo-controlled, double blind, and randomized clinical trial
evaluating the treatment effect of raloxifene on vertebral fracture.

Vertebral fracture was assessed at year 2 and 3, or with a symptom of back pain

BMD was measured at baseline and years 1, 2 and 3.
Sarkar, et al, J Bone
Miner Res 2002;17:1–10
UPENN Clinical Trials Conference
April 18, 2012
23
Adjustment for Measurement Error in PIG Estimation

Objective: to evaluate if the change in femoral neck BMD is a good surrogate
marker for vertebral fracture

Femoral neck BMD was measured twice at baseline

The estimated standard deviation of the measurement error = 0.023 g/cm2

The proportion of the variability due to measurement error in the observed BMD change was ~70%
(Qu, et al. Stat in Med 2007; 26:197--211)
PIG
95% CI
Naive
Adjusted
0.30
0.50
(0.05, 0.62)
(0.08, 0.91)
Li and Qu, Stat in Med. 2010: 2338–23

Even adjust for measurement error, change in femoral neck BMD is still not a good
surrogate marker
UPENN Clinical Trials Conference
April 18, 2012
24
Summary
 New concepts of surrogate marker and
surrogate endpoint
 Definition of validity and efficiency of a
surrogate marker
 PIG is so far a very reasonable quantity to
evaluate surrogate marker
 Measurement error in the marker can
attenuate the estimation for PIG
 SIMEX is a general method to correct for
bias due to measurement error
UPENN Clinical Trials Conference
April 18, 2012
25
UPENN Clinical Trials
Conference
April 18, 2012
26
Abstract
Statistical Evaluation of Surrogate Markers: Validity, Efficiency and
Sensitivity
Yongming Qu, PhD
Surrogate markers are important in drug development as they may reduce the
development cost and cycle dramatically, as compared to using actual clinical outcomes.
Statistical evaluation of surrogate markers can be dated back to thirty years ago. So far,
little progress has been made in identifying new surrogate endpoints. Demonstarting
treatment effect with clinical outcomes still remain mandatory requirement for clinical
drug development for many disease areas. For example, “the FDA approved Avastin for
advanced breast cancer in February 2008, after one clinical trial showed that combining
Avastin with another drug, paclitaxel, delayed the median time before tumors worsened by
5.5 months, compared with using paclitaxel alone. But the women who got Avastin did not
live significantly longer than those who got only paclitaxel, which is also known by its brand
name Taxol” (http://www.nytimes.com/2011/06/27/health/27drug.html). In this research,
we will discuss the validity, efficiency and sensitivity in statistical evaluation of surrogate
markers. New definitions with simulation and examples will be provided.
UPENN Clinical Trials
Conference
April 18, 2012
27
Download