University of Pennsylvania Annual Conference on Statistical Issues in Clinical Trials Statistical Evaluation of Surrogate Markers: Validity, Efficiency and Sensitivity Yongming Qu, PhD Eli Lilly and Company Indianapolis, Indiana April 18, 2012 This is based on previous and ongoing research through collaboration with Michael Case, Somnath Sarkar, Wen Li, and Pandurang M. Kulkarni. Outline Introduction Biomarker, surrogate marker and surrogate endpoint Validity and efficiency of surrogate marker Quantities used in statistical validation Proportion of Treatment Effect (PTE) General Association Likelihood reduction factor (LRF) Proportion of Information Gain (PIG) Effect of measurement error and adjustment for it Summary UPENN Clinical Trials Conference April 18, 2012 2 Biomarker and Surrogate Endpoint (SE) Biomarker: "a characteristic that is objectively measured and evaluated as an indicator of normal biologic processes, pathogenic processes, or pharmacologic responses to a therapeutic intervention.” (Clinical Pharmacology Therapy 2001;69:89-95.) Surrogate endpoint: “a laboratory measurement or a physical sign used as a substitute for a clinically meaningful endpoint that measures directly how a patient feels, functions or services. Changes induced by a therapy on a surrogate endpoint are expected to reflect changes in a clinically meaningful endpoint” (Temple 1995) UPENN Clinical Trials Conference April 18, 2012 3 Validation of Surrogate Endpoint (SE) Surrogate endpoint is intended to replace clinical outcome for any therapy Surrogate endpoint is independent of therapy Traditional way of validating surrogate endpoint using treatment is not feasible Surrogate endpoint needs to be validated To evaluate the surrogate endpoint, large confirmatory clinical trials need to be conducted for both surrogate and clinical endpoints If large confirmatory clinical trials are conducted, the drug efficacy should have been established. There is no need for surrogate endpoint for this drug The conclusion from this drug cannot be extrapolated to other drugs because different drugs may work through different pathways UPENN Clinical Trials Conference April 18, 2012 4 Validation of SE – New Thinking Validation of SE should be based on the disease mechanism, not the effect of treatment Hemoglobin A1c (HbA1c) is a widely used SE for the average of glucose The validation of this SE is not based on any clinical studies involving different treatment It is based on biochemistry and physiology Progression-free survival (PFS) is widely used as an SE for cancer survival The validation should be based on biology of the disease and tumor, not individual drugs UPENN Clinical Trials Conference April 18, 2012 5 Surrogate Marker (SM) Surrogate marker for a drug is a marker which could be used to predict the drug’s efficacy or safety Example of the usefulness of a surrogate marker Suppose bone mineral density (BMD) is a surrogate marker for osteoporotic fracture The long-term effect of an osteoporosis drug on fracture is difficult to know and is too costly to know A woman takes an osteoporosis drug Clinicians measure her BMD after 6 months of use If BMD is increased, this drug works for this woman and she should continue to use this drug If BMD is not increased, this drug does not work for this woman, and she should switch to a different drug SM is very useful to monitor patients and identify which drug works best for a patient in the early stage of the disease SM can NOT be used to replace clinical outcome for drug approval! UPENN Clinical Trials Conference April 18, 2012 6 SE, SM and Biomarker SE SM Biomarker Question: A particular biomarker is a SM? UPENN Clinical Trials Conference April 18, 2012 7 SE Validation (Prentice) Prentice (Stat. Med. 1989, 8:431–440) proposed a necessary and sufficient condition for a surrogate endpoint f(S|Z) = f(S) f(T|Z)=f(T) This definition is too stringent: it essentially requires surrogate endpoint is “equivalent” to clinical outcome Prentice’s key operational criterion f(T|S, Z) = f(T|S) does not guarantee this condition This condition can be weakened. A marker is said to be a SE if f(S|Z) = f(S) f(T|Z)=f(T) for any Z Practically, this condition cannot be validated through clinical trials testing drug effect One can NOT prove a mathematical theory through enumeration! One can invalidate a SE if the above relationship does not hold for one treatment Z UPENN Clinical Trials Conference April 18, 2012 8 Surrogate Marker - Concepts Validity: A marker S is said to be a valid surrogate marker for a clinical outcome T for a particular treatment if f(T|Z) ≠ f(T) f(T|S, Z) = f(T|S) where Z is the treatment indicator with Z = 1 for the treatment and Z = 0 for control Efficiency: For two surrogate markers S1 and S2, we say S1 is more efficient than S2 if Var[T|Z, S1] < Var[T|Z, S2] Validity is a much higher hurdle than efficiency in practice UPENN Clinical Trials Conference April 18, 2012 9 Proportion of Treatment Effect Consider two models T|Z =a0 + aZZ T|S, Z =b0 + bZZ + bsS The PTE (Freedman et al, Stat. Med. 1992; 11:167-178) is PTE = 1 – bZ/aZ Drawbacks of PTE Not bounded by [0,1] Large variability makes the results not informative UPENN Clinical Trials Conference April 18, 2012 10 General Association Consider two models YS , j S a Z , S Z j S , j YT , j T a Z ,T Z j T , j S , j SS where Var T , j SS ST TT Buyse and Molenberghs (Biometrics 1998:54:10141029) suggested using the coefficient of determination to evaluate the surrogate marker 2 R2 ( SS TT )1 ST 1 2 Var[T | S ] SS TT ST SS (1 R2 ) UPENN Clinical Trials Conference April 18, 2012 11 Artificial Example 1 S , j SS Var T , j SS YS , j S a Z , S Z j S , j YT , j T a Z ,T Z j T , j ST TT 2 R2 ( SS TT )1 ST Let S,j= T,j, then R2 = 1 E[YT , j | Z j 0, YS , j ] T S YS , j E[YT , j | Z j 1, YS , j ] T S YS , j (a Z ,T a Z ,S ) The relationship between clinical outcome and marker depends on treatment group YS,j is not a good surrogate marker! UPENN Clinical Trials Conference April 18, 2012 12 Artificial Example 2 YS , j S a Z , S Z j S , j YT , j b 0 b1YS , j u j R 2 0 SS S, j u ~ NID 0 , 0 j 0 uu 1 1 1 b12 SS uu Depending on the parameters, R2 can be any number The effect of treatment on the clinical outcome acts solely through the marker YS,j YS,j is a perfect surrogate marker! UPENN Clinical Trials Conference April 18, 2012 13 Likelihood Reduction Factor (LRF) Consider two models T|Z =a0 + aZZ (1) T|S, Z =b0 + bZZ + bsS (2) Alonso et al. (Biometrics 2004; 60:724-728) defined the likelihood reduction factor (LRF) as LRF(Z , S : S ) 1 exp(LRT(Z , S : Z ) / n) where LRT(Z,S:Z) is the likelihood ratio test statistic comparing the two models (2) and (1) LRF is bounded by [0,1] but may be impossible to reach 1 for some models The LRF adjusted (LRFa) was proposed UPENN Clinical Trials Conference April 18, 2012 14 A Different Approach Instead of comparing T|Z =a0 + aZZ LRFa(Z,S:Z) T|S, Z =b0 + bZZ + bsS Alonso, et al We compare T|S =g0 + gZS New Quantity T|S, Z =b0 + bZZ + bsS UPENN Clinical Trials Conference April 18, 2012 15 Proportion of Information Gain (PIG) Consider three models T =c0 (1) T|S =g0 + gZS (2) T|S, Z =b0 + bZZ + bsS (3) Qu and Case (Biometrics 2007;63:958-963) defined the proportion of information gain (PIG) as P IG LRT( S : 1) LRT( Z , S : 1) where LRT(Z,S:1) is the likelihood ratio test statistic comparing the models (3) and (1), and LRT(S:1) is the likelihood ratio test statistic comparing the models (2) and (1) UPENN Clinical Trials Conference April 18, 2012 16 A Simple Simulation logit(Pr(T=1) | S, Z) = -S S = Z + u, u~N(0,s2) Validity of SE is met Compare the performance of PTE, LRFa and PIG for various s2 Sample size = 1,000 (n=500 per group) 1,000 simulation samples Qu and Case (Biometrics 2007;63:958-963) UPENN Clinical Trials Conference April 18, 2012 17 Simulation Results: Mean (SD) s PTE LRFa PIG 0.01 1.38 (6.66) 0.02 (0.02) 0.98 (0.02) 0.10 1.04 (0.70) 0.06 (0.06) 0.98 (0.02) 1.00 1.02 (0.20) 0.82 (0.05) 1.00 (0.01) 2.00 1.06 (0.34) 0.96 (0.02) 1.00 (0.00) 4.00 1.28 (1.57) 0.99 (0.01) 1.00 (0.00) Qu and Case (Biometrics 2007;63:958-963) UPENN Clinical Trials Conference April 18, 2012 18 EFFECT OF MEASUREMENT ERROR ON EVALUATION OF BIOMARKERS UPENN Clinical Trials Conference April 18, 2012 19 Measurement Error in Biomarker Biomarker may be measured with error W = S + U, S = the true value for the marker, U is the measurement error and W is the observed value The magnitude of measurement error is generally described by Proportion of variation due to measurement error: Var(U)÷Var(W) <30% is considered small 30-50% is considered moderate > 50% is considered large Reliability: Var(S)÷Var(W) Measurement error could attenuate the estimate for PIG (and in PTE, etc) UPENN Clinical Trials Conference April 18, 2012 20 Simulation extrapolation (SIMEX) PIG(X) is what we want PIG(W) is the estimate with measurement error E[PIG(W 1U * ) | X ] has the same expectation as PIG(X), where U* and U are IID Above quantity is generally hard to estimate. SIMEX is a method to use simulation to estimate the trend of the bias (often using assuming a quadratic curve) and then extrapolate to obtain a less biased estimator. E[PIG(W U * ) | W ] Cook and Stefanski, JASA1994; 89:1314--1328. Li and Qu, Stat in Med. 2010: 2338–2346 UPENN Clinical Trials Conference April 18, 2012 21 Bone Mineral Density (BMD) and Fracture Healthy spine Kyphotic spine Dual-energy x-ray absorptiometry (DEXA) Vertebral Fracture BMD UPENN Clinical Trials Conference April 18, 2012 BM C BMA 22 Multiple Outcomes of Raloxifene Evaluation (MORE) MORE study was a 3-year placebo-controlled, double blind, and randomized clinical trial evaluating the treatment effect of raloxifene on vertebral fracture. Vertebral fracture was assessed at year 2 and 3, or with a symptom of back pain BMD was measured at baseline and years 1, 2 and 3. Sarkar, et al, J Bone Miner Res 2002;17:1–10 UPENN Clinical Trials Conference April 18, 2012 23 Adjustment for Measurement Error in PIG Estimation Objective: to evaluate if the change in femoral neck BMD is a good surrogate marker for vertebral fracture Femoral neck BMD was measured twice at baseline The estimated standard deviation of the measurement error = 0.023 g/cm2 The proportion of the variability due to measurement error in the observed BMD change was ~70% (Qu, et al. Stat in Med 2007; 26:197--211) PIG 95% CI Naive Adjusted 0.30 0.50 (0.05, 0.62) (0.08, 0.91) Li and Qu, Stat in Med. 2010: 2338–23 Even adjust for measurement error, change in femoral neck BMD is still not a good surrogate marker UPENN Clinical Trials Conference April 18, 2012 24 Summary New concepts of surrogate marker and surrogate endpoint Definition of validity and efficiency of a surrogate marker PIG is so far a very reasonable quantity to evaluate surrogate marker Measurement error in the marker can attenuate the estimation for PIG SIMEX is a general method to correct for bias due to measurement error UPENN Clinical Trials Conference April 18, 2012 25 UPENN Clinical Trials Conference April 18, 2012 26 Abstract Statistical Evaluation of Surrogate Markers: Validity, Efficiency and Sensitivity Yongming Qu, PhD Surrogate markers are important in drug development as they may reduce the development cost and cycle dramatically, as compared to using actual clinical outcomes. Statistical evaluation of surrogate markers can be dated back to thirty years ago. So far, little progress has been made in identifying new surrogate endpoints. Demonstarting treatment effect with clinical outcomes still remain mandatory requirement for clinical drug development for many disease areas. For example, “the FDA approved Avastin for advanced breast cancer in February 2008, after one clinical trial showed that combining Avastin with another drug, paclitaxel, delayed the median time before tumors worsened by 5.5 months, compared with using paclitaxel alone. But the women who got Avastin did not live significantly longer than those who got only paclitaxel, which is also known by its brand name Taxol” (http://www.nytimes.com/2011/06/27/health/27drug.html). In this research, we will discuss the validity, efficiency and sensitivity in statistical evaluation of surrogate markers. New definitions with simulation and examples will be provided. UPENN Clinical Trials Conference April 18, 2012 27