Consistent Assessment of Biomarker and Subgroup Identification Methods H.D. Hollins Showalter 5/20/2014 (MBSW) 1 Outline 1. 2. 3. 4. 5. 6. Background Data Generation Performance Measurement Example Operationalization Conclusion 5/20/2014 (MBSW) 2 Outline 1. 2. 3. 4. 5. 6. Background Data Generation Performance Measurement Example Operationalization Conclusion 5/20/2014 (MBSW) 3 Tailored Therapeutics A medication for which treatment decisions are based on the molecular profile of the patient, the disease, and/or the patient’s response to treatment. • A tailored therapeutic allows the sponsor to make a regulatory approved claim of an expected treatment effect (efficacy or safety) • “Tailored therapeutics can significantly increase value—first, for patients—who achieve better outcomes with less risk and, second, for payers— who more frequently get the results they expect.”* *Opening Remarks at 2009 Investor Meeting, John C. Lechleiter, Ph.D. 5/20/2014 (MBSW) Adapted from slides presented by William L. Macias, MD, PhD, Eli Lilly 4 Achieving Tailored Therapeutics • • • • Data source: clinical trials (mostly) Objective: identify biomarkers and subgroups Challenges: complexity, multiplicity Need: modern statistical methods 5/20/2014 (MBSW) 5 Prognostic vs. Predictive Markers Prognostic Marker Single trait or signature of traits that identifies different groups of patients with respect to the risk of an outcome of interest in the absence of treatment Predictive Marker Single trait or signature of traits that identifies different groups of patients with respect to the outcome of interest in response to a particular treatment 5/20/2014 (MBSW) 6 Statistical Interactions Treatment Response Treatment by Marker Effect Treatment Treatment Effect No treatment Marker Effect - Marker + Y = 0 + 1*M + 2*T + 3*M*T + 5/20/2014 (MBSW) 7 Types of Predictive Markers Treatment Response Response Treatment No treatment - No treatment - + Marker + Marker Treatment No treatment 5/20/2014 (MBSW) + Marker Response Response Treatment No treatment - + Marker 8 Predictive Marker Example Subgroup of Interest M− Group size: 25% Trt response: -1.39 Pl response: -0.19 Group size: 50% Trt response: -1.17 Pl response: -0.09 M− M+ x2 = 1 M+ Subgroup of Interest Trt response: -0.33 Pl response: -0.20 Treatment effect: -1.20 Group size: 75% Treatment effect: -1.08 Treatment effect: -0.13 x2 = 0 Trt response: -0.23 Pl response: -0.13 Treatment effect: -0.1 x1 = 1 x1 = 0 Entire Population 5/20/2014 (MBSW) x1 = 1 x1 = 0 Entire Population 9 BSID vs. “Traditional” Analysis • Traditional subgroup analysis o o o Interaction testing, one at a time Many statistical issues Many gaps for tailoring • Biomarker and subgroup identification (BSID) o o o Utilizes modern statistical methods Addresses issues with subgroup analysis Maximizes tailoring opportunities 5/20/2014 (MBSW) 10 Simulation to Assess BSID Methods Objective Consistent, rigorous, and comprehensive calibration and comparison of BSID methods Value • Further improve methodology o o Identify the gaps (where existing methods perform poorly) Synergy/combining ideas from multiple methods • Optimize application for specific clinical trials 5/20/2014 (MBSW) 11 BSID Simulation: Three Components 1. Data generation o Key is consistency 2. BSID o “Open” and comprehensive application of analysis method(s) 3. Performance measurement o Key is consistency 5/20/2014 (MBSW) 12 Data Generation BSID Simulation: Visual Representation Performance Measurement BSID Truth Dataset 1 Dataset 2 Dataset … Dataset n Results 1 Results 2 Results … Results n Performance Metrics 1 Performance Metrics 2 Performance Metrics … Performance Metrics n 5/20/2014 (MBSW) Overall Performance Metrics 13 Outline 1. 2. 3. 4. 5. 6. Background Data Generation Performance Measurement Example Operationalization Conclusion 5/20/2014 (MBSW) 14 Data Generation • Creating virtual trial data o o o Make assumptions in order to emulate real trial data Knowledge of disease and therapies, including historical data Specific to BSID: must embed markers and subgroups • In order to measure the performance of BSID methodology the “truth” is needed o This is challenging/impossible to discern using real trial data 5/20/2014 (MBSW) 15 Data Generation Survey SIDES (2011)1 SIDES (2014)2 VT3 GUIDE4 QUINT5 IT6 n 900 300, 900 400 - 2000 100 200 - 1000 300, 450 p 5 - 20 20 - 100 15 - 30 100 5 - 20 4 continuous continuous binary binary continuous TTE binary binary continuous categorical continuous ordinal, categorical predictor correlation 0, 0.3 0, 0.2 0, 0.7 0 0, 0.2 0 treatment assignment 1:1 1:1 ? ~1:1 ~1:1 ? # predictive markers 0-3 2 0, 2 0, 2 1-3 0, 2 higher order higher order higher order N/A, simple, higher order simple, higher order simple 15% - 20% 50% N/A, ~25%, ~50% N/A, ~36% ~16% - ~50% N/A, ~25%, ? 0 0 3 0-4 1-3 0, 2 N/A N/A simple, higher order N/A, simple, higher order simple, higher order simple logit model (w/o and with subjectspecific effects linear model (on probability scale) “tree model” exponential model Attribute response type predictor type predictive effect(s) predictive M+ group size (% of n) # prognostic markers prognostic effect(s) “contribution model” 5/20/2014 (MBSW) 16 Data Generation: Recommendations • Clearly identify attributes and models o o Transparency Traceability of analysis • Make sure to capture the “truth” in a way that facilitates performance measurement • Derive efficiency and synergistic value (more on this later!) 5/20/2014 (MBSW) 17 Data Generation: Specifics • Identify key attributes o o o o o o o Sample size Number of predictors Response type Predictor type/correlation Subgroup size Sizes of effects: placebo response, overall treatment effect, predictive effect(s), prognostic effect(s) Others: Missing data, treatment assignment • Specify model 5/20/2014 (MBSW) 18 Data Generation: Recommendations • Clearly identify attributes and models o o Transparency Traceability of analysis • Make sure to capture the “truth” in a way that facilitates performance measurement • Derive efficiency and synergistic value (more on this later!) 5/20/2014 (MBSW) 19 Data Generation: Reqs • Format data consistently • Make code flexible enough to accommodate any/all attributes and models • Ensure that individual datasets can be reproduced (i.e., various seeds for random number generation) The resulting dataset(s) should always have the same look and feel 5/20/2014 (MBSW) 20 Outline 1. 2. 3. 4. 5. 6. Background Data Generation Performance Measurement Example Operationalization Conclusion 5/20/2014 (MBSW) 21 Performance Measurement • Quantifying the ability of BSID methodology to recapture the “truth” underlying the (generated) data • If done consistently, allows calibration and comparison of BSID methods 5/20/2014 (MBSW) 22 Performance Measurement: Survey SIDES (2011)1 SIDES (2014)2 Selection rate Complete match rate Partial match rate Confirmation rate Treatment effect fraction 5/20/2014 (MBSW) VT3 GUIDE4 QUINT5 IT6 Pr(complete match) Finding correct X’s (RP1a) Pr(type I errors) Pr(partial match) Closeness of 𝐴 to the true A Pr(selection at 1st or 2nd level splits of trees) Frequencies of the final tree sizes Pr(selecting a subset) Closeness of the size of 𝐴 to the size of the true A Accuracy (RP2) Rec. of tree complexity Pr(selecting a superset) Treatment effect fraction (updated def.) Power Properties of 𝑄 𝐴 as an estimator of 𝑄 𝐴 Pr(nontrivial tree) (RP1b) Pr(type II errors) (RP3) Rec. of splitting vars and split points. Frequency of (predictor) “hits” Bias assessment via likelihood ratio and logrank tests (RP4) Rec. of assignments of observations to partition classes 23 Performance Measurement: Recommendations Marker Level testing 5/20/2014 (MBSW) Subgroup Level Subject Level estimation prediction 24 Perf. Measurement: Survey Revisited 2 Level SIDES (2011)1 Marker SIDES (2014) Selection rate Complete match rate Partial match rate VT3 Pr(selecting Pr(completea (RP1b) Finding Pr(type superset) match) correct X’s II errors) Pr(partial Finding match)X’s correct Pr(selecting Power a subset) Pr(selection Pr(selecting at 1st or 2nda superset) level splits of trees) Treatment effect fraction Pr(nontrivial (updated tree)def.) Closeness (RP2) Rec.of tree 𝐴 toofthe true A complexityof Closeness (RP3) Rec. the size of of 𝐴 splitting vars to the size of andtrue splitA the Treatment points. Pr(complete Power effect fraction match) Frequencies Properties of Pr(partial final 𝑄of 𝐴theas an match) tree sizesof estimator Pr(selecting a (RP1a) Pr(type Frequency 𝑄 𝐴 of subset) I errors) (predictor) “hits” Confirmation rate (testing) 5/20/2014 (MBSW) 4 Subgroup Level QUINT5 GUIDE 6 Subj.ITLevel Properties of (RP1a) Pr(type Frequencies Treatment Pr(selection Closeness of effect at 1stfraction or 2nd 𝑄 I 𝐴errors) as an 𝐴 of to the the final true A tree sizes level splits of (RP1b) estimator of Pr(type Accuracy Treatment trees) 𝑄 𝐴 II errors) Frequency of effect fraction (RP4) Rec. of (predictor) Accuracy (updated def.) (RP2) Rec. Bias assignments “hits” of of tree Pr(nontrivial assessment observations to Closeness of complexity tree) Bias via likelihood partition the size of 𝐴 assessment ratioRec. and of classes to the size of (RP3) via likelihood logrank tests splitting vars the true A ratio and and split (prediction) (estimation) logrank tests points. (RP4) Rec. of SIDES (2011)of1 assignments observations to2 SIDES (2014) partition classes VT3 GUIDE4 QUINT5 IT6 25 Contingency Table: Marker Level • • • • True False Yes True Positive False Positive No Identified as Predictive Predictive Biomarker False Negative True Negative Sensitivity = True Positive / True Predictive Biomarkers Specificity = True Negative / False Predictive Biomarkers PPV = True Positive / Identified as Predictive NPV = True Negative / Not Identified as Predictive 5/20/2014 (MBSW) 26 Performance Measures: Marker Level # and % of predictors: true vs. identified • Sensitivity • Specificity • PPV • NPV 5/20/2014 (MBSW) 27 Performance Measures: Subgroup Level • Size of identified subgroup • Treatment effect in the identified subgroup o Average the true “individual” treatment effects under potential outcomes framework • Accuracy of estimated treatment effect o Difference (both absolute and direction) between estimate and true effect 5/20/2014 (MBSW) 28 Perf. Measures: Subgroup Level, cont. • Implications on sample size/time/cost of future trials o o o Given true treatment effect, what is the number of subjects needed in the trial for 90% power? What is the cost of the trial? (mainly driven by # enrolled) How much time will the trial take? (mainly driven by # screened) 5/20/2014 (MBSW) 29 Contingency Table: Subject Level • • • • True False M+ True Positive False Positive M- Membership Classification Potential to Realize Enhanced Treatment Effect* False Negative True Negative *at a meaningful or desired level Sensitivity = True Positive / True Enhanced Treatment Effect Specificity = True Negative / False Enhanced Treatment Effect PPV = True Positive / Classified as M+ NPV = True Negative / Classified as M- 5/20/2014 (MBSW) 30 Performance Measures: Subject Level Compare subgroup membership on the individual level: true vs. identified • Sensitivity • Specificity • PPV • NPV 5/20/2014 (MBSW) 31 Conditional Performance Measures • Same metrics with Null submissions removed Markers/subgroups can be very difficult to find. When a method DOES find something, how accurate is it? Hard(er) to compare multiple methods when all performance measures are washed out by Null submissions 5/20/2014 (MBSW) 32 Cond. Subgroup Level Measures Example Group size: 50% Unconditional Treatment effect: 10 Group size: 50% Size: 0.95 Effect: 5.5 Conditional Size: 0.5 Effect: 10 M− x2 = 1 M+ BSID Method B 900/1000: Null 50/1000: x1 = 1 50/1000: x2 = 1 Unconditional Group size: 50% Treatment effect: 0 x1 = 1 x2 = 0 BSID Method A 900/1000: Null 100/1000: x1 = 1 Group size: 50% 1000 simulations x1 = 0 Size: 0.95 Effect: 5.25 Conditional Size: 0.5 Effect: 7.5 Truth (but x1 very hard to find) 5/20/2014 (MBSW) 33 Performance Measurement: Reqs For each application of BSID user proposes: • List of predictive biomarkers • The one subgroup for designing the next study • Estimated treatment effect in this subgroup In conjunction with the “truth” underlying the generated data, all of the recommended performance measures can be calculated using these elements 5/20/2014 (MBSW) 34 Considering the “Three Levels” What are the most important and relevant measures of a result? Depends on the objective… Marker Level Invest further in the marker(s) 5/20/2014 (MBSW) Subgroup Level Tailor the next study/design Subject Level Impact in clinical practice 35 Outline 1. 2. 3. 4. 5. 6. Background Data Generation Performance Measurement Example Operationalization Conclusion 5/20/2014 (MBSW) 36 Data Generation Example Attribute Value simulations (datasets) 200 n 240 p 20 response type continuous (𝑁 0, 1.132 errors) predictor type ordinal (“genetic”) predictor correlation 0 treatment assignment 1:3 (pl:trt) placebo response -0.1 (in weakest responding subgroup) treatment effect -0.1 (in weakest responding subgroup) # predictive markers predictive effect size(s) (type) predictive M+ group size # prognostic markers prognostic effect size(s) 1 -0.45 (dominant) ~50% of n 0 N/A linear model 5/20/2014 (MBSW) 37 Data Generation Example, cont. 5/20/2014 (MBSW) 38 Data Generation Example, concl. 0 -0.05 -0.1 -0.15 -0.2 -0.25 -0.3 -0.35 -0.4 -0.45 -0.5 x_1_1_1 trt 0 trt 1 - Dataset 1 Trt 0: -0.141 Trt 1: -0.407 Effect: -0.266 5/20/2014 (MBSW) Dataset 21 Trt 0: -0.018 Trt 1: -0.427 Effect: -0.409 + 0 -0.05 -0.1 -0.15 -0.2 -0.25 -0.3 -0.35 -0.4 -0.45 -0.5 x_1_1_1 trt 0 trt 1 - + 39 BSID Methods Applied to Example Approach Traditional Virtual Twin3 TSDT7 Handling treatment-bysubgroup interaction Model Transformation Sequential Searching for candidate subgroups Exhaustive Recursive Partitioning Recursive Partitioning Addressing multiplicity Simple (Sidak Correction) Permutation Sub-sampling + Permutation Alpha controlled at 0.1 5/20/2014 (MBSW) 40 Performance Measurement Example Truth = 5/20/2014 (MBSW) + Proposal Performance Measures 41 Perf. Measurement Example, cont. 5/20/2014 (MBSW) 42 Perf. Measurement Example, concl. Measure Virtual Twin3 Traditional TSDT7 Marker Level Uncond. Cond. Uncond. Cond. Uncond. Cond. Sensitivity 0.025 0.227 0.135 0.614 0.39 0.929 Specificity 0.995 0.957 0.996 0.980 0.998 0.996 PPV 0.227 0.227 0.614 0.614 0.929 0.929 NPV 0.951 0.959 0.957 0.980 0.969 0.996 Subgroup Level Non-Identification (Null) 89% 78% 58% Subgroup Size 93.6% 41.4% 88.8% 48.9% 79.2% 50.4% Trt Effect in Subgroup -0.335 -0.388 -0.359 -0.466 -0.416 -0.535 Sensitivity 0.947 0.518 0.956 0.798 0.986 0.966 Specificity 0.076 0.689 0.180 0.820 0.406 0.966 PPV 0.523 0.639 0.576 0.814 0.702 0.967 NPV 0.592 0.592 0.805 0.805 0.965 0.965 Subject Level 5/20/2014 (MBSW) 43 Outline 1. 2. 3. 4. 5. 6. Background Data Generation Performance Measurement Example Operationalization Conclusion 5/20/2014 (MBSW) 44 Strategy • Develop framework (done/ongoing) • Present/get input (current) o o Internal and external forums Workshop • Establish an open environment (future) o o R package on CRAN Web portal repository 5/20/2014 (MBSW) 45 Predictive Biomarker Project: Vision • Access Web Portal o Reads open description (objective, models, formats etc.) • Access web interface for Data Generation o Generate data under specified scenarios, or utilize “standard”/pre-existing scenarios • Apply BSID methodology to datasets o Express results in the specified format • Access web interface for Performance Measurement o Compare performance • Encouraged to contribute to Repository o Open sharing of results, descriptions, programs 5/20/2014 (MBSW) 46 Pros and Cons Pros • More convenient and useful simulation studies to aid research • Direct comparisons of performance by methods • Optimization of methods for relevant and important scenarios for drug development • New insights and collaborations • Data sets could be applied for other statistical problems Cons • Need to develop infrastructure to support simulated data • Access and upkeep • Need experts to explicitly define the scope 5/20/2014 (MBSW) 47 Outline 1. 2. 3. 4. 5. 6. Background Data Generation Performance Measurement Example Operationalization Conclusion 5/20/2014 (MBSW) 48 Conclusion • Simulation studies are a common approach to assessing BSID methods but there is a lack of consistency in data generation and performance measurement • The presented framework enables consistent, rigorous, comprehensive calibration and comparison of BSID methods • Collaborating on this effort will result in efficiency and synergistic value 5/20/2014 (MBSW) 49 Acknowledgements • • • • • • Richard Zink Lei Shen Chakib Battioui Steve Ruberg Ying Ding Michael Bell 5/20/2014 (MBSW) 50 References 1. 2. 3. 4. 5. 6. 7. 8. 9. Lipkovich I, Dmitrienko A, Denne J, Enas, G. Subgroup identification based on differential effect search — a recursive partitioning method for establishing response to treatment in patient subpopulations. Statistics in Medicine 2011; 30:2601–2621. doi:10.1002/sim.4289. Lipkovich I, Dmitrienko A. Strategies for Identifying Predictive Biomarkers and Subgroups with Enhanced Treatment Effect in Clinical Trials Using SIDES. Journal of Biopharmaceutical Statistics 2014; 24:130-153. doi:10.1080/10543406.2013.856024. Foster JC, Taylor JMG, Ruberg SJ. Subgroup identification from randomized clinical trial data. Statistics in Medicine 2011; 30:2867–2880. doi:10.1002/sim.4322. Loh, WY, He X, Man M. A regression tree approach to identifying subgroups with differential treatment effects. Presented at Midwest Biopharmaceutical Statistics Workshop 2014. Dusseldorp E, Van Mechelen I. Qualitative interaction trees: a tool to identify qualitative treatmentsubgroup interactions. Statistics in Medicine 2014; 33:219–237. doi:10.1002/sim.5933. Su X, Zhou T, Yan X, Fan J, Yang S. Interaction trees with censored survival data. International Journal of Biostatistics 2008; 4(1):Article 2. doi:10.2202/1557-4679.1071. Battioui C, Shen L, Ruberg S. A Resampling-based Ensemble Tree Method to Identify Patient Subgroups with Enhanced Treatment Effect. Proceedings of the 2013 Joint Statistical Meetings. Zink R, Shen L, Wolfinger R, Showalter H. Assessment of Methods to Identify Patient Subgroups with Enhanced Treatment Response in Randomized Clinical Trials. Presented at the 2013 ICSA Applied Statistical Symposium. Shen L, Ding Y, Battioui C. A Framework of Statistical Methods for Identification of Subgroups with Differential Treatment Effects in Randomized Trials. Presented at the 2013 ICSA Applied Statistical Symposium. 5/20/2014 (MBSW) 51 Backup Slides 5/20/2014 (MBSW) 52 Data Generation: SIDES (2011)1 Attribute Value simulations (datasets) 5000 n 900 (then divided into 3 equal – 1 training, 2 test) p 5, 10, 20 response type continuous (𝑁 0, 𝜎 2 errors) predictor type binary (dichotomized from continuous) predictor correlation 0, 0.3 treatment assignment 1:1 (pl:trt) placebo response 0 treatment effect 0 # predictive markers 0, 1, 2, 3* predictive effect size(s) not explicitly stated predictive M+ group size 15% - 20% of n (but not explicitly stated) # prognostic markers 0 prognostic effect size(s) N/A model “contribution model” 5/20/2014 (MBSW) 53 Data Generation: SIDES (2014)2 Attribute Scenario 1 Scenario 2 Scenario 3 Scenario 4 simulations (datasets) 10000 10000 10000 10000 n 300 300 900 900 p 20, 60, 100 20, 60, 100 20, 60, 100 20, 60, 100 response type continuous (𝑁 0, 𝜎 2 errors) continuous (𝑁 0, 𝜎 2 errors) continuous (𝑁 0, 𝜎 2 errors) continuous (𝑁 0, 𝜎 2 errors) predictor type binary (dichotomized from continuous) binary (dichotomized from continuous) binary (dichotomized from continuous) binary (dichotomized from continuous) predictor correlation 0 0.2* 0 0.2* treatment assignment 1:1 (pl:trt) 1:1 (pl:trt) 1:1 (pl:trt) 1:1 (pl:trt) placebo response 0 0 0 0 treatment effect 0 0 0 0 # predictive markers 2** 2** 2** 2** predictive effect size(s) 0.35 0.35 0.6 0.6 predictive M+ group size 0.5 * n = 150 0.5 * n = 150 0.5 * n = 450 0.5 * n = 450 # prognostic markers 0 0 0 0 prognostic effect size(s) N/A N/A N/A N/A model “contribution model” “contribution model” “contribution model” “contribution model” 5/20/2014 (MBSW) 54 Data Generation: Virtual Twins3 Attribute Null Base simulations (datasets) 100 100 n 1000 1000 400 and 2000 p 15 15 30 response type binary binary predictor type continuous (𝑁 0, 𝜎 2 errors) continuous (𝑁 0, 𝜎 2 errors) predictor correlation 0 0 treatment assignment ? ? placebo response -1 -1 treatment effect 0.1 0.1 # predictive markers 0 2 predictive effect size(s) 0 0.9 for X1*X2 1.5 for X1*X2 predictive M+ group size N/A ~0.25 * n = ~250 ~0.5 * n = ~500 # prognostic markers 3 3 prognostic effect size(s) 0.5, 0.5, -0.5 for X1, X2, X7 0.5 for X2*X7 0.5, 0.5, -0.5 for X1, X2, X7 0.5 for X2*X7 model logit model logit model 5/20/2014 (MBSW) Modifications* 0.7** logit model with subjectspecific effects ai and (ai, bi) 55 Data Generation: GUIDE4 Attribute M1 M2 M3 simulations (datasets) 1000 1000 1000 n 100 100 100 p 100 100 100 response type binary binary binary predictor type categorical (3 levels)* categorical (3 levels)* categorical (3 levels)* predictor correlation 0 0 0 treatment assignment ~1:1 (pl:trt) ~1:1 (pl:trt) ~1:1 (pl:trt) placebo response 0.4 0.3 0.2 treatment effect 0 0 0.2 # predictive markers 2 2 0 predictive effect size(s) 0.2, 0.15 for X1, X2 0.05 for X1*X2 0.4 for X1*X2 N/A predictive M+ group size ~0.36 * n = ~360 in strongest M+ group (but not explicitly stated) ~0.36 * n = ~360 (but not explicitly stated) N/A # prognostic markers 0 4 2 prognostic effect size(s) N/A 0.2 for X3, X4 -0.2 for X1*X2 0.2 for X1, X2 model linear model (on probability scale) linear model (on probability scale) linear model (on probability scale) 5/20/2014 (MBSW) 56 Data Generation: QUINT5 Attribute Model A Model B*** Model C*** Model D*** Model E simulations (datasets) 100 100 100 100 100 n 200, 300, 400, 500, 1000 200, 300, 400, 500, 1000 200, 300, 400, 500, 1000 200, 300, 400, 500, 1000 200, 300, 400, 500, 1000 p 5, 10, 20 5, 10, 20 5, 10, 20 5, 10, 20 5, 10, 20 response type continuous* continuous* continuous* continuous* continuous* predictor type continuous (multivariate normal)** continuous (multivariate normal)** continuous (multivariate normal)** continuous (multivariate normal)** continuous (multivariate normal)** predictor correlation 0, 0.2 0, 0.2 0, 0.2 0, 0.2 0, 0.2 treatment assignment ~1:1 (trt 1:trt 2) ~1:1 (trt 1:trt 2) ~1:1 (trt 1:trt 2) ~1:1 (trt 1:trt 2) ~1:1 (trt 1:trt 2) treatment 1 response 20*** 20*** 20*** 18.33*** 30*** treatment 2 effect -2.5, -5, -10*** -2.5, -5, -10*** -2.5, -5, -10*** -2.5, -5, -10*** 0*** # predictive markers 1 2 3 3 1 predictive effect size(s) 5, 10, 20*** 5, 10, 20*** 5, 10, 20*** 5, 10, 20*** 2.5, 5, 10*** predictive M+ group size ~0.16 * n (but not explicitly stated)*** ~0.16 * n (but not explicitly stated)*** ~0.38 * n (but not explicitly stated)*** ~0.16 * n (but not explicitly stated)*** ~0.5 * n (but not explicitly stated)*** # prognostic markers 1*** 2*** 3*** 3*** 1*** prognostic effect size(s) 20*** 20*** 20*** 21.67*** 10*** model “tree model” “tree model” “tree model” “tree model” “tree model” 5/20/2014 (MBSW) 57 Data Generation: Interaction Trees6 Attribute Model A Model B Model C Model D simulations (datasets) 100 100 100 100 n 450 test sample method (300 for learning sample, 150 for validation sample), 300 bootstrap method 450 test sample method (300 for learning sample, 150 for validation sample), 300 bootstrap method 450 test sample method (300 for learning sample, 150 for validation sample), 300 bootstrap method 450 test sample method (300 for learning sample, 150 for validation sample), 300 bootstrap method p 4 4 4 4 response type TTE (censoring rates = 0%, 50%) TTE (censoring rates = 0%, 50%) TTE (censoring rates = 0%, 50%) TTE (censoring rates = 0%, 50%) predictor type ordinal for X1 and X3, categorical for X2 and X4 ordinal for X1 and X3, categorical for X2 and X4 ordinal for X1 and X3, categorical for X2 and X4 ordinal for X1 and X3, categorical for X2 and X4 predictor correlation 0 0 0 0 treatment assignment ? ? ? ? placebo response 0.135 0.135 0.135 0.135 treatment effect 2* 2* 2* 2* # predictive markers 0 2 2 2 predictive effect size(s) N/A 0.223 for X1* 4.482 for X2* 0.741 to 0.050 for X1* ** 1.350 to 20.086 for X2* ** 0.5 for X1* 2 for X2* predictive M+ group size N/A ~0.25 * n in strongest M+ group (but not explicitly stated) not explicitly stated** ~0.25 * n in strongest M+ group (but not explicitly stated) # prognostic markers 2 0 0 0 prognostic effect size(s) 0.223 for X1* 4.482 for X2* N/A N/A N/A model exponential model exponential model exponential model exponential model 5/20/2014 (MBSW) 58 Perf. Measurement: SIDES (2011)1 • Selection rate, that is, the proportion of simulation runs in which >1 subgroup was identified. o o • • Complete match rate: Proportion of simulation runs in which the ideal subgroup was selected as the top subgroup (computed over the runs when at least one subgroup was selected). Partial match rate: Proportion of simulation runs in which the top subgroup was a subset of the ideal subgroup (computed over the runs when at least one subgroup was selected). Confirmation rate, that is, the proportion of simulation runs that yielded a confirmed subgroup (which is not necessarily identical to the ideal subgroup). In each run, the top subgroup was identified in terms of the treatment effect p-value in the training data set (if at least one subgroup was selected). The subgroup was classified as ‘confirmed’ if the treatment effect in this subgroup was significant at a two-sided 0.05 level in both test data sets. Treatment effect fraction defined as the fraction of the treatment effect (per patient) in the ideal group, which was retained in the top selected or confirmed subgroup. The fraction was defined as follows: 5/20/2014 (MBSW) 59 Perf. Measurement: SIDES (2014)2 • Probability of a complete match • Probability of a partial match o o Probability of selecting a subset Probability of selecting a superset • Treatment effect fraction (updated definition, not weighted by group sizes): 5/20/2014 (MBSW) 60 Perf. Measurement: Virtual Twins3 • Finding correct X’s • Closeness of 𝑨 to the true A. This is measured using sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and area under the ROC curve (AUC). • Closeness of the size of 𝑨 to the size of the true A • Power. Another quantity of interest is the percentage of times methods find a null 𝑨 when 𝜃 ≠ 0 and when 𝜃 = 0. • Properties of 𝑸 𝑨 as an estimator of 𝑸 𝑨 5/20/2014 (MBSW) 61 Performance Measurement: GUIDE4 • Probabilities that (predictive markers) are selected at first and second level splits of trees • Accuracy. Let n(t, y, z) denote the number of training samples in node t with Y = y and Z = z and define n(t,+, z) = 𝑦 n(t, y, z) and nt = 𝑧 n(t, +, z). Let 𝑆𝑡 be the subgroup defined by t. The value of 𝑅 𝑆𝑡 is estimated by 𝑅 𝑆𝑡 = |n(t, 1, 1)/n(t,+, 1) − n(t, 1, 0)/n(t,+, 0)|. The estimate 𝑆 of 𝑆∗ is the subgroup 𝑆𝑡 such that 𝑅 𝑆𝑡 is maximum among all terminal nodes. If 𝑆𝑡is not unique, 𝑆 is taken as their union. The “accuracy” of 𝑆 is defined to be 𝑃 𝑆 /𝑃 𝑆∗ if 𝑆 ⊂ 𝑆∗ and 0 otherwise. • Pr(nontrivial tree) 5/20/2014 (MBSW) 62 Performance Measurement: QUINT5 • (RP1a) Probability of type I errors • (RP1b) Probability of type II errors • (RP2) Recovery of tree complexity. Given an underlying true tree with a qualitative treatment– subgroup interaction that has been correctly detected, the probability of successfully identifying the complexity of the true tree. • (RP3) Recovery of splitting variables and split points. Given an underlying true tree with a qualitative treatment–subgroup interaction that has been correctly detected, probability of recovering the true tree in terms of the true splitting variables and the true split points • (RP4) Recovery of the assignments of the observations to the partition classes 5/20/2014 (MBSW) 63 Perf. Measurement: Interaction Trees6 • Frequencies of the final tree sizes • Frequency of (predictor) “hits” • Bias assessment: the following were calculated for the pooled training and test samples and the validation samples o o the likelihood ratio test (LRT) for overall interaction the logrank test for treatment effect within the terminal node that showed maximal treatment efficacy (for presentation convenience, the logworth of the pvalue, which is defined as -log10 (p-value), was used). 5/20/2014 (MBSW) 64 Predictive Biomarker Project Data Generation • Web interface • Standard datasets 5/20/2014 (MBSW) BSID • Open methods • Standard output Performance Measurement • Web interface • Standard summary 65