Development of omics-based clinical tests:
The challenge of achieving statistical robustness and clinical utility
Lisa M. McShane, Ph.D.
Biometric Research Branch
Division of Cancer Treatment and Diagnosis, NCI
University of Pennsylvania 5th Annual Conference on Statistical Issues in
Clinical Trials: Emerging Statistical Issues in Biomarker Validation
Philadelphia, PA
April 18, 2012
2
• “Omics” is a term encompassing multiple molecular disciplines, which involve the characterization of global sets of biological molecules such as DNAs,
RNAs, proteins, and metabolites.”
(IOM. 2012. Evolution of Translational Omics: Lessons Learned and the Path Forward.
Washington, DC: The National Academies Press.)
Genomics
Transcriptomics
Proteomics
Metabolomics
Epigenomics
SKY analysis of AML cells
Mutation sequence surveyor trace Illumina SNP bead array
3 cDNA expression microarray
Affymetrix expression GeneChip
MALDI-TOF proteomic spectrum
• Pre-process
• Input to classifier or calculate risk score
Quantify pattern
Predict clinical outcome or characteristic
ER+, N0
Inform clinical decision
4
Buyse et al. 2006, J Natl Cancer Inst
Mammaprint (70-gene), prognostic
Paik et al. 2006, J Clin Oncol
Oncotype DX 21-gene RS, prognostic, predictive?
5
• Analytical validity
Does the test accurately and reproducibly measure the analyte or characteristic?
• Clinical/biological validity
Does the test identify a biologic difference (e.g., “pos” vs. “neg”) that may or may not be clinically useful?
• Clinical utility
Do results of the test lead to a clinical decision that has been shown with high level of evidence to improve outcomes?
Teutsch et al. 2009, Genet Med
Simon et al. 2009, J Natl Cancer Inst
6
Diagnosis
• Confirmation
• Staging
• Subtyping
Pre-diagnosis Pre-treatment Intratreatment
• Risk
• Screening
• Early detection
• Prognostic
• Predictive
FOCUS: Initial therapy selection
• Early response or futility
• Toxicity monitoring
Post-treatment
• Early endpoint
• Recurrence or progression monitoring
7
• Distinguishing prognostic vs. predictive
• What makes a test “clinically useful”
• Pitfalls in development of prognostic and predictive tests from high-dimensional omic data
• Challenges in evaluation of tests on retrospective specimen & data sets
• Challenges for prospective evaluation
8
• PROGNOSTIC: Measurement associated with clinical outcome in absence of therapy (natural course) or with standard therapy all patients are likely to receive
Clinical use: Identify patients having highly favorable outcome in absence of (additional) therapy or extremely poor outcome regardless of
(additional) therapy
Research use: Disease biology, identify drug targets, stratification factor in clinical trials
9
• PREDICTIVE: Measurement associated with benefit or lack of benefit (or potentially even harm) from a particular therapy relative to other available therapy
Alternate terms:
• Treatment stratification biomarker
• Treatment effect modifier
• Treatment-guiding biomarker
• Treatment-selection biomarker
Examples:
• Endocrine therapies for breast cancer will benefit only patients whose tumors express hormone receptors
• SNPs in the drug metabolizing gene CYP2D6 may confer high risk of serious toxicities from narcotic analgesics
Ideally should be developed synchronously with new therapeutics
10
• Is the prognostic information sufficiently strong to influence clinical decisions (absolute risk important) ?
• Does the biomarker provide information beyond standard prognostic factors ?
Good prognosis group (M-) may forego additional therapy
Is this prognostic information helpful?
Hazard ratio = .56
Hazard ratio = .18
No survival benefit from new treatment
Prognostic but not predictive
11
New treatment for all or for M+ only?*
Prognostic and predictive
(*Different considerations might apply for Standard Treatment New Treatment)
Treatment-by-biomarker interaction: Is it sufficient?
Prognostic and predictive;
New treatment for M+ only
Prognostic and predictive;
New treatment for all?*
12
Qualitative interaction
• Std Trt better for M (HR
• New Trt better for M+ (HR
+
= 1.36)
= 0.63)
• Interaction = 0.63/1.36 = 0.47
Quantitative interaction
• New Trt better for M (HR
• New Trt better for M+ (HR
+
= 0.44)
= 0.63)
• Interaction = 0.63/0.44 = 1.45
Interaction = HR
+
/HR
where HR=
New
/
Std
(*Different considerations might apply for Standard Treatment New Treatment)
13
• Most published papers on omic signatures derived from high-dimensional omic data represent biological explorations or computational exercises in search of statistically significant findings
Some biological insights gained, BUT . . .
Few signatures have advanced to the point of having established clinical utility
• Unfocused clinical context (“convenience” specimens)
• Clinical vs. statistical significance
14
Many published papers on omic signatures have generated spurious findings or used faulty model development or validation strategies
Poor data quality (specimens or assay)
Poor experimental design (e.g. confounding with specimen handling or assay batches)
Multiple testing & model overfitting
Failure to conduct rigorous, independent validation
• Blinding & separation of training and test sets
• Biases introduced by non-randomized treatment
• Pre-specified analyses with appropriate type I error control
• Lack of statistical power
15
Training sets
Generate raw data from selected specimens
Screen out unsuitable data or specimens
Raw data pre-processing: normalization, calibration, summary measures
Identify features (e.g., genes, proteins) relevant to a clinical or pathological distinction
Model development
Apply algorithm to develop a classifier or score; INTERNAL VALIDATION
EXTERNAL VALIDATION on
INDEPENDENT set of specimens/data
16
• Omic assays can be exquisitely sensitive to detect subtle biological differences
• BUT, also exquisitely sensitive to
Specimen processing and handling differences
Assay variation
• Different technical protocol
• Between-lab
• Within-lab, between-batch, between-technician
• BE VIGILANT
Check for artifacts & confounding
Control for in experimental design if possible
Assay batch effects: Gene expression microarrays
Density estimates of PM probe intensities (Affymetrix CEL files) for 96 NSCLC specimens
Red = batch 1
Blue = batch 2
Purple &
Green = outliers?
PCA plots after RMA pre-processing with and without outlier CEL files
Normalized data may depend on other arrays normalized in the same set
17
(Figure 1 from Owzar et al. 2008, Clin Cancer Res using data from Beer et al. 2002, Nat Med )
Batch effects for 2 nd generation sequence data from the 1000
Genomes Project. Standardized coverage data represented. Same facility, same platform. Horizontal lines divide by date.
18
Figure 2 from Leek et al. 2010, Nature Rev Genet
19
• Selection of informative features
Reduce noise
Reduce dimension
• Building a classifier (predictor) model
Link signature variations to clinical outcome or biological characteristic
• Check for overfitting of the signature model
Internal validation
• External assessment of model performance
20
• Identify “informative” features (e.g., distinguish favorable vs. unfavorable outcome)
Control false positives
Potentially many distinct, equally informative sets
• Unsupervised dimension reduction
Create “super” features (e.g., “super genes”, pathways)
Example emprical methods:
• Principal components analysis (PCA), or more generally multidimensional scaling
• Clustering to produce cluster level summary features
• Supervised dimension reduction
Feature selection followed by dimension reduction
Example: Supervised principal components
21
Construct classifier function or risk score
Linear Predictors (e.g., LDA, SVM):
L(x) = w
1 x
1
+ w
2 x
2
+ . . . + w f x f to which cutpoint is often applied
Distance-based (e.g., nearest neighbor, nearest centroid)
Numerous other methods:
• Decision trees
• Random forests
• Completely stochastic or Bayesian model averaging
No “best” method for all problems
22
• Overfitting occurs when a statistical model describes random error or noise instead of the underlying relationship
Model is excessively complex , such as having too many parameters relative to the number of observations
Overfit model will generally have poor predictive performance , as it can exaggerate minor fluctuations in the data
(Source: http://en.wikipedia.org/wiki/Overfitting )
• In high dimensions, true models are always complex and data are always sparse
• VALIDATION OF MODEL PERFORMANCE IS
ESSENTIAL
23
• RESUBSTITUTION (plug in training data) estimates of model performance are highly biased and COMPLETELY USELESS in high-dimensional data setting
• INTERNAL: Within-sample validation
Cross-validation
• (Leave-one-out, split-sample, k-fold, etc.)
Bootstrap and other resampling methods
Method comparisons: Molinaro et al. 2005, Bioinformatics
• EXTERNAL: Independent-sample validation
24
Simulation of prognostic model resubstitution method
Simulation Training Validation
1
2
3
4
5
6
7
8
9 p=7.0e-05 p=4.2e-07 p=2.4e-13 p=1.3e-10 p=1.8e-13 p=5.5e-11 p=3.2e-09 p=1.8e-07 p=1.1e-07 p=0.70 p=0.54 p=0.60 p=0.89 p=0.36 p=0.81 p=0.46 p=0.61 p=0.49
(Subramanian and Simon 2010,
J Natl Cancer Inst – lung cancer prognostic signatures)
• Survival data on 129 patients from Bild et al. 2006, Nature
• Expression values for 5000 genes generated randomly from N(0, I
5000
) for each patient
• Data divided randomly into training and validation sets
• Prognostic model developed from training set and used to classify patients in both training and validation sets
10 p=4.3e-09 p=0.09
25
All stages, OBS, n=62
HR=15.02, p<.001
95% CI=(5.12,44.04)
Stage IB, OBS, n=34
HR=13.32, p<.001
95% CI=(2.86,62.11)
“A 15-gene signature separated
OBS patients into high-risk and low-risk subgroups with significantly different survival
(hazard ratio [HR], 15.02; 95% CI,
5.12 to 44.04; P <.001; stage I HR,
13.31; P <.001; stage II HR, 13.47;
P <.001).”
(Zhu et al. 2010, J Clin Oncol)
Stage II, OBS, n=28
HR=13.47, p<.001
95% CI=(3.00,60.43)
Figure 1 legend:
“Disease-specific survival outcome based on the 15-gene signature in the JBR.10 training set .”
26
“The prognostic effect was verified in the same
DCC: HR=2.36, p=.026
UM: HR=3.18, p=.006
RT-qPCR
Duke: HR=2.01, p=.08
NKI: HR=2.02, p=.033
RT-qPCR
62 OBS patients where gene expression was assessed by qPCR.
Furthermore, it was validated consistently in four separate microarray data sets (total 356 stage IB to II patients without adjuvant treatment) and additional
JBR.10 OBS patients by qPCR (n=19).”
JBR.10 OBS: HR=2.02, p=.033
(years)
1/9 events
JBR.10 ADD: HR=2.02, p=.033
What happened to
HR=15.02?
(years)
Is this still clinically useful?
27
Failure to maintain separation of training and test sets
Lung Metagene Score Predictor
(Figure 5A from Potti et al. 2006,
N Engl J Med)
Cohort Fraction
Stage IA
39/89 Duke
Training
ACOSOG
Validation
CALGB
Validation
5/25
24/84
Over half (39/68) of the cases used to generate the figure were from the training set used to develop the model, and
39/89 of those training cases were Stage IA.
28
Specimen j
Set aside
Specimens
1, 2, . . ., j-1, j+1, . . ., N
Build classifier (feature selection, model parameter estimation, etc.)
“Plug-in” Specimen j and record predicted class
Repeat for each j
ALL steps, including feature selection, must be included in the cross-validation loop
29
• 100 specimens, 1000 simulations
• 6000 markers measured on each specimen
• Marker measurements generated as independent
Gaussian white noise (i.i.d. N(0,1))
• Artificially separate specimens into two groups (first
50, last 50) so there is NO true relation between markers and group
• Build predictor of class
Select markers univariate t-test, =0.001
Linear discriminant analysis ( LDA )
• TRUE PREDICTION ACCURACY (and misclassification error rate) SHOULD BE 50%
30
True accuracy = 50% obtained by “ LOOCV Correct ”
“ Resubstitution ” is the naïve method of testing model performance by “plugging in” the exact same data as were used to build the model
“ LOOCV Wrong ” does not reselect features at each iteration of the cross-validation, and it is nearly as biased as the naïve resubstitution estimate
Simulations performed by E. Polley
31
M=10
Mean % Errors:
• 100 specimens, 1000 simulations
Correct: 52%
Wrong: 44%
Resub: 42%
• M = 10, 50, or 100 markers measured on each specimen
• Markers i.i.d. N(01)
M=50
Mean % Errors:
• Randomly separate specimens into two groups (first 50, last 50)
Correct: 51%
Wrong: 37%
Resub: 32% so there is NO true relation between markers and group
• Build predictor of class
M=100
Select markers by univariate t-test,
=0.1
Mean % Errors:
Correct: 51%
Wrong: 31%
Resub: 24%
Linear discriminant analysis ( LDA )
• TRUE PREDICTION ACCURACY
(and misclassification error rate)
SHOULD BE 50%
32
• Frequently performed incorrectly (e.g., not including feature selection)
• Cross-validated predictions can be tested in models using typical statistical inference procedures
Not independent variables
Conventional testing levels and CI widths not correct
(Lusa et al. 2007, Stat in Med; Jiang et al. 2008, Stat Appl Gen Mol Biol )
• Large variance in estimated accuracy and effects
• Doesn’t protect against biases due to selective inclusion/exclusion of samples
• Doesn’t protect against built-in biases (e.g., lab batch, variable specimen handling)
EXTERNAL VALIDATION IS ESSENTIAL!!!
33
Is resubstitution acceptable when model was fit using the control
(OBS) arm only? NO! (Fig. 3, Zhu et al. 2010, J Clin Oncol)
High risk, microarray Low risk, microarray
High risk, RT-qPCR Low risk, RT-qPCR
34
Figure 1. Genomic Decision Algorithm to Predict Sensitivity of Invasive Breast
Cancer to Adjuvant Chemotherapy (CT) or Chemoendocrine Therapy (CT+ HT)
(Hatzis et al. 2011, JAMA)
Figure 2.
Validation Cohort #1
35% N , 65% N+
62% ER+
AT HT if ER+
Claim: Test is predictive and not prognostic
P=.002 (Fig 2) vs.
P=.096 (eFig 6A)
A = anthracycline
T = Taxane eFigure 6A.
Validation Cohort #2
100% N
ER+ & ER (%?)
No HT & no CT
35
• Comparison of randomized designs (Sargent et al.
2005, J Clin Oncol; Freidlin et al. 2010, J Natl Cancer Inst;
Clark and McShane 2011, Stat in Biopharm Res)
Enrichment design
Completely randomized design
Randomized block design
Biomarker-strategy design
Adaptive designs
• Challenges
Big, long, and expensive
Inadequate enforcement of regulatory requirements
Test may become available before trial completes accrual
Difficulties with blinding & compliance
36
• Detection of model overfitting and biased study designs requires in-depth knowledge of complex statistical approaches to the analysis of high-dimensional omic data.
TRUE or FALSE
• Poor model development practices have few adverse consequences because the models will eventually be tested in rigorously designed clinical studies.
TRUE or FALSE
37
• Need earlier and more intense focus on clinical utility
• Need rigor in omics-based test development and study design
• EXTERNAL VALIDATION is essential
• Need more complete and transparent reporting of omics studies
REMARK guidelines (McShane et al. 2005, J Natl Cancer
Inst)
REMARK Explanation & Elaboration (Altman et al. 2012,
BMC Med and PLoS Med)
Availability of data and computer code?
• Need multi-disciplinary collaborative teams with ALL relevant expertise included
38
• NCI Cancer Diagnosis Program
Barbara Conley (Director), Tracy Lively
• NCI Biometric Research Branch
Richard Simon (Chief), Ed Korn, Boris Freidlin, Eric
Polley, Mei Polley
• Institute of Medicine Committee for Review of
Omics-Based Tests for Predicting Patient
Outcomes in Clinical Trials
( http://iom.edu/Activities/Research/OmicsBasedTests.aspx
)
39
1. SKY AML image http://www.nature.com/scitable/topicpage/human-chromosome-translocations-and-cancer-
23487
2. Mutation sequence surveyor trace (public domain) http://upload.wikimedia.org/wikipedia/commons/8/89/Mutation_Surveyor_Trace.jpg
3. Illumina SNP bead array https://www.sanger.ac.uk/Teams/Team67/
4. cDNA expression microarray image (public domain) http://en.wikipedia.org/wiki/File:Microarray2.gif
5. Affy GeneChip expression array
Source unknown
6. MALDI-TOF proteomic spectrum
Hodgkinson et al. (Cancer Letters, 2010)