The challenge of achieving statistical robustness and clinical utility

advertisement

Development of omics-based clinical tests:

The challenge of achieving statistical robustness and clinical utility

Lisa M. McShane, Ph.D.

Biometric Research Branch

Division of Cancer Treatment and Diagnosis, NCI

University of Pennsylvania 5th Annual Conference on Statistical Issues in

Clinical Trials: Emerging Statistical Issues in Biomarker Validation

Philadelphia, PA

April 18, 2012

2

Omics

• “Omics” is a term encompassing multiple molecular disciplines, which involve the characterization of global sets of biological molecules such as DNAs,

RNAs, proteins, and metabolites.”

(IOM. 2012. Evolution of Translational Omics: Lessons Learned and the Path Forward.

Washington, DC: The National Academies Press.)

 Genomics

 Transcriptomics

 Proteomics

 Metabolomics

 Epigenomics

Example Omic Assays

SKY analysis of AML cells

Mutation sequence surveyor trace Illumina SNP bead array

3 cDNA expression microarray

Affymetrix expression GeneChip

MALDI-TOF proteomic spectrum

GOAL: Omic “signature”

Clinical test

• Pre-process

• Input to classifier or calculate risk score

Quantify pattern

Predict clinical outcome or characteristic

ER+, N0

Inform clinical decision

4

Buyse et al. 2006, J Natl Cancer Inst

Mammaprint (70-gene), prognostic

Paik et al. 2006, J Clin Oncol

Oncotype DX 21-gene RS, prognostic, predictive?

5

Definitions

• Analytical validity

 Does the test accurately and reproducibly measure the analyte or characteristic?

• Clinical/biological validity

 Does the test identify a biologic difference (e.g., “pos” vs. “neg”) that may or may not be clinically useful?

• Clinical utility

 Do results of the test lead to a clinical decision that has been shown with high level of evidence to improve outcomes?

Teutsch et al. 2009, Genet Med

Simon et al. 2009, J Natl Cancer Inst

6

Potential roles for omics-based tests in medicine

Diagnosis

• Confirmation

• Staging

• Subtyping

Pre-diagnosis Pre-treatment Intratreatment

• Risk

• Screening

• Early detection

Prognostic

Predictive

FOCUS: Initial therapy selection

• Early response or futility

• Toxicity monitoring

Post-treatment

• Early endpoint

• Recurrence or progression monitoring

7

Overview

• Distinguishing prognostic vs. predictive

• What makes a test “clinically useful”

• Pitfalls in development of prognostic and predictive tests from high-dimensional omic data

• Challenges in evaluation of tests on retrospective specimen & data sets

• Challenges for prospective evaluation

8

Prognostic test

• PROGNOSTIC: Measurement associated with clinical outcome in absence of therapy (natural course) or with standard therapy all patients are likely to receive

Clinical use: Identify patients having highly favorable outcome in absence of (additional) therapy or extremely poor outcome regardless of

(additional) therapy

Research use: Disease biology, identify drug targets, stratification factor in clinical trials

9

Predictive test

• PREDICTIVE: Measurement associated with benefit or lack of benefit (or potentially even harm) from a particular therapy relative to other available therapy

 Alternate terms:

• Treatment stratification biomarker

• Treatment effect modifier

• Treatment-guiding biomarker

• Treatment-selection biomarker

 Examples:

• Endocrine therapies for breast cancer will benefit only patients whose tumors express hormone receptors

• SNPs in the drug metabolizing gene CYP2D6 may confer high risk of serious toxicities from narcotic analgesics

 Ideally should be developed synchronously with new therapeutics

10

When is a prognostic test clinically useful?

• Is the prognostic information sufficiently strong to influence clinical decisions (absolute risk important) ?

• Does the biomarker provide information beyond standard prognostic factors ?

Good prognosis group (M-) may forego additional therapy

Is this prognostic information helpful?

Hazard ratio = .56

Hazard ratio = .18

Prognostic vs. predictive distinction:

Importance of control groups

No survival benefit from new treatment

Prognostic but not predictive

11

New treatment for all or for M+ only?*

Prognostic and predictive

(*Different considerations might apply for Standard Treatment  New Treatment)

When is a predictive test clinically useful?

Treatment-by-biomarker interaction: Is it sufficient?

Prognostic and predictive;

New treatment for M+ only

Prognostic and predictive;

New treatment for all?*

12

Qualitative interaction

• Std Trt better for M  (HR

• New Trt better for M+ (HR

+

= 1.36)

= 0.63)

• Interaction = 0.63/1.36 = 0.47

Quantitative interaction

• New Trt better for M  (HR

• New Trt better for M+ (HR

+

= 0.44)

= 0.63)

• Interaction = 0.63/0.44 = 1.45

Interaction = HR

+

/HR

 where HR= 

New

/ 

Std

(*Different considerations might apply for Standard Treatment  New Treatment)

13

Pitfalls in developing prognostic and predictive tests from omic data

• Most published papers on omic signatures derived from high-dimensional omic data represent biological explorations or computational exercises in search of statistically significant findings

 Some biological insights gained, BUT . . .

 Few signatures have advanced to the point of having established clinical utility

• Unfocused clinical context (“convenience” specimens)

• Clinical vs. statistical significance

14

Pitfalls in developing prognostic and predictive tests from omic data

Many published papers on omic signatures have generated spurious findings or used faulty model development or validation strategies

 Poor data quality (specimens or assay)

 Poor experimental design (e.g. confounding with specimen handling or assay batches)

 Multiple testing & model overfitting

 Failure to conduct rigorous, independent validation

• Blinding & separation of training and test sets

• Biases introduced by non-randomized treatment

• Pre-specified analyses with appropriate type I error control

• Lack of statistical power

15

Development of an omic signature

Training sets

Generate raw data from selected specimens

Screen out unsuitable data or specimens

Raw data pre-processing: normalization, calibration, summary measures

Identify features (e.g., genes, proteins) relevant to a clinical or pathological distinction

Model development

Apply algorithm to develop a classifier or score; INTERNAL VALIDATION

EXTERNAL VALIDATION on

INDEPENDENT set of specimens/data

16

Artifacts

• Omic assays can be exquisitely sensitive to detect subtle biological differences

• BUT, also exquisitely sensitive to

 Specimen processing and handling differences

 Assay variation

• Different technical protocol

• Between-lab

• Within-lab, between-batch, between-technician

• BE VIGILANT

 Check for artifacts & confounding

 Control for in experimental design if possible

Assay batch effects: Gene expression microarrays

Density estimates of PM probe intensities (Affymetrix CEL files) for 96 NSCLC specimens

Red = batch 1

Blue = batch 2

Purple &

Green = outliers?

PCA plots after RMA pre-processing with and without outlier CEL files

Normalized data may depend on other arrays normalized in the same set

17

(Figure 1 from Owzar et al. 2008, Clin Cancer Res using data from Beer et al. 2002, Nat Med )

Assay batch effects: Sequence data

Batch effects for 2 nd generation sequence data from the 1000

Genomes Project. Standardized coverage data represented. Same facility, same platform. Horizontal lines divide by date.

18

Figure 2 from Leek et al. 2010, Nature Rev Genet

19

Development and validation of the signature model

• Selection of informative features

 Reduce noise

 Reduce dimension

• Building a classifier (predictor) model

 Link signature variations to clinical outcome or biological characteristic

• Check for overfitting of the signature model

 Internal validation

• External assessment of model performance

20

Feature selection & data reduction

• Identify “informative” features (e.g., distinguish favorable vs. unfavorable outcome)

 Control false positives

 Potentially many distinct, equally informative sets

• Unsupervised dimension reduction

 Create “super” features (e.g., “super genes”, pathways)

 Example emprical methods:

• Principal components analysis (PCA), or more generally multidimensional scaling

• Clustering to produce cluster level summary features

• Supervised dimension reduction

 Feature selection followed by dimension reduction

 Example: Supervised principal components

21

Building the molecular signature model

Construct classifier function or risk score

 Linear Predictors (e.g., LDA, SVM):

L(x) = w

1 x

1

+ w

2 x

2

+ . . . + w f x f to which cutpoint is often applied

 Distance-based (e.g., nearest neighbor, nearest centroid)

 Numerous other methods:

• Decision trees

• Random forests

• Completely stochastic or Bayesian model averaging

No “best” method for all problems

22

Dangers of model overfitting

• Overfitting occurs when a statistical model describes random error or noise instead of the underlying relationship

 Model is excessively complex , such as having too many parameters relative to the number of observations

 Overfit model will generally have poor predictive performance , as it can exaggerate minor fluctuations in the data

(Source: http://en.wikipedia.org/wiki/Overfitting )

• In high dimensions, true models are always complex and data are always sparse

• VALIDATION OF MODEL PERFORMANCE IS

ESSENTIAL

23

Model validation

• RESUBSTITUTION (plug in training data) estimates of model performance are highly biased and COMPLETELY USELESS in high-dimensional data setting

• INTERNAL: Within-sample validation

 Cross-validation

• (Leave-one-out, split-sample, k-fold, etc.)

 Bootstrap and other resampling methods

 Method comparisons: Molinaro et al. 2005, Bioinformatics

• EXTERNAL: Independent-sample validation

24

Simulation of prognostic model resubstitution method

Simulation Training Validation

1

2

3

4

5

6

7

8

9 p=7.0e-05 p=4.2e-07 p=2.4e-13 p=1.3e-10 p=1.8e-13 p=5.5e-11 p=3.2e-09 p=1.8e-07 p=1.1e-07 p=0.70 p=0.54 p=0.60 p=0.89 p=0.36 p=0.81 p=0.46 p=0.61 p=0.49

(Subramanian and Simon 2010,

J Natl Cancer Inst – lung cancer prognostic signatures)

• Survival data on 129 patients from Bild et al. 2006, Nature

• Expression values for 5000 genes generated randomly from N(0, I

5000

) for each patient

• Data divided randomly into training and validation sets

• Prognostic model developed from training set and used to classify patients in both training and validation sets

10 p=4.3e-09 p=0.09

25

Prognostic model resubstitution example

All stages, OBS, n=62

HR=15.02, p<.001

95% CI=(5.12,44.04)

Stage IB, OBS, n=34

HR=13.32, p<.001

95% CI=(2.86,62.11)

“A 15-gene signature separated

OBS patients into high-risk and low-risk subgroups with significantly different survival

(hazard ratio [HR], 15.02; 95% CI,

5.12 to 44.04; P <.001; stage I HR,

13.31; P <.001; stage II HR, 13.47;

P <.001).”

(Zhu et al. 2010, J Clin Oncol)

Stage II, OBS, n=28

HR=13.47, p<.001

95% CI=(3.00,60.43)

Figure 1 legend:

“Disease-specific survival outcome based on the 15-gene signature in the JBR.10 training set .”

26

Independent validations (?) of 15-gene prognostic score

“The prognostic effect was verified in the same

DCC: HR=2.36, p=.026

UM: HR=3.18, p=.006

RT-qPCR

Duke: HR=2.01, p=.08

NKI: HR=2.02, p=.033

RT-qPCR

62 OBS patients where gene expression was assessed by qPCR.

Furthermore, it was validated consistently in four separate microarray data sets (total 356 stage IB to II patients without adjuvant treatment) and additional

JBR.10 OBS patients by qPCR (n=19).”

JBR.10 OBS: HR=2.02, p=.033

(years)

1/9 events

JBR.10 ADD: HR=2.02, p=.033

What happened to

HR=15.02?

(years)

Is this still clinically useful?

27

Partial resubstitution:

Combining training and test sets

Failure to maintain separation of training and test sets

Lung Metagene Score Predictor

(Figure 5A from Potti et al. 2006,

N Engl J Med)

Cohort Fraction

Stage IA

39/89 Duke

Training

ACOSOG

Validation

CALGB

Validation

5/25

24/84

Over half (39/68) of the cases used to generate the figure were from the training set used to develop the model, and

39/89 of those training cases were Stage IA.

28

Internal validation: Leave-one-out cross-validation (LOOCV)

Specimen j

Set aside

Specimens

1, 2, . . ., j-1, j+1, . . ., N

Build classifier (feature selection, model parameter estimation, etc.)

“Plug-in” Specimen j and record predicted class

Repeat for each j

ALL steps, including feature selection, must be included in the cross-validation loop

29

Simulation of cross-validation approaches

• 100 specimens, 1000 simulations

• 6000 markers measured on each specimen

• Marker measurements generated as independent

Gaussian white noise (i.i.d. N(0,1))

• Artificially separate specimens into two groups (first

50, last 50) so there is NO true relation between markers and group

• Build predictor of class

 Select markers univariate t-test,  =0.001

 Linear discriminant analysis ( LDA )

• TRUE PREDICTION ACCURACY (and misclassification error rate) SHOULD BE 50%

30

Importance of correct cross-validation

True accuracy = 50% obtained by “ LOOCV Correct ”

“ Resubstitution ” is the naïve method of testing model performance by “plugging in” the exact same data as were used to build the model

“ LOOCV Wrong ” does not reselect features at each iteration of the cross-validation, and it is nearly as biased as the naïve resubstitution estimate

Simulations performed by E. Polley

31

Incorrect validation: Is bias only a problem with a very large number of markers?

M=10

Mean % Errors:

• 100 specimens, 1000 simulations

Correct: 52%

Wrong: 44%

Resub: 42%

• M = 10, 50, or 100 markers measured on each specimen

• Markers i.i.d. N(01)

M=50

Mean % Errors:

• Randomly separate specimens into two groups (first 50, last 50)

Correct: 51%

Wrong: 37%

Resub: 32% so there is NO true relation between markers and group

• Build predictor of class

M=100

 Select markers by univariate t-test,

 =0.1

Mean % Errors:

Correct: 51%

Wrong: 31%

Resub: 24%

 Linear discriminant analysis ( LDA )

• TRUE PREDICTION ACCURACY

(and misclassification error rate)

SHOULD BE 50%

32

Limitations of internal validation

• Frequently performed incorrectly (e.g., not including feature selection)

• Cross-validated predictions can be tested in models using typical statistical inference procedures

 Not independent variables

 Conventional testing levels and CI widths not correct

(Lusa et al. 2007, Stat in Med; Jiang et al. 2008, Stat Appl Gen Mol Biol )

• Large variance in estimated accuracy and effects

• Doesn’t protect against biases due to selective inclusion/exclusion of samples

• Doesn’t protect against built-in biases (e.g., lab batch, variable specimen handling)

EXTERNAL VALIDATION IS ESSENTIAL!!!

33

Assessment of predictive tests:

Dangers of resubstitution

Is resubstitution acceptable when model was fit using the control

(OBS) arm only? NO! (Fig. 3, Zhu et al. 2010, J Clin Oncol)

High risk, microarray Low risk, microarray

High risk, RT-qPCR Low risk, RT-qPCR

34

Assessment of predictive tests: Dangers of nonrandomized treatment, different cohorts

Figure 1. Genomic Decision Algorithm to Predict Sensitivity of Invasive Breast

Cancer to Adjuvant Chemotherapy (CT) or Chemoendocrine Therapy (CT+ HT)

(Hatzis et al. 2011, JAMA)

Figure 2.

Validation Cohort #1

35% N  , 65% N+

62% ER+

AT  HT if ER+

Claim: Test is predictive and not prognostic

P=.002 (Fig 2) vs.

P=.096 (eFig 6A)

A = anthracycline

T = Taxane eFigure 6A.

Validation Cohort #2

100% N 

ER+ & ER  (%?)

No HT & no CT

35

Prospective trials to evaluate clinical utility of omic tests

• Comparison of randomized designs (Sargent et al.

2005, J Clin Oncol; Freidlin et al. 2010, J Natl Cancer Inst;

Clark and McShane 2011, Stat in Biopharm Res)

 Enrichment design

 Completely randomized design

 Randomized block design

 Biomarker-strategy design

 Adaptive designs

• Challenges

 Big, long, and expensive

 Inadequate enforcement of regulatory requirements

 Test may become available before trial completes accrual

 Difficulties with blinding & compliance

36

Quiz

• Detection of model overfitting and biased study designs requires in-depth knowledge of complex statistical approaches to the analysis of high-dimensional omic data.

 TRUE or FALSE

• Poor model development practices have few adverse consequences because the models will eventually be tested in rigorously designed clinical studies.

 TRUE or FALSE

37

Summary remarks

• Need earlier and more intense focus on clinical utility

• Need rigor in omics-based test development and study design

• EXTERNAL VALIDATION is essential

• Need more complete and transparent reporting of omics studies

 REMARK guidelines (McShane et al. 2005, J Natl Cancer

Inst)

 REMARK Explanation & Elaboration (Altman et al. 2012,

BMC Med and PLoS Med)

 Availability of data and computer code?

• Need multi-disciplinary collaborative teams with ALL relevant expertise included

38

Acknowledgments

• NCI Cancer Diagnosis Program

Barbara Conley (Director), Tracy Lively

• NCI Biometric Research Branch

Richard Simon (Chief), Ed Korn, Boris Freidlin, Eric

Polley, Mei Polley

• Institute of Medicine Committee for Review of

Omics-Based Tests for Predicting Patient

Outcomes in Clinical Trials

( http://iom.edu/Activities/Research/OmicsBasedTests.aspx

)

39

References for images

1. SKY AML image http://www.nature.com/scitable/topicpage/human-chromosome-translocations-and-cancer-

23487

2. Mutation sequence surveyor trace (public domain) http://upload.wikimedia.org/wikipedia/commons/8/89/Mutation_Surveyor_Trace.jpg

3. Illumina SNP bead array https://www.sanger.ac.uk/Teams/Team67/

4. cDNA expression microarray image (public domain) http://en.wikipedia.org/wiki/File:Microarray2.gif

5. Affy GeneChip expression array

Source unknown

6. MALDI-TOF proteomic spectrum

Hodgkinson et al. (Cancer Letters, 2010)

Download