Systems Approaches to Disease Stratification Nathan Price Introduction to Systems Biology Short Course August 20, 2012 Goals and Motivation Currently most diagnoses based on symptoms and visual features (pathology, histology) However, many diseases appear deceptively similar, but are, in fact, distinct entities from the molecular perspective Drive towards personalized medicine Outline Molecular signature classifiers: main issues Signal to noise Small sample size issues Error estimation techniques Phenotypes and sample heterogeneity Example study Advanced topics Network-based classification Importance of broad disease context Molecular signature classifiers Overall strategy Molecular signatures for diagnosis The goals of molecular classification of tumors: Identify subpopulations of cancer Inform choice of therapy Generally, a set of microarray experiments is used with ~100 patient samples ~ 104 transcripts (genes) This very small number of samples relative to the number of transcripts is a key issue Feature selection & model selection Small sample size issues dominate Error estimation techniques Also, the microarray platform used can have a significant effect on results Randomness Expression values have randomness arising from both biological and experimental variability. Design, performance evaluation, and application of classifiers must take this randomness into account. Three critical issues arise… Given a set of variables, how does one design a classifier from the sample data that provides good classification over the general population? How does one estimate the error of a designed classifier when data is limited? Given a large set of potential variables, such as the large number of expression levels provided by each microarray, how does one select a set of variables as the input vector to the classifier? Small sample issues Our task is to predict future events Thus, we must avoid overfitting It is easy (if the model is complicated enough) to fit data we have Simplicity of model vital when data is sparse and possible relationships are large This is exactly the case in virtually all microarray studies, including ours In the clinic At the end, want a test that can easily be implemented and actually benefit patients Error estimation and variable selection An error estimator may be unbiased but have a large variance, and therefore often be low. This can produce a large number of gene sets and classifiers with low error estimates. For a small sample, one can end up with thousands of gene sets for which the error estimate from the sample data is near zero! Overfitting Complex decision boundary may be unsupported by the data relative to the feature-label distribution. Relative to the sample data, a classifier may have small error; but relative to the feature-label distribution, the error may be severe! Classification rule should not cut up the space in a manner too complex for the amount of sample data available. Overfitting: example of KNN rule N = 30 test sample; k = 3 N = 90 Example: How to identify appropriate models (regression… but the issues are the same) y f x n noise learn f from data Linear… Quadratic… Piecewise linear interpolation… Which one is best? Cross-validation Cross-validation Cross-validation Simple: just choose the classifier with the best cross-validation error But… (there is always a but) we are training on even less data, so the classifier design is worse if sample size is small, test set is small and error estimator has high variance so we may be fooling ourselves into thinking we have a good classifier… LOOCV (leave-one-out cross validation) mean square error: 2.12 mean square error: 0.96 best mean square error: 3.33 Estimating Error on Future Cases Use cross validation to estimate accuracy on future cases Feature selection and model selection must be within loop to avoid overly optimistic estimates Data Set Resampling: Shuffled repeatedly into training and test sets. Training Set NO information passage Methodology Best case: have an independent test set Resampling techniques Test Set Average performance on test set provides estimate for behavior on future cases Can be MUCH different than behavior on training set Classification methods k-nearest neighbor Support vector machine (SVM) Linear, quadratic Perceptrons, neural networks Decision trees k-Top Scoring Pairs Many others Molecular signature classifiers Example Study Diagnosing similar cancers with different treatments Challenge in medicine: diagnosis, treatment, prevention of disease suffer from lack of knowledge Gastrointestinal Stromal Tumor (GIST) and Leiomyosarcoma (LMS) morphologically similar, hard to distinguish using current methods different treatments, correct diagnosis is critical studying genome-wide patterns of expression aids clinical diagnosis ? GIST Patient LMS Patient Goal: Identify molecular signature that will accurately differentiate these two cancers Relative Expression Reversal Classifiers Find a classification rule as follows: IF gene A > gene B THEN class1, ELSE class2 Classifier is chosen finding the most accurate and robust rule of this type from all possible pairs in the dataset If needed, a set of classifiers of the above form can be used, with final classification resulting from a majority vote (k-TSP) • • Geman, D., et al. Stat. Appl. Geneti. Mol. Biol., 3, Article 19, 2004 Tan et al., Bioinformatics, 21:3896-904, 2005 Rationale for k-TSP Based on concept of relative expression reversals Advantages Does not require data normalization Does not require population-wide cutoffs or weighting functions Has reported accuracies in literature comparable to SVMs, PAM, other state-of-the art classification methods Results in classifiers that are easy to implement Designed to avoid overfitting n = number of genes, m = number of samples For the example I will show, this equation yields: 10^9 << 10^20 n 2m 2 Diagnostic Marker Pair 5 10 OBSCN expression Classified as GIST 4 10 3 10 2 10 ClinicopathologicalDiagnosis X – GIST O - LMS 1 10 1 10 2 10 Classified as LMS 3 10 4 10 5 10 C9orf65 expression Accuracy on data = 99% Predicted accuracy on future data (LOOCV) = 98% • Price, N.D. et al, PNAS 104:3414-9 (2007) RT-PCR Classification Results OBSCN-c9orf65 OBSCN 14 86 33 26 61 62 2 83 84 20 22 85 58 82 Sample 79 19 71 79 80 41 0 -20 40 52 7 77 76 c9orf65 81 87 78 70 74 -15 13 29 -10 73 72 75 -5 49 15 GIST c9orf65 5 69 LMS difference of Ct average 10 10 37 sample OBSCN OBSCN c9orf65 Sample 62 100% Accuracy 19 independent samples 20 samples from microarray study including previously indeterminate case • Price, Price, N.D. N.D. et et al, al, PNAS PNAS 104:3414-9 104:3414-9 (2007) (2007) Comparative biomarker accuracies 6 c-kit expression C-kit gene expression 10 5 10 4 10 3 10 2 10 GIST – X LMS – O 1 10 -3 10 -2 10 -1 10 0 10 1 10 OBSCN expression / C9orf65classifier expression 2-gene relative expression • Price, Price, N.D. N.D. et et al, al, PNAS PNAS 104:3414-9 104:3414-9 (2007) (2007) 2 10 Kit Protein Staining of GIST-LMS Blue arrows - GIST Red arrows - LMS Top Row – GIST Positive Staining Bottom Row – GIST negative staining • Accuracy as a classifier ~ 87%. Price, N.D. et al, PNAS 104:3414-9 (2007) A few general lessons Choosing markers based on relative expression reversals of gene pairs has proven to be very robust with high predictive accuracy in sets we have tested so far Simple and independent of normalization Easy to implement clinical test ultimately All that’s needed is RT-PCR on two genes Advantages of this approach may be even more applicable to proteins in the blood Each decision rule requiring the measurement of the relative concentration of 2 proteins Network-based classification Network-based classification Can modify feature selection methods based on networks Can improve performance (not always) Generally improves biological insight by integrating heterogeneous data Shown to improve prediction of breast cancer metastasis (complex phenotype) • Chuang, Lee, Liu, Lee, Ideker, Molecular Systems Biology 3:40 Rationale: Differential Rank Analysis (DIRAC) Cancer is a multi-genic disease Analyze high-throughput data to identify aspects of the genome-scale network that are most affected Initial version uses a priori defined gene sets BioCarta, KEGG, GO, etc. 5 10 Classified as GIST OBSCN expression Networks or pathways inform best targets for therapies 4 10 3 10 2 10 ClinicopathologicalDiagnosis X – GIST 1 O - LMS 10 1 10 2 10 Classified as LMS 3 10 4 10 5 10 C9orf65 expression Accuracy on data = 99% Predicted accuracy on future data (LOOCV) = 98% Differential rank conservation (DIRAC) for studying Expression rank conservation for pathways within a phenotype Pathways that discriminate well between phenotypes Eddy, J.A. et al, PLoS Computational Biology (2010) Price, N.D. et al, PNAS, 2007 Differential Rank Conservation …across pathways in a phenotype Highest conservation tightly regulated pathway g3 g3 g3 g3 g2 g2 g1 g2 g1 g1 g2 g1 g4 g4 g4 g4 weakly regulated pathway Lowest conservation …across phenotypes for a pathway shuffled pathway ranking between phenotypes GIST LMS g7 g6 g5 g7 g3 g4 g8 g8 g7 g6 g2 g1 g6 g7 g6 g8 g1 g3 g5 g5 g8 g5 g4 g2 Visualizing global network rank conservation Network name. 6 4 7 11 8 12 17 8 11 8 34 14 16 5 10 15 5 21 6 8 1.000 0.981 0.955 0.955 0.948 0.947 0.946 0.946 0.945 0.847 0.845 0.844 0.840 0.839 0.833 0.829 0.806 0.805 0.763 0.728 … µR … … GS FOSB AKAP13 AGPCR RNA CACAM NDKDYNAMIN ETC SET ALTERNATIVE ALK LAIR PITX2 METHIONINE IL5 STEM ION CYTOKINE IL18 LEPTIN Num. genes Visualizing global network rank conservation Network name. 6 4 7 11 8 12 17 8 11 8 34 14 16 5 10 15 5 21 6 8 1.000 0.981 0.955 0.955 0.948 0.947 0.946 0.946 0.945 0.847 0.845 0.844 0.840 0.839 0.833 0.829 0.806 0.805 0.763 0.728 … µR … … GS FOSB AKAP13 AGPCR RNA CACAM NDKDYNAMIN ETC SET ALTERNATIVE ALK LAIR PITX2 METHIONINE IL5 STEM ION CYTOKINE IL18 LEPTIN Num. genes Average rank conservation across all 248 networks: 0.903 Global regulation of networks across phenotypes Highest rank conservation Eddy et al, PLoS Computational Biology, (2010) Lowest rank conservation Global regulation of networks across phenotypes Highest rank conservation Tighter network regulation: normal prostate Looser network regulation: primary prostate cancer Loosest network regulation: metastatic prostate cancer Eddy et al, PLoS Computational Biology, (2010) Lowest rank conservation Differential Rank Conservation …across pathways in a phenotype Highest conservation tightly regulated pathway g3 g3 g3 g3 g2 g2 g1 g2 g1 g1 g2 g1 g4 g4 g4 g4 weakly regulated pathway Lowest conservation …across phenotypes for a pathway shuffled pathway ranking between phenotypes GIST LMS g7 g6 g5 g7 g3 g4 g8 g8 g7 g6 g2 g1 g6 g7 g6 g8 g1 g3 g5 g5 g8 g5 g4 g2 Differential rank conservation of the MAPK network DIRAC classification is comparable to other methods Cross validation accuracies in prostate cancer Differential Rank Conservation (DIRAC): Key Features Independent of data normalization Independent of genes/proteins outside of network Can show massive/complete perturbations Unlike Fischer’s exact test (e.g. GO enrichment) Measures the “shuffling” of the network in terms of the hierarchy of expression of he components Distinct from enrichment or GSEA Provides a distinct mathematically classifier to yield measurement of predictive accuracy on test data Stronger than p-value for determining signal Code for the method can be found at our website: http://price.systemsbiology.net • Eddy et al, PLoS Computational Biology, (2010) Global Analysis of Human Disease Importance of broad context to disease diagnosis The envisioned future of blood diagnostics Next generation molecular disease-screening Why global disease analyses are essential Organ-specificity: separating signal from noise Hierarchy of classification Context-independent classifiers Based on organ-specific markers Context-dependent classifiers Based on excellent markers once organ-specificity defined Provide context for how disease classifiers should be defined Provide broad perspective into how separable diseases are and if disease diagnosis categories seem appropriate GLOBAL ANALYSIS OF DISEASEPERTURBED TRANSCRIPTOMES IN THE HUMAN BRAIN Example case study 4/8/2015 49 Multidimensional scaling plot of brain disease data AI ALZ GBM MDL MNG NB OLG PRK normal Identification of Structured Signatures And Classifiers (ISSAC) • At each class in the decision tree, a test sample is either allowed to pass down the tree for further classification or rejected (i.e. 'does not belong to this class') and thus unable to pass Accuracy on randomly split test sets classification accuracy (%) 100 100 99.0 97.6 100 98.2 90 84.2 94.7 81.8 80 60 40 20 0 AI Average ALZ GBM MDL MNG NB OLG PRK accuracy of all class samples: 93.9 % normal /control The challenge of ‘Lab Effects’ Sample heterogeneity issues in personalized medicine 53 0% Normal (GSE7307) Normal (GSE3526) PA (GSE12907) PA (GSE5675) OLG (GSE4290) OLG (GSE4412) MNG (GSE16581) MNG (GSE9438) MNG (GSE4780) MDL (GSE12992) MDL (GSE10327) GBM (GSE4290) GBM (GSE9171) GBM (GSE8692) GBM (GSE4271) GBM (GSE4412) EPN (GSE21687) EPN (GSE16155) accuracy Independent hold-out trials for 18 GSE datasets 100% 80% 60% 40% 20% 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 E_12667 E_19188 E_3141 E_10245 E_14814 E_18842 E_15 E_10445 E_10072 E_2109 E_17475 E_7670 E_231 E_10799 E_10245 E_19188 E_3141 E_18842 E_2109 E_6253 E_14814 E_14814 E_2109 E_10445 E_19188 E_18965 E_7368 E_4302 E_8581 E_1650 E_8545 E_5058 E_12345 E_4302 E_47 E_994 E_5058 E_18842 E_1643 E_8545 E_19188 E_15 E_18965 E_1650 E_10799 E_7670 E_10072 E_231 E_8581 Class Sensitivity E_14814 E_10445 E_7670 E_10799 E_19188 E_15 E_3141 E_17475 E_10245 E_12667 E_10072 E_231 E_2109 E_18842 E_2109 E_19188 E_10245 E_14814 E_3141 E_18842 E_6253 E_14814 E_2109 E_10445 E_19188 E_7368 E_18965 E_4302 E_8581 E_1650 E_8545 E_5058 E_994 E_12345 E_47 E_5058 E_4302 E_18965 E_8545 E_19188 E_1643 E_231 E_10072 E_7670 E_10799 E_18842 E_1650 E_15 E_8581 Class Sensitivity Leave-batch-out validation shows impact of other batch effects ISSAC 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 ADC SCC LCLC AST COPD NORM Study Batch Excluded ooSVM (50 features) ADC SCC LCLC AST COPD NORM Study Batch Excluded Take home messages There is tremendous promise in highthroughput approaches to identify biomarkers Significant challenges remain to their broad success Integrative systems approaches are essential that link together data very broadly If training set is representative of population, there are robust signals in the data and excellent accuracy is possible Forward designs and partnering closely with clinical partners is essential, as is standardization of data collection and analysis Summary Molecular signature classifiers provide a promising avenue for disease stratification Machine-learning approaches are key Goal is optimal prediction of future data Must avoid overfitting Model complexity Feature selection & model selection Technical challenges Measurement platforms Network-based classification Global disease context is key Lab and batch effects critical to overcome Sampling of heterogeneity for some disease now sufficient to achieve stability in classification accuracies Acknowledgments Nathan D. Price Research Laboratory Institute for Systems Biology, Seattle, WA | University of Illinois, Urbana-Champaign, IL Price Lab Members Collaborators Seth Ament, PhD Daniel Baker Matthew Benedict Julie Bletz, PhD Victor Cassen Sriram Chandrasekaran Nicholas Chia, PhD (now Ast. Prof. at Mayo Clinic) John Earls James Eddy Cory Funk, PhD Pan Jun Kim, PhD (now Ast. Prof. at POSTECH) Alexey Kolodkin, PhD Charu Gupta Kumar, PhD Ramkumar Hariharan, PhD Ben Heavner, PhD Piyush Labhsetwar Andrew Magis Caroline Milne Shuyi Ma Beth Papanek Matthew Richards Areejit Samal, PhD Vineet Sangar, PhD Bozenza Sawicka Evangelos Simeonidis Jaeyun Sung Chunjing Wang Don Geman (Johns Hopkins) Wei Zhang (MD Anderson) Funding • NIH / National Cancer Institute - Howard Temin Pathway to Independence Award • NSF CAREER • Department of Energy • Energy Biosciences Institute (BP) • Department of Defense (TATRC) • Luxembourg-ISB Systems Medicine Program • Roy J. Carver Charitable Trust Young Investigator Award • Camille Dreyfus Teacher-Scholar Award