Integrative Colorectal Cancer Omics Data Mining and Knowledge Discovery Jake Y. Chen, Ph.D. IUPUI Indiana Center for Systems Biology & Personalized Medicine http://bio.informatics.iupui.edu Polyp and Colorectal Cancer Polyp vs. Colorectal Cancer • • • • Benign tumors of the large intestine. Does not invade nearby tissue or spread to other parts of the body. If not removed from the large intestine, may become malignant (cancerous) over time. Most of the cancers of the large intestine are believed to have developed from Polyp. Photo Courtesy of National Cancer Institute Colon Cancer vs. Rectal Cancer • • Share many commonalities, including molecular mechanisms. Tend to be treated differently. Colorectal Cancer Molecular Pathways A. Walther, et al. (2009) Nature Reviews Cancer, 9(7) pp. 489-99 Omics/Clinical Data Source Proteomics/Metabolomics/Lipdomics/Clinical Data LC-MS Proteomics NMR Metabolomics Vitamin D H=80 H=53 H=83 PR=72 PP=35 CR=40 CR=15 N=192 N=103 PP=81 CR=31 N=195 GC/GC MS Metabolomics Oxidative Stress H=83 H=50 PP=35 PP=84 PP=32 CR=15 CR=30 CR=12 N=97 N=197 N=94 Lipdomics H=47 Diet H=70 PP=54 CR=29 N=153 Scientific Questions to Answer Data Analysis • • Data Mining • • Which Omics data has the best prediction power? Which features in Omics data are important? Does integration of Omics data improve the prediction? Which combination of Omics data has the best prediction power? Knowledge Discovery • Why those features in Omics data have the best prediction power? Roadmap Knowledge Discovery of Proteomics Data Knowledge Discovery of Metabolomics Data Integrative Data Mining Proteomics Data Description Group: Bindley Biosciences Center at Purdue University Instruments: Agilent's chip cube coupled the XCT PLUS ESI ion trap Data format at CCE webportal: mzXML Number of Samples: Normal: 80; PolyP:72; Colorectal: 40 LC-MS Proteomics Data Processing LC/MS data “heat map” Total Ion Chromatogram (TIC) summarized from enhanced heat map Image Enhanced LC/MS data “heat map” Methods Adapted from N. Jeffries (2005) Bioinformatics, vol. 21, (no. 14), pp. 3066. S.A. Kazmi, et al., (2006) Metabolomics, vol. 2, (no. 2), pp. 75-83 LC-MS Major Protein Identification ~25-28 characteristic proteins /sample identified Identify Most Informative TIC R.T. “Grid” Use Mascot to Search for Protein ID at R.T. Grid Regions Apply the R.T. Grid to Original Spectra No Scan RT 1 119 139.48 2 229 3 372 4 Uniprot_ID Score Expect Evidence ADAD2_HUMAN 38 3.3 0 265.87 NNMT_HUMAN 43 1.1 2 429.15 ZSA5D_HUMAN 42 1.2 0 656 749.8 BRAF_HUMAN 40 2.2 479 5 1162 1276.6 RGS7_HUMAN 47 0.39 1 6 1310 1407.2 TTC9C_HUMAN 35 6.3 0 7 1669 1713.9 CP042_HUMAN 38 3.1 0 8 1866 1879.1 HXD11_HUMAN 34 8.4 0 9 1987 1980.3 ING4_HUMAN 38 3.1 2 10 2114 2086 ZN423_HUMAN 33 10 0 11 2353 2285.7 CL065_HUMAN 37 3.9 0 12 2539 2441.3 CA5BL_HUMAN 47 0.4 1 13 2722 2594.7 NPDC1_HUMAN 38 3.6 0 14 2874 2722.2 DJC27_HUMAN 37 3.8 0 15 3001 2828.5 BORG4_HUMAN 40 2.2 1 16 3165 2965.1 KC1G1_HUMAN 27 43 0 17 3440 3196.1 TPPC5_HUMAN 40 2 0 18 3656 3377.6 UB2D3_HUMAN 43 0.99 1 19 3997 3665.5 TM208_HUMAN 34 8.1 0 20 4257 3885.4 ZBED3_HUMAN 29 23 0 Proteomics Result Interpretation Proteins Identified from Colon Cancer and Health Group Frequency Frequency in Colon in Health Evidence in Uniprot_ID (10) (10) PubMed BRAF_HUMAN 3 0 508 DMP46_HUMAN 3 0 0 NNMT_HUMAN 3 1 4 MRP_HUMAN 1 3 0 STK33_HUMAN 0 3 0 Proteins Interacted with High-Frequency Proteins from Colon Cancer Group Uniprot_ID BRAF1_HUMAN P53_HUMAN CD44_HUMAN MDM2_HUMAN BCR_HUMAN LCK_HUMAN Q7RTZ3_HUMAN CAV1_HUMAN PNPH_HUMAN CBL_HUMAN RAF1_HUMAN CD38_HUMAN NNMT_HUMAN IRAK1_HUMAN DMPK_HUMAN ITA5_HUMAN ITB1_HUMAN ZAP70_HUMAN Gene Protein Name Serine/threonine-protein kinase BBRAF raf TP53 Cellular tumor antigen p53 CD44 CD44 antigen MDM2 E3 ubiquitin-protein ligase Mdm2 BCR Breakpoint cluster region protein LCK Tyrosine-protein kinase Lck LCK Tyrosine-protein kinase Lck CAV1 Caveolin-1 PNP Purine nucleoside phosphorylase CBL E3 ubiquitin-protein ligase CBL RAF proto-oncogene RAF1 serine/threonine-protein kinase CD38 ADP-ribosyl cyclase 1 NNMT Nicotinamide N-methyltransferase Interleukin-1 receptor-associated IRAK1 kinase 1 DMPK Myotonin-protein kinase ITGA5 Integrin alpha-5 ITGB1 Integrin beta-1 ZAP70 Tyrosine-protein kinase ZAP-70 Evidence in PubMed 508 443 411 131 59 29 29 21 13 11 10 8 4 3 2 1 1 1 Proteomics Result Interpretation A Network Biology Context Protein Network Constructed from the Top 3 Differential Proteins Green-circled proteins are frequently (>=0.3) detected in the colon patient blood samples by using LC/MS. Node: Protein with evidence from PubMed by searching ("GENE_SYMBOL" AND ("colon" OR "colorectal") AND ("cancer" OR "carcinoma")), Edge: Protein interaction with confidence score from HAPPI 1.31 (4&5-Star) Proteomics Result Interpretation A Biological Pathway Context BRAF (Serine/threonine-protein kinase B-raf) plays major roles in Colorectal Cancer Pathway (KEGG data) Proteomics Result Interpretation A Biological Pathway Context for NNMT NNMT (Nicotinamide N-methyltransferase) is involved in Biological Oxidations/Phase II Conjugation/Methylation (from Reactome) Roadmap Knowledge Discovery of Proteomics Data Knowledge Discovery of Metabolomics Data NMR Data GCxGC MS Data Integrative Data Mining Metabolomics Data Description Group: Daniel Raftery Laboratory at Purdue University 1. NMR Data 2. Instruments: Bruker Avance 500MHz, NMR Data format at CCE webportal: Excel spreadsheet Number of Samples: Normal: 53; PolyP:35; Colorectal: 15 GCxGC MS Data Instruments: LECO Pegasus 4D GCxGC-TOF Data format at CCE webportal: Excel spreadsheet Number of Samples: Normal: 83; Polyp: 84; Colorectal:30 NMR Data Analysis Workflow Signal Processing Report only significant metabolites Extract peaks’ ppm Search Against Human Metabolome Database (2.5) to identify metabolites Sample_ID Top1 Top2 Delta-Hexanolactone Hypotaurine 1 Delta-Hexanolactone Hypotaurine 2 Top3 Top4 2,3-Diphosphoglyceric acid Diethanolamine Diethanolamine 3,7-Dimethyluric acid Top5 Top6 3-Phosphoglyceric acid 3,7-Dimethyluric acid Methyl isobutyl ketone 1,3,7-Trimethyluric acid Top7 Top8 Top9 Top10 1,3,7-Trimethyluric acid L-Allothreonine Cysteine-S-sulfate L-Allothreonine NMR Peak Metabolite Identification using Human Metabolomics Database 1) Input the peak lists 2) Get the metabolites; leave out those with fewer than 2 matches Significant Metabolites Identified from NRM Metabolomics Data Marker metabolites? Group Metabolites Polyp vs Health D-Arabitol,D-Pantethine(2/35 vs 0/53) Colorectal vs Polyp None Colorectal vs Health D-Arabitol (2/15 vs 0/53) Population Frequency = Shared metabolites D-Arabitol Identified from NMR Results Involved in Pentose and Glucuronate Interconversions Pathways Roadmap Knowledge Discovery of Proteomics Data Knowledge Discovery of Metabolomics Data NMR Data GCxGC MS Data Integrative Data Mining Results from GCxGC MS Data I Metabolite identification is more straightforward Polyp vs Healthy Colorectal vs Polyp Colorectal vs Healthy Metabolites Metabolites Metabolites Methanesulfinic acid, trimethylsilyl ester Acetic acid, (methoxyimino)-, trimethylsilyl ester Butanoic acid, 2-[(trimethylsilyl)oxy]-, trimethylsilyl ester Propanoic acid, 2-(methoxyimino)-, trimethylsilyl ester Pentanoic acid, 2-(methoxyimino)-3-methyl-, trimethylsilyl ester L-Valine, N-(trimethylsilyl)-, trimethylsilyl ester Hexanedioic acid, bis(2-ethylhexyl) ester Methanesulfinic acid, trimethylsilyl ester Cholesterol trimethylsilyl ether Mefloquine Pentanedioic acid, 2-(methoxyimino)-, bis(trimethylsilyl) ester Hexanoic acid, trimethylsilyl ester Cyclohexane, 1,3,5-trimethyl-2-octadecyl- L-Valine, N-(trimethylsilyl)-, trimethylsilyl ester Pentanoic acid, 2-(methoxyimino)-3methyl-, trimethylsilyl ester Tetradecanoic acid, trimethylsilyl ester Butanoic acid, 2-[(trimethylsilyl)oxy]-, trimethylsilyl ester Hexanoic acid, 2-(methoxyimino)-, trimethylsilyl ester psi,psi.-Carotene, 3,3',4,4'-tetradehydro-1,1',2,2'tetrahydro-1,1'-dimethoxy-2,2'-dioxo- Cyclohexane, 1,3,5-trimethyl-2-octadecyl- 3,6-Dioxa-2,7-disilaoctane, 2,2,4,7,7pentamethyl- Silanol, trimethyl-, pyrophosphate (4:1) Butanoic acid, 2-(methoxyimino)-3-methyl-, trimethylsilyl ester Trimethylsilyl ether of glycerol L-Asparagine, N,N2-bis(trimethylsilyl)-, trimethylsilyl ester Ethylbis(trimethylsilyl)amine Cyclotrisiloxane, 2,4,6-trimethyl-2,4,6-triphenylBenzene, (1-hexadecylheptadecyl)Pentanedioic acid, 2-(methoxyimino)-, bis(trimethylsilyl) ester Results from GCxGC MS Data II A. Polyp vs Healthy B. Polyp vs Colorectal C. Colorectal vs Healthy Comparative Results (Intensity vs. Population) Marker Metabolite Panel Clustering of three groups Intensity based Heat map Population Frequency based Heat map Metabolites identified from GCxGC MS Results Involved in Fatty Acid Biosynthesis Pathways Roadmap Knowledge Discovery of Proteomics Data Knowledge Discovery of Metabolomics Data Integrative Data Mining Data Set Description Diet, Lipidomics, Oxidative and VD # of features and the total # of subjects varies Diet Lipid Oxidative VD Total Subjects 150 97 94 195 Total Features 38 49 3 2 Three classes are balanced to the least common denominator Healthy vs. Polyp Healthy vs. Colorectal Polyp vs. Colorectal Predictive Modeling Methods Raw Dataset Replaced with the mean value of the attribute in group Support vector machines (SVM) Classifier Kernel Filtering outliers (three standard deviations away from mean) Data Normalization (transforming to the 0-1 range) Binned categorical data using Quantile binning method Missing Value Treatment Classification Model Data Preprocessing Clean Dataset Hypothesis Hypothesis Hypothesis Radial Basis Function (RBF) kernel are used Feature Selection Methods Approach #1: Two sample unpaired T-tests at 5% significance level. Approach #2: SVM Attribute Evaluator with Ranker Algorithm. Features from T-tests are filtered using p-values K-fold Cross-validation Dietary Attributes as Predictors Colorectal vs. Healthy Polyp vs. Healthy P-value P-value Ice cream Rice 2.38E-02 Salad 2.53E-02 Tomato 9.57E-01 Egg 3.71E-02 Milk 5.60E-02 4.21E-01 Tea 4.11E-02 Shellfish 1.21E-01 SVM Predictor Accuracy = 64% SVM Predictor Accuracy = 65% Lipidomics T-Tests Results Significant Features Selected from T Test with their corresponding p value Features Polyp vs. Healthy 16:0/18:1 PE 1.76E-02 24:1 Cer 6.90E-03 Polyp vs. Colorectal LPE 18:1 LPE 20:0 Colorectal vs. Healthy <1.00E-04 1.50E-03 2.00E-04 An-16:0 LPA 3.23E-02 An-18:1 LPA 3.38E-02 AA 1.13E-02 18:2 LPA 1.13E-02 20:4 LPA 1.33E-02 4.50E-03 2.40E-02 22:6 FA 4.28E-02 3.24E-02 LPE 16:0 3.08E-02 3.40E-03 LPE 18:0 3.90E-03 1.00E-04 LPE 18:1 2.18E-02 Integrating lipidomics with clinical features Performance comparisons Without Clinical Features With Clinical Features Accuracy Accuracy (without preselection) Accuracy (with t-test preselection) Accuracy (automatic selection) Polyp vs. Healthy 0.55 Polyp vs. Healthy 0.54 0.71 0.78 Colorectal vs. Healthy* 0.60 Colorectal vs. Healthy* 0.57 0.63 0.73 Polyp vs. Colorectal * 0.60 Polyp vs. Colorectal * 0.70 0.90 0.87 * Since the number of subjects was less than15, 3 fold cross-validation accuracy was reported. Messages Individual Omics data set has variable predictive performance Need thorough statistical filtering + biological knowledge integration to battle inherent highlevel of data noise Integration of different Omics data with clinical data can improve predictive performance 31 Acknowledgment We thank all the members in our team.