OMICS and Pathway Integration for Knowledge Discovery

advertisement
Integrative Colorectal Cancer
Omics Data Mining and
Knowledge Discovery
Jake Y. Chen, Ph.D.
IUPUI
Indiana Center for Systems Biology & Personalized Medicine
http://bio.informatics.iupui.edu
Polyp and Colorectal Cancer

Polyp vs. Colorectal Cancer
•
•
•
•

Benign tumors of the large intestine.
Does not invade nearby tissue or spread to other
parts of the body.
If not removed from the large intestine, may become
malignant (cancerous) over time.
Most of the cancers of the large intestine are
believed to have developed from Polyp.
Photo Courtesy of National
Cancer Institute
Colon Cancer vs. Rectal Cancer
•
•
Share many commonalities, including molecular mechanisms.
Tend to be treated differently.
Colorectal Cancer Molecular Pathways
A. Walther, et al. (2009) Nature
Reviews Cancer, 9(7) pp. 489-99
Omics/Clinical Data Source
Proteomics/Metabolomics/Lipdomics/Clinical Data
LC-MS
Proteomics
NMR
Metabolomics
Vitamin D
H=80
H=53
H=83
PR=72
PP=35
CR=40
CR=15
N=192
N=103
PP=81
CR=31
N=195
GC/GC MS
Metabolomics
Oxidative
Stress
H=83
H=50
PP=35
PP=84
PP=32
CR=15
CR=30
CR=12
N=97
N=197
N=94
Lipdomics
H=47
Diet
H=70
PP=54
CR=29
N=153
Scientific Questions to Answer

Data Analysis
•
•

Data Mining
•
•

Which Omics data has the best prediction power?
Which features in Omics data are important?
Does integration of Omics data improve the prediction?
Which combination of Omics data has the best prediction power?
Knowledge Discovery
•
Why those features in Omics data have the best prediction power?
Roadmap

Knowledge Discovery of Proteomics
Data

Knowledge Discovery of Metabolomics Data
Integrative Data Mining

Proteomics Data Description

Group: Bindley Biosciences Center at Purdue
University

Instruments: Agilent's chip cube coupled the XCT
PLUS ESI ion trap

Data format at CCE webportal: mzXML

Number of Samples: Normal: 80; PolyP:72;
Colorectal: 40
LC-MS Proteomics Data Processing
LC/MS data “heat map”
Total Ion Chromatogram (TIC)
summarized from enhanced heat map
Image Enhanced LC/MS data “heat map”
Methods Adapted from
N. Jeffries (2005) Bioinformatics, vol. 21, (no. 14), pp. 3066.
S.A. Kazmi, et al., (2006) Metabolomics, vol. 2, (no. 2), pp. 75-83
LC-MS Major Protein Identification
~25-28 characteristic proteins /sample identified
Identify Most Informative TIC R.T. “Grid”
Use Mascot to Search for Protein ID at R.T. Grid Regions
Apply the R.T. Grid to Original Spectra
No
Scan
RT
1
119
139.48
2
229
3
372
4
Uniprot_ID
Score
Expect
Evidence
ADAD2_HUMAN
38
3.3
0
265.87
NNMT_HUMAN
43
1.1
2
429.15
ZSA5D_HUMAN
42
1.2
0
656
749.8
BRAF_HUMAN
40
2.2
479
5
1162
1276.6
RGS7_HUMAN
47
0.39
1
6
1310
1407.2
TTC9C_HUMAN
35
6.3
0
7
1669
1713.9
CP042_HUMAN
38
3.1
0
8
1866
1879.1
HXD11_HUMAN
34
8.4
0
9
1987
1980.3
ING4_HUMAN
38
3.1
2
10
2114
2086
ZN423_HUMAN
33
10
0
11
2353
2285.7
CL065_HUMAN
37
3.9
0
12
2539
2441.3
CA5BL_HUMAN
47
0.4
1
13
2722
2594.7
NPDC1_HUMAN
38
3.6
0
14
2874
2722.2
DJC27_HUMAN
37
3.8
0
15
3001
2828.5
BORG4_HUMAN
40
2.2
1
16
3165
2965.1
KC1G1_HUMAN
27
43
0
17
3440
3196.1
TPPC5_HUMAN
40
2
0
18
3656
3377.6
UB2D3_HUMAN
43
0.99
1
19
3997
3665.5
TM208_HUMAN
34
8.1
0
20
4257
3885.4
ZBED3_HUMAN
29
23
0
Proteomics Result Interpretation
Proteins Identified from Colon
Cancer and Health Group
Frequency Frequency
in Colon in Health Evidence in
Uniprot_ID
(10)
(10)
PubMed
BRAF_HUMAN
3
0
508
DMP46_HUMAN
3
0
0
NNMT_HUMAN
3
1
4
MRP_HUMAN
1
3
0
STK33_HUMAN
0
3
0
Proteins Interacted with High-Frequency
Proteins from Colon Cancer Group
Uniprot_ID
BRAF1_HUMAN
P53_HUMAN
CD44_HUMAN
MDM2_HUMAN
BCR_HUMAN
LCK_HUMAN
Q7RTZ3_HUMAN
CAV1_HUMAN
PNPH_HUMAN
CBL_HUMAN
RAF1_HUMAN
CD38_HUMAN
NNMT_HUMAN
IRAK1_HUMAN
DMPK_HUMAN
ITA5_HUMAN
ITB1_HUMAN
ZAP70_HUMAN
Gene
Protein Name
Serine/threonine-protein kinase BBRAF raf
TP53 Cellular tumor antigen p53
CD44 CD44 antigen
MDM2 E3 ubiquitin-protein ligase Mdm2
BCR Breakpoint cluster region protein
LCK Tyrosine-protein kinase Lck
LCK Tyrosine-protein kinase Lck
CAV1 Caveolin-1
PNP Purine nucleoside phosphorylase
CBL E3 ubiquitin-protein ligase CBL
RAF proto-oncogene
RAF1 serine/threonine-protein kinase
CD38 ADP-ribosyl cyclase 1
NNMT Nicotinamide N-methyltransferase
Interleukin-1 receptor-associated
IRAK1 kinase 1
DMPK Myotonin-protein kinase
ITGA5 Integrin alpha-5
ITGB1 Integrin beta-1
ZAP70 Tyrosine-protein kinase ZAP-70
Evidence in
PubMed
508
443
411
131
59
29
29
21
13
11
10
8
4
3
2
1
1
1
Proteomics Result Interpretation
A Network Biology Context
Protein Network Constructed from the Top 3 Differential Proteins
Green-circled proteins are frequently (>=0.3) detected in the colon patient blood samples by using LC/MS.
Node: Protein with evidence from PubMed by searching ("GENE_SYMBOL" AND ("colon" OR "colorectal")
AND ("cancer" OR "carcinoma")), Edge: Protein interaction with confidence score from HAPPI 1.31 (4&5-Star)
Proteomics Result Interpretation
A Biological Pathway Context
BRAF (Serine/threonine-protein kinase B-raf) plays major roles in
Colorectal Cancer Pathway (KEGG data)
Proteomics Result Interpretation
A Biological Pathway Context for NNMT
NNMT (Nicotinamide N-methyltransferase) is involved in Biological
Oxidations/Phase II Conjugation/Methylation (from Reactome)
Roadmap

Knowledge Discovery of Proteomics Data

Knowledge Discovery of Metabolomics
Data



NMR Data
GCxGC MS Data
Integrative Data Mining
Metabolomics Data Description
Group: Daniel Raftery Laboratory at Purdue University
1.
NMR Data



2.
Instruments: Bruker Avance 500MHz, NMR
Data format at CCE webportal: Excel spreadsheet
Number of Samples: Normal: 53; PolyP:35; Colorectal: 15
GCxGC MS Data



Instruments: LECO Pegasus 4D GCxGC-TOF
Data format at CCE webportal: Excel spreadsheet
Number of Samples: Normal: 83; Polyp: 84; Colorectal:30
NMR Data Analysis Workflow
Signal Processing
Report only significant metabolites
Extract peaks’ ppm
Search Against
Human Metabolome
Database (2.5) to
identify metabolites
Sample_ID
Top1
Top2
Delta-Hexanolactone
Hypotaurine
1
Delta-Hexanolactone
Hypotaurine
2
Top3
Top4
2,3-Diphosphoglyceric
acid
Diethanolamine
Diethanolamine
3,7-Dimethyluric acid
Top5
Top6
3-Phosphoglyceric acid
3,7-Dimethyluric acid
Methyl isobutyl ketone
1,3,7-Trimethyluric acid
Top7
Top8
Top9
Top10
1,3,7-Trimethyluric acid
L-Allothreonine
Cysteine-S-sulfate
L-Allothreonine
NMR Peak Metabolite Identification
using Human Metabolomics Database
1) Input the peak lists
2) Get the metabolites; leave
out those with fewer than 2
matches
Significant Metabolites Identified
from NRM Metabolomics Data
Marker metabolites?
Group
Metabolites
Polyp vs Health
D-Arabitol,D-Pantethine(2/35 vs 0/53)
Colorectal vs Polyp
None
Colorectal vs Health
D-Arabitol (2/15 vs 0/53)
Population Frequency =
Shared metabolites
D-Arabitol Identified from NMR Results
Involved in Pentose and Glucuronate Interconversions Pathways
Roadmap

Knowledge Discovery of Proteomics Data

Knowledge Discovery of Metabolomics
Data



NMR Data
GCxGC MS Data
Integrative Data Mining
Results from GCxGC MS Data I
Metabolite identification is more straightforward
Polyp vs Healthy
Colorectal vs Polyp
Colorectal vs Healthy
Metabolites
Metabolites
Metabolites
Methanesulfinic acid, trimethylsilyl ester
Acetic acid, (methoxyimino)-, trimethylsilyl ester
Butanoic acid, 2-[(trimethylsilyl)oxy]-,
trimethylsilyl ester
Propanoic acid, 2-(methoxyimino)-, trimethylsilyl
ester
Pentanoic acid, 2-(methoxyimino)-3-methyl-,
trimethylsilyl ester
L-Valine, N-(trimethylsilyl)-, trimethylsilyl
ester
Hexanedioic acid, bis(2-ethylhexyl) ester
Methanesulfinic acid, trimethylsilyl ester
Cholesterol trimethylsilyl ether
Mefloquine
Pentanedioic acid, 2-(methoxyimino)-,
bis(trimethylsilyl) ester
Hexanoic acid, trimethylsilyl ester
Cyclohexane, 1,3,5-trimethyl-2-octadecyl-
L-Valine, N-(trimethylsilyl)-, trimethylsilyl ester
Pentanoic acid, 2-(methoxyimino)-3methyl-, trimethylsilyl ester
Tetradecanoic acid, trimethylsilyl ester
Butanoic acid, 2-[(trimethylsilyl)oxy]-,
trimethylsilyl ester
Hexanoic acid, 2-(methoxyimino)-,
trimethylsilyl ester
psi,psi.-Carotene, 3,3',4,4'-tetradehydro-1,1',2,2'tetrahydro-1,1'-dimethoxy-2,2'-dioxo-
Cyclohexane, 1,3,5-trimethyl-2-octadecyl-
3,6-Dioxa-2,7-disilaoctane, 2,2,4,7,7pentamethyl-
Silanol, trimethyl-, pyrophosphate (4:1)
Butanoic acid, 2-(methoxyimino)-3-methyl-,
trimethylsilyl ester
Trimethylsilyl ether of glycerol
L-Asparagine, N,N2-bis(trimethylsilyl)-,
trimethylsilyl ester
Ethylbis(trimethylsilyl)amine
Cyclotrisiloxane, 2,4,6-trimethyl-2,4,6-triphenylBenzene, (1-hexadecylheptadecyl)Pentanedioic acid, 2-(methoxyimino)-,
bis(trimethylsilyl) ester
Results from GCxGC MS Data II
A. Polyp vs Healthy
B. Polyp vs Colorectal
C. Colorectal vs Healthy
Comparative Results (Intensity vs. Population)
Marker Metabolite Panel Clustering of three groups
Intensity based
Heat map
Population Frequency based
Heat map
Metabolites identified from GCxGC MS Results
Involved in Fatty Acid Biosynthesis Pathways
Roadmap

Knowledge Discovery of Proteomics Data
Knowledge Discovery of Metabolomics Data

Integrative Data Mining

Data Set Description

Diet, Lipidomics, Oxidative and VD


# of features and the total # of subjects varies
Diet
Lipid
Oxidative
VD
Total Subjects
150
97
94
195
Total Features
38
49
3
2
Three classes are balanced to the least common
denominator



Healthy vs. Polyp
Healthy vs. Colorectal
Polyp vs. Colorectal
Predictive Modeling Methods
Raw
Dataset



Replaced with the mean value of the attribute in group
Support vector machines (SVM) Classifier Kernel


Filtering outliers (three standard deviations away from mean)
Data Normalization (transforming to the 0-1 range)
Binned categorical data using Quantile binning method
Missing Value Treatment


Classification
Model
Data Preprocessing


Clean
Dataset
Hypothesis
Hypothesis
Hypothesis
Radial Basis Function (RBF) kernel are used
Feature Selection Methods
 Approach #1: Two sample unpaired T-tests at 5% significance level.
 Approach #2: SVM Attribute Evaluator with Ranker Algorithm.

Features from T-tests are filtered using p-values

K-fold Cross-validation
Dietary Attributes as Predictors
Colorectal vs. Healthy
Polyp vs. Healthy
P-value
P-value
Ice cream
Rice
2.38E-02
Salad
2.53E-02
Tomato
9.57E-01
Egg
3.71E-02
Milk
5.60E-02
4.21E-01
Tea
4.11E-02
Shellfish
1.21E-01
SVM Predictor Accuracy = 64%
SVM Predictor Accuracy = 65%
Lipidomics T-Tests Results
Significant Features Selected from T Test with their
corresponding p value
Features
Polyp vs. Healthy
16:0/18:1 PE
1.76E-02
24:1 Cer
6.90E-03
Polyp vs. Colorectal
LPE 18:1
LPE 20:0
Colorectal vs. Healthy
<1.00E-04
1.50E-03
2.00E-04
An-16:0 LPA
3.23E-02
An-18:1 LPA
3.38E-02
AA
1.13E-02
18:2 LPA
1.13E-02
20:4 LPA
1.33E-02
4.50E-03
2.40E-02
22:6 FA
4.28E-02
3.24E-02
LPE 16:0
3.08E-02
3.40E-03
LPE 18:0
3.90E-03
1.00E-04
LPE 18:1
2.18E-02
Integrating lipidomics with clinical features
Performance comparisons
Without Clinical
Features
With Clinical Features
Accuracy
Accuracy
(without
preselection)
Accuracy
(with t-test
preselection)
Accuracy
(automatic
selection)
Polyp vs.
Healthy
0.55
Polyp vs.
Healthy
0.54
0.71
0.78
Colorectal
vs. Healthy*
0.60
Colorectal
vs. Healthy*
0.57
0.63
0.73
Polyp vs.
Colorectal *
0.60
Polyp vs.
Colorectal *
0.70
0.90
0.87
* Since the number of subjects was less than15, 3 fold cross-validation accuracy was reported.
Messages



Individual Omics data set has variable
predictive performance
Need thorough statistical filtering + biological
knowledge integration to battle inherent highlevel of data noise
Integration of different Omics data with
clinical data can improve predictive
performance
31
Acknowledgment
We thank all the members in our team.
Download