Compound Classification Biological/Non-Biological?

advertisement
The Use of Graph Matching
Algorithms to Identify Biochemical
Substructures in
Synthetic Chemical Compounds
Application to Metabolomics
Mai Hamdalla, David Grant, Ion Mandoiu, Dennis Hill,
Sanguthevar Rajasekaran and Reda Ammar
University of Connecticut
Genome
DNA
Transcriptome
RNA
Proteome
Proteins
Metabolome
Sugars
Nucleotides
Amino
Acids
Lipids
Metabolites
Phenotype/Function
2
Identification Process
SMILES (simplified molecular-input line-entry system)
C8H7N
List of Candidate
Chemical
Structures
C1=CC=C2C(=C1)C=CN2
N
O
O
O
O
O
C9H18O8
C6H12O6
C(C1C(C(C(C(O1)OCC(CO)O)O)O)O)O
Mammalian
Metabolite
C(C1C(C(C(O1)(CO)O)O)O)O
Identifier
O
O
O
O
O
O
O
O
O
Ranked list of Candidate
Structures with mammalian
substructures
3
List of Candidate
Compound Structures
Identification Process
Mammalian Scaffolds List
non-Biological
Scaffolds
Sugars
Nucleotides
Amino
Acids
Lipids
Filtration
List of Filtered
Candidate
Compounds
Structure Matching
Ranked list of identified
Compounds
4
Collection and Curation of Scaffolds
Retrieve All compounds in a Metabolic Pathway in KEGG Database
Keep Participants of Mammalian Metabolic Pathway
Groups (91 KEGG Pathways)
Carbohydrate, Energy, Lipid,
Nucleotide, Amino Acid,
Glycan, Cofactors, and
Remove Entries that were single elements,
Vitamins Metabolism
metals, or inorganic
Remove Compounds that did
not have an entry in the
PubChem Database.
1,987 compounds
30 – 1,000 da
5
List of Candidate
Compound Structures
Identification Process
Mammalian Scaffolds List
non-Biological
Scaffolds
Sugars
Nucleotides
Amino
Acids
Lipids
Filtration
List of Filtered
Candidate
Compounds
Structure Matching
List of Identified
Compounds
6
Structure Matching
• SMSD (Small Molecule Sub-graph Detector)
toolkit is used for molecule similarity
searches.
O
Similarity Score 
O
N SBS
N SPR
N
O
N
O
O
Where:
NSBS : the number of atoms in the substructure and
O
N
NSPR : the number
of atoms in the
superstructure.
7
Scaffolds-Structure Matching
Mammalian Scaffolds
Candidate Structure
0.43
0.29
O
0.29
O
O
O
N
N
O
O
N
0.29
0.29
O
O
O
O
Similarity Score = 0.43
0.29
(6/14)
(4/14)
(4/14)
(6/14)
C10H7NO3
C1=CC=C2C(=C1)C(=O)C=C(N2)C(=O)O
O
N
O
0.36
N
O
0.43
O
N
Union Scaffold Structure
Candidate Structure
O
Mammalian Scaffolds
N
O
0.43
0.29
0.29
O
O
O
N
O
O
N
Similarity Score = 0.71
(10/14)
0.29
0.29
O
O
O
N
O
0.36
N
Union Scaffold
O
O
0.43
O
N
About 30% of the mammalian structures were missed (FN)
N
0.45
Union Scaffold Score = 0
N
Similarity Score = 0.9
O
Found to be a substructure
of 38 Scaffolds!
Superstructure
Scaffolds
Matching
N
S
O
N
0.6
(9/15)
O
N
0.9
(9/10)
O
0.75
(9/12)
N
O
N
O
O
10
O
Scoring Methods
Union Scaffold Structure
Candidate Structure
Superstructure Scaffold
Structure
O
O
O
N
N
N
O
O
O
O
O
O
O
0.71
0.93
• US: Union Scaffold Score = 0.71
• MS: Maximum Score (Union Scaffold Score, Superstructure Score) = 0.93
• SS: Sum of Scores (Union Scaffold Score, Superstructure Score) = 1.64
11
Collection and Curation of Synthetic
Compounds
• Retrieve synthetic compounds from ChemBridge and
ChemSynthesis databases.
– restricted to the 6 biological elements C, H, N, O, P, and S.
• The mass distribution
– ChemBridge (150 – 700 da)
– ChemSynthesis (50 –300 da)
mammalian scaffold list
reduced to 1,400 compounds
(50 – 700 da)
• 1,400 compounds were randomly selected for
training and 5,320 compounds were randomly
chosen for testing.
12
Cross Validation Average Accuracy
Results
SENS
SPEC
MCC
US
MS
SS
5US
5MS
5SS
AVG
70%
59%
88%
83%
84%
86%
STDEV
2%
2%
2%
1%
1%
1%
AVG
65%
71%
57%
75%
76%
78%
STDEV
3%
3%
3%
2%
2%
2%
AVG
0.36
0.3
0.47
0.57
0.6
0.64
STDEV
2%
2%
2%
2%
2%
1%
sensitivit y 
specificit y 
TP
TP  FN
TN
TN  FP
13
Leave one Out Accuracy
Mammalian
Non-Mammalian
120%
100%
80%
60%
40%
20%
0%
15%
3%
2%
3%
5%
4%
3%
5%
2%
10%
8%
0%
0%
Bin Mass (da)
Sensitivity = 96%
14
Prospective Results of Synthetic
Compounds
Mammalian
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
60%
63%
55%
68%
Non-Mammalian
63%
53%
47%
36%
34%
27%
26%
36%
22%
Bin Mass (da)
54% eliminated as non-mammalian
15
Conclusions
• A novel way of utilizing known mammalian
metabolites (scaffolds database) to identify
synthetic chemical compounds with
mammalian substructures.
• The results show a sensitivity of 96% in the
mammalian scaffolds leave-one-out
experiments.
• The system was able to eliminate 54% of a
random set of synthetic compounds.
16
Ongoing Work
• Exploring further improvements in accuracy
by using known biological pathway
information.
• Annotating PubChem
• Annotating existing and potential drugs
• Database independent compound search
– Generate all possible structures of a given formula
and rank them
17
Candidate Structures
Mammalian Scaffolds List
non-Biological
Scaffolds
Sugars
Amino
Acids
Nucleotides
Lipids
Filtration
Structure Matching
List of Filtered
Candidate
Compounds
Thank you!
O
N
O
O
N
O
N
O
O
O
O
O
O
Ranked Compounds
18
Download