The Use of Graph Matching Algorithms to Identify Biochemical Substructures in Synthetic Chemical Compounds Application to Metabolomics Mai Hamdalla, David Grant, Ion Mandoiu, Dennis Hill, Sanguthevar Rajasekaran and Reda Ammar University of Connecticut Genome DNA Transcriptome RNA Proteome Proteins Metabolome Sugars Nucleotides Amino Acids Lipids Metabolites Phenotype/Function 2 Identification Process SMILES (simplified molecular-input line-entry system) C8H7N List of Candidate Chemical Structures C1=CC=C2C(=C1)C=CN2 N O O O O O C9H18O8 C6H12O6 C(C1C(C(C(C(O1)OCC(CO)O)O)O)O)O Mammalian Metabolite C(C1C(C(C(O1)(CO)O)O)O)O Identifier O O O O O O O O O Ranked list of Candidate Structures with mammalian substructures 3 List of Candidate Compound Structures Identification Process Mammalian Scaffolds List non-Biological Scaffolds Sugars Nucleotides Amino Acids Lipids Filtration List of Filtered Candidate Compounds Structure Matching Ranked list of identified Compounds 4 Collection and Curation of Scaffolds Retrieve All compounds in a Metabolic Pathway in KEGG Database Keep Participants of Mammalian Metabolic Pathway Groups (91 KEGG Pathways) Carbohydrate, Energy, Lipid, Nucleotide, Amino Acid, Glycan, Cofactors, and Remove Entries that were single elements, Vitamins Metabolism metals, or inorganic Remove Compounds that did not have an entry in the PubChem Database. 1,987 compounds 30 – 1,000 da 5 List of Candidate Compound Structures Identification Process Mammalian Scaffolds List non-Biological Scaffolds Sugars Nucleotides Amino Acids Lipids Filtration List of Filtered Candidate Compounds Structure Matching List of Identified Compounds 6 Structure Matching • SMSD (Small Molecule Sub-graph Detector) toolkit is used for molecule similarity searches. O Similarity Score O N SBS N SPR N O N O O Where: NSBS : the number of atoms in the substructure and O N NSPR : the number of atoms in the superstructure. 7 Scaffolds-Structure Matching Mammalian Scaffolds Candidate Structure 0.43 0.29 O 0.29 O O O N N O O N 0.29 0.29 O O O O Similarity Score = 0.43 0.29 (6/14) (4/14) (4/14) (6/14) C10H7NO3 C1=CC=C2C(=C1)C(=O)C=C(N2)C(=O)O O N O 0.36 N O 0.43 O N Union Scaffold Structure Candidate Structure O Mammalian Scaffolds N O 0.43 0.29 0.29 O O O N O O N Similarity Score = 0.71 (10/14) 0.29 0.29 O O O N O 0.36 N Union Scaffold O O 0.43 O N About 30% of the mammalian structures were missed (FN) N 0.45 Union Scaffold Score = 0 N Similarity Score = 0.9 O Found to be a substructure of 38 Scaffolds! Superstructure Scaffolds Matching N S O N 0.6 (9/15) O N 0.9 (9/10) O 0.75 (9/12) N O N O O 10 O Scoring Methods Union Scaffold Structure Candidate Structure Superstructure Scaffold Structure O O O N N N O O O O O O O 0.71 0.93 • US: Union Scaffold Score = 0.71 • MS: Maximum Score (Union Scaffold Score, Superstructure Score) = 0.93 • SS: Sum of Scores (Union Scaffold Score, Superstructure Score) = 1.64 11 Collection and Curation of Synthetic Compounds • Retrieve synthetic compounds from ChemBridge and ChemSynthesis databases. – restricted to the 6 biological elements C, H, N, O, P, and S. • The mass distribution – ChemBridge (150 – 700 da) – ChemSynthesis (50 –300 da) mammalian scaffold list reduced to 1,400 compounds (50 – 700 da) • 1,400 compounds were randomly selected for training and 5,320 compounds were randomly chosen for testing. 12 Cross Validation Average Accuracy Results SENS SPEC MCC US MS SS 5US 5MS 5SS AVG 70% 59% 88% 83% 84% 86% STDEV 2% 2% 2% 1% 1% 1% AVG 65% 71% 57% 75% 76% 78% STDEV 3% 3% 3% 2% 2% 2% AVG 0.36 0.3 0.47 0.57 0.6 0.64 STDEV 2% 2% 2% 2% 2% 1% sensitivit y specificit y TP TP FN TN TN FP 13 Leave one Out Accuracy Mammalian Non-Mammalian 120% 100% 80% 60% 40% 20% 0% 15% 3% 2% 3% 5% 4% 3% 5% 2% 10% 8% 0% 0% Bin Mass (da) Sensitivity = 96% 14 Prospective Results of Synthetic Compounds Mammalian 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 60% 63% 55% 68% Non-Mammalian 63% 53% 47% 36% 34% 27% 26% 36% 22% Bin Mass (da) 54% eliminated as non-mammalian 15 Conclusions • A novel way of utilizing known mammalian metabolites (scaffolds database) to identify synthetic chemical compounds with mammalian substructures. • The results show a sensitivity of 96% in the mammalian scaffolds leave-one-out experiments. • The system was able to eliminate 54% of a random set of synthetic compounds. 16 Ongoing Work • Exploring further improvements in accuracy by using known biological pathway information. • Annotating PubChem • Annotating existing and potential drugs • Database independent compound search – Generate all possible structures of a given formula and rank them 17 Candidate Structures Mammalian Scaffolds List non-Biological Scaffolds Sugars Amino Acids Nucleotides Lipids Filtration Structure Matching List of Filtered Candidate Compounds Thank you! O N O O N O N O O O O O O Ranked Compounds 18