Applying Semantic Technologies to the Glycoproteomics Domain W. S York May 15, 2006 Some Goals of Glycoproteomics • How do changes in the expression levels of specific genes alter the expression of specific glycans on the cell surface? • Are changes in the expression of specific glycans at the cell surface related to cell function, cell development, and disease? • What are the mechanisms by which specific glycans at the cell surface affect cell function, cell development, and the progression of disease? Challenges of Glycoproteomics • Vast amounts of data collected by highthroughput experiments - better methods for data archival, retrieval, and analysis are needed • Complex structures of glycans and glycoproteins – better methods for representing branched structures and finding structural and functional homologies are needed • Complex Biology and Biochemistry – better methods to find relationships between the glycoproteome and biological processes are needed Glycoproteomics Solutions • Brute-force analysis of flat data files • Too much data • Data is heterogeneous • What does the data represent? • Relational databases • Data is well organized • Data organization is relatively rigid • What does the data represent? • Semantic Technologies • Data is well organized • Data organization is flexible • Concepts represented by data are accessible • Relationships between concepts are accessible What is Semantic Technology? The implication is Semantics: that enabling computers to thestudy meanings relationships 1.“understand” (Linguistics) The or scienceofof and meaning in language. 2. (Linguistics) The study relationships signs and between concepts willof allow them between to reason and symbols and whatin they communicate a represent. way that is analogous to the way The American Heritage® Dictionary of the English Language, Fourth Edition humans do. Semantic Technology: The use of formal representations of concepts and their relationships to enable efficient, intelligent software. Ontology (Computer Science): A model that represents a domain and is used to reason about the objects in that domain and the relations between them. http://en.wikipedia.org/wiki/Ontology_(computer_science) A Simple Ontology Organism is_a is_a Animal is_a Lion is_a Deer is_a Elsa Simba is_a Cow is_a is_a Plant Elsie is_a Bambi ate ate ate ate is_a is_a Hosta Alfalfa is_a is_a My Hosta Peter’s Alfalfa A Simple Ontology is_a is_a Lion Elsa Elsie is_a Bambi ate ate ate is_a is_a Hosta Alfalfa is_a Deer is_a Plant eats Herbivore Cow is_a Simba is_a is_a is_a is_a is_a Animal eats Carnivore Organism ate is_a is_a My Hosta Peter’s Alfalfa The Structure of GlycO – Concept Taxonomy chemical entity is_a molecule molecular fragment carbohydrate moiety is_a is_a residue is_a monoglycosyl moiety glycan moiety amino acid residue carbohydrate residue is_a N-glycan O-glycan The Structure of GlycO – Concept Taxonomy residue is_a glycan moiety amino acid residue carbohydrate residue is_a N-glycan O-glycan The Structure of GlycO – Instances Concept Taxonomy and Properties N-glycan_00020 has_residue is_linked_to N-glycan a-D-Manp 4 is_instance_of residue is_a N-glycan core b-D-Manp amino acid is_instance_of residue carbohydrate residue glycan moiety is_instance_of is_a N-glycan O-glycan The GlycO Ontology in Protégé 3 Top-Level Classes are Defined in GlycO The GlycO Ontology in Protégé Semantics Include Chemical Context This Class Inherits from 2 Parents The GlycO Ontology in Protégé The -D-Manp residues in N-glycans are found in 8 different chemical environments GlycoTree – A Canonical Representation of N-Glycans We give a residue in this position the same name, regardless of the specific structure it resides in b-D-GlcpNAc-(1-6)+ b-D-GlcpNAc-(1-2)- -D-Manp -(1-6)+ Semantics! b-D-Manp-(1-4)- b-D-GlcpNAc -(1-4)- b-D-GlcpNAc b-D-GlcpNAc-(1-4)- -D-Manp -(1-3)+ b-D-GlcpNAc-(1-2)+ N. Takahashi and K. Kato, Trends in Glycosciences and Glycotechnology, 15: 235-251 The GlycO Ontology in Protégé Bisecting b-D-GlcpNAc The GlycO Ontology in Protégé The GlycO Ontology in Protégé 1,3-linked -L-Fucp The GlycO Ontology in Protégé Ontology Population Workflow Semagix Freedom knowledge extractor YES: next Instance Instance Data Already in KB? Has CarbBank ID? NO YES Insert into KB Compare to Knowledge Base NO IUPAC to LINUCS LINUCS to GLYDE Ontology Population Workflow Semagix Freedom knowledge extractor YES: next Instance Instance Data Already in KB? Has CarbBank ID? NO YES Insert into KB Compare to Knowledge Base [][Asn]{[(4+1)][b-D-GlcpNAc] {[(4+1)][b-D-GlcpNAc] {[(4+1)][b-D-Manp] {[(3+1)][a-D-Manp] IUPAC to NO {[(2+1)][b-D-GlcpNAc] LINUCS {}[(4+1)][b-D-GlcpNAc] {}}[(6+1)][a-D-Manp] {[(2+1)][b-D-GlcpNAc]{}}}}}} LINUCS to GLYDE Ontology Population Workflow Semagix Freedom knowledge extractor <Glycan> YES: <aglycon name="Asn"/> <residue link="4" anomer="b" chirality="D" monosaccharide="GlcNAc"> nextanomeric_carbon="1" Instance <residue link="4" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc"> <residue link="4" anomeric_carbon="1" anomer="b" Instancechirality="D" monosaccharide="Man" > <residue link="3" anomeric_carbon="1" anomer="a" Data chirality="D" monosaccharide="Man" > <residue link="2" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc" > </residue> <residue link="4" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc" > </residue> Has </residue> Already in IUPAC to CarbBankchirality="D" NO monosaccharide="Man" > <residue link="6" anomeric_carbon="1" anomer="a" KB? LINUCS <residue link="2" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc"> ID? </residue> </residue> </residue> NO YES </residue> </residue> </Glycan> Compare to Insert into KB Knowledge Base LINUCS to GLYDE The ProPreO Ontology in Protégé 3 Top-Level Classes are Defined in ProPreO The ProPreO Ontology in Protégé This Class Inherits from 2 Parents The ProPreO Ontology in Protégé This Class Inherits from 2 Parents Semantic Annotation of MS Data 830.9570 194.9604 2 580.2985 0.3592 parent ion m/z 688.3214 0.2526 779.4759 38.4939 784.3607 21.7736 1543.7476 1.3822 fragment ion m/z 1544.7595 2.9977 1562.8113 37.4790 1660.7776 476.5043 parent ion charge parent ion abundance fragment ion abundance ms/ms peaklist data Semantically Annotated MS Data <ms/ms_peak_list> <parameter instrument=micromass_QTOF_2_quadropole_time_of_flight_mass_spectrometer mode = “ms/ms”/> <parent_ion m/z = 830.9570 abundance=194.9604 z=2/> <fragment_ion m/z = 580.2985 abundance = 0.3592/> < fragment_ion m/z = 688.3214 abundance = 0.2526/> < fragment_ion m/z = 779.4759 abundance = 38.4939/> Ontological < fragment_ion m/z = 784.3607 abundance = 21.7736/> < fragment_ion m/z = 1543.7476 abundance = 1.3822/> Concepts < fragment_ion m/z = 1544.7595 abundance = 2.9977/> < fragment_ion m/z = 1562.8113 abundance = 37.4790/> < fragment_ion m/z = 1660.7776 abundance = 476.5043/> <ms/ms_peak_list> Web Services Based Workflow for Proteomics1 Biological Sample Analysis by MS/MS O Agent Raw Data to Standard Format I Raw Data Agent O Standard Format Data Data Preprocess2 I Agent O Filtered Data Agent DB Search Results Postprocess (Mascot/ Sequest) (ProValt3) I I Search Results O Final Output Storage Biological Information 1 Design and Implementation of Web Services based Workflow for proteomics. Journal of Proteome Research. Submitted 2 Computational tools for increasing confidence in protein identifications. Association of Biomolecular Resource Facilities Annual Meeting, Portland, OR, 2004. 3 A Heuristic method for assigning a false-discovery rate for protein identifications from Mascot database search results. Mol. Cell. Proteomics. 4(6), 762-772. O An Integrated Semantic Information System • Formalized domain knowledge is in ontologies • The schema defines the concepts • Instances represent individual objects • Relationships provide expressiveness • Data is annotated using concepts from the ontologies • The semantic annotations facilitate the identification and extraction of relevant information • The semantic relationships allow knowledge that is implicit in the data to be discovered Satya Sahoo Christopher Thomas Cory Henson Ravi Pavagada Amit Sheth Krzysztof Kochut John Miller James Atwood Lin Lin Alison Nairn Gerardo Alvarez-Manilla Saeed Roushanzamir Michael Pierce Ron Orlando Kelley Moremen Parastoo Azadi Alfred Merrill