Semantics powered Bioinformatics Amit Sheth, William S. York, et al Large Scale Distributed Information Systems Lab & Complex Carbohydrate Research Center University of Georgia http://lsdis.cs.uga.edu Project Information: Background: SW for Life Sciences • Bioinformatics of Glycan Expression – component of the NCRR "Integrated Technology Resource for Biomedical Glycomics”. • W3C Interest Group on Semantic Web for Health care and Life Sciences • Deployed Active Semantic Electronic Medical Patient Record application at the Athens Heart Center Agenda • Review of Accomplishments/Ongoing Work: o o o o o o GLYDE standard GlycO Ontology ProPreO Ontology Semantic Analytical Glycomics Workflow Visualization Semantic Web Services: WSDL-S/METEOR-S GLYDE standard • An XML based representation format for glycan structures • Inter-convertible with existing data represented using IUPAC or LINUCS. • In progress: Incorporation of Probability based representation • In progress: Incorporation of aspects for visualization of structures using GLYDE (XML) files GLYDE - An expressive XML standard for the representation of glycan structure. Carbohydrate Research, 340 (18), Dec 30, 2005. Collaborative GlycoInformatics • Enable querying and export of query results in GLYDE format • Using GLYDE representation for disambiguation, mapping and matching GLYDE MonosaccharideDB SweetDB KEGG <glyde> <residue> . . </residue> </glyde> QUERY <glyde> <residue> . . </residue> </glyde> RESULT Collaborative GlycoInformatics • Development of GLYDE semantic web portal • Integration with www.glycosciences.de o Visualization aspect integrated with LiGraph (Heidelberg) or OntoVista (UGA) • Semantic Annotation of publications in GlycoProteomics domain MonosaccharideDB www.glycosciences.de GLYDE Semantic Portal KEGG Collaborative GlycoInformatics Evolving collaboration between: • LSDIS/CCRC: Will York, Amit Sheth, Michael Pierce • EUROCarbDB (German Cancer Research Center): Willi von der Lieth • Consortium for Functional Glycomics (CFG): Rahul Raman, Ram Sasisekharan, Thomas Lütteke • N.D. Zelinsky Institute of Organic Chemistry (Moscow) Yuriy Knirel • Mitsui Knowledge Industry (Japan): Hisashi Narimatsu, Norihiro Kikuchi • Kyoto Encyclopedia of Genes and Genomes (KEGG): Minoru Kanehisa, Kiyoko F. Aoki-Kinoshita • Palo Alto Research Center (PARC): David Goldberg, Semantic GlcyoInformatics - Ontologies • GlycO: A domain ontology for glycan structures, glycan functions and enzymes (embodying knowledge of the structure and metabolisms of glycans) o Contains 600+ classes and 100+ properties – describe structural features of glycans; unique population strategy o URL: http://lsdis.cs.uga.edu/projects/glycomics/glyco • ProPreO: a comprehensive process Ontology modeling experimental proteomics o Contains 330 classes, 6 million+ instances o Models three phases of experimental proteomics URL: http://lsdis.cs.uga.edu/projects/glycomics/propreo GlycO taxonomy The first levels of the GlycO taxonomy Most relationships and attributes in GlycO GlycO exploits the expressiveness of OWL-DL. Cardinality constraints, value constraints, Existential and Universal restrictions on Range and Domain of properties allow the classification of unknown entities as well as the deduction of implicit relationships. Pathway representation in GlycO Pathways do not need to be explicitly defined in GlycO. The residue-, glycan-, enzyme- and reaction descriptions contain all the knowledge necessary to infer pathways. Zooming in a little … Reaction R05987 catalyzed by enzyme 2.4.1.145 adds_glycosyl_residue N-glycan_b-D-GlcpNAc_13 The product of this reaction is the Glycan with KEGG ID 00020. The N-Glycan with KEGG ID 00015 is the substrate to the reaction R05987, which is catalyzed by an enzyme of the class EC 2.4.1.145. Ontology Population • The next slides show the different steps that were necessary to populate GlycO with glycan structures from multiple sources. • GLYDE is used to disambiguate between representations from multiple sources Ontology population workflow Semagix Freedom knowledge extractor YES: next Instance Instance Data Already in KB? Has CarbBank ID? NO YES Insert into KB Compare to Knowledge Base NO IUPAC to LINUCS LINUCS to GLYDE Semagix Freedom knowledge extractor YES: next Instance Instance Data Already in KB? Has CarbBank ID? NO YES Insert into KB Compare to Knowledge Base [][Asn]{[(4+1)][b-D-GlcpNAc] {[(4+1)][b-D-GlcpNAc] {[(4+1)][b-D-Manp] {[(3+1)][a-D-Manp] IUPAC to NO{[(2+1)][b-D-GlcpNAc] LINUCS {}[(4+1)][b-D-GlcpNAc] {}}[(6+1)][a-D-Manp] {[(2+1)][b-D-GlcpNAc]{}}}}}} LINUCS to GLYDE Semagix Freedom knowledge extractor <Glycan> YES: <aglycon name="Asn"/> <residue link="4" anomer="b" chirality="D" monosaccharide="GlcNAc"> nextanomeric_carbon="1" Instance <residue link="4" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc"> <residue link="4" anomeric_carbon="1" anomer="b" Instancechirality="D" monosaccharide="Man" > <residue link="3" anomeric_carbon="1" anomer="a" Data chirality="D" monosaccharide="Man" > <residue link="2" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc" > </residue> <residue link="4" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc" > </residue> Has </residue> Already in IUPAC to CarbBankchirality="D" NO monosaccharide="Man" > <residue link="6" anomeric_carbon="1" anomer="a" KB? LINUCS <residue link="2" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc"> ID? </residue> </residue> </residue> NO YES </residue> </residue> </Glycan> Compare to Insert into KB Knowledge Base LINUCS to GLYDE Semagix Freedom knowledge extractor YES: next Instance Instance Data Already in KB? Has CarbBank ID? NO YES Insert into KB Compare to Knowledge Base NO IUPAC to LINUCS LINUCS to GLYDE ProPreO • ProPreO: A process ontology to capture proteomics experimental lifecycle: o o o o o o Separation Mass spectrometry Analysis 330 classes 110 properties 6 million+ instances Usage: Mass spectrometry analysis Manual annotation of mouse kidney spectrum by a human expert. For clarity, only 19 of the major peaks have been annotated. Goldberg, et al, Automatic annotation of matrix-assisted laser desorption/ionization N-glycan spectra, Proteomics 2005, 5, 865–875 Semantic Annotation of Experimental Data •Enables Ontology-mediated Disambiguation •Allows correlation between disparate entities using Semantic Relations P(S | M = 3461.57) = 0.6 P(T | M = 3461.57) = 0.4 Goldberg, et al, Automatic annotation of matrix-assisted laser desorption/ionization N-glycan spectra, Proteomics 2005, 5, 865–875 Semantic GlycoProteomics Workflow Cell Culture extract Glycoprotein Fraction proteolysis Glycopeptides Fraction 1 n Separation technique I Glycopeptides Fraction n PNGase Peptide Fraction Separation technique II n*m Peptide Fraction Mass spectrometry ms data ms/ms data Data reduction ms peaklist ms/ms peaklist binning Glycopeptide identification and quantification N-dimensional array Signal integration Data reduction Peptide identification Peptide list Data correlation Web Services based Workflow = Web Process Windows XP WORKFLOW Web Service 1 WS1 Web Service 4 LINUX WS 2 WS 3 Web Service 2 Web Service 3 WS 4 MAC Solaris BOWSER • • • • Use semantics for describing Web Services WSDL-S (LSDIS/IBM) Use service-level annotation of Web Services Graphical traversal of taxonomy of biological concepts to search for Web Services • http://128.192.9.11:8080/stargate/bowser.jsp Semantic Annotation of Scientific Data 830.9570 194.9604 2 580.2985 0.3592 688.3214 0.2526 779.4759 38.4939 784.3607 21.7736 1543.7476 1.3822 1544.7595 2.9977 1562.8113 37.4790 1660.7776 476.5043 ms/ms peaklist data <ms/ms_peak_list> <parameter instrument=micromass_QTOF_2_quadropole_time_of_flight_m ass_spectrometer mode = “ms/ms”/> <parent_ion_mass>830.9570</parent_ion_mass> <total_abundance>194.9604</total_abundance> <z>2</z> <mass_spec_peak m/z = 580.2985 abundance = 0.3592/> <mass_spec_peak m/z = 688.3214 abundance = 0.2526/> <mass_spec_peak m/z = 779.4759 abundance = 38.4939/> <mass_spec_peak m/z = 784.3607 abundance = 21.7736/> <mass_spec_peak m/z = 1543.7476 abundance = 1.3822/> <mass_spec_peak m/z = 1544.7595 abundance = 2.9977/> <mass_spec_peak m/z = 1562.8113 abundance = 37.4790/> <mass_spec_peak m/z = 1660.7776 abundance = 476.5043/> <ms/ms_peak_list> Annotated ms/ms peaklist data Semantic annotation of Scientific Data <ms/ms_peak_list> <parameter instrument=“micromass_QTOF_2_quadropole_time_of_flight_mass_s pectrometer” mode = “ms/ms”/> <parent_ion_mass>830.9570</parent_ion_mass> <total_abundance>194.9604</total_abundance> <z>2</z> <mass_spec_peak m/z = 580.2985 abundance = 0.3592/> <mass_spec_peak m/z = 688.3214 abundance = 0.2526/> <mass_spec_peak m/z = 779.4759 abundance = 38.4939/> <mass_spec_peak m/z = 784.3607 abundance = 21.7736/> <mass_spec_peak m/z = 1543.7476 abundance = 1.3822/> <mass_spec_peak m/z = 1544.7595 abundance = 2.9977/> <mass_spec_peak m/z = 1562.8113 abundance = 37.4790/> <mass_spec_peak m/z = 1660.7776 abundance = 476.5043/> <ms/ms_peak_list> Annotated ms/ms peaklist data Discovery of relationship between biological entities ProPreO p GlycO r o c Lectin e s s Gene Ontology (GO) Fragment of Specific protein Specific cellular process Identified and quantified peptides Collection of N-glycan ligands Genomic database (Mascot/Sequest) The inference: instances of the class collection of Biosynthetic enzymes (GNT-V) are involved in the specific cellular process (metastasis). Collection of Biosynthetic enzymes Semantic Web Services using WSDL-S • Formalize description and classification of Web Services using ProPreO concepts <?xml version="1.0" encoding="UTF-8"?> <?xml version="1.0" encoding="UTF-8"?> <wsdl:definitions targetNamespace="urn:ngp" <wsdl:definitions targetNamespace="urn:ngp" …… ….. xmlns: xmlns:xsd="http://www.w3.org/2001/XMLSchema"> wssem="http://www.ibm.com/xmlns/WebServices/WSSemantics" xmlns: <wsdl:types> ProPreO="http://lsdis.cs.uga.edu/ontologies/ProPreO.owl" > <schema targetNamespace="urn:ngp“ xmlns="http://www.w3.org/2001/XMLSchema"> <wsdl:types> ….. <schema targetNamespace="urn:ngp" </complexType> xmlns="http://www.w3.org/2001/XMLSchema"> </schema> …… </wsdl:types> </complexType> <wsdl:message name="replaceCharacterRequest"> </schema> <wsdl:part name="in0" type="soapenc:string"/> </wsdl:types> <wsdl:part name="in1" type="soapenc:string"/> <wsdl:message name="replaceCharacterRequest" <wsdl:part name="in2" type="soapenc:string"/> wssem:modelReference="ProPreO#peptide_sequence"> </wsdl:message> <wsdl:part name="in0" type="soapenc:string"/> <wsdl:message name="replaceCharacterResponse"> <wsdl:part name="in1" type="soapenc:string"/> <wsdl:part name="replaceCharacterReturn" type="soapenc:string"/> <wsdl:part name="in2" type="soapenc:string"/> </wsdl:message> </wsdl:message> WSDL ModifyDB WSDL-S ModifyDB data Description of a sequence Web Service using: Web Service Description peptide_sequence Language Concepts defined in process Ontology ProPreO process Ontology Semantic Visualization • Ontologies are meant for machine consumption • Often too convoluted for the human eye • The scientist needs to know the concepts she uses for annotation • Build a visualization environment that translates the formal concepts into a representation the domain expert understands well Single Glycan Customizable Layouts • Using customizable layouts, knowledge can be formalized in a machine understandable way and then visually translated for the user’s needs. – Cartoonist representation for the Glycobiologist – Chemical reactions as left side right side, instead of convoluted representation in the ontology. Ongoing and Future Work • SemURI: Semantic URI based provenance scheme using ProPreO • RDF-based version of the GLYDE schema • A framework for semantic annotation of experimental data • Integration of large datasets (~500MB) into ProPreO for reasoning Further details at: • http://lsdis.cs.uga.edu/projects/glycomics/