Semantic empowerment of Life Science Applications October 2006 Amit Sheth LSDIS Lab, Department of Computer Science, University of Georgia Acknowledgement: NCRR funded Bioinformatics of Glycan Expression, collaborators, partners at CCRC (Dr. William S. York) and Satya S. Sahoo, Cartic Ramakrishnan, Christopher Thomas, Cory Henson. Computation, data and semantics In life sciences • “The development of a predictive biology will likely be one of the major creative enterprises of the 21st century.” Roger Brent, 1999 • “The future will be the study of the genes and proteins of organisms in the context of their informational pathways or networks.” L. Hood, 2000 • "Biological research is going to move from being hypothesisdriven to being data-driven." Robert Robbins • “We’ll see over the next decade complete transformation (of life science industry) to very database-intensive as opposed to wet-lab intensive.” Debra Goldfarb We will show how semantics is a key enabler for achieving the above predictions and visions in which information and process play critical role. Semantic Web and Life Science • Data captured per year = 1 exabyte (1018) (Eric Neumann, Science, 2005) • How much is that? – Compare it to the estimate of the total words ever spoken by humans = 12 exabyte • Death by data • The need for – Search – Integration – Analysis, decision support – Discovery Not data, but analysis and insight, leading to decisions and discovery Semantic empowerment of Life Science Applications Life Science research today deals with highly heterogeneous as well as massive amounts of data distributed across the world. We need more automated ways for integration and analysis leading to insight and discovery - to understand cellular components, molecular functions and biological processes, and more importantly complex interactions and interdependencies between them. Benefits of Semantics • Development of large domain-specific knowledge – for reference, common nomenclature, tagging • Integration of heterogeneous multi-source data: biomedical documents (text), scientific/experimental data and structured databases • Semantic search, browsing, integration analysis, and discovery Faster and more reliable discovery leading to quality of life improvements What is semantics & Semantic Web • Meaning and use of data • From syntax and structure to semantics (beyond formatting, organization, query interfaces,….) • XML -> RDF -> OWL -> Rules -> Trust • Ontologies at the heart of Semantic Web, capturing agreement and domain knowledge • (Automatic) Semantic annotation, reasoning,… • Also, increasing use of Services oriented Architecture -> semantic Web services • W3C SW for Health Care and Life Sciences Semantic empowerment of Life Science Applications This talk will demonstrate some of the efforts in: • Building large (populated) life science ontologies (GlycO, ProPreO) • Gathering/extracting knowledge and metadata: entity and relationship extraction from unstructured data, automatic semantic annotation of scientific/experimental data (e.g., mass spectrometry) • Semantic web services and registries, leading to better discovery/reuse of scientific tools and their composition • Ontology-driven applications developed Semantic Applications • Active Semantic Medical Records Demo : an operational health care application using multiple ontologies, semantic annotations and rule based decsion support • Semantic Browser Demo: contextual browsing of PubMed aided by ontology and schema (in future instance) level relationships • N-glycosylation workflow process: an example of scientific • Integrated Semantic Information & Knowledge System (ISIS): integrated access and analysis of structured databases, sc. literature and experimental data Others we will not discuss: SemBowser, SemDrug, …. Let us start with a couple of simple applications Life Science Ontologies • Glyco • An ontology for structure and function of Glycopeptides • 573 classes, 113 relationships • Published through the National Center for Biomedical Ontology (NCBO) • ProPreO • An ontology for capturing process and lifecycle information related to proteomic experiments • 398 classes, 32 relationships • 3.1 million instances • Published through the National Center for Biomedical Ontology (NCBO) and Open Biomedical Ontologies (OBO) N-Glycosylation metabolic pathway N-glycan_beta_GlcNAc_9 GNT-I attaches GlcNAc at position 2 N-acetyl-glucosaminyl_transferase_V N-glycan_alpha_man_4 GNT-V attaches GlcNAc at position 6 UDP-N-acetyl-D-glucosamine + alpha-D-Mannosyl-1,3-(R1)-beta-D-mannosyl-R2 <=> UDP + N-Acetyl-$beta-D-glucosaminyl-1,2-alpha-D-mannosyl-1,3-(R1)-beta-D-mannosyl-$R2 UDP-N-acetyl-D-glucosamine + G00020 <=> UDP + G00021 GlycO ontology • Challenge – model hundreds of thousands of complex carbohydrate entities • But, the differences between the entities are small (E.g. just one component) • How to model all the concepts but preclude redundancy → ensure maintainability, scalability GlycoTree b-D-GlcpNAc-(1-2)- a-D-Manp -(1-6)+ b-D-Manp-(1-4)- b-D-GlcpNAc -(1-4)- b-D-GlcpNAc b-D-GlcpNAc-(1-4)- a-D-Manp -(1-3)+ b-D-GlcpNAc-(1-2)+ N. Takahashi and K. Kato, Trends in Glycosciences and Glycotechnology, 15: 235-251 EnzyO • The enzyme ontology EnzyO is highly intertwined with GlycO. While it’s structure is mostly that of a taxonomy, it is highly restricted at the class level and hence allows for comfortable classification of enzyme instances from multiple organisms • GlycO together with EnzyO contain all the information that is needed for the description of Metabolic pathways – e.g. N-Glycan Biosynthesis Pathway representation in GlycO Pathways do not need to be explicitly defined in GlycO. The residue-, glycan-, enzyme- and reaction descriptions contain all the knowledge necessary to infer pathways. Zooming in a little … Reaction R05987 catalyzed by enzyme 2.4.1.145 adds_glycosyl_residue N-glycan_b-D-GlcpNAc_13 The product of this reaction is the Glycan with KEGG ID 00020. The N-Glycan with KEGG ID 00015 is the substrate to the reaction R05987, which is catalyzed by an enzyme of the class EC 2.4.1.145. GlycO population • Multiple data sources used in populating the ontology o KEGG - Kyoto Encyclopedia of Genes and Genomes o SWEETDB o CARBANK Database • Each data source has different schema for storing data • There is significant overlap of instances in the data sources • Hence, entity disambiguation and a common representational format are needed Ontology population workflow Semagix Freedom knowledge extractor YES: next Instance Instance Data Already in KB? Has CarbBank ID? NO YES Insert into KB Compare to Knowledge Base NO IUPAC to LINUCS LINUCS to GLYDE Ontology population workflow Semagix Freedom knowledge extractor YES: next Instance Instance Data Already in KB? Has CarbBank ID? NO YES Insert into KB Compare to Knowledge Base [][Asn]{[(4+1)][b-D-GlcpNAc] {[(4+1)][b-D-GlcpNAc] {[(4+1)][b-D-Manp] {[(3+1)][a-D-Manp] IUPAC to NO{[(2+1)][b-D-GlcpNAc] LINUCS {}[(4+1)][b-D-GlcpNAc] {}}[(6+1)][a-D-Manp] {[(2+1)][b-D-GlcpNAc]{}}}}}} LINUCS to GLYDE Ontology population workflow Semagix Freedom knowledge extractor <Glycan> YES: <aglycon name="Asn"/> <residue link="4" anomer="b" chirality="D" monosaccharide="GlcNAc"> nextanomeric_carbon="1" Instance <residue link="4" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc"> <residue link="4" anomeric_carbon="1" anomer="b" Instancechirality="D" monosaccharide="Man" > <residue link="3" anomeric_carbon="1" anomer="a" Data chirality="D" monosaccharide="Man" > <residue link="2" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc" > </residue> <residue link="4" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc" > </residue> Has </residue> Already in IUPAC to CarbBankchirality="D" NO monosaccharide="Man" > <residue link="6" anomeric_carbon="1" anomer="a" KB? LINUCS <residue link="2" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc"> ID? </residue> </residue> </residue> NO YES </residue> </residue> </Glycan> Compare to Insert into KB Knowledge Base LINUCS to GLYDE Ontology population workflow Semagix Freedom knowledge extractor YES: next Instance Instance Data Already in KB? Has CarbBank ID? NO YES Insert into KB Compare to Knowledge Base NO IUPAC to LINUCS LINUCS to GLYDE ProPreO ontology • Two aspects of glycoproteomics: o What is it? → identification o How much of it is there? → quantification • Heterogeneity in data generation process, instrumental parameters, formats • Need data and process provenance → ontology-mediated provenance • Hence, ProPreO models both the glycoproteomics experimental process and attendant data ProPreO population: transformation to rdf Scientific Data Computational Methods Ontology instances ProPreO population: transformation to rdf Scientific Data Computational Methods Key Extract Peptide Amino-acid Sequence from Protein Amino-acid Sequence Protein Path amino-acid sequence amino-acid sequence Protein Data Peptide Path Determine N-glycosylation Concensus Calculate Chemical Mass Calculate Monoisotopic Mass RDF Chemical Mass RDF n-glycosylation concensus “Protein RDF” chemical mass monoisotopic mass amino-acid sequence parent protein Monoisotopic Mass RDF n-glycosylation concensus “Peptide RDF” chemical mass Amino-acid Sequence RDF monoisotopic mass amino-acid sequence Semantic empowerment of Life Science Applications This talk will demonstrate some of the efforts in: • building large life science ontologies (GlycO -an ontology for structure and function for Glycopeptides and ProPreO - an ontology for capturing process and lifecycle information related to proteomic experiments) and their application in advanced ontology-driven semantic applications • entity and relationship extraction from unstructured data, automatic semantic annotation of scientific/experimental data (e.g., mass spectrometry), and resulting capability in integrated access and analysis of structured databases, scientific literature and experimental data • semantic web services and registries, leading to better discovery/reuse of scientific tools and composition of scientific workflows that process highthroughput data and can be adaptive • semantic applications developed Relationship extraction from unstructured data (other related research: biological entity extraction) Overview UMLS Biologically active substance affects complicates causes causes Lipid Disease or Syndrome affects instance_of instance_of ??????? Fish Oils Raynaud’s Disease MeSH PubMed 9284 documents 5 documents 4733 documents About the data used • UMLS – A high level schema of the biomedical domain – 136 classes and 49 relationships – Synonyms of all relationship – using variant lookup (tools from NLM) • MeSH – Terms already asserted as instance of one or more classes in UMLS • PubMed – Abstracts annotated with one or more MeSH terms T147—effect T147—induce T147—etiology T147—cause T147—effecting T147—induced Example PubMed abstract (for the domain expert) Abstract Classification/Annotation Method – Parse Sentences in PubMed SS-Tagger (University of Tokyo) SS-Parser (University of Tokyo) (TOP (S (NP (NP (DT An) (JJ excessive) (ADJP (JJ endogenous) (CC or) (JJ exogenous) ) (NN stimulation) ) (PP (IN by) (NP (NN estrogen) ) ) ) (VP (VBZ induces) (NP (NP (JJ adenomatous) (NN hyperplasia) ) (PP (IN of) (NP (DT the) (NN endometrium) ) ) ) ) ) ) Method – Identify entities and Relationships in Parse Tree Method – Identify entities and Relationships in Parse Tree Modifiers Modified entities Composite Entities Method – Fact Extraction from Parse Tree Semantic annotation of scientific/experimental data ProPreO: Ontology-mediated provenance 830.9570 194.9604 2 580.2985 0.3592 parent ion m/z 688.3214 0.2526 779.4759 38.4939 784.3607 21.7736 1543.7476 1.3822 fragment ion m/z 1544.7595 2.9977 1562.8113 37.4790 1660.7776 476.5043 parent ion charge parent ion abundance fragment ion abundance ms/ms peaklist data Mass Spectrometry (MS) Data ProPreO: Ontology-mediated provenance <ms-ms_peak_list> <parameter instrument=“micromass_QTOF_2_quadropole_time_of_flight_mass_spectrometer” mode=“ms-ms”/> <parent_ion m-z=“830.9570” abundance=“194.9604” z=“2”/> <fragment_ion m-z=“580.2985” abundance=“0.3592”/> <fragment_ion m-z=“688.3214” abundance=“0.2526”/> <fragment_ion m-z=“779.4759” abundance=“38.4939”/> <fragment_ion m-z=“784.3607” abundance=“21.7736”/> <fragment_ion m-z=“1543.7476” abundance=“1.3822”/> Ontological <fragment_ion m-z=“1544.7595” abundance=“2.9977”/> Concepts <fragment_ion m-z=“1562.8113” abundance=“37.4790”/> <fragment_ion m-z=“1660.7776” abundance=“476.5043”/> </ms-ms_peak_list> Semantically Annotated MS Data Semantic empowerment of Life Science Applications This talk will demonstrate some of the efforts in: • building large life science ontologies (GlycO -an ontology for structure and function for Glycopeptides and ProPreO - an ontology for capturing process and lifecycle information related to proteomic experiments) and their application in advanced ontology-driven semantic applications • entity and relationship extraction from unstructured data, automatic semantic annotation of scientific/experimental data (e.g., mass spectrometry), and resulting capability in integrated access and analysis of structured databases, scientific literature and experimental data • semantic web services and registries, leading to better discovery/reuse of scientific tools and composition of scientific workflows that process highthroughput data and can be adaptive • semantic applications developed N-Glycosylation Process (NGP) Cell Culture extract Glycoprotein Fraction proteolysis Glycopeptides Fraction 1 n Separation technique I Glycopeptides Fraction n PNGase Peptide Fraction Separation technique II n*m Peptide Fraction Mass spectrometry ms data ms/ms data Data reduction ms peaklist ms/ms peaklist binning Glycopeptide identification and quantification N-dimensional array Signal integration Data reduction Peptide identification Peptide list Data correlation Semantic Web Process to incorporate provenance Agent Biological Sample Analysis by MS/MS O Semantic Annotation Applications Agent Raw Data to Standard Format I Raw Data Agent Data Preprocess O I Standard Format Data (Mascot/ Sequest) O Filtered Data Agent DB Search I Search Results Storage Biological Information O Final Output Results Postprocess (ProValt) I O Converting biological information to the W3C Resource Description Framework (RDF): Experience with Entrez Gene Collaboration with Dr. Olivier Bodenreider (US National Library of Medicine, NIH, Bethesda, MD) Biomedical Knowledge Repository …. Entrez Biomedical Knowledge Repository Implementation Entrez Gene Entrez Gene XML XSLT Entrez Gene RDF graph Entrez Gene RDF Web interface ENTREZ GENE ENTREZ GENE XML XSLT ENTREZ GENE RDF GRAPH …. ENTREZ GENE RDF Implementation Entrez Gene Entrez Gene XML XSLT Entrez Gene RDF graph Entrez Gene RDF Connecting different genes protease nexin-II A4 amyloid protein amyloid-beta protein APP gene [Homo sapiens] beta-amyloid peptide cerebral vascular amyloid peptide Human APP gene is implicated in Alzheimer's disease. amyloid beta A4 protein Which genes are functionally homologous to this gene? amyloid beta (A4) precursor protein (protease nexin-II, Alzheimer disease) APP gene [Gallus gallus] amyloid protein APP gene [Canis familiaris ] eg:has_protein_reference_name_E Integrated Semantic Information and knowledge System (Isis) Have I performed an error? Give me all result files from a similar organism, cell, preparation, mass spectrometric conditions and compare results. SPARQL query-based User Interface ProPreO ontology Is the result erroneous? Experimental Semantic Give me result files from a similar Data all Semantic Metadata Annotation Metadata organism, cell, preparation, Registry File mass spectrometric conditions and compare results. PROTEOMECOMMONS EXPERIMENTAL DATA Raw mzXML Raw2mzXML mzXML2Pkl Pkl MACOT result ProVault result MASCOT Search ProVault pSplit Pkl2pSplit PROTEOMICS WORKFLOW Summary, Observations, Conclusions • We now have semantics and services enabled approaches that support semantic search, semantic integration, semantic analytics, decision support and validation (e.g., error prevention in healthcare), knowledge discovery, process/pathway discovery, … • http://lsdis.cs.uga.edu • http://knoesis.org http://lsdis.cs.uga.edu/projects/asdoc/ http://lsdis.cs.uga.edu/projects/glycomics/