Semantics powered Bioinformatics Amit Sheth, William S. York, et al

advertisement
Semantics powered Bioinformatics
Amit Sheth, William S. York, et al
Large Scale Distributed Information Systems Lab &
Complex Carbohydrate Research Center
University of Georgia
http://lsdis.cs.uga.edu
Project Information:
Background: SW for Life Sciences
• Bioinformatics of Glycan Expression –
component of the NCRR "Integrated
Technology Resource for Biomedical
Glycomics”.
• W3C Interest Group on Semantic Web for
Health care and Life Sciences
• Deployed Active Semantic Electronic
Medical Patient Record application at the
Athens Heart Center
Agenda
• Review of Accomplishments/Ongoing
Work:
o
o
o
o
o
o
GLYDE standard
GlycO Ontology
ProPreO Ontology
Semantic Analytical Glycomics Workflow
Visualization
Semantic Web Services: WSDL-S/METEOR-S
GLYDE standard
• An XML based representation format for
glycan structures
• Inter-convertible with existing data
represented using IUPAC or LINUCS.
• In progress: Incorporation of Probability
based representation
• In progress: Incorporation of aspects for
visualization of structures using GLYDE
(XML) files
GLYDE - An expressive XML standard for the representation of glycan
structure. Carbohydrate Research, 340 (18), Dec 30, 2005.
Collaborative GlycoInformatics
• Enable querying and export of query results in GLYDE
format
• Using GLYDE representation for disambiguation, mapping
and matching
GLYDE
MonosaccharideDB
SweetDB
KEGG
<glyde>
<residue>
.
.
</residue>
</glyde>
QUERY
<glyde>
<residue>
.
.
</residue>
</glyde>
RESULT
Collaborative GlycoInformatics
• Development of GLYDE semantic web portal
• Integration with www.glycosciences.de
o Visualization aspect integrated with LiGraph (Heidelberg)
or OntoVista (UGA)
• Semantic Annotation of publications in GlycoProteomics
domain
MonosaccharideDB
www.glycosciences.de
GLYDE Semantic Portal
KEGG
Collaborative GlycoInformatics
Evolving collaboration between:
• LSDIS/CCRC:
Will York, Amit Sheth, Michael Pierce
• EUROCarbDB (German Cancer Research Center):
Willi von der Lieth
• Consortium for Functional Glycomics (CFG):
Rahul Raman, Ram Sasisekharan, Thomas Lütteke
• N.D. Zelinsky Institute of Organic Chemistry (Moscow)
Yuriy Knirel
• Mitsui Knowledge Industry (Japan):
Hisashi Narimatsu, Norihiro Kikuchi
• Kyoto Encyclopedia of Genes and Genomes (KEGG):
Minoru Kanehisa, Kiyoko F. Aoki-Kinoshita
• Palo Alto Research Center (PARC):
David Goldberg,
Semantic GlcyoInformatics - Ontologies
• GlycO: A domain ontology for glycan structures,
glycan functions and enzymes (embodying knowledge
of the structure and metabolisms of glycans)
o Contains 600+ classes and 100+ properties –
describe structural features of glycans; unique
population strategy
o URL:
http://lsdis.cs.uga.edu/projects/glycomics/glyco
• ProPreO: a comprehensive process Ontology modeling
experimental proteomics
o Contains 330 classes, 6 million+ instances
o Models three phases of experimental proteomics
URL:
http://lsdis.cs.uga.edu/projects/glycomics/propreo
GlycO taxonomy
The first levels of
the GlycO
taxonomy
Most relationships
and attributes in
GlycO
GlycO exploits the
expressiveness of OWL-DL.
Cardinality constraints, value
constraints, Existential and
Universal restrictions on Range
and Domain of properties allow
the classification of unknown
entities as well as the deduction
of implicit relationships.
Pathway representation in GlycO
Pathways do not need to be
explicitly defined in GlycO. The
residue-, glycan-, enzyme- and
reaction descriptions contain
all the knowledge necessary to
infer pathways.
Zooming in a little …
Reaction R05987
catalyzed by enzyme 2.4.1.145
adds_glycosyl_residue
N-glycan_b-D-GlcpNAc_13
The product of this
reaction is the
Glycan with KEGG
ID 00020.
The N-Glycan with KEGG
ID 00015 is the substrate to
the reaction R05987, which
is catalyzed by an enzyme
of the class EC 2.4.1.145.
Ontology Population
• The next slides show the different steps
that were necessary to populate GlycO
with glycan structures from multiple
sources.
• GLYDE is used to disambiguate between
representations from multiple sources
Ontology population
workflow
Semagix Freedom knowledge
extractor
YES:
next Instance
Instance
Data
Already in
KB?
Has
CarbBank
ID?
NO
YES
Insert into
KB
Compare to
Knowledge
Base
NO
IUPAC to
LINUCS
LINUCS to
GLYDE
Semagix Freedom knowledge
extractor
YES:
next Instance
Instance
Data
Already in
KB?
Has
CarbBank
ID?
NO
YES
Insert into
KB
Compare to
Knowledge
Base
[][Asn]{[(4+1)][b-D-GlcpNAc]
{[(4+1)][b-D-GlcpNAc]
{[(4+1)][b-D-Manp]
{[(3+1)][a-D-Manp]
IUPAC to
NO{[(2+1)][b-D-GlcpNAc]
LINUCS
{}[(4+1)][b-D-GlcpNAc]
{}}[(6+1)][a-D-Manp]
{[(2+1)][b-D-GlcpNAc]{}}}}}}
LINUCS to
GLYDE
Semagix Freedom knowledge
extractor
<Glycan>
YES:
<aglycon name="Asn"/>
<residue link="4"
anomer="b" chirality="D" monosaccharide="GlcNAc">
nextanomeric_carbon="1"
Instance
<residue link="4" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc">
<residue link="4" anomeric_carbon="1" anomer="b"
Instancechirality="D" monosaccharide="Man" >
<residue link="3" anomeric_carbon="1" anomer="a"
Data chirality="D" monosaccharide="Man" >
<residue link="2" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc" >
</residue>
<residue link="4" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc" >
</residue>
Has
</residue> Already in
IUPAC to
CarbBankchirality="D"
NO monosaccharide="Man" >
<residue link="6" anomeric_carbon="1" anomer="a"
KB?
LINUCS
<residue link="2" anomeric_carbon="1" anomer="b"
chirality="D" monosaccharide="GlcNAc">
ID?
</residue>
</residue>
</residue>
NO
YES
</residue>
</residue>
</Glycan>
Compare to
Insert into
KB
Knowledge
Base
LINUCS to
GLYDE
Semagix Freedom knowledge
extractor
YES:
next Instance
Instance
Data
Already in
KB?
Has
CarbBank
ID?
NO
YES
Insert into
KB
Compare to
Knowledge
Base
NO
IUPAC to
LINUCS
LINUCS to
GLYDE
ProPreO
• ProPreO: A process ontology to capture
proteomics experimental lifecycle:
o
o
o
o
o
o
Separation
Mass spectrometry
Analysis
330 classes
110 properties
6 million+ instances
Usage: Mass spectrometry analysis
Manual annotation of mouse kidney spectrum by a human expert.
For clarity, only 19 of the major peaks have been annotated.
Goldberg, et al, Automatic annotation of matrix-assisted laser desorption/ionization N-glycan spectra, Proteomics 2005, 5, 865–875
Semantic Annotation of Experimental Data
•Enables Ontology-mediated Disambiguation
•Allows correlation between disparate entities using Semantic Relations
P(S | M =
3461.57) = 0.6
P(T | M = 3461.57)
= 0.4
Goldberg, et al, Automatic annotation of matrix-assisted laser desorption/ionization N-glycan spectra, Proteomics 2005, 5, 865–875
Semantic GlycoProteomics Workflow
Cell Culture
extract
Glycoprotein Fraction
proteolysis
Glycopeptides Fraction
1
n
Separation technique I
Glycopeptides Fraction
n
PNGase
Peptide Fraction
Separation technique II
n*m
Peptide Fraction
Mass spectrometry
ms data
ms/ms data
Data reduction
ms peaklist
ms/ms peaklist
binning
Glycopeptide identification
and quantification
N-dimensional array
Signal integration
Data reduction
Peptide identification
Peptide list
Data correlation
Web Services based Workflow = Web Process
Windows XP
WORKFLOW
Web Service 1
WS1
Web Service 4
LINUX
WS 2
WS 3
Web Service 2
Web Service 3
WS 4
MAC
Solaris
BOWSER
•
•
•
•
Use semantics for describing Web Services
WSDL-S (LSDIS/IBM)
Use service-level annotation of Web Services
Graphical traversal of taxonomy of biological
concepts to search for Web Services
• http://128.192.9.11:8080/stargate/bowser.jsp
Semantic Annotation of Scientific Data
830.9570 194.9604 2
580.2985 0.3592
688.3214 0.2526
779.4759 38.4939
784.3607 21.7736
1543.7476 1.3822
1544.7595 2.9977
1562.8113 37.4790
1660.7776 476.5043
ms/ms peaklist data
<ms/ms_peak_list>
<parameter
instrument=micromass_QTOF_2_quadropole_time_of_flight_m
ass_spectrometer
mode = “ms/ms”/>
<parent_ion_mass>830.9570</parent_ion_mass>
<total_abundance>194.9604</total_abundance>
<z>2</z>
<mass_spec_peak m/z = 580.2985 abundance = 0.3592/>
<mass_spec_peak m/z = 688.3214 abundance = 0.2526/>
<mass_spec_peak m/z = 779.4759 abundance = 38.4939/>
<mass_spec_peak m/z = 784.3607 abundance = 21.7736/>
<mass_spec_peak m/z = 1543.7476 abundance = 1.3822/>
<mass_spec_peak m/z = 1544.7595 abundance = 2.9977/>
<mass_spec_peak m/z = 1562.8113 abundance = 37.4790/>
<mass_spec_peak m/z = 1660.7776 abundance = 476.5043/>
<ms/ms_peak_list>
Annotated ms/ms peaklist data
Semantic annotation of Scientific Data
<ms/ms_peak_list>
<parameter
instrument=“micromass_QTOF_2_quadropole_time_of_flight_mass_s
pectrometer”
mode = “ms/ms”/>
<parent_ion_mass>830.9570</parent_ion_mass>
<total_abundance>194.9604</total_abundance>
<z>2</z>
<mass_spec_peak m/z = 580.2985 abundance = 0.3592/>
<mass_spec_peak m/z = 688.3214 abundance = 0.2526/>
<mass_spec_peak m/z = 779.4759 abundance = 38.4939/>
<mass_spec_peak m/z = 784.3607 abundance = 21.7736/>
<mass_spec_peak m/z = 1543.7476 abundance = 1.3822/>
<mass_spec_peak m/z = 1544.7595 abundance = 2.9977/>
<mass_spec_peak m/z = 1562.8113 abundance = 37.4790/>
<mass_spec_peak m/z = 1660.7776 abundance = 476.5043/>
<ms/ms_peak_list>
Annotated ms/ms peaklist data
Discovery of relationship between biological entities
ProPreO
p
GlycO
r
o
c
Lectin
e
s
s
Gene Ontology (GO)
Fragment of
Specific protein
Specific cellular
process
Identified
and quantified
peptides
Collection of
N-glycan ligands
Genomic database (Mascot/Sequest)
The inference: instances of the class
collection of Biosynthetic enzymes
(GNT-V) are involved in the specific
cellular process (metastasis).
Collection of
Biosynthetic enzymes
Semantic Web Services using WSDL-S
• Formalize description and classification of Web Services
using ProPreO concepts
<?xml version="1.0" encoding="UTF-8"?>
<?xml version="1.0" encoding="UTF-8"?>
<wsdl:definitions targetNamespace="urn:ngp"
<wsdl:definitions targetNamespace="urn:ngp"
……
…..
xmlns:
xmlns:xsd="http://www.w3.org/2001/XMLSchema">
wssem="http://www.ibm.com/xmlns/WebServices/WSSemantics"
xmlns:
<wsdl:types>
ProPreO="http://lsdis.cs.uga.edu/ontologies/ProPreO.owl" >
<schema targetNamespace="urn:ngp“
xmlns="http://www.w3.org/2001/XMLSchema">
<wsdl:types>
…..
<schema targetNamespace="urn:ngp"
</complexType>
xmlns="http://www.w3.org/2001/XMLSchema">
</schema>
……
</wsdl:types>
</complexType>
<wsdl:message name="replaceCharacterRequest">
</schema>
<wsdl:part name="in0" type="soapenc:string"/>
</wsdl:types>
<wsdl:part name="in1" type="soapenc:string"/>
<wsdl:message name="replaceCharacterRequest"
<wsdl:part name="in2" type="soapenc:string"/>
wssem:modelReference="ProPreO#peptide_sequence">
</wsdl:message>
<wsdl:part name="in0" type="soapenc:string"/>
<wsdl:message name="replaceCharacterResponse">
<wsdl:part name="in1" type="soapenc:string"/>
<wsdl:part name="replaceCharacterReturn" type="soapenc:string"/>
<wsdl:part name="in2" type="soapenc:string"/>
</wsdl:message>
</wsdl:message>
WSDL ModifyDB
WSDL-S
ModifyDB
data
Description of a
sequence
Web Service using:
Web
Service
Description
peptide_sequence
Language
Concepts defined in
process Ontology
ProPreO
process Ontology
Semantic Visualization
• Ontologies are meant for machine
consumption
• Often too convoluted for the human eye
• The scientist needs to know the concepts
she uses for annotation
• Build a visualization environment that
translates the formal concepts into a
representation the domain expert
understands well
Single Glycan
Customizable Layouts
• Using customizable layouts, knowledge
can be formalized in a machine
understandable way and then visually
translated for the user’s needs.
– Cartoonist representation for the Glycobiologist
– Chemical reactions as left side  right side,
instead of convoluted representation in the
ontology.
Ongoing and Future Work
• SemURI: Semantic URI based provenance
scheme using ProPreO
• RDF-based version of the GLYDE schema
• A framework for semantic annotation of
experimental data
• Integration of large datasets (~500MB)
into ProPreO for reasoning
Further details at:
• http://lsdis.cs.uga.edu/projects/glycomics/
Download