Genome Data and Tool Interoperation

advertisement
Genome Data and Tool Interoperation
over the “Semantic” Web
By
Kei-Hoi Cheung, Ph.D.
Assistant Professor
Yale Center for Medical Informatics
MB&B 452b/752b, April 20, 2005, Yale University
Outline
• Introduction
• Semantic Web
– Resource Description Framework (RDF)
– Life Sciences Identifiers (LSID)
– YeastHub: yeast genome data interoperation
– Web Services for tool interoperation
• Collaborative projects
– Biosphere
– Taverna
• Semantic Web Services
• Conclusion
• Future directions
Eras of Computing
• Mainframe computing (many people share one computer)
• Personal computing (one person uses one computer)
• Ubiquitous computing (one person is served by many
computers over the network)
– Client/server computing, grid computing, peer-to-peer computing,
distributed/parallel computing, component-based computing, etc
– World Wide Web (WWW) is one of the main driving
forces
– It provides a globally distributed communication
framework that is essential for almost all scientific
collaboration, including bioinformatics
The World Wide Web
• On the order of 108 users
–
Used in every country on Earth
• On the order of 1010 indexed web resources (text) in
Google etc
–
Essentially Infinite if one includes “dynamic” web pages
• Massively distributed and open
It is difficult to keep track of these resources
Data Heterogeneity
• Data are exposed in different ways
– Programmatic interfaces
– Web forms or pages
– FTP directory structures
• Data are presented in different ways
– Structured text
• Tab delimited format, XML format, etc
– Free text
– Binary
• Images
• Naming conflicts (e.g., synonyms and homonyms)
Tool heterogeneity
• Server applications
– Web server applications
– Application programming interfaces (API)
• Client applications (downloadable software)
• Different programming languages
• Different operating systems
From Web to Semantic Web
• Human processing  Machine processing
• Free text description  ontological
description
• HTML  XML  RDF or its extensions
• Metadata!
HTML Example
Readme
1
1
1
1
1
1
1
2
3
4
5
6
0
0
1
1
1
1
0
0
2
2
2
2
1
2
2
1
1
1
1
0
0
0
1
0
Col#
1
2
3
4
5
6
Description
pedigree id
Person id
Father id
Mother id
Sex
Status
<html>
<body>
…
<a href=“http://ycmi.med.yale.edu/ped_readme.html”>
Readme</a>
<table>
<tr>
<td>1</td> <td>1</td> <td>0</t> <td>0</td> …
</tr>
…
</table>
…
</body>
</html>
XML Example
Other Advantages of Using XML
• It is simple, hierarchical, self-describing,
and computer-readable
• It can be validated using DTD or XSchema
• It is a W3C standard
• It has a large base of software support (both
commercial and public domain software
tools)
– Editing tools, DOM, SAX, XSL, etc
Proliferation of Bio-XML Formats
Sequence
BSML AGAVE
Microarray Gene Expression
GEML
MAML
BIND
MAGE-ML
RDF (e.g., BioPax)
Semantically rich ontologies
Reasoning (machine intelligence)
Pathway
SBML
PSI-MI
Definition of an Ontology
• Conceptualization of a domain of interest
– Concepts, relations, attributes, constraints, objects,
values
• An ontology is a specification of a
conceptualization
– Formal notation
– Documentation
• A variety of forms, but includes:
– A vocabulary of terms
– Some specification of the meaning of the terms
• Ontologies are defined for reuse
Roles of Ontologies in Bioinformatics
• Success of many biological DBs depends on
– High fidelity ontologies
– Clearly communicating their ontologies
• Prevent errors on data entry and interpretation
• Common framework for multidatabase queries
• Controlled vocabularies for genome annotation
– GO
– EC numbers
• Information-extraction applications
• Reuse is a core aspect of ontologies
– Reuse of existing ontologies faster than designing new ones
– Reuse decreases semantic heterogeneity of DBs
• Schema-driven Software
– Knowledge-acquisition tools
– Query tools
Example Bio-ontologies
• Gene Ontologies
– http://www.geneontology.org/
• MGED Ontologies
– http://mged.sourceforge.net/
• Open Biomedical Ontologies (OBO)
– http://obo.sourceforge.net/
Are current bio-ontologies adequate?
Ontology desiderata
• Precision
– Formal, unambiguous
– High fidelity
• Explicitness
– Clarity
– Commitment
– Reuse
• Systematic
– Quality
– Clarity
• Flexibility
– Expressivity
– Evolution
machine computable
Semantic Web
• It provides a common framework that allows
semantic interoperability among multiple
resources through the use of ontologies
• It is a collaborative effort led by W3C with
participation from a large number of researchers
and industrial partners
• It is based on the Resource Description
Framework (RDF)
Resource Description Framework
(RDF)
• It is a standard data model (directed acyclic graph)
for representing information (metadata) about
resources in the World Wide Web
• In general, it can be used to represent information
about “things” that can be identified (using URI’s)
on the Web
• It is intended to provide a simple way to make
statements (descriptions) about Web resources
RDF Statement
A RDF statement consists of:
• Subject: resource identified by a URI
• Predicate: property (as defined in a name space identified by a
URI)
• Object: property value or a resource
For example, the “dbSNP Website” is a subject, “creator” is a
Predicate, “NCBI” is an object.
A resource can be described by multiple statements.
Graphical Representation
RDF/XML Representation
<?xml version="1.0"?>
<rdf:RDF xmlns:rdf=“http://www.w3.org/1999/02/22-rdf-syntax-ns#”
xmlns:dc=“http://purl.org/dc/elements/1.1”
xmlns:ex=“http://www.example.org/terms”>
<dc: creator rdf:resource=“http://www.example.org/staffid/85740”></dc:creator>
<dc:language>en</dc:language>
<ex:creation-date>August 16, 1999</dc:creation-date>
<rdf:RDF>
Data Integration Using RDF
atagccgta
cctgcgagt
ctagaagct
derives from
human
hemoglobin
GenBank
derives from
atagccgta
cctgcgagt
ctagaagct
+
human
hemoglobin
is a
oxygen
transport
protein
human
hemoglobin
is a
Gene Ontology
+
has 3D structure
human
hemoglobin
has 3D structure
Unified view
Protein Data Bank
oxygen
transport
protein
Reification
• Making statements about statements
• For example, GenBank provides the following
statement: “human hemoglobin derives from
atagccgtacctgcgagtctagaagct”
Example
<rdf:RDF xmlns:rdf=“http://www.w3.org/1999/02/22-rdf-syntax-ns#”
xmlns:s=“http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?val=29436”>
<rdf:Description about=“http://www.ncbi.nlm.nih.gov/Genbank”>
<s:derive_from rdf:ID=“statement1”> atag… </s:derive_from>
</rdf:Description>
<rdf:Description about=“#statement1”>
<s:providedBy>GenBank</s:providedBy>
</rdf:Description>
</rdf:RDF>
Other RDF-Based Ontology Languages
• RDFS
• DAML+OIL
• OWL
Life Science Identifiers (“LSID”)
Addresses Data Access Problems
• LSID is a naming standard for distributed data,
specifically:
–Scientifically significant data
–Geographically distributed
–Files, database records, and data objects managed by N-tier
applications
–Public and/or private networks
–And owned, managed, by different organizations
LSID Syntax
• 5 Part Format:
URN:LSID:Authority:Namespace:Object:[Revision-ID]
– URN:LSID: is a mandatory prefix
– Authority is the Internet domain of the organization that
assigns an LSID to a resource
– Namespace constrains the scope of the object
– Object is an alphanumeric describing the object
– Revision-ID is an optional version of the object
• Examples
– URN:LSID:ncbi.nlm.nih.gov:genbank:AF271072:1
– URN:LSID:ncbi.nlm.nih.gov:pubmed:12571434
LSID: a single naming schema
• One standard naming scheme
– Named data is unique
– Data integrity is maintained
• Breaking down of “data silos”
– Names no longer only useful in a specific proprietary context
– Integrate any data source using standard naming scheme
– Single LSID protocol replaces proprietary source specific programs
• Access to more data
– Integrate data across discovery and development cycles
• Metadata features
– Standard access to specific data allows them to easily be related
semantically. These semantic links can lead to new insights
LSID-Enabled Applications
• LaunchPad
• BioHaystack
LaunchPad
•
•
•
it takes an LSID;
resolves it;
attempts to match the
local applications one
uses to process/view
this data.
YeastHub (a semantic web approach to
yeast data integration)
(Collaboration between YCMI and Gerstein
Lab: Kevin Yip, Andrew Smith, Andy Masiar,
Remko deKnikker)
(Accepted for publication and presentation in
ISMB 2005)
Yeast Genome Data
• The budding yeast Saccharomyces cerevisiae was the first
fully sequenced eukaryotic genome.
• Ease of genetic manipulation and many of its genes are
strikingly similar to human genes
• It has been studied extensively through a wide range of
biological experiments (e.g., microarray experiments).
• A large variety of yeast genome data (e.g., gene
expression data) have been made available through many
resources (e.g., SGD, MIPS, YPD, TRIPLES, Yeast World,
etc)
• Integration of such a variety of yeast data can facilitate
whole genome analysis
Data Conversion and Integration
Resource1
Resource2
Resourcen
<xml>
…
</xml>
DOM/SAX
RDF1
DB-specific tool
XSLT
RDF2
RDFn
RDF/DB
RDQL
Users/Agents
(Sesame)
Two Levels of RDF Description
• Resource description
• Data description
Resource Description
(Use of Dublin Core Metadata)
Metadata Example
RDF Modeling of Tabular Data
Data Conversion
RDF Example
Query Form
RQL Syntax and Query Results
Semantic Web Technologies
Employed in YeastHub
• RDF Site Summary (RSS)
• D2RQ (mapping from relational databases
to RDF)
• Semantic Web Database (Sesame)
• RDF Query Languages (e.g., RQL and
SeRQL)
Tool Interoperation
An Example Scenario
• Comparative genomics
Manual Interoperation
A Better Way of Interoperation
A Better Way of Interoperation
(cont’d)
Web Services
“Creating a Bioinformatics Nation”
(Lincoln Stein)
Web Services
UDDI
WSDL
SOAP
SOAP
• It stands for Simple Object Access Protocol
• It is an XML syntax for exchanging
messages between applications
• It is based on HTTP
• It codifies existing practice of using XML
and HTTP together
• It is language and platform independent
RPC implementation
WSDL
• It describes the syntax of Web Service
interfaces and their locations
• Programmers can create WSDL files to
describe their Web Services and make them
available over the Internet
WSDL Contents
WSDL Example (XEMBL)
Bioinformatics Web Services Projects
• DAS (http://biodas.org/)
• DDBJ’s Biological Web Services
(http://www.xml.nig.ac.jp/)
• BioMoby (http://biomoby.org)
– Moby-S
– Semantic Moby
• myGrid (http://www.ebi.ac.uk/mygrid/)
– SoapLab (http://industry.ebi.ac.uk/soaplab/)
– Talisman (http://www.ebi.ac.uk/talisman/index.html)
– Taverna (http://taverna.sourceforge.net/)
Bioinformatics Web Service
Collaboration
• Biosphere (YCMI and University of Hong
Kong)
• Web Service Workflow (University of New
Castle Upon Tyne, University of Hong
Kong, and YCMI)
Biosphere
Taverna
Semantic Web Services
Semantic Web Service
• Description using OWL-S
– Profile
– Process
– Grounding (e.g., WSDL)
What to describe?
Resource
provides
Service
supports
presents
What it does
describedby
Service profile
description
Service grounding
Service model
How to access it
How it works
functionalities
functional
attributes
Semantic Web Services
•
•
•
•
Discovery
Invocation
Composition
Monitoring
Conclusion
• The World Wide Web affords unprecedented
access to “globally distributed information”
• Metadata, or structured data about data, helps
automate discovery of and access to such
information
• RDF
– is the W3C Recommendation defining an infrastructure
that enables the encoding, exchange, and reuse of
structured metadata
– allows that ontologies defined by different communities
can be shared
– facilitates data and tool interoperability
Future Directions: The Semantic Wave
“Once the web has been sufficiently "populated"
with rich metadata, what can we expect? First,
searching on the web will become easier as search
engines have more information available, and thus
searching can be more focused. Doors will also be
opened for automated software agents to roam the
web, looking for information for us or transacting
business on our behalf. The web of today, the vast
unstructured mass of information, may in the future
be transformed into something more manageable and thus something far more useful.”
(Ora Lassila)
Doing this humanely!
“No amount of automation will replace human
beings, but clumsy and belligerent automation
will alienate them and suppress their creativity.”
(Tony Kazic)
Thanks!
Download