Demonstration of Web Resources

advertisement
Intersection of Semantic Web
and Life Sciences
Kei Cheung
Yale Center for Medical Informatics
Genomics and Bioinformatics (MBB 452a), November 2, 2005
Outline
• Introduction
• Overview of RDF and LSID
• Semantic web applications
– Connotea
– Piggy Bank
– YeastHub
Two scientific/technological endeavors that have
impacted the world greatly in the past 15 years
• Human Genome Project (HGP)
– International collaboration that began in 1990 and completed in
2003
– Understand the blueprint of life (moon-landing of the nineties)
– Sequence the entire human genome
• World Wide Web (WWW)
– It was born in 1989/1990 at CERNS (developed by Tim BernersLee)
– Revolutionize information access and sharing over the Internet
(Gutenberg’s printing press)
– Web browsers (e.g., IE, Netscape, FireFox)
Relationship between HGP and WWW
• HGP transformed life sciences into an information
science, as large amounts of data have been generated,
which need to be stored and analyzed
– GenBank, EMBL, and DDBJ have recently reached a milestone
of 100 billion bases from > 165,000 organisms
– Pubmed has > 300,000 articles from > 150 life sciences journals
• WWW has become the most popular medium for life
scientists to distribute, access, share, and integrate
different types of biological data over the Internet
– As of 2005, there are 719 publicly available databases listed in
NAR molecular biology database compilation
Spider-Man: Spidey science gets a
genetic makover
http://www.genomenewsnetwork.org/articles/05_02/spiderman.php
Spider-Man (Tim Berners-Lee) –
Weaving the Web
Semantic Web
• "The Semantic Web is an extension of the current web in
which information is given well-defined meaning, better
enabling computers and people to work in cooperation." -Tim Berners-Lee, James Hendler, Ora Lassila, The
Semantic Web, Scientific American, May 2001
• It provides a common framework that allows data to be
shared and reused across application, enterprise, and
community boundaries
• It is based on the Resource Description Framework (RDF),
which integrates a variety of applications using XML for
syntax and URIs for naming.
Semantic Web for Life Sciences
(TBL, Bio-IT World Conference, May 2005)
• “… also the people involved in the
Semantic Web pushing it along are also
excited about getting involved in the life
sciences – it’s one of those areas that
affect humankind, finding drugs, curing
AIDS and cancer, etc. There seems to be
a huge energy, and lots of practical
technical reasons why this area is crying
out to be one of the flagship areas that the
Semantic Web really takes off …”
Data  Information  Knowledge
Navarro JD, Niranjan V, Peri S, Jonnalagadda CK, Pandey A. (2003)
From biological databases to platforms for biomedical discovery.
Trends Biotechnol. (6):263-8.
Problem with the Current WWW
Problem with the Current Web
Kei Tsi Daniel Cheng
(this is not me!!)
Kei Cheung
(15 years ago)
Kei Cheung
(9 months ago)
Keyword Search: “regulatory
variation” & “mammals”
Data Heterogeneity
• Lack of standard detailed description of
resources
• Data are exposed in different ways
– Programmatic interfaces
– Web forms or pages
– FTP directory structures
• Data are presented in different ways
– Structured text (e.g., tab delimited format and XML
format)
– Free text
– Binary (e.g., images)
Data Heterogeneity (cont’d)
• Nomenclature problem
– Gene/protein names (based on phenotype, sequence, function,
organisms, etc)
• Armadillo (fruitflies) vs. i-catenin (mice)
• PSM1 (human) = PSM2 (yeast); PSM1 (yeast) = PSM2 (human)
• Sonic Hedgehog
– ID proliferation
• Different ID schemes: 1OF1 (PDB ID) and P06478 (SwissProt ID)
correspond to Herpes Thymidine Kinase
• Lexcial variation: GO1234, GO:1234, GO-1234
– Synonyms vs. homonyms
• Dopamine receptor D2: DRD2, DRD-2, D2
• PSA: prostate specific antigen, puromycin-sensitive aminopeptidase,
psoriatric arthritis, pig serum albumin
• “Biologists would rather share their toothbrush than a gene name …
Gene nomenclature is beyond redemption”, said Michael Ashburner
From Web to Semantic Web
(cont’d)
• Human processing  Machine
processing
• Use of Metadata
– Free text description  ontological
description
– HTML  XML  RDF or its extensions
• Vision  implementation
HTML Example
Readme
1
1
1
1
1
1
1
2
3
4
5
6
0
0
1
1
1
1
0
0
2
2
2
2
1
2
2
1
1
1
1
0
0
0
1
0
Col#
1
2
3
4
5
6
Description
pedigree id
Person id
Father id
Mother id
Sex
Status
<html>
<body>
…
<a href=“http://ycmi.med.yale.edu/ped_readme.html”>
Readme</a>
<table>
<tr>
<td>1</td> <td>1</td> <td>0</t> <td>0</td> …
</tr>
…
</table>
…
</body>
</html>
What is XML?
•
•
•
•
•
•
•
eXtensible Markup Language
It is self describing
It is hierarchical
It is human- and computerreadable
It is a World Wide Web
Consortium (W3C) standard
It can be validated using DTD
or XSchema
There is a large software base
support
XML Example
Proliferation of Bio-XML
Formats
Sequence
BSML AGAVE
Microarray Gene Expression
GEML
MAML
BIND
MAGE-ML
RDF (e.g., BioPax)
Semantically rich ontologies
Reasoning (machine intelligence)
Pathway
SBML
PSI-MI
XML Representation of
Proteomics Data
AGML
HUP-ML
RDF Representation
Resource Description
Framework (RDF)
• It is a standard data model (directed acyclic
graph) for representing information
(metadata) about resources in the World
Wide Web
• In general, it can be used to represent
information about “things” that can be
identified (using URI’s) on the Web
• It is intended to provide a simple way to make
statements (descriptions) about Web
resources
RDF Statement
A RDF statement consists of:
• Subject: resource identified by a URI
• Predicate: property (as defined in a name space identified by a
URI)
• Object: property value (literal) or a resource
For example, the dbSNP Website is a subject, creator is a
predicate, NCBI is an object.
A resource can be described by multiple statements.
Graphical & XML Representation
http://www.ncbi.nlm.nih.gov/SNP
http://purl.org/dc/elements/1.1/creator
http://www.ncbi.nlm.nih.gov
http://purl.org/dc/elements/1.1/language
en
<?xml version="1.0"?>
<rdf:RDF xmlns:rdf=“http://www.w3.org/1999/02/22-rdf-syntax-ns#”
xmlns:dc=“http://purl.org/dc/elements/1.1”
xmlns:ex=“http://www.example.org/terms”>
<rdf:Description about=“http://www.ncbi.nlm.nih.gov/SNP”>
<dc:creator rdf:resource=“http://www.ncbi.nlm.nih.gov”></dc:creator>
<dc:language>en</dc:language>
date>
</rdf:Description>
</rdf:RDF>
Life Sciences Identifiers (LSIDs)
• URL vs. URI vs. URN
– URL http://www.gleaners.org/faq.html
– URI http://www.gleaners.org/faq.html#Q04
– URN www.gleaners.org/faq.html#Q04
• LSID is a form of URN
Problems of URIs
• The web server referenced by the URL may be
broken or become unavailable
• The syntax of the URL may change over time as
the underlying data retrieval program evolves
• The data returned by a URL may change over
time as the underlying database contents
change.
LISD Format and Examples
• URN:LSID:namespace:database:object_id
:[revision_id]
• Examples:
– URN:LSID:ncbi.nlm.nih.gov:genbank:AF2710
72'
– URN:LSID:chemacx.cambridgesoft.com:ACX:
CAS967582:1
LSID (cont’d)
• Globalness: A LSID is a name with global scope that does not imply
a location. It has the same meaning everywhere.
• Uniqueness: The same LSID will never be assigned to two different
objects.
• Persistence: It is intended that the lifetime of an LSID be
permanent.
• Scalability: LSIDs can be assigned to any data element that might
conceivably be available on the network, for hundreds of years.
• Legacy Support: The LSID naming scheme must permit the
support of existing legacy naming systems
• Extensibility: Any scheme for LSIDs must permit future extensions
to the scheme.
• Independence: It is solely the responsibility of a name issuing
authority to determine conditions under which it will issue a name.
• Resolution: A URN will not impede resolution i.e., translation to a
URL..."
Semantic Web Applications
• Connotea (on-line management of web
resources)
• Piggy bank (semantic web browser)
• YeastHub: yeast genome data integration
Connotea: Online Reference Management
Service (Nature Publishing Group)
www.connotea.org
• To keep links to the articles/websites of
your interest
• To discover new articles and websites
through sharing your links with other users
• It is web-accessible
TBL’s original vision of the Web
• Active vs. passive
• Collaborative vs. authoritative
• Decentralized vs. centralized
• Semantic vs. syntactic
Connotea: Online Reference Management
Service (Nature Publishing Group)
ALFRED Population Sample
Connotea (ALFRED Example)
ALFRED Example
Google Earth Example
Data Integration Using RDF
atagccgta
cctgcgagt
ctagaagct
derives from
human
hemoglobin
GenBank
derives from
atagccgta
cctgcgagt
ctagaagct
+
human
hemoglobin
is a
oxygen
transport
protein
human
hemoglobin
is a
Gene Ontology
+
has 3D structure
human
hemoglobin
has 3D structure
Unified view
Protein Data Bank
oxygen
transport
protein
Piggy Bank
• http://simile.mit.edu/piggy-bank
• It is an extension to the Firebox Web browser
• It turns the Firebox Web browser into a
Semantic Web browser
• It supports tagging and links to Google Map
RDF is the Common Currency
Peggy Bank (Data Integration Example)
TRIPLES
(Expr. Data)
HubMed
Keyword search
D2RQ
RDF Expr.
Dataset
RDF Bib..
Info.
import
import
Pluggin
Browse/
query
TRIPLES Expression Data in RDF
Peggy Bank (PIM1 Gene)
Semantic Bank
Yeast Hub
Yeast Hub Team
Kei Cheung
Kevin Yip
Remko deKnikker
Mark Gerstein
Andrew Smith
Andy Masiar
RDF Technologies
• Description of data source using Rich Site
Summary (RSS)
• Data Conversion into RDF
– Relational Database to RDF (D2RQ)
– Tabular-RDF-Conversion
• RDF Database (Sesame)
– RDF-based query languages
Rich Site Summary (RSS)
User
(Application)
aggregator
Yeast Hub Resource
No
RSS
No
RSS
RSS
RSS
Resources
Resource Description
(Use of Dublin Core Metadata)
RDF Metadata Example
(RSS1.0)
Data Conversion and
Integration
Resource1 Resource2 Resource3
Resourcen
<xml>
…
</xml>
DOM/SAX
RDF1
D2RQ
XSLT
RDF2
RDF3
RDF/DB
<rdf>
…
</rdf>
(Sesame)
RDQL
Users/Agents
RDF Modeling of Tabular Data
Genome Object
Organism
Object type
Collection of specific types
of genome objects
(e.g., genes, proteins)
Tabular-RDF Data Conversion
Example of Data Converted into
RDF
Motivating Example
• Genomic analysis of essentiality within
protein networks.
– H Yu, D Greenbaum, H Xin Lu, X Zhu, M Gerstein
(2004) Trends Genet 20: 227-31.
– Jeong, H., Mason, S., Barabási, A.-L., and Oltvai,
Z. 2001. Lethality and centrality in protein
networks. Nature 411: 41–42
– Fraser, H., Hirsh, A., Steinmetz, L., Scharfe, C.,
and Feldman, M. 2002. Evolutionary rate in the
protein interaction network. Science 296: 750–752
• Important but hard…
Example Integrated Query
Essential genes
YGDP MIPS
BIND
Protein-protein interactions
(connectivity)
SGD
GO
TRIPLES
Annotation
& expression data
Query Form
RQL Syntax and Query Results
Next step… Data Mining
• Whole yeast genome analysis (Y6K)
– Subcellular localization of the yeast proteome.
• A Kumar, S Agarwal, JA Heyman, S Matson, M Heidtman, S Piccirillo, L
Umansky, A Drawid, R Jansen, Y Liu, KH Cheung, P Miller, M Gerstein,
GS Roeder, M Snyder (2002) Genes Dev 16: 707-19.
– A Bayesian system integrating expression data with sequence
patterns for localizing proteins: comprehensive application to
the yeast genome.
• A Drawid, M Gerstein (2000) J Mol Biol 301: 1059-75.
• Doing systematic dataming to predict the remining 3K localizations
• Important but hard ….
“Once the web has been sufficiently "populated"
with rich metadata, what can we expect? First,
searching on the web will become easier as
search engines have more information
available, and thus searching can be more
focused. Doors will also be opened for
automated software agents to roam the web,
looking for information for us or transacting
business on our behalf. The web of today, the
vast unstructured mass of information, may in
the future be transformed into something more
manageable - and thus something far more
useful.”
(Ora Lassila)
Automate humanely!
“No amount of automation will replace
human beings, but clumsy and belligerent
automation will alienate them and suppress
their creativity.”
(Tony Kazic)
Thanks!
Questions?
Semantic Graph
Find the most current image of Kei Cheung who is affiliated with YCMI
affiliated with
Kei
Cheung
YCMI
images
Files
member of
member of
File 1
File n
date
Oct 1, 1990
date
Feb 1, 2005
Research/Technologies Related to
Semantic Web
•
•
•
•
Text mining
Agent computing
Web services
Ontological research
Knowledge representation
Jill
has_parent
Joe
has_child
Sue
has_child
?
• A person (Joe) is an uncle iff
– Joe is male
– He has a parent (Jill) who has a second child (Sue)
who is parent
Other things to mention?
• Taxonomy vs. ontology
• OWL overview and example(s)
Download