Providing RDF Metadata for Biological Databases using Semantic Overlays Ben Szekely

advertisement
Providing RDF Metadata for Biological
Databases using Semantic Overlays
Ben Szekely
Lee Feigenbaum (not present but highly responsible)
May, 2006
© 2006 IBM Corporation
IBM Internet Technology
LSID Overview - Syntax
 5 Part Format: urn:lsid:authority:namespace:object[:revision]
–
–
–
–
–
urn:lsid
– Mandatory prefix
authority
– Unique string, e.g. domain name of organization
namespace
– Alphanumeric sequence that constrains the scope
– E.g. to a particular database, species, etc…
object
– Alphanumeric sequence describing the object
[revision]
– Optional alphanumeric sequence describing the version of the object
 Example: urn:lsid:ncbi.nlm.nih.gov:genbank:af271072
 Example: urn:lsid:pdb.org:pdb:1aft:1
Providing RDF Metadata for Biological Databases using LSID-based Semantic Overlays
© 2006 IBM Corporation
IBM Internet Technology
LSID Overview - Resolution
1a - DDDS NAPTR
1b - SRV Record Lookup
DNS/DDDS
2 - getAvailableServices()
LSID Authority
Client
WSDL
3a - getData()
3b - getMetadata()
Data Service
Metadata Service
Providing RDF Metadata for Biological Databases using LSID-based Semantic Overlays
© 2006 IBM Corporation
IBM Internet Technology
LSID Overview – Comparison with URLs
URL
LSID
 Tied to physical addresses
 Server structure may change frequently
 Brittle (broken links)
 One location only
 One protocol per URI
 Same name = same content, always
 Location independent
 Enables transparent caching
 Formalized, rich multi-sourced metadata
retrieval
Providing RDF Metadata for Biological Databases using LSID-based Semantic Overlays
© 2006 IBM Corporation
IBM Internet Technology
The RDF Roadblock
 Consider the great corpus of biological data that could (should?) exist in
RDF.
–
NCBI, Ensembl, GeneOntology, HUGO, PDB
–
Metadata linking experiments, specimens, research papers
 Infeasible to convert all of this data to RDF
–
Development cost
–
– Ontology development, data conversion, query optimization
Standards issues
–
– When data is physically stored as RDF, ontologies are quite important to get right initially.
– Standards bodies have spent a long time settling on XML-based standards over the past
10 years, do they really want to do it all over again?
Lack of technology and tooling
– Does the technology really exist to host the NCBI online database and Web Service on a
system backed entirely by RDF?
Providing RDF Metadata for Biological Databases using LSID-based Semantic Overlays
© 2006 IBM Corporation
IBM Internet Technology
Semantic Overlays – a great start
 Provide an RDF-view of a database
–
Expose logical subsets of the database as named graphs
– Ex: a Pubmed article, an NCBI Protein
–
Named graphs should be exposed as URLs or LSIDs
– Standard resolution
– Simple to include in SPARQL queries
–
Ontologies in the view can evolve over time
– No need to modify the body of data when the ontology changes
– Perfect for experimentation, community feedback
Providing RDF Metadata for Biological Databases using LSID-based Semantic Overlays
© 2006 IBM Corporation
IBM Internet Technology
Semantic Overlays – implementation techniques
 XML/Web Services
–
XSLT: XML -> RDF/XML (or N3)
–
XML-Beans to RDF-Beans (i.e. Castor to Jastor)
 HTML Pages
–
Screen scraping -> RDF
 Relational databases
–
SQL Result Sets -> RDF (Jena, Jastor)
Many other techniques are possible
Providing RDF Metadata for Biological Databases using LSID-based Semantic Overlays
© 2006 IBM Corporation
IBM Internet Technology
Semantic Overlay - limitations
 Cannot query the complete database with SPARQL
– (We can, however, provide a SPARQL extension that wraps the
native query interface of the database. Ex. SPARQL Entrez
Search)
 RDF Translation can be expensive
 RDF Linkage is limited by the intrinsic linkages in
the store Ex. (Genbank -> Pubmed)
Providing RDF Metadata for Biological Databases using LSID-based Semantic Overlays
© 2006 IBM Corporation
IBM Internet Technology
Semantic Overlay – Demo 1
 Provide a simple, one-stop answer to the question:
How can I discover proteins that are
relevant to my work and locate
antibodies that target those proteins?
Providing RDF Metadata for Biological Databases using LSID-based Semantic Overlays
© 2006 IBM Corporation
IBM Internet Technology
SPARQL is…
 …a query language for selecting values
from RDF graphs
 …a protocol for issuing queries via
HTTP GET, HTTP POST, or SOAP
 …a W3C Candidate Recommendation
 …capable of returning results serialized
as web-friendly JSON structures
 …perfect for mashing up disparate data
sources representable as RDF
PREFIX foaf: <…foaf/0.1/>
PREFIX rdf: <…22-rdf-syntax-ns#>
SELECT ?name ?email
WHERE {
?person rdf:type foaf:Person .
?person foaf:name ?name .
OPTIONAL {
?person foaf:mbox ?email .
}
}
?name
?email
Lee Feigenbaum
feigenbl@us.ibm.com
Grandma Feigenbaum
(unbound)
Providing RDF Metadata for Biological Databases using LSID-based Semantic Overlays
© 2006 IBM Corporation
IBM Internet Technology
The Data Sources
 Entrez protein sequence and gene databases
–
National Center for Biotechnology Information (NCBI)
–
http://www.ncbi.nlm.nih.gov/
–
RDF  LSID metadata
 Antibody directory
–
Alzheimer Research Forum (AlzForum)
–
http://alzforum.org/res/com/ant/default.asp
–
RDF  HTML scraping
 Mapping data between genes and antibodies
–
Alan Ruttenberg, Millennium
–
RDF  spreadsheet data
 Taxonomy information
–
Wikispecies, free species directory
–
http://www.wikispecies.org
–
RDF  XSLT applied to XHTML
Providing RDF Metadata for Biological Databases using LSID-based Semantic Overlays
© 2006 IBM Corporation
IBM Internet Technology
The Tools
 JavaScript SPARQL client library
–
Issue SPARQL SELECT queries and retrieve results as JavaScript objects
–
Supports all SPARQL endpoints returning JSON results (SPARQLer, Rasqal,
XMLArmyKnife, …)
–
http://www.thefigtrees.net/lee/sw/sparql.js
 JSON
–
Lightweight serialization of data structures (e.g. SPARQL resultsets)
–
http://www.json.org
 Microtemplates
–
Automagically bind JavaScript-object data to DHTML fragments
–
http://www.microtemplates.org
Providing RDF Metadata for Biological Databases using LSID-based Semantic Overlays
© 2006 IBM Corporation
IBM Internet Technology
The Demo
Providing RDF Metadata for Biological Databases using LSID-based Semantic Overlays
© 2006 IBM Corporation
IBM Internet Technology
What We Learned
Take-away Lessons
 “With a query language, a client
can design their own interface.”
- Leigh Dodds
 SPARQL + JSON is a powerful
Web 2.0 environment
 Even data sources not natively
expressed in RDF can be mashed
up with SPARQL
 Life sciences provides a rich
domain of situational problems to
approach with SPARQL-based
mashups
Looking Ahead
 As we deal in larger and larger
data sets, on-the-fly RDF creation
becomes impractical, so:
–
“Smart” federation
–
Dedicated SPARQL endpoints
 Universal naming, merged graphs,
and shared predicates only get us
so far, so:
–
Custom relations
–
owl:sameAs
–
Human-guided curation
Providing RDF Metadata for Biological Databases using LSID-based Semantic Overlays
© 2006 IBM Corporation
IBM Internet Technology
Semantic Overlays – Demo 2
 Annotating Semantic Overlay Named Graphs using LSID
Goal: add custom, resolvable metadata to NCBI objects.
Providing RDF Metadata for Biological Databases using LSID-based Semantic Overlays
© 2006 IBM Corporation
IBM Internet Technology
Demo Technologies
 Insightlink Annotation System
–
Annotations stored in RDF/DB2
–
Custom annotation forms based on identity, context, application
–
Plugins for I.E., PowerPoint, Word, Acrobat, Spotfire
 LSID/NCBI Proxy
–
Web proxy to add LSID tags to NCBI search results
Providing RDF Metadata for Biological Databases using LSID-based Semantic Overlays
© 2006 IBM Corporation
IBM Internet Technology
Thanks!
 Questions?
 More information: bhszekel@us.ibm.com
 Demo online at http://thefigtrees.net/lee/sw/demos/antibodies/
 Thanks to:
–
Alan Ruttenberg, Millennium
–
June Kinoshita and Colin Knep, Alzheimer Research Forum
–
Lee Feigenbaum, Elias Torres, Stephen Evanchik, and Alister Lewis-Bowen, IBM
Providing RDF Metadata for Biological Databases using LSID-based Semantic Overlays
© 2006 IBM Corporation
Download