Providing RDF Metadata for Biological Databases using Semantic Overlays Ben Szekely Lee Feigenbaum (not present but highly responsible) May, 2006 © 2006 IBM Corporation IBM Internet Technology LSID Overview - Syntax 5 Part Format: urn:lsid:authority:namespace:object[:revision] – – – – – urn:lsid – Mandatory prefix authority – Unique string, e.g. domain name of organization namespace – Alphanumeric sequence that constrains the scope – E.g. to a particular database, species, etc… object – Alphanumeric sequence describing the object [revision] – Optional alphanumeric sequence describing the version of the object Example: urn:lsid:ncbi.nlm.nih.gov:genbank:af271072 Example: urn:lsid:pdb.org:pdb:1aft:1 Providing RDF Metadata for Biological Databases using LSID-based Semantic Overlays © 2006 IBM Corporation IBM Internet Technology LSID Overview - Resolution 1a - DDDS NAPTR 1b - SRV Record Lookup DNS/DDDS 2 - getAvailableServices() LSID Authority Client WSDL 3a - getData() 3b - getMetadata() Data Service Metadata Service Providing RDF Metadata for Biological Databases using LSID-based Semantic Overlays © 2006 IBM Corporation IBM Internet Technology LSID Overview – Comparison with URLs URL LSID Tied to physical addresses Server structure may change frequently Brittle (broken links) One location only One protocol per URI Same name = same content, always Location independent Enables transparent caching Formalized, rich multi-sourced metadata retrieval Providing RDF Metadata for Biological Databases using LSID-based Semantic Overlays © 2006 IBM Corporation IBM Internet Technology The RDF Roadblock Consider the great corpus of biological data that could (should?) exist in RDF. – NCBI, Ensembl, GeneOntology, HUGO, PDB – Metadata linking experiments, specimens, research papers Infeasible to convert all of this data to RDF – Development cost – – Ontology development, data conversion, query optimization Standards issues – – When data is physically stored as RDF, ontologies are quite important to get right initially. – Standards bodies have spent a long time settling on XML-based standards over the past 10 years, do they really want to do it all over again? Lack of technology and tooling – Does the technology really exist to host the NCBI online database and Web Service on a system backed entirely by RDF? Providing RDF Metadata for Biological Databases using LSID-based Semantic Overlays © 2006 IBM Corporation IBM Internet Technology Semantic Overlays – a great start Provide an RDF-view of a database – Expose logical subsets of the database as named graphs – Ex: a Pubmed article, an NCBI Protein – Named graphs should be exposed as URLs or LSIDs – Standard resolution – Simple to include in SPARQL queries – Ontologies in the view can evolve over time – No need to modify the body of data when the ontology changes – Perfect for experimentation, community feedback Providing RDF Metadata for Biological Databases using LSID-based Semantic Overlays © 2006 IBM Corporation IBM Internet Technology Semantic Overlays – implementation techniques XML/Web Services – XSLT: XML -> RDF/XML (or N3) – XML-Beans to RDF-Beans (i.e. Castor to Jastor) HTML Pages – Screen scraping -> RDF Relational databases – SQL Result Sets -> RDF (Jena, Jastor) Many other techniques are possible Providing RDF Metadata for Biological Databases using LSID-based Semantic Overlays © 2006 IBM Corporation IBM Internet Technology Semantic Overlay - limitations Cannot query the complete database with SPARQL – (We can, however, provide a SPARQL extension that wraps the native query interface of the database. Ex. SPARQL Entrez Search) RDF Translation can be expensive RDF Linkage is limited by the intrinsic linkages in the store Ex. (Genbank -> Pubmed) Providing RDF Metadata for Biological Databases using LSID-based Semantic Overlays © 2006 IBM Corporation IBM Internet Technology Semantic Overlay – Demo 1 Provide a simple, one-stop answer to the question: How can I discover proteins that are relevant to my work and locate antibodies that target those proteins? Providing RDF Metadata for Biological Databases using LSID-based Semantic Overlays © 2006 IBM Corporation IBM Internet Technology SPARQL is… …a query language for selecting values from RDF graphs …a protocol for issuing queries via HTTP GET, HTTP POST, or SOAP …a W3C Candidate Recommendation …capable of returning results serialized as web-friendly JSON structures …perfect for mashing up disparate data sources representable as RDF PREFIX foaf: <…foaf/0.1/> PREFIX rdf: <…22-rdf-syntax-ns#> SELECT ?name ?email WHERE { ?person rdf:type foaf:Person . ?person foaf:name ?name . OPTIONAL { ?person foaf:mbox ?email . } } ?name ?email Lee Feigenbaum feigenbl@us.ibm.com Grandma Feigenbaum (unbound) Providing RDF Metadata for Biological Databases using LSID-based Semantic Overlays © 2006 IBM Corporation IBM Internet Technology The Data Sources Entrez protein sequence and gene databases – National Center for Biotechnology Information (NCBI) – http://www.ncbi.nlm.nih.gov/ – RDF LSID metadata Antibody directory – Alzheimer Research Forum (AlzForum) – http://alzforum.org/res/com/ant/default.asp – RDF HTML scraping Mapping data between genes and antibodies – Alan Ruttenberg, Millennium – RDF spreadsheet data Taxonomy information – Wikispecies, free species directory – http://www.wikispecies.org – RDF XSLT applied to XHTML Providing RDF Metadata for Biological Databases using LSID-based Semantic Overlays © 2006 IBM Corporation IBM Internet Technology The Tools JavaScript SPARQL client library – Issue SPARQL SELECT queries and retrieve results as JavaScript objects – Supports all SPARQL endpoints returning JSON results (SPARQLer, Rasqal, XMLArmyKnife, …) – http://www.thefigtrees.net/lee/sw/sparql.js JSON – Lightweight serialization of data structures (e.g. SPARQL resultsets) – http://www.json.org Microtemplates – Automagically bind JavaScript-object data to DHTML fragments – http://www.microtemplates.org Providing RDF Metadata for Biological Databases using LSID-based Semantic Overlays © 2006 IBM Corporation IBM Internet Technology The Demo Providing RDF Metadata for Biological Databases using LSID-based Semantic Overlays © 2006 IBM Corporation IBM Internet Technology What We Learned Take-away Lessons “With a query language, a client can design their own interface.” - Leigh Dodds SPARQL + JSON is a powerful Web 2.0 environment Even data sources not natively expressed in RDF can be mashed up with SPARQL Life sciences provides a rich domain of situational problems to approach with SPARQL-based mashups Looking Ahead As we deal in larger and larger data sets, on-the-fly RDF creation becomes impractical, so: – “Smart” federation – Dedicated SPARQL endpoints Universal naming, merged graphs, and shared predicates only get us so far, so: – Custom relations – owl:sameAs – Human-guided curation Providing RDF Metadata for Biological Databases using LSID-based Semantic Overlays © 2006 IBM Corporation IBM Internet Technology Semantic Overlays – Demo 2 Annotating Semantic Overlay Named Graphs using LSID Goal: add custom, resolvable metadata to NCBI objects. Providing RDF Metadata for Biological Databases using LSID-based Semantic Overlays © 2006 IBM Corporation IBM Internet Technology Demo Technologies Insightlink Annotation System – Annotations stored in RDF/DB2 – Custom annotation forms based on identity, context, application – Plugins for I.E., PowerPoint, Word, Acrobat, Spotfire LSID/NCBI Proxy – Web proxy to add LSID tags to NCBI search results Providing RDF Metadata for Biological Databases using LSID-based Semantic Overlays © 2006 IBM Corporation IBM Internet Technology Thanks! Questions? More information: bhszekel@us.ibm.com Demo online at http://thefigtrees.net/lee/sw/demos/antibodies/ Thanks to: – Alan Ruttenberg, Millennium – June Kinoshita and Colin Knep, Alzheimer Research Forum – Lee Feigenbaum, Elias Torres, Stephen Evanchik, and Alister Lewis-Bowen, IBM Providing RDF Metadata for Biological Databases using LSID-based Semantic Overlays © 2006 IBM Corporation