Chris Maloney`s Notes - JHU, Summer, 2013

NCBI E-utilities in RDF In the following draft, anything in a block-quote section, like this text, is an editorial comment, or part of an outline, as opposed to draft text that would appear in the finished document. Introduction The NCBI E-utilities are a set of APIs that provide programmatic search-and-retrieval access to a large set of NCBI databases, including nucleotide and protein sequence databases (including GenBank), scientific journal article abstract, citation, and full-text databases (PubMed and PMC), and many others (52 at the time of this writing). Currently, E-utilities provides output in a few different formats, including text and XML, but not in RDF. In this project, I have created a framework for the enhancement of NCBI E-utilities to provide an RDF response format. In this format, the data from the NCBI databases will be presented as RDF triples, using a mixture of a standardized NCBI vocabulary (new URI patterns) in newly developed ontologies, and existing, third-party, ontologies. To demonstrate the results, I have created a simple Perl CGI server that will proxy HTTP requests that mimic the calling convention (CGI parameters) of E-utilities, send them to NCBI Eutilities to retrieve XML results, and then pass those results through an XSLT transformation, to produce RDF. This is illustrated in the following figure. The EutilsRDF service is installed at http://chrismaloney.org/eutils . It acts as a proxy between the client and the real NCBI E-utilities. To get an RDF representation of an E-utilities response, you need to translate the E-utilities URL into the form expected by this service. For example, consider the URL http://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi To translate this URL    Change the first part (domain + first part of the path) from eutils.ncbi.nlm.nih.gov/entrez/eutils to chrismaloney.org/eutils. change .fcgi to .cgi, and add ?retmode=rdf Resulting in http://chrismaloney.org/eutils/einfo.cgi?retmode=rdf If the NCBI E-utilities URL for the results that you are interested in already has a retmode parameter, then change that to rdf. Another example: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?retmode=xml& db=pmc&id=14900&version=2.0 would become http://chrismaloney.org/eutils/esummary.cgi?retmode=rdf&db=pmc&id=14900 &version=2.0 This project has included the development of:     A standard for NCBI RDF URIs A web server framework for transforming the E-utililities results into RDF, including: o A driver Perl CGI script o XSLT transformations Ontologies Documentation and examples The work was done in the open, as a project on GitHub, and with some engagement and feedback from staff at NCBI and the wider Semantic Web / Bioinformatics community (detailed in the Acknowledgements.) I believe that this project has the potential to benefit the bioinformatics community. More and more bioinformatics resources are being made available as linked open data, and it is my hope that this project will help to catalyze NCBI’s taking steps in the direction of providing more data in RDF format, and thereby more tightly integrate their resources with the rest of the Semantic Web, which would, in turn, make them easier to use, and quicken the pace of scientific discovery. With that in mind, while performing this work, I tried to do more than a simple, throwaway demonstration, and to do it in such a way that it might actually get incorporated into the NCBI APIs, eventually. Project infrastructure This service is currently deployed to my home web server, at http://chrismaloney.org/eutils. This constitutes the top level “welcome” page, and provides links to sample invocations of the service. The following are the web pages associated with the GitHub project:    Project page - Home page of the project, where you can browse and interact with the code repository Wiki - Documentation Issues - Tasks and bug reports In addition, the eutilsrdf Google group has been set up to provide a mailing list for discussions. (Unfortunately, it has not been used – discussions have taken place mostly in private email threads.) NCBI RDF URIs standard document The initial part of this project comprised working with people inside and outside of NCBI to develop a document to standardize RDF URIs for NCBI. As mentioned above, one of the goals of this project was to develop tools that might be incorporated into the real, production NCBI APIs, and so considerable up-front effort was expended on designing URIs for resources according to the widely-recognized current best practices. For example, those described in “Cool URIs for the Semantic Web” (2008) and “Linked Data: Evolving the Web into a Global Data Space” (Heath and Bizer, 2011). The document is included in Appendix A. These standards were used in the minting of URIs for the ontologies (described below) for NCBI database entities (which will be used as RDF subjects and objects) and Entrez filters and links (which will be used as RDF predicates). Still to write: Describe collaboration with PubChem group, and others who provided input. Perl CGI The backbone of the web service is a Perl CGI script that mimics the E-utilities API, and provides valid RDF data for many of the E-utilities requests. Still to describe:   URI manipulation Logic of error handling, HTTP responses XSLT transformations Transformations were written in XSLT 1.0 (since I used Perl’s XML::LibXSLT module to run them, which only supports 1.0). There were three XSLT stylesheets written, one for each of the E-utilities that were used in this project:    einfo.xsl elink.xsl esummary.xsl The input to each of these is the XML output from the respective NCBI E-utility tool. More to cover:   Output is an error response or RDF/XML Validation of NCBI responses and error handling I feel that doing this work on GitHub was ideal, because it facilitates stakeholders in the community to provide feedback, or to enhance or add their own transformations, by simply amending the code as they see fit, and sending a pull request. Responses Here, I will describe some of the various responses provided for specific invocations of E-utilities. A sample table is provided on the service home page. This utility will always return a well formed XML document. Depending on whether or not the script is able to parse and handle the URL, it will return one of the following:     200 OK, with a valid RDF/XML document. 404 Not found - for cases where it doesn’t understand the request, or if the retmode is ‘rdf’ but the service doesn’t support the requested transformation. 303 See Other redirect to the NCBI Eutilities URL - if the request is not for RDF (retmode is not given or is not ‘rdf’) 502 Bad gateway - if this service received a bad response from the NCBI E-utilities. All of the RDF responses have been validated both with the W3C RDF Validation service and with Virtuoso, in order to ensure that the provided output is valid and useful. Einfo The sample response types that have been converted into RDF include the responses from the EInfo tool. There are two types of responses that this tool provides: 1. List of all databases. 2. Database details. List of all databases When invoked without any query-string parameters, it provides a listing of all of the Entrez databases. Here I will describe, and (if space permits) give sample snippets or a graph diagram of the output from the RDFized versions of this response. <rdf:RDF xmlns:entrez="http://rdf.ncbi.nlm.nih.gov/entrez/"> <entrez:Db rdf:about="http://rdf.ncbi.nlm.nih.gov/entrez/db/pubmed"> <entrez:dbname>pubmed</entrez:dbname> </entrez:Db> <entrez:Db rdf:about="http://rdf.ncbi.nlm.nih.gov/entrez/db/protein"> <entrez:dbname>protein</entrez:dbname> </entrez:Db> <entrez:Db rdf:about="http://rdf.ncbi.nlm.nih.gov/entrez/db/nuccore"> <entrez:dbname>nuccore</entrez:dbname> </entrez:Db> ... </rdf:RDF> Database details When invoked with the db parameter having a value of one of the databases, it provides detailed information and current statistics about that database. ELink Another E-utilities response type that has a conversion into RDF is the “by id” variant of the ELink utility. The “batch mode” response[2] will not be handled, because it does not lend itself to conversion to RDF. The reason for this is that the response indicates a list of “subjects” and “objects” (in the RDF sense), and the type of link between them, but there is no indication of which object is associated with which of the subjects. This illustrates an important, general, limitation of this project: that not every E-utility request/response is amenable to conversion into RDF. Describe the elink response. ESummary Outline:   Limitation: we don’t do version 1.0 responses (deprecated) Sample: PMC response Another sample will be the PMC ESummary result, for example, this query. The ESummary service provides “document summaries”, or “docsums” of an entry within any Entrez database. For PMC, the entry is a full-text journal article, and the docsum comprises detailed bibliographic information about that article. Within the scope of this project will be to transform the PMC docsum output into RDF, using, where appropriate, established, standard ontologies, such as Dublin Core and SPAR. Ontologies Newly developed ontology To do: Use protege to validate my ontology, and grab a screenshot from OWLViz. The ontology that I developed for this project is in the GitHub repository, entrez-ontology.rdf. As with the RDF responses described above, this ontology file has been validated both with the (output of that validation can be viewed dynamically here) and with Virtuoso. Additionally, I imported it into Protege, and here is a visualization of the class model: Include class-model diagram here. Describe the ontology, and the rationale for some of the decisions made, etc. The following bits were copied from my “exercises” paper, but need updating: In this ontology, I have identified several classes, including:      NCBI Entrez Database Entrez database record Entrez link - a subclass of rdf:Property A nuccore (nucleotide) sequence A gene record  A PMC article I’ve also specified the hierarchy among these, with rdfs:subClassOf relationships. I’ve encoded information about two Entrez links. These are relationships between items within NCBI Entrez databases, and map very nicely to RDF properties. An example that illustrates a link between a nuccore record and the gene records (i.e., it answers the question, what gene corresponds to this nucleotide sequence) is http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=nuccore&db=gene&id=312836839 . This is mapped to the RDF property entrez:nuccore_gene, which is a subproperty of entrez:link, and the ontology specifies its domain and range. The reciprocal link is entrez:gene_nuccore (see this query result: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=gene&db=nuccore&id=159). This is specified as the owl:inverseOf the entrez:nuccore_gene link. Incorporated third-party ontologies Describe the third-party ontologies I’ve used. Practical applications To do: If time permits, I will pull a few of these results into Virtuoso to demo a few SPARQL queries that can be run against them. Related work, sites, and services Work similar to this project Bio2RDF is a project and a website that provides RDF data from many different data sets in the life sciences. The developers on this project have already done a fantastic amount of work to integrate NCBI’s (as well as other institution’s) data into the Semantic Web. Among the data sets that they have converted into RDF are NCBI’s Gene, GenBank, Homologene, OMIM, PubMed, RefSeq, and Taxonomy databases. The TogoWS web service provides a wrapper for some eutils, in a manner very similar to this project. For example, here, which is based on BioRuby. This project in the Semantic Web community NCBO BioPortal is a web site / service for registering Bioinformatics / Semantic Web projects and ontologies. This project has been registered with BioPortal here. Conclusions and future work Outline:   Integrate this work with outside resources, such as (if appropriate) freebase, identifiers.org, BioPortal, and/or Bio2RDF, for easier discovery and greater interoperability. Provide transformations for other types of responses. Limited time, unfortunately, prevented me from researching as fully as I would have liked some of the pre-existing work in the Semantic Web that relates to, or overlaps with, this project. In particular, it would be important to harmonize this work with what is already being done in the Bio2RDF project. The Bio2RDF-scripts Wiki provides extensive detailed information about how that project operates. Acknowledgements In this section, I’ll provide details of contributions from others People who had input to the development of the NCBI RDF URI standards document:     Jerven Bolleman (Uniprot), Michel Dumontier (Bio2RDF), Fu Gang (NCBI), Mark Johnson (NCBI). References [1] NCBI. Entrez Programming Utilities Help. 2010. [2] E-utilities Quick Start, Finding Related Data Through Entrez Links http://www.ncbi.nlm.nih.gov/books/NBK25500/#chapter1Finding_Related_Data_Through_En_ [3] W3C. Cool URIs for the Semantic Web. 2008. [4] Heath, Tom and Bizer, Christian (2011). Linked Data: Evolving the Web into a Global Data Space Accessed: 2013-06-09. (WebCite) [5] Berners-Lee, Tim (2006). Linked Data - Design Issues. Accessed: 2013-06-09. (WebCite) [6][Biohackathon 2013 Wiki pages](https://github.com/dbcls/bh13/wiki) (on GitHub) [7][Identifiers.org](http://identifiers.org/) [8][Bio2RDF-scripts Wiki](https://github.com/bio2rdf/bio2rdf-scripts/wiki) [9][Expressing Dublin Core metadata using the Resource Description Framework (RDF)](http://dublincore.org/documents/dc-rdf/) Appendices NCBI RDF URI Standard

Chris Maloney`s Notes - JHU, Summer, 2013

Related documents

Products

Support

Chris Maloney`s Notes - JHU, Summer, 2013

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib