Chris Maloney`s Notes - JHU, Summer, 2013

advertisement
NCBI E-utilities in RDF
In the following draft, anything in a block-quote section, like this text, is an
editorial comment, or part of an outline, as opposed to draft text that would
appear in the finished document.
Introduction
The NCBI E-utilities are a set of APIs that provide programmatic search-and-retrieval access to a
large set of NCBI databases, including nucleotide and protein sequence databases (including
GenBank), scientific journal article abstract, citation, and full-text databases (PubMed and
PMC), and many others (52 at the time of this writing). Currently, E-utilities provides output in a
few different formats, including text and XML, but not in RDF.
In this project, I have created a framework for the enhancement of NCBI E-utilities to provide an
RDF response format. In this format, the data from the NCBI databases will be presented as RDF
triples, using a mixture of a standardized NCBI vocabulary (new URI patterns) in newly
developed ontologies, and existing, third-party, ontologies.
To demonstrate the results, I have created a simple Perl CGI server that will proxy HTTP
requests that mimic the calling convention (CGI parameters) of E-utilities, send them to NCBI Eutilities to retrieve XML results, and then pass those results through an XSLT transformation, to
produce RDF. This is illustrated in the following figure.
The EutilsRDF service is installed at http://chrismaloney.org/eutils .
It acts as a proxy between the client and the real NCBI E-utilities. To get an RDF representation
of an E-utilities response, you need to translate the E-utilities URL into the form expected by this
service. For example, consider the URL
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi
To translate this URL



Change the first part (domain + first part of the path) from
eutils.ncbi.nlm.nih.gov/entrez/eutils to chrismaloney.org/eutils.
change .fcgi to .cgi, and
add ?retmode=rdf
Resulting in
http://chrismaloney.org/eutils/einfo.cgi?retmode=rdf
If the NCBI E-utilities URL for the results that you are interested in already has a retmode
parameter, then change that to rdf. Another example:
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?retmode=xml&
db=pmc&id=14900&version=2.0
would become
http://chrismaloney.org/eutils/esummary.cgi?retmode=rdf&db=pmc&id=14900
&version=2.0
This project has included the development of:




A standard for NCBI RDF URIs
A web server framework for transforming the E-utililities results into RDF, including:
o A driver Perl CGI script
o XSLT transformations
Ontologies
Documentation and examples
The work was done in the open, as a project on GitHub, and with some engagement and
feedback from staff at NCBI and the wider Semantic Web / Bioinformatics community (detailed
in the Acknowledgements.)
I believe that this project has the potential to benefit the bioinformatics community. More and
more bioinformatics resources are being made available as linked open data, and it is my hope
that this project will help to catalyze NCBI’s taking steps in the direction of providing more data
in RDF format, and thereby more tightly integrate their resources with the rest of the Semantic
Web, which would, in turn, make them easier to use, and quicken the pace of scientific
discovery.
With that in mind, while performing this work, I tried to do more than a simple, throwaway
demonstration, and to do it in such a way that it might actually get incorporated into the NCBI
APIs, eventually.
Project infrastructure
This service is currently deployed to my home web server, at http://chrismaloney.org/eutils. This
constitutes the top level “welcome” page, and provides links to sample invocations of the
service.
The following are the web pages associated with the GitHub project:



Project page - Home page of the project, where you can browse and interact with the
code repository
Wiki - Documentation
Issues - Tasks and bug reports
In addition, the eutilsrdf Google group has been set up to provide a mailing list for discussions.
(Unfortunately, it has not been used – discussions have taken place mostly in private email
threads.)
NCBI RDF URIs standard document
The initial part of this project comprised working with people inside and outside of NCBI to
develop a document to standardize RDF URIs for NCBI. As mentioned above, one of the goals
of this project was to develop tools that might be incorporated into the real, production NCBI
APIs, and so considerable up-front effort was expended on designing URIs for resources
according to the widely-recognized current best practices. For example, those described in “Cool
URIs for the Semantic Web” (2008) and “Linked Data: Evolving the Web into a Global Data
Space” (Heath and Bizer, 2011).
The document is included in Appendix A.
These standards were used in the minting of URIs for the ontologies (described below) for NCBI
database entities (which will be used as RDF subjects and objects) and Entrez filters and links
(which will be used as RDF predicates).
Still to write:
Describe collaboration with PubChem group, and others who provided input.
Perl CGI
The backbone of the web service is a Perl CGI script that mimics the E-utilities API, and
provides valid RDF data for many of the E-utilities requests.
Still to describe:


URI manipulation
Logic of error handling, HTTP responses
XSLT transformations
Transformations were written in XSLT 1.0 (since I used Perl’s XML::LibXSLT module to run
them, which only supports 1.0).
There were three XSLT stylesheets written, one for each of the E-utilities that were used in this
project:



einfo.xsl
elink.xsl
esummary.xsl
The input to each of these is the XML output from the respective NCBI E-utility tool.
More to cover:


Output is an error response or RDF/XML
Validation of NCBI responses and error handling
I feel that doing this work on GitHub was ideal, because it facilitates stakeholders in the
community to provide feedback, or to enhance or add their own transformations, by simply
amending the code as they see fit, and sending a pull request.
Responses
Here, I will describe some of the various responses provided for specific invocations
of E-utilities. A sample table is provided on the service home page.
This utility will always return a well formed XML document. Depending on whether or not the
script is able to parse and handle the URL, it will return one of the following:




200 OK, with a valid RDF/XML document.
404 Not found - for cases where it doesn’t understand the request, or if the retmode is
‘rdf’ but the service doesn’t support the requested transformation.
303 See Other redirect to the NCBI Eutilities URL - if the request is not for RDF
(retmode is not given or is not ‘rdf’)
502 Bad gateway - if this service received a bad response from the NCBI E-utilities.
All of the RDF responses have been validated both with the W3C RDF Validation service and
with Virtuoso, in order to ensure that the provided output is valid and useful.
Einfo
The sample response types that have been converted into RDF include the responses from the
EInfo tool.
There are two types of responses that this tool provides:
1. List of all databases.
2. Database details.
List of all databases
When invoked without any query-string parameters, it provides a listing of all of the Entrez
databases.
Here I will describe, and (if space permits) give sample snippets or a graph diagram
of the output from the RDFized versions of this response.
<rdf:RDF xmlns:entrez="http://rdf.ncbi.nlm.nih.gov/entrez/">
<entrez:Db rdf:about="http://rdf.ncbi.nlm.nih.gov/entrez/db/pubmed">
<entrez:dbname>pubmed</entrez:dbname>
</entrez:Db>
<entrez:Db rdf:about="http://rdf.ncbi.nlm.nih.gov/entrez/db/protein">
<entrez:dbname>protein</entrez:dbname>
</entrez:Db>
<entrez:Db rdf:about="http://rdf.ncbi.nlm.nih.gov/entrez/db/nuccore">
<entrez:dbname>nuccore</entrez:dbname>
</entrez:Db>
...
</rdf:RDF>
Database details
When invoked with the db parameter having a value of one of the databases, it provides detailed
information and current statistics about that database.
ELink
Another E-utilities response type that has a conversion into RDF is the “by id” variant of the
ELink utility. The “batch mode” response[2] will not be handled, because it does not lend itself
to conversion to RDF. The reason for this is that the response indicates a list of “subjects” and
“objects” (in the RDF sense), and the type of link between them, but there is no indication of
which object is associated with which of the subjects.
This illustrates an important, general, limitation of this project: that not every E-utility
request/response is amenable to conversion into RDF.
Describe the elink response.
ESummary
Outline:


Limitation: we don’t do version 1.0 responses (deprecated)
Sample: PMC response
Another sample will be the PMC ESummary result, for example, this query. The ESummary
service provides “document summaries”, or “docsums” of an entry within any Entrez database.
For PMC, the entry is a full-text journal article, and the docsum comprises detailed bibliographic
information about that article. Within the scope of this project will be to transform the PMC
docsum output into RDF, using, where appropriate, established, standard ontologies, such as
Dublin Core and SPAR.
Ontologies
Newly developed ontology
To do: Use protege to validate my ontology, and grab a screenshot from OWLViz.
The ontology that I developed for this project is in the GitHub repository, entrez-ontology.rdf.
As with the RDF responses described above, this ontology file has been validated both with the
(output of that validation can be viewed dynamically here) and with Virtuoso.
Additionally, I imported it into Protege, and here is a visualization of the class model:
Include class-model diagram here.
Describe the ontology, and the rationale for some of the decisions made, etc.
The following bits were copied from my “exercises” paper, but need updating:
In this ontology, I have identified several classes, including:





NCBI Entrez Database
Entrez database record
Entrez link - a subclass of rdf:Property
A nuccore (nucleotide) sequence
A gene record

A PMC article
I’ve also specified the hierarchy among these, with rdfs:subClassOf relationships.
I’ve encoded information about two Entrez links. These are relationships between items within
NCBI Entrez databases, and map very nicely to RDF properties.
An example that illustrates a link between a nuccore record and the gene records (i.e., it answers
the question, what gene corresponds to this nucleotide sequence) is
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=nuccore&db=gene&id=312836839
.
This is mapped to the RDF property entrez:nuccore_gene, which is a subproperty of entrez:link,
and the ontology specifies its domain and range.
The reciprocal link is entrez:gene_nuccore (see this query result:
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=gene&db=nuccore&id=159). This
is specified as the owl:inverseOf the entrez:nuccore_gene link.
Incorporated third-party ontologies
Describe the third-party ontologies I’ve used.
Practical applications
To do: If time permits, I will pull a few of these results into Virtuoso to demo a few
SPARQL queries that can be run against them.
Related work, sites, and services
Work similar to this project
Bio2RDF is a project and a website that provides RDF data from many different data sets in the
life sciences. The developers on this project have already done a fantastic amount of work to
integrate NCBI’s (as well as other institution’s) data into the Semantic Web. Among the data sets
that they have converted into RDF are NCBI’s Gene, GenBank, Homologene, OMIM, PubMed,
RefSeq, and Taxonomy databases.
The TogoWS web service provides a wrapper for some eutils, in a manner very similar to this
project.
For example, here, which is based on BioRuby.
This project in the Semantic Web community
NCBO BioPortal is a web site / service for registering Bioinformatics / Semantic Web projects
and ontologies. This project has been registered with BioPortal here.
Conclusions and future work
Outline:


Integrate this work with outside resources, such as (if appropriate) freebase,
identifiers.org, BioPortal, and/or Bio2RDF, for easier discovery and greater
interoperability.
Provide transformations for other types of responses.
Limited time, unfortunately, prevented me from researching as fully as I would have liked some
of the pre-existing work in the Semantic Web that relates to, or overlaps with, this project. In
particular, it would be important to harmonize this work with what is already being done in the
Bio2RDF project.
The Bio2RDF-scripts Wiki provides extensive detailed information about how that project
operates.
Acknowledgements
In this section, I’ll provide details of contributions from others
People who had input to the development of the NCBI RDF URI standards document:




Jerven Bolleman (Uniprot),
Michel Dumontier (Bio2RDF),
Fu Gang (NCBI),
Mark Johnson (NCBI).
References
[1] NCBI. Entrez Programming Utilities Help. 2010.
[2] E-utilities Quick Start, Finding Related Data Through Entrez Links
http://www.ncbi.nlm.nih.gov/books/NBK25500/#chapter1Finding_Related_Data_Through_En_
[3] W3C. Cool URIs for the Semantic Web. 2008.
[4] Heath, Tom and Bizer, Christian (2011). Linked Data: Evolving the Web into a Global Data
Space Accessed: 2013-06-09. (WebCite)
[5] Berners-Lee, Tim (2006). Linked Data - Design Issues. Accessed: 2013-06-09. (WebCite)
[6][Biohackathon 2013 Wiki pages](https://github.com/dbcls/bh13/wiki) (on GitHub)
[7][Identifiers.org](http://identifiers.org/)
[8][Bio2RDF-scripts Wiki](https://github.com/bio2rdf/bio2rdf-scripts/wiki)
[9][Expressing Dublin Core metadata using the Resource Description Framework
(RDF)](http://dublincore.org/documents/dc-rdf/)
Appendices
NCBI RDF URI Standard
Download