A data retrieval workflow using NCBI E-Utils + Python John Pinney Tech talk Tue 12th Nov Task Produce a data set given particular constraints. Allow easy revision/updates as needed. Output some kind of report for a biologist. (One possible) solution A number of DBs/tools now accept queries via RESTful* interfaces, in principle allowing up-to-date data set retrieval. fully online analysis workflows. *REST = Representational State Transfer. A client/server architecture that ensures stateless communication, usually implemented via HTTP requests. Bioinformatics REST services NCBI E-utils PubMed, other DBs, BLAST EBI web services various UniProt protein sequences KEGG metabolic network data OMIM human genetic disorders + many others (see e.g. biocatalogue.org for a registry) E-Utils services ESummary EFetch ESearch ELink all available through http://eutils.ncbi.nlm.nih.gov/ Basic URL API e.g. retrieve IDs of all human genes: http://eutils.ncbi.nlm.nih.gov/entrez/eutils + esearch.fcgi?retmode=xml&db=gene&term=9606[TAXID] esearch retmode=xml db=gene term=9606[TAXID] (which EUtil) (output format) (which DB) (query term) My tasks 1. Produce a list of human genes that are associated with at least one resolved structure in PDB AND at least one genetic disorder in OMIM 2. Make an online table to display them My tasks: 1. Produce a list of human genes that are associated with at least one resolved structure in PDB AND at least one genetic disorder in OMIM 2. Make an online table to display them Easy: Python requests using PyCogent PyCogent is a Python bioinformatics module that includes convenience methods for interaction with a number of online resources. from cogent.db.ncbi import * ef = EFetch(id='23491729', rettype='fasta') protein = ef.read() Bit more typing but still easy: Python requests using urllib2 For services that are not available through PyCogent, you can construct your own URLs using urllib2. import urllib2 url = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/ esummary.fcgi?retmode=xml&db=gene&id=7157" result = urllib2.urlopen(url).read() (TIP: use urllib.quote_plus to escape spaces and other special characters when preparing your URL query). Making your life much easier: XML handling using BeautifulSoup Using retmode=xml ensures consistency in output format, but it can be very difficult to extract the data without a proper XML parser. The simplest and most powerful XML handling in Python I have found is via the BeautifulSoup object model. Making your life much easier: XML handling using BeautifulSoup Example: extract all structure IDs linked to gene 7153. e = ELink(db='structure', dbfrom='gene', id=7153) result = e.read() Making your life much easier: XML handling using BeautifulSoup Example: extract all structure IDs linked to gene 7153. e = ELink(db='structure', dbfrom='gene', id=7153) result = e.read() from bs4 import BeautifulSoup soup = BeautifulSoup(result,'xml') linkset = soup.eLinkResult.LinkSet s = [ x.Id.text for x in linkset.LinkSetDb.findAll('Link') ] Using WebEnv to chain requests If you specify usehistory='y' , NCBI can remember your output result (e.g. a list of gene IDs) and use it as a batch input for another EUtil request. This is extremely useful for minimising the number of queries for workflows involving large sets of IDs. You keep track of this “environment” using the WebEnv and query_key fields. Using WebEnv to chain requests def webenv_search(**kwargs): e = ESearch(usehistory='y',**kwargs) result = e.read() soup = BeautifulSoup(result,'xml') return {'WebEnv':soup.WebEnv.text, 'query_key':soup.QueryKey.text } ESearch db=structure term=9606[TAXID] ESearch db=omim< term="omim<medgen"[Filter] (grey<background< =<using<WebEnv) structure'IDs ELink db=protein dbfrom=structure OMIM'IDs protein'IDs ELink db=gene,< dbfrom=protein ELink db=gene,< dbfrom=omim gene'IDs gene'IDs & gene'IDs Workflow for gene list My tasks ✓1. Produce a list of human genes that are associated with at least one resolved structure in PDB AND at least one genetic disorder in OMIM 2. Make an online table to display them (next time!) Summary Using NCBI EUtils to produce a data set under given constraints was relatively straightforward. Resulting code is highly re-usable for future workflows (especially if written as generic functions). Python modules used PyCogent Simple request handling for the main EUtils. pycogent.org urllib2 General HTTP request handler. docs.python.org/2/library/urllib2.html BeautifulSoup Amazingly easy to use object model for XML/HTML. www.crummy.com/software/BeautifulSoup/bs4/doc/