Development guide for Engines for Biology, Revision 1 What is

advertisement
Development guide for Engines for Biology, Revision 1
What is Engines for Biology?
Problems/Limitations
Bioinformatics has a number of well-established ubiquitous flat file formats.
These include FASTA (NCBI 2007), GenBank (NCBI 2010), PDB (PDB 2010), FASTQ,
and BLAST output formats (PDB 2010). These formats are employed by a large
number of existing tools for diverse applications such as sequence alignment and
structure prediction and as supported formats in APIs for Python (Cock, Antao et al.
2009), Perl (Stajich, Block et al. 2002), Ruby (Goto, Prins et al.), etc. As the
complexity of computational biology tasks increases, new file formats are created
that merge and integrate information to form new ontologies, such as in systems
biology with SBML (Hucka, Finney et al. 2003) and other related formats.
It is not uncommon for a given computational pipeline, such as those in
*omics research, to rely on a half dozen independent tools that read a variety of
different formats. Instruments on the benchtop or authoritative sources such as the
National Center for Biotechnology Information’s array of databases may generate
the original files. Importing file formats to pipeline specific formats negatively
impacts provenance-tracking capabilities of pipelines and unnecessarily adds
complexity. Worse, the multiplicity of tools may actually compute on the same data
but require the data to exist in different formats. Finally, performing relational
algebra on the output data sets of those tools is a common task in the same
pipelines, for example, to join annotations from GenBank files to BLAST result sets.
Solution
Engines for Biology (E4B) is a family of ‘Storage Engines’ for the MySQL
relational database management system (RDBMS) (Oracle 2010). Storage engines
are a feature of MySQL that enable developers to expose a two-dimensional table
model to file formats of their own design (Oracle 2010).
At he most basic level, the E4B storage engines allow SQL expressions to be
used to manipulate table representations of data stored in common file formats in
bioinformatics. They enable relational algebra on input and output formats that are
routinely employed in computational biology pipelines. This solves the common
problem of joining result sets from tools like BLAST with annotation data from
authoritative sources in SQL. This is done while preserving the original data format
thus aiding in provenance tracking. However, when format conversion is absolutely
necessary, it also allows file formats to be interchanged freely with SQL statements
that create tables by selecting data from existing tables.
Implementation details
Description of MySQL Storage Engines
MySQL storage engines are dynamic libraries that are loaded at runtime by
the RDBMS. They contain a set of callbacks defined by the MySQL storage engine API
that is used to open, read from, insert into, and close tables. Building a storage
engine dynamic library requires access to the MySQL header files corresponding to
the run-time version of the RDBMS. More information can be found at the MySQL
forge web site (MySQL 2010).
Engines for Biology as Storage Engines
A separate storage engine handles each file format supported by E4B, which
are compiled as a separated dynamic library loaded by the RDBMS on first use. The
engines as implemented support table scanning and append only inserts. Two sets
of storage engines exist: (1) tested and in use including FASTA, FASTQ, BLAST tab
delimited, and (2) partially implemented, undocumented, or untested including
GenBank, PDB, and XML based formats like SBML. The former, as mentioned are
actively employed in pipelines and include minimal test straps while the latter are
not employed directly in any current work and have no testing apparatus available.
As described, only table scans and append only inserts are supported. At this
time, there is no intention to implement indexing. Many tools that utilize the
targeted file formats are capable of reading compressed files, e.g. via libz. A possible
improvement could be the ability for the storage engines to do read-only table
access for compressed formats. Many of the aforementioned formats compress
quite well, particularly the sequence formats such as FASTA.
At run-time, a user connects an RDBMS table to an existing flat file in the
filesystem by tagging a ‘CREATE TABLE’ statement with both the ‘ENGINE’ attribute
to identify the storage engine and a subsequent ‘CONNECTION’ attribute naming the
filesystem file. If the flat file does not exist, it is created. While the ENGINE attribute
is relatively common in MySQL, the CONNECTION attribute is appropriated from the
replicated / remote MySQL realm and much less commonly employed.
Goals
The intention for E4B is to make the library available via the GPL to the
broader community, perhaps through an entity like SourceForge. However, before
this can be done, several features need to be added specifically related to building
E4B storage engine dynamic libraries and providing minimal documentation. These
needs are described below.
Configuration for Multiple Host Platforms
MySQL is available for a range of operating system platforms including Linux,
SunOS, OS/X, and Windows. These are all operative environments for bioinformatics
pipelines. As such, it is important for E4B to be readily build on these platforms in
the same manner that other ‘third party’ / ‘user contributed’ storage engines and
MySQL itself. For the *NIX like platforms, the GNU Autoconf kit is commonly
employed to generate the canonical ‘./configure’ script that performs necessarily
Makefile modification / generation specific to a given host environment (GNU
2010). In some cases, for example when CygWin is available, autoconf will work for
Windows. The autoconf system should be deployed into the E4B codebase to enable
automatic configuration of E4B for a variety of host environments. As it stands now,
E4B contains a Makefile that must be edited for each host environment, compiler,
and MySQL version (see next section).
Configuration for Installed Version of MySQL
In addition to basic autoconf deployment for host specific features, the
configuration environment must also be able to handle reasonable versions of
MySQL, e.g. 5.1 and beyond. Specifically, the configuration tool chain must accept the
ability to specify the location of the MySQL include directory that corresponds to the
installed version of MySQL. This could be accomplished by adding a ‘–mysqlinclude-dir’ flag to ‘./configure’.
Documentation
At this time, virtually no documentation for E4B exists. The distribution must
include a ‘0README’ file in the distribution root that provides basic installation
instructions via the autoconf apparatus. In addition, the examples that have been
developed thus far will need to be thoroughly vetted, possibly improved upon, and
referenced as tests in the installation / build procedure.
References
Cock, P. J., T. Antao, et al. (2009). "Biopython: freely available Python tools for
computational molecular biology and bioinformatics." Bioinformatics 25(11): 14223.
SUMMARY: The Biopython project is a mature open source international
collaboration of volunteer developers, providing Python libraries for a wide
range of bioinformatics problems. Biopython includes modules for reading
and writing different sequence file formats and multiple sequence
alignments, dealing with 3D macro molecular structures, interacting with
common tools such as BLAST, ClustalW and EMBOSS, accessing key online
databases, as well as providing numerical methods for statistical learning.
AVAILABILITY: Biopython is freely available, with documentation and source
code at (www.biopython.org) under the Biopython license.
GNU (2010). "Autoconf - GNU Project - Free Software Foundation (FSF)."
Retrieved September 3, 2010, from http://www.gnu.org/software/autoconf/.
Goto, N., P. Prins, et al. "BioRuby: Bioinformatics software for the Ruby
programming language." Bioinformatics.
SUMMARY: The BioRuby software toolkit contains a comprehensive set of
free development tools and libraries for bioinformatics and molecular
biology, written in the Ruby programming language. BioRuby has
components for sequence analysis, pathway analysis, protein modelling and
phylogenetic analysis; it supports many widely used data formats and
provides easy access to databases, external programs and public web
services, including BLAST, KEGG, GenBank, MEDLINE and GO. BioRuby comes
with a tutorial, documentation and an interactive environment, which can be
used in the shell, and in the web browser. AVAILABILITY: BioRuby is free and
open source software, made available under the Ruby license. BioRuby runs
on all platforms that support Ruby, including Linux, Mac OS X and Windows.
And, with JRuby, BioRuby runs on the Java Virtual Machine. The source code
is available from http://www.bioruby.org/. CONTACT: Toshiaki Katayama
(katayama@bioruby.org); Queries should be directed to the BioRuby mailing
list.
Hucka, M., A. Finney, et al. (2003). "The systems biology markup language (SBML): a
medium for representation and exchange of biochemical network models."
Bioinformatics 19(4): 524-31.
MOTIVATION: Molecular biotechnology now makes it possible to build
elaborate systems models, but the systems biology community needs
information standards if models are to be shared, evaluated and developed
cooperatively. RESULTS: We summarize the Systems Biology Markup
Language (SBML) Level 1, a free, open, XML-based format for representing
biochemical reaction networks. SBML is a software-independent language for
describing models common to research in many areas of computational
biology, including cell signaling pathways, metabolic pathways, gene
regulation, and others. AVAILABILITY: The specification of SBML Level 1 is
freely available from http://www.sbml.org/
MySQL (2010). "MySQL Internals Custom Engine - MySQL Forge Wiki."
Retrieved September 3, 2010, from
http://forge.mysql.com/wiki/MySQL_Internals_Custom_Engine.
NCBI (2007). "FASTA format description." Retrieved September 3, 2010, from
http://www.ncbi.nlm.nih.gov/blast/fasta.shtml.
NCBI (2010). "The DDBJ/EMBL/GenBank Feature Table Definition." Retrieved
September 3, 2010, from http://www.ncbi.nlm.nih.gov/collab/FT/.
Oracle (2010). "MySQL :: The world's most popular open source database."
Retrieved September 3, 2010, from http://www.mysql.com.
PDB (2010). "Atomic Coordinate Entry Format Version 3.2." Retrieved September
3, 2010, from http://www.wwpdb.org/documentation/format32/v3.2.html.
Stajich, J. E., D. Block, et al. (2002). "The Bioperl toolkit: Perl modules for the life
sciences." Genome Res 12(10): 1611-8.
The Bioperl project is an international open-source collaboration of
biologists, bioinformaticians, and computer scientists that has evolved over
the past 7 yr into the most comprehensive library of Perl modules available
for managing and manipulating life-science information. Bioperl provides an
easy-to-use, stable, and consistent programming interface for bioinformatics
application programmers. The Bioperl modules have been successfully and
repeatedly used to reduce otherwise complex tasks to only a few lines of
code. The Bioperl object model has been proven to be flexible enough to
support enterprise-level applications such as EnsEMBL, while maintaining an
easy learning curve for novice Perl programmers. Bioperl is capable of
executing analyses and processing results from programs such as BLAST,
ClustalW, or the EMBOSS suite. Interoperation with modules written in
Python and Java is supported through the evolving BioCORBA bridge. Bioperl
provides access to data stores such as GenBank and SwissProt via a flexible
series of sequence input/output modules, and to the emerging common
sequence data storage format of the Open Bioinformatics Database Access
project. This study describes the overall architecture of the toolkit, the
problem domains that it addresses, and gives specific examples of how the
toolkit can be used to solve common life-sciences problems. We conclude
with a discussion of how the open-source nature of the project has
contributed to the development effort.
Download