Development guide for Engines for Biology, Revision 1 What is Engines for Biology? Problems/Limitations Bioinformatics has a number of well-established ubiquitous flat file formats. These include FASTA (NCBI 2007), GenBank (NCBI 2010), PDB (PDB 2010), FASTQ, and BLAST output formats (PDB 2010). These formats are employed by a large number of existing tools for diverse applications such as sequence alignment and structure prediction and as supported formats in APIs for Python (Cock, Antao et al. 2009), Perl (Stajich, Block et al. 2002), Ruby (Goto, Prins et al.), etc. As the complexity of computational biology tasks increases, new file formats are created that merge and integrate information to form new ontologies, such as in systems biology with SBML (Hucka, Finney et al. 2003) and other related formats. It is not uncommon for a given computational pipeline, such as those in *omics research, to rely on a half dozen independent tools that read a variety of different formats. Instruments on the benchtop or authoritative sources such as the National Center for Biotechnology Information’s array of databases may generate the original files. Importing file formats to pipeline specific formats negatively impacts provenance-tracking capabilities of pipelines and unnecessarily adds complexity. Worse, the multiplicity of tools may actually compute on the same data but require the data to exist in different formats. Finally, performing relational algebra on the output data sets of those tools is a common task in the same pipelines, for example, to join annotations from GenBank files to BLAST result sets. Solution Engines for Biology (E4B) is a family of ‘Storage Engines’ for the MySQL relational database management system (RDBMS) (Oracle 2010). Storage engines are a feature of MySQL that enable developers to expose a two-dimensional table model to file formats of their own design (Oracle 2010). At he most basic level, the E4B storage engines allow SQL expressions to be used to manipulate table representations of data stored in common file formats in bioinformatics. They enable relational algebra on input and output formats that are routinely employed in computational biology pipelines. This solves the common problem of joining result sets from tools like BLAST with annotation data from authoritative sources in SQL. This is done while preserving the original data format thus aiding in provenance tracking. However, when format conversion is absolutely necessary, it also allows file formats to be interchanged freely with SQL statements that create tables by selecting data from existing tables. Implementation details Description of MySQL Storage Engines MySQL storage engines are dynamic libraries that are loaded at runtime by the RDBMS. They contain a set of callbacks defined by the MySQL storage engine API that is used to open, read from, insert into, and close tables. Building a storage engine dynamic library requires access to the MySQL header files corresponding to the run-time version of the RDBMS. More information can be found at the MySQL forge web site (MySQL 2010). Engines for Biology as Storage Engines A separate storage engine handles each file format supported by E4B, which are compiled as a separated dynamic library loaded by the RDBMS on first use. The engines as implemented support table scanning and append only inserts. Two sets of storage engines exist: (1) tested and in use including FASTA, FASTQ, BLAST tab delimited, and (2) partially implemented, undocumented, or untested including GenBank, PDB, and XML based formats like SBML. The former, as mentioned are actively employed in pipelines and include minimal test straps while the latter are not employed directly in any current work and have no testing apparatus available. As described, only table scans and append only inserts are supported. At this time, there is no intention to implement indexing. Many tools that utilize the targeted file formats are capable of reading compressed files, e.g. via libz. A possible improvement could be the ability for the storage engines to do read-only table access for compressed formats. Many of the aforementioned formats compress quite well, particularly the sequence formats such as FASTA. At run-time, a user connects an RDBMS table to an existing flat file in the filesystem by tagging a ‘CREATE TABLE’ statement with both the ‘ENGINE’ attribute to identify the storage engine and a subsequent ‘CONNECTION’ attribute naming the filesystem file. If the flat file does not exist, it is created. While the ENGINE attribute is relatively common in MySQL, the CONNECTION attribute is appropriated from the replicated / remote MySQL realm and much less commonly employed. Goals The intention for E4B is to make the library available via the GPL to the broader community, perhaps through an entity like SourceForge. However, before this can be done, several features need to be added specifically related to building E4B storage engine dynamic libraries and providing minimal documentation. These needs are described below. Configuration for Multiple Host Platforms MySQL is available for a range of operating system platforms including Linux, SunOS, OS/X, and Windows. These are all operative environments for bioinformatics pipelines. As such, it is important for E4B to be readily build on these platforms in the same manner that other ‘third party’ / ‘user contributed’ storage engines and MySQL itself. For the *NIX like platforms, the GNU Autoconf kit is commonly employed to generate the canonical ‘./configure’ script that performs necessarily Makefile modification / generation specific to a given host environment (GNU 2010). In some cases, for example when CygWin is available, autoconf will work for Windows. The autoconf system should be deployed into the E4B codebase to enable automatic configuration of E4B for a variety of host environments. As it stands now, E4B contains a Makefile that must be edited for each host environment, compiler, and MySQL version (see next section). Configuration for Installed Version of MySQL In addition to basic autoconf deployment for host specific features, the configuration environment must also be able to handle reasonable versions of MySQL, e.g. 5.1 and beyond. Specifically, the configuration tool chain must accept the ability to specify the location of the MySQL include directory that corresponds to the installed version of MySQL. This could be accomplished by adding a ‘–mysqlinclude-dir’ flag to ‘./configure’. Documentation At this time, virtually no documentation for E4B exists. The distribution must include a ‘0README’ file in the distribution root that provides basic installation instructions via the autoconf apparatus. In addition, the examples that have been developed thus far will need to be thoroughly vetted, possibly improved upon, and referenced as tests in the installation / build procedure. References Cock, P. J., T. Antao, et al. (2009). "Biopython: freely available Python tools for computational molecular biology and bioinformatics." Bioinformatics 25(11): 14223. SUMMARY: The Biopython project is a mature open source international collaboration of volunteer developers, providing Python libraries for a wide range of bioinformatics problems. Biopython includes modules for reading and writing different sequence file formats and multiple sequence alignments, dealing with 3D macro molecular structures, interacting with common tools such as BLAST, ClustalW and EMBOSS, accessing key online databases, as well as providing numerical methods for statistical learning. AVAILABILITY: Biopython is freely available, with documentation and source code at (www.biopython.org) under the Biopython license. GNU (2010). "Autoconf - GNU Project - Free Software Foundation (FSF)." Retrieved September 3, 2010, from http://www.gnu.org/software/autoconf/. Goto, N., P. Prins, et al. "BioRuby: Bioinformatics software for the Ruby programming language." Bioinformatics. SUMMARY: The BioRuby software toolkit contains a comprehensive set of free development tools and libraries for bioinformatics and molecular biology, written in the Ruby programming language. BioRuby has components for sequence analysis, pathway analysis, protein modelling and phylogenetic analysis; it supports many widely used data formats and provides easy access to databases, external programs and public web services, including BLAST, KEGG, GenBank, MEDLINE and GO. BioRuby comes with a tutorial, documentation and an interactive environment, which can be used in the shell, and in the web browser. AVAILABILITY: BioRuby is free and open source software, made available under the Ruby license. BioRuby runs on all platforms that support Ruby, including Linux, Mac OS X and Windows. And, with JRuby, BioRuby runs on the Java Virtual Machine. The source code is available from http://www.bioruby.org/. CONTACT: Toshiaki Katayama (katayama@bioruby.org); Queries should be directed to the BioRuby mailing list. Hucka, M., A. Finney, et al. (2003). "The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models." Bioinformatics 19(4): 524-31. MOTIVATION: Molecular biotechnology now makes it possible to build elaborate systems models, but the systems biology community needs information standards if models are to be shared, evaluated and developed cooperatively. RESULTS: We summarize the Systems Biology Markup Language (SBML) Level 1, a free, open, XML-based format for representing biochemical reaction networks. SBML is a software-independent language for describing models common to research in many areas of computational biology, including cell signaling pathways, metabolic pathways, gene regulation, and others. AVAILABILITY: The specification of SBML Level 1 is freely available from http://www.sbml.org/ MySQL (2010). "MySQL Internals Custom Engine - MySQL Forge Wiki." Retrieved September 3, 2010, from http://forge.mysql.com/wiki/MySQL_Internals_Custom_Engine. NCBI (2007). "FASTA format description." Retrieved September 3, 2010, from http://www.ncbi.nlm.nih.gov/blast/fasta.shtml. NCBI (2010). "The DDBJ/EMBL/GenBank Feature Table Definition." Retrieved September 3, 2010, from http://www.ncbi.nlm.nih.gov/collab/FT/. Oracle (2010). "MySQL :: The world's most popular open source database." Retrieved September 3, 2010, from http://www.mysql.com. PDB (2010). "Atomic Coordinate Entry Format Version 3.2." Retrieved September 3, 2010, from http://www.wwpdb.org/documentation/format32/v3.2.html. Stajich, J. E., D. Block, et al. (2002). "The Bioperl toolkit: Perl modules for the life sciences." Genome Res 12(10): 1611-8. The Bioperl project is an international open-source collaboration of biologists, bioinformaticians, and computer scientists that has evolved over the past 7 yr into the most comprehensive library of Perl modules available for managing and manipulating life-science information. Bioperl provides an easy-to-use, stable, and consistent programming interface for bioinformatics application programmers. The Bioperl modules have been successfully and repeatedly used to reduce otherwise complex tasks to only a few lines of code. The Bioperl object model has been proven to be flexible enough to support enterprise-level applications such as EnsEMBL, while maintaining an easy learning curve for novice Perl programmers. Bioperl is capable of executing analyses and processing results from programs such as BLAST, ClustalW, or the EMBOSS suite. Interoperation with modules written in Python and Java is supported through the evolving BioCORBA bridge. Bioperl provides access to data stores such as GenBank and SwissProt via a flexible series of sequence input/output modules, and to the emerging common sequence data storage format of the Open Bioinformatics Database Access project. This study describes the overall architecture of the toolkit, the problem domains that it addresses, and gives specific examples of how the toolkit can be used to solve common life-sciences problems. We conclude with a discussion of how the open-source nature of the project has contributed to the development effort.