ISMB2010_poster

advertisement
BioPerl at 15: New Features, New Directions
Christopher J. Fields, University of Illinois, cjfields@uiuc.edu
*Mark A. Jensen, Fortinbras Research and SRA International, mark_jensen@sra.com
Jason E. Stajich, University of California at Riverside, jason.stajich@ucr.edu
The BioPerl Project, an open-source Perl toolkit for bioinformatics, was initiated in 1995 and became instrumental in the automated organization and analysis of original Human
Genome Project data. Since then, BioPerl has become a complete object-oriented Perl environment for bioinformatics development, with modules to perform a wide range of
bioinformatics functions, including multi-format parsing and translation, object-relational model databasing, EMBL and NCBI web service access, and external program execution.
The BioPerl developer community is actively responding to the far-reaching changes in the field that have taken place over the last several years. Major goals are: (1) to provide new
functionality useful to researchers at the cutting edge of bioinformatics, (2) to reorganize BioPerl into smaller application-oriented packages, (3) to deprecate older modules whose
utility has declined substantially, and (4) to continue to expand and improve documentation, so that BioPerl remains useful and relevant in the years ahead.
Community participation and development
New features
New directions
Timeline
Next-gen sequencing support
Shattering the Monolith
BioPerl has grown in its user and
developer base since those early days.
New developers and collaborations have
contributed not only key modules, but
also important design methodologies and
refactoring over the years that have
helped BioPerl to maintain its usefulness
and relevance. Discontinuities followed by
increases in lines of code over time reflect
a high level of community flexibility and
dedication in pursuit of DTWT.
Bringing BioPerl up to speed for next-gen sequence data handling has led to efforts along three lines: file format
standardization, common command-line tool wrapping, and BioPerl object system I/O integration tailored to next-gen data.
BioPerl continues to be distributed as just a handful of
packages. The core package in particular has grown to 341
files, comprising 874 classes with 23,146 tests. Maintenance
and installation issues are barriers to developers and users
alike. We are in the process of splitting the core into
reasonable, application-related chunks. This plus the git
migration should significantly improve BioPerl management.
Formats
BioPerl and other Bio* projects recently published a
collaborative effort to standardize FASTQ formats,
including variants for Illumina and Solexa platforms.
These formats are now in use across BioPerl and the
Bio* projects.
Support for important binary formats (BAM, BigWIG) is
provided by wrappers for command line tools, and the
integration of fast XS-based Perl modules such as
Lincoln Stein's Bio-SamTools and Bio-BigFile CPAN
packages.
source: http://www.ohloh.net/p/bioperl
bedtools
bowtie
bwa
minimo
newbler
samtools
Wrappers
Enhancements to the Bio::Tools::Run::WrapperBase
system has made it easier to add BioPerl wrapper
modules for external programs, and to integrate
these into other modules that implement pipelines
using BioPerl sequence and alignment objects as I/O.
The BioPerl wiki
(http://bioperl.org)
The wiki is now the central
location for all BioPerl
documentation: installation,
module POD, HOWTO articles,
code snippets, and personnel
descriptions. It has played an
important role as the new face
of BioPerl and as a landing for
the developer discussions that
are taking BioPerl forward.
Intermediate layers for large file handling and generic parsing
BioPerl parsers generally take raw data to Perl objects with no intermediate layer. This induces
prohibitive overhead when parsing large files, and also can limit user flexibility: parsing may be desired,
but not the BioPerl objects. The first problem is being tackled by attaching backend handlers onto
container class constructors that are able persist records of large files efficiently, creating BioPerl objects
only as needed or desired. The second problem has led to experiments in generic parsing: data file
records are parsed into a simple stream of hashes, which then can be directed where the user desires;
into the creation of BioPerl objects as usual, or elsewhere.
General wrapper facility
maq assembly pipeline
Convert plain text
sequence
fasta2bfa
fastq2bfq
Map reads to
reference seq
map
mapmerge
Assemble map into
consensus
assemble
BioPerl object support : Bio::Assembly
The Bio::Assembly system has been extensively updated, to
include reading and/or writing assemblies in MAQ, BAM,
SAM, BWA, and other formats. Assembly object support is
integrated into run wrappers for bwa, bedtools, maq, and
samtools. Future work will incorporate new sequence objects
that are optimized for large files (through the work of GSoC
student Jun Yin).
A set of modules (Bio::Tools::WrapperMaker) is under
development that will increase the responsiveness of
BioPerl development by providing an XML-based way
for users themselves to specify the interface for their
favorite commandl ine programs, at the same time
creating a common, consistent API for executing
those programs and accessing output.
Google Summer of Code
BioPerl has provided mentorship for GSoC projects for the past three years. These have resulted in material
additions to the codebase, and have been focused on expanding BioPerl's capabilities in format parsing and large
file processing.
Year
Sponsoring
Institution
Student
Project
Example Module
2008
NESCent
Mira Han
PhyloXML parsing
Bio::TreeIO::phyloxml
2009
NESCent
Chase Miller
NeXML parsing
Bio::Nexml
2010
OBF
Jun Yin
Alignment subsystem refactoring
in progress
Extract info from
consensus
mapview
cns2fq
use Bio::Tools::Run::Maq;
my $maq = Bio::Tools::Run::Maq->new();
$assy_obj = $maq->run('read1.fastq',
'refseq.fas',
'read2.fastq');
Biome and BioPerl 6
BioPerl has been object-oriented from the beginning, but suffers the weaknesses of Perl 5 objects: very
high overhead, loose encapsulation, limited object introspection, and the lack of built-in interfaces and
roles, among other things.
These issues are being addressed in two ways: in Perl 5 through the Moose classes and dependencies,
and in the creation of Perl 6. BioPerl is exploring both paths to true objects with the experimental Biome
(BioPerl with Metaobject Extensions) and BioPerl 6 projects.
Tracking NCBI developments
Role
In the past year, NCBI has released a fully updated BLAST toolkit, blast+†, and has been encouraging a move from
their EUtilities RESTful interface to a newer SOAP interface‡.
BioPerl on gitHub
(http://github.com/bioperl)
BioPerl recently migrated all active repositories to
gitHub from OBF-hosted Subversion. With the
move to git comes decentralization and more fluid,
independent development. We expect this to
improve the BioPerl response time both to bugs
and to new developments in the field, as well as
increase new developer recruitment and
community participation.
Class
BioPerl has responded with Bio::Tools::Run::StandAloneBlastPlus and Bio::DB::SoapEUtilities. These were designed
not only to update the API interface, but also to add I/O layers that accept and parse messages into familiar BioPerl
objects, and to build in straightforward methods for creating pipelines of blast+ program analyses or EUtilities
fetches.
†ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST
‡http://eutils.ncbi.nlm.nih.gov/entrez/eutils/soap/v2.0/DOC/esoap_help.html
main::
Biome role as interface
The BioPerl Core Development Team is Sendu Bala, Rob Buels, Christopher Fields, Mark Jensen, Hilmar Lapp, Heikki Lehväslaiho, Aaron Mackey, Dave Messina, Brian Osborne, Jason Stajich, and Lincoln Stein.
Key support is provided by Chris Dagdigian and Mauricio Herrera Cuadra. Florent Angly and Dan Kortschak are lead developers of projects discussed here.
Download