BioPerl at 15: New Features, New Directions Christopher J. Fields, University of Illinois, cjfields@uiuc.edu *Mark A. Jensen, Fortinbras Research and SRA International, mark_jensen@sra.com Jason E. Stajich, University of California at Riverside, jason.stajich@ucr.edu The BioPerl Project, an open-source Perl toolkit for bioinformatics, was initiated in 1995 and became instrumental in the automated organization and analysis of original Human Genome Project data. Since then, BioPerl has become a complete object-oriented Perl environment for bioinformatics development, with modules to perform a wide range of bioinformatics functions, including multi-format parsing and translation, object-relational model databasing, EMBL and NCBI web service access, and external program execution. The BioPerl developer community is actively responding to the far-reaching changes in the field that have taken place over the last several years. Major goals are: (1) to provide new functionality useful to researchers at the cutting edge of bioinformatics, (2) to reorganize BioPerl into smaller application-oriented packages, (3) to deprecate older modules whose utility has declined substantially, and (4) to continue to expand and improve documentation, so that BioPerl remains useful and relevant in the years ahead. Community participation and development New features New directions Timeline Next-gen sequencing support Shattering the Monolith BioPerl has grown in its user and developer base since those early days. New developers and collaborations have contributed not only key modules, but also important design methodologies and refactoring over the years that have helped BioPerl to maintain its usefulness and relevance. Discontinuities followed by increases in lines of code over time reflect a high level of community flexibility and dedication in pursuit of DTWT. Bringing BioPerl up to speed for next-gen sequence data handling has led to efforts along three lines: file format standardization, common command-line tool wrapping, and BioPerl object system I/O integration tailored to next-gen data. BioPerl continues to be distributed as just a handful of packages. The core package in particular has grown to 341 files, comprising 874 classes with 23,146 tests. Maintenance and installation issues are barriers to developers and users alike. We are in the process of splitting the core into reasonable, application-related chunks. This plus the git migration should significantly improve BioPerl management. Formats BioPerl and other Bio* projects recently published a collaborative effort to standardize FASTQ formats, including variants for Illumina and Solexa platforms. These formats are now in use across BioPerl and the Bio* projects. Support for important binary formats (BAM, BigWIG) is provided by wrappers for command line tools, and the integration of fast XS-based Perl modules such as Lincoln Stein's Bio-SamTools and Bio-BigFile CPAN packages. source: http://www.ohloh.net/p/bioperl bedtools bowtie bwa minimo newbler samtools Wrappers Enhancements to the Bio::Tools::Run::WrapperBase system has made it easier to add BioPerl wrapper modules for external programs, and to integrate these into other modules that implement pipelines using BioPerl sequence and alignment objects as I/O. The BioPerl wiki (http://bioperl.org) The wiki is now the central location for all BioPerl documentation: installation, module POD, HOWTO articles, code snippets, and personnel descriptions. It has played an important role as the new face of BioPerl and as a landing for the developer discussions that are taking BioPerl forward. Intermediate layers for large file handling and generic parsing BioPerl parsers generally take raw data to Perl objects with no intermediate layer. This induces prohibitive overhead when parsing large files, and also can limit user flexibility: parsing may be desired, but not the BioPerl objects. The first problem is being tackled by attaching backend handlers onto container class constructors that are able persist records of large files efficiently, creating BioPerl objects only as needed or desired. The second problem has led to experiments in generic parsing: data file records are parsed into a simple stream of hashes, which then can be directed where the user desires; into the creation of BioPerl objects as usual, or elsewhere. General wrapper facility maq assembly pipeline Convert plain text sequence fasta2bfa fastq2bfq Map reads to reference seq map mapmerge Assemble map into consensus assemble BioPerl object support : Bio::Assembly The Bio::Assembly system has been extensively updated, to include reading and/or writing assemblies in MAQ, BAM, SAM, BWA, and other formats. Assembly object support is integrated into run wrappers for bwa, bedtools, maq, and samtools. Future work will incorporate new sequence objects that are optimized for large files (through the work of GSoC student Jun Yin). A set of modules (Bio::Tools::WrapperMaker) is under development that will increase the responsiveness of BioPerl development by providing an XML-based way for users themselves to specify the interface for their favorite commandl ine programs, at the same time creating a common, consistent API for executing those programs and accessing output. Google Summer of Code BioPerl has provided mentorship for GSoC projects for the past three years. These have resulted in material additions to the codebase, and have been focused on expanding BioPerl's capabilities in format parsing and large file processing. Year Sponsoring Institution Student Project Example Module 2008 NESCent Mira Han PhyloXML parsing Bio::TreeIO::phyloxml 2009 NESCent Chase Miller NeXML parsing Bio::Nexml 2010 OBF Jun Yin Alignment subsystem refactoring in progress Extract info from consensus mapview cns2fq use Bio::Tools::Run::Maq; my $maq = Bio::Tools::Run::Maq->new(); $assy_obj = $maq->run('read1.fastq', 'refseq.fas', 'read2.fastq'); Biome and BioPerl 6 BioPerl has been object-oriented from the beginning, but suffers the weaknesses of Perl 5 objects: very high overhead, loose encapsulation, limited object introspection, and the lack of built-in interfaces and roles, among other things. These issues are being addressed in two ways: in Perl 5 through the Moose classes and dependencies, and in the creation of Perl 6. BioPerl is exploring both paths to true objects with the experimental Biome (BioPerl with Metaobject Extensions) and BioPerl 6 projects. Tracking NCBI developments Role In the past year, NCBI has released a fully updated BLAST toolkit, blast+†, and has been encouraging a move from their EUtilities RESTful interface to a newer SOAP interface‡. BioPerl on gitHub (http://github.com/bioperl) BioPerl recently migrated all active repositories to gitHub from OBF-hosted Subversion. With the move to git comes decentralization and more fluid, independent development. We expect this to improve the BioPerl response time both to bugs and to new developments in the field, as well as increase new developer recruitment and community participation. Class BioPerl has responded with Bio::Tools::Run::StandAloneBlastPlus and Bio::DB::SoapEUtilities. These were designed not only to update the API interface, but also to add I/O layers that accept and parse messages into familiar BioPerl objects, and to build in straightforward methods for creating pipelines of blast+ program analyses or EUtilities fetches. †ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST ‡http://eutils.ncbi.nlm.nih.gov/entrez/eutils/soap/v2.0/DOC/esoap_help.html main:: Biome role as interface The BioPerl Core Development Team is Sendu Bala, Rob Buels, Christopher Fields, Mark Jensen, Hilmar Lapp, Heikki Lehväslaiho, Aaron Mackey, Dave Messina, Brian Osborne, Jason Stajich, and Lincoln Stein. Key support is provided by Chris Dagdigian and Mauricio Herrera Cuadra. Florent Angly and Dan Kortschak are lead developers of projects discussed here.