Application Note GeneTrack – a genomic data processing and visualization framework Istvan Albert1,2*, Shinichiro Wachi2 , Cizhong Jiang2, and B. Franklin Pugh2 1 Huck Institutes of the Life Sciences, 2Center for Comparative Genomics and Bioinformatics, Center for Gene Regulation, and Department of Biochemistry and Molecular Biology, Pennsylvania State University, University Park, Pennsylvania, 16802, USA Received on XXXXX; revised on XXXXX; accepted on XXXXX Associate Editor: XXXXXXX ABSTRACT Motivation: High-throughput ‘ChIP-chip’ and ‘ChIP-seq’ methodologies generate sufficiently large data sets that analysis poses significant informatics challenges, particularly for research groups with modest computational support. To address this challenge, we devised a software platform for storing, analyzing and visualizing high resolution genome-wide binding data. GeneTrack automates several steps of a typical data processing pipeline, including smoothing and peak detection, and facilitates dissemination of the results via the web. Our software is freely available via the Google Project Hosting environment at http://genetrack.googlecode.com 1 INTRODUCTION Current genomic technologies generate millions of data points from a single biological experiment. As these technologies are now mainstream, the resulting data are straining bioinformatics pipelines. Consequently, our understanding of genome regulation is becoming more analysis-limited rather than data-limited. First generation genome technologies (such as microarrays) increased data gathering capacity significantly. Bioinformatics tools to manage, display, and analyze this data were developed [1], including the UC-Santa Cruz genome browser [2], the Generic Genome Browser [3], GenoViz platform and the Integrated Genome Browser, NCBI for raw archiving, and GALAXY [4] for data analysis tools. Second generation technologies (such as tiling microarrays and whole-genome sequencing) have the potential to increase data acquisition dramatically. Here, we introduce a new software platform named GeneTrack that we have developed to automate and facilitate large-scale downstream data processing of chromatin immunoprecipitation data obtained via high-throughput sequencing and tiling arrays [5]. Our software is unique in that it integrates multiple steps of a typical data analysis process: smoothing, fitting, peak detection and visualization into a single workflow and that it performs to specifications on limited hardware. GeneTrack employs rapid processing and has low computational demands which make it suitable for exploratory data analyses where different fitting and peak detection parameters are varied and the results compared on the same display..GeneTrack was developed *To whom correspondence should be addressed. © Oxford University Press 2005 in the Python programming language, using the hierarchical data format (HDF) as its data storage backend and the NumPy library for numerical computation. For a better integration with existing tools final results are written into a relational database with support for all major databases: including MySQL, PostgreSQL, Oracle, and MS-SQL. Notably, the system does not require the presence of a separate database nor webserver for small scale deployments. Our software is able to use the SQLite relational database engine that is distributed with Python 2.5, moreover, it can present data through the web via its own embedded web server, thus is ideal for smaller, individual research groups. Setup, maintenance, and administration are controlled via a single program to provide the following functionality: store and retrieve data efficiently, quickly smooth data over an entire chromosome, combine strand specific data into a composite value, detect peaks rapidly over a chromosome-wide scale, display and publish results via an embedded webserver. 2 METHODS The hierarchical data format (HDF) provides a versatile data model that represents complex data objects and a wide variety of metadata in a portable file format with no limit on the number or size of data objects in the collection. We adopted HDF as the data storage backend for GeneTrack and we chose NumPyto provide highspeed numerical computation. GeneTrack works iteratively, processing data in individual, overlapping chunks, thus keeping the memory consumption low and independent of characteristics of the data. Data smoothing and averaging is accomplished by a Gaussian smoothing procedure as follows: a measurement at a genomic coordinate is approximated by values taken from a normal distribution of height equal to the measurement, centered over the measurement’s coordinate, and, with a standard deviation that acts as an external, tunable parameter (fitting tolerance). Consequently, each measurement is expanded into many values over a contiguous set of coordinates. Next, all individually fitted values are summed at each genomic coordinate resulting in a smooth and continuous “probabilistic landscape” where the peaks of the curve indicate the most likely positions of the measured genomic feature. Finally, the peak detection algorithm in GeneTrack operates by selecting the maximal non-neighboring subset from all local maxima in the 1 I.Albert et al. data. That is, the algorithm selects the highest peak along the chromosome, then establishes an exclusion zone (typically a few hundred bp), within which no other subsequent peaks are allowed to fall. In the rare occasion that two close peaks (inside the exclusion zone) have exactly equal height only one of them will be selected as valid. The process is repeated iteratively over the remain- Fig. 1. GeneTrack Composite Plot. Black vertical bars represent experimental sequencing reads that map to a genomic coordinate. The purple trace represents the smoothed fit over the read values. The purple horizontal tracks represent the 147 bp wide nucleosome predictions as given by the peak detection algorithm. The blue and red tracks are yeast ORFs on the forward (blue) and reverse (red) strands. ing space until no other peaks may be placed. This algorithm is well suited for problems where genomic features have a certain apriori known width (e.g. 147 bp nucleosomal DNA), Setting the exclusion zone to 0 turns off this feature, and allows the algorithm to to determine the optimal placement of heterogeneously-sized DNA fragments that are typically generated in ChIP-seq experiments. Fig. 1 illustrates the output derived from 1.3 million ChIPseq “reads” of the Saccharomyces genome. 3 RESULTS GeneTrack is controlled via a single program and a configuration file that specifies the processes that it should perform (various data analysis steps or web serving and/or plotting). The input for the analysis must be a tab-delimited text file that lists chromosome, genomic coordinates and values on forward or reverse strands. For experimental methodologies (ChIP-chip) that do not separate strands, the values on the reverse strand may be set to zero. Detailed documentation is available in a searchable Wiki format, containing installation instructions and other operational details (see main website). The code distribution also includes a dataset and configuration files for the work published elsewhere [5], packaged such that users may repeat a full data analysis and view the results via the embedded web server within minutes. Internally, all information is represented on forward, reverse and “composite” strands, where the composite strand is a combination (in the simplest case a sum) of the data on each individual strand. Since data on the forward and reverse strand represent two independent determinations of binding, this approach allows for individual error evaluations to be made whenever the data on the two strands are in disagreement.GeneTrack will automatically operate on each strand separately (Figure 2) and will derive the values for the composite strand as well. The software was designed with extensibility in mind. There is a clear separation of the database schemas, parsing, fitting and pre- 2 diction modules, to the extent that the schema or prediction algorithm that is to be invoked in a certain analysis run can be changed via the configuration file. Similarly, the output tracks and graphs are fully customizable and may be entirely replaced although this requires Python expertise. The currently distributed schemas are for sequencing data, but we are preparing a set of modules to Fig. 2. GeneTrack Two Strand Plot. Within the software full strand information is maintained. The same plot from Figure 1 is now visualized on the separate strands forward (blue) and reverse (red). An ideal signal would be perfectly mirrored; this type of display allows one to indentify systematic shifts in the data. Note how the fitting and peak prediction algorithms also operate on separate strands. streamline tiling array data processing. We are committed to providing a smooth data exchange with other existing data analysis and visualization platforms provided by UCSC and Ensemble. To that end we have implemented export functionality that produces results in BED, GFF or wiggle format. The software has been tested on Windows and Linux platforms and is believed to work on all major operating systems that can run Python and its extension libraries for HDF and Numerical Python. We maintain several GeneTrack instances to disseminate our results (see http://atlas.bx.psu.edu). Funding for the project has been provided by NIH R01-HG004160. REFERENCES 1. 2. 3. 4. 5. Quackenbush J: Computational approaches to analysis of DNA microarray data. Methods Inf Med 2006, 45 Suppl 1:91-103. Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D: The human genome browser at UCSC. Genome Res 2002, 12(6):996-1006. Stein LD, Mungall C, Shu S, Caudy M, Mangone M, Day A, Nickerson E, Stajich JE, Harris TW, Arva A et al: The generic genome browser: a building block for a model organism system database. Genome Res 2002, 12(10):1599-1610. Blankenberg D, Taylor J, Schenck I, He J, Zhang Y, Ghent M, Veeraraghavan N, Albert I, Miller W, Makova KD et al: A framework for collaborative analysis of ENCODE data: making large-scale analyses biologist-friendly. Genome Res 2007, 17(6):960-964. Albert I, Mavrich TN, Tomsho LP, Qi J, Zanton SJ, Schuster SC, Pugh BF: Translational and rotational settings of H2A.Z nucleosomes across the Saccharo- GeneTrack – a genomic data processing and visualization framework myces cerevisiae genome. Nature 2007, 446(7135):572576. 3