2 methods - GeneTrack

advertisement
Application Note
GeneTrack – a genomic data processing and visualization framework
Istvan Albert1,2*, Shinichiro Wachi2 , Cizhong Jiang2, and B. Franklin Pugh2
1
Huck Institutes of the Life Sciences, 2Center for Comparative Genomics and Bioinformatics, Center for Gene Regulation,
and Department of Biochemistry and Molecular Biology, Pennsylvania State University, University Park, Pennsylvania,
16802, USA
Received on XXXXX; revised on XXXXX; accepted on XXXXX
Associate Editor: XXXXXXX
ABSTRACT
Motivation: High-throughput ‘ChIP-chip’ and ‘ChIP-seq’ methodologies generate sufficiently large data sets that analysis poses significant informatics challenges, particularly for research groups with
modest computational support. To address this challenge, we devised a software platform for storing, analyzing and visualizing high
resolution genome-wide binding data. GeneTrack automates several
steps of a typical data processing pipeline, including smoothing and
peak detection, and facilitates dissemination of the results via the
web. Our software is freely available via the Google Project Hosting
environment at http://genetrack.googlecode.com
1
INTRODUCTION
Current genomic technologies generate millions of data points
from a single biological experiment. As these technologies are now
mainstream, the resulting data are straining bioinformatics pipelines. Consequently, our understanding of genome regulation is
becoming more analysis-limited rather than data-limited. First
generation genome technologies (such as microarrays) increased
data gathering capacity significantly. Bioinformatics tools to manage, display, and analyze this data were developed [1], including
the UC-Santa Cruz genome browser [2], the Generic Genome
Browser [3], GenoViz platform and the Integrated Genome
Browser, NCBI for raw archiving, and GALAXY [4] for data
analysis tools. Second generation technologies (such as tiling microarrays and whole-genome sequencing) have the potential to
increase data acquisition dramatically. Here, we introduce a new
software platform named GeneTrack that we have developed to
automate and facilitate large-scale downstream data processing of
chromatin immunoprecipitation data obtained via high-throughput
sequencing and tiling arrays [5]. Our software is unique in that it
integrates multiple steps of a typical data analysis process: smoothing, fitting, peak detection and visualization into a single workflow
and that it performs to specifications on limited hardware. GeneTrack employs rapid processing and has low computational demands which make it suitable for exploratory data analyses where
different fitting and peak detection parameters are varied and the
results compared on the same display..GeneTrack was developed
*To
whom correspondence should be addressed.
© Oxford University Press 2005
in the Python programming language, using the hierarchical data
format (HDF) as its data storage backend and the NumPy library
for numerical computation. For a better integration with existing
tools final results are written into a relational database with support
for all major databases: including MySQL, PostgreSQL, Oracle,
and MS-SQL. Notably, the system does not require the presence of
a separate database nor webserver for small scale deployments.
Our software is able to use the SQLite relational database engine
that is distributed with Python 2.5, moreover, it can present data
through the web via its own embedded web server, thus is ideal for
smaller, individual research groups. Setup, maintenance, and administration are controlled via a single program to provide the
following functionality:

store and retrieve data efficiently,

quickly smooth data over an entire chromosome,

combine strand specific data into a composite value,

detect peaks rapidly over a chromosome-wide scale,

display and publish results via an embedded webserver.
2
METHODS
The hierarchical data format (HDF) provides a versatile data model
that represents complex data objects and a wide variety of metadata in a portable file format with no limit on the number or size of
data objects in the collection. We adopted HDF as the data storage
backend for GeneTrack and we chose NumPyto provide highspeed numerical computation. GeneTrack works iteratively, processing data in individual, overlapping chunks, thus keeping the
memory consumption low and independent of characteristics of the
data. Data smoothing and averaging is accomplished by a Gaussian
smoothing procedure as follows: a measurement at a genomic coordinate is approximated by values taken from a normal distribution of height equal to the measurement, centered over the measurement’s coordinate, and, with a standard deviation that acts as an
external, tunable parameter (fitting tolerance). Consequently, each
measurement is expanded into many values over a contiguous set
of coordinates. Next, all individually fitted values are summed at
each genomic coordinate resulting in a smooth and continuous
“probabilistic landscape” where the peaks of the curve indicate the
most likely positions of the measured genomic feature. Finally,
the peak detection algorithm in GeneTrack operates by selecting
the maximal non-neighboring subset from all local maxima in the
1
I.Albert et al.
data. That is, the algorithm selects the highest peak along the
chromosome, then establishes an exclusion zone (typically a few
hundred bp), within which no other subsequent peaks are allowed
to fall. In the rare occasion that two close peaks (inside the exclusion zone) have exactly equal height only one of them will be selected as valid. The process is repeated iteratively over the remain-
Fig. 1. GeneTrack Composite Plot. Black vertical bars represent experimental sequencing reads that map to a genomic coordinate. The purple
trace represents the smoothed fit over the read values. The purple horizontal tracks represent the 147 bp wide nucleosome predictions as given
by the peak detection algorithm. The blue and red tracks are yeast ORFs
on the forward (blue) and reverse (red) strands.
ing space until no other peaks may be placed. This algorithm is
well suited for problems where genomic features have a certain apriori known width (e.g. 147 bp nucleosomal DNA), Setting the
exclusion zone to 0 turns off this feature, and allows the algorithm
to to determine the optimal placement of heterogeneously-sized
DNA fragments that are typically generated in ChIP-seq experiments. Fig. 1 illustrates the output derived from 1.3 million ChIPseq “reads” of the Saccharomyces genome.
3
RESULTS
GeneTrack is controlled via a single program and a configuration
file that specifies the processes that it should perform (various data
analysis steps or web serving and/or plotting). The input for the
analysis must be a tab-delimited text file that lists chromosome,
genomic coordinates and values on forward or reverse strands. For
experimental methodologies (ChIP-chip) that do not separate
strands, the values on the reverse strand may be set to zero. Detailed documentation is available in a searchable Wiki format,
containing installation instructions and other operational details
(see main website). The code distribution also includes a dataset
and configuration files for the work published elsewhere [5], packaged such that users may repeat a full data analysis and view the
results via the embedded web server within minutes.
Internally, all information is represented on forward, reverse and
“composite” strands, where the composite strand is a combination
(in the simplest case a sum) of the data on each individual strand.
Since data on the forward and reverse strand represent two independent determinations of binding, this approach allows for individual error evaluations to be made whenever the data on the two
strands are in disagreement.GeneTrack will automatically operate
on each strand separately (Figure 2) and will derive the values for
the composite strand as well.
The software was designed with extensibility in mind. There is a
clear separation of the database schemas, parsing, fitting and pre-
2
diction modules, to the extent that the schema or prediction algorithm that is to be invoked in a certain analysis run can be changed
via the configuration file. Similarly, the output tracks and graphs
are fully customizable and may be entirely replaced although this
requires Python expertise. The currently distributed schemas are
for sequencing data, but we are preparing a set of modules to
Fig. 2. GeneTrack Two Strand Plot. Within the software full strand information is maintained. The same plot from Figure 1 is now visualized on the
separate strands forward (blue) and reverse (red). An ideal signal would be
perfectly mirrored; this type of display allows one to indentify systematic
shifts in the data. Note how the fitting and peak prediction algorithms also
operate on separate strands.
streamline tiling array data processing. We are committed to
providing a smooth data exchange with other existing data analysis
and visualization platforms provided by UCSC and Ensemble. To
that end we have implemented export functionality that produces
results in BED, GFF or wiggle format. The software has been tested on Windows and Linux platforms and is believed to work on all
major operating systems that can run Python and its extension
libraries for HDF and Numerical Python. We maintain several
GeneTrack instances to disseminate our results (see
http://atlas.bx.psu.edu). Funding for the project has been provided
by NIH R01-HG004160.
REFERENCES
1.
2.
3.
4.
5.
Quackenbush J: Computational approaches to analysis
of DNA microarray data. Methods Inf Med 2006, 45
Suppl 1:91-103.
Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle
TH, Zahler AM, Haussler D: The human genome
browser at UCSC. Genome Res 2002, 12(6):996-1006.
Stein LD, Mungall C, Shu S, Caudy M, Mangone M,
Day A, Nickerson E, Stajich JE, Harris TW, Arva A et
al: The generic genome browser: a building block for
a model organism system database. Genome Res 2002,
12(10):1599-1610.
Blankenberg D, Taylor J, Schenck I, He J, Zhang Y,
Ghent M, Veeraraghavan N, Albert I, Miller W, Makova
KD et al: A framework for collaborative analysis of
ENCODE data: making large-scale analyses biologist-friendly. Genome Res 2007, 17(6):960-964.
Albert I, Mavrich TN, Tomsho LP, Qi J, Zanton SJ,
Schuster SC, Pugh BF: Translational and rotational
settings of H2A.Z nucleosomes across the Saccharo-
GeneTrack – a genomic data processing and visualization framework
myces cerevisiae genome. Nature 2007, 446(7135):572576.
3
Download