mussa_tech_brief_t03_bjw02

advertisement
Mussa: Transitive Comparative Genomic Analysis
(Comparative Genomic Analysis Using Transitivity - this takes the hidden pun one step further, tho not
as pure)
Tristan De Buysscher, Nora Mullaney, Barbara Wold
Abstract
Rationale
Comparative genome analysis, as a routine lab tool for cell and molecular biologists, is becoming
increasingly important as the repertoire of sequenced genomes increases.
In particular, we are beginning to
see the power of comparisons integrated over many genomes from within a clade to resolve functionally
important sequence features (protein coding, RNA coding, cis-acting regulatory, regulatory RNA or unknown,
etc. in function) by the fact that they have been preserved through evolution among multiple mammalian
genomes, or sensu stricto yeasts or multiple Drosophilidae, etc. However, there are a number of considerations
in performing any comparative analysis. By definition, comparative sequence analysis can only discover
functional sequence elements that are both ancestral to all the sequences in the comparison and that have still
been retained in each genome. Increasing the evolutionary distance (total branch length) between the two or
more genome sequences selected for analysis therefore increases the likelihood that the ancestral sequences not
under selective pressure will have diverged.
For comparative studies, achieving such branch length can be achieved in different ways. This strategic choice
has significant impact on the number and character of features that can be identified (XXrefs). This choice also
affects the kind of algorithm, the software tools and the similarity metrics needed to do the work. One
straightforward strategy that has had considerable success is to focus on a pair of genomes that are separated
from each other by considerable distance, such as human and fish(ref) or mouse and chicken (ref). These longdistance binary comparisons constitute a very stringent filter. For example, in transgenic bioassays, candidate
regulatory elements conserved in these binary analyses often give very strong positive results, indicating that
they have in fact been conserved over a long stretch of evolution because they have a significant biological
function (Xxeddy). However, this single "long branch" analysis strategy also has the effect of eliminating
many sequence features, both coding and regulatory, that are known to be functionally important in at least one
of the organisms (sometimes important in both), though they are no longer detectably shared. Some will have
diverged because they are simply not needed in one of the two organisms (genes and/or regulatory elements that
are clade specific and are, for example, needed in mammals but not in fish or birds – or the reverse). A
different class of functional elements lost from long-distance binary comparisons are those that are able to
execute similar functions with highly altered primary DNA sequences. This is an especially important scenario
for cis-acting regulatory modules whose internal organizational flexibility can be extraordinary (see below,
figureXX). If one goal of a comparative analysis is to identify as many "true positive" functionally important
elements as possible over a locus of interest, then the single long-branch strategy should only be chosen when
there are few genomes to work with.
At the other end of the comparison strategy continuum, the same total branch length used in a given binary
comparison can highlight more functional elements, if that distance comes from using multiple genomes that
are positioned so that each individual genome contributes less total branch length. The expectation is that the
aggregate N-genome branch length sum will deliver high resolution while retaining elements that are important
within the clade being studied. Thus the practical virtue of the latter N-way genome comparison strategy is that
it can highlight a larger number of functional genomic features than does a 2-genome strategy of roughly
similar branch length. At this time, there are two limiting factors for for high-N multigenome comparisons.
Depending on the clade of interest and the locus of interest, the number of whole genome sequences available
can still be limiting, even though it continues to increase rapidly (XX),especially in clades surrounding major
experimental model organisms such as S. cerevaisae, D. melanogaster, C. elegans, and A. thaliana. The other
limitation has been in the bioinformatic tools where user-friendly software is needed to first make appropriately
integrated and flexible N-way analyses and then couple that with mining tools that will help to integrate
comparative conservation information with other pertinent sequence features and relationships.
The latter
needs motivated this work.
The availability of many genomes means that a biologist studying a given locus needs to make detailed
comparisons with the flexibility to set different sequence similarity thresholds and to include
different
combinations of genomes. Once these basic comparisons are made one needs to relate domains of sequence
conservation to each other, ask if they have shared internal elements, and map additional features such as small
transcription factor binding motifs, regulatory RNA interaction motifs, etc. This kind of analysis needs to be
specifically tailored to the questions, prior knowledge and particulars of a given locus.
The Mussa software package was designed to make possible interactive this kind of user-driven interactive
comparative analysis and annotation of individual loci over N genomes. To do this, it employs a transitivity
filtering algorithm to integrate sequence similarity ties over the entire collection of genomes being compared.
Interactivity features then permit inspection of the analysis at varying levels of resolution, recovery of specific
sequence regions for further external analysis, and user-driven integration of conserved sequence domains with
maps of sequence motifs such as transcription factor binding sites or gene model annotations.. (last seems like
unnecessarily marketing speech that really says nothing...)
N-way Transitivity
The Mussa N-way analysis occurs in two stages. The first stage consists of a full set of all possible
pairwise sequence comparisons among the N genomes. These are made by the well-known thresholded sliding
window matching algorithm, which is the basis of classic two-gene dot-plot comparisons (looking for ref, it's a
bit confused), most recently re-implemented in the Family Relations software package (Brown et al 2005,
2002). One salient property of comparisons made with these algorithms is that they can capture and highlight
multiple related features within a locus. In a classic 2-genome dot plot, these are the off-diagonal similarity
features. In this manner sliding window algorithms differ from those that are based on initial global alignments
that detect only the single aligned relationship per feature per genome. In noncoding candidate regulatory
DNA, in particular, the occurance of local duplications of features of potential functional interest, mean that
these realtionships can be biologically pertinent. This basic similarity search design also provides some
tolerence for correctly detecting relatedness in genomes with local missassembly issues.
The second stage of the Mussa comparison algorithm integrates all possible pairwise matched features by
applying a transitivity test. Simply mathematical transitivity requires:
if A = B and B = C, then A = C
In a 3-way comparison, the pairwise matches must satisfy the relationship For windows, W, matching between
sequences A, B, and C:
if WAB and WBC meet threshold, then WAC must meet threshold
The thresholded window matches are equivalences and not strict equalities in cases where the
selected threshold T (in bp) is any value less than the window size W (bp). In the special case of exact
sequence matchs, T=W, transitivity is inherently satisfied in the straight radial method. However, usage
requiring perfect matching assumes that functionally conserved areas are fully conserved, which is not usually
true for informative comparisons between orthologous or paralogous loci. Transitivity filtering produces a
smaller set of matches than does a straight radial comparison by demanding that all pairwaise relationships
meet the stated threshold. On the other hand, mismatch at the specified threshold or above is tolerated. In
effect, this gives equal weight in the comparison to all participating genomes, and the interactive viewer
highlights all realtionships that pass the transitivty test on all the genomes in the analysis. In particular, the
identity of the 'reference' genome in a Mussa comparison is simply a graphical annotation device and it does not
treat relatedness between the reference and other genomes differently than the relatedness between any two
genomes within the comparison.
Classical 'radial design' algorithms, shown schematically in figure 1B, are the basis for such tools
as MultiPipMaker and Multi-VISTA (refs). In these algorithms conserved windows of sequence are matched,
pairwise, against a designated 'reference' genome.
They do not make direct comparisons of the other
participating N-1 genomes with each other. This means that it is possible for the various genomes to be equally
distant from the Reference genome - and thus meet the specified similarity threshold- while also being
considerably more distant from each other. In this situation they need not meet the specified threshold with
respect to each other. Thus, for a three way comparison by a radial algorithm, the relationships produced are:
A practical upshot of the differences in the algorithms, which will vary in extent according to
specifics of the genomes used and the particular locus, is that the radial comparison tools will highlight and
score more features at a given threshold relative to the corresponding Mussa comparison at the same window
size and similarity threshold.
Interactive Features
Mussa provides a gui for inspecting the results. This gui has been extended from that used by
FamilyRelations for use with more than two sequences. A top level view of all conservation is presented in the
primary connections window (figure 2a). The similarity threshold can be adjusted higher than the base analysis
which is usually set at 70%, based on experimence with vertebrate clades. Basic IUPAC motif searching and
annotation marking are available. Sequence areas can be highlighted for detailed inspection in a sequence view
window. In the sequence view window (fig 2b), the user can inspect the actual base pairs comprising conserved
sequence windows identified by Mussa.
In many uses, a large scale analysis for Mussa (20-200kb) at an effective window size and sequence siminlarity
threshold can only identify broad areas of conservation. These are very useful as starting point candidates for
finding functional features, including regulatory modules that consist of multiple clustered binding sites for
transcription factors. A sub-analysis feature in Mussa allows the user to select small subregions from the main
analysis and perform a fine-scale comparative analysis using more traditional dot-plot parameters (eg window =
10, threshold = 8). This can reveal fine level detail of conservation withthe larger regions of conservation.
Because this kind of window and threshold is close to the level of most transcription factor beingin motifs, it
can deliver additional insights into relevant functional features, including smaller scale relationships between
different regulatory modules (enhancers and promoters) that are not detectable with longer search windows.
The efficacy of different search window sizes and thresholds over various locus sizes are discussed in the
Mussa web tutorial.
Availability and Tutorial
Mussa has been released under the GPL license. Package downloads and tutorial are available at
http://mussa.caltech.edu .
Acknowlegements (xxxfundingxx).
References
(just beginning this, they have to be numbered & ref'd in order of appearance for Genome Bio...)
Brown CT, Rust AG, Clarke PJ, Pan Z, Schilstra MJ, De Buysscher T, Griffin G, Wold BJ, Cameron RA,
Davidson EH, Bolouri H: New computational approaches for analysis of cis-regulatory networks. Dev
Biol. 2002 Jun 1;246(1):86-102.
Brown CT, Xie Y, Davidson EH, Cameron RA: Paircomp, FamilyRelationsII and Cartwheel: tools for
interspecific sequence comparison. BMC Bioinformatics. 2005 Mar 24;6(1):70.
Brudno M, Poliakov A, Salamov A, Cooper GM, Sidow A, Rubin EM, Solovyev V, Batzoglou S, Dubchak I:
Automated whole-genome multiple alignment of rat, mouse, and human.
Genome Res. 2004
Apr;14(4):685-92.
Schwartz S, Elnitski L, Li M, Weirauch M, Riemer C, Smit A, Green ED, Hardison RC, Miller W:
MultiPipMaker and supporting tools: Alignments and analysis of multiple genomic DNA sequences.
Nucleic Acids Res. 2003 Jul 1;31(13):3518-24.
Eventually, multiple clades will have even deeper, multispecies in the manner that is now becoming possible
for mammals, sesu stricto yeasts, or Drsophilidae. However
Total branch length, however it is achieved, also affects the size of sequence feature one can expect to be able
to identify as conserved because of functional pressure (XX, YY). This, in turn,
common by, for example, mammals rather than being limited to the smaller set of things held in common by
vertebrates.
inherently requires sequences from species whose function elements have also diverged considerably (well,
...depends on the particular system, more basal/fundamental systems are most likely to be retained due to other
systems dependence and the fact they've already likely hit a fairly low (stable) point on the fitness landscape...)
re: theory of both ancestral and retain...where to address repeats? Not here - in type of summary/discussion at
end?
...ultimately, given the very complex manner of differential selection, no single comparative approach can
uncover, purely algoritmically, the intricate details of what is a conserved base and what is not...
COMMENT the thought immediately above: – I agree, but we will not make progress on the basic stuff, if we
do not simplify enough to make a point.
The quotes on this general point go back thousands of years but are also current: You always must simplify to
make basic points you believe to be true, and deal later – when [ropgress says which ones you cannot anylonger
ignore, and how to deal with them…..
Revised to perhaps draw on, if needed”
The availability of many genomes means that a biologist studying a given locus now wants to make detailed
comparisons with the flexibility to set different sequence similarity thresholds and to include
combinations of genomes.
different
Once these basic comparisons are made, a first important level of mining and
integration is to ask if the domains of sequence conservation are related to each other, if they have shared
internal motifs, and if additional features such as small transcription factor binding motifs, regulatory RNA
interaction motifs map onto the locus and into domains of multigenome conservation.
Although the global
sequence conservation overviews now provided by genome browsers provide a powerful starting point, in our
experience, that actual usage of these analyses fo rdissecting gene structure and function often needs to be
specifically tailored to the questions, prior knowledge and particulars of a given locus
Download