Mussa: Transitive Comparative Genomic Analysis (Comparative Genomic Analysis Using Transitivity - this takes the hidden pun one step further, tho not as pure) Tristan De Buysscher, Nora Mullaney, Barbara Wold Abstract Rationale Comparative genome analysis, as a routine lab tool for cell and molecular biologists, is becoming increasingly important as the repertoire of sequenced genomes increases. In particular, we are beginning to see the power of comparisons integrated over many genomes from within a clade to resolve functionally important sequence features (protein coding, RNA coding, cis-acting regulatory, regulatory RNA or unknown, etc. in function) by the fact that they have been preserved through evolution among multiple mammalian genomes, or sensu stricto yeasts or multiple Drosophilidae, etc. However, there are a number of considerations in performing any comparative analysis. By definition, comparative sequence analysis can only discover functional sequence elements that are both ancestral to all the sequences in the comparison and that have still been retained in each genome. Increasing the evolutionary distance (total branch length) between the two or more genome sequences selected for analysis therefore increases the likelihood that the ancestral sequences not under selective pressure will have diverged. For comparative studies, achieving such branch length can be achieved in different ways. This strategic choice has significant impact on the number and character of features that can be identified (XXrefs). This choice also affects the kind of algorithm, the software tools and the similarity metrics needed to do the work. One straightforward strategy that has had considerable success is to focus on a pair of genomes that are separated from each other by considerable distance, such as human and fish(ref) or mouse and chicken (ref). These longdistance binary comparisons constitute a very stringent filter. For example, in transgenic bioassays, candidate regulatory elements conserved in these binary analyses often give very strong positive results, indicating that they have in fact been conserved over a long stretch of evolution because they have a significant biological function (Xxeddy). However, this single "long branch" analysis strategy also has the effect of eliminating many sequence features, both coding and regulatory, that are known to be functionally important in at least one of the organisms (sometimes important in both), though they are no longer detectably shared. Some will have diverged because they are simply not needed in one of the two organisms (genes and/or regulatory elements that are clade specific and are, for example, needed in mammals but not in fish or birds – or the reverse). A different class of functional elements lost from long-distance binary comparisons are those that are able to execute similar functions with highly altered primary DNA sequences. This is an especially important scenario for cis-acting regulatory modules whose internal organizational flexibility can be extraordinary (see below, figureXX). If one goal of a comparative analysis is to identify as many "true positive" functionally important elements as possible over a locus of interest, then the single long-branch strategy should only be chosen when there are few genomes to work with. At the other end of the comparison strategy continuum, the same total branch length used in a given binary comparison can highlight more functional elements, if that distance comes from using multiple genomes that are positioned so that each individual genome contributes less total branch length. The expectation is that the aggregate N-genome branch length sum will deliver high resolution while retaining elements that are important within the clade being studied. Thus the practical virtue of the latter N-way genome comparison strategy is that it can highlight a larger number of functional genomic features than does a 2-genome strategy of roughly similar branch length. At this time, there are two limiting factors for for high-N multigenome comparisons. Depending on the clade of interest and the locus of interest, the number of whole genome sequences available can still be limiting, even though it continues to increase rapidly (XX),especially in clades surrounding major experimental model organisms such as S. cerevaisae, D. melanogaster, C. elegans, and A. thaliana. The other limitation has been in the bioinformatic tools where user-friendly software is needed to first make appropriately integrated and flexible N-way analyses and then couple that with mining tools that will help to integrate comparative conservation information with other pertinent sequence features and relationships. The latter needs motivated this work. The availability of many genomes means that a biologist studying a given locus needs to make detailed comparisons with the flexibility to set different sequence similarity thresholds and to include different combinations of genomes. Once these basic comparisons are made one needs to relate domains of sequence conservation to each other, ask if they have shared internal elements, and map additional features such as small transcription factor binding motifs, regulatory RNA interaction motifs, etc. This kind of analysis needs to be specifically tailored to the questions, prior knowledge and particulars of a given locus. The Mussa software package was designed to make possible interactive this kind of user-driven interactive comparative analysis and annotation of individual loci over N genomes. To do this, it employs a transitivity filtering algorithm to integrate sequence similarity ties over the entire collection of genomes being compared. Interactivity features then permit inspection of the analysis at varying levels of resolution, recovery of specific sequence regions for further external analysis, and user-driven integration of conserved sequence domains with maps of sequence motifs such as transcription factor binding sites or gene model annotations.. (last seems like unnecessarily marketing speech that really says nothing...) N-way Transitivity The Mussa N-way analysis occurs in two stages. The first stage consists of a full set of all possible pairwise sequence comparisons among the N genomes. These are made by the well-known thresholded sliding window matching algorithm, which is the basis of classic two-gene dot-plot comparisons (looking for ref, it's a bit confused), most recently re-implemented in the Family Relations software package (Brown et al 2005, 2002). One salient property of comparisons made with these algorithms is that they can capture and highlight multiple related features within a locus. In a classic 2-genome dot plot, these are the off-diagonal similarity features. In this manner sliding window algorithms differ from those that are based on initial global alignments that detect only the single aligned relationship per feature per genome. In noncoding candidate regulatory DNA, in particular, the occurance of local duplications of features of potential functional interest, mean that these realtionships can be biologically pertinent. This basic similarity search design also provides some tolerence for correctly detecting relatedness in genomes with local missassembly issues. The second stage of the Mussa comparison algorithm integrates all possible pairwise matched features by applying a transitivity test. Simply mathematical transitivity requires: if A = B and B = C, then A = C In a 3-way comparison, the pairwise matches must satisfy the relationship For windows, W, matching between sequences A, B, and C: if WAB and WBC meet threshold, then WAC must meet threshold The thresholded window matches are equivalences and not strict equalities in cases where the selected threshold T (in bp) is any value less than the window size W (bp). In the special case of exact sequence matchs, T=W, transitivity is inherently satisfied in the straight radial method. However, usage requiring perfect matching assumes that functionally conserved areas are fully conserved, which is not usually true for informative comparisons between orthologous or paralogous loci. Transitivity filtering produces a smaller set of matches than does a straight radial comparison by demanding that all pairwaise relationships meet the stated threshold. On the other hand, mismatch at the specified threshold or above is tolerated. In effect, this gives equal weight in the comparison to all participating genomes, and the interactive viewer highlights all realtionships that pass the transitivty test on all the genomes in the analysis. In particular, the identity of the 'reference' genome in a Mussa comparison is simply a graphical annotation device and it does not treat relatedness between the reference and other genomes differently than the relatedness between any two genomes within the comparison. Classical 'radial design' algorithms, shown schematically in figure 1B, are the basis for such tools as MultiPipMaker and Multi-VISTA (refs). In these algorithms conserved windows of sequence are matched, pairwise, against a designated 'reference' genome. They do not make direct comparisons of the other participating N-1 genomes with each other. This means that it is possible for the various genomes to be equally distant from the Reference genome - and thus meet the specified similarity threshold- while also being considerably more distant from each other. In this situation they need not meet the specified threshold with respect to each other. Thus, for a three way comparison by a radial algorithm, the relationships produced are: A practical upshot of the differences in the algorithms, which will vary in extent according to specifics of the genomes used and the particular locus, is that the radial comparison tools will highlight and score more features at a given threshold relative to the corresponding Mussa comparison at the same window size and similarity threshold. Interactive Features Mussa provides a gui for inspecting the results. This gui has been extended from that used by FamilyRelations for use with more than two sequences. A top level view of all conservation is presented in the primary connections window (figure 2a). The similarity threshold can be adjusted higher than the base analysis which is usually set at 70%, based on experimence with vertebrate clades. Basic IUPAC motif searching and annotation marking are available. Sequence areas can be highlighted for detailed inspection in a sequence view window. In the sequence view window (fig 2b), the user can inspect the actual base pairs comprising conserved sequence windows identified by Mussa. In many uses, a large scale analysis for Mussa (20-200kb) at an effective window size and sequence siminlarity threshold can only identify broad areas of conservation. These are very useful as starting point candidates for finding functional features, including regulatory modules that consist of multiple clustered binding sites for transcription factors. A sub-analysis feature in Mussa allows the user to select small subregions from the main analysis and perform a fine-scale comparative analysis using more traditional dot-plot parameters (eg window = 10, threshold = 8). This can reveal fine level detail of conservation withthe larger regions of conservation. Because this kind of window and threshold is close to the level of most transcription factor beingin motifs, it can deliver additional insights into relevant functional features, including smaller scale relationships between different regulatory modules (enhancers and promoters) that are not detectable with longer search windows. The efficacy of different search window sizes and thresholds over various locus sizes are discussed in the Mussa web tutorial. Availability and Tutorial Mussa has been released under the GPL license. Package downloads and tutorial are available at http://mussa.caltech.edu . Acknowlegements (xxxfundingxx). References (just beginning this, they have to be numbered & ref'd in order of appearance for Genome Bio...) Brown CT, Rust AG, Clarke PJ, Pan Z, Schilstra MJ, De Buysscher T, Griffin G, Wold BJ, Cameron RA, Davidson EH, Bolouri H: New computational approaches for analysis of cis-regulatory networks. Dev Biol. 2002 Jun 1;246(1):86-102. Brown CT, Xie Y, Davidson EH, Cameron RA: Paircomp, FamilyRelationsII and Cartwheel: tools for interspecific sequence comparison. BMC Bioinformatics. 2005 Mar 24;6(1):70. Brudno M, Poliakov A, Salamov A, Cooper GM, Sidow A, Rubin EM, Solovyev V, Batzoglou S, Dubchak I: Automated whole-genome multiple alignment of rat, mouse, and human. Genome Res. 2004 Apr;14(4):685-92. Schwartz S, Elnitski L, Li M, Weirauch M, Riemer C, Smit A, Green ED, Hardison RC, Miller W: MultiPipMaker and supporting tools: Alignments and analysis of multiple genomic DNA sequences. Nucleic Acids Res. 2003 Jul 1;31(13):3518-24. Eventually, multiple clades will have even deeper, multispecies in the manner that is now becoming possible for mammals, sesu stricto yeasts, or Drsophilidae. However Total branch length, however it is achieved, also affects the size of sequence feature one can expect to be able to identify as conserved because of functional pressure (XX, YY). This, in turn, common by, for example, mammals rather than being limited to the smaller set of things held in common by vertebrates. inherently requires sequences from species whose function elements have also diverged considerably (well, ...depends on the particular system, more basal/fundamental systems are most likely to be retained due to other systems dependence and the fact they've already likely hit a fairly low (stable) point on the fitness landscape...) re: theory of both ancestral and retain...where to address repeats? Not here - in type of summary/discussion at end? ...ultimately, given the very complex manner of differential selection, no single comparative approach can uncover, purely algoritmically, the intricate details of what is a conserved base and what is not... COMMENT the thought immediately above: – I agree, but we will not make progress on the basic stuff, if we do not simplify enough to make a point. The quotes on this general point go back thousands of years but are also current: You always must simplify to make basic points you believe to be true, and deal later – when [ropgress says which ones you cannot anylonger ignore, and how to deal with them….. Revised to perhaps draw on, if needed” The availability of many genomes means that a biologist studying a given locus now wants to make detailed comparisons with the flexibility to set different sequence similarity thresholds and to include combinations of genomes. different Once these basic comparisons are made, a first important level of mining and integration is to ask if the domains of sequence conservation are related to each other, if they have shared internal motifs, and if additional features such as small transcription factor binding motifs, regulatory RNA interaction motifs map onto the locus and into domains of multigenome conservation. Although the global sequence conservation overviews now provided by genome browsers provide a powerful starting point, in our experience, that actual usage of these analyses fo rdissecting gene structure and function often needs to be specifically tailored to the questions, prior knowledge and particulars of a given locus