Principal Investigator/Program Director (Dorman, Karin, Saskia): a. Specific Aims The specific aims have not been altered from the original competing award. b. Studies and Results The first specific aim is to “construct a database of statistically confirmed HIV recombinant crossover points.” Last year, we reported trouble testing HIV sequences for intersubtype recombination because the phylogenetic relationship between the subtypes was not consistent throughout the genome. Such evidence of ancient recombination among the subtype ancestors is problematic for our original method that, for computational reasons, assumes a fixed phylogenetic relationship for all but the putative recombinant sequences. Last year, we proposed to modify the recombination detection method to simultaneously estimate ancient recombination. However, it is still not computationally feasible to use such a method to achieve our goal of testing all HIV sequences longer than 500 nucleotides for intersubtype recombination. Instead, our current analysis reduces the problem in two ways. First, we divide the HIV genome into many overlapping segments of 500 to 1000 nucleotides. Next, we divide the HIV subtypes into distinct groups that have consistent phylogenetic relationships. A reference alignment is created for each group of subtypes and genomic region. We align each HIV sequence to all reference alignments with which it has homology and test for recombination. Thus, each HIV sequence is tested for recombination multiple times. Any positive recombination signal is analyzed again using all subtype reference sequences and without fixing the parental tree. The computations are currently underway and first results will be ready for motif searching (Aim 2) during the summer. Figure 1. Accession numbers AF385934 and AF385935, two recombinants with identical intersubtype B/F recombinant structure are tested for monophyly. Shown are the HIV genomic map (genes), the probabilities that the recombinants are subtype B, subtype F, split between subtypes B and F, or subtype C (), and the probability that the recombinants are monophyletic (mono). There is moderate evidence that nucleotides 1200-1600 descend from distinct subtype F sequences, suggesting that at least one sequence experienced a second recombination event and a possible hotspot at position 1600. The polyphyletic 3’ nef region may also indicate multiple events although a small subtype C-like PHS 2590 (Rev. 05/01) Page ___4___ Form Page 5 Principal Investigator/Program Director (Dorman, Karin, Saskia): region inferred in this region may confound results. If there is some kind of “recombination signal” present in RNA sequence that triggers recombination, sequences near confirmed hotspots for recombination will likely represent the strongest source of this signal. Thus, the motif search of Aim two is likely to be most successful when tested against sequences near recombination hotspots. Unfortunately, descendents of recombination events occurring at a hotspot are difficult to distinguish from descendents of a single, rare recombination that has subsequently spread through a population. Dr. Dorman and graduate student Fang Fang developed a method during the last year to test whether recombinants with apparently similar structure result from a single, unique recombination event or multiple, similar recombination events. We determine that multiple recombination events have occurred when the recombinants with similar structure do not form monophyletic groups in a phylogenetic analysis, i.e. the recombinants do not form a tightly related cluster. Because there are far fewer trees that place recombinants as monophyletic than polyphyletic (not monophyletic), a uniform prior on trees will favor the hypothesis of multiple events. Fang Fang developed a prior that weights monophyletic and polyphyletic recombinants equally a priori, leading to a more accurate statistical analysis. Preliminary results suggest that two intersubtype B/F HIV recombinant that share some crossover points in common may have recombined multiple times at this site, suggesting a possible hotspot (Figure 1). This work has been accepted for presentation at three international conferences. The second aim is to search for sequence and secondary structure motifs associated with recombination hotspots. In the last year the Ashlock lab has developed a technique called multiclustering and a non-linear projection algorithm for understanding the output of RNA structural motifs. A paper applying multiclustering on non-RNA data has been accepted to the 2005 genetic and evolutionary computation conference. A paper applying the non-linear projection technique to iron response elements has been submitted to the 2005 Congress on evolutionary computation. The non-linear projection correctly distinguished the two know forms of IREs. A survey paper on different RNA structural motif finders and a paper on the theoretical basis of multiclustering are in preparation. The goal of the third aim is to determine if cis-acting sequences can modulate the frequency of HIV1 recombination in vitro. The goal of these studies is to verify experimentally putative recombination hot spots identified through statistical and computational methods. We were also planning to examine some putative hotspots identified by others (Zhuang et. al, 2002) in order to identify promising signals that should definitely be targeted by the motif searcher (Aim 2). Unfortunately, significant progress on Aim three was thwarted in the last year because of the sudden departure of Sijun Liu, the postdoc working with Susan Carpenter. A replacement postdoc, XiaoYu Lui, will be hired starting July 1, 2005, to continue progress on this aim. The first predictions of the motif searcher should be available for in vitro testing by the end of the 2005 summer. The fourth aim extends the “Bayesian change-point model for recombination and incorporate the significant features identified in Aims 2 & 3.” This year, Dr. Suchard and his graduate student, Vladimir Minin, have completed the work of Section 5.4.2. They have extended the multiple change-point model for recombination detection to include two independent change-point processes. One of these processes directly models recombination, allowing for a non-uniform distribution of crossover points. The other process allows evolutionary pressures to varying independently. Figure 2 demonstrates the improvement of the dual over single change-point models in an HIV recombination detection problem. A manuscript describing the dual model is “in press” in Bioinformatics. Dr. Dorman and graduate student Fang Fang re-analyzed a set of nearly PHS 2590 (Rev. 05/01) Page ___5___ Form Page 5 Principal Investigator/Program Director (Dorman, Karin, Saskia): 400 Hepatitis B viral (HBV) sequences, previously tested using the original single multiple change point model and reported last year, with the dual multiple change point model. The slightly modified results are in preparation for submission to Molecular Biology and Evolution. The overall conclusions changed little from the first analysis, however a putative recombination hotspot located near a dramatic shift in the average mutation rate was less clear in the second analysis. We suspect that the strong change in selection pressure near the boundary between genes was attracting nearby recombination change points and leading to false evidence of a hotspot. Since accurate detection of recombination hotspots is critical for the motif searcher, the new model is expected to play an important role in the continued search for recombination hotspots in HIV. This model is the one currently being used to accomplish Aim 1. Figure 2. Gaussian Markov random field model to identify recombination hotspots in 11 gag gene recombinants. Dr. Suchard and Mr. Minin have also continued work on Section 5.4.1. They are developing a nonuniform prior over recombination break-points informed by a library of previously identified HIV recombinants. The model is based on a Gaussian Markov random field and Figure 3 represents the smoothed probabilities of recombination from a collection of 11 gag gene recombinants. A potential recombination hotspot occurs near the border between protein products p17 and p24. This work has been published as a conference proceedings paper and a manuscript for peer-review is in process. Both Dr. Suchard and Mr. Minin have presented results from these studies at national and international research meetings and invited seminars. Zhuang, AE Jetzt, G Sun, H Yu, G Klarmann, Y Ron, BD Preston, JP Dougherty. (2002) Human immunodeficiency virus type 1 recombination: rate, fidelity, and putative hot spots. J. Virol. 76:11273-11282. c. Significance Previous recombination models have linked regional changes in evolutionary parameters, such as mutation rate, with changes in topology due to recombination. There is little biological reason to link these processes, but more importantly for the objectives of this grant, it was necessary to disconnect these processes before more complicated models of recombination could be accommodated. For example, if recombination is known to occur preferentially in AT-rich regions, it makes sense to develop a prior that favors configurations placing crossover points in AT-rich PHS 2590 (Rev. 05/01) Page ___6___ Form Page 5 Principal Investigator/Program Director (Dorman, Karin, Saskia): regions over GC-rich regions. Attempts to accommodate this bias using previous models would have unwittingly forced changes in evolutionary parameters to also associate with AT-rich regions. The dual multiple change point model separates these two processes to allow greater flexibility and realism in the modeling process. Figure 3. Dual multiple change-point model improves resolution of recombination detection. Results for the single (dashed) and dual (solid) change point process models offer substantially different (arrows) estimates of the recombination break-point (top plot) and changes in evolutionary pressures (lower plots). In general, dual multiple change-point model estimates are more accurate. It is commonly assumed, but never tested, that recombinant sequences with similar mosaic structure have resulted from the spread and transmission of a single, one-time recombination event. The methods developed during the last year can distinguish these two outcomes, and may reveal that recombination has been far more extensive, and repetitive, than previously suspected. Since the virus may benefit from recombination, for example to achieve multi-drug resistance by combining multiple resistance mutations, a better understanding of how and where recombination occurs is important. Together, the improvements accomplished during the last year will help to better detect past recombination events and may ultimately assist in the design of novel therapeutics to promote or suppress natural recombination. d. Plans The plans for the coming year are not substantially changed from those originally proposed. 1. The results of the test for recombination of HIV sequences will be interpreted and submitted to the motif searcher. All sequences with evidence of recombination will be re-examined in careful follow-up analyses, which may produce cleaner datasets for motif searching in subsequent iterations. The identified crossover points, with error bounds, will be made publicly available via the internet. PHS 2590 (Rev. 05/01) Page ___7___ Form Page 5 Principal Investigator/Program Director (Dorman, Karin, Saskia): 2. The HIV genome will be assessed for the propensity to form secondary structure, and the correlation between this propensity and recombination frequency will be assessed. This analysis will help motivate and perhaps guide the motif search for secondary structure. 3. The sequences around HIV crossover points will be analyzed using the sequence and secondary structure motif-searching tools. This will be an iterative process, extending into year 4, since the motifs to discover are far less well known than the motifs traditionally sought (e.g. the IRE). 4. The in vitro strand transfer assay will be used to analyze any statistically or computationally identified motifs and a few a priori hypotheses about the type of signal likely to associate with recombination (e.g. A/T richness). 5. The Bayesian model with non-uniform prior on recombination crossover point locations will be adjusted to incorporate the motifs identified in planned objective 4. 6. The Bayesian model will be extended to test the hypothesis that multiple instances of a crossover point between the same subtypes and in similar locations represent a single or multiple recombination event(s). This proposal is a slight modification of the existing prior to test whether two recombinant structures descended from the same or different recombination event. e. Publications 2004-2005 Papers Directly Related to Grant: 1. Minin VN, Dorman KS. Fang F and Suchard MA. Dual multiple change-point model leads to more accurate recombination detection. Bioinformatics, In press. 2. Minin VN, Dorman KS and Suchard MA. Bayesian recombination identification: new models for incorporating prior information. Proceedings of the Section on Bayesian Statistical Science. Alexandria, VA: American Statistical Association, In press. 3. Nonlinear Projection for the Display of High Dimensional Distance Data Submitted to CEC 2005. D. Ashlock, J. Schonfeld. 2004-2005 Papers That Use Methods Developed in Grant: 1. Suchard MA. Stochastic models for horizontal gene transfer: taking a random walk through tree space, Genetics, In press. 2. Redelings BD and Suchard MA. Joint Bayesian estimation of alignment and phylogeny. Systematic Biology, In press. 4. Suchard MA, Weiss RE and Sinsheimer JS. Models for estimating Bayes factors with applications to phylogeny and tests of monophyly. Biometrics, In press. 5. D. Ashlock E.Y. Kim. Techniques for Analysis of Evolved Prisoner's Dilemma Strategies with Fingerprints. Accepted to GECCO 2005. Talks And Posters Involving Grant Work 1. Fang F, Suchard MA, Minin VN, Dorman KS. Bayesian Phylogenetic Model to Identify Multiple Recombination Events from the Sequences with Apparently Similar Mosaic Structures. HIV Dynamics and Evolution, Cleveland, OH, April, 2005. PHS 2590 (Rev. 05/01) Page ___8___ Form Page 5 Principal Investigator/Program Director (Dorman, Karin, Saskia): 2. Kitchen CMR and Suchard MA. Intra-host recombination of between plasma and genital tract conferring resistance to naviripine. Contributed talk. 2005 Palm Springs Symposium on HIV/AIDS, Palm Springs, CA, March 2005. 3. Suchard MA. Statistics in evolutionary medicine: resolving intra-host phylogenies of rapidly evolving pathogens. Invited seminar. Department of Biostatistics, Johns Hopkins University, Baltimore, MD, January 2005. 4. Suchard MA. Joint Bayesian alignment and phylogeny for intra-host recombination detection. Contributed poster. 2nd Joint Institute of Mathematical Statistics-International Society for Bayesian Analysis Meeting, Bormio, Italy, January 2005. 5. Suchard MA. Resolving the intra-host evolution of rapidly evolving pathogens: incorporating common patterns and shared indel information. Invited seminar. Department of Integrative Biology, UC Berkeley, CA, December 2004. 6. Suchard MA. Resolving the intra-host evolution of rapidly evolving pathogens. Invited talk. International Conference on Bioinformatics, Auckland, New Zealand, September 2004. 7. Minin VN, Dorman KS and Suchard MA. Bayesian recombination identification: new models and betters ways of incorporating prior information. Contributed talk. American Statistical Association, Joint Statistical Meetings, Toronto, Canada, August 2004. 8. Fang F, Rischmiller MA, Suchard MA, Dorman KS. Recombination in Hepatitis B Virus: a survey with evidence for the presence of hotspots. VII International Meeting on Molecular Epidemiology and Evolutionary Genetics of Infectious Diseases. Valencia, Spain, July, 2004. f. Project-Generated Resources The software to detect recombination via the dual multiple change point model is made available in the original Java version at http://www.biomath.medsch.ucla.edu/msuchard/. The Java version permits estimation of the hierarchical structure on the evolutionary parameters across segments. A slightly speedier C version without hierarchical structure estimation is available at http://www.biomath.org/dormanks. The C version also implements the priors needed to test whether recombinants with identical mosaic structure result from a single or multiple past recombination events. PHS 2590 (Rev. 05/01) Page ___9___ Form Page 5