Progress report 2005

advertisement
Principal Investigator/Program Director (Dorman, Karin, Saskia):
a. Specific Aims
The specific aims have not been altered from the original competing award.
b. Studies and Results
The first specific aim is to “construct a database of statistically confirmed HIV recombinant
crossover points.” Last year, we reported trouble testing HIV sequences for intersubtype
recombination because the phylogenetic relationship between the subtypes was not consistent
throughout the genome. Such evidence of ancient recombination among the subtype ancestors is
problematic for our original method that, for computational reasons, assumes a fixed phylogenetic
relationship for all but the putative recombinant sequences. Last year, we proposed to modify the
recombination detection method to simultaneously estimate ancient recombination. However, it is
still not computationally feasible to use such a method to achieve our goal of testing all HIV
sequences longer than 500 nucleotides for intersubtype recombination. Instead, our current
analysis reduces the problem in two ways. First, we divide the HIV genome into many overlapping
segments of 500 to 1000 nucleotides. Next, we divide the HIV subtypes into distinct groups that
have consistent phylogenetic relationships. A reference alignment is created for each group of
subtypes and genomic region. We align each HIV sequence to all reference alignments with which
it has homology and test for recombination. Thus, each HIV sequence is tested for recombination
multiple times. Any positive recombination signal is analyzed again using all subtype reference
sequences and without fixing the parental tree. The computations are currently underway and first
results will be ready for motif searching (Aim 2) during the summer.
Figure 1. Accession numbers AF385934 and AF385935, two recombinants with identical intersubtype B/F recombinant
structure are tested for monophyly. Shown are the HIV genomic map (genes), the probabilities that the recombinants
are subtype B, subtype F, split between subtypes B and F, or subtype C (), and the probability that the recombinants
are monophyletic (mono). There is moderate evidence that nucleotides 1200-1600 descend from distinct subtype F
sequences, suggesting that at least one sequence experienced a second recombination event and a possible hotspot
at position 1600. The polyphyletic 3’ nef region may also indicate multiple events although a small subtype C-like
PHS 2590 (Rev. 05/01)
Page ___4___
Form Page 5
Principal Investigator/Program Director (Dorman, Karin, Saskia):
region inferred in this region may confound results.
If there is some kind of “recombination signal” present in RNA sequence that triggers
recombination, sequences near confirmed hotspots for recombination will likely represent the
strongest source of this signal. Thus, the motif search of Aim two is likely to be most successful
when tested against sequences near recombination hotspots. Unfortunately, descendents of
recombination events occurring at a hotspot are difficult to distinguish from descendents of a single,
rare recombination that has subsequently spread through a population. Dr. Dorman and graduate
student Fang Fang developed a method during the last year to test whether recombinants with
apparently similar structure result from a single, unique recombination event or multiple, similar
recombination events. We determine that multiple recombination events have occurred when the
recombinants with similar structure do not form monophyletic groups in a phylogenetic analysis, i.e.
the recombinants do not form a tightly related cluster. Because there are far fewer trees that place
recombinants as monophyletic than polyphyletic (not monophyletic), a uniform prior on trees will
favor the hypothesis of multiple events. Fang Fang developed a prior that weights monophyletic
and polyphyletic recombinants equally a priori, leading to a more accurate statistical analysis.
Preliminary results suggest that two intersubtype B/F HIV recombinant that share some crossover
points in common may have recombined multiple times at this site, suggesting a possible hotspot
(Figure 1). This work has been accepted for presentation at three international conferences.
The second aim is to search for sequence and secondary structure motifs associated with
recombination hotspots. In the last year the Ashlock lab has developed a technique called
multiclustering and a non-linear projection algorithm for understanding the output of RNA structural
motifs. A paper applying multiclustering on non-RNA data has been accepted to the 2005 genetic
and evolutionary computation conference. A paper applying the non-linear projection technique to
iron response elements has been submitted to the 2005 Congress on evolutionary computation.
The non-linear projection correctly distinguished the two know forms of IREs. A survey paper on
different RNA structural motif finders and a paper on the theoretical basis of multiclustering are in
preparation.
The goal of the third aim is to determine if cis-acting sequences can modulate the frequency of HIV1 recombination in vitro. The goal of these studies is to verify experimentally putative
recombination hot spots identified through statistical and computational methods. We were also
planning to examine some putative hotspots identified by others (Zhuang et. al, 2002) in order to
identify promising signals that should definitely be targeted by the motif searcher (Aim 2).
Unfortunately, significant progress on Aim three was thwarted in the last year because of the
sudden departure of Sijun Liu, the postdoc working with Susan Carpenter. A replacement postdoc,
XiaoYu Lui, will be hired starting July 1, 2005, to continue progress on this aim. The first
predictions of the motif searcher should be available for in vitro testing by the end of the 2005
summer.
The fourth aim extends the “Bayesian change-point model for recombination and incorporate the
significant features identified in Aims 2 & 3.” This year, Dr. Suchard and his graduate student,
Vladimir Minin, have completed the work of Section 5.4.2. They have extended the multiple
change-point model for recombination detection to include two independent change-point
processes. One of these processes directly models recombination, allowing for a non-uniform
distribution of crossover points. The other process allows evolutionary pressures to varying
independently. Figure 2 demonstrates the improvement of the dual over single change-point
models in an HIV recombination detection problem. A manuscript describing the dual model is “in
press” in Bioinformatics. Dr. Dorman and graduate student Fang Fang re-analyzed a set of nearly
PHS 2590 (Rev. 05/01)
Page ___5___
Form Page 5
Principal Investigator/Program Director (Dorman, Karin, Saskia):
400 Hepatitis B viral (HBV) sequences, previously tested using the original single multiple change
point model and reported last year, with the dual multiple change point model. The slightly modified
results are in preparation for submission to Molecular Biology and Evolution. The overall
conclusions changed little from the first analysis, however a putative recombination hotspot located
near a dramatic shift in the average mutation rate was less clear in the second analysis. We
suspect that the strong change in selection pressure near the boundary between genes was
attracting nearby recombination change points and leading to false evidence of a hotspot. Since
accurate detection of recombination hotspots is critical for the motif searcher, the new model is
expected to play an important role in the continued search for recombination hotspots in HIV. This
model is the one currently being used to accomplish Aim 1.
Figure 2. Gaussian Markov random field model to identify recombination hotspots in 11 gag gene recombinants.
Dr. Suchard and Mr. Minin have also continued work on Section 5.4.1. They are developing a nonuniform prior over recombination break-points informed by a library of previously identified HIV
recombinants. The model is based on a Gaussian Markov random field and Figure 3 represents
the smoothed probabilities of recombination from a collection of 11 gag gene recombinants. A
potential recombination hotspot occurs near the border between protein products p17 and p24.
This work has been published as a conference proceedings paper and a manuscript for peer-review
is in process. Both Dr. Suchard and Mr. Minin have presented results from these studies at
national and international research meetings and invited seminars.
Zhuang, AE Jetzt, G Sun, H Yu, G Klarmann, Y Ron, BD Preston, JP Dougherty. (2002) Human immunodeficiency
virus type 1 recombination: rate, fidelity, and putative hot spots. J. Virol. 76:11273-11282.
c. Significance
Previous recombination models have linked regional changes in evolutionary parameters, such as
mutation rate, with changes in topology due to recombination. There is little biological reason to
link these processes, but more importantly for the objectives of this grant, it was necessary to
disconnect these processes before more complicated models of recombination could be
accommodated. For example, if recombination is known to occur preferentially in AT-rich regions, it
makes sense to develop a prior that favors configurations placing crossover points in AT-rich
PHS 2590 (Rev. 05/01)
Page ___6___
Form Page 5
Principal Investigator/Program Director (Dorman, Karin, Saskia):
regions over GC-rich regions. Attempts to accommodate this bias using previous models would
have unwittingly forced changes in evolutionary parameters to also associate with AT-rich regions.
The dual multiple change point model separates these two processes to allow greater flexibility and
realism in the modeling process.
Figure 3. Dual multiple change-point model improves resolution of recombination detection. Results for the single
(dashed) and dual (solid) change point process models offer substantially different (arrows) estimates of the
recombination break-point (top plot) and changes in evolutionary pressures (lower plots). In general, dual multiple
change-point model estimates are more accurate.
It is commonly assumed, but never tested, that recombinant sequences with similar mosaic
structure have resulted from the spread and transmission of a single, one-time recombination
event. The methods developed during the last year can distinguish these two outcomes, and may
reveal that recombination has been far more extensive, and repetitive, than previously suspected.
Since the virus may benefit from recombination, for example to achieve multi-drug resistance by
combining multiple resistance mutations, a better understanding of how and where recombination
occurs is important. Together, the improvements accomplished during the last year will help to
better detect past recombination events and may ultimately assist in the design of novel
therapeutics to promote or suppress natural recombination.
d. Plans
The plans for the coming year are not substantially changed from those originally proposed.
1. The results of the test for recombination of HIV sequences will be interpreted and submitted to
the motif searcher. All sequences with evidence of recombination will be re-examined in careful
follow-up analyses, which may produce cleaner datasets for motif searching in subsequent
iterations. The identified crossover points, with error bounds, will be made publicly available via
the internet.
PHS 2590 (Rev. 05/01)
Page ___7___
Form Page 5
Principal Investigator/Program Director (Dorman, Karin, Saskia):
2. The HIV genome will be assessed for the propensity to form secondary structure, and the
correlation between this propensity and recombination frequency will be assessed. This
analysis will help motivate and perhaps guide the motif search for secondary structure.
3. The sequences around HIV crossover points will be analyzed using the sequence and
secondary structure motif-searching tools. This will be an iterative process, extending into year
4, since the motifs to discover are far less well known than the motifs traditionally sought (e.g.
the IRE).
4. The in vitro strand transfer assay will be used to analyze any statistically or computationally
identified motifs and a few a priori hypotheses about the type of signal likely to associate with
recombination (e.g. A/T richness).
5. The Bayesian model with non-uniform prior on recombination crossover point locations will be
adjusted to incorporate the motifs identified in planned objective 4.
6. The Bayesian model will be extended to test the hypothesis that multiple instances of a
crossover point between the same subtypes and in similar locations represent a single or
multiple recombination event(s). This proposal is a slight modification of the existing prior to test
whether two recombinant structures descended from the same or different recombination event.
e. Publications
2004-2005 Papers Directly Related to Grant:
1. Minin VN, Dorman KS. Fang F and Suchard MA. Dual multiple change-point model leads to
more accurate recombination detection. Bioinformatics, In press.
2. Minin VN, Dorman KS and Suchard MA. Bayesian recombination identification: new models for
incorporating prior information. Proceedings of the Section on Bayesian Statistical Science.
Alexandria, VA: American Statistical Association, In press.
3. Nonlinear Projection for the Display of High Dimensional Distance Data Submitted to CEC 2005.
D. Ashlock, J. Schonfeld.
2004-2005 Papers That Use Methods Developed in Grant:
1. Suchard MA. Stochastic models for horizontal gene transfer: taking a random walk through tree
space, Genetics, In press.
2. Redelings BD and Suchard MA. Joint Bayesian estimation of alignment and phylogeny.
Systematic Biology, In press.
4. Suchard MA, Weiss RE and Sinsheimer JS. Models for estimating Bayes factors with
applications to phylogeny and tests of monophyly. Biometrics, In press.
5. D. Ashlock E.Y. Kim. Techniques for Analysis of Evolved Prisoner's Dilemma Strategies with
Fingerprints. Accepted to GECCO 2005.
Talks And Posters Involving Grant Work
1. Fang F, Suchard MA, Minin VN, Dorman KS. Bayesian Phylogenetic Model to Identify Multiple
Recombination Events from the Sequences with Apparently Similar Mosaic Structures. HIV
Dynamics and Evolution, Cleveland, OH, April, 2005.
PHS 2590 (Rev. 05/01)
Page ___8___
Form Page 5
Principal Investigator/Program Director (Dorman, Karin, Saskia):
2. Kitchen CMR and Suchard MA. Intra-host recombination of between plasma and genital tract
conferring resistance to naviripine. Contributed talk. 2005 Palm Springs Symposium on
HIV/AIDS, Palm Springs, CA, March 2005.
3. Suchard MA. Statistics in evolutionary medicine: resolving intra-host phylogenies of rapidly
evolving pathogens. Invited seminar. Department of Biostatistics, Johns Hopkins University,
Baltimore, MD, January 2005.
4. Suchard MA. Joint Bayesian alignment and phylogeny for intra-host recombination detection.
Contributed poster. 2nd Joint Institute of Mathematical Statistics-International Society for
Bayesian Analysis Meeting, Bormio, Italy, January 2005.
5. Suchard MA. Resolving the intra-host evolution of rapidly evolving pathogens: incorporating
common patterns and shared indel information. Invited seminar. Department of Integrative
Biology, UC Berkeley, CA, December 2004.
6. Suchard MA. Resolving the intra-host evolution of rapidly evolving pathogens. Invited talk.
International Conference on Bioinformatics, Auckland, New Zealand, September 2004.
7. Minin VN, Dorman KS and Suchard MA. Bayesian recombination identification: new models and
betters ways of incorporating prior information. Contributed talk. American Statistical
Association, Joint Statistical Meetings, Toronto, Canada, August 2004.
8. Fang F, Rischmiller MA, Suchard MA, Dorman KS. Recombination in Hepatitis B Virus: a survey
with evidence for the presence of hotspots. VII International Meeting on Molecular
Epidemiology and Evolutionary Genetics of Infectious Diseases. Valencia, Spain, July, 2004.
f. Project-Generated Resources
The software to detect recombination via the dual multiple change point model is made available in
the original Java version at http://www.biomath.medsch.ucla.edu/msuchard/. The Java version
permits estimation of the hierarchical structure on the evolutionary parameters across segments. A
slightly speedier C version without hierarchical structure estimation is available at
http://www.biomath.org/dormanks. The C version also implements the priors needed to test
whether recombinants with identical mosaic structure result from a single or multiple past
recombination events.
PHS 2590 (Rev. 05/01)
Page ___9___
Form Page 5
Download