Detecting Evolutionary Signatures of Positive Selection in HIV

advertisement
Detecting Evolutionary Signatures of Positive Selection in HIV
Introduction:
It is widely known that the spread of HIV (Human Immunodeficiency Virus)
presents a major threat to human life globally. The World Health Organization
(WHO) estimates that approximately 33.4 million people were living with HIV in
2008 and 2.7 million people were newly infected with HIV just in that year. Of
those previously infected with this retrovirus, about 2 million died. There are two
types of HIV, both of which descended from SIV (Simian Immunodeficiency
Virus). HIV, categorized as a lentivirus, characterized by long incubation times
and complex genomes, primarily infects CD4+ T cells of the human immune
system. Although there are many treatments that can temporarily slow the
progression of a HIV infection including highly active anti-retroviral therapy
(HAART), there currently is no cure or vaccine for the virus. HAART, though
greatly reducing the mortality and morbidity in HIV patients, has its own problems
in terms of toxicity within the human body and high cost. Thus, it is advantageous
to develop novel drug treatments with fewer side effects and greater effectiveness.
In addition, HIV has been rapidly evolving over the past few years in response to
the utilization of anti-retroviral drugs and the pressures of the human immune
system.
Studying the evolutionary tendencies of HIV could offer some insight on
how to design more effective therapies or perhaps even vaccines for HIV. In the
human body, 10.3 x 109 virus particles are created each day while approximately 3
x 10-5 errors are made per replication cycle in an approximately 4000 base pair
RNA genome. These rates translate to the virus having at least one mutation at
every position of the viral genome each day (Hunt 2008). Mutations of a single
1
codon in a DNA sequence which is subsequently translated to an amino acid can
have dramatic effects on the ability of HIV to evade the adaptive immune system.
Hence this fact coupled with the high error rate of HIV replication allows us to
observe the strong signals of purifying and positive selection within the HIV
genome. Mutations at sites that have structural or functional significance render the
protein dysfunctional, causing purifying selection while mutations that provide
advantages such as increase in virulence tend to cause positive selection. By
locating the areas of the HIV genome that are undergoing positive selection using
bioinformatics methods and tools, we can better prepare and predict a virus’s
course of mutation and response to certain antiviral drugs.
HIV represents the human introduction of SIV, likely transmitted through
the handling of monkey carcasses in Africa. Two distinct classes of HIV exist:
HIV-1, presumed to be from chimpanzees, and HIV-2, from sooty mangabeys—
though HIV-1 is responsible for the modern pandemic. Within HIV-1, three major
groups, M, N, and O exist, possibly representing three separate introductions of
SIV into the human population. However, about ninety percent of all HIV
infections come from group M and therefore it is the most studied. Group M is
further subdivided into 9 clades, labeled the letters A through K excluding E and I.
These clades tend to be bound by geography as different subtypes of viruses
circulate within each continent. We decided to focus our analysis on clade B, the
dominant strain in North America and Europe, which has the most available
sequence data.
A means to understanding the natural selection in organisms at the genetic
level is the comparison between the conserved sequences and the variable ones.
This strategy can also be applied to viruses, which though not considered living,
still replicate and alter their genetic code. The ratio between the variable and
2
conserved ratios was then calculated, with more variation occurring at higher
ratios, more conservation with lower ratios, and a neutral selection at a ratio value
of one, indicating the level and type of selection of the area under consideration.
Not only do different domains in HIV have different ratio values, but also different
proteins of that domain, requiring a more specific analysis and model. For
example, a classical example in HIV research is the tat gene, which is the primary
gene responsible for the replication of HIV.
Our research goal involved applying a specific evolutionary model of
calculation to the HIV genome and examining the consistency of its results
compared to findings in published literature that utilized either bioinformatics or
traditional lab approaches to find sites of conservation and variation. The available
phylogenetic analysis programs utilize several mathematical models designed
specifically to study the evolutionary process of organisms. These programs range
from less detailed multiple alignments to highly specialized and complex
algorithms such as those utilizing maximum likelihood or Bayesian inference
methods. Our project primarily used the ConSurf tool for analysis by maximum
likelihood (Ashkenazy et al., 2010). Phylogenetic analysis of HIV is especially
interesting compared to analysis of mammalian species because of HIV’s rapid
evolutionary rate. In familiarizing ourselves with the program and working with
the databases we can compare our results our results with those found in literature
and eventually extend our analysis to patient data. This study could help us better
understand how the viruses co-evolve with their host cells and their responses to
the barrage of inhibitory drugs.
3
Materials & Methods:
There were several databases that were potential sources for the project. The
first database that we looked at, GenBank, hosted by NCBI, has the collection of
all publicly available sequences. Yet the problem we kept on running into was the
inefficiency of extracting sequences one at a time since we did not have the
computing skills to write a perl script to retrieve them; therefore, we chose to find
other sequences despite the standard use of GenBank in genetic research. We chose
to move into sequences that would be strictly HIV and hoped to find a mass
sequence downloading option. Stanford University hosts a HIV database that is
quite extensive, however the database focused on sequences exclusively related to
patient data. For our study we wanted to analyze the overall conservation scores of
the virus, and therefore patient data was neither broad enough nor representative as
a whole. We therefore obtained the sequences from a database hosted by Los
Alamos National Laboratories (LANL), as it had the most extensive HIV data set
available to the public. LANL, has several features that might be relevant for future
projects such as its geographic search feature which lists all the HIV sequences in
that particular area.
The Linux platform was fundamental throughout much of our work since
many bioinformatics tools were programmed to run it. Linux provides several
advantages including speed of calculation, simplicity, and the ability to script
things otherwise not possible on other operating systems. At first we utilized a
remote Linux server provided by our mentor and eventually we installed an
Ubuntu-like operating system on our own computers and by having a personal
Linux system, we were able to fully exploit the advantages of Linux. Our advisor
aided us in writing perl scripts that allowed us to extract clade B sequences from
the comprehensive sequence set by separating, sorting, and recombining our
4
desired sequences. Another advantage of Linux that we found useful was the
efficiency of its programs. When we needed to run multiple sequence alignments
with ClustalW locally (the volume of our sequences were too high for online
servers), the process was much faster on Linux than in Windows (Larkin 2007).
We credit one of our teammates for this discovery since it greatly improved the
efficiency of our work.
The alignment process allows us to compare the sequences and observe
which regions are conserved, semi-conserved, or not conserved at all. It lines up
each of the sequences by the consistent regions, so that other bioinformatics tools
can analyze these sequences further. From the alignment we can further compare
the phylogenetic visually compare relationships and constructs trees, domains of
interest can be found, and especially relevant to our study we can analyze the
conservation sites. ClustalW and its windows counterpart ClustalX, are the most
commonly used and reliable alignment tools (Larkin 2007). Other prominent
alignment tools include MUSCLE (Edgar 2004), which is built into ConSurf, as
well as T-COFFEE (Notredame et al., 2000), a web based alignment tool. ConSurf
would ultimately be the tool we would use to calculate the various evolutionary
rates of the data.
The majority of our analysis of the HIV sequences was done using the
multipurpose tool, ConSurf. This versatile tool is very flexible in terms of the
sequences it can accept. It can analyze both nucleotide sequences and amino acid
sequences. Its protocol can be found under Figure 2 in Illustrations. We wished to
have a degree of control over our alignment parameters and we already possessed
viable sequences so our execution of the tool began at step four. The program also
constructs phylogenetic trees which can be viewed in various software packages.
At the center of the ConSurf analyses are the calculations of the evolutionary
5
conservation of the positions of the amino acids in the proteins using Rate4Site
algorithm (Ashkenazy et al., 2010).
The conservation of the protein depends on
the importance to the virus’s fitness, because in most cases the mutations that
change the genome of a virus are destructive. However, that is not to say that the
conserved regions cannot be affected by natural selection. There are two main
methods to measure the evolutionary rates, the Bayesian method and Maximum
likelihood. The consistency with which the nucleotides are with one another in
amino acid positions is determined by the conservation scores. The Bayesian
method is favored for smaller sets of sequences, but because we decided to use
such expansive data sets, maximum likelihood would contribute greater accuracy
to the conservation scores. Moreover, the conservation scores determine the nine
discrete grades that provide a visual representation of the data. We also set
ConSurf to follow the Tamura 1992 model (T92). This model was applicable since
there were strong transition-transversion biases as HIV tends to contain many point
mutations (Tamura 1992).
The final step in the ConSurf protocol would be to map the conservation
levels to a query sequence. This sequence needs to be specified in the initial startup
parameters. The selection of this sequence was arbitrary in our case so we used the
first sequence in the multiple alignment output as our query sequence. The output
results included a color-coded query sequence according to conservation, the
corresponding file listing the conservation score and subsequent color assignment,
a file listing nucleotide variety at each position, and the data for the rendering of
phylogenetic trees.
6
Results:
Alignments from ClustalW were the first results that we could conduct
limited analysis. Normally with protein sequences, ClustalW alignments calculate
four degrees of variability with a “*” symbol representing complete conservation,
“:” representing conserved substitutions, “.” representing semi-conservation, and
absence of a symbol representing variation (Larkin 2007). A large amount of
substitution choices, 20 amino acids to be exact, allows for this variability.
However in our ClustalW output, the degrees of variability were even lower with
either complete conservation or no conservation displayed due to the fact that we
were analyzing nucleotides (Figure 1). This result certainly accentuates the need
for a more sensitive and more inclusive output. In organization of our data we
created a wikispace website that not only organized our literature review process
but also organized our data and results in a centralized fashion. These can be
viewed for the raw sequences, extraction scripts, and extracted sequences can be
found at http://hivproject.wikispaces.com/Data. Another teammate with experience
in data management was able to set this website up.
We can, however, conduct limited interpretation of these results. For
example, we can calculate percentage of fully conserved sites within each gene.
We observed in this analysis that the tat gene had a larger proportion of sites that
were fully conserved (Table 1). We need to conduct a significance test on these
values to see if the larger observed conservation in tat is significant. A simple
examination of the standard deviation reveals that tat’s percentage conserved is
only about one standard deviation away from the overall mean percentage. Clearly
this sort of analysis is meaningless but interesting to consider. In order to obtain
more detailed and comprehensive results, we clearly needed a better form of
7
analysis than multiple sequence alignments. This analysis proves the weakness of
relying solely on ClustalW for analyzing sequence data.
Table 1: Summary of ClustalW Alignment Results
Fully Conserved
Variable
% Fully conserved
Total Sequence
Length
rev
46
305
13.11%
351
tat
66
240
21.57%
306
vif
97
474
16.75%
579
vpr
37
254
12.71%
291
When we used ConSurf for analysis, we obtained results that rated
conservation into nine intervals. Clearly this would be much more sensitive than
the alignment method. The ConSurf service provides a very convenient display of
the results that maps the conservation level based on a color scheme onto the
selected query sequence (Figure 3). To avoid confusion, ConSurf includes ConSeq,
which is the program that processed the nucleotide sequences and calculated the
evolutionary conservation rates. The results are titled “ConSeq Results” for that
reason. The ConSurf method’s main advantage lies in this output, which is
extremely useful for visualizing the variability of the genomic code.
Though the ConSurf result mapped onto the query sequence provides a great
visual tool in identifying sites of variation and conservation, it provides no hard
numbers to work with. The conservation scores provided by ConSurf provide a
more statistical way to analyze the data. These results can be found in the results
page on the wikispace (http://hivproject.wikispaces.com/Results) as files called
“Conservation Scores_proteinname.” These files list the conservation value as well
as the color assigned in a number form. At first we wished to compare the number
strong variation sites, which we set as Level 1 as reflected in Figure 3. Table 2 is a
8
further graphical representation of the variation and conservation of sites much like
Table 1. However, because of the use of the more in depth analysis using ConSurf,
we are able to analyze this further than ClustalX allowed. This kind of analysis is
remarkably similar to the analysis presented in Table 1 in terms of the lack of
detail presented. The conserved to variable ratios are nearly statistically identical
and provide nearly no significant observations though individually, the sites are
significant because of the implications in mutation rates.
Figure 4 depicts a diagram of the tat protein. This protein is important in the
reverse transcription process in HIV and greatly enhances its replication rate. It
would be important to compare the sites of variability and sites of conservation to
see if intuitively match up with the functional area of the protein. For example, if
the protein has an important function at that domain that perhaps translate to
important structural or functional amino acid residues
Table 2: Sites of Strong Conservation and Strong Variation in Select HIV-1
Clade B Proteins
Strong Variation
Strong Conservation
Ratio of
(Level 1)
(Level 9)
Level 1 to Level 9
rev
64
143
0.448
tat
59
143
0.413
vif
110
253
0.435
vpr
53
135
0.393
In realizing the failures of the previous analysis, we instead counted the
frequency of each conservation level. The results of this counting are listed below
in Table 3. Visually analyzing raw numbers is impossible so these numbers have
been converted into circle graphs. The graph of the tat gene is displayed in Figure
5.
In the chart view, one can see that though the most conserved and most
variable sites dominate the graph, including the less conserved but nevertheless
9
relatively constant seven and eight ratings in the conserved category and two/three
in the variable category demonstrate the overwhelming majority of conserved sites.
We collected our phylogenetic tree results in Newick format. While this
format is unreadable for humans, it is easy for software packages such as MEGA
(Molecular Evolutionary Genetics Analysis) to render and explore the tree (Kumar
2008). The phylogenetic trees we developed with the help of ConSurf can again be
found on the wikispace link above. The trees are the files with the file extension
“.nwk”. An example of a phylogenetic tree developed with the maximum
likelihood model using the data from the tat gene is displayed in Figure 6. The
extent of the number of branches is overwhelming and highlights the largeness of
our data source. However, the visual effect can provide some distinctions of the
various evolutionary groups calculated of the major clusters of the sequences
involved. In tree rendering software, it would be easier to determine the numerical
values of the evolutionary closeness of each clade B strain.
Table 3: Frequency of Each Level of Conservation in Each Gene
Conservation Frequencies of Conservation for Each Gene:
Level
rev
tat
vif
(Color):
1
64
59
110
2
10
6
12
3
12
10
15
4
17
12
19
5
17
17
25
6
20
14
40
7
20
20
41
8
48
25
64
9
143
143
253
Total
351
306
579
10
vpr
53
11
8
15
12
16
15
26
135
291
Illustrations:
Figure 1: Section of Multiple Sequence Alignment (Pairwise) of DNA sequences
using ClustalW for the tat gene
Figure 2: ConSurf Protocol (Ashkenazy et al., 2010)
11
Figure 3: Conservation Output for the tat Gene using ConSurf
Figure 4: Functional Regions of tat gene (Doherty 2005)
12
Figure 5: Circle Graph of Conservation Grades
13
Figure 6: Phylogenetic Tree Demonstrating the Vastness of our Data in MEGA
(tat gene)
Discussion:
By analyzing the mutations to a genome over a period of time, evolution can
be firmly inferred to exist not just over the study of millennia, but also on the
micro time scale. This analysis is especially applicable fast replicating viruses,
whose evolutionary rates are astounding. Those places in the genome that can
change HIV’s rate of survival are scrutinized. The combination of conservation
and variability at these sites essentially governs evolution at the molecular level.
14
As mentioned earlier in the report, many multiple sequence alignment
programs such as ClustalW exist that can calculate the degree of variability across
both protein sequences and nucleotide sequences. However, the outputs of these
programs are never used in modern bioinformatics results because of their low
degree of sensitivity. Instead, they are often the basis of inputs to more advanced
calculation algorithms.
We decided to focus our analysis on the tat gene, which is commonly targeted
by antiretroviral drugs, and a few other accessory proteins that enhance HIV
replication efficiency. In addition, these genes are less understood than the three
essential and characteristic lentivirus genes of env, gag, and pol. To our
knowledge, conservation analysis of the nucleic acid sequences has not been done,
though positive selection studies have been done. Much of the focus of HIV
specialists lie in the three major genes of HIV, though drugs for many stages of the
HIV lifecycle exist in antiretroviral therapy. Perhaps our research can help the
scientific community gain more knowledge on the less studied accessory proteins.
Based on our findings, we can take into account another dimension in the
mutational ability of HIV during the R&D of new antiretroviral drugs for HIV. We
know that HIV evolves incredibly quickly but now we can pinpoint the locations of
this variability. This information could increase the efficiency of drug development
pipeline and aid in the development of new drugs to counter-adapt to the viral
adaptations. Personalized drugs may even become possible if a refined method of
choosing drugs based on the present strains of virus within the body was
developed, though expense and toxicity would remain a major obstacle for HIV
patients. The viability of a vaccine is quite certain in the near future with these
perceptive technologies. Pharmaceutical companies can target the areas of the virus
that change the most frequently, variation sites, finding methods to make the areas
15
less volatile or predict when and where mutations will occur. The variable sites can
be mapped for each of the HIV proteins for all of the clades. Even though the
focus was on a clade belonging to the predominant M subtype of HIV-1, the other
types can be looked at as well. Scientists are continually mapping new sequences
of HIV that arise and make them available to other institutions for analysis.
Through this ongoing scientific process, ways to retard HIV will eventually be
found.
In our project, the goal was to compare the conserved and variable regions of
HIV sequences. We measured the ratios between the sites that were most
conserved and most variable. But the big picture tells us little about what is
occurring at the nucleotide site level. Some sites exhibited noticeably higher
variability and we wanted to analyze the implications of these sites. It would be
unlikely for them to belong to a vital structural or functional domain but could be
part of a binding site that interacts with human proteins such as receptors.
Mutations at these sites could provide an evolutionary advantage for the virus in
both evasion of human defenses and adaptation to human counter adaptations. Our
methodology of calculating this rate can be replicated for further analysis of the
evolutionary rates in other clades of HIV and other lentiviruses.
During mutation of HIV and other retroviruses’ genetic code, single point
mutations can potentially change translational results in amino acid sequence.
Sometimes mutations are silent due to the fact that most amino acids are
represented by multiple codons. Some types of statistical analysis use amino acid
sequences in their calculation of protein variability. For example, in conservation
analysis, variation is scored based on evolutionary rates of the amino acid using
either Bayesian or Maximum Likelihood methods (Ashkenazy 2010). Our method
uses nucleotides because they offer a higher level of sensitivity at the nucleic acid
16
level and give us more evolutionary information than would be possible using
amino acid sequences.
Our data does have an interpretational problem. The size of our dataset
means that it has very diverse strains, producing a very divergent and perhaps
misleading output from ConSurf. This is somewhat inconsistent with literature that
uses a smaller dataset. However the purpose of our project was to analyze the
overall conservation of the genome. Therefore, our results must be interpreted that
way.
Conclusion:
In our research, we discovered sites of variation in the Human
Immunodeficiency Virus that were reasonably consistent with existing literature.
These sites contribute to diversity and differentiation within the viral population
allowing the virus to dodge both natural and artificial inhibitory biological
measures such as antibodies and HAART. During our research and calculation, we
directly compared the conserved and variable sites. It was evident that while
conserved sites dominated variable sites as expected, there was an extremely large
amount of sites with a high amount of variation compared to for example a human
gene. Having a few hundred sequences of various different strains contributed the
high amount of variability. The variable sites have been identified in our results
and this methodology can be very useful in predicting these sites for future
pharmaceutical applications.
The objectives of this experiment were met for the most part, though there is
plenty more we can do with this field. Another program package called PAML can
be used in place of ConSurf with similar results though a different intuitive
approach. This method would calculate positive and purifying selection rates,
17
which are similar to conservation and variability scores. Scientific, rather than
statistical analysis may be more accurate for analysis purposes. There was some
margin for error that mostly would have stemmed from the original sequence data
itself. Scientists that gathered the sequences could have easily made a mistake.
This project was very educational as we knew virtually nothing about
bioinformatics and the structure of HIV before we undertook this project. There is
plenty of documentation available that helped us troubleshoot and get this project
running successfully. Our understandings of HIV have become
quite
comprehensive for high school students and we all endeavor to continue this
research through college and beyond to eventually find a cure for HIV. In order to
confirm that our results are somewhat accurate, we would need the opportunity to
test in the field. This project thus far has been solely computer-based, but once
enough data is eventually gathered, we would implore an opportunity to have some
wet-lab experimental trials.
Future work to be done in this project includes extending our analysis to the
other clades as well as other lentiviruses such as SIV. These analyses could
provide insights into the phylogenetic differences within the lentivirus family and
whether any consistencies in sites of positive selection exist. In addition, it would
be important to study the reaction of adaptive human proteins to the HIV virus.
The human reaction to HIV’s influence in an individual or in a population could be
better understood if it was further documented. Another relevant study would be to
investigate the reaction of the virus to the pressures of suppressive drug therapies
such as HAART.
18
Download