Detecting Evolutionary Signatures of Positive Selection in HIV Introduction: It is widely known that the spread of HIV (Human Immunodeficiency Virus) presents a major threat to human life globally. The World Health Organization (WHO) estimates that approximately 33.4 million people were living with HIV in 2008 and 2.7 million people were newly infected with HIV just in that year. Of those previously infected with this retrovirus, about 2 million died. There are two types of HIV, both of which descended from SIV (Simian Immunodeficiency Virus). HIV, categorized as a lentivirus, characterized by long incubation times and complex genomes, primarily infects CD4+ T cells of the human immune system. Although there are many treatments that can temporarily slow the progression of a HIV infection including highly active anti-retroviral therapy (HAART), there currently is no cure or vaccine for the virus. HAART, though greatly reducing the mortality and morbidity in HIV patients, has its own problems in terms of toxicity within the human body and high cost. Thus, it is advantageous to develop novel drug treatments with fewer side effects and greater effectiveness. In addition, HIV has been rapidly evolving over the past few years in response to the utilization of anti-retroviral drugs and the pressures of the human immune system. Studying the evolutionary tendencies of HIV could offer some insight on how to design more effective therapies or perhaps even vaccines for HIV. In the human body, 10.3 x 109 virus particles are created each day while approximately 3 x 10-5 errors are made per replication cycle in an approximately 4000 base pair RNA genome. These rates translate to the virus having at least one mutation at every position of the viral genome each day (Hunt 2008). Mutations of a single 1 codon in a DNA sequence which is subsequently translated to an amino acid can have dramatic effects on the ability of HIV to evade the adaptive immune system. Hence this fact coupled with the high error rate of HIV replication allows us to observe the strong signals of purifying and positive selection within the HIV genome. Mutations at sites that have structural or functional significance render the protein dysfunctional, causing purifying selection while mutations that provide advantages such as increase in virulence tend to cause positive selection. By locating the areas of the HIV genome that are undergoing positive selection using bioinformatics methods and tools, we can better prepare and predict a virus’s course of mutation and response to certain antiviral drugs. HIV represents the human introduction of SIV, likely transmitted through the handling of monkey carcasses in Africa. Two distinct classes of HIV exist: HIV-1, presumed to be from chimpanzees, and HIV-2, from sooty mangabeys— though HIV-1 is responsible for the modern pandemic. Within HIV-1, three major groups, M, N, and O exist, possibly representing three separate introductions of SIV into the human population. However, about ninety percent of all HIV infections come from group M and therefore it is the most studied. Group M is further subdivided into 9 clades, labeled the letters A through K excluding E and I. These clades tend to be bound by geography as different subtypes of viruses circulate within each continent. We decided to focus our analysis on clade B, the dominant strain in North America and Europe, which has the most available sequence data. A means to understanding the natural selection in organisms at the genetic level is the comparison between the conserved sequences and the variable ones. This strategy can also be applied to viruses, which though not considered living, still replicate and alter their genetic code. The ratio between the variable and 2 conserved ratios was then calculated, with more variation occurring at higher ratios, more conservation with lower ratios, and a neutral selection at a ratio value of one, indicating the level and type of selection of the area under consideration. Not only do different domains in HIV have different ratio values, but also different proteins of that domain, requiring a more specific analysis and model. For example, a classical example in HIV research is the tat gene, which is the primary gene responsible for the replication of HIV. Our research goal involved applying a specific evolutionary model of calculation to the HIV genome and examining the consistency of its results compared to findings in published literature that utilized either bioinformatics or traditional lab approaches to find sites of conservation and variation. The available phylogenetic analysis programs utilize several mathematical models designed specifically to study the evolutionary process of organisms. These programs range from less detailed multiple alignments to highly specialized and complex algorithms such as those utilizing maximum likelihood or Bayesian inference methods. Our project primarily used the ConSurf tool for analysis by maximum likelihood (Ashkenazy et al., 2010). Phylogenetic analysis of HIV is especially interesting compared to analysis of mammalian species because of HIV’s rapid evolutionary rate. In familiarizing ourselves with the program and working with the databases we can compare our results our results with those found in literature and eventually extend our analysis to patient data. This study could help us better understand how the viruses co-evolve with their host cells and their responses to the barrage of inhibitory drugs. 3 Materials & Methods: There were several databases that were potential sources for the project. The first database that we looked at, GenBank, hosted by NCBI, has the collection of all publicly available sequences. Yet the problem we kept on running into was the inefficiency of extracting sequences one at a time since we did not have the computing skills to write a perl script to retrieve them; therefore, we chose to find other sequences despite the standard use of GenBank in genetic research. We chose to move into sequences that would be strictly HIV and hoped to find a mass sequence downloading option. Stanford University hosts a HIV database that is quite extensive, however the database focused on sequences exclusively related to patient data. For our study we wanted to analyze the overall conservation scores of the virus, and therefore patient data was neither broad enough nor representative as a whole. We therefore obtained the sequences from a database hosted by Los Alamos National Laboratories (LANL), as it had the most extensive HIV data set available to the public. LANL, has several features that might be relevant for future projects such as its geographic search feature which lists all the HIV sequences in that particular area. The Linux platform was fundamental throughout much of our work since many bioinformatics tools were programmed to run it. Linux provides several advantages including speed of calculation, simplicity, and the ability to script things otherwise not possible on other operating systems. At first we utilized a remote Linux server provided by our mentor and eventually we installed an Ubuntu-like operating system on our own computers and by having a personal Linux system, we were able to fully exploit the advantages of Linux. Our advisor aided us in writing perl scripts that allowed us to extract clade B sequences from the comprehensive sequence set by separating, sorting, and recombining our 4 desired sequences. Another advantage of Linux that we found useful was the efficiency of its programs. When we needed to run multiple sequence alignments with ClustalW locally (the volume of our sequences were too high for online servers), the process was much faster on Linux than in Windows (Larkin 2007). We credit one of our teammates for this discovery since it greatly improved the efficiency of our work. The alignment process allows us to compare the sequences and observe which regions are conserved, semi-conserved, or not conserved at all. It lines up each of the sequences by the consistent regions, so that other bioinformatics tools can analyze these sequences further. From the alignment we can further compare the phylogenetic visually compare relationships and constructs trees, domains of interest can be found, and especially relevant to our study we can analyze the conservation sites. ClustalW and its windows counterpart ClustalX, are the most commonly used and reliable alignment tools (Larkin 2007). Other prominent alignment tools include MUSCLE (Edgar 2004), which is built into ConSurf, as well as T-COFFEE (Notredame et al., 2000), a web based alignment tool. ConSurf would ultimately be the tool we would use to calculate the various evolutionary rates of the data. The majority of our analysis of the HIV sequences was done using the multipurpose tool, ConSurf. This versatile tool is very flexible in terms of the sequences it can accept. It can analyze both nucleotide sequences and amino acid sequences. Its protocol can be found under Figure 2 in Illustrations. We wished to have a degree of control over our alignment parameters and we already possessed viable sequences so our execution of the tool began at step four. The program also constructs phylogenetic trees which can be viewed in various software packages. At the center of the ConSurf analyses are the calculations of the evolutionary 5 conservation of the positions of the amino acids in the proteins using Rate4Site algorithm (Ashkenazy et al., 2010). The conservation of the protein depends on the importance to the virus’s fitness, because in most cases the mutations that change the genome of a virus are destructive. However, that is not to say that the conserved regions cannot be affected by natural selection. There are two main methods to measure the evolutionary rates, the Bayesian method and Maximum likelihood. The consistency with which the nucleotides are with one another in amino acid positions is determined by the conservation scores. The Bayesian method is favored for smaller sets of sequences, but because we decided to use such expansive data sets, maximum likelihood would contribute greater accuracy to the conservation scores. Moreover, the conservation scores determine the nine discrete grades that provide a visual representation of the data. We also set ConSurf to follow the Tamura 1992 model (T92). This model was applicable since there were strong transition-transversion biases as HIV tends to contain many point mutations (Tamura 1992). The final step in the ConSurf protocol would be to map the conservation levels to a query sequence. This sequence needs to be specified in the initial startup parameters. The selection of this sequence was arbitrary in our case so we used the first sequence in the multiple alignment output as our query sequence. The output results included a color-coded query sequence according to conservation, the corresponding file listing the conservation score and subsequent color assignment, a file listing nucleotide variety at each position, and the data for the rendering of phylogenetic trees. 6 Results: Alignments from ClustalW were the first results that we could conduct limited analysis. Normally with protein sequences, ClustalW alignments calculate four degrees of variability with a “*” symbol representing complete conservation, “:” representing conserved substitutions, “.” representing semi-conservation, and absence of a symbol representing variation (Larkin 2007). A large amount of substitution choices, 20 amino acids to be exact, allows for this variability. However in our ClustalW output, the degrees of variability were even lower with either complete conservation or no conservation displayed due to the fact that we were analyzing nucleotides (Figure 1). This result certainly accentuates the need for a more sensitive and more inclusive output. In organization of our data we created a wikispace website that not only organized our literature review process but also organized our data and results in a centralized fashion. These can be viewed for the raw sequences, extraction scripts, and extracted sequences can be found at http://hivproject.wikispaces.com/Data. Another teammate with experience in data management was able to set this website up. We can, however, conduct limited interpretation of these results. For example, we can calculate percentage of fully conserved sites within each gene. We observed in this analysis that the tat gene had a larger proportion of sites that were fully conserved (Table 1). We need to conduct a significance test on these values to see if the larger observed conservation in tat is significant. A simple examination of the standard deviation reveals that tat’s percentage conserved is only about one standard deviation away from the overall mean percentage. Clearly this sort of analysis is meaningless but interesting to consider. In order to obtain more detailed and comprehensive results, we clearly needed a better form of 7 analysis than multiple sequence alignments. This analysis proves the weakness of relying solely on ClustalW for analyzing sequence data. Table 1: Summary of ClustalW Alignment Results Fully Conserved Variable % Fully conserved Total Sequence Length rev 46 305 13.11% 351 tat 66 240 21.57% 306 vif 97 474 16.75% 579 vpr 37 254 12.71% 291 When we used ConSurf for analysis, we obtained results that rated conservation into nine intervals. Clearly this would be much more sensitive than the alignment method. The ConSurf service provides a very convenient display of the results that maps the conservation level based on a color scheme onto the selected query sequence (Figure 3). To avoid confusion, ConSurf includes ConSeq, which is the program that processed the nucleotide sequences and calculated the evolutionary conservation rates. The results are titled “ConSeq Results” for that reason. The ConSurf method’s main advantage lies in this output, which is extremely useful for visualizing the variability of the genomic code. Though the ConSurf result mapped onto the query sequence provides a great visual tool in identifying sites of variation and conservation, it provides no hard numbers to work with. The conservation scores provided by ConSurf provide a more statistical way to analyze the data. These results can be found in the results page on the wikispace (http://hivproject.wikispaces.com/Results) as files called “Conservation Scores_proteinname.” These files list the conservation value as well as the color assigned in a number form. At first we wished to compare the number strong variation sites, which we set as Level 1 as reflected in Figure 3. Table 2 is a 8 further graphical representation of the variation and conservation of sites much like Table 1. However, because of the use of the more in depth analysis using ConSurf, we are able to analyze this further than ClustalX allowed. This kind of analysis is remarkably similar to the analysis presented in Table 1 in terms of the lack of detail presented. The conserved to variable ratios are nearly statistically identical and provide nearly no significant observations though individually, the sites are significant because of the implications in mutation rates. Figure 4 depicts a diagram of the tat protein. This protein is important in the reverse transcription process in HIV and greatly enhances its replication rate. It would be important to compare the sites of variability and sites of conservation to see if intuitively match up with the functional area of the protein. For example, if the protein has an important function at that domain that perhaps translate to important structural or functional amino acid residues Table 2: Sites of Strong Conservation and Strong Variation in Select HIV-1 Clade B Proteins Strong Variation Strong Conservation Ratio of (Level 1) (Level 9) Level 1 to Level 9 rev 64 143 0.448 tat 59 143 0.413 vif 110 253 0.435 vpr 53 135 0.393 In realizing the failures of the previous analysis, we instead counted the frequency of each conservation level. The results of this counting are listed below in Table 3. Visually analyzing raw numbers is impossible so these numbers have been converted into circle graphs. The graph of the tat gene is displayed in Figure 5. In the chart view, one can see that though the most conserved and most variable sites dominate the graph, including the less conserved but nevertheless 9 relatively constant seven and eight ratings in the conserved category and two/three in the variable category demonstrate the overwhelming majority of conserved sites. We collected our phylogenetic tree results in Newick format. While this format is unreadable for humans, it is easy for software packages such as MEGA (Molecular Evolutionary Genetics Analysis) to render and explore the tree (Kumar 2008). The phylogenetic trees we developed with the help of ConSurf can again be found on the wikispace link above. The trees are the files with the file extension “.nwk”. An example of a phylogenetic tree developed with the maximum likelihood model using the data from the tat gene is displayed in Figure 6. The extent of the number of branches is overwhelming and highlights the largeness of our data source. However, the visual effect can provide some distinctions of the various evolutionary groups calculated of the major clusters of the sequences involved. In tree rendering software, it would be easier to determine the numerical values of the evolutionary closeness of each clade B strain. Table 3: Frequency of Each Level of Conservation in Each Gene Conservation Frequencies of Conservation for Each Gene: Level rev tat vif (Color): 1 64 59 110 2 10 6 12 3 12 10 15 4 17 12 19 5 17 17 25 6 20 14 40 7 20 20 41 8 48 25 64 9 143 143 253 Total 351 306 579 10 vpr 53 11 8 15 12 16 15 26 135 291 Illustrations: Figure 1: Section of Multiple Sequence Alignment (Pairwise) of DNA sequences using ClustalW for the tat gene Figure 2: ConSurf Protocol (Ashkenazy et al., 2010) 11 Figure 3: Conservation Output for the tat Gene using ConSurf Figure 4: Functional Regions of tat gene (Doherty 2005) 12 Figure 5: Circle Graph of Conservation Grades 13 Figure 6: Phylogenetic Tree Demonstrating the Vastness of our Data in MEGA (tat gene) Discussion: By analyzing the mutations to a genome over a period of time, evolution can be firmly inferred to exist not just over the study of millennia, but also on the micro time scale. This analysis is especially applicable fast replicating viruses, whose evolutionary rates are astounding. Those places in the genome that can change HIV’s rate of survival are scrutinized. The combination of conservation and variability at these sites essentially governs evolution at the molecular level. 14 As mentioned earlier in the report, many multiple sequence alignment programs such as ClustalW exist that can calculate the degree of variability across both protein sequences and nucleotide sequences. However, the outputs of these programs are never used in modern bioinformatics results because of their low degree of sensitivity. Instead, they are often the basis of inputs to more advanced calculation algorithms. We decided to focus our analysis on the tat gene, which is commonly targeted by antiretroviral drugs, and a few other accessory proteins that enhance HIV replication efficiency. In addition, these genes are less understood than the three essential and characteristic lentivirus genes of env, gag, and pol. To our knowledge, conservation analysis of the nucleic acid sequences has not been done, though positive selection studies have been done. Much of the focus of HIV specialists lie in the three major genes of HIV, though drugs for many stages of the HIV lifecycle exist in antiretroviral therapy. Perhaps our research can help the scientific community gain more knowledge on the less studied accessory proteins. Based on our findings, we can take into account another dimension in the mutational ability of HIV during the R&D of new antiretroviral drugs for HIV. We know that HIV evolves incredibly quickly but now we can pinpoint the locations of this variability. This information could increase the efficiency of drug development pipeline and aid in the development of new drugs to counter-adapt to the viral adaptations. Personalized drugs may even become possible if a refined method of choosing drugs based on the present strains of virus within the body was developed, though expense and toxicity would remain a major obstacle for HIV patients. The viability of a vaccine is quite certain in the near future with these perceptive technologies. Pharmaceutical companies can target the areas of the virus that change the most frequently, variation sites, finding methods to make the areas 15 less volatile or predict when and where mutations will occur. The variable sites can be mapped for each of the HIV proteins for all of the clades. Even though the focus was on a clade belonging to the predominant M subtype of HIV-1, the other types can be looked at as well. Scientists are continually mapping new sequences of HIV that arise and make them available to other institutions for analysis. Through this ongoing scientific process, ways to retard HIV will eventually be found. In our project, the goal was to compare the conserved and variable regions of HIV sequences. We measured the ratios between the sites that were most conserved and most variable. But the big picture tells us little about what is occurring at the nucleotide site level. Some sites exhibited noticeably higher variability and we wanted to analyze the implications of these sites. It would be unlikely for them to belong to a vital structural or functional domain but could be part of a binding site that interacts with human proteins such as receptors. Mutations at these sites could provide an evolutionary advantage for the virus in both evasion of human defenses and adaptation to human counter adaptations. Our methodology of calculating this rate can be replicated for further analysis of the evolutionary rates in other clades of HIV and other lentiviruses. During mutation of HIV and other retroviruses’ genetic code, single point mutations can potentially change translational results in amino acid sequence. Sometimes mutations are silent due to the fact that most amino acids are represented by multiple codons. Some types of statistical analysis use amino acid sequences in their calculation of protein variability. For example, in conservation analysis, variation is scored based on evolutionary rates of the amino acid using either Bayesian or Maximum Likelihood methods (Ashkenazy 2010). Our method uses nucleotides because they offer a higher level of sensitivity at the nucleic acid 16 level and give us more evolutionary information than would be possible using amino acid sequences. Our data does have an interpretational problem. The size of our dataset means that it has very diverse strains, producing a very divergent and perhaps misleading output from ConSurf. This is somewhat inconsistent with literature that uses a smaller dataset. However the purpose of our project was to analyze the overall conservation of the genome. Therefore, our results must be interpreted that way. Conclusion: In our research, we discovered sites of variation in the Human Immunodeficiency Virus that were reasonably consistent with existing literature. These sites contribute to diversity and differentiation within the viral population allowing the virus to dodge both natural and artificial inhibitory biological measures such as antibodies and HAART. During our research and calculation, we directly compared the conserved and variable sites. It was evident that while conserved sites dominated variable sites as expected, there was an extremely large amount of sites with a high amount of variation compared to for example a human gene. Having a few hundred sequences of various different strains contributed the high amount of variability. The variable sites have been identified in our results and this methodology can be very useful in predicting these sites for future pharmaceutical applications. The objectives of this experiment were met for the most part, though there is plenty more we can do with this field. Another program package called PAML can be used in place of ConSurf with similar results though a different intuitive approach. This method would calculate positive and purifying selection rates, 17 which are similar to conservation and variability scores. Scientific, rather than statistical analysis may be more accurate for analysis purposes. There was some margin for error that mostly would have stemmed from the original sequence data itself. Scientists that gathered the sequences could have easily made a mistake. This project was very educational as we knew virtually nothing about bioinformatics and the structure of HIV before we undertook this project. There is plenty of documentation available that helped us troubleshoot and get this project running successfully. Our understandings of HIV have become quite comprehensive for high school students and we all endeavor to continue this research through college and beyond to eventually find a cure for HIV. In order to confirm that our results are somewhat accurate, we would need the opportunity to test in the field. This project thus far has been solely computer-based, but once enough data is eventually gathered, we would implore an opportunity to have some wet-lab experimental trials. Future work to be done in this project includes extending our analysis to the other clades as well as other lentiviruses such as SIV. These analyses could provide insights into the phylogenetic differences within the lentivirus family and whether any consistencies in sites of positive selection exist. In addition, it would be important to study the reaction of adaptive human proteins to the HIV virus. The human reaction to HIV’s influence in an individual or in a population could be better understood if it was further documented. Another relevant study would be to investigate the reaction of the virus to the pressures of suppressive drug therapies such as HAART. 18