Variation in conservation among different genes within the Herpes Simplex Virus type 1, and its correlation with function Kerri Callahan & Samantha Nadeau Molecular Evolution Thursday, December 11, 2014 Abstract An analysis of three functionally different HSV-1 genes reveals patterns of purifying selection as well as correlations between gene function and variation. Gene redundancy decreases conservation while vital functionality increases conservation. The study examined HSV-1 due to its prevalence in American adults. Hopes for vaccinations and treatments require an understanding of gene conservation and evolutionary patterns. The genes studied produced products including a membrane protein (UL20), a tegument protein (UL49), and a terminase enzyme (UL15; UL28; UL33). In this study, multiple tests rejected the null hypothesis that dN=dS and supported the original hypothesis that UL20 currently maintains less conservation than UL49. Findings regarding the UL15 complex are inconclusive due to false assumptions that the individual genes could be grouped together. Introduction: Herpes Simplex Virus type 1, more commonly known as oral herpes, affects 90% of American adults (1). Herpes Simplex Virus type 2, or genital herpes, affects 20% of American Adults (1). Due to the prevalent nature of these viruses, researchers completed numerous studies investigating both type 1 and type 2 of Herpes Simplex. Most of the research compared Herpes Simplex type 1 to Herpes Simplex type 2 rather than just focusing on one of the two viruses. By narrowing in on Herpes Simplex Virus type 1 specifically, and the different morphological regions of the virus, comparing some of the different genes in regards to their evolutionary conservation and correlating function becomes possible. The three gene products compared in this study include a membrane protein (encoded by the gene UL20), a tegument protein (encoded for by UL49), and a terminase enzyme (encoded for by UL15, UL28, and UL33). The hypothesis states that the UL20 gene, which encodes for the membrane protein, will be least conserved since it is believed that this is less vital to the function of the virus in comparison to the other genes because in addition to UL20, many glycoproteins and other membrane-associated proteins perform similar, if not the same, function (5). Background: The Herpes Simplex Virus type 1 is a double stranded DNA virus with a lytic life cycle (2). After the virus infects the cell through lytic infection, the virus travels to the spinal ganglia and resides there until reactivation occurs causing latent infection (3). The virion consists of three major morphological structures, the envelope, the tegument, and the nucleocapsid (4). Within the envelope, a lipid bilayer composing the outermost surface of the virion, lies the tegument, which contains proteins that get released into the cytoplasm of the infected cell (4). Tegument proteins are responsible for the egress of virion progeny (4). Within the tegument, the nucleocapsid contains the double stranded DNA necessary for HSV-1 function (4). The Herpes Simplex Virus 1 encodes for over fifteen membrane proteins, one of which is the UL20 gene product (5). The UL20 protein is an intrinsic membrane protein involved in the distribution of virions into the extracellular space of the infected cell (5). The UL20 gene product works with other membrane-associated proteins and glycoproteins to transport the virions from the infected cell into the extracellular space (5). This means that UL20 plays a role in virion transmission; however, virion transmission requires other genes as well (5). Within the virion, the tegument layer lies between the nucleocapsid and the envelope (4). The VP22 protein is a major tegument protein, encoded for by the UL49 gene (4). This protein is needed for the redistribution of viral proteins from the nucleus to the cytoplasm of infected cells (4). This is essential for the egress of virion progeny (4). During replication of Herpes Simplex Virus type 1, concatemers accumulate in the infected cell (6). In order for the progeny virions to assemble, the concatemers must be cleaved by a terminase enzyme and packaged into capsids (6). UL15, UL28, and UL33 protein subunits comprise the terminase enzyme in Herpes Simplex Virus type 1(6). The subunits are assembled in the cytoplasm and using a mechanism on the UL15 protein, the complex exists the cytoplasm and enters into the nucleus (6). There, the UL28 binds the complex to the DNA packaging signal and with the help of UL15 and UL33, cleaves the concatemer and packages the genome into the capsid (6). Prior research into the evolution of Herpes Simplex Virus shows UL15, UL28, and UL33 to be highly conserved (7,8). Previously, studies on UL33, UL31, and UL34 (orthologs) indicated that highly conserved protein interactions may result whether or not high sequence similarity exists between the genes. However, since literature stated interactions between UL15, UL28, and UL33 existed in multiple different species, we chose to group the three genes together for analysis under the assumption that they were co-evolving to become more similar to one another (8). Methods: Partial genomes of HSV-1 strains CJ394, CJ360, CJ311, CJ790, OD4, and TFT401 were imported in FASTA(text) format from the NCBI database (9) into MEGA software version 6.06 (10). The location of genes UL15, UL20, UL28, UL33, and UL49 were found on NCBI for each strain and used to crop the partial genome into the targeted genes (9). Sequence Alignments: Three separate alignments containing six total isolates were made using CLUSTALW. The first alignment consisted of the UL20 gene from each of the six strains. The second alignment also consisted of an isolated gene from each of the six strains, but the gene was UL49. The third alignment consisted of two isolates of UL15, one from strain CJ311 and one from strain OD14, two isolates of UL28, one from strain CJ394 and one from strain CJ360, and two isolates of UL33, one from TFT401 and one from CJ970. Sequence Analysis: The differences in synonymous and nonsynonymous mutations were tested using the codon-based-z test of selection. The three alternative hypotheses used included dN>dS, dN=/=dS, dS>dN, and the null hypothesis remained as dN=dS. The NeiGojobori (Proportion) method was used on all three sets of sequences and gaps were treated using pairwise deletion. Selection at Codons was estimated via HyPhy using the maximum likelihood method, standard genetic code table, and the Felsenstein 1981 model. Gaps were treated using complete deletion. Sums of dN values for all codons and dS values for all codons were calculated and divided to give the dN/dS value. This test was repeated for all three sequence sets. Pairwise distance estimates were computed using the bootstrap method (500 replications) and Tamura-Nei model, with p-values of less than 0.05 being considered as “significant”. Additionally, maximum likelihood trees were constructed using the bootstrap test of phylogeny and the Tamura-Nei model for each of the three sets of alignments. Branch length was measured in the number of substitutions per site and included in the tree. Results: UL20: The codon-based-z-test of selection resulted in a probability of 1.0 and a Zstatistic of -2.572 for HA: dN>dS, probability of 0.003 and Z-statistic of -3.017 for HA: dN=/=dS, and a probability of 0.004 and Z-statistic of 2.669 for HA: dN<dS (Table 1). Selection at Codons was estimated using HyPhy and resulted in a dN/dS value of 0.749 (Table 2). Of the 15 p-values reported from the pairwise distance analysis, 11 were significant in rejecting the null hypothesis that dN=dS (Table 3). The maximum likelihood tree that was generated showed two general clades, one consisting of strain OD4 and CJ970 and one consisting of strains CJ394, CJ360, CJ311, and TFT401. The bootstrap value for the latter clade was 100 whereas CJ970 was condensed. The branch lengths for all clades were less than 0.1, with all except two being less than 0.01 (Figure 1). The average branch length was 0.00949. UL15+UL28+UL33: The codon-based-z-test of selection for the UL15, 28, 33 group computed a probability of 1.0 and a Z-statistic of -1.895 for HA: dN>dS, probability of 0.08 and Zstatistic of -1.764 for HA: dN=/=dS, and a probability of 0.007 and Z-statistic of 1.798 for HA: dN<dS (Table 1). Selection at codons estimated via HyPhy gave a dN/dS value of 0.972 (Table 2). Three of the 15 p-values from the pairwise distance analysis were significant to reject the null hypothesis that dN=dS (Table 3). Three clades were clear in the maximum likelihood tree. The first had a bootstrap value of 100 and consisted of strains CJ394 and CJ360. The second had a bootstrap value of 99 and consisted of TFT401 and CJ970. These two clades originated from a common ancestor. The third clade had a bootstrap value of 100 and strains CJ311 and OD4. The average branch length was 0.185 for the entire tree, 0.008 for the UL28 clade, 0.0 for the UL33 clade, and 0.0025 for the UL15 clade (Figure 2). UL49: The codon-based-z-test of selection computed an overall probability of 1.0 and zstatistic of -1.794 for HA: dN>dS, probability of 0.078 and Z-statistic of -7.8 for HA: dN=/=dS, and a probability of 0.043 and Z-statistic of 1.73 for HA: dN<dS the UL49 gene (Table 1). The value of dN/dS found estimated using the selection at codons via HyPhy was .306 (Table 2). All of the p-values from the pairwise distance analysis were significant to reject the null hypothesis that dN=dS (Table 3). The maximum likelihood tree showed one common ancestor for strain CJ360, OD4, and the group of strains TFT401, CJ970, CJ394, and CJ311 which had a bootstrap value of 70. Branch lengths for all clades were all between 0.0 and 0.0035 (Figure 3). The average branch length was 0.0011. Patterns: The z-statistics for all codon-based-z-tests of selection using the alternative hypotheses that dN>dS and dN=/=dS were negative, indicating that there were more synonymous mutations than nonsynonymous mutations (Table 1). The probabilities were all greater than 0.05, except for UL20, HA: dN=/=dS, meaning that they were not significant at rejecting the null hypothesis of neutral selection (Table 1). The codonbased-z-test of Selection for the alternative hypothesis that dN<dS all have a positive Zstatistic values and probability values less than .05 (Table 1). This indicates that there are more synonymous mutations than non-synonymous and allows for the rejection of the null hypothesis that dN=dS for all genes. The dN/dS value was less than one for all three sets indicating purifying selection. UL49 had the lowest value (closest to zero), and the UL15, 28, 33 complex had the highest value (closest to one). The branch lengths were generally different for all three trees, with the UL49 tree having the shortest average branch length. The UL15, 28, 33 complex had the highest overall average branch length, but when broken down by gene all branch length averages (for UL15, UL28, and UL33 individually) were lower than the average branch length for the UL20 gene (Figure 1, 2, 3). HA: dN>dS Gene UL20 UL15; UL28; UL33 UL49 Z-stat Prob HA: dN=/=dS Z-stat Prob HA: dN<dS Z-stat Prob -2.572 1.0 -3.017 0.003 2.669 0.004 -1.895 1.0 -1.764 0.080 1.798 0.007 -1.794 1.0 -7.80 0.078 1.73 0.043 Table 1. Codon-Based-Z-Test of Selection: Statistic values (dN-dS, dN-dS, and dS-dN, respectively), and probability values for each of three hypotheses. Probabilities <0.05 are significant in rejecting the null hypothesis that dN=dS and highlighted in purple. Gene UL20 UL15; UL28; UL33 complex UL49 dN/dS value 0.749 0.972 .306 Table 2. dN/dS Values Based on HyPhy Testing: UL49 has the lowest dN/dS value while the UL15; UL28; UL33 complex has the highest. The dN/dS values are less than one for all of the genes. Table 3. Estimates of Evolutionary Diverengce between Sequences: pairwise distances and number of base substitutions per site between sequences are shown. Codon positions included 1st, 2nd, and noncoding. All ambiguous positions were deleted. Significant p-values are highlighted in yellow. Human herpesvirus 1 strain CJ394 partial genome 0.0000 0.0023Human herpesvirus 1 strain CJ360 partial genome 0.0000 0.0045 Human herpesvirus 1 strain CJ311 partial genome 0.0000 0.0348 Human herpesvirus 1 strain TFT401 partial genome 0.0022 0.0369 0.0047 Human herpesvirus 1 strain CJ970 partial genome 0.0000 Human herpesvirus 1 strain OD4 partial genome Figure 1. Maximum Likelihood Tree of Molecular Phylogenetic Analysis for the UL20 Gene: Evolutionary history of the UL20 Gene inferred via Maximum Likelihood method. Branch lengths shown next to branches were measured as the number of substitutions per site. Codon positions included 1st, 2nd, and non-coding. Human herpesvirus 1 strain CJ394 partial genome 0.0082 0.6869 0.0401 Human herpesvirus 1 strain CJ360 partial genome 0.0084 Human herpesvirus 1 strain TFT401 partial genome 0.3706 0.0000 Human herpesvirus 1 strain CJ970 partial genome 0.0000 Human herpesvirus 1 strain CJ311 partial genome 0.0000 0.7299 Human herpesvirus 1 strain OD4 partial genome 0.0054 Figure 2. Maximum Likelihood Tree Based on Molecular Phylogenetic Analysis for UL15; UL28; UL33 Gene Complex: Evolutionary history of the UL15; UL28; UL33 Gene Complex inferred via Maximum Likelihood method. Branch lengths shown next to branches were measured as the number of substitutions per site. Codon positions included 1st, 2nd, and non-coding. Human herpesvirus 1 strain TFT401 partial genome 0.0000 Human herpesvirus 1 strain CJ970 partial genome 0.0018 0.0000 0.0018 0.0000 0.0017 Human herpesvirus 1 strain CJ394 partial genome Human herpesvirus 1 strain CJ311 partial genome Human herpesvirus 1 strain OD4 partial genome 0.0000 0.0035 Human herpesvirus 1 strain CJ360 partial genome Figure 3. Maximum Likelihood Tree Based on Molecular Phylogenetic Analysis for UL15; UL28; UL33 Gene Complex: Evolutionary history of the UL49 Gene inferred via Maximum Likelihood method. Branch lengths shown next to branches were measured as the number of substitutions per site. Codon positions included 1st, 2nd, and non-coding. Discussion: According to values computed through HyPhy, the gene correlating with the smallest dN/dS value was UL49, signifying that UL49 is the most highly conserved out of the three genes studied (Table 2). This supports part of the hypothesis, stating that the UL49 gene will be most highly conserved. In comparison, the gene correlating with the largest dN/dS value was the UL15; UL28; UL33 complex, signifying that this group of genes is least highly conserved (Table 2). This gene complex makes up a terminase enzyme needed for the cleavage and packaging of the viral DNA into capsids. It was hypothesized that this gene complex would be highly conserved. The large dN/dS value correlating with this gene complex indicates however that it is not highly conserved. The hypothesis also states that the UL20 gene will be least conserved since literature suggests viral function depends less on this gene than on the others. This however, was not the case when comparing the dN/dS values computed via HyPhy. The UL20 gene had a dN/dS value of .749, while the UL15; UL28; UL33 gene complex had a dN/dS value of .972 (Table 2). Although the UL20 has a higher dN/dS value than UL49, it is still smaller than the UL15; UL28; UL33 gene complex. This could be due to many reasons. One reason could be that virion transmission could be more vital to virus function than DNA encapsidation. However, virion transmission involves genes besides UL20. Ward et al, in a study on the function of UL20, infected cells with either a UL20- virus (HSV-1 lacking the UL20 gene), or a Wild Type virus (HSV-1 containing the UL20 gene) (5). Cells infected with the UL20- virus accumulated many virions in between the nuclear and plasma membrane, but lacked virions in the extracellular space (5). This was in comparison to cells infected with the Wild Type virus, which showed virion occupancy between the nuclear and plasma membrane, and within the extracellular space (5). This indicates that although there are other proteins that perform similar or the same function, virion egress into the extracellular space still requires UL20. This is therefore necessary for virus transmission, and will cause the gene to have a lower evolutionary rate than expected. Another possible explanation as to why the UL15; UL28; UL33 gene complex is less conserved than UL20 could be due to the variety of genes within the complex. Although these three genes work together to encapsulate the DNA, they do have separate functions within the terminase enzyme. When analyzing the maximum likelihood tree, it is clear to see that three separate clades were established, each representing a different gene within the complex. The branch length leading to each individual clade is long, but the branches composing the clade are short. Each clade correlates with the two strains used for each individual gene (Figure 2). This means that the strains for each gene are highly conserved within themselves, even if they are less conserved as a complex. In other words, it seems as though comparing the different genes within this complex against each other made for inconclusive results. In addition, if the individual genes were compared only against themselves, the dN/dS values for each of these genes would be lower than the dN/dS value for UL20. Similarly, the assumption that the UL15; UL28; UL33 gene complex would be conserved because the products of all three genes work together to form the terminase enzyme was not supported by the pairwise distance matrices. Only three out of fifteen pvalues for the UL15; UL28; UL33 gene complex matrix were able to reject the null that dN=dS (Table 3). This means that either all of these genes are under neutral selection (unlikely) or the assumption is fallible (likely). The assumption was most likely fallible because although the three gene products make up the same enzyme, they are responsible for different functions within that enzyme. The pairwise distance matrices supported this theory since the only significant values were the three instances where each individual gene (UL15, UL28, and UL33) was compared against a different strain of itself (Table 3). Additionally, tips composed of two strains of the same gene had short branch distances, whereas branch lengths separating each of the three sets of clades had large branch distances (Figure 2). The combination of the pairwise matrices and maximum likelihood trees suggests that the genes vary too greatly in composition to be considered as one complex. They should instead be treated as individual genes and compared against one another by isolated analysis rather than grouped. On the other hand, both UL20 and UL49 provide great support against the null hypothesis that dN=dS. The codon-based-z-test of selection for HA: dN>dS resulted in negative values and probabilities of 1.0, suggesting that the test was too restrictive but also that the number of synonymous mutations was greater than the number of nonsynonymous mutations (Table 1). After changing the alternative hypothesis to test for neutral selection (HA: dN=/=dS), the only gene with a significant probability to reject the null was UL20, and again all values were negative (Table 1). Finally, the alternative hypothesis was changed to dN<dS, and all values were positive with a significant probability (Table 1). This suggests that the null hypothesis of neutral selection is rejected and accept the alternative hypothesis that all genes are under purifying selection. All p-values computed via pairwise distance for UL49 are statistically significant in rejecting the null (Table 3). This suggests a higher confidence in the hypothesis that selection sees the UL49 gene. Seeing as how the dN/dS value was 0.306, the hypothesis that the UL49 gene is under purifying selection is supported (Table 2). The combination of the unanimous p-values and low dN/dS value strongly support the main hypothesis that the UL49 gene is most highly conserved. Similarly, the dN/dS value for the UL20 gene is less than one (0.749), suggesting purifying selection (Table 2). However, since this number is closer to one than the dN/dS value for UL49 and not all p-values for UL20 reject the null hypothesis, there is less confidence that the UL20 gene is under selection and so the hypothesis that UL20 is less highly conserved than UL49 is supported. This is also supported by the UL49 gene having the lowest dN/dS value (Table 2). In conclusion, our hypotheses that UL49 would be highly conserved and UL20 would not be as highly conserved are supported, but our hypothesis regarding the conservation of the UL15; UL28; UL49 gene complex is inconclusive. References: 1. Ehrlich SD. 2013. Herpes simplex virus. Complementary and Alternative Medicine Guide. University of Maryland Medical Center, Baltimore, MD. [Online.] http://umm.edu/health/medical/altmed/condition/herpes-simplex-virus 2. Jenkins FJ, Turner SL. 1996. Herpes simplex virus: a tool for neuroscientists. Frontiers in Bioscience. 1:241-247 3. Spear PG. 2004. Herpes simplex virus: receptors and ligands for cell entry. Cell Microbiol. 6(5):401-410 4. Tanaka M, Kato A, Satoh Y, Ide T, Sagou K, Kimura K, Hasegawa H, Kawaguch Y. 2012. Herpes simplex virus 1 VP22 regulates translocation of multiple viral and cellular proteins and promotes neurovirulence. Journal of Virology. 86:5264-5277 5. Ward PL, Campadelli-Flume G, Avitabile E, Roizman B. 1994. Localization and putative function of the UL20 membrane protein in cells infected with Herpes Simplex Virus 1. Journal of Virology. 68:7406-7417 6. Higgs MR, Preston VG, Stow NG. 2008. The UL15 protein of herpes simplex virus type 1 is necessary for the localization of the UL28 and UL33 proteins to virl DNA replication centres. Journal of General Virology. 89:1709-1715 7. Brown, J. December 2004. Effect of gene location on the evolutionary rate of amino acid substitutions in herpes simplex virus proteins. Virology. 330: 209-220. [Online.] http://www.sciencedirect.com/science/article/pii/S0042682204006233 8. Fossum, E., Friedel, C.D., Rajagopala, S.V, Titz, B., Baiker, A., Schmidt, T., Kraus, T., Stellberger, T, Rutenberg, C., Suthram, S., Bandyopadhyay, S., Rose, D., Von Brunn, A., Uhlmann, M., Zeretzke, C., Dong, Y., Boulet, H., Koegl, M., Bailer, S.M., Koszinowski, U., Ideker, T., Uetz, P., Zimmer, R., & Haas, J. September 2009. Evolutionarily conserved herpesviral protein interaction networks. PLOS Pathogens. [Online.] http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2731838/# 9. Kolb AW, Adams M, Cabot EL, Craven M, Brandt CR. 2011. Human herpesvirus 1 strains CJ360, CJ311, OD4, TFT401, CJ790, CJ394 partial genomes. [Online.] http://www.ncbi.nlm.nih.gov 10. Tamura K., Stecher G., Peterson D., Filipski A., and Kumar S. 2013. MEGA6: Molecular Evolutionary Genetics Analysis version 6.0. Molecular Biology and Evolution30: 2725-2729.