1 SUPPLEMENTARY MATERIAL Consensus differences and genetic distances between genotypes 1a and 3a Overall, the consensus amino acid sequence differed between HCV 1a and 3a genotypes at 515 of 2,147 residues examined (24%; Supplementary Table 1). Of the consensus amino acids that differ between the genotypes, 350 (68%) involved a 1-nucleotide substitution, with a consistent distribution across the non-structural proteins (Supplementary Table 1). For these sites where the genetic barrier (number of substitution steps) between HCV genotypes is low, we found that the amino acid variation of one genotype included the alternative consensus at 166 residues (47% of 1-nucleotide distant sites). For the remaining sites (n=184, 54%), the amino acid variation of one genotype did not contain the consensus sequence of the alternative genotype despite a low genetic barrier. Of these sites, amino acid substitutions were physicochemically conservative in 62% of cases indicating that factors other than genetic distance serve to preserve distinct intergenotype differences at many of these sites. For the 165 (32%) residues where there was a higher genetic barrier between consensus amino acids, there were only two sites at which the alternative consensus amino acid was also represented as an intra-genotype variant. Otherwise, these genetically divergent sites defined highly genotype-specific mutational pathways. Consensus differences and proteasomal cleavage predictions Given that the processing of viral antigens relies on relatively monomorphic proteins to direct appropriate peptide cleavage, and that abrogation of this antigen processing pathway provides a means of evading CTL responses through ‘processing escape’ (1;2), we next examined the effects of genotype-specific differences in consensus sequences on antigen processing using the NetChop prediction tool (3;4). This analysis identified 102 sites (20% of all discordant consensus residues) where consensus sequence variation was predicted to abrogate proteasomal cleavage (55 for genotype 1a; 47 for genotype 3a), as shown in Supplementary Table 1. The large number of different consensus amino acids and predicted processing sites between the genotypes is likely to result in differences in the repertoire of HLA-restricted viral epitopes across the genomes. 2 Analyses of positive and negative codon selection and amino acid co-variation Analysis of positive and negative selection for the HCV non-structural protein sequences utilised the Single-Likelihood Ancestor Counting algorithm implemented in the program HyPhy, consistent with the approach adopted by Campo et al (5). in which selection of HCV genotype 1b sequences was considered. The assessment of amino acid co-variation at sites of HLAassociated polymorphism utilised Fisher’s exact tests for classification as consensus vs nonconsensus amino acid using S-Plus 8.0 (Insightful Corporation, Seattle, USA). Here, sites at which amino acid polymorphism was significantly associated with the presence of each HLAassociated polymorphism described in Table 2 (cut-off p<0.01) were identified. Evidence of positive and negative selection in HCV genotypes 1a and 3a We also examined these HCV sequences for evidence of positive and negative selection at each residue position, using the Single-Likelihood Ancestor Counting algorithm implemented in the program HyPhy (6). In keeping with previous studies (5), we identified a dominant overall pattern of negative selection implying a relative abundance of synonymous codon substitutions, suggesting that extensive amino acid variation is not tolerated. Overall, only 30 of the 2,147 codon sites analyzed (1.4%) showed evidence of positive selection in genotype 1a (listed in Supplementary Table 2) compared with 1,415 sites under negative selection (66%) across the non-structural HCV proteins. Similar results were obtained for genotype 3a, where evidence of positive selection was identified at only 13 sites (0.6%) while negative selection was evident at 1,395 sites (65%) (Supplementary Table 2). Interestingly, only five positively selected sites were common to both genotypes 1a and 3a. 3 Codon selection and co-variation at HLA-associated sites There were a number of HLA-associated polymorphic sites without evidence of co-variation, suggesting that the selection of amino acid variation is primarily determined by site-specific characteristics. These putative ‘independent’ sites accounted for the majority of HCV genotype 3a associations (13/18 sites = 72%) but were less common for genotype 1a (12/32 sites = 37.5%). For genotype 1a associations of this type, it is notable that a number of sites are characterized by negative codon selection (eg. NS3-1398 and 1403 (HLA-B*0801); NS3-1495 and 1503 (HLAA*0101); NS2-841 (HLA-C*0401) and 851 (HLA-C*1502)), implying a locally-determined fitness cost of mutation that is in keeping with experimental data (7). This was not as apparent for ‘independent’ HLA-associated polymorphisms within genotype 3a, which tended (with the exception of NS3-1133) to demonstrate neutral selection. As shown in Supplementary Table 3, there were also a number of HLA-associated polymorphisms that did appear to exist within highly integrated networks, involving amino acid co-variation throughout the non-structural HCV proteins. These ‘integrated’ sites included several examples in which amino acid covariation linked to an HLA-associated polymorphism was also identified as a ‘primary’ HLAassociation site. This could in some cases be attributed to a common HLA-restriction (eg. linked amino acid variation at NS2 residues 958 and 1006, as well as NS4b residue 1723, each associated with HLA-B*3701; and co-varying NS5b polymorphism at positions 2841 and 2846 associated with HLA-B*2705), where it is conceivable that the selection of a primary CTL escape mutation (eg. at residue 958, as shown in Figure 2) could entrain compensatory mutations elsewhere that preserve viral fitness. However, we also noted several links between ‘integrated’ sites associated with distinct HLA alleles – for example, NS2-957 polymorphism (HLA-B*1302) is associated with co-variation at NS4a residue 1695, which is in turn a HLA-B*2705-asociated site. Each of these examples, involving seven polymorphic sites in genotype 1a, is highlighted in bold in Supplementary Table 3. 4 Reference List 1. Kimura Y,Gushima T,Rawale S,Kaumaya P,Walker CM.Escape mutations alter proteasome processing of major histocompatibility complex class I-restricted epitopes in persistent hepatitis C virus infection.J Virol 2005;79(8):4870-4876. 2. Seifert U,Liermann H,Racanelli V,Halenius A,Wiese M,Wedemeyer H et al.Hepatitis C virus mutation affects proteasomal epitope processing.J Clin Invest 2004;114(2):250-259. 3. Nielsen M,Lundegaard C,Lund O,Kesmir C.The role of the proteasome in generating cytotoxic T-cell epitopes: insights obtained from improved predictions of proteasomal cleavage.Immunogenetics 2005;57(1-2):33-41. 4. Saxova P,Buus S,Brunak S,Kesmir C.Predicting proteasomal cleavage sites: a comparison of available methods.Int Immunol 2003;15(7):781-787. 5. Campo DS,Dimitrova Z,Mitchell RJ,Lara J,Khudyakov Y.Coordinated evolution of the hepatitis C virus.Proc Natl Acad Sci U S A 2008;105(28):9685-9690. 6. Pond SL,Frost SD,Muse SV.HyPhy: hypothesis testing using phylogenies.Bioinformatics 2005;21(5):676-679. 7. Salloum S,Oniangue-Ndza C,Neumann-Haefelin C,Hudson L,Giugliano S,aus dem SM et al.Escape from HLA-B*08-restricted CD8 T cells by hepatitis C virus is associated with fitness costs.J Virol 2008;82(23):11803-11812. 8. Kumar S,Tamura K,Nei M.MEGA3: Integrated software for Molecular Evolutionary Genetics Analysis and sequence alignment.Brief Bioinform 2004;5(2):150-163. 5 Supplementary Figure legends Supplementary Figure 1. HLA Class I allele distribution across genotypes (A) and cohorts (B). Supplementary Figure 2. Polymorphism profile and correlation of polymorphism rates in non-structural HCV proteins. Polymorphism profile in the non-structural HCV proteins (left panels). Vertical bars indicate the proportion of sequences with non-consensus residues for genotype 1a above the line and genotype 3a below the line. A red circle along the x-axis and red bars indicate residues with a different consensus amino acid for the genotypes. Correlation of polymorphism rates between the genotypes (right panels). Black dots indicate residues with identical consensus for both genotypes, red dots indicate residues where the consensus differs between the genotypes (see also legend to Figure 3). Supplementary Figure 3. Comparison of genotype 1a and 3a HCV NS3 and NS5a sequences in the Australian, Swiss and United Kingdom study cohorts. A phylogenetic analysis of NS3 and NS5B was generated using the amino acid alignments described above using the Neighbor-Joining method based on the p-distance model with pairwise deletion. Mean genetic distance between the genotypes was calculated using the same alignments. All analyses were performed using Mega v3.1 (8). 6 Supplementary Table 1. Consensus differences and genetic distances for genotype 1a and 3a non-structural HCV proteins (NS2-NS5B). Protein Amino Different acid consensus between genotypes 1 nucleotide change between genotypes >1 nucleotide change between genotypes NS2 217 92 (42%) 59 (27%) 33 (15%) Genotype differences with loss of proteasomal cleavage 30 (14%) NS3 631 114 (18%) 81 (13%) 33 (5%) 33 (5%) NS4A 54 11 (18%) 10 (18%) 1 (2%) 3 (5%) NS4B 261 52 (20%) 39 (15%) 13 (5%) 11 (4%) NS5A 448 124 (28%) 80 (18%) 45 (10%) 6 (1%) NS5B 536 122 (23%) 81 (15%) 41 (8%) 19 (4%) Supplementary Table 2. Codon sites with evidence of positive selection. Genotype 1a Protein Site 814, 824, 837, Genotype 3a Protein Site 856, 859, 896, NS2 904, 921, 925, NS2 824, 849 NS3 1444, 1503, 1607 NS4A NS4B Nil 1873 NS5A 1979, 2263, 2307 998, 1019 NS3 NS4A NS4B 1411, 1582, 1634, 1640 Nil 1873 2079, 2107, NS5A 2197, 2281, 2319, 2374 2486, 2510, 2534 NS5B 2537, 2540, 2600 2626, 2729 NS5B 2486, 2497 2534, 2626 Bold indicates sites under positive selection for both genotypes 1a and 3a. 7 Supplementary Table 3A. Codon selection and co-variation at HLA-associated sites (GT1a). HLA association Protein Site HLA P NS2 824 C*1502 NS2 841N C*0401 N NS2 851 C*1502 NS2 - Co-varying amino acid residues NS3 NS4a NS4b NS5a 1093N, NS2 856N B*3503 814P,963 1223N 2079P, 2217, - 1967 1344N NS2 957 NS2 958 NS2 NS2 962 998P NS2 1006 NS2 1017N B*1302 C*0602 B*3701 C*0602 B*3503 B*1302 B*3701 C*0602 B*1501 - 2252, 2268, 2493N 2362, 2373 834N, 846N 883,904 NS5b - 1695 941,969 1968N 1969N 2079P, 2102, 2287, 2298, 2369, 2373 2435N, 2475N 1006 - - - - - 885,925P - - - - 2586N 958 - - 1723 2330 - 814P - - 1946 - 2501 - - - - - - - 2169 - 1106 1686 1969N 2093, 2102, 2435N, 2369 2475N 2008N, 2169, 2485, 2171N, 2185, 2604, 2213, 2339N, 2674, 2362 2747 - 1093N, NS3 1341 A*1101 821,963 1223N, 965 1344N, 1408N NS3 NS3 NS3 NS3 NS3 NS3 NS3 NS3 1366 1368 1398N 1403N 1444N 1495N 1503N 1635 C*1502 B*5101 B*0801 B*0801 A*0101 A*0101 C*1203 A*1101 NS4a 1695N B*2705 846N,883 957,969 834N,879 883,885, NS4b 1759* B*3701 904, 1066, 1200 1694N 1964 938,941 1006,1018 NS4b NS5a NS5a NS5a 1876N 2000N 2036N 2155 B*4001 C*0401 A*1402 B*3501 852N 941 - 1444N - - 1873P - 2075N 2324N NS5a 2227 B*4403 - - - - 2216, 2234 - NS5a 2234 B*5101 - - - - 2227 - 8 NS5b 2467 B*1501 941 - - 1964 2375 NS5b NS5b 2510P 2796 A*3101 C*0303 906N - - - - - NS5b 2841* B*2705 - - - - - NS5b 2846* B*2705 - - - - - 2630, 2747 2844N, 2846 2841 Supplementary Table 3B. Codon selection and co-variation at HLA-associated sites (GT3a). HLA association Protein Site HLA NS2 981 B*4403 NS3 1073 A*2402 NS3 1133N A*0301 NS3 1383 B*5101 NS3 1416N B*0702 NS3 1444 A*0101 NS3 1560N A*2402 A*1101 NS3 1637 B*4403 NS3 1646 A*0101 NS4b 1759 B*5701 NS5a 1982 B*5701 NS5a 2248 B*3501 A*0201 NS5a 2320 C*1502 NS5a 2321 C*0602 NS5a 2354 B*3501 NS5a 2367 B*5101 NS5a 2372 A*1101 NS5b 2467 B*1501 - Co-varying amino acid residues NS3 NS4a NS4b NS5a 2219 1607P 1570N - - 1074N - - - - - - - - - - - - - - - - - - - - - - NS2 983 - NS5b - Bold=HLA-associated site; *Association in unrestricted dataset only; Ppositively and Nnegatively selected sites