Supporting Information TITLE: Ecological connectivity shapes quasispecies structure of RNA virus in an Antarctic lake AUTHORS: Alberto López-Bueno1, Alberto Rastrojo1, Ramón Peiró1, Miguel Arenas1 & Antonio Alcamí1* FULL POSTAL ADDRESS: 1 Department of Virology and Microbiology. Centro de Biología Molecular “Severo Ochoa” (Consejo Superior de Investigaciones Científicas-Universidad Autónoma de Madrid). Nicolás Cabrera 1, Cantoblanco, 28049, Madrid, Spain CORRESPONDING AUTHOR: * Antonio Alcamí. Tel. +34 911964560, Fax +34 911964420, Email aalcami@cbm.csic.es 1 Supporting Fig. 1. Taxonomy overview of reads contained in contigs from Lake Limnopolar viromes. BLASTx (E value< 0.001) comparison of contigs obtained with Newbler assembler against a dataset of reference viral genomes. (A) Breakdown in assembled or unassembled metagenomic reads (singletons). (B) Taxonomic profile of reads in contigs assigned to viral families. 2 Supporting Fig. 2. Genomic organization of four APLVs. ORFs with >100 codons are drawn as grey arrows. Blocks within ORFs represent different domains: green for RNA_helicase (pfam00910), yellow for 3C_cysteine_proteinase (pfam00548), light blue for RNA_dependent_RNA_polymerase (pfam00680), red for Rhv_like capsid domain (pfam00073), blue for Dicistroviridae VP4 protein (pfam11492) and light green for CRPV-Capsid domain (pfam08762). Final lengths include 20 consecutive adenines at the 3´-end. 3 Supporting Fig. 3. Principal Component Analysis (PCA) of the dinucleotide bias of APLV14 and a comprehensive set of viruses from the order Picornavirales. Blue, red and green dots represent values for individual viruses known to infect plants, invertebrates and microorganisms respectively. Ellipses contain 67% of the total variance of virus groups with the same host. 4 Supporting Fig. 4. Genome stability of APLVs. (A) Nucleotide and amino acid differences affecting the consensus sequence of APLV1, 2 and 3 among water and cyanobacterial mat samples. (B) Shared SNVs among quasispecies of APLV1 considering the 50 most abundant SNVs. (C) Ranking of the eight most abundant SNVs of APLV1 with colours indicating identical SNVs among samples. 5 Supporting Fig. 5. Statistical analysis of complexity indexes. Statistical comparisons were performed using unpaired Mann-Whitney test in GraphPad software package. P-value < 0.05 was considered significant. (A) Average SNV frequencies obtained with CLC are shown. (B) Entropy was calculated for each position using the formula H = -Σi=A, C, G, TPilog2Pi , where Pi represents the probability of ith base. Average entropy is represented. (C) Diversity was defined as the proportion of different nucleotide states for each position along the genome and was calculated with the following formula D = 1 - Σi=A, C, G, T (Pi)2 . Average diversity is represented. (D) dN-dS for each codon was calculated as described in Material and Methods using an in-house script. Horizontal lines inside boxes represent the median and the whiskers represent minimum and maximum values. Error bars represent standard deviation in A, B and C. 6 Supporting Fig. 6. SNV calling calculated by four alternative methods. (A) Distribution of SNVs along APLV1 and APLV2 genomes. As indicated, each coloured dot represents the method used to validate SNVs forming the quasispecies of APLV1 in 2006 and APLV2 in 2007. Genome organization schemes are drawn above graphs to properly locate the SNVs. (B) Top ten of the most abundant SNVs called under four alternative methods are shown. 7 Supporting Table 1. Details of sample collection. Virome L2006 L2007 L2010 CyaMat Sample Freshwater (4 m depth) Freshwater (4 m depth) Freshwater (4 m depth) Cyanobacteria Mat (Pool 5 samples) Physical factors Ice-covered Open lake Open lake Non ice-covered mats Collecting date 27th Nov 2006 22nd Jan 2007 1st Feb 2010 2nd Feb 2008 Season Spring Summer Summer Summer Supporting Table 2. Oligonucleotides used for scaffolding and 3´ end sequencing. Virus APLV1 APLV2 APLV3 APLV4 Region 3´-end* 3´-end* Inner Inner Inner Inner Inner Inner 3´-end* 3´-end* Inner Inner 3´-end* 3´-end* 3´-end* 3´-end* Name 7739F-rS9B1 7977F-rS9B1 4504F-rS9A3 5237R-rS9A3 5493F-rS9A· 6556R-rS9A3 7114F-rS9A3 7839R-rS9A3 8630F-rS9A3 8986F-rS9A3 4065F-rS9A17 4904R-rS9A17 7892F-rS9A17 8304F-rS9A17 8088F-rS9Z121 8360F-rS9Z121 Sequence (5´-3´) GAGGTATAGCTTCAACGC TGATGCACGTTGTTTCGC GAGCAAGGTCAAATTTGGA CATCTTGCTCTTAGGCCC TGGTACGCTTGCTATCTG CAAGCCTACAATGCCAAG CTAGACCCTACTGAATCAC ATTGAGCCTCGCCAATAC TTACTCCGGCTTTCGCAG TTCTCAGTGATGAAGGCG GCTCGACCATATCTACGG GCCTACACGTAATACCAC GGAAGTGTTGGGAGAAGG GATACGCTTAGTGTAAGGG CGACTCAAAGTGTTGAACC ACGTAAGTGTACTTTTGCG *A Nested PCR with Oligo dT was employed to sequence the 3´end region of the four APLVs genomes Supporting Table 3. Metagenomic reads statistics and accession numbers. L2006 L2007 L2010 Imput Reads 42,653 65,525 95,425 Average length 374 bp 325 bp 361 bp Mbp 16.0 21.3 34.4 Primers clippingReads (%) 37,956 (89.0) 54,727 (83.5) 83,379 (87.4) quality filtering Average length 359 bp 316 bp 347 bp Mbp (%) 13.7 (85.9) 17.5 (82) 29.1 (84.4) Repository SRA accession no. SRS668074 SRS668081 SRS668082 MG-RAST id. 4572579.3 4572623.3 4572643.3 Metavir id. 4488 4489 4490 Metagenomic reads statistics before and after primer-clipping and quality filtering. Accession numbers of RNA viromes in three repositories of metagenomics reads. Sequence Read Archive (SRA) numbers correspond to Biosamples and they are included in the Bioproject SRP044919. 8 Supporting Table 4. Taxonomic profile of contigs. dsDNA ssDNA dsRNA ssRNA Number of contigs (% reads in contigs) Number viral contigs (% reads in contigs) Myoviridae Podoviridae Siphoviridae Mimiviridae Phycodnaviridae Poxviridae Satellites Circoviridae Inoviridae Microviridae Nanoviridae Unclassified Totiviridae Reoviridae Alphatetraviridae Bromoviridae Sobemovirus Tombusviridae Virgaviridae Dicistroviridae Secoviridae Picornaviridae Bacillarnavirus Iflaviridae Labyrnavirus Marnaviridae Environmental Picornavirales Unclassified ssRNA viruses L2006 79 (72.8%) 39 (66.6%) 0 1 1 0 0 2 1 3 3 12 1 3 0 0 0 0 0 1 0 4 0 0 4 0 0 1 2 0 L2007 222 (70,7%) 60 (41.8%) 3 0 7 2 2 1 0 0 1 3 0 0 1 0 0 0 0 0 1 4 1 1 10 1 1 1 18 2 L2010 201 (72.4%) 96 (60.7%) 0 1 0 1 0 0 0 0 0 0 0 0 3 1 1 1 1 8 0 18 2 1 15 0 3 5 33 2 Reads were assembled with Newbler 2.5.3 using 97% identity over a minimum overlapping of 90%. Binnig was performed by comparison with the reference viral proteins dataset of the NCBI using BLASTx (threshold of 10-3 on the BLAST e-value) and contigs were ascribed to viral families or groups based on the best hit. 9 Supporting Table 5. BLASTp against GenBank nr of the APLVs-ORFs. APLV1(ORF-1) APLV1(ORF-2) APLV2(ORF-1) APLV2(ORF-2) APLV3(ORF-1) APLV4(ORF-1) APLV4(ORF-1) E-Value 3E-164 5E-155 0 2E-118 6E-59 3E-95 1E-99 Query 27% 99% 84% 98% 60% 69% 97% Identity 58% 37% 40% 32% 24% 30% 33% Family Dicistroviridae Dicistroviridae Environmental Bacillarnavirus Bacillarnavirus Environmental Bacillarnavirus Species Nedicistrovirus TFN-2012 Nedicistrovirus TFN-2012 JP-A AglaRNAV CtenRNAV JP-A AglaRNAV Query indicates the percentage of aa in the alignment region and Identity the percentage of aa identity. JPA: Marine RNA virus JP-A, AglaRNAV: Asterionellopsis glacialis RNA virus, CtenRNAV: Chaetoceros tenuissimus RNA virus. Supporting Table 6. Percentage of amino acid identity among conserved domains of APLVs. Rhv domains (pfam00073) * APLV1-1 APLV1-2* APLV2-1* APLV2-2* APLV4-1* APLV4-2* APLV1-1 APLV1-2 APLV2-1 APLV2-2 APLV4-1 APLV4-2 100 20.2 19.6 14.4 23.1 12.6 20.2 100 17.0 25.2 20.3 23.5 19.6 17.0 100 18.5 40.6 14.9 14.4 25.2 18.5 100 16.3 39.4 23.1 20.3 40.6 16.3 100 16.4 12.6 23.5 14.9 39.4 16.4 100 RdRp domain (pfam00680) APLV1 APLV2 APLV3 APLV4 APLV1 100 23.6 22.4 23.0 APLV2 23.6 100 32.9 34.4 APLV3 22.4 32.9 100 29.9 * APLV4 23.0 34.4 29.9 100 There are two Rhv domains in the structural gene of ALPV1, 2 and 4. 10 Supporting Table 7. Complexity indexes of APLV1 and APLV2 viral quasispecies. APLV1-L2006 Variant calling method CLC MPileup No. SNVs Coverage Quality Length studied Mean diversity SNV frequency Ti/Tv Sn Sn norm π (%) 134 831 34.6 8,333 1.61 2.58 11.18 20.8 0.0012 0.077 266 572 ND 8,315 3.20 2.12 2.97 37.3 0.0022 0.129 APLV2-L2007 MPileup MPileup +Realig. +Shorah 153 621 ND 8,315 1.84 2.19 7.50 21.5 0.0013 0.076 118 932 ND 8,150 1.45 3.23 0.62 22.5 0.0014 0.087 CLC MPileup 475 447 35.4 8,877 5.35 11.89 4.22 208.8 0.0118 0.964 453 300 ND 8,600 5.27 9.54 4.46 163.9 0.0095 0.758 MPileup MPileup +Shora +Realig. h 455 255 254 293 ND ND 8,600 8,397 5.29 3.04 7.29 14.33 7.77 4.20 135.5 123.1 0.0079 0.0073 0.614 0.620 Complexity indexes of two viral quasispecies obtained following four distinct SNV-calling strategies. No. SNVs: number of SNVs >1%; Coverage refers to the average coverage for SNV sites; Length studied (coverage>25×; bp); Mean diversity (% SNV sites); SNV frequency: average percentage of SNVs; Ti/Tv represents the ratio of nucleotide transitions and transvertions; Sn and NSn are the Shannon index and the Normalized Shannon index respectively; π is the nucleotide diversity; ND Non Determined. Supporting Table 8. Cyanobacterial Mat RNA viral contigs in the Lake Limnopolar. Reads Contigs/ Genomes C-26 APLV3 APLV2 C-79 C-130 C-1 C-2 Length (bp) 3,874 8,644 9,357 2,207 1,628 8,251 7,561 CyaMat 6,086 4,151 2,776 2,071 1,165 676 485 ORF Taxonomy L2006 0 1 13 0 0 0 0 L2007 0 2,918 4,219 0 0 8 44 L2010 0 749 634 1 2 16 68 Non-structural JP-A CtenRNAV JP-A CsfrRNAV JP-A RsRNAV Structural AglaRNAV AglaRNAV CtenRNAV AglaRNAV Subsamples of 25,000 reads from each virome were aligned to full-length RNA viral genomes or contigs previously assembled from Cyanobacterial Mat virome. ORF-taxonomy was obtained by BLASTp against GenBank nr database. JP-A: Marine RNA virus JP-A, CtenRNAV: Chaetoceros tenuissimus RNA, AglaRNAV: Asterionellopsis glacialis RNA virus, CsfrRNAV: Chaetoceros socialis f. radians RNA virus, RsRNAV: Rhizosolenia setigera RNA virus. 11 Supporting Table 9. Alternative methods for global dN/dS rates estimation. APLV KaKs _Calculator HyPhy In-house script ORF2 APLV2 CyaMa t ORF1 CyaMa t ORF2 NG (JC69) 0.2607 0.1405 0.1865 0.1759 0.2004 0.1285 0.2312 L2006 L2006 L2007 L2007 L2010 L2010 ORF1 ORF2 ORF1 ORF2 ORF1 L2007 L2007 L2010 L2010 ORF1 ORF2 ORF1 ORF2 APLV3 CyaMa t ORF1 0.1705 0.0344 0.1138 0.0536 0.1160 0.1723 APLV L2007 L2010 L2010 L2010 ORF1 ORF1 ORF1 ORF2 0.0601 0.0978 0.1637 0.2812 YN (HKY) 0.3567 0.1118 0.2443 0.1204 0.2624 0.0896 0.3445 0.2061 0.0406 0.1229 0.0627 0.1283 0.2115 0.0707 0.1088 0.1659 0.2462 MYN (TN93) 0.4001 0.1120 0.2930 0.1296 0.3520 0.0935 0.3611 0.2033 0.0484 0.1206 0.0706 0.1267 0.2101 0.0715 0.1189 0.2062 0.2818 MS 0.4096 0.1459 0.2998 0.1823 0.3463 0.1061 0.3249 0.2248 0.0512 0.1423 0.0735 0.1355 0.2343 0.0778 0.1214 0.1813 0.2641 MA 0.3994 0.1376 0.2928 0.1766 0.3457 0.0947 0.3385 0.2228 0.0512 0.1415 0.0735 0.1382 0.2338 0.0773 0.1216 0.1813 0.2607 Average (HKY) 0.4847 0.2102 0.4785 0.1009 0.4672 0.1377 0.3187 0.3444 0.0347 0.1600 0.0535 0.1412 0.1173 0.0683 0.0839 0.0963 0.2207 SD (HKY) 0.0240 0.0260 0.0395 0.0177 0.0325 0.0198 0.0147 0.0346 0.0003 0.0015 0.0004 0.0014 0.0019 0.0006 0.0007 0.0014 0.0114 Average (TN93) 0.6271 0.2537 0.6668 0.1443 0.6756 0.1825 0.3908 0.3546 0.0485 0.1756 0.0702 0.1523 0.1337 0.0788 0.0977 0.1152 0.2630 SD (TN93) 0.0322 0.0312 0.0507 0.0253 0.0434 0.0271 0.0172 0.0385 0.0005 0.0017 0.0005 0.0013 0.0021 0.0008 0.0007 0.0019 0.0147 Average (F81) 0.2850 0.1556 0.2835 0.0729 0.2769 0.1129 0.1930 0.2689 0.0215 0.1263 0.0340 0.1095 0.0803 0.0482 0.0593 0.0703 0.1847 SD (F81) 0.0162 0.0251 0.0249 0.0161 0.0177 0.0215 0.0080 0.0285 0.0003 0.0014 0.0002 0.0011 0.0012 0.0004 0.0004 0.0011 0.0107 Average (GTR) 0.6191 0.2595 0.6612 0.1445 0.6743 0.1871 0.3946 0.3538 0.0491 0.1783 0.0710 0.1542 0.1336 0.0790 0.0978 0.1169 0.2755 SD (GTR) 0.0316 0.0313 0.0491 0.0346 0.0470 0.0257 0.0160 0.0368 0.0005 0.0016 0.0005 0.0015 0.0043 0.0007 0.0009 0.0015 0.0152 JC69 0.2942 0.1517 0.2162 0.1719 0.2304 0.1260 0.2513 0.1792 0.0472 0.1256 0.0691 0.1295 0.1936 0.0713 0.1150 0.1853 0.3151 HKY 0.2941 0.1518 0.2162 0.1719 0.2304 0.1260 0.2513 0.1792 0.0470 0.1241 0.0688 0.1287 0.1936 0.0712 0.1153 0.1845 0.3199 F81 0.2941 0.1518 0.2162 0.1719 0.2304 0.1260 0.2513 0.1792 0.0472 0.1256 0.0691 0.1294 0.1936 0.0714 0.1150 0.1855 0.3139 TN93 0.2925 0.1518 0.2149 0.1719 0.2304 0.1260 0.2513 0.1792 0.0468 0.1236 0.0689 0.1279 0.1936 0.0711 0.1152 0.1847 0.3191 GTR 0.2929 0.1518 0.2151 0.1719 0.2304 0.1260 0.2513 0.1792 0.0468 0.1228 0.0687 0.1276 0.1936 0.0711 0.1155 0.1840 0.3232 Global dN/dS rates estimated by three alternative methods that implement several nucleotide substitution models. Information regarding the KaKs calculator substitution matrices can be retrieved from http://services.cbu.uib.no/tools/kaks. dN/dS estimated by HyPhy package from simulated reads reflect the average value from 100 replicates and the standard deviation is also provided. 12