Supplemental Material: Single Cell Genomes of Marine Thaumarcheota Reveal Insights into Population Differentiation by Depth Haiwei Luo, Bradley B. Tolar, Brandon K. Swan, Chuanlun L. Zhang, Ramunas Stepanauskas, Mary Ann Moran, James T. Hollibaugh Supplemental Methods Single cell sample collection and construction of single amplified genome (SAG) libraries Water samples for single cell analyses were collected and replicate, 1 mL subsamples were cryopreserved with 6% glycine betaine (Sigma) and stored at –80 ºC (Cleland et al., 2004). Prior to cell sorting, samples with prokaryote cell abundances above 5x105 mL-1 were diluted 10x with filter-sterilized field samples and screened through a 70 µm mesh-size cell strainer (BD). For heterotrophic prokaryote detection, diluted subsamples (1-3 mL) were incubated for 10-120 min with SYTO-9 DNA stain (5 µM; Invitrogen). Cell sorting was performed with a MoFlo™ (Beckman Coulter) flow cytometer using a 488 nm argon laser for excitation, a 70 µm nozzle orifice and a CyClone™ robotic arm for droplet deposition into microplates. The cytometer was triggered on side scatter. The “single 1 drop” mode was used for maximal sort purity. Prokaryote cells were separated from eukaryotes, viruses, and detritus based on SYTO-9 fluorescence (proxy to nucleic acid content) and light side scatter (proxy to particle size) (del Giorgio et al., 1996). Synechococcus cells were excluded, based on their autofluorescence signal. Target cells were deposited into 384-well plates containing 600 nL per well of either a) 1x TE buffer or b) prepGEM™ Bacteria (Zygem) reaction mix and stored at –80 ºC until further processing. Of the 384 wells, 315 were dedicated for single cells, 66 were used as negative controls (no droplet deposited) and 3 received 10 cells each (positive controls). The accuracy of droplet deposition was determined by depositing 10 mm fluorescent beads into 384-well plates then the results were checked by microscopically verifying the presence of beads in the plate wells. Of the 2-3 plates examined each sort day, with one bead deposited per well, fewer than 2% of wells were found to contain no bead and 0.4% to contain more than one bead. The latter is most likely caused by co-deposition of two beads attached to each other, which at certain orientation may have similar optical properties to single beads. Cells were sorted into TE buffer were lysed and their DNA was denatured using cold KOH (Raghunathan et al., 2005). Genomic DNA from the lysed cells was amplified using multiple displacement amplification (MDA) (Dean et al., 2002; Raghunathan et al., 2005) in 10 µL final volume. The MDA reactions contained 2 U/µL Repliphi polymerase (Epicentre), 1x reaction buffer (Epicentre), 0.4 mM each dNTP (Epicentre), 2 mM DTT (Epicentre), 50 mM phosphorylated random hexamers (IDT) and 1 µM SYTO-9 (Invitrogen) (all final concentration). The MDA reactions were run at 30 °C for 12-16 h, then inactivated by a 15 min incubation at 65 °C. Amplified genomic DNA was stored at -80 °C until further processing. We refer to the MDA products originating from individual cells as single amplified genomes (SAGs). Prior to cell sorting, the instrument and the workspace were decontaminated for DNA as previously described (Stepanauskas and Sieracki, 2007). High molecular weight DNA contaminants were removed from all MDA reagents by a UV treatment in Stratalinker (Stratagene) (Woyke et al., 2011). During UV treatment, reagents were placed on ice to avoid overheating. An empirical optimization of the UV exposure was performed to remove all detectable contaminants without inactivating the reaction. Cell sorting and MDA setup were performed in a HEPA-filtered environment. As a quality control, the kinetics of all MDA reactions was monitored by measuring the SYTO-9 fluorescence using either LightCycler 480 (Roche) or FLUOstar Omega (BMG). The critical point (Cp) was determined for each MDA reaction as the time required to produce half of the maximal fluorescence. The Cp is inversely correlated to the amount of DNA template (Zhang and Fang, 2006). PCR screening of SAG libraries MDA products were diluted 50-fold in TE buffer and 500 nL aliquots of diluted MDA product served as the template DNA in 5 µL final volume real-time PCR screens. All PCR reactions were performed using LightCycler 480 SYBR Green I Master Mix (Roche) and the Roche LightCycler® 480 II real-time thermal cycler. PCR amplification of Archaeal SSU rRNA from SAGs was done using primers Arch_344F (ACG GGG YGC AGC AGG CGC GA) and Arch_915R (GTG CTC CCC CGC CAA TTC CT) (Lane et al. 1991). Forward (5´– GTAAAACGACGGCCAGT–3´) and reverse (5´–CAGGAAACAGCTATGACC–3´) M13 sequencing primers were added to the 5´ ends of each target primer pair to aid direct sequencing of PCR products. All PCR reactions were run for 40 cycles at the appropriate annealing temperature, followed by melting curve analysis performed as follows: 95°C for 5 s, 52°C for 1 min, and a continuous temperature ramp (0.11°C/s) from 52 to 97°C. Real-time PCR kinetics and amplicon melting curves served as proxies for detecting SAGs positive for target genes. New, 20 µL PCR reactions were set up for all PCR-positive SAGs and amplicons were sequenced from both ends using Sanger technology by Beckman Coulter Genomics. Single cell sorting, whole genome amplification, real-time PCR screens and PCR product sequence analyses were performed at the Bigelow Laboratory Single Cell Genomics Center following protocols described on their web site (www.bigelow.org/scgc). Antarctic SAGS were also screened at the University of Georgia for the presence of Archaeal amoA genes using primers and qPCR conditions described in Francis et al. (2005) and Wuchter et al. (2006). PCR products were sequenced at the Georgia Genomics Facility to verify amplification of the target gene. SAG sequencing and analysis A total of 46 Thaumarchaeota SAGs were chosen for whole genome sequencing based on multiple displacement amplification (MDA) kinetics, presence of metabolic genes from PCR screening and geographic location of the sampling site. Three approaches were used for sequencing marine Thaumarchaeota SAGs: 1) A combination of Illumina and 454 shotgun sequencing (AAA007-O23), or Illumina only (AB-661-I02, AB-661-L21, AB-661-M19, AB663-F14, AB-663-G14, AB-663-N18, AB-663-O07, AB-663-P07, AAA160-J20, AAA001A19), as described in Swan et al. (2011); 2) a combination of Illumina and PacBio long read sequence data (AAA007-N19, AAA288-I14, and AAA288-J14) as described in Martinez-Garcia et al. (2012) and assembled using Velvet-SC (Chitsaz et al., 2011) and PBcR (Koren et al., 2012) and; 3) 454 shotgun sequencing of Nextera-prepared libraries followed by dual assembly with Newbler v2.4 and Geneious Pro v.5.5.6 (Drummond et al., 2011) (all remaining SAGs; total of 32). For each of these 32 SAGs, raw 454 sequences were trimmed in Geneious Pro v5.5.6 and any remaining transposons were removed using TagCleaner v0.11 (Schmieder et al., 2010). Sequences were then assembled separately in Newbler v.2.4 (Roche) using default settings and Geneious using the high-sensitivity setting. The Newbler-assembled sequences were imported into Geneious and co-assembled with both the Geneious-assembled contigs and the unused reads. The dual assembled contigs and all other contigs longer than 300 bp were pooled and annotated. Nextera-prepared sequencing libraries were generated using the Roche TitaniumCompatible kit with MDA product as the input DNA, following the manufacturer’s instructions (Adey et al., 2010). A total of 32 Nextera sequencing libraries constructed from SAGs were barcoded and sequenced (454 FLX Titanium chemistry) on 1/2 microtiter plate. Whole-genome sequence data for all Thaumarchaeota SAGs are available in IMG under accession numbers listed in Supplementary Table S1. SAG whole genome sequence quality control Each raw sequence data set was screened against all finished bacterial and archaeal genome sequences (downloaded from NCBI) and the human genome to identify potential contamination in the sample. Reads were mapped against reference genomes with bwa version 0.5.9 (Li and Durbin, 2009) using default parameters (96% identity threshold). None of the libraries showed significant contamination. Additionally, gene sequences of the final assemblies (see below) were compared against the GenBank nr database by BLASTX and taxonomically classified using MEGAN (57). To further verify the absence of contaminating sequences in the assemblies, tetramer frequencies were extracted from all scaffolds using two alternative settings: 1) sliding window of 1000 bp and 100 bp step size and 2) sliding window of 5000 bp and 500 bp step size. Reversecomplementary tetramers were combined and the frequencies represented as a N×136 feature matrix, where N is the number of windows and each column of the matrix corresponds to the frequency of one of the 136 possible tetramers. Principal component analysis (PCA) was then used to extract the most important components of this high dimensional feature matrix. The analysis produced unimodal distribution along the first four PCs for the majority of SAGs, suggesting homogenous DNA sources. Scaffolds representing extremes on the first four PCs were identified and manually examined for their closest TBLASTX hits against the NCBI nt database. SAG annotation The gene modeling program Prodigal (http://prodigal.ornl.gov/) was run on the draft single cell genomes, using default settings that permit overlapping genes and using ATG, GTG, and TTG as potential starts. The resulting protein translations were compared to the GenBank non-redundant database (NR), the Swiss-Prot/TrEMBL, Pfam, TIGRFam, Interpro, KEGG, and COGs databases using BLASTP or HMMER. From these results, product assignments were made. Initial criteria for automated functional assignment set priority based on TIGRFam, Pfam, COG, Interpro profiles, pairwise BLAST versus Swiss-Prot/TrEMBL, and KO groups. The annotation was imported into the Joint Genome Institute Integrated Microbial Genomes (IMG; http://img.jgi.doe.gov/cgi-bin/pub/main.cgi) (Markowitz et al., 2010). Phylogenomic tree construction We compiled two data sets for phylogenomic analyses of marine Thaumarchaeota. The first data set used sequence data from 46 single cell genomes and the single cultured isolate Nitrosopumilus maritimus SCM1 genome. The second set included all 8 published composite Thaumarchaeota genomes in addition to N. maritimus and the 46 single cell genomes used in the first compilation. These composite genomes are Candidatus Nitrosoarchaeum koreensis MY1, Candidatus Nitrosoarchaeum limnia SFB1, Candidatus Cenarchaeum symbiosum A, Candidatus Nitrosoarchaeum limnia BG20, Candidatus Nitrosopumilus salaria BD31, Candidatus Nitrosopumilus koreensis AR1, Candidatus Nitrosopumilus sediminis AR2, and Candidatus Nitrososphaera gargensis Ga9.2. Genome sequences from 2 Crenarchaeota, Pyrobaculum islandicum DSM 4184 and Sulfolobus acidocaldarius DSM 639, were included as outgroups. These two data sets were analyzed separately, because it is not clear how composite genomes may affect the phylogenomic reconstruction. The two data sets were processed in an identical way based on the following procedure. Orthologous gene families were identified using the OrthoMCL software (Li et al., 2003). Inparalog copies in a gene family were discarded, and gene members assigned to different COGs were also discarded. For the remaining single copy orthologous families, only those found in genomes from at least 25 (in the first data set) or 33 (in the second data set) Thaumarchaeota and 1 outgroup member were retained. This resulted in retention of 97 (in the first data set) or 83 (in the second data set) gene families. Members in each gene family were aligned at the amino acid level using MAFFT (Katoh et al., 2005) and the alignments were trimmed using TrimAl (Capella-Gutiérrez et al., 2009) with the criteria of “-automated1 -resoverlap 0.55 -seqoverlap 60”. Then the trimmed alignments were concatenated, with missing sequences treated as gaps. To account for heterogeneity in the evolutionary processes among different genes, we applied a data partition model during phylogenetic construction using the RAxML v7.3.0 software (Stamatakis, 2006). The PartitionFinder software (Lanfear et al., 2012) grouped the 97 proteins into 16 partitions and grouped the 83 proteins into 14 partitions, respectively, and estimated the best-fit substitution matrix for each partition using a maximum likelihood framework. Gamma distribution of rate variation was also applied in RAxML analysis. Genomes obtained from single cells have many missing genes and taxa with insufficient phylogenetic signal may become rogues that take uncertain positions in a phylogenetic tree. We applied the RogueNaRok software (Aberer et al., 2013) and identified one rogue, SCGC AAA008-M23. Another RAxML phylogenomic tree was constructed with this genome excluded, but the bootstrap support for unresolved branches was only slightly improved compared to the original tree. Therefore, only the original RAxML tree containing sequences from all SAGs is presented. Orthologous protein sequences are available upon request. Comparative analysis of genome content All of the predicted amino acid sequences from the 46 SAGs and Nitrosopumilus maritimus SCM1 were clustered into orthologous gene families using the OrthoMCL software (Li et al. 2003). Then the occurrence rate of each family in the 4 epipelagic clade SAGs and the 42 mesopelagic clade SAGs was calculated, respectively. The most interesting ecologically relevant gene families, that had a higher occurrence rate in one clade compared to the other, were identified and are listed in Table S3. Analysis of photolyase and catalase Inferred amino acid sequences closely related to homologs of photolyase and catalase were identified in the Global Ocean Survey (GOS) metagenomic database using a three-step procedure. Firstly, the GOS DNA read sequences were translated to amino acid sequences using all 6 reading frames. Peptide fragments with at least 60 amino acids were retained. Next, the photolyase and catalase amino acid sequences identified in the SAGs were used as query sequences to search against GOS using the BLASTp program. The criteria to retain GOS hits for further analyses were similarity scores ≥60, alignment lengths ≥100, and bit scores ≥100 for photolyase, and similarity scores ≥75, alignment lengths ≥310, and bit scores ≥500 for catalase. These parameter values were estimated based on preliminary phylogenetic analyses showing that sequences recovered using more relaxed criteria were not related to Thaumarchaeota. Finally, GOS DNA sequences identified as photolyase and catalase peptide fragments, were extracted and searched against the NCBI non-redundant database using the BLASTx program to guarantee that these GOS reads encoded photolyase or catalase. Phylogenetic analysis of the photolyase and catalase sequences we retrieved followed an identical procedure. Since the homologous sequences are very divergent, 7 alignment methods were used and compared to better account for alignment uncertainty. These methods include (Larkin et al., 2007), MAFFT (Katoh et al., 2005), MUSCLE (Edgar, 2004), T-coffee (Notredame et al., 2000), DIALIGN (Morgenstern, 2004), Kalign (Lassmann and Sonnhammer, 2005), and OPAL (Wheeler and Kececioglu, 2007). The qualities of the alignments were compared using the TrimAl software (Capella-Gutiérrez et al., 2009); the best alignment was selected according to the consistency score calculated by TrimAl. Next, the amino acid substitution model was determined using the ProtTest v3 software (Darriba et al., 2011). A phylogenetic tree was constructed using the MrBayes v3.1.2 software (Ronquist and Huelsenbeck, 2003). One cold and three heated Markov chain Monte Carlo (MCMC) chains were run for 1,000,000 generations with trees sampled every 100 generations. Two independent runs of MCMC were performed. The first 25% of all runs were discarded as ‘burn-in’. A 50% majority-rule consensus tree was constructed from the post-burn-in trees. The average standard deviation of split frequencies reached <0.01, indicative of convergence. Supplemental Figure Legends Figure S1. Maximum likelihood phylogenetic tree of marine Thaumarchaeota 16S rRNA genes. The tree was constructed using the RAxML v7.3.0 software using the GTR substitution model with Gamma distributed rate heterogeneity among sites. Values at the nodes show the number of times the clade defined by that node appeared in the 100 bootstrapped datasets. Bootstrap values below 50 are not shown. The taxa included in the tree are the Thaumarchaeota SAGs, cultures, and a few environmental sequences, with 5 soil and hot spring sequences as outgroups. The epi- and mesopelagic clades are indicated by shading. Figure S2. Maximum likelihood phylogenomic analysis of 55 Thaumarchaeota genomes. The tree was constructed using the RAxML v7.3.0 software using a concatenated amino acid sequence of 83 genes with 24,061 sites, with a data partition model determined by the PartitionFinder software. Values at the nodes show the number of times the clade defined by that node appeared in the 100 bootstrapped datasets. Two Crenarchaeota outgroup species are not shown. Details of tree construction can be found in Supplemental Material. The epi- and mesopelagic clades are indicated by shading. Single cell genomes from different water masses/locations/depths are marked with different colors as identified in the legend inset. Supplemental Table Legends Table S1. Accession numbers and environmental characteristics of the 46 marine Thaumarchaeota single-cell amplified genomes used in this study. Table S2. COG annotations of the 97 proteins used for phylogenomic analysis. Table S3. Examples of gene families distributed in the epi- and mesopelagic clades of marine Thaumarchaeota. References Aberer, A.J., Krompass, D., and Stamatakis, A. (2013). Pruning rogue taxa improves phylogenetic accuracy: an efficient algorithm and webservice. Syst Biol 62: 162-166. Adey A, Morrison H, Asan, Xun X, Kitzman J, Turner E et al. (2010). Rapid, low-input, low-bias construction of shotgun fragment libraries by high-density in vitro transposition. Genome Biology 11: R119. Capella-Gutiérrez, S., Silla-Martínez, J.M., and Gabaldón, T. (2009). trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics 25: 1972-1973. Chitsaz H, Yee-Greenbaum JL, Tesler G, Lombardo M-J, Dupont CL, Badger JH et al. (2011). Efficient de novo assembly of single-cell bacterial genomes from short-read data sets. Nat Biotechnol 29: 915–921. Cleland D, Krader P, McCree C, Tang J, Emerson D (2004). Glycine betaine as a cryoprotectant for prokaryotes. Journal of Microbiological Methods 58: 31-38. Darriba, D., Taboada, G.L., Doallo, R., and Posada, D. (2011). ProtTest 3: fast selection of best-fit models of protein evolution. Bioinformatics 27: 1164-1165. ean FB, Hosono S, Fang L, Wu X, Faruqi AF, Bray-Ward P et al. (2002). Comprehensive human genome amplification using multiple displacement amplification. Proceedings of the National Academy of Sciences of the United States of America 99: 5261-5266. del Giorgio PA, Bird DF, Prairie YT, Planas D (1996). Flow cytometric determination of bacterial the green nucleic acid stain SYTO 13. Limnol Oceanogr 41: 783–789. Drummond AJ, Ashton B, Buxton S, Cheung M, Cooper A, Duran C et al. (2011). Geneious v5.4, Available from http://www.geneious.com/. Edgar, R.C. (2004). MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Re 32: 1792-1797. Francis, C. A., K. J. Roberts, J. M. Beman, A. E. Santoro and B. B. Oakley (2005). "Ubiquity and diversity of ammonia-oxidizing Archaea in water columns and sediments of the ocean." Proceedings of the National Academy of Sciences of the US 102(41): 14683-14688. Katoh, K., Kuma, K.-i., Toh, H., and Miyata, T. (2005). MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res 33: 511518. Koren S, Schatz MC, Walenz BP, Martin J, Howard JT, Ganapathy G et al. (2012). Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nat Biotechnol 30: 693–700. Lane, D. J. 1991. 16S/23S rRNA sequencing. In E. Stackebrandt and M. Goodfellow (ed.), Nucleic acid techniques in bacterial systematics. John Wiley, Chichester, UK. Lanfear, R., Calcott, B., Ho, S.Y.W., and Guindon, S. (2012). PartitionFinder: combined selection of partitioning schemes and substitution models for phylogenetic analyses. Mol Biol Evol 29: 1695-1701. Larkin, M.A., Blackshields, G., Brown, N.P., Chenna, R., McGettigan, P.A., McWilliam, H. et al. (2007). Clustal W and Clustal X version 2.0. Bioinformatics 23: 2947-2948. Lassmann, T., and Sonnhammer, E. (2005). Kalign - an accurate and fast multiple sequence alignment algorithm. BMC Bioinformatics 6: 298. Li H, Durbin R (2009). Fast and accurate short read alignment with Burrows– Wheeler transform. Bioinformatics 25: 1754–1760. Li, L., Stoeckert, C.J., and Roos, D.S. (2003). OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res 13: 2178-2189. Markowitz VM, Chen I-MA, Palaniappan K, Chu K, Szeto E, Grechkin Y et al. (2010). The integrated microbial genomes system: an expanding comparative analysis resource. Nucleic Acids Res 38: D382–D390. Martinez-Garcia M, Brazel DM, Swan BK, Arnosti C, Chain PSG, Reitenga KG et al. (2012). Capturing single cell genomes of active polysaccharide degraders: An unexpected contribution of Verrucomicrobia. PLoS ONE 7: e35314. Morgenstern, B. (2004). DIALIGN: multiple DNA and protein sequence alignment at BiBiServ. Nucleic Acids Res 32: W33-W36. Notredame, C., Higgins, D., and Heringa, J. (2000). T-coffee: a novel method for fast and accurate multiple sequence alignment. J Mol Biol 302: 205-217. Raghunathan A, Ferguson HR, Jr., Bornarth CJ, Song W, Driscoll M, Lasken RS (2005). Genomic DNA amplification from a single bacterium. Applied and Environmental Microbiology 71: 3342-3347. Ronquist, F., and Huelsenbeck, J.P. (2003). MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics 19: 1572-1574. Schmieder R, Lim YW, Rohwer F, Edwards R (2010). TagCleaner: Identification and removal of tag sequences from genomic and metagenomic datasets. BMC Bioinformatics 11: 341. Stamatakis, A. (2006). RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 22: 2688 - 2690. Stepanauskas R, Sieracki ME (2007). Matching phylogeny and metabolism in the uncultured marine bacteria, one cell at a time. Proceedings of the National Academy of Sciences 104: 9052-9057. Swan BK, Martinez-Garcia M, Preston CM, Sczyrba A, Woyke T, Lamy D et al. (2011). Potential for chemolithoautotrophy among ubiquitous bacteria lineages in the dark ocean. Science 333: 1296–1300. Wheeler, T.J., and Kececioglu, J.D. (2007). Multiple alignment by aligning alignments. Bioinformatics 23: i559-i568. Woyke T, Sczyrba A, Lee J, Rinke C, Tighe D, Clingenpeel S et al. (2011). Decontamination of MDA reagents for single cell whole genome amplification. PLoS ONE 6: e26161. Wuchter, C., B. Abbas, M. J. L. Coolen, L. Herfort, J. van Bleijswijk, P. Timmers, M. Strous, E. Teira, G. J. Herndl, J. J. Middelburg, S. Schouten and J. S. Sinninghe Damste (2006). "Archaeal nitrification in the ocean." Proceedings of the National Academy of Sciences of the USA 103(33): 12317-12322. Zhang T, Fang H (2006). Applications of real-time polymerase chain reaction for quantification of microorganisms in environmental samples. Applied Microbiology and Biotechnology 70: 281-289.