Supplemental Material: Single Cell Genomes of Marine

advertisement
Supplemental Material:
Single Cell Genomes of Marine Thaumarcheota Reveal Insights into Population
Differentiation by Depth
Haiwei Luo, Bradley B. Tolar, Brandon K. Swan, Chuanlun L. Zhang, Ramunas Stepanauskas,
Mary Ann Moran, James T. Hollibaugh
Supplemental Methods
Single cell sample collection and construction of single amplified genome (SAG) libraries
Water samples for single cell analyses were collected and replicate, 1 mL subsamples
were cryopreserved with 6% glycine betaine (Sigma) and stored at –80 ºC (Cleland et al., 2004).
Prior to cell sorting, samples with prokaryote cell abundances above 5x105 mL-1 were diluted
10x with filter-sterilized field samples and screened through a 70 µm mesh-size cell strainer
(BD). For heterotrophic prokaryote detection, diluted subsamples (1-3 mL) were incubated for
10-120 min with SYTO-9 DNA stain (5 µM; Invitrogen). Cell sorting was performed with a
MoFlo™ (Beckman Coulter) flow cytometer using a 488 nm argon laser for excitation, a 70 µm
nozzle orifice and a CyClone™ robotic arm for droplet deposition into microplates. The
cytometer was triggered on side scatter. The “single 1 drop” mode was used for maximal sort
purity. Prokaryote cells were separated from eukaryotes, viruses, and detritus based on SYTO-9
fluorescence (proxy to nucleic acid content) and light side scatter (proxy to particle size) (del
Giorgio et al., 1996). Synechococcus cells were excluded, based on their autofluorescence
signal. Target cells were deposited into 384-well plates containing 600 nL per well of either a)
1x TE buffer or b) prepGEM™ Bacteria (Zygem) reaction mix and stored at –80 ºC until further
processing. Of the 384 wells, 315 were dedicated for single cells, 66 were used as negative
controls (no droplet deposited) and 3 received 10 cells each (positive controls).
The accuracy of droplet deposition was determined by depositing 10 mm fluorescent
beads into 384-well plates then the results were checked by microscopically verifying the
presence of beads in the plate wells. Of the 2-3 plates examined each sort day, with one bead
deposited per well, fewer than 2% of wells were found to contain no bead and 0.4% to contain
more than one bead. The latter is most likely caused by co-deposition of two beads attached to
each other, which at certain orientation may have similar optical properties to single beads.
Cells were sorted into TE buffer were lysed and their DNA was denatured using cold
KOH (Raghunathan et al., 2005). Genomic DNA from the lysed cells was amplified using
multiple displacement amplification (MDA) (Dean et al., 2002; Raghunathan et al., 2005) in 10
µL final volume. The MDA reactions contained 2 U/µL Repliphi polymerase (Epicentre), 1x
reaction buffer (Epicentre), 0.4 mM each dNTP (Epicentre), 2 mM DTT (Epicentre), 50 mM
phosphorylated random hexamers (IDT) and 1 µM SYTO-9 (Invitrogen) (all final
concentration). The MDA reactions were run at 30 °C for 12-16 h, then inactivated by a 15 min
incubation at 65 °C. Amplified genomic DNA was stored at -80 °C until further processing. We
refer to the MDA products originating from individual cells as single amplified genomes
(SAGs).
Prior to cell sorting, the instrument and the workspace were decontaminated for DNA as
previously described (Stepanauskas and Sieracki, 2007). High molecular weight DNA
contaminants were removed from all MDA reagents by a UV treatment in Stratalinker
(Stratagene) (Woyke et al., 2011). During UV treatment, reagents were placed on ice to avoid
overheating. An empirical optimization of the UV exposure was performed to remove all
detectable contaminants without inactivating the reaction. Cell sorting and MDA setup were
performed in a HEPA-filtered environment. As a quality control, the kinetics of all MDA
reactions was monitored by measuring the SYTO-9 fluorescence using either LightCycler 480
(Roche) or FLUOstar Omega (BMG). The critical point (Cp) was determined for each MDA
reaction as the time required to produce half of the maximal fluorescence. The Cp is inversely
correlated to the amount of DNA template (Zhang and Fang, 2006).
PCR screening of SAG libraries
MDA products were diluted 50-fold in TE buffer and 500 nL aliquots of diluted MDA
product served as the template DNA in 5 µL final volume real-time PCR screens. All PCR
reactions were performed using LightCycler 480 SYBR Green I Master Mix (Roche) and the
Roche LightCycler® 480 II real-time thermal cycler. PCR amplification of Archaeal SSU rRNA
from SAGs was done using primers Arch_344F (ACG GGG YGC AGC AGG CGC GA) and
Arch_915R (GTG CTC CCC CGC CAA TTC CT) (Lane et al. 1991). Forward (5´–
GTAAAACGACGGCCAGT–3´) and reverse (5´–CAGGAAACAGCTATGACC–3´) M13
sequencing primers were added to the 5´ ends of each target primer pair to aid direct sequencing
of PCR products. All PCR reactions were run for 40 cycles at the appropriate annealing
temperature, followed by melting curve analysis performed as follows: 95°C for 5 s, 52°C for 1
min, and a continuous temperature ramp (0.11°C/s) from 52 to 97°C. Real-time PCR kinetics and
amplicon melting curves served as proxies for detecting SAGs positive for target genes. New,
20 µL PCR reactions were set up for all PCR-positive SAGs and amplicons were sequenced
from both ends using Sanger technology by Beckman Coulter Genomics.
Single cell sorting, whole genome amplification, real-time PCR screens and PCR product
sequence analyses were performed at the Bigelow Laboratory Single Cell Genomics Center
following protocols described on their web site (www.bigelow.org/scgc). Antarctic SAGS were
also screened at the University of Georgia for the presence of Archaeal amoA genes using
primers and qPCR conditions described in Francis et al. (2005) and Wuchter et al. (2006). PCR
products were sequenced at the Georgia Genomics Facility to verify amplification of the target
gene.
SAG sequencing and analysis
A total of 46 Thaumarchaeota SAGs were chosen for whole genome sequencing based on
multiple displacement amplification (MDA) kinetics, presence of metabolic genes from PCR
screening and geographic location of the sampling site. Three approaches were used for
sequencing marine Thaumarchaeota SAGs: 1) A combination of Illumina and 454 shotgun
sequencing (AAA007-O23), or Illumina only (AB-661-I02, AB-661-L21, AB-661-M19, AB663-F14, AB-663-G14, AB-663-N18, AB-663-O07, AB-663-P07, AAA160-J20, AAA001A19), as described in Swan et al. (2011); 2) a combination of Illumina and PacBio long read
sequence data (AAA007-N19, AAA288-I14, and AAA288-J14) as described in Martinez-Garcia
et al. (2012) and assembled using Velvet-SC (Chitsaz et al., 2011) and PBcR (Koren et al.,
2012) and; 3) 454 shotgun sequencing of Nextera-prepared libraries followed by dual assembly
with Newbler v2.4 and Geneious Pro v.5.5.6 (Drummond et al., 2011) (all remaining SAGs; total
of 32). For each of these 32 SAGs, raw 454 sequences were trimmed in Geneious Pro v5.5.6 and
any remaining transposons were removed using TagCleaner v0.11 (Schmieder et al., 2010).
Sequences were then assembled separately in Newbler v.2.4 (Roche) using default settings and
Geneious using the high-sensitivity setting. The Newbler-assembled sequences were imported
into Geneious and co-assembled with both the Geneious-assembled contigs and the unused
reads. The dual assembled contigs and all other contigs longer than 300 bp were pooled and
annotated. Nextera-prepared sequencing libraries were generated using the Roche TitaniumCompatible kit with MDA product as the input DNA, following the manufacturer’s instructions
(Adey et al., 2010). A total of 32 Nextera sequencing libraries constructed from SAGs were
barcoded and sequenced (454 FLX Titanium chemistry) on 1/2 microtiter plate. Whole-genome
sequence data for all Thaumarchaeota SAGs are available in IMG under accession numbers
listed in Supplementary Table S1.
SAG whole genome sequence quality control
Each raw sequence data set was screened against all finished bacterial and archaeal
genome sequences (downloaded from NCBI) and the human genome to identify potential
contamination in the sample. Reads were mapped against reference genomes with bwa version
0.5.9 (Li and Durbin, 2009) using default parameters (96% identity threshold). None of the
libraries showed significant contamination. Additionally, gene sequences of the final assemblies
(see below) were compared against the GenBank nr database by BLASTX and taxonomically
classified using MEGAN (57).
To further verify the absence of contaminating sequences in the assemblies, tetramer
frequencies were extracted from all scaffolds using two alternative settings: 1) sliding window of
1000 bp and 100 bp step size and 2) sliding window of 5000 bp and 500 bp step size. Reversecomplementary tetramers were combined and the frequencies represented as a N×136 feature
matrix, where N is the number of windows and each column of the matrix corresponds to the
frequency of one of the 136 possible tetramers. Principal component analysis (PCA) was then
used to extract the most important components of this high dimensional feature matrix. The
analysis produced unimodal distribution along the first four PCs for the majority of SAGs,
suggesting homogenous DNA sources. Scaffolds representing extremes on the first four PCs
were identified and manually examined for their closest TBLASTX hits against the NCBI nt
database.
SAG annotation
The gene modeling program Prodigal (http://prodigal.ornl.gov/) was run on the draft
single cell genomes, using default settings that permit overlapping genes and using ATG, GTG,
and TTG as potential starts. The resulting protein translations were compared to the GenBank
non-redundant database (NR), the Swiss-Prot/TrEMBL, Pfam, TIGRFam, Interpro, KEGG, and
COGs databases using BLASTP or HMMER. From these results, product assignments were
made. Initial criteria for automated functional assignment set priority based on TIGRFam, Pfam,
COG, Interpro profiles, pairwise BLAST versus Swiss-Prot/TrEMBL, and KO groups. The
annotation was imported into the Joint Genome Institute Integrated Microbial Genomes (IMG;
http://img.jgi.doe.gov/cgi-bin/pub/main.cgi) (Markowitz et al., 2010).
Phylogenomic tree construction
We compiled two data sets for phylogenomic analyses of marine Thaumarchaeota. The
first data set used sequence data from 46 single cell genomes and the single cultured isolate
Nitrosopumilus maritimus SCM1 genome. The second set included all 8 published composite
Thaumarchaeota genomes in addition to N. maritimus and the 46 single cell genomes used in the
first compilation. These composite genomes are Candidatus Nitrosoarchaeum koreensis MY1,
Candidatus Nitrosoarchaeum limnia SFB1, Candidatus Cenarchaeum symbiosum A, Candidatus
Nitrosoarchaeum limnia BG20, Candidatus Nitrosopumilus salaria BD31, Candidatus
Nitrosopumilus koreensis AR1, Candidatus Nitrosopumilus sediminis AR2, and Candidatus
Nitrososphaera gargensis Ga9.2. Genome sequences from 2 Crenarchaeota, Pyrobaculum
islandicum DSM 4184 and Sulfolobus acidocaldarius DSM 639, were included as outgroups.
These two data sets were analyzed separately, because it is not clear how composite genomes
may affect the phylogenomic reconstruction.
The two data sets were processed in an identical way based on the following procedure.
Orthologous gene families were identified using the OrthoMCL software (Li et al., 2003).
Inparalog copies in a gene family were discarded, and gene members assigned to different COGs
were also discarded. For the remaining single copy orthologous families, only those found in
genomes from at least 25 (in the first data set) or 33 (in the second data set) Thaumarchaeota and
1 outgroup member were retained. This resulted in retention of 97 (in the first data set) or 83 (in
the second data set) gene families. Members in each gene family were aligned at the amino acid
level using MAFFT (Katoh et al., 2005) and the alignments were trimmed using TrimAl
(Capella-Gutiérrez et al., 2009) with the criteria of “-automated1 -resoverlap 0.55 -seqoverlap
60”. Then the trimmed alignments were concatenated, with missing sequences treated as gaps.
To account for heterogeneity in the evolutionary processes among different genes, we applied a
data partition model during phylogenetic construction using the RAxML v7.3.0 software
(Stamatakis, 2006). The PartitionFinder software (Lanfear et al., 2012) grouped the 97 proteins
into 16 partitions and grouped the 83 proteins into 14 partitions, respectively, and estimated the
best-fit substitution matrix for each partition using a maximum likelihood framework. Gamma
distribution of rate variation was also applied in RAxML analysis. Genomes obtained from
single cells have many missing genes and taxa with insufficient phylogenetic signal may become
rogues that take uncertain positions in a phylogenetic tree. We applied the RogueNaRok software
(Aberer et al., 2013) and identified one rogue, SCGC AAA008-M23. Another RAxML
phylogenomic tree was constructed with this genome excluded, but the bootstrap support for
unresolved branches was only slightly improved compared to the original tree. Therefore, only
the original RAxML tree containing sequences from all SAGs is presented. Orthologous protein
sequences are available upon request.
Comparative analysis of genome content
All of the predicted amino acid sequences from the 46 SAGs and Nitrosopumilus
maritimus SCM1 were clustered into orthologous gene families using the OrthoMCL software
(Li et al. 2003). Then the occurrence rate of each family in the 4 epipelagic clade SAGs and the
42 mesopelagic clade SAGs was calculated, respectively. The most interesting ecologically
relevant gene families, that had a higher occurrence rate in one clade compared to the other, were
identified and are listed in Table S3.
Analysis of photolyase and catalase
Inferred amino acid sequences closely related to homologs of photolyase and catalase
were identified in the Global Ocean Survey (GOS) metagenomic database using a three-step
procedure. Firstly, the GOS DNA read sequences were translated to amino acid sequences using
all 6 reading frames. Peptide fragments with at least 60 amino acids were retained. Next, the
photolyase and catalase amino acid sequences identified in the SAGs were used as query
sequences to search against GOS using the BLASTp program. The criteria to retain GOS hits for
further analyses were similarity scores ≥60, alignment lengths ≥100, and bit scores ≥100 for
photolyase, and similarity scores ≥75, alignment lengths ≥310, and bit scores ≥500 for
catalase. These parameter values were estimated based on preliminary phylogenetic analyses
showing that sequences recovered using more relaxed criteria were not related to
Thaumarchaeota. Finally, GOS DNA sequences identified as photolyase and catalase peptide
fragments, were extracted and searched against the NCBI non-redundant database using the
BLASTx program to guarantee that these GOS reads encoded photolyase or catalase.
Phylogenetic analysis of the photolyase and catalase sequences we retrieved followed an
identical procedure. Since the homologous sequences are very divergent, 7 alignment methods
were used and compared to better account for alignment uncertainty. These methods include
(Larkin et al., 2007), MAFFT (Katoh et al., 2005), MUSCLE (Edgar, 2004), T-coffee
(Notredame et al., 2000), DIALIGN (Morgenstern, 2004), Kalign (Lassmann and Sonnhammer,
2005), and OPAL (Wheeler and Kececioglu, 2007). The qualities of the alignments were
compared using the TrimAl software (Capella-Gutiérrez et al., 2009); the best alignment was
selected according to the consistency score calculated by TrimAl. Next, the amino acid
substitution model was determined using the ProtTest v3 software (Darriba et al., 2011). A
phylogenetic tree was constructed using the MrBayes v3.1.2 software (Ronquist and
Huelsenbeck, 2003). One cold and three heated Markov chain Monte Carlo (MCMC) chains
were run for 1,000,000 generations with trees sampled every 100 generations. Two independent
runs of MCMC were performed. The first 25% of all runs were discarded as ‘burn-in’. A 50%
majority-rule consensus tree was constructed from the post-burn-in trees. The average standard
deviation of split frequencies reached <0.01, indicative of convergence.
Supplemental Figure Legends
Figure S1. Maximum likelihood phylogenetic tree of marine Thaumarchaeota 16S rRNA
genes. The tree was constructed using the RAxML v7.3.0 software using the GTR substitution
model with Gamma distributed rate heterogeneity among sites. Values at the nodes show the
number of times the clade defined by that node appeared in the 100 bootstrapped datasets.
Bootstrap values below 50 are not shown. The taxa included in the tree are the Thaumarchaeota
SAGs, cultures, and a few environmental sequences, with 5 soil and hot spring sequences as
outgroups. The epi- and mesopelagic clades are indicated by shading.
Figure S2. Maximum likelihood phylogenomic analysis of 55 Thaumarchaeota genomes.
The tree was constructed using the RAxML v7.3.0 software using a concatenated amino acid
sequence of 83 genes with 24,061 sites, with a data partition model determined by the
PartitionFinder software. Values at the nodes show the number of times the clade defined by that
node appeared in the 100 bootstrapped datasets. Two Crenarchaeota outgroup species are not
shown. Details of tree construction can be found in Supplemental Material. The epi- and
mesopelagic clades are indicated by shading. Single cell genomes from different water
masses/locations/depths are marked with different colors as identified in the legend inset.
Supplemental Table Legends
Table S1. Accession numbers and environmental characteristics of the 46 marine
Thaumarchaeota single-cell amplified genomes used in this study.
Table S2. COG annotations of the 97 proteins used for phylogenomic analysis.
Table S3. Examples of gene families distributed in the epi- and mesopelagic clades of
marine Thaumarchaeota.
References
Aberer, A.J., Krompass, D., and Stamatakis, A. (2013). Pruning rogue taxa
improves phylogenetic accuracy: an efficient algorithm and webservice. Syst Biol 62:
162-166.
Adey A, Morrison H, Asan, Xun X, Kitzman J, Turner E et al. (2010). Rapid,
low-input, low-bias construction of shotgun fragment libraries by high-density in vitro
transposition. Genome Biology 11: R119.
Capella-Gutiérrez, S., Silla-Martínez, J.M., and Gabaldón, T. (2009). trimAl: a
tool for automated alignment trimming in large-scale phylogenetic analyses.
Bioinformatics 25: 1972-1973.
Chitsaz H, Yee-Greenbaum JL, Tesler G, Lombardo M-J, Dupont CL, Badger JH
et al. (2011). Efficient de novo assembly of single-cell bacterial genomes from short-read
data sets. Nat Biotechnol 29: 915–921.
Cleland D, Krader P, McCree C, Tang J, Emerson D (2004). Glycine betaine as a
cryoprotectant for prokaryotes. Journal of Microbiological Methods 58: 31-38.
Darriba, D., Taboada, G.L., Doallo, R., and Posada, D. (2011). ProtTest 3: fast
selection of best-fit models of protein evolution. Bioinformatics 27: 1164-1165.
ean FB, Hosono S, Fang L, Wu X, Faruqi AF, Bray-Ward P et al. (2002).
Comprehensive human genome amplification using multiple displacement amplification.
Proceedings of the National Academy of Sciences of the United States of America 99:
5261-5266.
del Giorgio PA, Bird DF, Prairie YT, Planas D (1996). Flow cytometric
determination of bacterial the green nucleic acid stain SYTO 13. Limnol Oceanogr 41:
783–789.
Drummond AJ, Ashton B, Buxton S, Cheung M, Cooper A, Duran C et al.
(2011). Geneious v5.4, Available from http://www.geneious.com/.
Edgar, R.C. (2004). MUSCLE: multiple sequence alignment with high accuracy
and high throughput. Nucleic Acids Re 32: 1792-1797.
Francis, C. A., K. J. Roberts, J. M. Beman, A. E. Santoro and B. B. Oakley
(2005). "Ubiquity and diversity of ammonia-oxidizing Archaea in water columns and
sediments of the ocean." Proceedings of the National Academy of Sciences of the US
102(41): 14683-14688.
Katoh, K., Kuma, K.-i., Toh, H., and Miyata, T. (2005). MAFFT version 5:
improvement in accuracy of multiple sequence alignment. Nucleic Acids Res 33: 511518.
Koren S, Schatz MC, Walenz BP, Martin J, Howard JT, Ganapathy G et al.
(2012). Hybrid error correction and de novo assembly of single-molecule sequencing
reads. Nat Biotechnol 30: 693–700.
Lane, D. J. 1991. 16S/23S rRNA sequencing. In E. Stackebrandt and M.
Goodfellow (ed.), Nucleic acid techniques in bacterial systematics. John Wiley,
Chichester, UK.
Lanfear, R., Calcott, B., Ho, S.Y.W., and Guindon, S. (2012). PartitionFinder:
combined selection of partitioning schemes and substitution models for phylogenetic
analyses. Mol Biol Evol 29: 1695-1701.
Larkin, M.A., Blackshields, G., Brown, N.P., Chenna, R., McGettigan, P.A.,
McWilliam, H. et al. (2007). Clustal W and Clustal X version 2.0. Bioinformatics 23:
2947-2948.
Lassmann, T., and Sonnhammer, E. (2005). Kalign - an accurate and fast multiple
sequence alignment algorithm. BMC Bioinformatics 6: 298.
Li H, Durbin R (2009). Fast and accurate short read alignment with Burrows–
Wheeler transform. Bioinformatics 25: 1754–1760.
Li, L., Stoeckert, C.J., and Roos, D.S. (2003). OrthoMCL: identification of
ortholog groups for eukaryotic genomes. Genome Res 13: 2178-2189.
Markowitz VM, Chen I-MA, Palaniappan K, Chu K, Szeto E, Grechkin Y et al.
(2010). The integrated microbial genomes system: an expanding comparative analysis
resource. Nucleic Acids Res 38: D382–D390.
Martinez-Garcia M, Brazel DM, Swan BK, Arnosti C, Chain PSG, Reitenga KG
et al. (2012). Capturing single cell genomes of active polysaccharide degraders: An
unexpected contribution of Verrucomicrobia. PLoS ONE 7: e35314.
Morgenstern, B. (2004). DIALIGN: multiple DNA and protein sequence
alignment at BiBiServ. Nucleic Acids Res 32: W33-W36.
Notredame, C., Higgins, D., and Heringa, J. (2000). T-coffee: a novel method for
fast and accurate multiple sequence alignment. J Mol Biol 302: 205-217.
Raghunathan A, Ferguson HR, Jr., Bornarth CJ, Song W, Driscoll M, Lasken RS
(2005). Genomic DNA amplification from a single bacterium. Applied and
Environmental Microbiology 71: 3342-3347.
Ronquist, F., and Huelsenbeck, J.P. (2003). MrBayes 3: Bayesian phylogenetic
inference under mixed models. Bioinformatics 19: 1572-1574.
Schmieder R, Lim YW, Rohwer F, Edwards R (2010). TagCleaner: Identification
and removal of tag sequences from genomic and metagenomic datasets. BMC
Bioinformatics 11: 341.
Stamatakis, A. (2006). RAxML-VI-HPC: maximum likelihood-based
phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 22: 2688
- 2690.
Stepanauskas R, Sieracki ME (2007). Matching phylogeny and metabolism in the
uncultured marine bacteria, one cell at a time. Proceedings of the National Academy of
Sciences 104: 9052-9057.
Swan BK, Martinez-Garcia M, Preston CM, Sczyrba A, Woyke T, Lamy D et al.
(2011). Potential for chemolithoautotrophy among ubiquitous bacteria lineages in the
dark ocean. Science 333: 1296–1300.
Wheeler, T.J., and Kececioglu, J.D. (2007). Multiple alignment by aligning
alignments. Bioinformatics 23: i559-i568.
Woyke T, Sczyrba A, Lee J, Rinke C, Tighe D, Clingenpeel S et al. (2011).
Decontamination of MDA reagents for single cell whole genome amplification. PLoS
ONE 6: e26161.
Wuchter, C., B. Abbas, M. J. L. Coolen, L. Herfort, J. van Bleijswijk, P.
Timmers, M. Strous, E. Teira, G. J. Herndl, J. J. Middelburg, S. Schouten and J. S.
Sinninghe Damste (2006). "Archaeal nitrification in the ocean." Proceedings of the
National Academy of Sciences of the USA 103(33): 12317-12322.
Zhang T, Fang H (2006). Applications of real-time polymerase chain reaction for
quantification of microorganisms in environmental samples. Applied Microbiology and
Biotechnology 70: 281-289.
Download