1. Experimental procedures

advertisement
Supporting Information
Functional analysis of the archaea, bacteria, and viruses from
a halite endolithic microbial community
Alexander Crits-Christoph, Diego R. Gelsinger, Bing Ma, Jacek Wierzchos, Jacques
Ravel, Alfonso Davila, M. Cristina Casero, and Jocelyne DiRuggiero
1. Experimental procedures
Sampling site characterization
Entire halite nodules with distinct colonization patterns were collected in 2013 in Salar
Grande, an hydrologically inactive salar located in the southwest Tarapacá Region of the
Atacama Desert, sealed in sterile Whirlpacks, and stored dry at room temperature until
analysis (Robinson et al., 2015). Extremely rare rainfall episodes occur in this area,
whereas intense fog events, called “camanchaca”, are frequently recorded. Mean
photosynthetic active radiation (PAR) was previously reported at 552.54 µmol.s-1.m-2,
mean air temperature was 19.6°C, and mean relative humidity (RH) was 53.8% between
April 2010 to 2011 (Robinson et al., 2015).
DNA extraction, sequencing, and analysis
Total genomic DNA was extracted from pooled halite nodules using the PowerSoil DNA
Isolation kit (MoBio Laboratories Inc., Solana Beach, CA) and as previously described
(Robinson et al., 2015). Sequencing libraries were prepared using the Nextera XT DNA
sample preparation kit (Illumina, San Diego, CA), with an average insert size of 400 bp,
and sequenced on the Illumina HiSeq2500 platform. For quality control of the paired-end
reads, ribosomal RNA SSU sequence reads were removed using Bowtie (v1) (Langmead
et al., 2009) and by mapping to the SILVA reference database (Pruesse et al., 2007).
Both paired-end reads were filtered out if one read mapped to the rRNA database. Raw
reads with low-quality bases as determined by based calling (phred quality score of 20
that corresponds to an error probability of 1%) were trimmed from the end of the
sequence, and sequences longer than 75% of the original read length were retained.
1
Assembly and annotation of the algae genome
Contigs in genomic bin 18 identified to belong to eukaryotic green algae using BLASTP
were then mapped using PROMER from the MuMMER package (Delcher et al., 2003) to
the closest reference genome, Ostreococcus tauri. Contigs that mapped to haloarchaea
reference genomes, including that of Halococcus, Halorubrum, and Haloarcula, were
removed. Gene finding was performed on the remaining contigs using the self-training
program GeneMarkS with the sequence type set to “intronless eukaryote” (Besemer et
al., 2001). Individual genes were annotated using BLASTP against the nr database and
contigs with proteins mapping exclusively to bacteria were removed. This resulted in a
dataset of 89 genes on 73 contigs all with close known homologs in either green algae or
plant taxa. The isoelectric points of all predicted protein products for this dataset was
calculated using a custom Python script
(https://github.com/alexcritschristoph/MicrobialGenomicsScripts) and compared to the
reference proteomes for Micromonas sp. RCC299, Ostreococcus tauri, and Dunaliella
salina. In genomic bin 9, with low mean G+C%, putative eukaryotic organelle contigs
were identified using BLASTP and annotated using DOGMA (Wyman et al., 2004).
Predicted Photosystem II proteins from a chloroplast genomic contig were concatenated
and aligned to homologs from multiple characterized algae species with MUSCLE
(Edgar 2004). Phylogenetically informative regions were extracted from the alignment
using Gblocks (Castresana 2000) and a concatenated protein phylogeny was built using
FastTree (Price et al., 2010).
Assembly of the Nanohaloarchaea genome
Protein products of four of the largest contigs that binned together were predicted with
Prodigal (Hyatt et al., 2010) and all putative Nanohaloarchaea contigs were isolated using
a BLASTP reference-based requiring 25% of genes on any contig to have closest matches
to reference Nanohaloarchaeal genes. Each of the identified contig was then reassembled
by mapping reads with Bowtie2 (Langmead and Salzberg, 2012) and the contigs were
reassembled together using SOAPdenovo2 (Luo et al., 2012). This assembly produced 4
larger contigs at an abundance coverage around 20, with highly similar G+C (~46.4%),
which aligned in overlap with each other by 100+ base pairs of near 100% identity in
contiguous genes; the overlaps were removed and the contigs concatenated, resulting in a
2
single 1.1 Mbp contig. The start and end of the concatenated contigs were contained
within the 16S rRNA gene, indicating that only ~1 kbp was missing from the assembly.
The 16S rRNA gene was reassembled from the metagenome using EMIRGE (Miller et
al., 2011), found to share >95% identity overlaps with both ends of the large contig, and
added to the completed assembly.
2. Supporting tables and figures
Table S1: List of putative viral genomes
Table S2: Predicted functions for algal genes
Figure S1: Major taxonomic groups using 16S rRNA sequences
Figure S2: Halite functional analysis summary
Figure S3: Halite functional analysis: carbon metabolism
Figure S4: Taxonomic distribution of RubisCO
Figure S5: Taxonomic distribution of major proteins for PS I and II
Figure S6: Taxonomic distribution for light harvesting complexes
Figure S7: Halite functional analysis: phototrophy
Figure S8: Halite functional analysis: nitrogen metabolism
Figure S9: Phylogenetic tree for PS II proteins
Figure S10: Isoelectric point distributions
3
Table S1: List of putative viral genomes
Contig Size
GC%
Genome
Structure
Putative Host
VIRSorter Category
32
70.0 51.5
Linear
Halobacteria
cat2
38
64.0 54.9
Circular
Halobacteria
cat2
68
52.7 50.3
Linear
Halobacteria
cat2
82
47.1 63.3
Circular
Halobacteria
cat2
86
46.2 63.8
Linear
Halobacteria
cat2
92
44.5 63.9
Circular
Halobacteria
cat2
104
41.5 57.2
Linear
Halobacteria
cat2
127
36.8 58.2
Linear
Halobacteria
cat3
135
35.1 63.9
Linear
Halobacteria
cat2
139
34.1 43.3
Circular
Nanohaloarchaea
cat2
146
33.0 57.4
Linear
Halobacteria
cat2
155
32.3 60.4
Circular
Halobiforma
nitratireducens
cat3
161
31.6 60.1
Linear
Halobacteria
cat2
186
29.1 58.8
Linear
Halobacteria
cat2
192
28.1 63.4
Linear
Halobacteria
cat2
216
26.4 65
Circular
Halobacteria
cat3
232
24.7 63.9
Linear
Halobacteria
cat2
238
24.0 47.2
Linear
Halothece
cat2
257
23.3 63.2
Circular
Halobacteria
cat3
279
22.2 65.9
Linear
Halobacteria
cat2
299
21.8 63.8
Linear
Halobacteria
cat2
313
21.4 61.1
Linear
Halobacteria
cat2
322
21.0 45.2
Linear
Halothece
n/a
354
20.1 52.3
Linear
Halobacteria
cat2
403
18.8 59.6
Circular
Halobacteria
cat3
523
16.5 65.6
Circular
Halobacteria
cat3
4
526
16.5 61.5
Circular
Halobacteria
cat3
557
15.9 60.6
Circular
Halobacteria
cat2
589
15.9 61.6
Circular
Halobacteria
cat2
627
15.1 55.1
Circular
Halobacteria
n/a
687
14.4 61.4
Circular
Halobacteria
cat2
929
12.3 62.5
Circular
Halobacteria
cat2
934
12.3 57.6
Circular
Halobacteria
cat3
966
12.0 63.9
Circular
Halobacteria
cat3
1987
8.4
Linear
Halobacteria
n/a
46.1
cat1: most confident; cat2: likely prediction; cat3: possible prediction (Roux et al., 2015)
5
Table S2: Predicted functions for algal genes
Algae Contig #
7838
7838
7838
8477
8477
8477
8477
12175
12259
12259
12846
12846
12846
13069
13069
13134
14068
14068
14665
14665
15192
15192
15192
15192
15504
15504
16823
16823
16847
17149
17149
17149
18046
18500
18905
20779
20779
21304
21616
21917
22322
22971
22971
23040
23221
23680
23967
24870
25659
26146
26718
27451
27865
28066
28253
28994
30012
31934
32083
32276
32478
33817
35119
35476
35676
37046
37835
38761
39163
39738
39747
40156
42188
42323
42407
43213
44658
45694
46829
48310
48607
49317
49317
49629
51636
51863
54109
54270
54867
Algae Gene # Gene Length Predicted Function (from Web BLAST)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
749
458
974
485
563
491
374
2876
2213
317
1403
368
569
506
1073
2678
956
1295
1169
305
368
458
1061
422
1229
971
974
920
1931
419
737
329
1838
1652
1190
1499
359
1946
1442
1349
818
665
458
1334
1676
902
1847
1571
848
1352
1712
587
845
1112
1643
326
485
1502
1436
1397
1394
380
1244
980
449
929
1328
800
1292
1280
788
911
716
869
611
395
1040
1019
740
773
311
323
605
1085
1031
1049
899
620
1004
protein farnesyltransferase subunit beta
Unknown
chloroplast envelope protein translocase family [Micromonas sp. RCC299]
6-phosphogluconate dehydrogenase (ISS) [Ostreococcus tauri]
PREDICTED: ras-related protein Rab7-like [Nelumbo nucifera]
Unknown
hypothetical protein EMIHUDRAFT_97090 [Emiliania huxleyi CCMP1516]
alanine--tRNA ligase [Eucalyptus grandis]
SNF2 super family [Micromonas sp. RCC299]
PREDICTED: ATP-dependent helicase BRM-like [Sesamum indicum]
hypothetical protein CHLREDRAFT_139213 [Chlamydomonas reinhardtii]
Isoleucine-tRNA synthetase [Ostreococcus lucimarinus CCE9901]
ribose-phosphate pyrophosphokinase [Bathycoccus prasinos]
threonyl-tRNA synthetase [filamentous cyanobacterium ESFC-1] [OK]
putative threonyl-tRNA synthetase (ISS) [Ostreococcus tauri]
PREDICTED: inositol hexakisphosphate and diphosphoinositol-pentakisphosphate kinase 2-like isoform X2 [Sesamum indicum]
PREDICTED: DNA replication licensing factor MCM2 [Jatropha curcas]
MCM-domain-containing protein [Coccomyxa subellipsoidea C-169]
FAD-dependent oxidoreductase family protein (ISS) [Ostreococcus tauri]
Eukaryotic translation initiation factor 1A
hypothetical protein [Halothece sp. PCC 7418] [OK]
predicted protein [Ostreococcus lucimarinus CCE9901]
phosphoglycolate phosphatase precursor (ISS) [Ostreococcus tauri]
GPN-loop GTPase 2 [Morus notabilis]
bifunctional dihydrofolate reductase-thymidylate synthase 2 / DHFR-TS (ISS) [Ostreococcus tauri]
predicted protein [Ostreococcus lucimarinus CCE9901]
Lipoyl synthase [Coccomyxa subellipsoidea C-169]
asparagine synthase [Micromonas sp. RCC299]
ATPase type 13A (ISS) [Ostreococcus tauri]
PREDICTED: 40S ribosomal protein S3-1 [Nicotiana sylvestris]
putative plant SNARE 12 [Auxenochlorella protothecoides]
Unknown
putative chloroplast 1-hydroxy-2-methyl-2-(E)-butenyl-4-diphosphate synthase precursor [Coccomyxa subellipsoidea C-169]
cell division protein (ISS) [Ostreococcus tauri]
U5 snRNP spliceosome subunit (ISS) [Ostreococcus tauri]
PREDICTED: nucleolar GTP-binding protein 2-like [Oryza brachyantha]
Unknown
DNA-directed RNA polymerase (ISS) [Ostreococcus tauri]
2-oxoglutarate dehydrogenase E2 subunit-like protein [Ostreococcus lucimarinus CCE9901]
glutamate-1-semialdehyde 2
Mismatch repair ATPase MSH4 (MutS family) (ISS) [Ostreococcus tauri]
AAA+-type ATPase (ISS) [Ostreococcus tauri]
Unknown
UDP-galactopyranose mutase (ISS) [Ostreococcus tauri]
mitochondrial elongation factor [Micromonas sp. RCC299]
predicted protein [Micromonas sp. RCC299]
RNA polymerase I
phosphofructokinase [Micromonas pusilla CCMP1545]
14-3-3 protein [Coccomyxa subellipsoidea C-169]
PREDICTED: heat shock protein 83 [Populus euphratica]
acetyl-CoA carboxylase (ISS) [Ostreococcus tauri]
PREDICTED: 60S ribosomal protein L30-like [Glycine max]
predicted protein [Micromonas sp. RCC299]
vacuolar-type H+-pyrophosphatase (ISS) [Ostreococcus tauri]
DEAD/DEAH box helicase [Micromonas sp. RCC299]
predicted protein [Ostreococcus lucimarinus CCE9901]
predicted protein [Micromonas sp. RCC299]
PREDICTED: putative pre-mRNA-splicing factor ATP-dependent RNA helicase DHX16 isoform X3 [Vitis vinifera]
1-deoxy-D-xylulose-5-phosphate synthase plastid precursor [Ostreococcus lucimarinus CCE9901]
PREDICTED: probable pre-mRNA-splicing factor ATP-dependent RNA helicase [Amborella trichopoda]
pyruvate dehydrogenase E1 component beta subunit
NHP2-like protein 1 [Morus notabilis]
PREDICTED: monodehydroascorbate reductase
WD40 repeat-like protein [Coccomyxa subellipsoidea C-169]
PREDICTED: 40S ribosomal protein S5-like [Sesamum indicum]
Pre-mRNA-processing-splicing factor isoform 3 [Theobroma cacao]
imidazole glycerol phosphate synthase hisHF [Metarhizium acridum CQMa 102]
RecQL4 DNA/RNA helicase [Guillardia theta CCMP2712]
P-ATPase family transporter: cadmium ion [Ostreococcus lucimarinus CCE9901]
Eukaryotic translation initiation factor 5B [Auxenochlorella protothecoides]
PREDICTED: transcription regulatory protein SNF2-like isoform X1 [Glycine max]
L-myo inositol-1 phosphate synthase [Medicago truncatula]
acetyl-coa carboxylase [Micromonas pusilla CCMP1545]
ATP-sulfurylase [Chlamydomonas reinhardtii]
MC family transporter: ADP/ATP [Ostreococcus lucimarinus CCE9901]
60S ribosomal L1/L10a protein [Theileria orientalis strain Shintoku]
PREDICTED: tRNA-splicing ligase RtcB homolog [Harpegnathos saltator]
Ferredoxin-dependent glutamate synthase [Morus notabilis]
PREDICTED: eukaryotic initiation factor 4A-III [Tribolium castaneum]
PREDICTED: cell division protein FtsZ homolog 2-2
hypothetical protein CHLNCDRAFT_32043 [Chlorella variabilis]
PREDICTED: zinc finger CCCH domain-containing protein 64-like [Amborella trichopoda]
phosphoglucomutase (ISS) [Ostreococcus tauri]
putative polyamine oxidase (ISS) [Ostreococcus tauri]
Pre-mRNA-processing-splicing factor isoform 3 [Theobroma cacao]
Splicing factor 3B subunit 1 [Auxenochlorella protothecoides]
PREDICTED: ruBisCO large subunit-binding protein subunit alpha
dynamin family protein (ISS) [Ostreococcus tauri]
Heat Shock Protein 70 [Ostreococcus lucimarinus CCE9901]
6
Rela ve Abundance
1
0.8
0.6
0.4
0.2
0
SG (EMIRGE)
p_Proteobacteria; o_Desulfuromonadales
o_Halobacteriales; g_Halalkalicoccus
o_Halobacteriales; g_Halobacterium
o_Halobacteriales; g_Halomicrobium
o_Halobacteriales; g_Halomarina
d_Eukarya; Chloroplast
o_Halobacteriales; g_Halopiger
o_Halobacteriales; g_Halobaculum
o_Halobacteriales; g_Halosimplex
o_Halobacteriales; g_Halorubrum
p_Cyanobacteria; g_Halothece
o_Halobacteriales; g_Haloarcula
p_Bacteroidetes; g_Salinibacter
o_Halobacteriales; others
o_Halobacteriales; g_Halococcus
o_Halobacteriales; g_Natronomona
o_Halobacteriales; g_Halorhabdus
Figure S1: Major taxonomic groups using reconstructed 16S rRNA full-length gene
sequences from the metagenomic dataset using EMIRGE (Miller et al., 2011); p:phylum;
o: order; g: genus.
7
Figure S2: Halite functional analysis summary. Relative abundance of major functional
categories using the SEED subsystem hierarchy. The data was compared to SEED
subsystems using a maximum e-value of 1e-5, a minimum identity of 60 %, and a
minimum alignment length of 15 measured in amino acids for protein databases.
8
Sugar metabolism
26%
Central carbohydrate
metabolism
30%
Organic acids
7%
Fermenta on
10%
CO2 uptake,
carboxysome
7%
CO2 fixa on
8%
One-carbon
metabolism
19%
Photorespira on
(oxida ve C2 cycle)
57%
Calvin-Benson
cycle
29%
Carboxysome
7%
Figure S3: Halite functional analysis: (a) carbon metabolism and (b) metabolic pathways
for CO2 fixation. Relative abundance of major functional categories using the SEED
subsystem hierarchy.
9
Figure S4: Taxonomic distribution of RubisCO type I large and small chains in the halite
metagenome using SEED subsystem functional abundance and Best Hit Classification in
MG-RAST.
10
unassigned
24%
unclassified
Eukaryota
1%
Cyanobacteria
56%
Streptophyta
7%
Chlorophyta
12%
unassigned
26%
Cyanobacteria
57%
unclassified
Viruses
3%
unclassified
Eukaryota
3%
Streptophyta
9%
Chlorophyta
2%
Figure S5: Taxonomic distribution of major proteins for Photosystems I and II in the
halite metagenome using SEED subsystem functional abundance and Best Hit
Classification in MG-RAST. (a) PsaA protein, encoding a major PS I reaction center
protein subunit, and (b) proteins D1 (PsbA) and D2 (PsbD), encoding major protein
subunits of PS II reaction center.
11
algae
3%
unassigned
19%
Cyanobacteria
78%
Figure S6: Taxonomic distribution for Allophycocyanin, Phycobilisome, Phycocyanin,
Phycocyanobilin, and Phycoerythrin light harvesting complexes in the halite metagenome
using SEED subsystem functional abundance and Best Hit Classification in MG-RAST.
12
Figure S7: Halite functional analysis: phototrophy. Relative abundance of major
functional categories using the SEED subsystem hierarchy.
13
Allantoin U liza on
0.4%
Denitrifica on
1%
Cyanate
hydrolysis
1%
Dissimilatory nitrite reductase
4%
Nitric oxide synthase
5%
Nitrate and nitrite
ammonifica on
30%
Urea ABC transporters
0.2%
Nitrosa ve
stress
0.3%
Nitrogen Fixa on
<0.1%
Ammonia assimila on
59%
Figure S8: Halite functional analysis: nitrogen metabolism. Relative abundance of major
functional categories using the SEED subsystem hierarchy.
14
Figure S9: Maximum-Likelihood phylogeny built with FastTree using a concatenation of
six Photosystem II predicted proteins (psbN, psbH, psbL, psbT, psbI, and psbJ) from
algae chloroplast genomes. Bar represent 0.04% sequence divergence. Bootstrap values
(1000 replicates) are shown at nodes. The tree was rooted with Chlamydomonas
reinhardtii. SG Algae: Atacama halite algae.
15
Figure S10: Isoelectric point distributions for all predicted proteins in three “salt-in”
species: Haloarcula hispanica ATCC (red), Candidatus Nanopetramus SG9 (orange),
Salinibacter ruber M8 (blue), and one non salt-in species, Halothece PCC 7418 (green).
References
Besemer J, Lomsadze A, Borodovsky M (2001). GeneMarkS: a self-training method for
prediction of gene starts in microbial genomes. Implications for finding sequence
motifs in regulatory regions. Nucleic Acids Res 29: 2607-2618.
Castresana J (2000). Selection of conserved blocks from multiple alignments for their use
in phylogenetic analysis. Mol Biol Evol 17: 540-552.
Delcher AL, Salzberg SL, Phillippy AM (2003). Using MUMmer to identify similar
regions in large sequence sets. Curr Protoc Bioinformatics Chapter 10: Unit 10 13.
Edgar RC (2004). MUSCLE: multiple sequence alignment with high accuracy and high
throughput. Nucleic Acids Res 32: 1792-1797
Hyatt D, Chen GL, Locascio PF, Land ML, Larimer FW, Hauser LJ (2010). Prodigal:
prokaryotic gene recognition and translation initiation site identification. BMC
Bioinformatics 11: 119.
16
Langmead B, Trapnell C, Pop M, Salzberg SL (2009). Ultrafast and memory-efficient
alignment of short DNA sequences to the human genome. Genome Biol 10: R25.
Langmead B, Salzberg SL (2012). Fast gapped-read alignment with Bowtie 2. Nat
Methods 9: 357-359.
Luo R, Liu B, Xie Y, Li Z, Huang W, Yuan J et al. (2012). SOAPdenovo2: an
empirically improved memory-efficient short-read de novo assembler. Gigascience
1: 18.
Price MN, Dehal PS, Arkin AP (2010). FastTree 2--approximately maximum-likelihood
trees for large alignments. PLoS One 5: e9490.
Pruesse E, Quast C, Knittel K, Fuchs B, Ludwig W, Peplies J et al. (2007). SILVA: a
comprehensive online resource for quality checked and aligned ribosomal RNA
sequence data compatible with ARB. Nuc Acids Res 35: 7188-7196.
Robinson CK, Wierzchos J, Black C, Crits-Christoph A, Ma B, Ravel J et al. (2015).
Microbial diversity and the presence of algae in halite endolithic communities are
correlated to atmospheric moisture in the hyper-arid zone of the Atacama Desert.
Environ Microbiol 17: 299-315.
Roux S, Enault F, Hurwitz BL, Sullivan MB (2015). VirSorter: mining viral signal from
microbial genomic data. Peer J 3: e985.
Wyman C, Ristic D, Kanaar R (2004). Homologous recombination-mediated doublestrand break repair. DNA Repair 3: 827-833.
17
Download