Supporting Information Functional analysis of the archaea, bacteria, and viruses from a halite endolithic microbial community Alexander Crits-Christoph, Diego R. Gelsinger, Bing Ma, Jacek Wierzchos, Jacques Ravel, Alfonso Davila, M. Cristina Casero, and Jocelyne DiRuggiero 1. Experimental procedures Sampling site characterization Entire halite nodules with distinct colonization patterns were collected in 2013 in Salar Grande, an hydrologically inactive salar located in the southwest Tarapacá Region of the Atacama Desert, sealed in sterile Whirlpacks, and stored dry at room temperature until analysis (Robinson et al., 2015). Extremely rare rainfall episodes occur in this area, whereas intense fog events, called “camanchaca”, are frequently recorded. Mean photosynthetic active radiation (PAR) was previously reported at 552.54 µmol.s-1.m-2, mean air temperature was 19.6°C, and mean relative humidity (RH) was 53.8% between April 2010 to 2011 (Robinson et al., 2015). DNA extraction, sequencing, and analysis Total genomic DNA was extracted from pooled halite nodules using the PowerSoil DNA Isolation kit (MoBio Laboratories Inc., Solana Beach, CA) and as previously described (Robinson et al., 2015). Sequencing libraries were prepared using the Nextera XT DNA sample preparation kit (Illumina, San Diego, CA), with an average insert size of 400 bp, and sequenced on the Illumina HiSeq2500 platform. For quality control of the paired-end reads, ribosomal RNA SSU sequence reads were removed using Bowtie (v1) (Langmead et al., 2009) and by mapping to the SILVA reference database (Pruesse et al., 2007). Both paired-end reads were filtered out if one read mapped to the rRNA database. Raw reads with low-quality bases as determined by based calling (phred quality score of 20 that corresponds to an error probability of 1%) were trimmed from the end of the sequence, and sequences longer than 75% of the original read length were retained. 1 Assembly and annotation of the algae genome Contigs in genomic bin 18 identified to belong to eukaryotic green algae using BLASTP were then mapped using PROMER from the MuMMER package (Delcher et al., 2003) to the closest reference genome, Ostreococcus tauri. Contigs that mapped to haloarchaea reference genomes, including that of Halococcus, Halorubrum, and Haloarcula, were removed. Gene finding was performed on the remaining contigs using the self-training program GeneMarkS with the sequence type set to “intronless eukaryote” (Besemer et al., 2001). Individual genes were annotated using BLASTP against the nr database and contigs with proteins mapping exclusively to bacteria were removed. This resulted in a dataset of 89 genes on 73 contigs all with close known homologs in either green algae or plant taxa. The isoelectric points of all predicted protein products for this dataset was calculated using a custom Python script (https://github.com/alexcritschristoph/MicrobialGenomicsScripts) and compared to the reference proteomes for Micromonas sp. RCC299, Ostreococcus tauri, and Dunaliella salina. In genomic bin 9, with low mean G+C%, putative eukaryotic organelle contigs were identified using BLASTP and annotated using DOGMA (Wyman et al., 2004). Predicted Photosystem II proteins from a chloroplast genomic contig were concatenated and aligned to homologs from multiple characterized algae species with MUSCLE (Edgar 2004). Phylogenetically informative regions were extracted from the alignment using Gblocks (Castresana 2000) and a concatenated protein phylogeny was built using FastTree (Price et al., 2010). Assembly of the Nanohaloarchaea genome Protein products of four of the largest contigs that binned together were predicted with Prodigal (Hyatt et al., 2010) and all putative Nanohaloarchaea contigs were isolated using a BLASTP reference-based requiring 25% of genes on any contig to have closest matches to reference Nanohaloarchaeal genes. Each of the identified contig was then reassembled by mapping reads with Bowtie2 (Langmead and Salzberg, 2012) and the contigs were reassembled together using SOAPdenovo2 (Luo et al., 2012). This assembly produced 4 larger contigs at an abundance coverage around 20, with highly similar G+C (~46.4%), which aligned in overlap with each other by 100+ base pairs of near 100% identity in contiguous genes; the overlaps were removed and the contigs concatenated, resulting in a 2 single 1.1 Mbp contig. The start and end of the concatenated contigs were contained within the 16S rRNA gene, indicating that only ~1 kbp was missing from the assembly. The 16S rRNA gene was reassembled from the metagenome using EMIRGE (Miller et al., 2011), found to share >95% identity overlaps with both ends of the large contig, and added to the completed assembly. 2. Supporting tables and figures Table S1: List of putative viral genomes Table S2: Predicted functions for algal genes Figure S1: Major taxonomic groups using 16S rRNA sequences Figure S2: Halite functional analysis summary Figure S3: Halite functional analysis: carbon metabolism Figure S4: Taxonomic distribution of RubisCO Figure S5: Taxonomic distribution of major proteins for PS I and II Figure S6: Taxonomic distribution for light harvesting complexes Figure S7: Halite functional analysis: phototrophy Figure S8: Halite functional analysis: nitrogen metabolism Figure S9: Phylogenetic tree for PS II proteins Figure S10: Isoelectric point distributions 3 Table S1: List of putative viral genomes Contig Size GC% Genome Structure Putative Host VIRSorter Category 32 70.0 51.5 Linear Halobacteria cat2 38 64.0 54.9 Circular Halobacteria cat2 68 52.7 50.3 Linear Halobacteria cat2 82 47.1 63.3 Circular Halobacteria cat2 86 46.2 63.8 Linear Halobacteria cat2 92 44.5 63.9 Circular Halobacteria cat2 104 41.5 57.2 Linear Halobacteria cat2 127 36.8 58.2 Linear Halobacteria cat3 135 35.1 63.9 Linear Halobacteria cat2 139 34.1 43.3 Circular Nanohaloarchaea cat2 146 33.0 57.4 Linear Halobacteria cat2 155 32.3 60.4 Circular Halobiforma nitratireducens cat3 161 31.6 60.1 Linear Halobacteria cat2 186 29.1 58.8 Linear Halobacteria cat2 192 28.1 63.4 Linear Halobacteria cat2 216 26.4 65 Circular Halobacteria cat3 232 24.7 63.9 Linear Halobacteria cat2 238 24.0 47.2 Linear Halothece cat2 257 23.3 63.2 Circular Halobacteria cat3 279 22.2 65.9 Linear Halobacteria cat2 299 21.8 63.8 Linear Halobacteria cat2 313 21.4 61.1 Linear Halobacteria cat2 322 21.0 45.2 Linear Halothece n/a 354 20.1 52.3 Linear Halobacteria cat2 403 18.8 59.6 Circular Halobacteria cat3 523 16.5 65.6 Circular Halobacteria cat3 4 526 16.5 61.5 Circular Halobacteria cat3 557 15.9 60.6 Circular Halobacteria cat2 589 15.9 61.6 Circular Halobacteria cat2 627 15.1 55.1 Circular Halobacteria n/a 687 14.4 61.4 Circular Halobacteria cat2 929 12.3 62.5 Circular Halobacteria cat2 934 12.3 57.6 Circular Halobacteria cat3 966 12.0 63.9 Circular Halobacteria cat3 1987 8.4 Linear Halobacteria n/a 46.1 cat1: most confident; cat2: likely prediction; cat3: possible prediction (Roux et al., 2015) 5 Table S2: Predicted functions for algal genes Algae Contig # 7838 7838 7838 8477 8477 8477 8477 12175 12259 12259 12846 12846 12846 13069 13069 13134 14068 14068 14665 14665 15192 15192 15192 15192 15504 15504 16823 16823 16847 17149 17149 17149 18046 18500 18905 20779 20779 21304 21616 21917 22322 22971 22971 23040 23221 23680 23967 24870 25659 26146 26718 27451 27865 28066 28253 28994 30012 31934 32083 32276 32478 33817 35119 35476 35676 37046 37835 38761 39163 39738 39747 40156 42188 42323 42407 43213 44658 45694 46829 48310 48607 49317 49317 49629 51636 51863 54109 54270 54867 Algae Gene # Gene Length Predicted Function (from Web BLAST) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 749 458 974 485 563 491 374 2876 2213 317 1403 368 569 506 1073 2678 956 1295 1169 305 368 458 1061 422 1229 971 974 920 1931 419 737 329 1838 1652 1190 1499 359 1946 1442 1349 818 665 458 1334 1676 902 1847 1571 848 1352 1712 587 845 1112 1643 326 485 1502 1436 1397 1394 380 1244 980 449 929 1328 800 1292 1280 788 911 716 869 611 395 1040 1019 740 773 311 323 605 1085 1031 1049 899 620 1004 protein farnesyltransferase subunit beta Unknown chloroplast envelope protein translocase family [Micromonas sp. RCC299] 6-phosphogluconate dehydrogenase (ISS) [Ostreococcus tauri] PREDICTED: ras-related protein Rab7-like [Nelumbo nucifera] Unknown hypothetical protein EMIHUDRAFT_97090 [Emiliania huxleyi CCMP1516] alanine--tRNA ligase [Eucalyptus grandis] SNF2 super family [Micromonas sp. RCC299] PREDICTED: ATP-dependent helicase BRM-like [Sesamum indicum] hypothetical protein CHLREDRAFT_139213 [Chlamydomonas reinhardtii] Isoleucine-tRNA synthetase [Ostreococcus lucimarinus CCE9901] ribose-phosphate pyrophosphokinase [Bathycoccus prasinos] threonyl-tRNA synthetase [filamentous cyanobacterium ESFC-1] [OK] putative threonyl-tRNA synthetase (ISS) [Ostreococcus tauri] PREDICTED: inositol hexakisphosphate and diphosphoinositol-pentakisphosphate kinase 2-like isoform X2 [Sesamum indicum] PREDICTED: DNA replication licensing factor MCM2 [Jatropha curcas] MCM-domain-containing protein [Coccomyxa subellipsoidea C-169] FAD-dependent oxidoreductase family protein (ISS) [Ostreococcus tauri] Eukaryotic translation initiation factor 1A hypothetical protein [Halothece sp. PCC 7418] [OK] predicted protein [Ostreococcus lucimarinus CCE9901] phosphoglycolate phosphatase precursor (ISS) [Ostreococcus tauri] GPN-loop GTPase 2 [Morus notabilis] bifunctional dihydrofolate reductase-thymidylate synthase 2 / DHFR-TS (ISS) [Ostreococcus tauri] predicted protein [Ostreococcus lucimarinus CCE9901] Lipoyl synthase [Coccomyxa subellipsoidea C-169] asparagine synthase [Micromonas sp. RCC299] ATPase type 13A (ISS) [Ostreococcus tauri] PREDICTED: 40S ribosomal protein S3-1 [Nicotiana sylvestris] putative plant SNARE 12 [Auxenochlorella protothecoides] Unknown putative chloroplast 1-hydroxy-2-methyl-2-(E)-butenyl-4-diphosphate synthase precursor [Coccomyxa subellipsoidea C-169] cell division protein (ISS) [Ostreococcus tauri] U5 snRNP spliceosome subunit (ISS) [Ostreococcus tauri] PREDICTED: nucleolar GTP-binding protein 2-like [Oryza brachyantha] Unknown DNA-directed RNA polymerase (ISS) [Ostreococcus tauri] 2-oxoglutarate dehydrogenase E2 subunit-like protein [Ostreococcus lucimarinus CCE9901] glutamate-1-semialdehyde 2 Mismatch repair ATPase MSH4 (MutS family) (ISS) [Ostreococcus tauri] AAA+-type ATPase (ISS) [Ostreococcus tauri] Unknown UDP-galactopyranose mutase (ISS) [Ostreococcus tauri] mitochondrial elongation factor [Micromonas sp. RCC299] predicted protein [Micromonas sp. RCC299] RNA polymerase I phosphofructokinase [Micromonas pusilla CCMP1545] 14-3-3 protein [Coccomyxa subellipsoidea C-169] PREDICTED: heat shock protein 83 [Populus euphratica] acetyl-CoA carboxylase (ISS) [Ostreococcus tauri] PREDICTED: 60S ribosomal protein L30-like [Glycine max] predicted protein [Micromonas sp. RCC299] vacuolar-type H+-pyrophosphatase (ISS) [Ostreococcus tauri] DEAD/DEAH box helicase [Micromonas sp. RCC299] predicted protein [Ostreococcus lucimarinus CCE9901] predicted protein [Micromonas sp. RCC299] PREDICTED: putative pre-mRNA-splicing factor ATP-dependent RNA helicase DHX16 isoform X3 [Vitis vinifera] 1-deoxy-D-xylulose-5-phosphate synthase plastid precursor [Ostreococcus lucimarinus CCE9901] PREDICTED: probable pre-mRNA-splicing factor ATP-dependent RNA helicase [Amborella trichopoda] pyruvate dehydrogenase E1 component beta subunit NHP2-like protein 1 [Morus notabilis] PREDICTED: monodehydroascorbate reductase WD40 repeat-like protein [Coccomyxa subellipsoidea C-169] PREDICTED: 40S ribosomal protein S5-like [Sesamum indicum] Pre-mRNA-processing-splicing factor isoform 3 [Theobroma cacao] imidazole glycerol phosphate synthase hisHF [Metarhizium acridum CQMa 102] RecQL4 DNA/RNA helicase [Guillardia theta CCMP2712] P-ATPase family transporter: cadmium ion [Ostreococcus lucimarinus CCE9901] Eukaryotic translation initiation factor 5B [Auxenochlorella protothecoides] PREDICTED: transcription regulatory protein SNF2-like isoform X1 [Glycine max] L-myo inositol-1 phosphate synthase [Medicago truncatula] acetyl-coa carboxylase [Micromonas pusilla CCMP1545] ATP-sulfurylase [Chlamydomonas reinhardtii] MC family transporter: ADP/ATP [Ostreococcus lucimarinus CCE9901] 60S ribosomal L1/L10a protein [Theileria orientalis strain Shintoku] PREDICTED: tRNA-splicing ligase RtcB homolog [Harpegnathos saltator] Ferredoxin-dependent glutamate synthase [Morus notabilis] PREDICTED: eukaryotic initiation factor 4A-III [Tribolium castaneum] PREDICTED: cell division protein FtsZ homolog 2-2 hypothetical protein CHLNCDRAFT_32043 [Chlorella variabilis] PREDICTED: zinc finger CCCH domain-containing protein 64-like [Amborella trichopoda] phosphoglucomutase (ISS) [Ostreococcus tauri] putative polyamine oxidase (ISS) [Ostreococcus tauri] Pre-mRNA-processing-splicing factor isoform 3 [Theobroma cacao] Splicing factor 3B subunit 1 [Auxenochlorella protothecoides] PREDICTED: ruBisCO large subunit-binding protein subunit alpha dynamin family protein (ISS) [Ostreococcus tauri] Heat Shock Protein 70 [Ostreococcus lucimarinus CCE9901] 6 Rela ve Abundance 1 0.8 0.6 0.4 0.2 0 SG (EMIRGE) p_Proteobacteria; o_Desulfuromonadales o_Halobacteriales; g_Halalkalicoccus o_Halobacteriales; g_Halobacterium o_Halobacteriales; g_Halomicrobium o_Halobacteriales; g_Halomarina d_Eukarya; Chloroplast o_Halobacteriales; g_Halopiger o_Halobacteriales; g_Halobaculum o_Halobacteriales; g_Halosimplex o_Halobacteriales; g_Halorubrum p_Cyanobacteria; g_Halothece o_Halobacteriales; g_Haloarcula p_Bacteroidetes; g_Salinibacter o_Halobacteriales; others o_Halobacteriales; g_Halococcus o_Halobacteriales; g_Natronomona o_Halobacteriales; g_Halorhabdus Figure S1: Major taxonomic groups using reconstructed 16S rRNA full-length gene sequences from the metagenomic dataset using EMIRGE (Miller et al., 2011); p:phylum; o: order; g: genus. 7 Figure S2: Halite functional analysis summary. Relative abundance of major functional categories using the SEED subsystem hierarchy. The data was compared to SEED subsystems using a maximum e-value of 1e-5, a minimum identity of 60 %, and a minimum alignment length of 15 measured in amino acids for protein databases. 8 Sugar metabolism 26% Central carbohydrate metabolism 30% Organic acids 7% Fermenta on 10% CO2 uptake, carboxysome 7% CO2 fixa on 8% One-carbon metabolism 19% Photorespira on (oxida ve C2 cycle) 57% Calvin-Benson cycle 29% Carboxysome 7% Figure S3: Halite functional analysis: (a) carbon metabolism and (b) metabolic pathways for CO2 fixation. Relative abundance of major functional categories using the SEED subsystem hierarchy. 9 Figure S4: Taxonomic distribution of RubisCO type I large and small chains in the halite metagenome using SEED subsystem functional abundance and Best Hit Classification in MG-RAST. 10 unassigned 24% unclassified Eukaryota 1% Cyanobacteria 56% Streptophyta 7% Chlorophyta 12% unassigned 26% Cyanobacteria 57% unclassified Viruses 3% unclassified Eukaryota 3% Streptophyta 9% Chlorophyta 2% Figure S5: Taxonomic distribution of major proteins for Photosystems I and II in the halite metagenome using SEED subsystem functional abundance and Best Hit Classification in MG-RAST. (a) PsaA protein, encoding a major PS I reaction center protein subunit, and (b) proteins D1 (PsbA) and D2 (PsbD), encoding major protein subunits of PS II reaction center. 11 algae 3% unassigned 19% Cyanobacteria 78% Figure S6: Taxonomic distribution for Allophycocyanin, Phycobilisome, Phycocyanin, Phycocyanobilin, and Phycoerythrin light harvesting complexes in the halite metagenome using SEED subsystem functional abundance and Best Hit Classification in MG-RAST. 12 Figure S7: Halite functional analysis: phototrophy. Relative abundance of major functional categories using the SEED subsystem hierarchy. 13 Allantoin U liza on 0.4% Denitrifica on 1% Cyanate hydrolysis 1% Dissimilatory nitrite reductase 4% Nitric oxide synthase 5% Nitrate and nitrite ammonifica on 30% Urea ABC transporters 0.2% Nitrosa ve stress 0.3% Nitrogen Fixa on <0.1% Ammonia assimila on 59% Figure S8: Halite functional analysis: nitrogen metabolism. Relative abundance of major functional categories using the SEED subsystem hierarchy. 14 Figure S9: Maximum-Likelihood phylogeny built with FastTree using a concatenation of six Photosystem II predicted proteins (psbN, psbH, psbL, psbT, psbI, and psbJ) from algae chloroplast genomes. Bar represent 0.04% sequence divergence. Bootstrap values (1000 replicates) are shown at nodes. The tree was rooted with Chlamydomonas reinhardtii. SG Algae: Atacama halite algae. 15 Figure S10: Isoelectric point distributions for all predicted proteins in three “salt-in” species: Haloarcula hispanica ATCC (red), Candidatus Nanopetramus SG9 (orange), Salinibacter ruber M8 (blue), and one non salt-in species, Halothece PCC 7418 (green). References Besemer J, Lomsadze A, Borodovsky M (2001). GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. Nucleic Acids Res 29: 2607-2618. Castresana J (2000). Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Mol Biol Evol 17: 540-552. Delcher AL, Salzberg SL, Phillippy AM (2003). Using MUMmer to identify similar regions in large sequence sets. Curr Protoc Bioinformatics Chapter 10: Unit 10 13. Edgar RC (2004). MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32: 1792-1797 Hyatt D, Chen GL, Locascio PF, Land ML, Larimer FW, Hauser LJ (2010). Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 11: 119. 16 Langmead B, Trapnell C, Pop M, Salzberg SL (2009). Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10: R25. Langmead B, Salzberg SL (2012). Fast gapped-read alignment with Bowtie 2. Nat Methods 9: 357-359. Luo R, Liu B, Xie Y, Li Z, Huang W, Yuan J et al. (2012). SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. Gigascience 1: 18. Price MN, Dehal PS, Arkin AP (2010). FastTree 2--approximately maximum-likelihood trees for large alignments. PLoS One 5: e9490. Pruesse E, Quast C, Knittel K, Fuchs B, Ludwig W, Peplies J et al. (2007). SILVA: a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB. Nuc Acids Res 35: 7188-7196. Robinson CK, Wierzchos J, Black C, Crits-Christoph A, Ma B, Ravel J et al. (2015). Microbial diversity and the presence of algae in halite endolithic communities are correlated to atmospheric moisture in the hyper-arid zone of the Atacama Desert. Environ Microbiol 17: 299-315. Roux S, Enault F, Hurwitz BL, Sullivan MB (2015). VirSorter: mining viral signal from microbial genomic data. Peer J 3: e985. Wyman C, Ristic D, Kanaar R (2004). Homologous recombination-mediated doublestrand break repair. DNA Repair 3: 827-833. 17