Diversification and niche adaptations of Nitrospina-like bacteria in the polyextreme interfaces of the Atlantis II Deep brine from the Red Sea David Kamanda Ngugi1*, Jochen Blom2, Ramunas Stepanauskas3, and Ulrich Stingl1 1 Red Sea Research Centre, King Abdullah University of Science and Technology, Thuwal 23955-6900, Saudi Arabia; 2Bioinformatics and Systems Biology, Justus Liebig University Giessen, Germany; 3Bigelow Laboratories for Ocean Sciences, 60 Bigelow Drive, East Boothbay, ME 04544-0380, USA. *Corresponding author: david.ngugi@kaust.edu.sa SUPPLMENTARY FIGURE LEGENDS Figure S1: Hierachical clustering of OTUs from 16S rRNA gene sequences of brine-seawater interface (bsi) previously classified as Nitrospina-like in five different brines based on data from Ngugi et al. (2014). Operational taxonomic units were clustered at 97% identity level. Only OTUs with an abundance of 1% of all reads that were assigned as “Nitrospina” are shown. Figure S2: Phylum-level taxonomic assignment of the predicted proteomes of Nitrospina gracilis and Ca. Nitromaritima SAGs from the Red Sea (RS), the N. Pacific (NA), and the N. Atlantic (NA) as predicted using MAPLE (see methods for details). Figure S3: Venn diagram showing the core genesets (in bold) shared between the (pan)genomes of the three geographical separated Ca. Nitromaritima species and Nitrospina gracilis. Values in bracket show the unique genesets in each (pan)-genome. Figure S4: Gene cluster organization of pyruvate:ferredoxin oxidoreductase (PFOR) operon of one of our SAGs (SCGC AAA799-C22), Nitrospina gracilis 3/211, and Ca. Nitromaritima sp. B18 (from the N. Atlantic Ocean. Note that the Ca. N. species B18 lacks a putative rubrerythrin-like gene. Figure S5: Maximum-likelihood tree showing the phylogenetic placement of the -subunit of the nitrite oxidoreductase (nxrB) gene sequences from low- and high-affinity NOB (blue and red leaves respectively) among selected type II dimethyl sulfoxide reductase enzyme family. Sequences from Ca. Nitromaritima single-cell amplified genomes are shown in bold. Figure S6: Ratios between the frequency of nxrA gene homologues from low-affinity (LNOB) and high-affinity NOB (HNOB) in metagenomic datasets. Note that the ratios were calculated excluding nxrA gene homologues from anaerobic ammonia oxidizing (anammox) bacteria. Figure S7: Organization of gene clusters for the biosynthesis of ectoine and hydroxyectoine (ectABCD) in one of our SAGs (SCGC AAA799-C22) and also Nitrococcus mobilis Nb-231. Note that among all genome-sequenced NOB, only these two carry the EctABC operon, and Ngugi et al. 1 only our SAGs have the ABC-type transporter for ectoine (ehuACBD) and the gene encoding for ectoine hydroxylase (ectD). Figure S8: Box plots showing the distribution in the isoelectric point (pI) of predicted proteomes among members of the proposed candidate phylum Nitrospinae (in grey), representatives of various nitrite-oxidizers and anaerobic ammonia-oxidizing bacteria (in yellow), aerobic ammonia-oxidizing prokaryotes (in purple), and typical planktonic bacteria (in white) relative to extreme (in red) and moderate (in light blue) halophiles. The three panels show data for the overall predicted proteome (a), as well as protein-coding genes without any trans-membrane domains (b) or with a single trans-membrane domain (c). The dashed blue line demarcates a pI of 7.0, while triangles show the mean. Our single cell genomes are shown in bold. Figure S9: Phylogenetic analysis of putative oxidases encoded in the Nitrospinae genomes (in red and blue) relative to the bd-type quinol oxidases (in grey) and the proposed “cytochrome bd-like oxidase” from Ca. N. defluvii (NIDE0901, in bold; see Lücker et al., 2010). (a) Shows a maximum-likelihood tree of the genes putatively encoding for the subunit of the predicted oxidases, including those from N. gracilis (in blue). Branches with boostrap values above 85% are highlighted with a black dot ontop of each node based on 100 iterations. Characterized bd-type quinol oxidases (in grey) were used for outgrouping. (b) Depicts a multiple sequence alignment of the above proteins highlighting residues that putatively possess functions homologous to those in typical cytochrome c oxidases, including residues involved in the binding of heme groups (in red) and their alternatives (in yellow), those involved in copper binding (in turquoise), or those that are conserved in cyt. c oxidases (in green). Residues that are required for quinol binding in bd-type quinol oxidases are highlighted as well (in pupurple). Note the divergence of “cyt. bd-like oxidases” (harbouring copper binding sites but no quinol residues) and “bd-like enzymes” (lacking both copper and quinol binding sites). SUPPLMENTARY TABLE LEGENDS Table S1: Metadata of the sampling location, where the single cell genomes were obtained. Table S2: Distance matrix of the 16S rRNA genes of all Nitrospina-like bacterial sequences used for constructing the tree presented in Figure 1. Table S3: Estimation of genome completeness based on 104 single-copy genes as determined using CheckM. Table S4: 454 metagenomic libraries and their associated metadata that were used for fragement recruitment analyses. Table S5: The percentage of overlap (upper triangle) and average nucleotide identity (ANI, lower triangle) between pairs of genomes including our SAGs (in bold), the related Ca. Nitromaritima SAGs and fosmids, and canonical NOBs. Table S6: Genesets in the core genome of Nitrospina gracilis and the pan-genome of our SAGs. The first two columns show the corresponding orthologous proteins and their predicted fucntions based on annotations in NCBI (text highlighted in grey), while the subsequenct Ngugi et al. 2 columns are based on our automated annotation in INDIGO. Text highlighted in pink, include those genes encoding enzymes that are discussed in the main text or depicted in Figure 6. Table S7: Spearman correlation coefficient in the occurrence of genes. Table S8: Unique genesets in the pan-genome of Ca. Nitromaritima RS (relative to the genome of Nitrospina gracilis). All the genes discussed in the main text or depicted in Figure 6 are highlighted in red. Table S9: List of transporters predicted to be encoded in the (pan)-genomes of N. gracilis (S9A), Ca. Nitromaritima RS (S9B), Ca. Nitromaritima NA (S9C), and Ca. Nitromaritima sp. NP (S9D) based on automated annotation via the web-based transporters (TransAAP) annotation tool (http://www.membranetransport.org/). SUPPLEMENTARY MATERIALS & METHODS MATERIALS AND METHODS Phylogenetic analyses All phylogenetic analyses were conducted in Geneious Pro v7.1.2 (Biomatters Ltd, Aukcland, NZ; http://www.geneious.com). For the 16S (≥1400 bp) and 23S (≥2600 bp) rRNA genebased phylogenetic trees were constructed by aligning the respective sequences using the SINA alignment webtool based on the SILVA 115 database (http://www.arb-silva.de/aligner). Phylogenetic analyses were then performed by importing the aligned sequences into Geneious Pro and computing a maximum-likelihood tree with PHYML (Guindon & Gascuel, 2003) based on the WAG substitution model (100 bootstraps), an estimated gamma distribution parameter (I), and four discrete substitution rate categories (Γ4). The best nucleotide substitution model (i.e., WAG) was selected prior to phylogenetic inference using jModelTest (Posada, 2008). A Bayesian consensus tree was also constructed using MrBayes v3.2.1 (Ronquist & Huelsenbeck, 2003) from the same alignment with the GTR + I + Γ4 model. MrBayes was run with four chains for 1 million generations and trees were sampled every 200 generations. To construct the consensus tree, 10% of the sampled points were discarded as “burn-in”. To infer phylogeny using the internal transcribed spacer (ITS) region between the 16S and 23S rRNA genes, we first extracted the ITS sequences from each genome based on the 16S Ngugi et al. 3 and 23S rRNA gene coordinates; average ITS sequence sizes were 454 and 478 bp for the Nitrospina-like SAGs and N. gracilis respectively, with both encoding tRNAs for Iso-leucine and Alanine. Closest blast hits were then obtained from the NCBI’s non-redundant (nr) and whole-genome shotgun databases. Sequences were aligned using MUSCLE (Edgar, 2004) followed by phylogenetic inference as described above. The phylogenetic analysis of the genes encoding for nitrite oxidoreductase (NXR) was performed by aligning the protein-encoding sequences of the NxrA (1,416 amino acid positions; Figure 5B) and NxrB (561 amino acid positions; Figure S5) subunits from the “Nitromaritima” SAGs into existing databases of the type II DMSO reductases enzyme family proteins (Lücker et al., 2013) using MUSCLE (Edgar, 2004). The best amino acid substitution model (WAG + I + Γ4) was then selected using ProTest3 (Darriba et al., 2011) prior to phylogenetic inference as described above using PhyML and MrBayes. Phylogenetic analysis of the genes encoding for the putative alpha subunit of “cytochrome bd-like” oxidases was performed by generating a consensus multiple sequence alignment from multiple independent alignments of protein sequences that had been aligned using MAFFT (Katoh et al., 2005) based on different amino acid substitution matrices as implemented in MergeAlign (http://mergealign.appspot.com/; Collingridge & Kelly, 2012). Aligned positions (and gaps) with a mean score less than 50% were then removed resulting in an alignment with 543 aligned positions. A maximum-likelihood tree was then reconstructed from the optimized alignment using PhyML (100 boostrap) with the amino acid substitution model WAG. Sequences encoding for cydA gene of characterized bd-type quinol oxidases were used for outgrouping, namely from E. coli (Acc. No. EDV65286), Bacillus subtilis, (Acc. No. BAA11727), and Azotobacter vinelandii (Acc. No. ACO78197). REFERENCES Collingridge PW, Kelly S. (2012). MergeAlign: improving multiple sequence alignment performance by dynamic reconstruction of consensus multiple sequence alignments. BMC Bioinformatics 13:117. Darriba D, Taboada GL, Doallo R, Posada D. (2011). ProtTest 3: fast selection of best-fit models of protein evolution. Bioinformatics 27:1164–1165. Edgar RC. (2004). MUSCLE: multiple sequence alignment with high accuracy and high Ngugi et al. 4 throughput. Nucleic Acids Research 32:1792–1797. Guindon S, Gascuel O. (2003). A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Systematic Biol 52:696–704. Katoh K, Kuma K-I, Toh H, Miyata T. (2005). MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Research 33:511–518. Lücker S, Nowka B, Rattei T, Spieck E, Daims H. (2013). The genome of Nitrospina gracilis illuminates the metabolism and evolution of the major marine nitrite oxidizer. Front Microbiol 4:27. Posada D. (2008). jModelTest: phylogenetic model averaging. Molecular Biology and Evolution 25:1253–1256. Ronquist F, Huelsenbeck JP. (2003). MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics 19:1572–1574. Ngugi et al. 5