SUPPLEMENTARY MATERIAL for Novel Microbial Populations in Ambient and Mesophilic Biogas-producing and Phenol-degrading Consortia Unraveled by High-throughput Sequencing Feng Ju, Tong Zhang* Environmental Biotechnology Laboratory, Department of Civil Engineering, The University of Hong Kong, Hong Kong SAR Submitted to Microbial Ecology *Corresponding author phone: +852-28578551; fax: +852-25595337; e-mail: zhangt@hku.hk Supporting Information S1: Specific methanogenic activity tests Supporting Information S2: Bioinformatics analysis Tables Table S1 454 pyrosequencing datasets of the seed sludge, AT and MT enrichments Table S2 Illumina sequencing data of the seed sludge, AT and MT enrichments Table S3 Total bacterial classes shared among the seed sludge, AT and MT enrichments. Table S4 Total bacterial genera shared among the seed sludge, AT and MT enrichments Figures Figure S1 The flowchart of data processing in this study. Figure S2 Variation of phenol, VFAs and alcohols concentrations with time in the AT (a) and MT (b) reactors during Batch 18 Figure S3 Phenol-degrading profiles at different initial concentrations in SMA tests Figure S4 Rarefaction curves of the seed sludge, ambient and mesophilic enrichments at similarity cutoffs of 3% (a) and 6% (b) Figure S5 Shift in Phylum Proteobacteria before and after enrichment under ambient and mesophilic conditions Figure S6 Phylogenic trees of 16S rRNA gene sequences constructed for the most abundant species detected in the AT (a) and MT (b) phenol-degrading enrichments Figure S7 Rank-abundance curves for bacterial genus in the seed sludge, AT and MT reactors, respectively Supporting Information S1: Specific methanogenic activity tests The SMA tests of the phenol-degrading sludge were conducted in 166 ml batch serum bottles (working volume 50 ml) at 20 and 37 0C, respectively. The sludge for the batch tests was sampled from AT and MT reactors on Day 193 when the AT and MT sludge could tolerate phenol concentrations as high as 875 and 1000 mg.L-1 (Figure S2), corresponding to phenol loadings of 365 and 417 mg.L-1.d-1, respectively, removing almost 100% of phenol. Initial phenol concentrations in SMA tests varied from 100 to 1000 mg.L-1, and phenol depletion was monitored until the concentration of phenol was below the detection limit. The sludge concentrations were determined at the end of the test, and the volatile suspended solids concentrations in each batch was determined to be 0.55 g/L for AT sludge and 0.73 g/L for MT sludge, respectively. Supporting Information S2: Bioinformatics analysis (1) Processing of high-throughput sequencing datasets The analysis of sequencing data in this study was performed using the procedures shown in Figure S1. The analysis of 454 pyrosequencing datasets was conducted in QIIME (quantitative insights into microbial ecology, v 1.5.0) pipeline [1]. First, the 454 reads (sequences) were separated into different samples based on their nucleotide barcodes. Then, sequences in each sample were denoised by AmpliconNoise using the default parameters except that Perseus algorithm for chimera removal was disabled. After that, chimera checking was performed using ChimeraSlayer [2]. Those reads after denoise and chimera removal are referred as “effective reads”. Although bacteria-specific primers were used, very small amount of undesired archaeal reads were still obtained. To exclude those archaeal reads, the effective reads of each AS sample were submitted to the online RDP Classifier [3] to identify the archaeal and bacterial reads, and archaeal reads were discarded. To fairly compare all samples at the same sequencing depth, normalization of the bacterial sequence number was conducted by randomly extracting 8150 sequences from each 454 dataset. For the metagenomic dataset, reads containing one or more uncalled bases, or containing bases with quality score < 30 were removed. Then, the reads were de-replicated to get rid of the duplicate and near-duplicate reads deprived from Illumina sequencing [4], based on the guideline of MG-RAST [5]. After that, all PE reads were merged allowing a minimum overlap region of 10 bps. The sequences obtained after reads overlapping were referred as “tags” and used for downstream analysis. (2) Taxonomic analysis The bacterial composition was analyzed based on the bacterial 16S rRNA sequences obtained from 454 pyrosequencing. The bacterial sequences were searched against GreenGenes database using NCBI’s BLASTN tool at an e-value cutoff of 1e-20 to identify the 16S rRNA gene fragments. All bacterial sequences with hits of e-value < 10-20 were used for taxonomic analysis. The top 100 hits of all qualified bacterial sequences in each sample were imported into MEGAN and then annotated by Lowest Common Ancestor (LCA) algorithm using the default parameters except that the Percent Identity Filter was activated, which was specifically designed to filter 16S rRNA gene sequences by similarity based on the following principal: the percent identity of a match must exceed the given value of percent identity to be assigned at the given rank: Species 99%, Genus 97%, Family 95%, Order 90%, Class 85%, and Phylum 80%. The archaeal populations were analyzed using the 16S rRNA gene tags identified from the metagenomic tags by performing BLASTN against two 16S rRNA gene databases, GreenGenes and SILVA SSU databases, respectively, at an e-value cutoff of 10-20. To minimize short random similarities, 16S rRNA gene tags with read length between 150~190 bp and an alignment length of >100 bp were used for taxonomic analysis using MEGAN-LCA strategy following the above procedures [6]. (3) Construction of rarefaction curves and phylogenic trees The normalized 8150 bacterial sequences for the seed sludge, AT and MT enrichments were individually submitted to RDP Pyrosequencing Pipeline [7], in which they were first aligned by Infernal according to the bacteria-alignment model in Align module of the RDP [8], and then were assigned to phylotype clusters at the dissimilarity cutoffs of 3% and 6% by using Complete Linkage Clustering, and finally the rarefaction curves were computed based on the clusters. The phylogenic trees were constructed for the most abundant species detected in the AT (a) and MT (b) phenol-degrading enrichments, and the reference sequences retrieved from GeneBank and SilvaSSU111, using the MEGA5 package [9]. In brief, the relevant sequences were extracted from the 454 datasets, and aligned using ClustalW program provided, and the tree was constructed using the neighbor-joining algorithm with Jukes-Cantor model (bootstrapping number =1000). References 1. Caporaso JG, Kuczynski J, Stombaugh J, Bittinger K, Bushman FD, Costello EK, Fierer N, Peña AG, Goodrich JK, Gordon JI (2010) QIIME allows analysis of high-throughput community sequencing data. Nat Methods 7: 335-336. 2. Haas BJ, Gevers D, Earl AM, Feldgarden M, Ward DV, Giannoukos G, Ciulla D, Tabbaa D, Highlander SK, Sodergren E (2011) Chimeric 16S rRNA sequence formation and detection in Sanger and 454-pyrosequenced PCR amplicons. Genome Res 21: 494-504. 3. Wang Q, Garrity GM, Tiedje JM, Cole JR (2007) Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Appl Environ Microbiol 73: 5261-5267. 4. Burriesci MS, Lehnert EM, Pringle JR (2012) Fulcrum: condensing redundant reads from high-throughput sequencing studies. Bioinformatics 28: 1324-1327. 5. Meyer F, Paarmann D, D'Souza M, Olson R, Glass EM, Kubal M, Paczian T, Rodriguez A, Stevens R, Wilke A (2008) The metagenomics RAST server–a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinformatics 9: 386. 6. Ghai R, RodÅ•íguez-Valera F, McMahon KD, Toyama D, Rinke R, de Oliveira TCS, Garcia JW, de Miranda FP, Henrique-Silva F (2011) Metagenomics of the water column in the pristine upper course of the Amazon river. PLoS One 6: e23785. 7. Cole JR, Wang Q, Cardenas E, Fish J, Chai B, Farris RJ, Kulam-Syed-Mohideen A, McGarrell D, Marsh T, Garrity GM (2009) The Ribosomal Database Project: improved alignments and new tools for rRNA analysis. Nucleic Acids Res 37: D141-D145. 8. Nawrocki EP, Eddy SR (2007) Query-dependent banding (QDB) for faster RNA similarity searches. PLoS Comput Biol 3: e56. 9. Tamura K, Peterson D, Peterson N, Stecher G, Nei M, Kumar S (2011) MEGA5: molecular evolutionary genetics analysis using maximum likelihood, evolutionary distance, and maximum parsimony methods. Mol Biol Evol 28: 2731-2739. Table S1 454 pyrosequencing datasets of the seed sludge, AT and MT enrichments Raw reads Denoised reads Effective reads 1 SEED AT MT 19170 24331 18419 14250 11364 8305 14238 11345 8289 Bacterial reads 14197 11292 8165 Archaeal reads 41 53 15 Normalized reads 8150 8150 8150 2 OTUs at 3% 381 150 106 OTUs at 6% 315 125 91 1 Effective reads refers to denoised reads after removing chimeric sequences. 2 The diversity measurement (OTUs) was performed at a rarefaction of 8150 sequences using RDP Pyrosequencing Pipeline Rarefaction Tool. Table S2 Illumina metagenomic sequencing datasets of the seed sludge, AT and MT enrichments SEED1* Category Reads Tags1 SEED2* AT MT1* MT2* 17,655,000x2 14,444,685x2 17,560,000x2 14,394,059x2 Raw (100 bps) 21,475,060x2 19,282,148x2 After de-replication 20,313,343x2 18,335,191 17,665,000x2 15,285,922x2 Assembled gene tags (~190bp) 13,192,974 15,060,113 13,113,118 13,119,590 13,106,796 Normalized gene tags (~190bp) 13,106,796 13,106,796 13,106,796 13,106,796 13,106,796 9469 9913 9,614 8,699 8,516 16S rRNA gene tags2 (150~190bp) 1: Tags means the sequences obtained from overlapping the PE metagenomic reads. 2: 16S rRNA gene tags were identified by BLASTN against GreenGenes at the e-value cutoff of 1e-20. *SEED1 & SEED2 and MT1 & MT2 were the replicate metagenomic datasets derived from DNA replicates extracted from the seed sludge and MT enrichment, respectively. Table S3 Total bacterial classes shared among the seed sludge, AT and MT enrichments. The results are based on taxonomic analysis of the 454 datasets. The number inside the bracket of the first column indicates the number of classes shared. Shared by1 Class Reads number Percent (%) in total bacterial reads SEED AT MT SEED AT MT Bacteroidia 331 1341 120 4.06 16.45 1.47 Deltaproteobacteria 183 1262 1652 2.25 15.48 20.27 Clostridia 1276 730 681 15.66 8.96 8.36 Synergistia 172 391 415 2.11 4.80 5.09 AT&MT& Actinobacteria 452 369 78 5.55 4.53 0.96 SEED (10) Alphaproteobacteria 835 203 29 10.25 2.49 0.36 Gammaproteobacteria 308 157 47 3.78 1.93 0.58 Anaerolineae 1464 54 305 17.96 0.66 3.74 Betaproteobacteria 256 52 1051 3.14 0.64 12.90 Spirochaetia 189 32 27 2.32 0.39 0.33 Epsilonproteobacteria 0 2640 795 0.00 32.39 9.75 WWE1 0 121 2040 0.00 1.48 25.03 Thermomicrobia 0 67 17 0.00 0.82 0.21 Erysipelotrichi 0 7 56 0.00 0.09 0.69 Sphingobacteriia 103 20 0 1.26 0.25 0.00 Negativicutes 37 8 0 0.45 0.10 0.00 Bacilli 115 0 12 1.41 0.00 0.15 Chlorobia 0 363 0 0.00 4.45 0.00 Solibacteres 0 9 0 0.00 0.11 0.00 Elusimicrobia 0 6 0 0.00 0.07 0.00 Thermotogae 0 0 13 0.00 0.00 0.16 Caldilineae 239 0 0 2.93 0.00 0.00 Opitutae 93 0 0 1.14 0.00 0.00 Planctomycetia 89 0 0 1.09 0.00 0.00 Aquificae 53 0 0 0.65 0.00 0.00 Verrucomicrobiae 47 0 0 0.58 0.00 0.00 Cytophagia 41 0 0 0.50 0.00 0.00 Deinococci 36 0 0 0.44 0.00 0.00 Acidobacteriia 22 0 0 0.27 0.00 0.00 Chlamydiia 8 0 0 0.10 0.00 0.00 Nitrospira 5 0 0 0.06 0.00 0.00 AT&MT (4) AT&SEED (2) MT&SEED (1) AT (3) MT (1) SEED (23) 1. The number inside the bracket indicates the number of classes shared. Table S4 Total bacterial genera shared among the seed sludge, AT and MT enrichments. The results are based on taxonomic analysis of the 454 datasets. The number inside the bracket of the first column indicates the number of genera shared. Percent (%) in total bacterial Reads number Shared by Genus 1 SEE D reads AT MT SEED AT MT Syntrophorhabdus 7 669 1433 0.09 8.21 17.58 Synergistes 23 122 29 0.28 1.50 0.36 AT&MT Mycobacterium 139 119 23 1.71 1.46 0.28 &SEED Aminobacterium 5 111 133 0.06 1.36 1.63 Acinetobacter 35 58 38 0.43 0.71 0.47 T78 834 24 158 10.23 0.29 1.94 Brevundimonas 10 12 11 0.12 0.15 0.13 0 2608 795 0.00 32.00 9.75 Pelotomaculum 0 514 89 0.00 6.31 1.09 Desulfovibrio 0 329 90 0.00 4.04 1.10 Syntrophus 0 162 99 0.00 1.99 1.21 W22 0 121 2040 0.00 1.48 25.03 Desulfomicrobium 0 20 8 0.00 0.25 0.10 Bellilinea 0 11 30 0.00 0.13 0.37 Rhodoplanes 22 9 0 0.27 0.11 0.00 D Levilinea 122 8 0 1.50 0.10 0.00 (3) Iamia 14 12 0 0.17 0.15 0.00 MT&SEE Thermovirga 20 0 232 0.25 0.00 2.85 D Clostridium 15 0 11 0.18 0.00 0.13 (3) Syntrophobacter 50 0 9 0.61 0.00 0.11 Rhodopseudomonas 0 143 0 0.00 1.75 0.00 Chlorobaculum 0 58 0 0.00 0.71 0.00 Rhodococcus 0 43 0 0.00 0.53 0.00 Geobacter 0 41 0 0.00 0.50 0.00 AT Sulfuricurvum 0 23 0 0.00 0.28 0.00 (10) Treponema 0 19 0 0.00 0.23 0.00 Alishewanella 0 11 0 0.00 0.13 0.00 Candidatus Solibacter 0 9 0 0.00 0.11 0.00 Arcobacter 0 9 0 0.00 0.11 0.00 Thauera 0 5 0 0.00 0.06 0.00 Brachymonas 0 0 584 0.00 0.00 7.17 MT Moorella 0 0 489 0.00 0.00 6.00 (11) Thermonema 0 0 95 0.00 0.00 1.17 Turicibacter 0 0 56 0.00 0.00 0.69 (7) Campylobacterales-related genus AT&MT (7) AT&SEE SEED (29) Alcaligenes 0 0 23 0.00 0.00 0.28 Corynebacterium 0 0 11 0.00 0.00 0.13 Rhodobacter 0 0 10 0.00 0.00 0.12 Fervidobacterium 0 0 10 0.00 0.00 0.12 Bacillus 0 0 9 0.00 0.00 0.11 Arthrobacter 0 0 7 0.00 0.00 0.09 Pseudomonas 0 0 7 0.00 0.00 0.09 Sedimentibacter 524 0 0 6.43 0.00 0.00 Caldilinea 239 0 0 2.93 0.00 0.00 Spirochaeta 160 0 0 1.96 0.00 0.00 Lysobacter 82 0 0 1.01 0.00 0.00 Syntrophomonas 78 0 0 0.96 0.00 0.00 Streptococcus 61 0 0 0.75 0.00 0.00 Bradyrhizobium 56 0 0 0.69 0.00 0.00 Novosphingobium 51 0 0 0.63 0.00 0.00 Parabacteroides 50 0 0 0.61 0.00 0.00 Butyrivibrio 47 0 0 0.58 0.00 0.00 Planctomyces 47 0 0 0.58 0.00 0.00 Adhaeribacter 41 0 0 0.50 0.00 0.00 Weissella 37 0 0 0.45 0.00 0.00 Deinococcus 36 0 0 0.44 0.00 0.00 Sphingomonas 35 0 0 0.43 0.00 0.00 Zoogloea 28 0 0 0.34 0.00 0.00 Aminiphilus 23 0 0 0.28 0.00 0.00 Candidatus Microthrix 20 0 0 0.25 0.00 0.00 Pedomicrobium 16 0 0 0.20 0.00 0.00 Gemmata 15 0 0 0.18 0.00 0.00 Longilinea 14 0 0 0.17 0.00 0.00 Flavisolibacter 13 0 0 0.16 0.00 0.00 Dehalobacterium 11 0 0 0.13 0.00 0.00 Prosthecobacter 10 0 0 0.12 0.00 0.00 Propionibacterium 8 0 0 0.10 0.00 0.00 Selenomonas 8 0 0 0.10 0.00 0.00 Thermomonas 6 0 0 0.07 0.00 0.00 Nitrospira 5 0 0 0.06 0.00 0.00 Thermosinus 5 0 0 0.06 0.00 0.00 1. The number inside the bracket indicates the number of genera shared. Figure S1 The flowchart of data processing in this study. The green and reddish-brown squares show the analysis procedures for 454 datasets and metagenomic datasets, respectively. The purple square show the method used for handling both 454 and metagenomic datasets. a b 500 900 800 400 Concentration (mg/L) Concentration (mg/L) 700 300 200 100 600 500 400 300 200 100 0 0 1 2 3 Time (days) Phenol 4 5 Benzoate 0 0 Ethanol 1 Butanol 2 3 Time (days) 4 5 Acetatic acid Figure S2 Variation of phenol, VFAs and alcohols concentrations with time in the AT (a) and MT (b) reactors during Batch 18 a b 1200 1100 1100 1000 1000 Phenol concentration (mg/L) Phenol concentration (mg/L) 1200 900 800 700 600 500 400 300 900 800 700 600 500 400 300 200 200 100 100 0 0 0 2 4 6 8 10 12 14 16 18 20 0 2 Time(days) 100mg/L 4 6 8 10 12 14 16 18 20 Time(days) 200mg/L 400mg/L 600mg/L 1000mg/L Figure S3 Phenol-degrading profiles at different initial concentrations in SMA tests using sludge collected from AT (c) and MT (d) reactors, respectively. 400 Seed sludge AT MT Seed sludge AT MT 350 300 300 250 250 OTUs OTUs 350 400 0.88% a 200 0.44% 150 b 0.66% 200 150 0.22% 0.22% 100 100 50 50 0 0 2000 4000 6000 Number of sequences 8000 0.11% 0 0 2000 4000 6000 8000 Number of sequences Figure S4 Rarefaction curves of the seed sludge, ambient and mesophilic enrichments at similarity cutoffs of 3% (a) and 6% (b). The rarefaction curve was computed using RDP Pyrosequencing Pipeline Rarefaction Tool. The samples were arranged descendingly based on the numbers of OTUs. The value above each curve is the slope at the end point of each curve, which indicated the increase in the number of novel OTUs with the increase of every 100 sequences. Figure S5 Shift in Phylum Proteobacteria before and after enrichment at ambient and mesophilic temperatures. The percent in the bracket indicated the relative abundance of Proteobacteria in total bacterial population of each sample. Figure S6 Phylogenic trees of 16S rRNA gene sequences constructed for the most abundant species detected in the AT (a) and MT (b) phenol-degrading enrichments. The relevant sequences were aligned using ClustalW program provided in MEGA5 package. The tree was constructed using the neighbor-joining algorithm with Jukes-Cantor model (bootstrapping number =1000). o 2048 AT (20 C) o MT(37 C) Seed sludge 1024 Sequence abundce 512 256 128 64 32 16 8 4 0 4 8 12 16 20 24 28 32 36 40 44 Genus rank Figure S7 Rank-abundance curves for bacterial genus in the seed sludge, AT and MT reactors, respectively. The abundance is represented using sequences number that is assigned to each genus in 454 data sets of 16S rRNA gene sequences.