Microbial community successional patterns in beach sands impacted by the Deepwater Horizon oil spill. L. M. Rodriguez-R, W. A. Overholt, C. Hagan, M. Huettel, J. E. Kostka, and K. T. Konstantinidis SUPPLEMENTARY INFORMATION SUPPORTING METHODS Sample collection and nucleic acid extractions During the 2010 sampling trips, a 10×1×1 meter trench was excavated, samples were collected using cutoff 5 mL syringes exuded into 15 mL Falcon tubes, immediately frozen on dry ice, and stored at -80ºC, as described in detail by Kostka et al. (2011). Samples from June 2011 were collected using sediment cores, sectioned at 10 cm intervals, homogenized, and immediately frozen on dry ice. Total genomic DNA was extracted using a MoBio Powersoil DNA extraction kit (MoBio Laboratories, Carlsbad, CA) according to the manufacturer’s protocol. Due to low DNA yields, DNA extracts were pooled according to depth and similar degree of oiling (Table 1). DNA libraries were prepared and sequenced by the Genomics Sequencing Facility at Los Alamos National Laboratory, using Illumina TruSeq DNA library prep and Illumina HiSeq 2000, with a paired-end strategy and reads of 102 bp in length. 16S rRNA gene amplicon analysis and annotation 16S rRNA gene amplicon libraries were prepared on both pooled and original samples. Genomic DNA was sequenced at the Argonne National Laboratory using Illumina MiSeq with paired-end strategy and reads of 151bp in length, using custom sequencing primers and procedures as previously described (Caporaso et al., 2012). Raw 16S rRNA gene amplicons were processed in the software package QIIME v1.7.0 (Caporaso et al., 2010). Reads were truncated if three consecutive bases had a PHRED score of less than 20, and only reads greater than 113 bases were retained. Reads less than 60% similar to any sequence in the Greengenes release 13-5 database (DeSantis et al., 2006) were discarded. Sequences were clustered into OTUs at 97% 2 similarity using UCLUST (Edgar, 2010). All OTUs that represented < 0.005% of the total sequences were discarded to minimize the effect of sequence errors and focus on the dominant OTUs (Bokulich et al., 2013). Representative sequences of each OTU were classified using the RDP Classifier at 50% confidence (Wang et al., 2007). The number of OTUs were subsequently rarefied using the R package vegan (Oksanen et al., 2013). The final OTU counts at 10,000 reads were used to compute and project complexity curves using preseq (Daley & Smith, 2013) with increasing step size from 2,000 to 10,000 sequences until convergent models were found. Samples that didn’t produce a convergent model at 10,000 sequences step size were ignored in the projection. Sequence trimming, community coverage, and metagenomic assembly Metagenomic sequences were trimmed using SolexaQA (Cox et al., 2010) with a PHRED score cutoff of 20 and a minimum fragment length of 50 bp. Only coupled reads with both sisters longer than 50 bp after trimming were used further. Reads including Illumina adaptors in the 3’ end were also discarded. Average coverage of the community was estimated and projected using Nonpareil, an algorithm that determines coverage based on the level of redundancy of the sequence reads of metagenomes, with default parameters (Rodriguez-R & Konstantinidis, 2014b). Each metagenomic dataset was assembled following the procedures outlined in (Luo et al., 2012). Briefly, sequences were assembled with Velvet (Zerbino & Birney, 2008) and SOAP de novo (Li et al., 2010) using k-mer values ranging from 21 to 63. Three assemblies that maximized the N50 and/or the number of reads used were selected from each method, and combined using Newbler. 3 Annotation of metagenomic fragments Coding sequences were predicted on contigs larger than 500 bp using MetaGeneMark.hmm with default parameters (Zhu et al., 2010) and functionally annotated based on protein-level searches against the SwissProt database (Wu et al., 2006) using the Gene Ontology (GO) terms (Ashburner et al., 2000), and against the SEED database using the Subsystems categories (Overbeek et al., 2005). Query genes were assigned the functional annotation of their best-match in SwissProt containing GO terms or in SEED containing Subsystem categories, based on blastp (Altschul et al., 1990) searches with a minimum bit-score for a match of 60. The bacterial and archaeal taxonomy of the contigs was assessed using MyTaxa (Luo et al., 2014), with default parameters. The input file to MyTaxa was the blastp results of the query genes against the predicted proteins of all closed and draft prokaryotic genomes available in NCBI as of January 2014, with minimum score of 50 bits and alignment length ≥ 80% of the metagenomic protein and only the top-5 hits per query being considered. The fraction of sequences originating from each domain was estimated by directly comparing sequencing reads against UniRef50 (Suzek et al., 2007) using PAUDA (Huson & Xie, 2013) with a minimum score of 60 bits. The assembled genes were used to estimate the minimum generation time using growthpred (Vieira-Silva & Rocha, 2010) with parameters “-b -m -c 0 -r -i -s” separately for each metagenome. Gene functional and taxonomic profiles Functionally and taxonomically annotated genes from different samples were pooled together. The abundance of each gene on each dataset (i.e., the sequencing depth of a gene) was estimated by the number of reads mapped on the genes. Read 4 mapping was performed using BLAT with default parameters (Kent, 2002), considering only the best match with alignment length ≥ 80bp and identity ≥ 97%. Annotation terms and taxa with significantly different abundance between groups of samples were identified using a negative binomial test as implemented in the DESeq2 package (Anders & Huber, 2010). The profiles of eukaryotic taxonomy were estimated based on putative 18S rRNA reads identified by Metaxa (Bengtsson et al., 2011) with e-value below 0.1. To taxonomically identify eukaryotic sequences, a reference 18S rRNA gene tree was first built using high-quality aligned sequences (pintail, sequence, and alignment qualities above 90) from the SILVA reference database r114 (Quast et al., 2012) and FastTree v2.1.7 with GTR model for nucleotides (Price et al., 2010). The tree was next pruned to reduce complexity by iteratively collapsing closely related leaf nodes into single nodes (10 cycles), using a custom R script and the picante library (Kembel et al., 2010). The putative 18S rRNA gene-encoding metagenomic reads identified by Metaxa were next aligned to the reference 18S rRNA gene alignment using MAFFT’s addfragments option with 6mer pairs (Katoh et al., 2002) and placed onto the reference tree with pplacer v1.1.alpha14 (Matsen et al., 2010). The taxonomy of the reads was identified using the Silva taxonomy and ancillary custom scripts and the tools pplacer and taxtastic (Matsen et al., 2010). Complete taxonomic profiles were visualized using Krona (Ondov et al., 2011). The profiles of eukaryotic taxonomy were estimated based on putative 18S rRNA reads identified by Metaxa (Bengtsson et al., 2011) with e-value below 0.1. To taxonomically identify eukaryotic sequences, a reference 18S rRNA gene tree was first built using high-quality aligned sequences (pintail, sequence, and alignment qualities above 90) from the SILVA reference database r114 (Quast et al., 2012) and 5 FastTree v2.1.7 with GTR model for nucleotides (Price et al., 2010). The tree was next pruned to reduce complexity by iteratively collapsing closely related leaf nodes into single nodes (10 cycles), using a custom R script (enve.prune.dist from enveomics.R; http://github.com/lmrodriguezr/enveomics) and the picante library (Kembel et al., 2010). The putative 18S rRNA gene-encoding metagenomic reads identified by Metaxa were next aligned to the reference 18S rRNA gene alignment using MAFFT’s addfragments option with 6mer pairs (Katoh et al., 2002) and placed onto the reference tree with pplacer v1.1.alpha14 (Matsen et al., 2010). The taxonomy of the reads was identified using the Silva taxonomy and ancillary custom scripts and the tools pplacer and taxtastic (Matsen et al., 2010). Complete taxonomic profiles were visualized using Krona (Ondov et al., 2011). Genome equivalents quantification In order to measure the average number of genes per cell with a given functional annotation (genome equivalents), a set of universally conserved singlecopy genes were identified among the assembled gene sequences from the metagenomes. All genes were compared against a collection of 101 HMMs (Dupont et al., 2012), mainly derived from the Genome Property database, entry GenProp0799 “bacterial core gene set, exactly 1 per genome” (Haft et al., 2005), using HMMER3 (http://hmmer.janelia.org/) with default settings and trusted cutoff. Ten models were eliminated because more than one model represented the same gene family (rpoC, pheT, proS, and glyS) or represented extremely conserved families (era and tRNA synthase class I). The median sequencing depth (in reads/bp) of the remaining 91 models was used as the normalizing factor for each dataset. The sequencing depth of genes with a given annotation (see below) was estimated on each dataset (in 6 reads/bp), added up, and divided by the normalizing factor of the corresponding dataset. Hydrocarbon analyses The analyses of the crude oil hydrocarbons in the sediment samples followed the recommended EPA standard procedures. Hydrocarbons were extracted from homogenized sediment sections using an automated Buchi Soxhlet extraction system according to EPA method 3541. The extraction with a hexane:acetone 1:1 v/v mixture was followed by a rinsing step with the same solvent mixture. The solvents containing the extracts were concentrated using a Turbovap-II rapid solvent evaporator (55ºC) to less than 5 mL, and then transferred to cyclohexane by adding 20 mL of that solvent. Concentration was continued until the sample was reduced to roughly a volume of 2 mL that then was cleaned according to the EPA 3630C method. The cleaned hydrocarbon extracts were analyzed on an Agilent 7890A Series GC (Column: Agilent HP-5MS, 30 m × 250 μm × 0.25 μm) with Agilent 7890A FID. To determine the concentrations of aromatic hydrocarbons, samples were injected into an Agilent 7890A Series GC (Column: Agilent DB-EUPAH, 20 m × 180 μm × 0.14 μm) coupled to an Agilent 7000 triple quadrupole MS system. 7 SUPPLEMENTARY FIGURES A Nonpareil curves for metagenomes Estimated average coverage (%) 100 90 80 70 60 50 40 30 pre (4) A, B, C (3) D (1) 20 10 E, F, G (3) H (1) I, J (4) 0 100 M 500 M 97% OTUs (x 1,000) B 1G 5G Sequencing effort (bp) 10 G 50 G 100 G Rarefaction curves for 16S rRNA gene amplicons 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 pre (5) A (2) B (3) 100 D (3) E (5) G (2) 1K H (2) I (7) J (7) 10 K 100 K 1M 10 M Sequencing effort (Reads) Supplementary Figure S1: Sequence coverage of the datasets used in this study. (A) Nonpareil curves for the metagenomic datasets in this study. The abundanceweighted average coverage was estimated based on the redundancy of the reads for different sequencing efforts (solid lines) using the Nonpareil algorithm (Rodriguez-R & Konstantinidis, 2014b). The inter-quartile range of 1,024 replicates is presented as shaded areas, and the estimated coverage of the entire datasets is presented as empty circles. The curves were projected using a cumulative gamma distribution, and the convergent models are presented as dashed lines. The horizontal red line indicates 8 95% average coverage. Nonpareil diversity index for pre-oil samples ranged between 20.35 and 20.92 (average 20.73), for oiled samples ranged between 20.65 and 21.31 (average 21.07), and for recovered samples ranged between 22.03 and 22.63 (average 22.39), indicating and increase in sequence complexity from pre-oil to oiled samples and from oiled to recovered samples. (B) Rarefaction curves for 97% OTUs in 16S rRNA gene amplicon datasets were constructed using the vegan R package (Oksanen et al., 2013). The colored areas indicate the 95% confidence intervals. Curves were projected using preseq (Anders & Huber, 2010), and convergent models are presented as dashed lines. Samples are named after the corresponding metagenomic dataset, and the number of libraries (replicates) for each group is indicated in parenthesis (legends). All samples were colored following table 1: Pre-oil samples (green), Oiled samples from July 2010 (mauve), Weathered oil samples from July 2010 (red), Oiled samples from October 2010 (sea green), Weathered oil samples from October 2010 (pink), and recovered samples (olive). Note that samples with lower Nonpareil curves (A; olive), indicating higher diversity, correspond to those with highest estimated richness (B). 9 0 ●0 ●0 0 ● May-2010 ● Pre-spill 5 ●0 Jul−2010 ● Pre-spill ●15 ●25 15 ● 35 ● 35 ● 0 ● 35● ● 35 Jul−2010 Oiled 15 ● 55 ● 55 ● 40 ● Jun−2011 Recovered 60 ●35● 50 35 ● ●55 ● 35 40 ●● 30 40 55 ● ● ● 35 45 ● 40 ● ● Oct−2010 Oiled Jul-2010 Weathered oil ● 50 55 ● 50 ● Oct−2010 Weathered oil Supplementary Figure S2: Sampling depth played a limited role in structuring beach microbial communities relative to oiling. Non-metric multidimensional scaling plot generated from weighted-UNIFRAC distances based on 16S rRNA OTUs defined at 97% identity (2D-stress 14%). Samples were colored by groups following Table 1, with the additional group Jul-02-10 (Pre-spill) in light olive. Depth (cm) for each sample is indicated with red text. The graph shows that the presence/absence of oil had the largest effect on microbial community structure, followed by sampling date. Depth played a relatively small role in structuring beach sand microbial communities (ADONIS partial R2 0.03) compared to oiling status and sampling date 10 (ADONIS partial R2 0.754). For instance, samples collected on June 14th, 2011 reveal surficial and deep sands are more similar to each other than to the oiled layers. The pre-oiled metagenomes used in the analysis were obtained from the May, 2010 samples. 11 E pre 7e-6 0.08 2e-4 2e-6 1e-4 2e-12 pre pre pre pre A A B B B D D E E E E E G G H H I I I I I I I J J J J J J J 0.007 0.6 5e-6 pre pre pre pre A A B B B D D E E E E E G G H H I I I I I I I J J J J J J J 0.7 7e-4 97% 16S rRNA OTUs 0 100 200 300 Diversity (equivalent GO terms) D pre 0.004 pre pre pre pre A A B B B D D E E E E E G G H H I I I I I I I J J J J J J J 0.006 C pre 0.002 0.002 0 100 200 300 400 Diversity (equivalent subsystems) 0.53 0.16 S1 S2 S3 S4 A B C D E F G H I600 I606 J598 J604 0.01 0.003 GO: molecular function B S2 S3 S4 A B C D E F G H I600 I606 J598 J604 0.12 Subsystems A S1 27,948 0 1,000 2,000 Diversity (equivalent OTUs) 0 104 2x104 0 Richness (OTUs) 0.1 0.2 Evenness 0.3 Supplementary Figure S3: Effect of community recovery in taxonomic and functional diversity. The horizontal colored bars represent the α component of functional (A-B) and taxonomic (C-E) diversity, and the grey horizontal bars represent the corresponding γ component. Comparisons between groups were performed using two-sided t-tests, and the p-values are reported for the following comparisons: Pre-oil vs Oiled, Oiled vs Recovered, and Oiled from July 2010 vs Oiled from October 2010. Samples were colored by groups following Table 1. 12 A 24 52 22 Minimum doubling time (h) 20 18 14 12 10 6 1192 446 384 362 4 0 Average genome size (Mbp) 474 8 2 B 38 16 331 92 767 653 603 632 47 736 S1 S2 S3 S4 A B C D E F G H 49 I I J J 50 40 30 20 10 5 4 3 2 1 Supplementary Figure S4: Estimated minimum doubling time and average genome size of the microbial communities sampled in this study. The minimum doubling time was estimated using growthpred (Vieira-Silva & Rocha, 2010) and the values are presented for each individual metagenomic sample. Open circles indicate average value, and the bars indicate the range of mean ± 1.96 × standard deviation (approximating the 95% confidence interval). The number of identified ribosomal 13 genes is indicated below the bars. Samples were colored by groups following Table 1. The estimated optimal temperature was in the range 29-37ºC (34.2 ± 2.5 ºC), except for J604 (40ºC). The average genome size was estimated based on the detected abundance of conserved single-copy genes in the bacterial and archeal fraction. Open circles indicate median value between estimators (single-copy genes) and the bars indicate the range of the 95% confidence interval. In both panels, the horizontal lines indicate the average per group of samples. 14 SUPPLEMENTARY FILES Supplementary File S1: Assembly, gene prediction, and read mapping statistics per dataset. This file (File_S1.Stats.xlsx) is a spreadsheet with a single tab in Microsoft Excel format. Supplementary File S2: Distribution of bacterial and archaeal taxa in the studied metagenomes. This self-contained website presents an interactive view of the different bacterial (hues of red) and archaeal (hues of green) taxa in the metagenomic samples in the form of hierarchical pie charts. The size of the sections indicates the abundance (as number of reads) in the datasets based on the classification of reads derived from MyTaxa. The different samples can be explored in the top-left list. This file (File_S2.BacArc-OS.html) was produced with Krona (Ondov et al., 2011). Supplementary File S3: Distribution of eukaryotic taxa in the studied metagenomes. This self-contained website presents an interactive view of the different eukaryotic taxa in the metagenomic samples in the form of hierarchical pie charts. The size of the sections indicates the abundance (as number of reads) in the datasets based on the classification of reads derived from PAUDA. The different samples can be explored in the top-left list. This file (File_S3.Euk-OS.html) was produced with Krona (Ondov et al., 2011). Supplementary File S4: Comparisons of category abundances between groups of datasets. Tabs are named for the comparison performed and the groups of datasets compared. The categories correspond to GO terms from the Molecular Function (MF), Biological Process (BP), and Cellular Component (CC) domains; and the Subsystems categories (SS). The groups of datasets contrasted were, following table 15 1, Pre-oil from May 2010 (P), Oiled from July and October 2010 (O), Recovered from July 2011 (R), Oiled from July 2010 (O1), and Oiled from October 2010 (O2). Comparisons were performed using the DESeq2 package, and the columns indicate the base mean abundance after normalization (baseMean), the log2 of the fold-change between compared groups (log2FoldChange) and its estimated standard error (lfcSE), the statistic for the negative binomial test (stat), the corresponding p-value (pvalue), and its correction for multiple testing (padj). Only categories with significant differences are reported (p-value adjusted ≤ 0.05), and the rows are colored by the group of datasets in which the corresponding category is significantly more abundant. This file (File_S4.Annotation.xlsx) is a spreadsheet with 16 tabs in Microsoft Excel format. 16 SUPPLEMENTARY REFERENCES Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. (1990). Basic local alignment search tool. J Mol Biol 215:403–410. Anders S, Huber W. (2010). Differential expression analysis for sequence count data. Genome Biol 11:R106. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al. (2000). Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25:25–29. Bengtsson J, Eriksson KM, Hartmann M, Wang Z, Shenoy BD, Grelet G-A, et al. (2011). Metaxa: a software tool for automated detection and discrimination among ribosomal small subunit (12S/16S/18S) sequences of archaea, bacteria, eukaryotes, mitochondria, and chloroplasts in metagenomes and environmental sequencing datasets. Antonie Van Leeuwenhoek 100:471–475. Bokulich NA, Subramanian S, Faith JJ, Gevers D, Gordon JI, Knight R, et al. (2013). Quality-filtering vastly improves diversity estimates from Illumina amplicon sequencing. Nat Methods 10:57–59. Caporaso JG, Kuczynski J, Stombaugh J, Bittinger K, Bushman FD, Costello EK, et al. (2010). QIIME allows analysis of high-throughput community sequencing data. Nat Methods 7:335–336. Caporaso JG, Lauber CL, Walters WA, Berg-Lyons D, Huntley J, Fierer N, et al. (2012). Ultra-high-throughput microbial community analysis on the Illumina HiSeq and MiSeq platforms. ISME J 6:1621–1624. Cox MP, Peterson DA, Biggs PJ. (2010). SolexaQA: At-a-glance quality assessment of Illumina second-generation sequencing data. BMC Bioinformatics 11:485. Daley T, Smith AD. (2013). Predicting the molecular complexity of sequencing libraries. Nat Methods 10:325–327. DeSantis TZ, Hugenholtz P, Larsen N, Rojas M, Brodie EL, Keller K, et al. (2006). Greengenes, a Chimera-Checked 16S rRNA Gene Database and Workbench Compatible with ARB. Appl Environ Microbiol 72:5069–5072. Dupont CL, Rusch DB, Yooseph S, Lombardo M-J, Richter RA, Valas R, et al. (2012). Genomic insights to SAR86, an abundant and uncultivated marine bacterial lineage. ISME J 6:1186–1199. Edgar RC. (2010). Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26:2460–2461. 17 Haft DH, Selengut JD, Brinkac LM, Zafar N, White O. (2005). Genome Properties: a system for the investigation of prokaryotic genetic content for microbiology, genome annotation and comparative genomics. Bioinforma Oxf Engl 21:293–306. Huson DH, Xie C. (2013). A poor man’s BLASTX—high-throughput metagenomic protein database search using PAUDA. Bioinformatics btt254. Katoh K, Misawa K, Kuma K, Miyata T. (2002). MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res 30:3059–3066. Kembel SW, Cowan PD, Helmus MR, Cornwell WK, Morlon H, Ackerly DD, et al. (2010). Picante: R tools for integrating phylogenies and ecology. Bioinformatics 26:1463–1464. Kent WJ. (2002). BLAT--the BLAST-like alignment tool. Genome Res 12:656–664. Kostka JE, Prakash O, Overholt WA, Green SJ, Freyer G, Canion A, et al. (2011). Hydrocarbon-Degrading Bacteria and the Bacterial Community Response in Gulf of Mexico Beach Sands Impacted by the Deepwater Horizon Oil Spill. Appl Environ Microbiol 77:7962–7974. Li R, Zhu H, Ruan J, Qian W, Fang X, Shi Z, et al. (2010). De novo assembly of human genomes with massively parallel short read sequencing. Genome Res 20:265– 272. Luo C, Rodriguez-R LM, Konstantinidis KT. (2014). MyTaxa: an advanced taxonomic classifier for genomic and metagenomic sequences. Nucleic Acids Res 42:e73–e73. Luo C, Tsementzi D, Kyrpides N, Read T, Konstantinidis KT. (2012). Direct comparisons of Illumina vs. Roche 454 sequencing technologies on the same microbial community DNA sample. PloS One 7:e30087. Matsen FA, Kodner RB, Armbrust EV. (2010). pplacer: linear time maximumlikelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree. BMC Bioinformatics 11:538. Oksanen J, Blanchet FG, Kindt R, Legendre P, Minchin PR, O’Hara RB, et al. (2013). vegan: Community Ecology Package. http://cran.rproject.org/web/packages/vegan/index.html (Accessed November 13, 2013). Ondov BD, Bergman NH, Phillippy AM. (2011). Interactive metagenomic visualization in a Web browser. BMC Bioinformatics 12:385. Overbeek R, Begley T, Butler RM, Choudhuri JV, Chuang H-Y, Cohoon M, et al. (2005). The Subsystems Approach to Genome Annotation and its Use in the Project to Annotate 1000 Genomes. Nucleic Acids Res 33:5691–5702. 18 Price MN, Dehal PS, Arkin AP. (2010). FastTree 2 – Approximately MaximumLikelihood Trees for Large Alignments. PLoS ONE 5:e9490. Quast C, Pruesse E, Yilmaz P, Gerken J, Schweer T, Yarza P, et al. (2012). The SILVA ribosomal RNA gene database project: improved data processing and webbased tools. Nucleic Acids Res 41:D590–D596. Rodriguez-R LM, Konstantinidis KT. (2014). Nonpareil: a redundancy-based approach to assess the level of coverage in metagenomic datasets. Bioinformatics 30:629–635. Suzek BE, Huang H, McGarvey P, Mazumder R, Wu CH. (2007). UniRef: comprehensive and non-redundant UniProt reference clusters. Bioinformatics 23:1282–1288. Vieira-Silva S, Rocha EPC. (2010). The Systemic Imprint of Growth and Its Uses in Ecological (Meta)Genomics. PLoS Genet 6:e1000808. Wang Q, Garrity GM, Tiedje JM, Cole JR. (2007). Naïve Bayesian Classifier for Rapid Assignment of rRNA Sequences into the New Bacterial Taxonomy. Appl Environ Microbiol 73:5261–5267. Wu CH, Apweiler R, Bairoch A, Natale DA, Barker WC, Boeckmann B, et al. (2006). The Universal Protein Resource (UniProt): an expanding universe of protein information. Nucleic Acids Res 34:D187–191. Zerbino DR, Birney E. (2008). Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res 18:821–829. Zhu W, Lomsadze A, Borodovsky M. (2010). Ab initio gene identification in metagenomic sequences. Nucleic Acids Res 38:e132. 19