April 12, 2010 Jizhong Zhou, PhD Editor, Applied and Environmental Microbiology (AEM) Re: Diversity of 16S rRNA genes within individual prokaryotic genomes (AEM02953-09 Version 1) Dear Dr. Zhou, Thank you for reviewing our manuscript, “Diversity of 16S rRNA genes within individual prokaryotic genomes” (AEM02953-09 Version 1). We appreciate the opportunity to have the updated manuscript considered for re-review. The reviewers’ comments have been very helpful in improving the manuscript’s quality and content. Please find below an item by item disposition of each of the reviewers’ suggestions, followed by our responses. Sincerely, Zhiheng Pei, MD, PhD REVIEWER 1: 1. Several times the manuscript give 3% difference in 16S rRNA as the species boundary "by general definition" (p 17 li 19), citing the 1994 work of Stackebrandt and Goebel. The authors are apparently unaware that Stackebrandt later revised his estimate to between 1 and 1.5 percent (Stackebrandt & Ebers. Microbiology Today, Nov 2006 pp. 153-155.) We would like to thank the reviewer for providing us with the update on the operational definition of species using 16S rRNA gene sequences. Stackebrandt & Ebers suggested using 1-1.3% (not 1-1.5%) to replace the old value of 3% as the species boundary. With the new values, intragenomic variation of 16S rRNA genes exceeds this boundary in 24 species (see updated table 2). We have incorporated the new definition into our manuscript. Please note that much of the changes in the revised manuscript are related to the update on this definition. 2. Normally, for phylogenetic analysis, only those positions that can be well-aligned between most sequences are included in the analysis. Hypervariable regions and 1 regions of variable length and secondary structure are usually "masked out". It would greatly improve the paper if, in addition to the overall percent difference, percent differences were calculated after using such a mask. The 1991 Lane mask would be a good choice here. (Lane,D.J. 1991. 16S/23S rRNA sequencing. In Stackebrandt,E. and Goodfellow,M. (eds), Nucleic Acid Techniques in Bacterial Systematics. John Wiley and Sons, New York, pp. 115–175.). We appreciate the reviewer for this thoughtful suggestion. As suggested, we aligned 16S rRNA genes from the 24 highly diversified species listed on the updated Table 2 and masked the aligned sequences with Lane mask. Intragenomic differences were recalculated on the masked sequences and the results are added in Table 2. We have updated the Methods section with this new analysis. 3. Related to the above comment, if a high proportion of the changes occur in these hypervariable regions that are not normally used for phylogenetic inference, then the portions of the manuscript dealing with the effect on inter-organism comparisons will need to be revised. Of the 24 highly diversified species listed on Table 2, the effect of masking hypervariable positions was remarkable for 14 species, reducing the diversity from between 1.06% and 2.07% to <0.66% (Table 2). This level of diversity will not have a significant impact on phylogenetic inference. However, the variation after masking remained high for H. marismortui (4.86%) and T. tengcongensis (5.01%), Candidatus Protochlamydia amoebophila (1.53%), Carboxydothermus hydrogenoformans (1.24%), Deinococcus geothermalis (1.09%), and Geobacillus thermodenitrificans (1.09%). For B. afzelii, the two 16S rRNA genes are too diversified to be aligned using available algorithms (Our original alignment of full sequences of these two genes was done manually). Such diversified 16S rRNA genes will be troublesome not only for threshold-based taxonomic assignment using full length sequences but also for phylogenetic inference using masked sequences because 16S rRNA genes from within the same genome are not monophyletic in these species. We have updated the Results section with these new data. 4. A more complete position-by-position comparison of the intra-genomic vs inter-genomic rates of changes should be presented. In the current manuscript this is done for only a single organism (fig 3). We presented this type of analysis with two examples, Thermoanaerobacter tengcongensis (Figure 1) and Borrelia afzelii (Figure 4). These examples were selected to demonstrate important conclusions from this study. T. tengcongensis was used to demonstrate the power of ribosomal constraint on at the secondary structure level, while B. afzelii was used to show loss of ribosomal constraint in a pseudogene. Although more figures like these can be included in the manuscript, they are too complex as Reviewer 3 pointed out (See question 8 from Reviewer 3). To balance between the suggestions by Reviewers 1 and 3, we would like to keep Figs. 1 and 3 unchanged to preserve the details asked by Reviewer 1 but will not add more figures like these as it would exaggerate the concern raised by Reviewer 3. 2 5. p 12 li 3 should be "Table 3",not "Table 1" Table 1 now has been changed to Table 3. 6. Figure 1. I can't distinguish the "large letters" mentioned in the legend. We have changed the “large letters” to “colored letters” in the legend. 7. Order of figures doesn't match first use in text? The figures have been reordered in the order of first appearance. REVIEWER 2: 1. Minimum-free energy folding on single sequences is widely known to perform poorly. Strongly recommend re-doing this part of the analysis using constraints on conserved regions/folds and-or using multiple-sequence folding approaches, which work much better (see e.g. Paul Gardner's reviews on this topic). We understand the reviewer’s concern with minimum-free energy folding on single sequences. The minimum-free energy folding approach we used effectively predicted ribosomal constraint at the 2º structural level for nearly all species without need for alternative folding strategies. We believe this success was due to the availability of consensus 16S rRNA models that were used to guide the folding. However, we encountered difficulty, as the reviewer predicted, when using the same approach to fold the whole 16S rRNA molecules for 5 species, S. woodyi, P. profundum, C. cellulolyticum, Desulfitobacterium hafniense, and Syntrophomonas wolfei (Table 3 and Fig. 3). The difficulty was caused by a high concentration of substitutions in certain regions of these 16S rRNA molecules, which prevented more detailed comparison. As the reviewer suggested, for 16S rrn genes that displayed high levels of regional diversity, the regions in question were folded using the KnetFold program (Bindewald 2006). This folding method creates secondary structures based on multiple sequences. The output from KnetFold was entered into jViz.RNA 2.0 in order to visualize the secondary structure (Wiese 2005). jViz.RNA 2.0 allows for the creation of complex secondary structures that may contain pseudoknots. The multiple sequence folding was verified using another program named Murlet (Kiryu 2007) (see update in Methods section). The results were included in Results and Discussion sections and Fig. 3. 2. What implications do the results have for chimera detection? If homogeneity within a genome is maintained by gene conversion, do the recombinants cause problems for chimera detection algorithms? As the reviewer suggested, we checked for chimeras in all 16S rRNA genes from species listed on Table 2, using Bellerophon (Huber 2004). No chimeras were detected. This outcome was somewhat expected as chimera detection relies on obvious breakpoints where two 3 phylogenetic distinct parent molecules are ligated. Such subtle recombinations would be below the typical sensitivity of chimera detection algorithms as commonly employed. 3. Would be useful to add to table 1 min, max, and standard deviation of # rRNA genes per genome. Could the authors comment on how copy number variation is likely to bias our 16S-based estimates of community composition, and whether these biases are likely to matter in practice? Table 1 has been updated with the min, max, and standard deviation of # rRNA genes per genome, as recommended. Since inter-quartile range and median were used to describe the data in the original manuscript, this change created an inconsistency between the tables and the manuscript. To fix this inconsistency, inter-quartile range and median were replaced with min, max, and standard deviation in the revised manuscript. It is well known that there is wide variation of copy numbers of 16S rRNA gene among various species (Lee 2009, Rastogi 2009). Currently, it is common practice to describe the composition of a microbial community using 16S gene composition rather than cell composition. It would be desirable to convert 16S gene composition to cell composition but for a large number of organisms in a complex microbiome, this conversion is not possible because of the lack of knowledge about the copy numbers of 16S rRNA gene in their genomes. Let’s illustrate the difference using an artificial example in which a microbial community contains 100 bacterial cells, 90 cells from Borrelia turicatae and 10 cells from Brevibacillus brevis. Because there is one 16S rRNA gene per cell for B. turicatae and 15 16S rRNA genes per cell for Brevibacillus brevis, this community contains 240 16S rRNA genes, 90 from B. turicatae and 150 from Brevibacillus brevis. Consequently, this community is dominated by cells from B. turicatae (90/100) and by 16S rRNA genes from Brevibacillus brevis (150/240). Thus, 16S gene composition is an acceptable way to describe a microbial community with the understanding of the difference between the 16S gene composition and cell composition. This discussion now has been included in the revised manuscript. 4. Are there any changes in diversity correlated with differences in GC content? There is no correlation for most species except for the top three species with the highest diversity. Besides the two species, H. marismortui and T. tengcongensis, discussed in the original manuscript, the discussion has been updated with B. afzelii. Of the two 16S rRNA genes in B. afzelii, the pseudogene has a much lower GC content (38.1%) than the functional copy (46.5%). It appears that random mutations in the pseudogene have been bringing its GC content towards the baseline for the whole genome (28%). 5. How do the variability estimates in Fig. 1. compare with traditional estimates of variability from environmental sequencing projects? To our knowledge Thermoanaerobacter tengcongensis, as shown in Fig. 1, harbors the most diversified 16S genes among all known prokaryotic species except for Borrelia afzelii whose 4 high diversity is related to a pseudogene. This level of diversity is comparable to those found by using traditional PCR cloning technique in Haloarula marismortui (5%) and Thermobispora bispora (6.4%). 6. Does the availability of high-quality complete genome sequence allow the avoidance of low quality read problems that can artificially increase variability in estimates from single-pass environmental sequencing projects? No. The genome sequences are helpful but limited in this regard because the genome database only covers a very minor fraction of true variations of 16S rRNA genes in natural world. The database is too small to allow identifying or correcting sequence errors by cross reference to 16S rRNA genes in the database. REVIEWER 3: 1. P11L14 - Any evidence that this gene is expressed in B afzelii? Also, is there any evidence for horizontal gene transfer or is this merely the accumulation of deleterious mutations? This is an important consideration as intra-genomic variation is often used by the Ford Doolittles of the world to critique the use of 16S rRNA genes to infer organismal phylogeny? There has not been any experiment designed to examine the expression of rrnA of B. afzelii. It does not appear to be horizontally transferred into B afzelii from other species, as it is closer to rrnB of B. afzelii than to a species in any other genera. We have updated P11L14 with these sentences. 2. P14L16 - What is the evidence that genes have been lost? Do other genes in operon appear to be pseudogenes? We understand the reviewer’s concern on the word “lost” or complete deletion of 16S rRNA genes. Now, instead of implicating there was an event of lost or deletion, we simply describe the status of the involved rRNA operons as partial rRNA operon missing 16S rRNA gene. We updated P14 with the following information. Missing of a whole 16S rRNA gene in a rRNA operon, as evidenced by the presence of 23S or 5S rRNA genes but absence of 16S rRNA gene, was observed in rRNA operons in 95 species (Table S2.). This ranges from an absence of one 16S copy to an absence of eight copies in S. wolfei. The 23S or 5S rRNA genes in the partial rRNA operon appear functional because none of the genes exhibit excessive random mutations characteristic of a pseudogene. Interestingly, intragenomic diversity among 16S rRNA genes in 6 of the 95 species was borderline or slightly above the 1-1.3% threshold for separation of species (Table 2). These species include Shewanella sp. ANA-3 (1.09%), Escherichia coli (1.10%), Bacillus clausii (1.15%), Bifidobacterium adolescentis (1.30%), Shewanella baltica (1.36%), and Syntrophomonas wolfei (1.67%). As described before, the high diversity in S. wolfei was also associated with IVS in 16S rRNA genes. 5 3. P17L19 - I know the authors probably feel compelled to comment on the concept of an operational species definition. A larger point could be made that 16S rRNA genes probably evolve at different rates. In fact, this would be a very opportune time to look at how intra- and inter-genomic variation compares within and between species. We have updated the manuscript with the following discussion, as recommended by the reviewer. The definition for prokaryotic species is polyphasic in that it requires a distinct set of biological characteristics and corresponding DNA reassociation values greater than 70%. However, there is not a simple, universal definition. 16S rRNA genes have been used as a surrogate maker for operationally defining species. Initially, >3% difference between 16S rRNA genes from two organisms was required to claim the two organisms belong to two different species (Stackebrandt 1994). Later, the threshold was lowered to 1-1.3% (26). This operational definition is helpful in taxonomic classification using 16S rRNA genes, especially for studies of complex microbiomes using cultivation-independent techniques in which biological characteristics and DNA reassociation values can not be determined for individual bacterial cells/species. Nevertheless, it is critical to understand the limitations of the 16S rRNA-based operational definition for species. The main limitation is that 16S rRNA genes evolve at different rate but the operational species threshold (1-1.3%) is relatively rigid. As a consequence, closely related species that evolve slowly will be grouped as a single species by the operational definition such as, Streptococcus pseudopneumoniae and Streptococcus pneumoniae (Arbique 2004) that differ by only 5 bp between their 16S rRNA genes corresponding to only 0.03% difference. Another limitation is that 16S does not represent the entire genomic content that determines the biological characteristics for a species. It is by now quite evident that significant differences in genome composition may be present in bacterial species that are completely identical or that differ only slightly in 16S rRNA genes. For example, isolates of Vibrio splendidus exhibit up to 25% genotypic difference (Thompson 2005), and strains of E. coli may differ up to 40% in the number of genes in their genomes (Perna 2001, Kudva 2002). The three members in the Bacillus cereus group, B. cereus, B. anthracis, and B. thuringiensis can be classified as a single species by their nearly identical 16S rRNA genes but differ greatly by the number and type of genes they harbor due to the presence of large plasmids (Rasko 2005). In both E. coli and the B. cereus group, these differences confer various biological capabilities and pathogenicity. Intragenomic variation of 16S rRNA genes is another limit that can be encountered when classifying species that harbor 16S rRNA genes with diversity greater than threshold set by the operational definition (Table 2), which will lead overestimation of species diversity in a microbiome. Thus, it can be expected that community structures determined using the 16S-based operational species definition approximate but do not necessarily reflect the true community structures. 4. The authors comment that intra-genomic variation could confound taxonomic classifications. It would be interesting to see whether 16S rRNA gene from within the same genome are monophyletic. 6 Please see reply to question 3 from reviewer 1. 5. P8L6 - 19 Archaea + 408 Bacteria = 427 total - not 425 We sincerely thank the Reviewer for the careful review of our data. The error was due to our partial correction of a miscalculation. Actually, there are 425 species including 19 Archaea + 406 Bacteria. Finding this error promoted us to verify all numbers in the Tables. No additional errors were identified in Tables 1-3. However, two errors were found in Table S1 (568 prokaryotic species analyzed in this study). Table S1 included two genomes for Vibrio fischeri. Removal of the redundant Vibrio fischeri genome reduced the number of unique species in this study to 588 from 569. The number of 16S rRNA gene in Mycobacterium laprae was zero but should be one. Four species were removed from Table S2 (95 prokaryotic species with partial rRNA operons missing 16S rRNA genes) because the species either had no partial rRNA operons (Streptomyces coelicolor) or had partial rRNA operons missing 23S rRNA genes instead of missing 16S rRNA genes (Persephonella marina, Sulfurihydrogenibium azorense, Vibrio harveyi). 6. P9L3 - How are distances calculated when there are IVS's? If gaps were determined to be caused by intervening sequences (IVS) (inserts >10 bp), they were recorded and removed and sequences were realigned and distance recalculated. Please see P6L14 in the Method section for detail. 7. P10L13 - Possible to state the variable regions that these occur? P10L13 has been updated using variable regions, as recommended. 8. The tables and figures are brutal. Is there anyway to simplify these to highlight the important points? Please see reply to Question 4 of Reviewer 1. 9. Organismal names need to be italicized throughout. All organismal names have been italicized in the revised manuscript. 7