High throughput sequencing to assess microbial diversity in four western hemispheric soils. Luiz F.W. Roesch1,7, Roberta R. Fulthorpe2, Alberto Riva3, George Casella4, Angela D. Kent5, Samira Daroub6, Flávio A. O. Camargo7, William G. Farmerie8, and Eric W. Triplett1 1 Department of Microbiology and Cell Science, Institute for Food and Agricultural Sciences, University of Florida, Gainesville, FL 32611-0700 USA 2 Department of Physical and Environmental Sciences, University of Toronto Scarborough, Toronto, Ontario, M1C 1A4 Canada 3 Department of Molecular Genetics and Microbiology, College of Medicine, Gainesville, FL 32610-0266 USA 4 Department of Statistics, University of Florida, Gainesville, FL 32611-8545 USA 5 Department of Natural Resources and Environmental Sciences, University of Illinois at UrbanaChampaign, Urbana, IL 61801 USA 6 Department of Soil and Water Science, Everglades Research and Education Center, Institute for Food and Agricultural Sciences, University of Florida, Belle Glade, FL 33430 USA 7 Department of Soil Science, Federal University of Rio Grande do Sul, Porto Alegre, Brazil 8 Interdisciplinary Center for Biotechnology Research, University of Florida, Gainesville, FL 32610 USA ABSTRACT The objective of this work was to estimate the number of operational taxonomic units (OTUs) in a gram of soil using 454 pyrosequencing. Soil samples from four sites across a large transect of the western hemisphere were collected, DNA isolated, and the 3’-end of the 16S rRNA gene amplified. Sites included a boreal forest soil from northwestern Ontario, a temperate zone agricultural soil from the Morrow Plots in Illinois, a tropical, highly organic soil from the Everglades Agricultural Area in Florida, and a temperate agricultural soil from southern Brazil. Pyrosequencing was performed so that only the most hypervariable region (approximately 13901500) of the 16S rRNA gene was sequenced. The sequences were aligned using MUSCLE, distance matrices determine by DNADist, and clutering done by DOTUR. The output from DOTUR was the theoretical maximum OTU number determined by the Transform Both Sides Power of x (TBS-PX) transformation of Michaelis-Menton kinetics. The number of sequences available for each site varied from 26,140 to 53,533. Crenarchaeotal sequences were also amplified with far more obtained from the agricultural soils than the forest soil. The most abundant groups in all four soils appear to the Bacteriodetes and the - and -Proteobacteria. The number of OTUs was calculated using parametric and non-parametric estimators and at five levels of dissimilarlity in nucleotide sequence. Even at the lowest level of dissimilarity using the most generous estimator, the number of OTUs in a gram of soil never exceeded 35,000, considerably less than the 8.3 million estimated by Gans et al. (2005). The bacterial diversity of the forest soil was strikingly different that those of the three agricultural soils. The forest soil appears to be phylum rich and relatively species poor compared to the agricultural soils which appeared to be species rich but phylum poor. INTRODUCTION Recently, much attention has been paid toward estimating the number of species or operational taxonomic units (OTUs) in a gram of soil. This is of particular interest as soil is considered to harbor the most diverse populations of bacteria of any environment on earth. Even th0se soils with large particle sizes provide an enormous surface area for many bacteria. The number of potential microhabitats within a small soil sample is extraordinary. In the first culture-independent analysis of a bacterial species census in soil, Torsvik et al. (1990) isolated DNA from cultured species and estimated the number of bacterial genomes present in a mixed sample using DNA:DNA hybridization. Taking the admittedly risky approach of extrapolating such data from a mixture containing a few species to a complex mixture of DNA from soil, Torsvik et al. (1990) estimated that the number of bacterial species in a gram of boreal forest soil was approximately 10,000. Gans et al. (2005) re-evaluated this approach and predicted nearly 107 microbial species per gram of soil and proposed that this estimate could be practically verified with current DNA sequencing technology. In the meantime, others have attempted to estimate the number of bacterial species in a gram of soil by extrapolation based on a sample size of amplified, cloned, and sequenced 16S rRNA genes. Schloss and Handelsman (2006) defined an operational taxonomic unit as 3% dissimilarlity in the comparison of 16S rRNA gene sequences. Using that definition and two sequenced clone libraries from different sites, they estimated the number of OTUs in a gram of soil \to be between 2,000 and 5.000. These two libraries were derived from Alaskan and Minnesota soils and contained 1,033 and 600 16S rRNA clones, respectively. Hong et al. (2006) used a nonparametric approach in the analysis of 556 16S rRNA clones and estimated the number of species per gram of marine sediment to be between 2,000 and 3,000. None of the estimates to date used a sufficient number of sequences to reach the limit of a rarefaction curve depicting the number of species versus the number of clones sequenced. In this work, we used pyrosequencing to obtain well over 25,000 16S rRNA gene fragments from each of four soils. This is an order of magnitude higher than done previously. A statistical transformation of the resulting rarefaction curves and species richness indicators permits an estimate of the number of OTUs, as defined by specific levels of sequence identity, for each soil. The goal of this work was to get an estimate of soil bacterial diversity based on a number of sequences that exceeds or approaches the estimates to date on the number of species in a gram of soil. To do this, soil DNA was isolated and 454 pyrosequencing was employed following the amplification of a hypervariable region of the highly conserved 16S rRNA gene. Another objective was to determine how this estimate might vary across a broad geographical scale. Soils from four disparate sites across the western hemisphere were chosen. Three of these soils were agricultural soils. One agricultural soil was selected from a maize field in the southern most state of Brazil, Rio Grande do Sul. Another agricultural soil came from a sugarcane field in the Everglades Agricultural Area in Florida where organic matter content is very high. A third agricultural soil came from the Morrow Plots at the University of Illinois in Urbana. The fourth soil was derived from a boreal forest site in northwestern Ontario, Canada. METHODS Soil sampling sites and methods From each site, soil was collected from the top 10 cm from the surface. For the boreal forest soil. sphagnum and duff layers were removed prior to collection. All soil was packed on ice upon collection and sent to Gainesville, Florida by express mail for futher analysis. Soils were stored in the lab at -80°C until extraction. At the time of collection, the soil from the Morrow Plots in Illinois was in a corn-oatsalfalfa rotation for 53 years and has received fertilization with limestone, nitrogen, phosphorous, and potassium. The Everglades soil in Florida was planted with sugarcane and the Brazilian plot was at the end of a maize growing cycle. DNA Extraction and 16S rRNA gene amplification Bacterial DNA was isolated from soil samples using the FastDNA® Kit (Qbiogene, Inc., Calif.) following the manufacturer’s instructions. The DNA obtained was cleaned from organic and inorganic compounds that contaminate DNA soil preparations using the agarose-embedded genomic DNA method described by Moreira (1998). To amplify 16S rRNA gene sequences from bacteria, the conserved areas of sequences in the ribosomal database were examined to determine the variability and length of each of the V1V9 variable regions in the gene. This was necessary since the read lengths obtained from GS-20 454 pyrosequencing is only about 105 bases on average. In addition, we needed to identify a region of about 700-800 bases to amplify since 454 library construction is optimal with a fragment of this size. One end of this region must flank a highly variable region that will be sequenced to give us the maximum amount of phylogenetic information possible. The analysis included aligning representative sequences from each phyla using ClustalW and showed that the V9 region was optimal in length and variability for this purpose. To amplify a fragment of the appropriate size, primers 780f and a modification of 1492r were chosen. The 1492r primer has been heavily used in 16S rRNA gene amplifications since its publication in 1991 (Lane 1991). However, the original 1492r sequence f0 (5'-GGTTACCTTGTTACGACTT3’) is not conserved at all sites. The first G and first T are variable according to alignment of sequences from a newer database. Therefore, a modified primer, 1492-rm, was used (5'GNTACCTTGTTACGACTT-3'), where the 3' end corresponds to the 1492 position of E. coli 16S rRNA gene, but the primer is one bp shorter. This primer sequence was found in 36,850 of 262,030 of the bacterial sequences in RDP II, as compared to only 10,860 of 262,030 for the original 1492r primer. These numbers are low overall since not many of the published sequences have data at this end. The modified primer, 1492-rm, gave the highest number of hits for this region of any of the modified 1492 primers examined. For the other end of the area 787f was chosen, just prior to the V5 region. This primer (5'-AATAGATACCCNGGTAG-3’) matched 157,088 of 262,030 sequences in RDP II, the high number reflects the greater number of sequences from this region in the database. Assuming 1 x 109 cells/g of soil, an average bacterial genome size of 3 MB, and an average 16S rRNA copy number of one per cell, approximately 20 PCR cycles was required to obtain 3 g of amplified 16S rRNA gene product with 96 PCR reactions each having 10 g of template DNA. Three g of amplified 16S rRNA gene product from each soil was required to construct the four libraries for 454 sequencing. The PCR conditions used were 94°C for 2 minutes, 20 cycles of 94°C for 45 s denaturation, 55°C for 45 s annealing, and 72°C for 1 min extension followed by 72°C for 6 min. After the first 20 rounds of amplification, another three rounds of amplifcation was done to add the A and B adapters required for 454 pyrosequencing (Margulies et al. 2005) to specific ends of the amplified 16S rRNA fragment for library construction. For this purpose, a new primer was synthesized where the A adaptor sequence was immediately upstream of the 1492-rm primer sequence. The resulting sequence was 5'(CCATCTCATCCCTGCGTGTCCCATCTGTTCCCTCCCTGTCTCAG)GITACCTTGTTACG ACTT-3' with the A adaptor sequence in parentheses. Similarly, the biotinylated B-adaptor sequence was added upstream of the 780f primer resulting in 5’(BioTEG/CCTATCCCCTGTGTGCCTTGCCTATCCCCTGTTGCGTGTCTCAG)ATTAGATA CCCIGGTAG-3' with the adaptor sequence in parentheses. The resulting PCR product was nebulized and used in the emPCR process necessary to make the single strands on beads as required for 454 pyrosequencing (Margulies et al. 2005). Sequencing was done by separating the microtiter plate into four sections so that all four samples could be sequenced in a single 454 run. The number of reads per samples, the maximum and average read lengths per sample, and GenBank accession numbers are shown in Table 1. Phylogenetic assignment the 16S rRNA gene fragments were phylogenetically assigned according to their best matches to sequences in the NAST (Nearest Alignment Space termination) database (DeSantis et al. 2006). The sequences were submitted to a web-interface tool (http://greengenes.lbl.gov/NAST) which compares the sequences with a 16S rRNA reference database with ~100,000 aligned sequences. Each sequence was aligned and categorized according to the best match in the reference database. Rarefaction curves, richness estimators and diversity indexes were calculated based on best matches for each query sequence to sequences in NAST and their frequency of recovery. Phylogenic inferences were made according to the closest sequences assigned in the NAST database. Alignment and clustering of 16S rRNA gene fragments A second approach was to cluster sequences into groups of defined sequences by using DOTUR (Schloss and Handelsman, 2005). Multiple sequence alignment was done using MUSCLE (Edgar, 2004) (with parameter -maxiters 1, -diags1 and -sv). Based on the alignment, a distance matrix was constructed using DNAdist from PHYLIP suite of programs version 3.6 with default parameters (Felsenstein 1989, 2005). These pairwise distances served as input to DOTUR (Schloss and Handelsman, 2005) for clustering the sequences into OTUs of defined sequence identity that ranged from unique sequences (no variation) to 20% dissimilarity. These clusters served as OTUs for generating rarefaction curves and for making calculations with the species richness, sequence coverage, and species diversity indexes. To obtain ACE, Chao1, and log normal estimators, the distance matrices were further analysed by DOTUR (Schloss and Handelsman 2005). The capacity of each of these programs were exceeded by these datasets on traditional laboratory PCs. Hence, these programs were run on an HP DL585 Proliant Server with 64 Gigabytes of RAM and 2 dual core Opteron processors at 1.8GHz, running the RedHat Enterprise Linux OS. Statistical analysis of taxa richness - Rarefaction At each percent and sample size, a Michaelis-Menten equation was fit to the dotur curve. (For example, for the Brazil data, the dotur output was obtained for sample sizes 5000,10000,15000,20000,25000,26140, and percents 0,.03.,.05,.1 ,.2. The MM equation has parameters V and K with the equation y=Vx/(K+x), and the estimate of V is our estimate of the asymptote. We found that the Michaelis-Menton equation was not a good fit to the rarefaction curve, as two parameters were not able to sufficiently model the curvature. The resulting asymptote estimates were actually below the observed data by a substantial amount. The problem was that the curvature at the beginning of the rarefaction curve was different from that at the end. To solve this problem we used a model of two Michaelis-Menton equations, one for the beginning and one for the end of the curve, that were joined continuously at a point that was optimized by the fit. The curves were fit by the method of maximum likelihood, using the transform-bothsides methods of Rupert et al. (1989). This approach better models the error terms and provides improved estimators over standard approaches such as Lineweaver-Burk. The result of this analysis was an estimate of the asymptote for each percent. Statistical analysis of taxa richness – Chao1, Ace and Rarefaction ourput For each region we have an asymptote estimate for each sample size. To this curve (see picture) we fit an MM equation to estimate the overall asymptote. (Here we typically have 5-7 data points, and the ordinary MM, with the TBS fit, works fine). We used the bootstrap to estimate standard deviations for both V and K for each curve. We can also translate that in standard deviations for estimating proportions of V. For example, if you want to know the necessary sample for getting p*V species, where p is (say) .9, the answer is sample size (p/(1p))*K with standard error (p/(1-p))*Std(K). Note that at low percents, the estimates are quite variable, but still in the same ballpark. For higher percents, the estimates become remarkably similar. RESULTS AND DISCUSSION Distribution of divisions identified With over 40% of the total bacterial sequences, the proteobacteria represented the dominant division in each soil. The -proteobacteria were the dominant subdivision among the proteobacteria in all soils except Brazil where the -proteobacteria were dominant. With 15-25% of the bacterial sequences in each sample, the second most abundant division in all four soils was the Bacteriodetes. Other prominent divisions were the Acidobacteria, the Actinobacteria, and the Firmicutes. Among the other divisions, the Gemmatimonadetes represented 3.5% of the sequences in the Canadian sample while in Florida and Brazil, the Nitrospira represented 3 and 2% of total sequences, respectively. In Illinois and Canada, the Verrucomicrobia were represented by more than 2% of the sequences. In Illinois, nearly 4% of the sequences were in the TM7 division. Approximately 6-12% of the sequences from each sample remained unclassified. Crenarchaeota diversity The primers used in this work also amplified the crenarchaeota (Table 1). In the agricultural soils, the crenarchaeota represented a fairly significant proportion, 4.6-12.5%, of the total sequences. The surprising result was the relative dearth of crenarchaeotal sequences in the Canadian sample with just five of 53,251 sequences being in this group. These five sequences were all related to the ammonia oxidizing archaea and the low number of these sequences may be attributable to the low pH of this soil. However, the Brazilian soil had an even lower pH and yet over 4% of the total sequences from Brazil were crenarchaeotal. Further work is needed to confirm this observation of a low proportion of crenarchaeotal sequences in the Canadian forest soil. The cause of this may be that nitrogen is more limiting in forest soils compared to fertilized agricultural soils. Of the crenarchaeota observed from each of the agricultural soils, about one-third of the sequences are closely related to the ammonia oxdidizing archaea. While 16S rRNA sequence alone cannot describe the physiology of an organism it is no unreasonable to expect a high number of ammonia oxidizing archaea in these soils given recent work (Leininger et al. 2006;Treusch et al. 2005). Estimation of the number of OTUs in each soil With an average read length of 105 bases in each fragment, it is impossible to define species even with the hypervariable region sequenced. However, even the entire length of the 16S rRNA gene cannot be used in a species definition. Genera can be defined at approximately the 5% level of dissimilarity. Five levels of dissimilarity are presented here for each sample so that the reader can choose a level of discrimination of interest. Three estimates of the number of OTUs at each % dissimilarity are presented. One of these estimates is parametric (an estimate based on rarefaction) while two of these are based on non-parametric measures (Chao1 and Ace). Using these estimates, the number of sequences required to reach 90 and 95% of the maximum number of OTUs at each % dissimilarity are presented. Even at 0% dissimilarity, the maximum number of OTUs in any soil using parametric or non-parametric estimates, is less than 35,000. This is over 200-fold lower than the estimate of OTUs suggested by Gans et al. (2005) in a gram of soil. The maximum number of sequences required to identify the 35,000 OTUs is never more than 600,000. Thus, using the methods here, 95% of all OTUs can be identified with over 10-fold fewer sequences than the number of species suggested by Gans et al. (2005). As a result, we strongly disagree with the statement of Gans et al. (2005) that “the survey size required for accurate analysis of soil communities is impractically large”. In fact, with the latest improvement in throughput in 454 pyrosequencing, this survey can be completed in less than one day of operation of the 454 sequencer. We are aware of the biases involved in our methods. We are aware that our soil extraction method likely did no complete extract all DNA present in the soil. We estimate that we may have extracted only 50% of the DNA present in each sample. We are also aware that the primers we chose almost certainly missed a proportion of the taxa present in soil. Assuming only a 50% extraction of DNA and that we may have missed half of the taxa present in soil using our primers, our estimates for the number of OTUs present in soil may be underestimated by a factor of four. However, for the number of species proposed by Gans et al. (2005) to be correct, our methods would have to have missed over 99.5% of the species in soil. Given the biases and limitations of our methods described by others, this is simply not possible. Comparison of species richness in each soil The value of assessing diversity in a variety of soils has significant merit (Fig. 3). The patterns of the rarefaction curves for the Canadian forest sample are very different from those of the agricultural soils (Table 2). OTU abundance is particularly high at high levels of dissimilarity while in the other samples the diversity is low at high levels of dissimilarity. Conversely at low levels of dissimilarity the OTU abundance of the Canadian sample is lower that that of the other samples. This result is interpreted as the forest sample being very phylum rich but species poor while the samples are species rich and phylum poor. The cause of this difference is unknown. Agricultural systems may reduce phylum diversity but increase speciation. The Canadian sample may have higher phylum diversity because forest soils support a wider diversity of bacteria. Speciation in this forext soil may be lower because cold temperatures most of the year inhibits bacterial growth. In contrast to the situation with bacterial diversity, the opposite is true regarding crenarchaeota diversity. In Canada the number of sequences recovered was remarkably low and highest in Illinois. Again, the cause of this is unknown. The age of inexpensive, high throughput pyrosequencing allows the assessment of the full taxonomic diversity of bacteria in soil for the first time. As a result, much better estimates of OUT richness can be obtained from any sample. Here no attempt is made to define a species. A significant debate exists in the literature regarding the species concept for bacteria. No judgment is made on this here. However, it is clear that regardless of the % dissimilarity chosen by the reader to define a particular level of taxonomic discrimination, the number of OTUs by any definition cannot be as high as described by Gans et al. (2005). REFERENCES Borneman, J. and E.W. Triplett. 1997. Molecular microbial diversity in soils from eastern Amazonia: evidence for unusual microorganisms and population shifts associated with deforestation. Appl. Environ. Microbiol. 63:2647-2653. Borneman, J., P.W. Skroch, J.A. Jansen, K.L. O'Sullivan, J.A. Palus, N.G. Rumjanek, J. Nienhuis, and E.W. Triplett. 1996. Microbial diversity of an agricultural soil in Wisconsin. Appl. Environ. Microbiol. 62:1935-1943. DeSantis, T.Z., P. Hugenholtz, N. Larsen, M. Rojas, E.L. Brodie, K. Keller, T. Huber, D. Dalevi, P. Hu, and G.L. Anderson. 2006. Greengenes, a chimera-checked 16S rRNA database and workbench compatible with ARB. Appl. Environ. Microbiol. 72:5069-5072. Edgar, R.C. 2004. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Research 32:1792-1797. Felsenstein, J. 2005. PHYLIP (Phylogeny Inference Package) version 3.6. Distributed by the author. Department of Genome Sciences, University of Washington, Seattle. Felsenstein, J. 1989. PHYLIP - Phylogeny Inference Package (Version 3.2). Cladistics 5: 164166. Fierer, N. and R.B. Jackson. 2006. The diversity and biogeography of soil bacterial communities. Proc. Natl. Acad. Sci. (USA) 103;626-631. Gans, J., M. Woilinsky, and J. Dunbar. 2005. Computational improvements reveal great bacterial diversity and high metal toxicity in soil. Science 309:1387-1390. Hong, S.-H., J. Bunge, S.-O. Jeon, and S.S. Epstein. Predicting microbial species richness. Proc. Natl. Acad. Sci. (USA) 103:117-122. Kent, A.D. and E.W. Triplett. 2002. Microbial communities and their interactions in soil and rhizosphere ecosystems. Annu. Rev. Microbiol. 56:211-236. Lane, D.J. 1991. 16S/23S rRNA sequencing. In: Stackebrandt E, Goodfellow M (eds) Nucleic Acid Techniques in Bacterial Systematics. John Wiley & Sons, New York, pp 115–175. Leininger, S., T. Urich, M. Schloter, L. Schwark, J. Qi, G.W. Nicol, J.I. Prosser, S.C. Schuster, and C. Schleper. 2006. Archaea predominate among ammonia-oxidizing prokaryotes in soils. Nature 442:806-809. Margulies, M., M. Egholm, W.E. Altman, S. Attiya, J.S. Bader, L.A. Bemben, J. Berka, M.S. Braverman, Y.-J. Chen, Z. Chen, S.B. Dewell, L. Du, J.M. Fierro, X.V. Gomes, B.C. Godwin, W. He, S. Helgesen, C.H. Ho, G.P. Irzyk, S.C. Jando, M.L.I. Alenquer, T.P. Jarvie, K.B. Jirage, J.-B. Kim, J.R. Knight, J.R. Lanza, J.H. Leamon, S.M. Lefkowitz, M. Lei, J. Li, K.L. Lohman, H. Lu, V.B. Makhijani, K.E. McDade, M.P. McKenna, E.W. Myers, E. Nickerson, J.R. Nobile, R. Plant, B.P. Puc, M.T. Ronan, G.T. Roth, G.J. Sarkis, J.F. Simons, J.W. Simpson, M. Srinivasan, K.R. Tartaro, A.Tomasz, K.A. Vogt, G.A. Volkmer, S.H. Wang, Y. Wang, M.P. Weiner, P. Yu, R.F. Begley, and J.M. Rothberg. 2005. Genome sequencing in microfabricated high-density picolitre reactors. Nature 437:376-380 Moreira, D. 1998. Efficient removal of PCR inhibitors using agarose-embedded DNA preparations. Nucleic Acids Res. 26,:3309-3310. Mullins, T.D., T.B. Britschgi, R.L. Krest, and S.J. Giovannoni. 1995. Genetic comparisons reveal the same unknown bacterial lineages in Atlantic and Pacific bacterioplankton communities. Limnol. Oceanogr. 40:148–158. Neufeld, J.D. and W.W. Mohn. 2005. Unexpectedly high bacteria diversity in arctic tundra relative to boreal forest soils, revealed by serial analysis of ribosomal sequence tags. Appl. Environ. Microbiol. 71:5710-5718. Rupert, D., N. Cressie, and R.J. Carroll. 1989. A transformation/weighting model for estimating Michaelis-Menton parameters. Biometrics 45:637-656. Schloss, P.D. and J. Handelsman. Computational Biology 2:786-793. 2006. Toward a census of bacteria in soil. PLOS Schloss, P.D. and J. Handelsman. 2005. Introducing DOTUR, a computer program for defining operational taxonomic units and estimating species richness. Appl. Environ. Microbiol. 71:1501-1506. Sogin, M.L., H.G. Morrison, J.A. Huber, D.M. Welch, S.M. Huse, P.R. Neal, J.M. Arrieta, and G.J. Herndl. 2006. Microbial diversity in the deep sea and the underexplored “rare biosphere”. Proc. Natl. Acad. Sci. (USA) 103:12115-12120. Torsvik, V., Goksoyr, and F.L. Daae. 1990. High diversity in DNA of soil bacteria. Appl. Environ. Microbiol. 56:782-787. Treusch, A.H., S. Leininger, A. Kletzin, S.C. Schuster, H.-P. Klenk and C. Schleper. 2005. Novel genes for nitrite reductase and Amo-related proteins indicate a role of uncultivated mesophilic crenarchaeota in nitrogen cycling. Environ. Microbiol. 7:1985-1995. Tringe, S.G., C. von Mering, A. Kobayashi, A.A. Salamov, K. Chen, H. Chang, M. Podar, J. Short, E. Mathur, J. Detter, et al. 2005. Comparative metagenomics of microbial communities. Science 308:554-557. Table 1. Locations, pH, locations, elevations, and types of soils as well as statistics on the 454 pyrosequencing. Soil Brazil pH 4.6 lattitue/longitude -29.539671S, 55.107556W Florida 7.0 26.663199N, -80.628500W Illinois 6.4 40.104616N, -88.226517W Canada 5.4 52.743203N, - 91.718433W elevation (m) soil classification 132 Distrophic oxisol 224 mesic Aquic Argiudoll 314 Dystric Brunisol GenBank accession numbers No. of bacterial 16S rRNA gene sequences No. of crenarchaeotal 16S rRNA gene sequences Maximum read length (bps) Average read length (bps) EF222481EF248596 2 Lauderhill euic hyperthermic Lithic EF248597EF276844 EF276845EF308590 EF308591EF361836 26,116 28,248 31,746 53,246 1,259 3,546 4,530 5 141 159 144 160 103 102 104 102 Table 2. Estimated number of OTUs for each sample using parametric (rarefaction) and nonparametric estimators (Chao1 and Ace). The number of sequences required to reach 90% and 95% of the maximum OTU level at each % dissimilarity was calculated using the rarefaction estimator. Brazil Florida Illinois Canada % dissimilarity Rarefaction Chao1 Ace 0 3 5 10 20 0 3 5 10 20 0 3 5 10 20 0 3 5 10 20 12,196 3,611 2,395 648 298 16,485 4,477 3,033 1,061 529 15,797 4.010 2,651 926 538 20,920 15,188 13,805 10,604 9,376 25,972 5,021 3,183 756 338 34,380 5,666 3,686 1,110 509 43,652 6,040 3,403 1,057 537 20,244 20,244 12,282 10,230 9,246 26,395 4,888 3,046 767 344 37,537 5,820 3,753 1,198 539 51,378 5,890 3,553 1,154 575 20,596 13,329 12,234 10,213 9,236 Sequences needed for 90% max. OTUs 209,864 113,205 98,495 68,172 65,500 266,429 127,715 104,457 85,361 82,610 583,003 282,214 247,373 228,840 263,225 842,692 762,838 735,323 661,496 617,608 Sequences needed for 95% max. OTUs 443,047 238,988 207,934 143,919 138,279 562,462 269,621 220,514 180,207 174,398 615,392 297,892 261,116 241,554 277,849 1,779,017 1,610,438 1,552,349 1,396,492 1,303,840 Figure 1. Google earth images of the four locations used for soil collection. Fig. 2. The effect of the number of sequences available on the estimation of the number of OTUs. Fig. 3. Rarefaction curves depicting the effect of % dissimilarity on the number of OTUs identified. Note the comparatively high species richness of the agricultural samples while the Canadian forest soil has very high phylum richness. FIGURE 4. Relative abundance of phylogenetic divisions for each soil library in which 16S sequences were classified according to the nearest neighbor in the Greengenes database (http://greengenes.lbl.gov).