Estimating Microbial Diversity John Bunge jab18@cornell.edu Department of Statistical Science Cornell University 1 Thanks to: Amy Willis Fiona Walsh David Mark Welch Colleagues too numerous to mention Bunge, J., Willis, A. and Walsh, F. (2013) Estimating the number of species in microbial diversity studies. Ann. Rev. of Statist. and its Appl. v.1. Forthcoming. 2 Statisticians 3 Bioinformaticists 4 Statistics is not a collection of formulae, nor computer programs, but a conceptual framework, an intellectual stance, a point of view, a theory of knowledge Fundamental idea: distinction between sample and population Classical or frequentist statistics is fundamentally dualistic 5 Plato’s Republic, VII,7 Behold! human beings living in an underground den, which has a mouth open towards the light and reaching all along the den; here they have been from their childhood […] Above and behind them a fire is blazing at a distance, […] you will see, if you look, a low wall built along the way, like the screen which marionette players have in front of them, over which they show the puppets. […] They see only their own shadows, or the shadows of one another, which the fire throws on the opposite wall of the cave […] To them, I said, the truth would be literally nothing but the shadows of the images. 6 Old Testament Ecclesiastes 1:15 What is crooked cannot be straightened; what is lacking cannot be counted. New Testament Corinthians 13:12 For now we see through a glass, darkly, but then face to face: now I know in part; but then shall I know even as also I am known. 7 The knowledge problem in microbiome studies Metagenomics is the study of metagenomes, genetic material recovered directly from environmental samples. -Wikipedia DNA extraction bias notwithstanding, metagenomics is the most unrestricted and comprehensive approach. Our ability to interpret these data is always improving, and we stand on a precipice of unprecedented discovery […] Microbes are not the only group to benefit from these surveys; viruses exist at 10 times the abundance of microbes […]. - Gilbert, 2011 BUT: METAGENOMIC SURVEYS RECOVER ONLY A SMALL FRACTION OF THE EXTANT DIVERSITY. NONETHELESS, MANY METHODS TREAT THE OBSERVED SAMPLE AS THE POPULATION. 8 MACHINES The fundamental idea of statistics: Distinction between Population (or universe) and Sample (or data) 9 THE SAMPLE IS A SUBSET OF THE POPULATION Population Universe Reality State of nature Truth parameters Sample Finite, random noise error perturbation shock statistics Statistical inference: Extract maximum information from sample in order to draw conclusions about population Inductive not deductive 10 Question: In a microbial diversity study, What is the population? Collect 1L seawater @ 500m depth in ocean • From 1L, remove 5ml & exhaustively sequence microbial DNA • Cluster sequences into OTUs • From OTUs, calculate frequency count data • Compute estimate of total species richness Question: Richness of what population? Original 1L of water? Surrounding environment? Entire pelagic microbiome? Definition The population is what would be observed if the operative sampling and analysis protocols were carried out to infinite effort. 11 How do we statistically estimate total microbial taxonomic richness? 12 Physical DNA sample Next-generation sequencing; Bioinformatic preprocessing Collection of sequences Bioinformatic processing: Alignment, clustering, counting Cluster sequences at some % “identity,” typically 97% {clusters} = {OTUs} OTU = “operational taxonomic unit” 13 Statistical problem: Estimate total population diversity – number of species, classes, taxa, OTUs – based on frequency count data Data = # of units observed exactly once in sample (singletons); # observed exactly twice (doubletons); # observed exactly three times; … . 14 Frequency count data example Microbial ecology Fiona Walsh et al. • Data from soil in apple orchards • Use of antibiotics on bacterial populations in soil ecosystems • Singletons ≈ 2x doubletons – may be 10x! • Goal is to estimate taxonomic richness of community • Change with respect to intervention/covariates/metadata freq count freq count 1 317 124 1 2 179 128 1 3 127 133 1 4 77 134 1 5 66 149 1 6 61 159 1 7 39 170 1 8 42 184 1 9 29 195 1 10 24 208 1 11 12 232 1 12 27 … 262 1 Walsh F, Owens S, Duffy B, Smith DP, Frey JE. 2013. Streptomycin use in apple orchards did not alter the soil bacterial communities 15 Apple orchard data - original scale 350 300 count 250 200 150 100 50 0 0 50 100 150 200 250 300 • • • frequency • Apple orchard data - log scale 1000 • count 100 10 1 0 50 100 150 200 250 Issues: High diversity Typical of microbial data Singletons ~ 2x doubletons Data acquisition / bioinformatic issues Spurious singletons? • Correct at what stage? Statistical approach? 300 frequency 16 Statistical inference from frequency count data STANDARD MODEL • C classes/taxa/species in population. Each species independently contributes Poisson-distributed # of representatives to the sample. X 2 ~ Poisson(2 ) X1 ~ Poisson(1 ) X 3 ~ Poisson(3 ) X C ~ Poisson(C ) sample • Counts ~ zero-truncated mixed Poisson. 17 The mixed-Poisson model Species (taxon) i contributes a Poisson-distributed number Xi of replicates to the sample – i.e., taxon i appears in the sample Xi times. Units appear independently in the sample Fundamental problem: heterogeneity, i.e., unequal Poisson means λi • Standard approach: model λi‘s as i.i.d. replicates from some mixing distribution F • Frequency counts fi are then marginally i.i.d. Fmixed Poisson random variables • Zero-truncated since zero counts Xi are unobservable 18 The mixed-Poisson model cont’d Mixing distribution F, i.e., distribution of sampling intensities λ, is also called species abundance distribution Probably a misnomer Mathematical treatment (marginalization) implies that each species contribution to the sample is independent and identically distributed Both assumptions are certainly wrong How to account for dependent or differently distributed species counts? Not in standard model. 19 Mixing distributions F Parametric, low-dimensional parameter vector • None ≡ point mass at λ ≡ all equal species sizes • Gamma (Fisher, 1943) • Lognormal • Inverse Gaussian, generalized inverse Gaussian (Sichel) • Pareto • Log-t • Stable Finite mixture of exponentials - semiparametric 20 Richness estimation under the Poisson model Diversity estimate is then # taxain sample ˆ N F : 1 PF (0) where PF(0) = F-mixed Poisson probability of 0: PF (0) e dF( ) EF e Nˆ F is the Horvitz-Thompson estimator (HTE) and is uniformly minimum variance unbiased (UMVU). Require empirical version of Nˆ , i.e., require F estimate of PF(0) (frequentist version). 21 Richness estimation under the Poisson model, cont’d Require empirical version of HTE # taxain sample ˆ N F : , F F ( , ) 1 PF (0) Estimate θ by ML, using zero-truncated F-mixed Poisson, conditional on # of observed taxa. Final estimator: # taxain sample ˆ N F : 1 PF (0,ˆ) SE via Fisher information CI via (approximation to) profile likelihood 22 CatchAll software www.northeastern.edu/catchall or: STAMPS! Developed under NSF grant DEB – 0816638 by JB/LW/SC, in C# & C Implements o finite mixtures of 0 – 4 exponential components (F) o weighted linear regression procedure o all Chao-type nonparametric procedures o model evaluation/GOF/selection/outlier assessment Produces estimates, SEs, & CIs Fast, efficient, platform-independent Excel graphics (VBA) package Summary or copious output (text files) Bunge J, Woodard L, Böhning D, Foster JA, Connolly S, Allen HK. 2012b. Estimating population diversity with CatchAll. Bioinformatics 28:1045--47 23 Partial CatchAll summary output for apple orchard data Total Number of Observed Observed Estimated Lower Upper Species = 1187 Model Tau Sp Total Sp SE CB CB GOF0 GOF5 Best Parm ThreeMix Model edExp 184 1183 1823.5 122.4 1625.1 2111.6 0.0118 0.6038 ThreeMix Parm Model 2a edExp 118 1175 1854.9 158 1609.8 2242.3 0.1428 0.3632 ThreeMix Parm Model 2b edExp 262 1187 1797.6 101.6 1628.6 2031.3 0 0.4029 TwoMixe Parm Model 2c dExp 23 1087 1865.5 141 1640.4 2202.2 0.0001 0.0208 WLRM UnTransf 10 961 2285.8 572.7 1607.4 4058.9 0.0206 ThreeMix Parm Max Tau edExp 262 1187 1797.6 101.6 1628.6 2031.3 0 0.4029 WLRM Max Tau LogTransf 31 1114 1390.3 30.4 1338.9 1459.2 24 350 CatchAll fitted models for apple orchard data 300 250 200 Counts Observed Other 3--TwoMixedExp/Tau 23 Other 2--ThreeMixedExp/Tau 262 150 Other 1--ThreeMixedExp/Tau 118 Best--ThreeMixedExp/Tau 184 100 Τ = 184 50 0 0 50 100 150 200 250 300 Frequency 25 Data-analytic considerations • Problem of right cutoff point τ o Typically no parametric model will fit complete frequency count dataset o Too many right outliers – highly abundant taxa in sample – with large gaps between counts o Nonparametric methods do even worse with outliers, diverging to ∞ as outliers are included in data • Data-analytic solution: remove large frequency counts for frequencies > some cutoff τ o Chao1: τ = 2 o Chao-type coverage-based nonparametric methods: τ = 10 (arbitrary) o Parametric mixture models: τ selected by goodness-of-fit algorithm o Weighted linear regression model: selected by goodness-of-fit • Further problem: model selection and outlier deletion confounded o Computational solution: compute all methods at every τ o Requires optimized code o Use double selection algorithm to select “best of the best” o Introduces simultaneous inference problem: large number of simultaneous GOF tests. Little theory exists to correct for this. 26 Statistical analysis of standard model: The bigger picture Philosophy/ approach Frequentist Parametric Nonparametric Maximum likelihood (Bunge et al.) Weighted linear regression (Rocchetti et al. 2011) Coverage-based (Chao et al.); Zelterman; NPMLE (Böhning et al.) Bayesian Objective Bayes ??? (Barger et al.; Quince et al.) (Tardella et al. for capturerecapture) 27 Statistical analysis of standard model – Chao-type nonparametrics • Coverage-based approaches • Coverage = proportion of population represented in sample • Random variable not parameter • Can interpret 1 – PF(0) as surrogate for coverage • Turing’s estimate of PF(0): f1 n where n = # of individual units in sample • Good-Turing estimate of diversity: # of taxain sample 1 f1 / n • Chao’s abundance-based coverage estimators (ACE): Good-Turing + adjustment for heterogeneity Chao, A. & J. Bunge. 2002. Estimating the number of species in a stochastic abundance model. Biometrics 58: 531–539 28 7000 Coverage-based estimators diverge to infinity as large frequency counts are included 6000 Estimated Count 5000 Observed Sp 4000 Est Sp for NonParametric Model Est Sp for TwoMixedExp Model Est Sp for SingleExp Model 3000 Est Sp for ThreeMixedExp Model Est Sp for Poisson Model Est Sp for FourMixedExp Model 2000 Hence coveragebased estimators require τ ≤ 10 1000 0 10 100 1000 Tau Statistical analysis of standard model: general nonparametrics • Nonparametric maximum likelihood estimation • Leave species abundance distribution F unspecified, i.e., F varies across all possible distributions • Mathematical implications: F is actually non-identifiable • Nevertheless NPMLE is possible in principle. • Computational issues: difficult numerical search, highly complex error estimation. • Software CAMCR Böhning D, Kuhnert R. 2009. CAMCR: Computer-Assisted Mixture model analysis for Capture-Recapture count data. AStA Adv. Stat. 30 Anal. 93:61--71 The Bayesian paradigm • Rev. Thomas Bayes • Bayesian statistics: Probabilistic & statistical statements concern degrees of belief • Usually parametric: statements concern values of parameters, e.g., species richness. Nonparametric Bayes is possible but complex. • Procedure: 1. Investigator first declares existing belief about population value: this is prior distribution 2. Collect sample data 3. Update prior, based on data, to obtain posterior, i.e., final state of knowledge or belief about population. The Bayesian paradigm cont’d Bayes’ Theorem: P( B | A) P( A | B) P( B) P( A) Posterior distribution: P(parameters| data) P(data | parameters) P(parameters) likelihood prior Bayesian computation is now fairly well established Bayesian estimation of taxonomic richness based on the standard model • Species abundance distribution F is parametric: F depends on a small number of parameters (typically 2-3), called • Parameter of interest is total richness C • Procedure: 1. Establish prior distributions for and C 2. Likelihood function is known (based on mixedPoisson) 3. Run Bayesian machinery 4. Obtain posterior distribution, estimate, “credible interval,” etc. • Quince et al. quasi-noninformative priors; Barger et al. formal objective priors. Active research area in statistics. Quince C, Curtis TP, Sloan WT. 2008. The rational exploration of microbial diversity. ISME J. 2:997—1006; Barger K, Bunge J. 2011. Objective Bayesian estimation for the number of species. J. Bayesian Analysis 5:765--86 A New Hope Is it possible to estimate taxonomic richness without a species abundance distribution independent species contributions to the sample identically distributed species contributions to the sample ? Yes, using ratios of frequency counts. 34 breakaway: Estimating taxonomic richness based on ratios of frequency counts 1 2 3 4 5 6 7 8 9 10 11 12 count (j+1)f_(j+1)/f_j 317 1.13 179 2.13 127 2.43 77 4.29 66 5.55 61 4.48 39 8.62 42 6.21 29 8.28 24 5.50 12 27.00 27 7.70 Ratio plot - apple orchard data 80.00 70.00 60.00 (j+1)f_(j+1)/f_j j 50.00 40.00 30.00 20.00 10.00 0.00 0 Idea: ratios are ~ linear Project line downward to obtain f0 = # of unobserved species 5 10 15 j r ( j ) : 20 25 30 ( j 1) f j 1 fj 35 j 35 breakaway: Estimating taxonomic richness based on ratios of frequency counts, cont’d Some issues: • Straight-line fit may go negative! • Can be fixed by ad hoc log-transformation (Rocchetti et al.) • Broad generalization: represent ratio of frequency counts as ratio of polynomials • Deep probabilistic justification; corrects negativity 0 1 j 2 j 3 j 2 3 fj 1 1 j 2 j 3 j f j 1 2 3 Rocchetti I, Bunge J, Böhning D. 2011. Population size estimation based upon ratios of recapture probabilities. Ann. Appl. Stat. 5:1512—33; Willis A. and Bunge J. (2013) in prep. 36 breakaway: Estimating taxonomic richness based on ratios of frequency counts, cont’d ################## Smoothed weights ################## The best estimate of total diversity is 1800 with std error 256 The model employed was model_1_1 The function selected was f_{x+1}/f_{x} ~ (beta0+beta1*(x-xbar))/(1+alpha1*(x-xbar)) Coef estimates Coef std errors beta0 1.11078693 0.13241518 beta1 0.05383757 0.02916098 alpha1 0.03002143 0.03840271 37 breakaway: Estimating taxonomic richness based on ratios of frequency counts, cont’d • • • • Nonlinear regression Heteroscedastic (changing variance) Autocorrelated: f2/f1 is correlated with f3/f2, etc. Collinear: parameter estimates of α’s and β’s highly correlated unless corrected • Multiple significant numerical challenges Statistical questions • Model selection – degree of numerator and denominator polynomials • Error estimation • Underlying probability theory: what do these models imply, and what are they implied by? 38 Noise and unreliable low frequency counts Next generation sequencing technology […] has revolutionised the study of microbial diversity as it is now possible to sequence a substantial fraction of the 16S rRNA genes in a community. However, […] because of the large read numbers and the lack of consensus sequences it is vital to distinguish noise from true sequence diversity in this data. Otherwise this leads to inflated estimates of the number of types or operational taxonomic units (OTUs) present. - Quince et al. (2011) 39 Methods to address unreliable low frequency counts I. Fix the data at the source! • Example: PyroNoise and AmpliconNoise - aim at “separately removing 454 sequencing errors and PCR single base errors.” (Quince 2011) • Direct, non-statistical approach 40 Methods to address unreliable low frequency counts 41 Methods to address unreliable low frequency counts III. Deleting the high-diversity component of a mixture model Bunge J, Böhning D, Allen H, Foster JA. 2012a. Estimating population diversity with unreliable low frequency counts. In Biocomputing 2012: Proceedings of the Pacific Symposium, pp. 203--12. Hackensack, NJ: World Sci. Publ 42 Methods to address unreliable low frequency counts IV. Bayesian approaches • Informative or subjective: investigator specifies non-trivial downweighting or rapidly decreasing prior for higher diversity values • Specific choice of prior? 43 Numerical results from viral phage data: Lower bounds and component deletion Method Poisson EstDiv SE 8730 103 8535 8938 11690 346 11050 12407 ThreeMixedExp 67792 8656 53009 87195 221 1410 2305 GoodTuring Discounted: TwoMixedExp 1727 LCB UCB 44 Some notes on β-diversity • Crucial to distinguish between Statistical inference procedures that (attempt to) account for unobserved as well as observed diversity Procedures (computational, graphical, or qualitative) that treat the observed sample as the population. UniFrac, “ordination” methods, co-inertia. • Only the former considered here. Estimation of population parameters, possible hypothesis testing. 45 Statistical inference for comparing taxonomic diversity across populations • Simplest version: Estimate richness in each population, with associated standard errors and confidence intervals, & compare (e.g., do CI’s overlap?) • Can be done with existing methods: parametric, nonparametric, Bayesian, etc. • Exactly ONE known inferential procedure. Lower bound for # of shared taxa: Sˆ12 D12 af12 / 2 f2 bf21 / 2 f2 abf112 / 4 f22 (D12 = observed # of shared species, fjk = # of species observed j times in sample 1 and k times in sample 2, a and b = constants) Pan HY, Chao A, Foissner W. 2009. A nonparametric lower bound for the number of species shared by multiple communities. J. Agric. Biol. Environ. Stat. 14:452--68 46 Statistical inference for β-diversity: other scenarios • Inference for the Jaccard index, accounting for unobserved species (Chao et al.) • Inference for “the probability of a draw from one distribution not being observed in k draws from another distribution.” (Hampton et al.) • Statistical work in this area not extensive – very fertile area for research. Chao A, Chazdon RL, Colwell RK, Shen T-J. 2006. Abundance-based similarity indices and their estimation when there are unseen species in samples. Biometrics 62:361—71; Hampton J, Lladser ME. 2012. Estimation of distribution overlap of urn models. PLoS ONE 7:e42368 47 NEVER throw away data when doing statistical inference “Not even wrong” – Richard Feynman 48 There is no post hoc statistical fix for • Ill-posed research problem • Vaguely defined population • Statistical model not appropriate for o population description o sample generation process • Model must compromise between detailed phenomenological description and parsimony • “To what extent can we idealize the properties of the system and still obtain satisfactory results? The answer to this question can only be given in the end by experiment. Only the comparison of the answers provided by analysis of our model with the results of the experiment will enable us to judge whether the idealization is legitimate.” Andronov (1937) Theory of Oscillators. 49 On the sociology of science • Fact: Universities have statistics departments! o Cornell: www.stat.cornell.edu o At least 131 university stat dept’s in U.S. – random sample of 10: • University of California, Berkeley, Division of Biostatistics • Princeton University, Program in Statistics and Operations Research • Bowling Green State University, Department of Applied Statistics and Operations Research • University of Illinois, Urbana-Champaign, Department of Statistics • University of South Carolina, Department of Statistics • Columbia School of Public Health, Division of Biostatistics • Medical College of Georgia, Office of Biostatistics and Bioinformatics • Duke University, Institute of Statistics and Decision Sciences • Yale University Department of Statistics • University of Michigan, Department of Biostatistics • Collaboration extremely valuable in both directions (even though academic incentive structure may not immediately reward it) • Be persistent: “Fall down seven times, get up eight” 50 CatchAll • • • • • http://www.northeastern.edu/catchall/ or STAMPS! V.4 now available; mothur uses v.3 (?) Two programs: basic analysis program + Excel graphics spreadsheet (macros) Windows GUI, Windows command-line – .Net framework must be installed Mac OS/Linux command-line – mono must be installed. Input data file structure: *.csv (comma-separated values) 1,f1 2,f2 … m,fm 51 CatchAll cont’d • Read in data • Go! (Can set option to omit most complex model, if too time-consuming; see manual) • Output files appear in “Output” folder/directory datasetname_Analysis.csv Complete listing of all analyses datasetname_BestModelsAnalysis.csv Column‐formatted summary analysis output datasetname_BestModelsFits.csv Fitted values for the "best models" as selected by the model selection algorithm datasetname_BubblePlot.csv Data to generate bubble plots using Excel spreadsheet 52 CatchAll cont’d: BestModelsAnalysis file • • • • • • • • Total number of observed species: self‐explanatory Model: see manual Tau: upper‐frequency cutoff Observed Sp: number of species (counts) with frequencies up to τ only Estimated total Sp: final estimate of the total number of species in the population SE: standard error of preceding estimate Lower CB, Upper CB: lower and upper 95% confidence bounds GOF0, GOF5: Pearson goodness‐of‐fit p‐values, uncorrected and corrected 53 CatchAll cont’d: BestModelsAnalysis file • Best Parm Model; Parm Model 2a, 2b, 2c. Parametric models (and τ’s) selected by various goodness‐of‐fit criteria • WLRM: weighted linear regression model • Parm Max Tau, WLRM Max Tau: best parametric model and WLRM computed on entire dataset • Best Discounted: best parametric model with low‐frequency/high‐diversity component deleted • Non‐P 1: Chao1, nonparametric lower bound for total number of species • Non‐P 2. Chao’s ACE or high‐diversity variant ACE1 (τ ≤ 10) • Non‐P 3. Chao’s ACE (τ ≤ 10) 54 CatchAll cont’d: Analysis file • All models & procedures computed by CatchAll, including several not reported in summary analysis • All cutoffs τ • All supplementary/supporting information (GOF etc.) • Question: what if no “best” parametric model selected? o Means no model passed most stringent GOF criteria o Revert to alternative models (2a-c) o If necessary revert to lower bounds (Chao1 etc.) 55