Estimating Microbial Diversity
John Bunge
jab18@cornell.edu
Department of Statistical Science
Cornell University
1
Thanks to:
Amy Willis
Fiona Walsh
David Mark Welch
Colleagues too numerous to mention
Bunge, J., Willis, A. and Walsh, F. (2013)
Estimating the number of species in microbial
diversity studies. Ann. Rev. of Statist. and its
Appl. v.1. Forthcoming.
2
Statisticians
3
Bioinformaticists
4
Statistics is not a collection of formulae, nor computer
programs, but a conceptual framework, an intellectual
stance, a point of view, a theory of knowledge
Fundamental idea:
distinction between sample and population
Classical or frequentist statistics is fundamentally
dualistic
5
Plato’s Republic, VII,7
Behold! human beings living in an underground
den, which has a mouth open towards the light
and reaching all along the den; here they have
been from their childhood
[…]
Above and behind them a fire is blazing at a
distance, […] you will see, if you look, a low wall
built along the way, like the screen which
marionette players have in front of them, over
which they show the puppets.
[…]
They see only their own shadows, or the
shadows of one another, which the fire throws on
the opposite wall of the cave […]
To them, I said, the truth would be literally
nothing but the shadows of the images.
6
Old Testament
Ecclesiastes 1:15
What is crooked cannot be straightened;
what is lacking cannot be counted.
New Testament
Corinthians 13:12
For now we see through a glass, darkly, but then face to face: now
I know in part; but then shall I know even as also I am known.
7
The knowledge problem in microbiome studies
Metagenomics is the study of metagenomes, genetic material
recovered directly from environmental samples.
-Wikipedia
DNA extraction bias notwithstanding, metagenomics is the most
unrestricted and comprehensive approach. Our ability to
interpret these data is always improving, and we stand on a
precipice of unprecedented discovery […] Microbes are not the
only group to benefit from these surveys; viruses exist at 10
times the abundance of microbes […]. - Gilbert, 2011
BUT: METAGENOMIC SURVEYS RECOVER ONLY A SMALL FRACTION
OF THE EXTANT DIVERSITY. NONETHELESS, MANY METHODS TREAT
THE OBSERVED SAMPLE AS THE POPULATION.
8
MACHINES
The fundamental idea of statistics:
Distinction between
Population (or universe) and Sample (or data)
9
THE SAMPLE IS A SUBSET OF THE POPULATION
Population
Universe
Reality
State of nature
Truth
parameters
Sample
Finite, random
noise
error
perturbation
shock
statistics
Statistical inference: Extract maximum information from
sample in order to draw conclusions about population
Inductive not deductive
10
Question: In a microbial diversity study,
What is the population?
 Collect 1L seawater @
500m depth in ocean
• From 1L, remove 5ml &
exhaustively sequence
microbial DNA
• Cluster sequences into
OTUs
• From OTUs, calculate
frequency count data
• Compute estimate of total
species richness
 Question: Richness of
what population?
Original 1L of water?
Surrounding
environment? Entire
pelagic microbiome?
Definition
The population is what would be observed if the operative
sampling and analysis protocols were carried out to infinite
effort.
11
How do we statistically estimate
total microbial taxonomic
richness?
12
Physical DNA
sample
Next-generation
sequencing;
Bioinformatic
preprocessing
Collection of
sequences
Bioinformatic
processing:
Alignment,
clustering,
counting
Cluster sequences at some % “identity,” typically 97%
{clusters} = {OTUs}
OTU = “operational taxonomic unit”
13
Statistical problem:
Estimate total population diversity – number of
species, classes, taxa, OTUs – based on
frequency count data
Data =
# of units observed exactly once in
sample (singletons);
# observed exactly twice (doubletons);
# observed exactly three times; … .
14
Frequency count data example
Microbial ecology
Fiona Walsh et al.
• Data from soil in apple orchards
• Use of antibiotics on bacterial populations in
soil ecosystems
• Singletons ≈ 2x doubletons – may be 10x!
• Goal is to estimate taxonomic richness of
community
• Change with respect to
intervention/covariates/metadata
freq
count
freq count
1 317
124
1
2 179
128
1
3 127
133
1
4
77
134
1
5
66
149
1
6
61
159
1
7
39
170
1
8
42
184
1
9
29
195
1
10
24
208
1
11
12
232
1
12
27 …
262
1
Walsh F, Owens S, Duffy B, Smith DP, Frey JE. 2013.
Streptomycin use in apple orchards did not alter the soil bacterial
communities
15
Apple orchard data - original scale
350
300
count
250
200
150
100
50
0
0
50
100
150
200
250
300
•
•
•
frequency
•
Apple orchard data - log scale
1000
•
count
100
10
1
0
50
100
150
200
250
Issues:
High diversity
Typical of microbial data
Singletons ~ 2x
doubletons
Data acquisition /
bioinformatic issues
Spurious singletons?
• Correct at what
stage? Statistical
approach?
300
frequency
16
Statistical inference from frequency count data
STANDARD MODEL
• C classes/taxa/species in population. Each species independently
contributes Poisson-distributed # of representatives to the sample.
X 2 ~ Poisson(2 )
X1 ~ Poisson(1 )
X 3 ~ Poisson(3 )
X C ~ Poisson(C )

sample
• Counts ~ zero-truncated mixed Poisson.
17
The mixed-Poisson model
 Species (taxon) i contributes a Poisson-distributed
number Xi of replicates to the sample – i.e., taxon i
appears in the sample Xi times.
 Units appear independently in the sample
 Fundamental problem: heterogeneity, i.e., unequal
Poisson means λi
• Standard approach: model λi‘s as i.i.d. replicates
from some mixing distribution F
• Frequency counts fi are then marginally i.i.d. Fmixed Poisson random variables
• Zero-truncated since zero counts Xi are
unobservable
18
The mixed-Poisson model cont’d
 Mixing distribution F, i.e., distribution of sampling
intensities λ, is also called species abundance
distribution
 Probably a misnomer
 Mathematical treatment (marginalization) implies
that each species contribution to the sample is
independent and identically distributed
 Both assumptions are certainly wrong
 How to account for dependent or differently
distributed species counts? Not in standard model.
19
Mixing distributions F
 Parametric, low-dimensional parameter vector
• None ≡ point mass at λ ≡ all equal species
sizes
• Gamma (Fisher, 1943)
• Lognormal
• Inverse Gaussian, generalized inverse
Gaussian (Sichel)
• Pareto
• Log-t
• Stable
 Finite mixture of exponentials - semiparametric
20
Richness estimation under the Poisson model
 Diversity estimate is then
# taxain sample
ˆ
N F :
1  PF (0)
where PF(0) = F-mixed Poisson probability of 0:
 

PF (0)   e dF( )  EF e 
Nˆ F is the Horvitz-Thompson estimator (HTE) and is
uniformly minimum variance unbiased (UMVU).
 Require empirical version of Nˆ , i.e., require
F
estimate of PF(0) (frequentist version).
21
Richness estimation under the Poisson model,
cont’d
 Require empirical version of HTE
# taxain sample
ˆ
N F :
, F  F ( ,  )
1  PF (0)
 Estimate θ by ML, using zero-truncated F-mixed Poisson,
conditional on # of observed taxa. Final estimator:
# taxain sample
ˆ
N F :
1  PF (0,ˆ)
 SE via Fisher information
 CI via (approximation to) profile likelihood
22
CatchAll software
www.northeastern.edu/catchall
or: STAMPS!
 Developed under NSF grant DEB – 0816638 by
JB/LW/SC, in C# & C
 Implements
o finite mixtures of 0 – 4 exponential components (F)
o weighted linear regression procedure
o all Chao-type nonparametric procedures
o model evaluation/GOF/selection/outlier assessment
 Produces estimates, SEs, & CIs
 Fast, efficient, platform-independent
 Excel graphics (VBA) package
 Summary or copious output (text files)
Bunge J, Woodard L, Böhning D, Foster JA, Connolly S, Allen HK.
2012b. Estimating population diversity with CatchAll.
Bioinformatics 28:1045--47
23
Partial CatchAll summary output for apple orchard data
Total Number
of Observed
Observed Estimated
Lower Upper
Species = 1187 Model Tau Sp
Total Sp SE
CB
CB
GOF0 GOF5
Best Parm
ThreeMix
Model
edExp
184
1183
1823.5 122.4 1625.1 2111.6 0.0118 0.6038
ThreeMix
Parm Model 2a edExp
118
1175
1854.9 158 1609.8 2242.3 0.1428 0.3632
ThreeMix
Parm Model 2b edExp
262
1187
1797.6 101.6 1628.6 2031.3
0 0.4029
TwoMixe
Parm Model 2c dExp
23
1087
1865.5 141 1640.4 2202.2 0.0001 0.0208
WLRM
UnTransf 10
961
2285.8 572.7 1607.4 4058.9 0.0206
ThreeMix
Parm Max Tau edExp
262
1187
1797.6 101.6 1628.6 2031.3
0 0.4029
WLRM Max
Tau
LogTransf 31
1114
1390.3 30.4 1338.9 1459.2
24
350
CatchAll fitted models for apple orchard data
300
250
200
Counts
Observed
Other 3--TwoMixedExp/Tau 23
Other 2--ThreeMixedExp/Tau 262
150
Other 1--ThreeMixedExp/Tau 118
Best--ThreeMixedExp/Tau 184
100
Τ = 184
50
0
0
50
100
150
200
250
300
Frequency
25
Data-analytic considerations
• Problem of right cutoff point τ
o Typically no parametric model will fit complete frequency count dataset
o Too many right outliers – highly abundant taxa in sample – with large gaps
between counts
o Nonparametric methods do even worse with outliers, diverging to ∞ as
outliers are included in data
• Data-analytic solution: remove large frequency counts for
frequencies > some cutoff τ
o Chao1: τ = 2
o Chao-type coverage-based nonparametric methods: τ = 10 (arbitrary)
o Parametric mixture models: τ selected by goodness-of-fit algorithm
o Weighted linear regression model: selected by goodness-of-fit
• Further problem: model selection and outlier deletion confounded
o Computational solution: compute all methods at every τ
o Requires optimized code
o Use double selection algorithm to select “best of the best”
o Introduces simultaneous inference problem: large number of simultaneous
GOF tests. Little theory exists to correct for this.
26
Statistical analysis of standard model: The bigger picture
Philosophy/
approach
Frequentist
Parametric
Nonparametric
Maximum likelihood
(Bunge et al.)
Weighted linear regression
(Rocchetti et al. 2011)
Coverage-based
(Chao et al.);
Zelterman; NPMLE
(Böhning et al.)
Bayesian
Objective Bayes
???
(Barger et al.; Quince et al.) (Tardella et al. for capturerecapture)
27
Statistical analysis of standard model – Chao-type nonparametrics
• Coverage-based approaches
• Coverage = proportion of population represented in sample
• Random variable not parameter
• Can interpret 1 – PF(0) as surrogate for coverage
• Turing’s estimate of PF(0):
f1
n
where n = # of individual units in sample
• Good-Turing estimate of diversity:
# of taxain sample
1  f1 / n
• Chao’s abundance-based coverage estimators (ACE):
Good-Turing + adjustment for heterogeneity
Chao, A. & J. Bunge. 2002. Estimating the number of species in a
stochastic abundance model. Biometrics 58: 531–539
28
7000
Coverage-based
estimators diverge to
infinity as large
frequency counts are
included
6000
Estimated Count
5000
Observed Sp
4000
Est Sp for NonParametric Model
Est Sp for TwoMixedExp Model
Est Sp for SingleExp Model
3000
Est Sp for ThreeMixedExp Model
Est Sp for Poisson Model
Est Sp for FourMixedExp Model
2000
Hence coveragebased estimators
require τ ≤ 10
1000
0
10
100
1000
Tau
Statistical analysis of standard model: general nonparametrics
• Nonparametric maximum likelihood estimation
• Leave species abundance distribution F unspecified, i.e., F
varies across all possible distributions
• Mathematical implications: F is actually non-identifiable
• Nevertheless NPMLE is possible in principle.
• Computational issues: difficult numerical search, highly
complex error estimation.
• Software CAMCR
Böhning D, Kuhnert R. 2009. CAMCR: Computer-Assisted Mixture
model analysis for Capture-Recapture count data. AStA Adv. Stat.
30
Anal. 93:61--71
The Bayesian paradigm
• Rev. Thomas Bayes
• Bayesian statistics: Probabilistic & statistical statements concern
degrees of belief
• Usually parametric: statements concern values of parameters, e.g.,
species richness. Nonparametric Bayes is possible but complex.
• Procedure:
1. Investigator first declares existing belief about population value:
this is prior distribution
2. Collect sample data
3. Update prior, based on data, to obtain posterior, i.e., final state of
knowledge or belief about population.
The Bayesian paradigm cont’d
Bayes’ Theorem:
P( B | A) 
P( A | B) P( B)
P( A)
Posterior distribution:
P(parameters| data)  P(data | parameters) P(parameters)
 likelihood prior
Bayesian computation is now fairly well established
Bayesian estimation of taxonomic richness
based on the standard model
• Species abundance distribution F is parametric: F depends
on a small number of parameters (typically 2-3), called 
• Parameter of interest is total richness C
• Procedure:
1. Establish prior distributions for  and C
2. Likelihood function is known (based on mixedPoisson)
3. Run Bayesian machinery
4. Obtain posterior distribution, estimate, “credible
interval,” etc.
• Quince et al. quasi-noninformative priors; Barger et al.
formal objective priors. Active research area in statistics.
Quince C, Curtis TP, Sloan WT. 2008. The rational exploration of
microbial diversity. ISME J. 2:997—1006; Barger K, Bunge J.
2011. Objective Bayesian estimation for the number of species. J.
Bayesian Analysis 5:765--86
A New Hope
Is it possible to estimate taxonomic richness without
 a species abundance distribution
 independent species contributions to the sample
 identically distributed species contributions to the
sample
?
Yes, using ratios of frequency counts.
34
breakaway: Estimating taxonomic richness based on
ratios of frequency counts
1
2
3
4
5
6
7
8
9
10
11
12
count (j+1)f_(j+1)/f_j
317
1.13
179
2.13
127
2.43
77
4.29
66
5.55
61
4.48
39
8.62
42
6.21
29
8.28
24
5.50
12
27.00
27
7.70
Ratio plot - apple orchard data
80.00
70.00
60.00
(j+1)f_(j+1)/f_j
j
50.00
40.00
30.00
20.00
10.00
0.00
0
Idea: ratios are ~ linear
Project line downward to obtain
f0 = # of unobserved species
5
10
15
j
r ( j ) :
20
25
30
( j  1) f j 1
fj
35
   j
35
breakaway: Estimating taxonomic richness based on
ratios of frequency counts, cont’d
Some issues:
• Straight-line fit may go negative!
• Can be fixed by ad hoc log-transformation (Rocchetti et
al.)
• Broad generalization: represent ratio of frequency counts
as ratio of polynomials
• Deep probabilistic justification; corrects negativity
0  1 j   2 j  3 j  

2
3
fj
1  1 j   2 j   3 j  
f j 1
2
3
Rocchetti I, Bunge J, Böhning D. 2011. Population size
estimation based upon ratios of recapture probabilities. Ann.
Appl. Stat. 5:1512—33; Willis A. and Bunge J. (2013) in prep.
36
breakaway: Estimating taxonomic richness based on
ratios of frequency counts, cont’d
################## Smoothed weights ##################
The best estimate of total diversity is 1800
with std error 256
The model employed was model_1_1
The function selected was
f_{x+1}/f_{x} ~ (beta0+beta1*(x-xbar))/(1+alpha1*(x-xbar))
Coef estimates Coef std errors
beta0
1.11078693
0.13241518
beta1
0.05383757
0.02916098
alpha1 0.03002143
0.03840271
37
breakaway: Estimating taxonomic richness based on
ratios of frequency counts, cont’d
•
•
•
•
Nonlinear regression
Heteroscedastic (changing variance)
Autocorrelated: f2/f1 is correlated with f3/f2, etc.
Collinear: parameter estimates of α’s and β’s highly
correlated unless corrected
• Multiple significant numerical challenges
Statistical questions
• Model selection – degree of numerator and denominator
polynomials
• Error estimation
• Underlying probability theory: what do these models
imply, and what are they implied by?
38
Noise and unreliable low frequency counts
Next generation sequencing technology […] has
revolutionised the study of microbial diversity as it is
now possible to sequence a substantial fraction of the
16S rRNA genes in a community. However, […] because
of the large read numbers and the lack of consensus
sequences it is vital to distinguish noise from true
sequence diversity in this data. Otherwise this leads to
inflated estimates of the number of types or operational
taxonomic units (OTUs) present.
- Quince et al. (2011)
39
Methods to address unreliable low
frequency counts
I. Fix the data at the source!
• Example: PyroNoise and AmpliconNoise
- aim at “separately removing 454 sequencing
errors and PCR single base errors.” (Quince 2011)
• Direct, non-statistical approach
40
Methods to address unreliable low
frequency counts
41
Methods to address unreliable low
frequency counts
III. Deleting the high-diversity component of a
mixture model
Bunge J, Böhning D, Allen H, Foster JA. 2012a. Estimating
population diversity with unreliable low frequency counts. In
Biocomputing 2012: Proceedings of the Pacific Symposium, pp.
203--12. Hackensack, NJ: World Sci. Publ
42
Methods to address unreliable low
frequency counts
IV. Bayesian approaches
• Informative or subjective: investigator specifies
non-trivial downweighting or rapidly decreasing
prior for higher diversity values
• Specific choice of prior?
43
Numerical results from viral phage data:
Lower bounds and component deletion
Method
Poisson
EstDiv SE
8730
103
8535
8938
11690
346
11050
12407
ThreeMixedExp 67792
8656
53009
87195
221
1410
2305
GoodTuring
Discounted:
TwoMixedExp
1727
LCB
UCB
44
Some notes on β-diversity
• Crucial to distinguish between
 Statistical inference procedures that (attempt to) account
for unobserved as well as observed diversity
 Procedures (computational, graphical, or qualitative) that
treat the observed sample as the population. UniFrac,
“ordination” methods, co-inertia.
• Only the former considered here. Estimation of population
parameters, possible hypothesis testing.
45
Statistical inference for comparing taxonomic
diversity across populations
• Simplest version: Estimate richness in each population, with
associated standard errors and confidence intervals, & compare
(e.g., do CI’s overlap?)
• Can be done with existing methods: parametric, nonparametric,
Bayesian, etc.
• Exactly ONE known inferential procedure. Lower bound for # of
shared taxa:
Sˆ12  D12  af12 / 2 f2  bf21 / 2 f2  abf112 / 4 f22
(D12 = observed # of shared species, fjk = # of species observed j
times in sample 1 and k times in sample 2, a and b = constants)
Pan HY, Chao A, Foissner W. 2009. A nonparametric
lower bound for the number of species shared by
multiple communities. J. Agric. Biol. Environ. Stat.
14:452--68
46
Statistical inference for β-diversity:
other scenarios
• Inference for the Jaccard index, accounting for unobserved
species (Chao et al.)
• Inference for “the probability of a draw from one distribution
not being observed in k draws from another distribution.”
(Hampton et al.)
• Statistical work in this area not extensive – very fertile area for
research.
Chao A, Chazdon RL, Colwell RK, Shen T-J. 2006.
Abundance-based similarity indices and their
estimation when there are unseen species in samples.
Biometrics 62:361—71; Hampton J, Lladser ME. 2012.
Estimation of distribution overlap of urn models. PLoS
ONE 7:e42368
47
NEVER
throw away data when doing
statistical inference
“Not even wrong” – Richard Feynman
48
There is no post hoc statistical fix for
• Ill-posed research problem
• Vaguely defined population
• Statistical model not appropriate for
o population description
o sample generation process
• Model must compromise between detailed phenomenological
description and parsimony
• “To what extent can we idealize the properties of the system and
still obtain satisfactory results? The answer to this question can
only be given in the end by experiment. Only the comparison of
the answers provided by analysis of our model with the results of
the experiment will enable us to judge whether the idealization is
legitimate.” Andronov (1937) Theory of Oscillators.
49
On the sociology of science
• Fact: Universities have statistics departments!
o Cornell: www.stat.cornell.edu
o At least 131 university stat dept’s in U.S. – random
sample of 10:
• University of California, Berkeley, Division of Biostatistics •
Princeton University, Program in Statistics and Operations
Research • Bowling Green State University, Department of
Applied Statistics and Operations Research • University of
Illinois, Urbana-Champaign, Department of Statistics •
University of South Carolina, Department of Statistics •
Columbia School of Public Health, Division of Biostatistics •
Medical College of Georgia, Office of Biostatistics and
Bioinformatics • Duke University, Institute of Statistics and
Decision Sciences • Yale University Department of Statistics •
University of Michigan, Department of Biostatistics
• Collaboration extremely valuable in both directions
(even though academic incentive structure may not
immediately reward it)
• Be persistent: “Fall down seven times, get up eight”
50
CatchAll
•
•
•
•
•
http://www.northeastern.edu/catchall/ or STAMPS!
V.4 now available; mothur uses v.3 (?)
Two programs: basic analysis program + Excel
graphics spreadsheet (macros)
Windows GUI, Windows command-line – .Net
framework must be installed
Mac OS/Linux command-line – mono must be
installed.
Input data file structure: *.csv (comma-separated
values)
1,f1
2,f2
…
m,fm
51
CatchAll cont’d
• Read in data
• Go! (Can set option to omit most complex model, if too
time-consuming; see manual)
• Output files appear in “Output” folder/directory
datasetname_Analysis.csv
Complete listing of all analyses
datasetname_BestModelsAnalysis.csv
Column‐formatted summary analysis output
datasetname_BestModelsFits.csv
Fitted values for the "best models" as selected by the model
selection algorithm
datasetname_BubblePlot.csv
Data to generate bubble plots using Excel spreadsheet
52
CatchAll cont’d: BestModelsAnalysis file
•
•
•
•
•
•
•
•
Total number of observed species: self‐explanatory
Model: see manual
Tau: upper‐frequency cutoff
Observed Sp: number of species (counts) with
frequencies up to τ only
Estimated total Sp: final estimate of the total number of
species in the population
SE: standard error of preceding estimate
Lower CB, Upper CB: lower and upper 95% confidence
bounds
GOF0, GOF5: Pearson goodness‐of‐fit p‐values,
uncorrected and corrected
53
CatchAll cont’d: BestModelsAnalysis file
• Best Parm Model; Parm Model 2a, 2b, 2c. Parametric
models (and τ’s) selected by various goodness‐of‐fit
criteria
• WLRM: weighted linear regression model
• Parm Max Tau, WLRM Max Tau: best parametric
model and WLRM computed on entire dataset
• Best Discounted: best parametric model with
low‐frequency/high‐diversity component deleted
• Non‐P 1: Chao1, nonparametric lower bound for total
number of species
• Non‐P 2. Chao’s ACE or high‐diversity variant ACE1 (τ
≤ 10)
• Non‐P 3. Chao’s ACE (τ ≤ 10)
54
CatchAll cont’d: Analysis file
• All models & procedures computed by CatchAll,
including several not reported in summary analysis
• All cutoffs τ
• All supplementary/supporting information (GOF etc.)
• Question: what if no “best” parametric model selected?
o Means no model passed most stringent GOF
criteria
o Revert to alternative models (2a-c)
o If necessary revert to lower bounds (Chao1 etc.)
55