Lecture 2: Diversity, Distances, adonis Lecture 2: Diversity, Distances, adonis • “Diversity” -­‐ alpha, beta (, gamma) • Beta-­‐Diversity in practice: Ecological Distances • Unsupervised Learning: Clustering, etc • Ordination: e.g. PCA, UniFrac/PCoA, DPCoA • Testing: Permutational Multivariate ANOVA Some slides from prof A. Alekseyenko, NYU; and prof S. Holmes, Stanford 1 Alpha-­‐Diversity 3 2 Alpha diversity definition(s) • Alpha diversity describes the diversity of a single community (specimen). • In statistical terms, it is a scalar statistic computed for a single observation (column) that represents the diversity of that observation. • There are many statistics that can describe diversity: e.g. taxonomical richness, evenness, dominance, etc. 4 Rank abundance plots Species richness • Suppose we observe a community that can contain up to k ‘species’. • The relative proportions of the species P = {p1, …, pk} • Richness is computed as R = 1(p1) + 1(p2) + … + 1(pk) where 1(.) is an indicator function, i.e. 1(x) = 1 if pi≠0, and 0 otherwise. • Higher R means greater diversity • Very dependent upon depth of sampling and sensitive to presence of rare species • • • 5 6 Rarefaction Curves Shannon index • Suppose we observe a community that can contain up to k ‘species’. • The relative proportions of the species are P = {p1, …, pk}. • Shannon index is related to the notion of information content from information theory. It roughly represents the amount of information that is available for the distribution of P. • When pi = pj, for all i and j, then we have no information about which species a random draw will result in. As the inequality becomes more pronounced, we gain more information about the possible outcome of the draw. The Shannon index captures this property of the distribution. • Shannon index is computed as Sanders 1968 non-parametric richness estimate coverage Number of species Sk= – p1log2p1 – p2log2p2 – … – pklog2pk Note as pi ➔0, log2pi ➔ –∞, we therefore define pilog2pi = 0. Higher Sk means higher diversity • Sanders, H. L. (1968). Marine benthic diversity: a comparative study. American Naturalist # Observations / Library Size / # Reads / Sample Size “Shannon entropy” http://en.wikipedia.org/wiki/Entropy_(information_theory) 7 8 From Shannon to Evenness • Shannon index for a community of k species has a maximum at log2k • We can make different communities more comparable if we normalize by the maximum • Evenness index is computed as Ek=Sk/log2k • Ek=1 means total evenness 9 Numbers equivalent diversity • Often it is convenient to talk about alpha diversity in terms of equivalent units: Simpson index • Suppose we observe a community that can contain up to k ‘species’. • The relative proportions of the species are P = {p1, …, pk}. • Simpson index is the probability of resampling the same species on two consecutive draws with replacement. • Suppose on the first draw we picked species i, this event has probability pi, hence the probability of drawing that species twice is pi*pi. • Simpson index is usually computed as: D=1 – (p12 + p22 + … + pk2) In this case, the index represents the probability that two individuals randomly selected from a sample will belong to different species. • D = 0 means no diversity (1 species is completely dominant) • D = 1 means complete diversity 10 Beta-­‐Diversity – How many equally abundant taxa will it take to get the same diversity as we see in a given community? • For richness there is no difference in statistic • For Shannon, remember that log2k is the maximum which is attained when all species equal abundance. Hence the diversity in equivalent units is 2Sk • For Simpson the equivalent units measure of diversity is 1/(1-­‐D) Sometimes called “Inverse Simpson Index” 11 12 Beta-­‐Diversity • Microbial ecologists typically use beta diversity as a broad umbrella term that can refer to any of several indices related to compositional differences (Differences in species content between samples) • For some reason this is contentious, and there appears to be ongoing (and pointless?) argument over the possible definitions • For our purposes, and microbiome research, when you hear “beta-­‐diversity”, you can probably think: “Diversity of species composition” Summary of diversity “types” • α – diversity within a community, # of species only • β – diversity between communities (differentiation), species identity is taken into account • γ – (global) diversity of the site • Theoretically, one would wishes to use such measures that result in γ = α × β • This is only possible if α and β are independent of each other. http://en.wikipedia.org/wiki/Beta_diversity 13 14 Beta-­‐Diversity “in practice” Dimensional Reduction 1.UniFrac or Bray-­‐Curtis distance between samples 2.MDS (“PCoA”) 3.Plot first two axes 4.Admire clusters 5.Write Paper 6.Choose new microbiomes 7.Return to Step 1, Repeat Why? Let’s back up. This is one option in an arsenal of dimensional reduction methods, that come from “unsupervised learning” in “exploratory data analysis” 15 Regress disc on weight Regress weight on disc 16 Dimensional Reduction Dimensional Reduction Minimize the distance to the line in both directions the purple line is the principal component line Principal Components are Linear Combinations of the ‘old’ variables The projection that maximizes the area of the shadow and an equivalent measurement is the sums of squares of the distances between points in the projection, we want to see as much of the variation as possible, that’s what PCA does. 17 18 The PCA workflow Ordination Using the Tree 1. UniFrac-­‐PCoA 2. Double Principal Coordinates 19 20 (Un)supervised Learning (Un)supervised Learning Ordination Best Practice Ordination Best Practice 1. Always look at scree plot 2. Variables, Samples 3. Biplot 4. Altogether (if readable) pca.turtles=dudi.pca(Turtles[,-1],scannf=F,nf=2)! scatter(pca.turtles) 21 22 (Un)supervised Learning (Un)supervised Learning What did we “learn”? Depends on the data. • How many axes are probably useful? • Are their clusters? How many? • Are their gradients? • Are the patterns consistent with covariates • (e.g. sample observations) • How might we test this? 23 What did we “learn”? Depends on the data. • Are their clusters? How many? ! Gap Statistic 24 (Un)supervised Learning What did we “learn”? Depends on the data. • Are their gradients? ! PCA regression (Un)supervised Learning What did we “learn”? Depends on the data. • Are the patterns consistent with covariates • How might we test this? (Permutational) Multivariate ANOVA vegan::adonis( ) 25 26