Integrating Macroecological Metrics and Community Taxonomic Structure John Harte, Andrew Rominger, Wenyu Zhang SUPPLEMENTARY MATERIAL Here we fill in mathematical details of how the results in the third column of Table 1 are derived from the definitions in Table 2. Then we briefly discuss two possible extensions of the theory presented here. For completeness, however, and at the risk of repetition, we first summarize the ideas that lead to the constraints and defining equations for the metrics in Table 2. Constraints and Defining Equations The Extended Ecological Structure Function. An “ecological structure function”, denoted R(n, ε|S0, N0, E0), is the core of ASNE (Harte et al., 2008). R is a joint conditional distribution over abundance (n) and metabolic rate (ε) defined so that R·dε is the probability that if a species is picked at random from the species pool, then it has abundance n, and if an individual is picked at random from that species, then its metabolic energy requirement is in the interval (ε, ε + dε). To construct AGSNE, we augment the list of state variables, A0, S0, N0, E0, by adding G0, the number of genera in area A0. In analogy to R, a new joint, conditional probability distribution, Q(m,n,ε|G0, S0, N0, E0), can be defined by: Pick a genus at random from the pool of genera; then Qdε is the probability it has m species and, if you pick one of those species, that it has n individuals, and that if you pick one of those individuals from that species, that it has metabolic rate in the interval (ε, ε + dε). Note that Q, the ecological structure function in the extended theory, is a function of the discrete variables, m and n, and a continuous variable ε. For notational simplicity we will use the term ‘distribution’ regardless of whether the independent variable is discrete or continuous, with the understanding that in the latter case a probability density function is intended. Also, for notational convenience, in what follows we replace integrals over ε with sums over discrete values of ε, but in the actual calculations we use integrals over continuous values of ε. As in ASNE, we define the unit of energy such that ε = 1 is the lowest metabolic rate among the N0 individuals. We leave the limits off of summations, which are understood to range from 1 to S0, N0, E0 for m, n, and ε respectively. Finally we do not show the state variables on which distributions are contingent unless they are needed for clarity. The Constraints. The constraints on Q follow immediately from the definitions of the state variables and of Q and are listed in Table 2. m S0 mQ(m, n, ) G0 m, n, (S-1) nG N0 mnQ(m, n, ) G0 m , n , (S-2) G E0 mnQ(m, n, ) G0 m , n , (S-3) Here < > indicates expectation value, <m> is the average number of species per genus, <nG> is the average number of individuals per genus and εG is the average metabolic rate per genus. In similar notation, <nS> = N0/S0 = (G0/S0) <nG>, and <ε> = E0/N0 = (G0/N0)<εG>. The Metrics. The MaxEnt solution for the AGSNE ecological structure function, Q, is readily found using the same methods as in Harte et al. (2008) and explained in more detail in Harte (2011): Q(m, n, ) 1 e 1m e 2mne 3mn . Z (1 , 2 , 3 ) (S-4) Z, a normalizing factor, is evaluated from Z (1 , 2 , 3 ) e 1m 2 mn 3 mn e e (S-5) m, n, Macroecological metrics extended to higher taxa describe the probability distributions of metabolic energy rates, abundances, and species richness. Each metric is obtained directly from Q either as a marginal distribution, or a conditional distribution obtained as the ratio of other metrics in conformity with the identity: P(x|y) = P(x,y)/P(y). The distribution of species richness, m, over genera, Γ(m), is given by: Γ (m) Q(m, n, ) . (S-6) n , The distribution of abundances over species, 𝜑(𝑛), is given by: j (n) = å mQ(m, n, e ) m,e å mQ(m, n, e ) = G0 å mQ(m, n, e ) S0 m,e (S-7) m,n,e where the second equality follows from Eq. 1. The distribution of abundances over species belonging to a genus with m species, φ(n|m), is given by: Q(m, n, ) ( n | m) . Q(m, n, ) (S-8) ,n Using Eq. S-6, this can be re-expressed as: ( n | m) Q ( m, n , ) Γ ( m) (S-9) For later use we note that the average abundance of a species in any genus with m species is given by n | m nQ(m, n, ) n, Γ (m) (S-10) The distribution of metabolic rates over all individuals, Ψ(ε), is given by: y (e ) = å mnQ(m, n, e ) m,n å mnQ(m, n, e ) = G0 å mnQ(m, n, e ) N 0 m,n (S-11) m,n,e where the second equality follows from Eq. S-2. The distribution of metabolic rates over all individuals in a species with abundance n that is in a genus with m species is given by: Θ( | m, n) Q(m, n, ) Q(m, n, ) Using Eq. S-9, this can be re-expressed as (S-12) Θ( | m, n) Q(m, n, ) /[ Γ (m) (n | m)] (S-13) The ratios of state variables that appear in front of some of the summation signs above, and the factors of m or mn that appear in the summands, arise from the way Q is defined; derivations follow closely the example given in Box 7.2 of Harte (2011). The last metric we introduce is an expression for the distribution of metabolic rates, 𝜀, over the individual in a species selected at random from the pool of all species in genera with m species: 𝜉(𝜀|𝑚) = ∑𝑛 𝛩(𝜀|𝑚, 𝑛)𝜑(𝑛|𝑚). (S-14) The λi are Lagrange multipliers that are determined numerically from the values of the state variables. Deriving the Lagrange multipliers and the results in column 3 of Table 1 Equations (S-1 - S-14) are exact and can be solved numerically, but with some approximations that are justified for most data sets we have examined, we can derive useful closed-form expressions for the metrics from them. Upon inspection, these expressions permit an understanding of the nature of the predicted patterns. The first assumption is that the number of genera in the data set is sufficiently large so that terms of order exp(-G0) can be ignored compared to terms of order 1. In practice, G0 > 5 (exp(-G0) < 0.01) is adequate. With that assumption alone, most of the equations can be solved analytically but some of the resulting expressions for the metrics are quite complicated. Further simplification occurs if E0 >> N0 >> S0 >> G0 and N0 >> S0G0, where in practice the double inequality means a factor of at least five difference between left and right hand sides. In the results shown below, we use an equal sign (=) to indicate that only the first assumption (large enough G0) is needed, and an inexact equality sign () if the inequalities above are assumed as well. In every census data set we use here for theory testing these assumptions are justified, and in most data sets we have encountered they are also satisfied. The mathematical steps that we do not describe here are tedious but straightforward, involving nothing more than summations, integrations, and Taylor series expansions. Throughout, we make frequent use of: ∑𝑀 𝑚=1 𝑒 −𝜆𝑚 𝑚 1 1 = ln(1−𝑒 −𝜆 ) ≈ ln(𝜆) (S-15) where the equality is strictly correct only as exp(-M) 0 and the approximate equality is valid if << 1. In our use of Eq. S-15, will be of order G0/S0 and M will equal S0 and so exp(-M) is of order exp(-G0), which we are assuming is very small. The summation and integration over n and ε in Eq. S-5 are readily carried out and result in: e m ( 1n ) 1 e 1m 1 Z (1 , 2 , 3 ) ln( ) 3 m,n mn 3 m m 1 e m 1 (S-16) where β ≡ λ2 + λ3. To obtain expressions for the Lagrange multipliers as functions of the state variables, we have to evaluate Eqs. S-1 – S-3. From Eq. S-1: m S0 1 1 e 1m ln( ) G0 Z3 m 1 e m (S-17) while, from Eq. S-2: nG N0 e m ( 1 n ) 1 e 1me m G0 m, n Z3 Z3 m 1 e m From Eq.S-3, and doing the integral over ε, we have: (S-18) G E0 e m ( 1 n ) 1 1 1 N ( 2 ) 0 G0 m, n Z 3 mn 3 3 G0 (S-19) The terms 1/λ3 and N0/G0 in the final equality arise from use of Eqs.S-16 and S-18, respectively. Hence, the third Lagrange multiplier is given by λ3 = G0/(E0 – N0). (S-20) To determine without further approximation the values of λ1 and β, Eqs. S-17 – S-19 have to be evaluated numerically. If N0 >> S0 >> G0 >> 1, and in addition, N0 >> S0G0, then the summation in Eq. S-16 can be approximated as: Z3 ln( 1 ) ln( 1 1 ), (S-21) Under the same assumptions, Eq. S-17 simplifies to: ln( 1 ) S0 1 m , G0 Z31 ln( 1 ) 1 (S-22) 1 and Eq, S-18 simplifies to: nG N0 1 G0 ln( 1 ) (S-23) We note that from Eq. S-22, λ1 < G0/S0 << 1, from Eq. S-23, β < G0/N0 << 1, and because N0>> S0 by assumption, we have β << λ1. We also note that βm will be < 1 if S0G0 < N0 because m < S. We can use the more exact Eqs. S-16 – S-18 when testing theory, but Eqs S-21 – S-23 provide a way to obtain analytically initial guesses for λ1, β, and Z for numerical evaluation and also can be used to provide initial insight into the behavior of the metrics derived below. Next, we evaluate the metrics Γ, 𝜑, and Ψ. The sum and integral over n and ε in the derivation of Γ(m) are straightforward, resulting in: Γ ( m) 1 ) 1 e m Z3m e 1m ln( (S-24) where we have assumed that βmN0 >> 1, which will always be true if G0 >> 1. If βm << 1, which is the case if N0 >> G0S0, then Eq. S-21 is valid, and Eq. S-24 simplifies to: e 1m Γ ( m) 1 m ln( ) (S-25) 1 The difference between the m-dependence of the exact (Eq. S-24) and approximate (Eq.S-25) distributions is that the former falls slightly less rapidly with increasing m because ln(1/(1-e-βm)) is a slowly decreasing function of m. Numerical comparisons of Eqs. S-24 and S-25 for realistic combinations of the state variables indicate that they differ by at most a few percent. Turning to the species abundance distribution, j (n) , the sum over m and integral over ε in Eq. S7 result in: j (n) = G0 e-[ l1+b n] S0 Z l3n ×[1- exp(-(l1 + b n))] If the approximations leading to Eqs. S-21 - S-23 are valid, Eq. S-26 simplifies to: (S-26) j (n) » l1e-( l +b n) 1 (S-27) 1 n ln( )(1- e-( l1+b n) ) b For small n such that βn << λ1 << 1, which is roughly equivalent to n << N0/S0, Eq. S-27 can be approximated by: j (n) » e- b n 1 n ln( ) (S-28) b Numerical evaluation of Eqs. 26 and 28 for realistic choices of state variables, show that the two expressions differ by no more than 3 or 4% over the range of values of n. For much larger n, such that βn >> λ1, or N0/G0 >> n >> N0/S0, Eq. S-27 is approximately: j (n) » l1e- b n 1 n ln( )(1- e- b n ) » b N 0 l1e- b n G0 n 2 (S-29) where the second approximate equality holds if βn << 1. In general, however, for large n the expression in Eq. S-27 is needed because for n > N0/G0, βn is not small. The distribution of abundances over the set of species in all the genera with m species, φ(n|m), can also be evaluated from Eq. S-21, giving: 𝜑(𝑛|𝑚) = 𝑒 −𝛽𝑚𝑛 𝑛ln( 1 ) 1−𝑒−𝛽𝑚 ≈ 𝑒 −𝛽∙𝑚𝑛 𝑛ln( 1 ) 𝛽𝑚 (S-30) From Eq. S-30, the average abundance of the species in all the genera with m species is given by n | m n n e mn e m 1 1 n ln( ) (1 e m ) ln( ) m 1 e 1 e m (S-31) If βm << 1, which will be the case for all m if G0S0 << N0, or at least for small m if G0 << N0, then Eq. S-31 can be approximated by < n | m >» e- b m b m ln( 1 ) bm » 𝑁0 (S-32) 𝐺0 𝑚 where we have used Eq. S-23 and the inequality βm << 1 in the approximations in Eq. S=32. To derive the variance of n at a given value of m, we need to calculate < n 2 | m >= å n n 2 e- b mn 1 n ln( ) 1- e- b m (S-33) which in the same approximation as above leads to variance(n|m) = <n |m> - <n|m> ≈ 2 2 1 𝛽 𝐺02 𝑚2 𝑁02 ln( ) (S-34) From Eq. S-30 we can also derive an approximate value for the expected abundance of the species, in a genus with m species, with the maximum abundance. To do this we first note that 𝑚𝑎𝑥 ∑𝑛𝑛=1 𝜑(𝑛|𝑚) = 1 − 1 2𝑚 (S-35) Here we have used the fact that on a rank-abundance curve, each of the m species has on average a share of 1/m of total probability and thus the sum from nmax to N0 is 1/(2m). To approximate the sum in Eq. S-35, we can no longer use Eq. S-15 because we cannot assume that exp(-βmnmax) 0. But we also cannot assume that βmnmax ≈ 0, in which case the summation would yield a logarithm of nmax. Instead, we use a relatively accurate numerically-derived approximation that is discussed in more detail in Harte (2011, Box C.1) and is valid if βmnmax is of order 1: 𝑚𝑎𝑥 ∑𝑛𝑛=1 𝑒 −𝛽𝑚𝑛 𝑛 𝑛1.55 𝑚𝑎𝑥 ≈ 0.643 ln(0.408+(𝛽𝑚𝑛 𝑚𝑎𝑥 ) 1.55 ) (S-36) Combining Eqs. S-30, S-35, and S-36, we arrive at 𝑛𝑚𝑎𝑥 ≈ 0.56 (S-37) 𝛽𝑚(1−𝛽𝑚)0.643 Because 𝛽𝑚 <<1, the m-dependence of this expression is ~ 1/m. Turning to the distribution of metabolic rates, Ψ(ε), we simplify the notation by defining γ(ε) ≡ λ2 +ελ3 (S-38) The summation over n in Eq. S-11 can be carried out, giving Ψ ( ) G0 N0 Z me 1m (e ( ) m e ( ) mN0 (1 e ( ) m ) N 0e ( ) mN0 ) (1 e ( ) m ) 2 m1 S (S-39) The terms with exp(-mγ(ε)N0) can be neglected if λ2N0 >> 1, which is true if G0 >> 1. In that case, Eq. S-39 simplifies to Ψ ( ) G0 N0Z me 1m e ( ) m ( ) m 2 ) m (1 e (S-40) Using Eq. S-21, this can be approximated as: 𝛹(𝜀) ≈ 𝛽𝜆3 1 ln( ) 𝜆1 ∑𝑚 𝑚𝑒 −(𝜆1 +𝛾(𝜀))𝑚 (1−𝑒 −𝛾(𝜖)𝑚 ) 2 (S-41) Although the summation in Eq. S-40 or S-41 is not expressible in finite form, if γ(ε)m << 1 over the range of m-values that contribute significantly to the sum, which roughly speaking will be the case if ε << E0/S0G0, and β +λ1 << 1, then using Eqs. S-15 and S-22, this simplifies to Ψ ( ) 3 2 ( ) (S-42) The distribution of metabolic rates over all individuals in a species with n individuals that is in a genus with m species, as expressed in Eq. S-12, is readily evaluated: Θ( | m, n) 3mne 3 mn( 1) (S-43) Finally, using Eqs. S-30 and S-43 we can evaluate Eq. S-14: 𝜉(𝜖|𝑚) = 𝜆3 ∙𝑚 1 ) 1−𝑒−𝛽∙𝑚 ln( 𝑒 −𝑚(𝜆3 (𝜖−1)+𝛽∙𝑚) ∙ 1−𝑒 −𝑚(𝜆3 (𝜖−1)+𝛽∙𝑚) (S-44) At large , this can be approximated by: 𝜉(𝜖|𝑚) ≈ 𝜆3 𝑚 1 ) 𝛽𝑚 ln( 𝑒 −𝜆3 𝑚𝜀 (S-45) We note that this is an increasing function of m for values of λ3m << 1, which is generally the case for realistic values of the state variables. Taking the mean of Eq. S-45, we obtain the approximate result: < 𝜀|𝑚 >≈ 1 (S-46) 1 ) 𝛽𝑚 (𝜆3 𝑚)ln( and for the variance we obtain: < 𝜀 2 |𝑚 > − < 𝜖|𝑚 >2 ≈ 1 1 𝑚2 ln( )𝜆23 𝛽 1 (2 − 1 1 𝛽 ) ln( ) (S-47) In ASNE, the analog metric to 𝜉(𝜖|𝑚) 𝑖s (), defined in Table 1 in the text, and given by () = n (|n)(n). We note that this function was incorrectly defined and derived in Harte (2011) and incorrectly defined and tested in Newman et al. (2015), cited in the reference section of the text. Two Possible Extensions The genera-area relationship (GAR). In the ASNE version of METE, the species-area relationship (SAR) is derived from the expression: 𝑆(𝐴) = 𝑆0 ∑𝑛0 𝜑(𝑛0 )[1 − 𝛱(0|𝑛0 , 𝐴, 𝐴0 )] (S-48) where Π(0|n0, A, A0) is the probability that if a species has n0 individuals in area A0, then it has 0 individuals in area A and S0 is the number of species in area A0. Both functions and are predicted from the theory. In the extended theory, the genera-area relationship (GAR) can be derived if Π(0|m, n0, A, A0) is known. A genus will be found in an area A if at least one individual of at least one species in that genus is found in area A. Hence we expect the GAR to be flatter than the SAR. We can derive the GAR using the conditional abundance distribution φ(n|m) as a weighting function and noting that a genus with m species will not be found in area A only if each of the m species in that genus is not in A. For such a genus, the probability of not occurring in an area A is the product of probabilities Π(0|m,n,A,A0) that each species in the genus does not occur in the area. Determining the m-dependence of Π(0) and deriving the explicit functional form of the GAR awaits further analysis. Additional taxonomic categories. The methods used above to extend METE by adding an additional taxonomic category, genus, can be generalized to an arbitrary number of nested categories. For example, the natural extension of the entities R(n,ε|S0, N0, E0) in ASNE and Q(m,n,ε|G0, S0, N0, E0) in AGSNE is, for AFGSNE, T(l,m,n,ε | F0, G0, S0, N0, E0) defined by Pick a family at random from the pool of families; then T dε is the probability it has l genera, and if you pick one of those genera at random, that it has m species, and if you pick one of those species at random that it has n individuals, and if you pick one of those individuals, it has a metabolic rate in the interval (ε, ε+dε). Paralleling Eq. S-4, the MaxEnt solution is of the form, 𝑇(𝑙, 𝑚, 𝑛, 𝜀) = 𝑒 −𝜆1 𝑙 𝑒 −𝜆2 𝑙𝑚 𝑒 −𝜆3 𝑙𝑚𝑛 𝑒 −𝜆4 𝑙𝑚𝑛𝜀 𝑍(𝜆1 ,𝜆2 ,𝜆3 ,𝜆4 ) (S-49) Generalizing further, the constraints imposed by an entire taxonomic tree can be included within the MaxEnt framework. We further note that the constraints could arise from knowledge of structure of the phylogenetic, rather than the taxonomic, tree. The full analysis of such entire trees also awaits further analysis.