Supplementary information for Ecological Interactions on Macroevolutionary Time Scales: Clams and Brachiopods are more than Ships that Pass in the Night Liow L.H., Reitan, T. & Harnik P.G. Materials and Methods 1. Data We downloaded all available brachiopod and bivalve observed occurrence data recorded from marine deposits from the Paleobiology Database (PaleoDB, downloaded 16 Dec 2014). We altered the following download options: Only marine deposits were downloaded, genus names with qualifiers “aff., cf., ?, “, ex. Gr., sensu lato” were excluded and genus names were replaced by subgenus names where available. Note that taxa not identified to species level, e.g. “Genus sp.” are included in our data. This means that dynamics will be more muted (Wagner et al. 2007). Subgenera and genera are often treated as equal in rank in macroevolutionary studies of fossil marine invertebrates (Roy et al. 1996; Sepkoski 2002; Jablonski et al. 2003; Simpson & Harnik 2009; Foote & Miller 2013) because the morphological distinction between these ranks can be arbitrary (Roy et al. 1996). This protocol is standard in global-scale analyses of fossil marine bivalves (Roy et al. 1996; Jablonski et al. 2003; Simpson & Harnik 2009). However, workers concerned about taxonomic over-splitting in certain groups, such as fossil brachiopods (Cooper 1970; Carlson & Fitzgerald 2007), may favor the use of the genus level as the operational taxonomic unit (Harnik et al. 2014; Powell et al. 2015). All analyses reported here treat subgenera and genera as equal in rank; a preliminary analysis where we substituted genus names using available subgenus names for only bivalves but not brachiopods gave qualitatively similar results. In the main text, we explain why we analyzed data for (sub)genera rather than species. We used “ma_max” and “ma_min” as reported in the PaleoDB as the age range of each observed occurrence where “ma_max” is the maximum age estimate of the fossil occurrence based on geochronology if available or the interval name, if not, and where “ma_min” is the equivalent minimum age estimate, both in millions of years ago. Before capturerecapture analyses (see next section), we removed data where reported age ranges are greater than the largest age interval in the ICS (Cohen et al. 2013)(18.5My, duration of the Carnian); we also removed occurrences assigned a Cambrian age. For brachiopods, we have 135251 data points representing observations of 3420 (sub)genera while for bivalves, we have 156011 data points representing 2679 (sub)genera. These are the data that we use for capture-recapture analyses (see below). M. Clapham (30.6%), W. Kiessling (14.6%) and A. Miller (11.7%) were the top three authorizers for the brachiopod data we used while for the bivalve data, the top three authorizers were W. Kiessling (18.7%), A. Hendy (16.4%) and M. Clapham (13.7%); the numbers in parentheses indicate the proportion of total hours authorized. 2. Capture-recapture models for paleontological data: Pradel seniority model Capture-recapture and related approaches have their roots in population ecology (King 2014) and we and others have previously used such approaches to infer diversification parameters using fossil observation data that are analogous to capture-recapture data in ecology (Nichols & Pollock 1983; Connolly & Miller 2001b, a, 2002; Kröger 2005; Liow et al. 2008; Liow & Finarelli 2014). Here, we briefly outline the approach we use and refer readers to Pradel (1996), Connolly and Miller (2001b) and Liow & Finarelli (2014) for details. The data available to us are the recorded observations of genera in the fossil record. Assuming that there are no errors in taxonomic identification or age assignment, we can infer that a taxon was extant and sampled if there is an observation of it in our database. For instance, in the table below, Taxon A was sampled in both Time 1 (oldest interval) and Time 3 (younger interval), but not in Time 2, so we can infer that it must have been extant during Time 2 but simply not sampled. However, for Taxon B, we know it was extant in Time 1, but we do not know if it was extant but simply not sampled in Times 2, 3 or 4. Time 1 Time 2 Time 3 Time 4 Taxon A 1 0 1 1 Taxon B 1 0 0 0 Given a dataset of such sampling histories in this classic Cormack-Jolly-Seber (CJS) (Cormack 1964; Jolly 1965; Seber 1965) example where survival probabilities are conditioned on first observation, the probability of observing the sampling histories (sh) for Taxon A, can be written as Pr( sh [1011]) 1(1 p2 )2p33p4 (1) For Taxon B, it is Pr( sh [1000]) 1 1 1 (1 p2 )(1 2 ) 1(1 p2 )2(1 p3 ) 1(1 p2 )2(1 p3 )3(1 p4 ) (2) j is survival probability between the subscripted time interval and the next and p is sampling probability within the subscripted time interval. Note that multiple observations of the same taxon in a given time interval is treated simply as “observation” as opposed to “non-observation” in this set-up. While it is potentially fruitful to utilize multiple-observations, such an approach is beyond the scope of this work but is part of our on-going work. To simplify the modeling, we assume that all the taxa within our dataset have the same survival probabilities and sampling probabilities within the same time intervals (see later paragraph on assumptions). We could even assume that only sampling is time-varying but that survival is non time-varying, then L(j ,p2 ,p3 ,p4 ) = N1011 (j 3(1- p2 )p3p4 ) { } N1000 [(1- j + j(1- p2 ) 1- j + j(1- p3 )(1- j + j(1- p4 ) ] (3) where N indicates how many cases in the dataset there are with the corresponding sampling histories exemplified by taxa A and B. The maximum likelihood estimates of the parameters can then be found. The Pradel seniority model is a modification of the classic CJS model (described above) where survival probabilities (the complement of extinction probability) and “seniority” probabilities (the complement of origination probability if genus observation data are used) can be estimated simultaneously with sampling probabilities. Upon reparametrization of the Pradel model (Pradel 1996; Liow & Finarelli 2014), net diversification (origination minus extinction) can also be estimated. The assumptions of capture-recapture approaches are well known and the effects of violating these assumptions has been well-studied (Pollock et al. 1990; Williams et al. 2002; Liow & Nichols 2010). To summarize, using the CJS model as an example, i) sampling and survival probabilities are assumed to be equal for all taxa in the dataset, ii) the sampling intervals are short relative to the timespan over which survival is estimated and iii) the fate of each taxon is independent of other taxa in the dataset. By making parameters taxon or time specific, we can relax the first assumption. The second assumption is always violated for paleontological data but simulations have shown that the effects of this violation are not substantial (see Pollock et al. 1990, Williams et al. 2002, Liow and Nichols 2010 and references therein). The third assumption leads to over-dispersion and overly narrow confidence intervals for the estimates but this can be corrected using a variance inflation factor if desirable. In our study, we estimated diversification and sampling parameters for brachiopods and bivalves separately and assumed a full-time varying model such that each of the 86 time intervals had a priori different estimates. We also removed time points with large uncertainties (see Methods in main text and Table S1). We supply the datasets we used as separate files in our supplementary information (bivalves16dec2014Marine.csv and brachiopods16dec2014Marine.csv) and in Table S1 we supply the untransformed estimates from the Pradel model. 3. Inference of macroevolutionary processes using SDE: background Paleontological datasets often span a substantial amount of time and are composed of observations of a given system at various time points. These temporal observations, like other time series data, can exhibit considerable time dependency, i.e. the state of the system at one time point can be correlated with that at another time point. Paleontologists often wish to know whether a paleobiological time series (e.g. phenotype, extinction rates) is correlated or driven by another such biotic time series and/or an environmental time series (e.g. predator-density, climate, sea-level). Given the time dependency inherent in many time series (including paleontological and environmental time series), attempts to identify relationships among time series using ordinary regression and related techniques will fail. Although the estimate of the effect might be unbiased, the uncertainty of that effect will be seriously underestimated because ordinary regression assumes independent noise terms. This can often lead to rejection of a true null hypothesis (Type I error). If however the time dependency is addressed using a time series model that incorporates the auto-correlation present in the data, reliable tests for effects can be performed. Common tools for time series analysis, predominantly based on ARIMA models, tacitly assume that the given time series have been sampled regularly (equidistant) in time. Time series in paleontology, like the ones we are interested in here, however, are often not temporally equidistant. For instance, the age of observed fossils are haphazardly dispersed in time and geological stages are also not regularly spaced temporally. Thus it is necessary to have statistical tools for handling observations that have irregularly spaced time points. Stochastic differential equations (SDEs) describe processes that are continuous in time, such that analyses based on such models are able to deal with measurements distributed irregularly in time. In addition, paleobiological processes are often continuous even though our observations are discrete. SDEs can be linear or nonlinear and we use linear SDEs here because their likelihoods are analytically tractable. For a detailed insight into SDEs, see Øksendal (2010); Evans (2013). Linear SDEs have been used paleobiological studies, such as Brownian Motion (Raup 1977) and the Ornstein-Uhlenbeck processes (Lande 1976). However, linear SDEs also easily allow for studying the effect of one time series on another (Hansen 1997) or even the effect of unmeasured time series on measured time series (Hansen et al. 2008; Reitan et al. 2012). The general expression for a linear SDE can be written as dX(t ) A(t )X(t ) m(t ) dt (t )dB(t ) (4) where the first part of Eqn (4) is deterministic and the second part, stochastic. In the first part of the equation, X (t ) is the vector process of interest (potentially having both measured and hidden components), m(t ) is a vector and A(t ) and (t ) are matrices with explicit time dependency and in the second part, B(t ) represents stochastic contributions to the system. The additive vector m , the socalled pull-matrix A and co-variance matrix are often assumed to be timeindependent, hence simplifying calculations. Assuming a constant pull-matrix 𝐴, the SDE can formally be transformed into a stochastic integral with the form t t t0 t0 X (t ) e A(t t0 ) X (t 0 ) e A(ut0 )m(u)du e A(ut0 ) (u)dB(u) (5) which is a function of the previously known state at time t 0 . If no previous state is known, one can set t 0 . Equation (5) retains the normality of the stochastic contributions, thus only the expectation vector and the covariance matrix of the vectorial process X (t ) (that is, all the possible combinations of time points) are necessary for deriving a likelihood for different states of the process. Assuming that that one can find an eigenvector decomposition VA V where is a diagonal matrix of eigenvalues, then the expectation vector will be t E X (t )| X (t 0 ) V 1e (t t0 )VX (t 0 ) V 1e (t t0 )Vm(u)du (6) t0 and the covariance matrix of the state at two different time points t and v where t>v will be Cov X (v ), X (t )| X (t 0 ) V 1 e (uv )V (u)(u)V e (uv )du (V 1 ) (7) t0 where V stands for the transpose of matrix V . When m and are also constant, algebraic expressions for the expectation (Eqn (8)) and covariance (Eqn (9)) can be found: E X(t )| X(t0 ) V 1e(t t0 )VX(t0 ) V 11(1 e(t t0 ) )Vm (8) Cov X(v ), X(t )| X(t0 ) V 1(t , v ,t0 )(V 1 ) (9) where (t , v , t0 )i , j e j ( t ) i ( t t0 t0 ) e i j i , j and V V (see Reitan et. al. 2012, including supplementary material). In addition to the stochasticity of one state conditioned on the previous state, independent measurement errors also contribute to the overall stochasticity of the observations. Assuming that these errors are also normally distributed, one can either express the likelihood through a multi-normal distribution of all measurement points or use a Kalman filter (see Kalman 1960 and Reitan et al. 2012 for use in the linear SDE setting), which incrementally calculates likelihood contributions one measurement at a time. We use a Kalman filter because of its computational efficiency. General vector and matrix expressions allow total flexibility in linear SDE modeling. However, using these general expressions directly as models introduces unnecessary complexity, possibly making some parameters unidentifiable and rendering results difficult to interpret. Imposing restrictions and extra structure in these equations alleviates this problem and makes it possible to answer questions, such as “is the process stationary” or “is there a relationship between process 1 and process 2 and if so, is that a causal relationship (one process affecting/driving the other) or one of simple correlation?” For example, stationarity can be studied by comparing a Brownian motion (BM) process, which is non-stationary, to an Ornstein-Uhlenbeck (OU) process, which is stationary. BM can be described using only one stochastic variable (although it can be expanded to being multivariate) in the SDE, that is, dX (t ) dB(t ) . It is characterized by being normal, having a variance that increases with time, var( X (t )) 2t and increments that are independent. BM has been proposed as a null hypothesis for evolutionary processes (Raup 1977). In contrast, the OU process, described by dX (t ) ( X (t ) )dt dB(t ) is stationary with an expectation and stationary variance var(X(t))= s 2 /2a and a correlation between the state at time t and time u of (t ) e a|t u| . The OU process has been used as a model for phenotypic evolution where the fitness landscape is bell-shaped (see Lande 1976) and having an optimum at . The pull then describes the strength of the selection. The stochastic part represents genetic drift in this model. The OU process has also been proposed as a model for the process of change in the optimum of the fitness landscape itself (Hansen 1997) The pull is often re-parameterized using either characteristic time t c 1/ or half-life, 𝛥𝑡1/2 = log (2)⁄𝛼 . The half-life describes the time it takes for the distance from process to 𝜇 to be halved in expectation, as well as the time it takes for the correlation to drop to 0.5. The incremental variance can also be re-parameterized as stationary variance s / 2 . The OU process has also been described as a model for stabilizing selection (Estes & Arnold 2007) but when the phenotype is far from the optimum, it can also be viewed as a model for directional selection. Although we can use linear SDEs to encode different ideas (e.g. a meanreverting process, a random process etc), the same verbal idea can be modeled in different ways. For instance, directional evolution can be described by a non-zero value for the additive term for the Brownian motion such that dX (t ) mdt dB(t ) (Smaers & Vinicius 2009), or a linear expectation term in the OU process, such that (t ) 0 t (Estes & Arnold 2007). Model comparisons can help clarify model differences, even if precise interpretations of each model may be wanting. One must also be aware that there might be multiple interpretations of the favored model (see Reitan et al. 2012). So far, only single variable processes have been considered. Frequently however, the objective is to examine the relationship between two such processes. With linear SDEs this is possible, even if the measurements were not sampled at the same time points and with non-equidistant gaps in the data. Overlapping measurement periods are of course necessary. Relationships can be investigated with a correlated noise model: dX1(t ) 1( X1(t ) 1 )dt 1dB1(t ) (10) dX 2(t ) 2( X 2(t ) 2 )dt 2 1 2 dB2(t ) 2dB1(t ) (11) Eqn (10) is an OU process, whereas Eqn (11) has not only its independent contribution, but also a term from Eqn (10). While the expression given here is asymmetrical in Eqns (10) and (11), these can be expressed in a symmetric fashion while having the same statistical properties. Each process by itself will have the same properties as an OU process, but the two processes seen together will be correlated. If the pull (and half-life) is the same for the two processes, the correlation between the state of the two processes at the same time will simply be while if the pulls are different, the correlation will be less, namely 2 12 /(1 2 ) . Such processes can be expected to have peaks and troughs at the approximately the same time points, with no processes preceding the other systematically. Note that BM processes might be similarly linked. Another way of connecting two processes is through a causal relationship. This can be unidirectional with one process driving the other, or bidirectional, with one process driving the other and vice versa. A causal relationship between two processes exist when the state of one process influences the outcome of the other (Granger 1969; Schweder 1970). Such directional pulls are most easily modeled in OU-like processes. If process X 1 (t ) influences X 2(t ) in a linear fashion, we can write dX2(t ) 2( X2(t ) 2 ( X1(t ) 1 ))dt 2dB2(t )dt (12) If we compare Eqn (12) with Eqn (10), we can see that X 2(t ) is an OU-like tracking process that has an additive term that is influenced by X 1 (t ) . The two processes will be correlated, but the peaks and troughs are now expected to occur first in X 1 (t ) and then in X 2(t ) . The relationship between the two processes can be summarized by a regression-like parameter, , instead of the correlation term, . It is also fairly straightforward to exchange the OU process with BM for the driving process, as was done in Hansen et al. 2008. Note also that if process X 1 (t ) is not measured, its influence can still be detectable on the measured process, as the auto-correlation of the process will be different from that of an OU process. This is the basis for multi-layered SDE analysis (Reitan et al. 2012). A given time series can be fitted to a single-layer SDE, which can be a BM process, an OU process, an OU process with a trend, etc., but it can sometimes be better described by multiple SDEs that are encapsulated in casual layers (Reitan et al. 2012). In such a multi-layered SDE, variables describing the data are ordered with a directed causal flow. We then number the multiple processes associated with a single measured time series (Layer 1), such that for a threelayered process, Layer 3 causally affects Layer 2, which in turn causally affects Layer 1. Layer 1 would thus be an OU-like tracking of Layer 2, which in turn would be an OU-like tracking of layer 3. The process of the lowest layer (layer 3 in our terminology) could be either an OU process, an OU process with a linear time trend or a BM process. For identifiability, we set 𝛽 = 1 (see SI of Reitan et. al 2012) and m is set for the lowest layer as the other processes will inherit this expectation. So if the lowest layer (Layer 3) is an OU process, the whole system looks like this (where subscripts denote layers): dX1(t) = -a 1(X1(t )- X 2(t))dt + s 1dB1(t)dt dX 2(t) = -a 2(X2(t )- X 3(t))dt + s 2dB2(t)dt dX 3(t) = -a 3(X3(t )- m )dt + s 3dB3(t)dt Data quantity and quality can set limits to what relationships are detectable with SDE analyses. With too few data points or with too little overlap between the time periods of the measurements, a correlative or casual relationship between two truly correlated or casually related processes might not be found. If the measurements are too sparsely sampled in time to detect the autocorrelation in each process, causal relationships may be inferred as correlative ones. If a hidden (layered) process, A, affects B and process B responds too quickly compared to the sparseness of measurements, only the dynamics of process A will be detected. 4. Bayesian framework and model comparison Likelihood landscapes for evolutionary processes can be very complex and we have previously encountered numerical problems in our maximum likelihood approaches to analyzing linear SDEs (see Reitan et al. 2012). Hence we use Bayesian Markov chain Monte Carlo (MCMC) analyses here to obtain samples from the posterior distributions of the parameters, given the data and prior distributions. For each parameter, we assigned an independent prior distribution, which was set to be wide but informative: Expected value, , logged stochastic contribution, log( ) , logged half-time, log(Dt1/2 ) , causal connection, , linear time trend, , and logit-transformed correlation, log((1 ) / (1 )) , were all assigned normal distributions adjusted to give a target 95% credibility interval (CI) on the original scale. The expected value, , was given a 95% CI of (-20,20) for the logged biotic rates, (-1000,1000) for sea level, (-1,1) for normalized abiotic data (see Section 6 and main text) and (- 10,10) for the other abiotic series. The stochastic contribution was given a vague prior with a 95% CI of (0.01,100) for logged biotic rates and (0.00001,100) for the remaining time series. Half-life, Dt1/2 , and correlation, , were assigned (0.001,1000) and (-0.96,0.96) respectively as 95% CI for all series. Linear time trends, , and causal connections, , were both given a CI of (-10,10). Model comparison was performed using the Bayesian Model Likelihood (BML), defined as the probability density of the data given the prior parameter distribution of the model; BML( M ) f ( D | M ) f ( D | M ) f ( M )d M where f ( D | M ) is the likelihood and f ( M ) is the prior distribution. The BML can be used for deriving the Bayes factor (which compares two models), B(M1 , M 2 ) f ( D | M1 ) / f ( D | M 2 ) or to calculate model probabilities in a collection of models, P( M j | D) f ( D | M j ) P( M j ) M f ( D | M ) P( M ) i 1 i . For our purposes, a given i null model (of no relationship between two time series) is assigned a prior probability of 50% with each alternative model sharing the remaining 50% probability equally. Thus Bayes factor for the null model versus the set of alternatives is B(M 0 , M A ) P( M 0 | D) / (1 P(M 0 | D)) . This can also be equivalently reported as the posterior probability of the null model, P(M 0 | D) . Our numerical method for calculating the BML uses a multivariate normal distribution adjusted to the MCMC samples as a proposal distribution in an importance sampling scheme as described in Reitan and Petersen-Øverleir (2009). In short, while the model probabilities do depend on the prior distribution, they are robust. That is, the model probabilities are not very sensitive to changes in the prior distribution, as long as these changes are not dramatic (several orders of magnitude, see Reitian et al. 2012). 5. Comparing SDE with ordinary regression based approaches with simulations We simulated three scenarios then applied SDE, as well as regression techniques, to recover the relationships between time series pairs. The first scenario is where the two simulated time series are i) independent; the second where they are ii) correlated, but not causally related; and the third where they are casually related such that one time series drove they other (see main text Eqns 2 to 4). Each times series pair was simulated 100 times, each with 200 data points drawn uniformly over a time period [0, 100] such that there were no time points found in both series (i.e. zero overlap of temporal data). While this sounds extreme, we note that it is exceedingly rare that an isotope measurement (paleoenvironmental proxy) stems from the same shell that contributed to a taxon observation in the PaleoDB. Even taxa observed in the same named stage are unlikely to be from the decade or century, more so they are from different locations. The OU parameters were 2 , s 2 , Dt1/2 = 5 for the first series in each pair and 2 , s 1 , Dt1/2 = 20 for the second. To enable linear regression analysis, the simulated data were binned into 2My intervals. Note that there is a mean of 48 time bins in the binned data as some of the 100 time intervals contain no data points due to the randomized drawing of data. In addition to performing regression directly on the binned simulated data, we also perform the same analysis after shifting the data of one time series one time bin forward relative to the other and vice versa. Bonferroni correction was used to deal with the multiple testing. The code for generating and analyzing these simulated time series can be found in simcausal.zip (attached here with our submission and could also be deposited at Dyrad or another online repository). Given that 95 out of 100 cases are classified as independent when they indeed were independent, we can say that the linear regression on first-differenced time series is well calibrated in this set of simulations, given 95% confidence (Table S8). However, linear regression on the un-differenced time series gave a false positive rate of 7+18+18=43% (Table S8). This invalidates the true positive rates of the correlated and casually related cases. Note also that linear regression on causally connected first-differenced time series performed poorly: 73 casual cases were classified as independent. In contrast, SDE performed well in the classification of relationships in all simulated scenarios (Table S8). It must be stressed that linear regression on shifted time series is not the same as inferring a true Granger causal relationship (Granger 1969, Schweder 1970, see rest of section). But we have however assumed such an interpretation in Table S8 as time lagged analyses have been used in the paleontological literature as a test for casual relationships. In the above simulations, the false positive and negative rates are both variable in each approach. The Bayesian test uses BML based on vague prior knowledge and is not constructed to yield 5% false positives, like a well-calibrated classic test is supposed to (if a 95% confidence level is desirable). However, by varying the target Bayes-factor for linear SDE analysis and p-value for classic regression, one can create a ROC-curve. A ROC (Receiver Operating Characteristic) curve is a plot that shows the relationship between false positives and true positives for varying test sensitivities, such that comparisons between tests of different philosophies are possible. An entirely random test will yield a diagonal line, while a near perfect test would be a line squeezed towards the upper left corner of a ROC plot (see Fig. below). The strength of a test calibrated to a 5% significance level can be read from the true positive rate (TPR) when the false positive rate (FPR) is 5%. Since we are now concerned only inferring a link model or not, the nature of the relationship (correlated or causal) is not of concern here. For correlated pairs of time series, linear regression on firstdifference data performed only slightly better than linear SDE for our example, while linear regression on untransformed data performed very poorly (see panel A in the figure below). For causally connected pairs, both regression methods performed much worse than linear SDE analysis and in this particular case almost as poorly as a random test (see panel B in the figure below). It must be noted that the test strength given for the two regression tests at a 5% significance level must be regarded as a best-case scenario since one must first use time series models to calibrate these! Figure: ROC-curves for A) correlated time series pairs, B) causally connected time series pairs (x1->x2). The y-axes are true positive rates (TPR) and the x-axes are false positive rates (FPR). In both A and B, blue lines indicate linear regression on binned untransformed data, red lines indicate linear regression on binned first-differenced data, green lines indicate linear SDE analysis on original data. The black diagonal lines show the ROC curve of a purely random test. The vertical line shows where to read the test strength for a test calibrated to a 5% significance level. The poor performance of linear regression on untransformed data is unsurprising given the high number of false positives seen in our simulations (Table S8). While linear regression on first differences yield better results than for untransformed data, it is still far less efficient than linear SDE analysis. A causal connection x1 x2 means that changes in x1 have an accumulated effect on x2 , with a “memory” proportional to the half-life. This is not the same as a correlation for a specific temporal shift between the two time series. For temporally un-shifted data, both linear regression methods performed very poorly on causally related time series. In our simulated example, the two tests performed almost as poorly as an entirely random test. When we simulated x1 (the cause) as the slow series and x2 (the effect) as the fast series, linear regression on untransformed data gave better results than linear regression on first-difference data for lag=0. The reason might be that since x1 at one time point is now well correlated with x1 at a later time point and x2 is correlated to x1 at the former time point, x1 and x2 will also be correlated at the same time point. Thus we cannot take for granted that first differenced data will yield better regression results than untransformed data in all cases. Because of the reasons laid out in this section, we did not perform linear regression on our empirical data. 6. Data preparation for SDE SDE analyses assume normality of data. While all the biotic time series, the δ13C, δ34S and eustatic sea-level series are normally distributed as verified with a Kolmogorov-Smirnov test, we had to transform δ180 and 87Sr/86Sr using 𝑥 = Φ−1 (𝐹𝑦 (𝑦)) where 𝑦 is the original value, x is the transformed value, 𝐹𝑦 is the estimated cumulative distribution function of the original values and Φ is the cumulative distribution function of the standard normal distribution. A kernel smoother was used in order to estimate 𝐹𝑦 using the “density” function in R (R_Core_Team 2014), with adjustment 0.1 such that it is closer to the empirical distribution function. Results Here, we list results in presented the supplementary Tables file and follow these with some short remarks. Table S1. Pradel seniority estimates for origination, extinction and sampling probabilities Table S2. Stand-alone time series summary: multi-layered process inference Table S3. Relationships among paleoenvironmental proxies and biotic time series (single-layered process inference) Table S4. Relationships among paleoenvironmental proxies and biotic time series (multi-layered process inference) Table S5. Relationships among paleoenvironmental proxies (single-layered process inference) Table S6. Relationships among paleoenvironmental proxies (multi-layered process inference) Table S7: Relationships among biotic time series of bivalves and brachiopods (multi-layered process inference) Table S8: Classification results from simulations Relationships among abiotic time series Of the five abiotic time series, only low latitude δ180 and δ13C have a significant positive correlative relationship with one another (Table S5-S6) after a Bonferoni-like correction (see Methods). In contrast, a study using a differential statistical approach (Hannisdal 2011) found that δ13C and δ34S, low latitude δ180 and δ13C, and δ34S and 87Sr/86Sr are pairs of time series exhibiting the strongest signals of information transfer (IT), with the first two relationships being bidirectional while the last was unidirectional from δ34S to 87Sr/86Sr. Hannisdal also found a unidirectional coupling from δ34S to δ180 that can be explained by the other pairs of relationships. To summarize, we recovered only one of the relationships found by Hannisdal (2011), who used a non-parametric information transfer approach. The discrepancies between Hannisdal’s study and ours may be due to some or all of the following causes: 1) Discrepancies could simply be due to type I (Hannisdal 2011) and type II (our results) errors. 2) Hannisdal (2011) did not account for multiple testing in the IT analyses. 3) Hannisdal conditioned pairwise relationships on a third time series, an approach that we did not use. 4) Information transfer is more suited than linear SDE in inferring non-linear relationships. However, one would not expect non-linear relationships between two time series that both are normally distributed, even though it is technically possible. Three of the abiotic time series are normally distributed while we transformed the other two. References cited 1. Carlson, S.J. & Fitzgerald, P.C. (2007). Sampling taxa, estimating phylogeny and inferring macroevolution: an example from Devonian terebratulide brachiopods. . Earth and Environmental Science Transactions of the Royal Society of Edinburgh, 98, 311-325. 2. Cohen, K.M., Finney, S.C., Gibbard, P.L. & Fan, J.-X. (2013). The ICS International Chrongostratigraphic Chart. Episodes, 36, 199-204. 3. Connolly, S.R. & Miller, A.I. (2001a). Global Ordovician faunal transitions in the marine benthos:proximate causes. . Paleobiology, 27, 779-795. 4. Connolly, S.R. & Miller, A.I. (2001b). Joint estimation of sampling and turnover rates from fossil databases: Capture-Mark-Recapture methods revisited. Paleobiology, 27, 751-767. 5. Connolly, S.R. & Miller, A.I. (2002). Global Ordovician faunal transitions in the marine benthos: ultimate causes. Paleobiology, 28, 26-40. 6. Cooper, G.A. (1970). Generic characters of brachiopods. In: North American Paleontological Convention (ed. Yochelson, EL). Allen Press Field Museum of Natural History, Chicago, pp. 194–263. 7. Cormack, R.M. (1964). Estimates of survival from sightings of marked animals. Biometrika, 51, 429-438. 8. Estes, S. & Arnold, S.J. (2007). Resolving the paradox of stasis: models with stabilizing selection explain evolutionary divergence on all time scales. Am. Nat., 169, 227-244. 9. Evans, L.C. (2013). An Introduction to Stochastic Differential Equations. American Mathematical Society. 10. Foote, M. & Miller, A.I. (2013). Determinants of early survival in marine animal genera. Paleobiology. Paleobiology, 39, 171-192. 11. Granger, C.W.J. (1969). Investigating causal relations by econometric models and cross-spectral methods. Econometrica, 37, 424-238. 12. Hannisdal, B. (2011). Non-parametric inference of causual interactions from geological records. American Journal of Science, 311, 315-334. 13. Hansen, T.F. (1997). Stabilizing selection and the comparative analysis of adaptation Evolution, 51, 1341-1351. 14. Hansen, T.F., Pienaar, J. & Orzack, S.H. (2008). A comparative method for studying adaptation to a randomly evolving environment. Evolution, 62, 1965–1977. 15. Harnik, P.G., Fitzgerald, P.C., Payne, J.L. & Carlson, S.J. (2014). Phylogenetic signal in extinction selectivity in Devonian terebratulide brachiopods. Paleobiology, 40, 675-692. 16. Jablonski, D., Roy, K., Valentine, J.W., Price, R.M. & Anderson, P.S. (2003). The impact of the pull of the recent on the history of marine diversity. Science, 300, 1133-1135. 17. Jolly, G.M. (1965). Explicit estimates from capture-recapture data with both death and immigration-stochastic model. Biometrika, 52, 225-&. 18. Kalman, R.E. (1960). A new approach to linear flitering and prediction problems. Journal of Basic Engineering, 82, 35-45. 19. King, R. (2014). Statistical Ecology. In: Annual Review of Statistics and Its Application, Vol 1 (ed. Fienberg, SE). Annual Reviews Palo Alto, pp. 401U983. 20. Kröger, B. (2005). Adaptive evolution in Paleozoic coiled cephalopods. Paleobiology, 31, 253-268. 21. Lande, R. (1976). Natural-selection and random genetic drift in phenotypic evolution. Evolution, 30, 314-334. 22. Liow, L.H. & Finarelli, J.A. (2014). A dynamic global equilibrium in carnivoran diversification over 20 million years. Proceedings of the Royal Society B: Biological Sciences, 281. 23. Liow, L.H., Fortelius, M., Bingham, E., Lintulaakso, K., Mannila, H., Flynn, L. et al. (2008). Higher origination and extinction rates in larger mammals. Proceedings of the National Academy of Sciences of the United States of America, 105, 6097-6102. 24. Liow, L.H. & Nichols, J.D. (2010). Estimating rates and probabilities of origination and extinction using taxonomic occurrence data: Capture-recapture approaches. In: Short Courses in Paleontology: Quantitative Paleobiology (eds. Hunt, G & Alroy, J). Paleontological Society, pp. 81-94. 25. Nichols, J.D. & Pollock, K.H. (1983). Estimating taxonomic diversity, extinction rates, and speciation rates from fossil data using capture-recapture models. Paleobiology, 9, 150-163. 26. Øksendal, B. (2010). Stochastic Differential Equations: An Introduction with Applications. Sixth edn. Springer. 27. Pollock, K.H., Nichols, J.D., Brownie, C. & Hines, J.E. (1990). Statistical inference for capture-recapture experiments. Wildlife Monographs, 1-97. 28. Powell, M.G., Moore, B.R. & Smith, T.J. (2015). Origination, extinction, invasion, and extirpation components of the brachiopod latitudinal biodiversity gradient through the Phanerozoic Eon. Paleobiology. 29. Pradel, R. (1996). Utilization of capture-mark-recapture for the study of recruitment and population growth rate. Biometrics, 52, 703-709. . 30. R_Core_Team (2014). R: A Language and Environment for Statistical Computing. 31. Raup, D.M. (1977). Probabilistic models in evolutionary paleobiology. American scientist, 65, 50-57. 32. Reitan, T. & Petersen-Øverleir, A. (2009). Bayesian methods for estimating multisegment discharge rating curves. Stochastic Environmental Research and Risk Assessment, 23, 627-642. 33. Reitan, T., Schweder, T. & Henderiks, J. (2012). Phenotypic evolution studied by layered stochastic differential equations. Annals of Applied Statistics, 6, 1531- 1551. 34. Roy, K., Jablonski, D. & Valentine, J.W. (1996). Higher taxa in biodiversity studies: patterns from eastern Pacific marine molluscs. . Philosophical Transactions of the Royal Society B: Biological Sciences, 351, 1605-1613. 35. Schweder, T. (1970). Decomposable markov processes. Journal of Applied Probability, 7, 400-410. 36. Seber, G.A.F. (1965). A note on multiple-recapture census. Biometrika, 52, 249259. 37. Sepkoski, J.J. (2002). A compendium of fossil marine animal genera. Bulletins of American paleontology, 363, 1-560. 38. Simpson, C. & Harnik, P.G. (2009). Assessing the role of abundance in marine bivalve extinction over the post-Paleozoic. Paleobiology, 35, 631-647. 39. Smaers, J.B. & Vinicius, L. (2009). Inferring macro-evolutionary patterns using an adaptive peak model of evolution. Evolutionary Ecology Research, 11, 9911015. 40. Wagner, P.J., Aberhan, M., Hendy, A. & Kiessling, W. (2007). The effects of taxonomic standardization on sampling-standardized estimates of historical diversity. Proceedings of the Royal Society B-Biological Sciences, 274, 439-444. 41. Williams, B.K., Nichols, J. & Conroy, M.J. (2002). Analysis and Management of Animal Populations. Academic Press, San Diego, San Francisco, New York, Boston, London, Sydney, Tokyo.