Liow Supplement 1 1 Supplementary Material for Liow (Paleobiology) 2 PRESENCE 3 PRESENCE is a Windows-based program specifically written for occupancy modeling. The program, 4 worked examples, and a user manual can be downloaded at http://www.mbr- 5 pwrc.usgs.gov/software/presence.html. The program MARK can also be used to run occupancy 6 models but the text for this supplementary material is based solely on PRESENCE. 7 The following text will guide a novice user through occupancy modeling analyses as presented for 8 Hebertella occidentalis in the main text. Questions regarding PRESENCE other than those directly 9 related to the analyses presented here should be directed to www.phidot.org, the discussion forum for 10 PRESENCE and other related methods and software. 11 Data 12 Data files for Hebertella occidentalis (hereafter Hebertella for short) are included in the Excel file, 13 Hebertella_data.xls. These data are also downloadable from the Paleobiology Database, as for the 14 other taxa analyzed in the main text. 15 There are a total of 698 samples, where a sample is a replicate from a specific location and sequence 16 (see sheet labeled “Samples” in Hebertella_data.xlsx), that is, a collection made from one bed at a 17 particular locality, equivalent to a collection in the Paleobiology Database. Each location typically 18 contains multiple samples (e.g. sample numbers 2-4), and this replication allows occupancy modeling. 19 Each site has associated facies information that is inferred independently of its faunal composition 20 (Holland and Patzkowsky 2007). Although the area of the outcrop sampled was measured for some 21 samples, this information is not complete and hence the consideration of this potentially important 22 covariate in occupancy is omitted here. Only C1-C5 sequences are used for analyses because of the 23 small sample size from C6. Liow Supplement 2 24 25 Single season model (= Single time interval model) Input format 26 A site is defined here as a unique combination of a geographic location depositional sequence (e.g., 27 C1-C5), and facies (either offshore, deep subtidal, or shallow subtidal). With this definition, one 28 geographic location that exposes a thick succession of strata with multiple depositional sequences and 29 facies would therefore be composed of multiple sites. Each site must have at least two samples to be 30 used in the analyses. For instance, sample 1 is the only sample from the C2 deep subtidal at AA Cold 31 Spring, whereas samples 6-17 are considered replicates because they are all from the C1 offshore at 32 Cold Spring. 33 The localities are arranged as rows (Sheet labeled “Single season input format”). Of these, site 13 (C1 34 offshore at Maysville 62-68) has the greatest number of replicate samples within sites in the C1 35 sequence (26). Therefore, the number of samples for a C1 analysis is set at 26. C1 sites that have fewer 36 than 26 replicates are padded with dashes (indicating missing data) to build their number of replicates 37 to 26. One could chose to use fewer replicates so that there would be less missing data. Although 38 confidence intervals on the estimates decrease with added data, the estimates themselves do not 39 change unless the distributions of presences and absences (1’s and 0’s in the input matrix) change. 40 Data are blocked as C1 through C5, as these are analyzed separately. The columns for facies 41 are coded as 1 or 0 as appropriate for each site. For example, site 13 (an offshore site), would have a 1 42 for offshore and a 0 for both shallow subtidal and deep subtidal facies. 43 Entering Data into Presence 44 Open PRESENCE. Start a new project (under File). This demonstration is based on the data from C4. 45 Select the data cells for C4 from the sheet labeled “Single season input format” highlighted in yellow, 46 including the locality names (but not the facies information, which will be entered separately). Click 47 on “Input Data Form” then go to Edit then Paste with site names. Check that there are 21 rows since 48 C4 has 21 sites. The “number of occasions” in PRESENCE is the same as our number of replicates; Liow Supplement 3 49 set this to 10, which is how many replicate samples per site are available in C4. The “number of site 50 covariates” in PRESENCE is equal to the number of facies. Set this to 3. A new spreadsheet will 51 appear. Select the cells labeled “S,D,O” in C4 (also highlighted yellow) and then click edit then paste 52 with covariate names (and with site names if you wish). Save this data file with a unique name and say 53 “no” for using the last column as frequency. Enter a title for this dataset, such as “Hebertella single 54 timebin C4”. The data filename and results file name should be automatically created. Check that they 55 are in the directory you desire. Click OK. A folder will be created to hold the results of the analyses. 56 Analyses and Results 57 Four simple models will be demonstrated for Hebertella in C4. Click on “Run”, and select “Analysis: 58 Single-season”. Click on “Custom” under “Model” and a design matrix will appear. To understand 59 more about design matrices, refer to (Cooch and White 2006, Liow and Nichols 2010). We will now 60 run the simplest model where a single preservation probability will be estimated and no covariates 61 considered. No change to the current design matrix is required. Leave the model name as psi(.),p(.) but 62 click on the Options box labeled “assess model fit”. Click “OK” to run and enter 1000 for the number 63 of bootstraps for assessing model fit. The DOS window that opens automatically on the starting of 64 PRESENCE should indicate that PRESENCE is running. When the run is done, say “yes” to 65 appending results and the model we just run with its AIC and other associated values should show up 66 on the results spreadsheet on PRESENCE. The details of the model output can be viewed by right 67 clicking on the model. For a full discussion of the output of the model, see the user manual of 68 PRESENCE. Here, we focus on discussing only a few lines in the output. The “Naive occupancy 69 estimate” is 0.3810, and this is the proportion of sites where Hebertella is recorded as present without 70 any modeling effort (see main text Fig. 2 for Hebertella at C4). Scrolling down, we see that the 71 “Individual Site estimates of <Psi>” has a value of 0.3870 with standard error 0.1077 and a 95% 72 confidence interval of 0.2059 - 0.6059. Note that for this model and these data, the naïve estimate is 73 close to non-naïve estimate. However, we now also have a detection probability estimate (0.7072) 74 with standard error 0.0794 and a 95% confidence interval 0.5326 - 0.8366. It is instructive to examine 75 the “DERIVED parameter - Psi-conditional” while checking back on our input data. At the locality Liow Supplement 4 76 Lincoln County Line, Hebertella is found in all the 4 available replicates, for instance. Hence the 77 model output tells us that psi-conditional is 1 (i.e., the probability that this locality is occupied by 78 Hebertella in C4 given its detection history, given by the presence-absence data). Now look at the 79 localities Bypass 4 with 3 replicates of zeros and Stonelick Spillway with 7 replicates of zeros. The 80 conditional psi for Bypass 4 is 0.0156 while that for Stonelick Spillway is 0.0001. Although we might 81 have naively thought that neither of these localities contains Hebertella in C4, given our raw data with 82 so many zeros or absences, what this model tells us is that there is some small probability that 83 Hebertella is present but we just didn’t sample it. The more absences recorded, the more certain we 84 are that it is unlikely present (comparing Bypass 4 and Stonelick Spillway values). Scrolling past the 85 detailed results of the goodness of fit test, we see that “Estimate of c-hat = 0.4004.” If the reported “c- 86 hat” value is greater than 1, there is likely more variation in the data than expected, given the model, 87 while a value that is smaller than 1 indicates that there is less variation (MacKenzie and Bailey 2004). 88 A value closer to 1 is desirable and the c-hat value from the most saturated value in the set can be used 89 to correct for the AIC ranking (Cooch and White 2006). 90 We will now change the design matrix to run 3 other models using the same dataset. Click on “Custom” 91 under “Model” again and change the model name to psi(F),p(F) where “F” indicates that we will now 92 include facies as a site-specific covariate for estimating occupancy and detection. In the design matrix, 93 add 2 columns to the psi spreadsheet by right clicking then select “init” and put D, S, and O in each of 94 the cells. Go to the detection spreadsheet and repeat this for each column. This specifies that there 95 should be separate estimates of psi (occupancy) and p (detection) for different facies. Run this model. 96 The model output should show that there are 6 parameters (3 each for psi and p). You will also see that 97 this model has a higher AIC value and hence lower AIC weight, relative to the simpler model we have 98 run before. This does not mean that facies are unimportant in predicting the occupancy of Hebertella, 99 only that the simpler model is more parsimonious, given sample size and fewer parameters of the 100 simple model. If we wanted to estimate how much facies contribute to the occupancy of Hebertella, 101 we would not consider models without facies as a covariate (e.g. the first model we ran here). An easy 102 first reference to read about constructing models is (Burnham and Anderson 2002). Liow Supplement 5 103 Now let’s look at the model output for this model. Reading the values from “untransformed estimates 104 of coefficients for covariates (Beta's)” will give the values for the first two columns in the following 105 table. These “beta” estimates (see text) can be transformed to “true estimates” (i.e. probability values) 106 via e / 1 e 107 estimate and the transformed estimate is obtained by e0.023405 / 1 e0.023405 . These transformed can 108 actually also be read off the output at “Individual Site estimates” but one would have to check the 109 covariate for the corresponding site to extract the values. Why are the estimates for Psi(O) and p(O) so 110 large? Let’s check the input data. It turns out that there is no offshore facies in C4 that has been 111 sampled, so that information is lacking. Hence, even though we have included all of the facies in the 112 model, it is not possible to extract any reasonable estimate for offshore facies. Our results tell us that 113 based on this model and these data, Hebertella has a higher occupancy probability in deep subtidal 114 facies although there is a higher sampling probability in shallow subtidal facies. This contrast between 115 sampling and true occupancy estimates is one of the strengths of occupancy modeling as describe in 116 our main text. Scrolling down to the c-hat value, we see that it is 0.2756, indicating that there is less 117 variation in the data than specified in this particular model. or p e / 1 e . So for instance, Psi (D) = 0.023405 for the untransformed Summary table showing results for psi(F)p(F) in C4, green values are referred to in the next section. Untransformed estimate standard error Transformed estimate (untransformed estimate) Psi(D) 0.023406 0.584502 (0.337462) 0.5059 Psi(S) -1.249642 0.802366 (0.463246) 0.2228 Psi(O) 0.000000 31622776601.683792 NA p(D) 2.068462 1.067849 (0.616523) 0.6476 p(S) 0.608449 0.422802 (0.244105) 0.8878 p(O) 0.000000 31622776601.683792 NA 118 119 Running the other two models presented in our main text should now be very straightforward. To run 120 Psi(.)p(F), simply click run model and then retrieve Psi(F)p(F) and change the Psi design matrix such Liow Supplement 6 121 that there is only 1 column and with a value of 1 (e.g. manually or by clicking “constant” on “init”). 122 Similarly for Psi(F)p(.), retrieve Psi(F)p(F) and change the p design matrix. The estimates for Psi and 123 p are very similar to those for Psi(F)p(F). The AIC ranking indicates that the simplest model is the 124 most parsimonious, but all the models we built have some credibility. 125 The data are also formatted for C1-C3 and C5 as used in the main text. The models can be simply run 126 as described above for C4. For interesting exercises, run C1 where most of the localities are assigned 127 offshore facies and where there is only one locality with a deep subtidal facies. 128 The Effect of Increased Sampling 129 In C4, there are only 10 sites (with varying numbers of replicates). If we increase the number of sites 130 by a factor of three (see sheet labeled tripled-data C4), we could increase our confidence in our 131 estimates quite a bit. Assuming that the data are identical if we collected more of it, we re-ran the 132 single-season analysis for C4 using triple amount the data. Tripling the amount of data shrinks the 133 standard deviation of the estimates (see table above, values in green), and leaves the estimates 134 unchanged. For the replicates, the occupancy probability is assumed to be the same although the 135 preservation/sampling is a chance event. The area of volume of the replicates taken need not be large 136 (say a 20 cm x 20 cm outcrop surface area, which takes only a couple of seconds to scan for the 137 brachiopod of interest). Although the number of replicates per site needs to be as large as in this 138 dataset, the number of sites (which are spatial replicates) should be ideally larger. It is in fact very 139 useful to run some simulated datasets before field collection in other to determine the rough number of 140 replicates and sites needed for the types of questions one wants to answer. Liow Supplement 7 141 142 Multiple season model (= Multiple time interval model) Input format 143 A multiple time interval model allows one to relax the assumption that there is no migration or 144 colonization during the period of study by explicitly introducing an extra term in the model. For 145 instance, if we were interested in estimating the migration rate of a certain taxon out of the spatial area 146 of interest, we could collect data from a time interval before the suspected migration and the following 147 time interval(s) and model the changing occupancy and migration of this taxon through these time 148 intervals. Compare this with the single time interval model where we could only estimate separate, 149 unlinked occupancy probabilities for the separate time intervals. For an ideal “multi-season” dataset in 150 a paleontological context, one would collect samples at the same location representing different time 151 intervals (e.g. up a section). This Cincinnatian brachiopod dataset was not collected specifically for 152 occupancy modeling and hence does not cohere to this format. 153 154 For our dataset, we assume that samples can be considered replicates as long as they were collected 155 from the same facies and depositional sequence. In other words, information on localition is removed 156 (relative to data used for the single time interval models). So if two samples are from C1 and represent 157 offshore facies, we consider it “the same thing” even though they might be from totally different 158 places. Because the location information is removed and because replication is needed for each “site”, 159 samples with the same facies and sequences are randomly assigned across the five sequences. The 160 exact input data used for the analyses shown in the main text is found in the spreadsheet labeled 161 “Multi-season input format.” The data is trimmed such that each “time interval” or “season” has 10 162 replicates (“occasions” in occupancy modeling terminology). Analyses using other runs of randomly 163 assigned replicates did not change qualitative conclusions in the main paper (detailed results not 164 shown). This is because our model is relatively simple, and we do not model any differences among 165 the samples or among “occasions”. If we did (given that we had more data for modeling more complex 166 models), the results will mostly likely change with this random assignment. The number of occasions Liow Supplement 8 167 and hence missing values do affect the confidence intervals of the estimates although not the estimates 168 themselves (see above). These observations should serve as a reminder that the ideal dataset should 169 contain replicates up core or section for multiple time interval analyses and that missing values are 170 ideally avoided. The data as presented has 33 sites and 10 replicates (occasions) in each time interval. 171 Entering into Presence 172 Open PRESENCE. Start a new project (under File). Click on “Input Data Form” then select all the 173 green cells in the excel spreadsheet labeled “multi-season input format” and paste. There should be 33 174 rows of data. Change the cell “No. of occ/season” to 10 and the “no. site covar” cell to 3. A new site 175 covariates sheet should show up. Paste the yellow cells from the excel spread sheet in this new sheet in 176 PRESENCE using the Edit pull down menu. Save the file and say no to using the last column as 177 frequency. Click “ok” on the main menu, and say “OK” to creating a new folder to hold results from 178 analyses of this dataset. 179 Analyses and Results 180 We now will illustrate 2 of the extreme member models for Hebertella, out of the 8 that were run for 181 multi-season analyses reported in the main text, namely 182 1) psi(t),gamma(t),p(t) 183 2) psi(F,t),gamma(F,t),p(F,t) 184 Click on Run and select Analysis: Multi-season. For model parameterization, click on “seasonal 185 occupancy and colonization, detection in order to replicate the results in the main text. The various 186 ways of model parameterization are theoretically equivalent (MacKenzie et al. 2006) although 187 different parameters are estimated. For model (1), set the occupancy and colonization sheets with “init” 188 to “full identity”. For the detection sheet, click “seasonal effects”. This is such that there will be a 189 single estimate for each of the parameters psi, gamma and p for each time interval (or season). Click 190 OK to run. The output is similar to the single season model except that now there is an extra parameter, 191 gamma. Reading the values from “untransformed estimates of coefficients for covariates (Beta's)” will Liow Supplement 9 192 give the values for the first two columns in the following table. As before, these beta estimates can be 193 transformed to “true estimates” (i.e. probability values) via e / 1 e since we chose the logit link 194 function. So for instance, Psi (C1) = -0.606586 for the untransformed estimate and the transformed 195 estimate is obtained by e 0.606586 / 1 e 0.606586 0.3528 . These transformed values can also be read 196 off the output at “Individual Site estimates” or further down at “real parameters.” Note that the various 197 “surveys” have the same output values because we have not modeled any differences among the 198 sampling occasions. In fact, we cannot because we cannot differentiate these replicates like ecologists 199 can. It pays to check why Psi(C1) and Psi(C2) have zeros for standard errors (as with gamma (C1-2)). 200 In this case, it is largely because Hebertella is not sampled in C1 (see datamatrix in spreadsheet), 201 hence both the occupancy in C1 and the next time-step C2, which is affected by the colonization 202 parameter, are not well-constrained. This effect percolates through to C2 such that gamma (C1-2) has 203 a zero standard error and that for gamma (C2-3) is rather large, as with the standard error for p(C1). 204 Liow Supplement 10 205 Summary table showing results for psi(t),gamma(t),p(t) Untransformed estimate standard error Transformed estimate (untransformed estimate) Psi(C1) -0.606586 0.000000 0.3528 Psi(C2) 0.660279 0.000000 0.6593 Psi(C3) 0.332957 0.769356 0.5825 Psi(C4) -0.244741 0.653325 0.4391 Psi (C5) 0.523016 0.441302 0.6279 gamma(C1-2) -0.105766 0.000000 0.4736 gamma(C2-3) 0.015198 1.910512 0.5038 gamma(C3-4) -0.021979 1.408863 0.4945 gamma(C3-5) 0.475942 0.801564 0.6168 p(C1) -0.180332 1.116662 0.4550 p(C2) -0.020765 0.277056 0.4948 p(C3) 0.189343 0.306836 0.5472 p(C4) 0.435745 0.369953 0.6072 p(C5) 0.961263 0.191215 0.7234 206 207 For model (2), click on Run and select Analysis: Multi-season and “seasonal occupancy and 208 colonization, detection” again. Set the occupancy and colonization sheets with “init” to “full identity” 209 and “seasonal effects” for the detection sheet as for model (1). But in each case, add 3 columns and 210 specify “init” as D, S, and O respectively in each column. Run the model, there should be 23 211 parameters and this model should have a greater AIC weight than model (1). Now we have estimates 212 of psi, gamma, and p specific to each time interfval and facies. Scrolling down and comparing the 213 “real estimates” with those from model (1), one sees that values are higher in general (because Liow Supplement 11 214 differences in facies are taken into account in different time intervals). Standard errors are larger 215 because there is now actually less data for a more complex model where colonization is involved. 216 217 References cited 218 Burnham, K. P., and D. R. Anderson. 2002. Model Selection and Multimodel Inference: A Practical 219 Information-Theoretic Approach. Springer, New York. 220 Cooch, E., and G. White. 2006. Program Mark: A Gentle Introduction 221 222 (http://www.phidot.org/software/mark/docs/book/). Holland, S. M., and M. E. Patzkowsky. 2007. Gradient ecology of a biotic invasion: biofacies of the 223 type Cincinnatian series (Upper Ordovician), Cincinnati, Ohio region, USA. Palaios 22:392- 224 407. 225 Liow, L. H., and J. D. Nichols. 2010. Estimating rates and probabilities of origination and extinction 226 using taxonomic occurrence data: Capture-recapture approaches. Pp. 81-94. In G. Hunt, and J. 227 Alroy, eds. Short Courses in Paleontology: Quantitative Paleobiology. Paleontological Society. 228 229 230 231 232 233 MacKenzie, D., and L. Bailey. 2004. Assessing the fit of site-occupancy models. Journal of Agricultural Biological and Environmental Statistics 9(3):300-318. MacKenzie, D., J. Nichols, A. Royle, K. Pollock, L. Bailey, and J. Hines. 2006. Occupancy Estimation and Modeling: Inferring Patterns and Dynamics of Species Occurrence Academic Press.