Liow_Supplementary_Text.

advertisement
Liow Supplement 1
1
Supplementary Material for Liow (Paleobiology)
2
PRESENCE
3
PRESENCE is a Windows-based program specifically written for occupancy modeling. The program,
4
worked examples, and a user manual can be downloaded at http://www.mbr-
5
pwrc.usgs.gov/software/presence.html. The program MARK can also be used to run occupancy
6
models but the text for this supplementary material is based solely on PRESENCE.
7
The following text will guide a novice user through occupancy modeling analyses as presented for
8
Hebertella occidentalis in the main text. Questions regarding PRESENCE other than those directly
9
related to the analyses presented here should be directed to www.phidot.org, the discussion forum for
10
PRESENCE and other related methods and software.
11
Data
12
Data files for Hebertella occidentalis (hereafter Hebertella for short) are included in the Excel file,
13
Hebertella_data.xls. These data are also downloadable from the Paleobiology Database, as for the
14
other taxa analyzed in the main text.
15
There are a total of 698 samples, where a sample is a replicate from a specific location and sequence
16
(see sheet labeled “Samples” in Hebertella_data.xlsx), that is, a collection made from one bed at a
17
particular locality, equivalent to a collection in the Paleobiology Database. Each location typically
18
contains multiple samples (e.g. sample numbers 2-4), and this replication allows occupancy modeling.
19
Each site has associated facies information that is inferred independently of its faunal composition
20
(Holland and Patzkowsky 2007). Although the area of the outcrop sampled was measured for some
21
samples, this information is not complete and hence the consideration of this potentially important
22
covariate in occupancy is omitted here. Only C1-C5 sequences are used for analyses because of the
23
small sample size from C6.
Liow Supplement 2
24
25
Single season model (= Single time interval model)
Input format
26
A site is defined here as a unique combination of a geographic location depositional sequence (e.g.,
27
C1-C5), and facies (either offshore, deep subtidal, or shallow subtidal). With this definition, one
28
geographic location that exposes a thick succession of strata with multiple depositional sequences and
29
facies would therefore be composed of multiple sites. Each site must have at least two samples to be
30
used in the analyses. For instance, sample 1 is the only sample from the C2 deep subtidal at AA Cold
31
Spring, whereas samples 6-17 are considered replicates because they are all from the C1 offshore at
32
Cold Spring.
33
The localities are arranged as rows (Sheet labeled “Single season input format”). Of these, site 13 (C1
34
offshore at Maysville 62-68) has the greatest number of replicate samples within sites in the C1
35
sequence (26). Therefore, the number of samples for a C1 analysis is set at 26. C1 sites that have fewer
36
than 26 replicates are padded with dashes (indicating missing data) to build their number of replicates
37
to 26. One could chose to use fewer replicates so that there would be less missing data. Although
38
confidence intervals on the estimates decrease with added data, the estimates themselves do not
39
change unless the distributions of presences and absences (1’s and 0’s in the input matrix) change.
40
Data are blocked as C1 through C5, as these are analyzed separately. The columns for facies
41
are coded as 1 or 0 as appropriate for each site. For example, site 13 (an offshore site), would have a 1
42
for offshore and a 0 for both shallow subtidal and deep subtidal facies.
43
Entering Data into Presence
44
Open PRESENCE. Start a new project (under File). This demonstration is based on the data from C4.
45
Select the data cells for C4 from the sheet labeled “Single season input format” highlighted in yellow,
46
including the locality names (but not the facies information, which will be entered separately). Click
47
on “Input Data Form” then go to Edit then Paste with site names. Check that there are 21 rows since
48
C4 has 21 sites. The “number of occasions” in PRESENCE is the same as our number of replicates;
Liow Supplement 3
49
set this to 10, which is how many replicate samples per site are available in C4. The “number of site
50
covariates” in PRESENCE is equal to the number of facies. Set this to 3. A new spreadsheet will
51
appear. Select the cells labeled “S,D,O” in C4 (also highlighted yellow) and then click edit then paste
52
with covariate names (and with site names if you wish). Save this data file with a unique name and say
53
“no” for using the last column as frequency. Enter a title for this dataset, such as “Hebertella single
54
timebin C4”. The data filename and results file name should be automatically created. Check that they
55
are in the directory you desire. Click OK. A folder will be created to hold the results of the analyses.
56
Analyses and Results
57
Four simple models will be demonstrated for Hebertella in C4. Click on “Run”, and select “Analysis:
58
Single-season”. Click on “Custom” under “Model” and a design matrix will appear. To understand
59
more about design matrices, refer to (Cooch and White 2006, Liow and Nichols 2010). We will now
60
run the simplest model where a single preservation probability will be estimated and no covariates
61
considered. No change to the current design matrix is required. Leave the model name as psi(.),p(.) but
62
click on the Options box labeled “assess model fit”. Click “OK” to run and enter 1000 for the number
63
of bootstraps for assessing model fit. The DOS window that opens automatically on the starting of
64
PRESENCE should indicate that PRESENCE is running. When the run is done, say “yes” to
65
appending results and the model we just run with its AIC and other associated values should show up
66
on the results spreadsheet on PRESENCE. The details of the model output can be viewed by right
67
clicking on the model. For a full discussion of the output of the model, see the user manual of
68
PRESENCE. Here, we focus on discussing only a few lines in the output. The “Naive occupancy
69
estimate” is 0.3810, and this is the proportion of sites where Hebertella is recorded as present without
70
any modeling effort (see main text Fig. 2 for Hebertella at C4). Scrolling down, we see that the
71
“Individual Site estimates of <Psi>” has a value of 0.3870 with standard error 0.1077 and a 95%
72
confidence interval of 0.2059 - 0.6059. Note that for this model and these data, the naïve estimate is
73
close to non-naïve estimate. However, we now also have a detection probability estimate (0.7072)
74
with standard error 0.0794 and a 95% confidence interval 0.5326 - 0.8366. It is instructive to examine
75
the “DERIVED parameter - Psi-conditional” while checking back on our input data. At the locality
Liow Supplement 4
76
Lincoln County Line, Hebertella is found in all the 4 available replicates, for instance. Hence the
77
model output tells us that psi-conditional is 1 (i.e., the probability that this locality is occupied by
78
Hebertella in C4 given its detection history, given by the presence-absence data). Now look at the
79
localities Bypass 4 with 3 replicates of zeros and Stonelick Spillway with 7 replicates of zeros. The
80
conditional psi for Bypass 4 is 0.0156 while that for Stonelick Spillway is 0.0001. Although we might
81
have naively thought that neither of these localities contains Hebertella in C4, given our raw data with
82
so many zeros or absences, what this model tells us is that there is some small probability that
83
Hebertella is present but we just didn’t sample it. The more absences recorded, the more certain we
84
are that it is unlikely present (comparing Bypass 4 and Stonelick Spillway values). Scrolling past the
85
detailed results of the goodness of fit test, we see that “Estimate of c-hat = 0.4004.” If the reported “c-
86
hat” value is greater than 1, there is likely more variation in the data than expected, given the model,
87
while a value that is smaller than 1 indicates that there is less variation (MacKenzie and Bailey 2004).
88
A value closer to 1 is desirable and the c-hat value from the most saturated value in the set can be used
89
to correct for the AIC ranking (Cooch and White 2006).
90
We will now change the design matrix to run 3 other models using the same dataset. Click on “Custom”
91
under “Model” again and change the model name to psi(F),p(F) where “F” indicates that we will now
92
include facies as a site-specific covariate for estimating occupancy and detection. In the design matrix,
93
add 2 columns to the psi spreadsheet by right clicking then select “init” and put D, S, and O in each of
94
the cells. Go to the detection spreadsheet and repeat this for each column. This specifies that there
95
should be separate estimates of psi (occupancy) and p (detection) for different facies. Run this model.
96
The model output should show that there are 6 parameters (3 each for psi and p). You will also see that
97
this model has a higher AIC value and hence lower AIC weight, relative to the simpler model we have
98
run before. This does not mean that facies are unimportant in predicting the occupancy of Hebertella,
99
only that the simpler model is more parsimonious, given sample size and fewer parameters of the
100
simple model. If we wanted to estimate how much facies contribute to the occupancy of Hebertella,
101
we would not consider models without facies as a covariate (e.g. the first model we ran here). An easy
102
first reference to read about constructing models is (Burnham and Anderson 2002).
Liow Supplement 5
103
Now let’s look at the model output for this model. Reading the values from “untransformed estimates
104
of coefficients for covariates (Beta's)” will give the values for the first two columns in the following
105
table. These “beta” estimates (see text) can be transformed to “true estimates” (i.e. probability values)
106
via   e  / 1  e 
107
estimate and the transformed estimate is obtained by e0.023405 / 1  e0.023405 . These transformed can
108
actually also be read off the output at “Individual Site estimates” but one would have to check the
109
covariate for the corresponding site to extract the values. Why are the estimates for Psi(O) and p(O) so
110
large? Let’s check the input data. It turns out that there is no offshore facies in C4 that has been
111
sampled, so that information is lacking. Hence, even though we have included all of the facies in the
112
model, it is not possible to extract any reasonable estimate for offshore facies. Our results tell us that
113
based on this model and these data, Hebertella has a higher occupancy probability in deep subtidal
114
facies although there is a higher sampling probability in shallow subtidal facies. This contrast between
115
sampling and true occupancy estimates is one of the strengths of occupancy modeling as describe in
116
our main text. Scrolling down to the c-hat value, we see that it is 0.2756, indicating that there is less
117
variation in the data than specified in this particular model.

 or
p  e  / 1  e   . So for instance, Psi (D) = 0.023405 for the untransformed


Summary table showing results for psi(F)p(F) in C4, green values are referred to in the next section.
Untransformed estimate
standard error
Transformed estimate
(untransformed estimate)
Psi(D)
0.023406
0.584502 (0.337462)
0.5059
Psi(S)
-1.249642
0.802366 (0.463246)
0.2228
Psi(O)
0.000000
31622776601.683792
NA
p(D)
2.068462
1.067849 (0.616523)
0.6476
p(S)
0.608449
0.422802 (0.244105)
0.8878
p(O)
0.000000
31622776601.683792
NA
118
119
Running the other two models presented in our main text should now be very straightforward. To run
120
Psi(.)p(F), simply click run model and then retrieve Psi(F)p(F) and change the Psi design matrix such
Liow Supplement 6
121
that there is only 1 column and with a value of 1 (e.g. manually or by clicking “constant” on “init”).
122
Similarly for Psi(F)p(.), retrieve Psi(F)p(F) and change the p design matrix. The estimates for Psi and
123
p are very similar to those for Psi(F)p(F). The AIC ranking indicates that the simplest model is the
124
most parsimonious, but all the models we built have some credibility.
125
The data are also formatted for C1-C3 and C5 as used in the main text. The models can be simply run
126
as described above for C4. For interesting exercises, run C1 where most of the localities are assigned
127
offshore facies and where there is only one locality with a deep subtidal facies.
128
The Effect of Increased Sampling
129
In C4, there are only 10 sites (with varying numbers of replicates). If we increase the number of sites
130
by a factor of three (see sheet labeled tripled-data C4), we could increase our confidence in our
131
estimates quite a bit. Assuming that the data are identical if we collected more of it, we re-ran the
132
single-season analysis for C4 using triple amount the data. Tripling the amount of data shrinks the
133
standard deviation of the estimates (see table above, values in green), and leaves the estimates
134
unchanged. For the replicates, the occupancy probability is assumed to be the same although the
135
preservation/sampling is a chance event. The area of volume of the replicates taken need not be large
136
(say a 20 cm x 20 cm outcrop surface area, which takes only a couple of seconds to scan for the
137
brachiopod of interest). Although the number of replicates per site needs to be as large as in this
138
dataset, the number of sites (which are spatial replicates) should be ideally larger. It is in fact very
139
useful to run some simulated datasets before field collection in other to determine the rough number of
140
replicates and sites needed for the types of questions one wants to answer.
Liow Supplement 7
141
142
Multiple season model (= Multiple time interval model)
Input format
143
A multiple time interval model allows one to relax the assumption that there is no migration or
144
colonization during the period of study by explicitly introducing an extra term in the model. For
145
instance, if we were interested in estimating the migration rate of a certain taxon out of the spatial area
146
of interest, we could collect data from a time interval before the suspected migration and the following
147
time interval(s) and model the changing occupancy and migration of this taxon through these time
148
intervals. Compare this with the single time interval model where we could only estimate separate,
149
unlinked occupancy probabilities for the separate time intervals. For an ideal “multi-season” dataset in
150
a paleontological context, one would collect samples at the same location representing different time
151
intervals (e.g. up a section). This Cincinnatian brachiopod dataset was not collected specifically for
152
occupancy modeling and hence does not cohere to this format.
153
154
For our dataset, we assume that samples can be considered replicates as long as they were collected
155
from the same facies and depositional sequence. In other words, information on localition is removed
156
(relative to data used for the single time interval models). So if two samples are from C1 and represent
157
offshore facies, we consider it “the same thing” even though they might be from totally different
158
places. Because the location information is removed and because replication is needed for each “site”,
159
samples with the same facies and sequences are randomly assigned across the five sequences. The
160
exact input data used for the analyses shown in the main text is found in the spreadsheet labeled
161
“Multi-season input format.” The data is trimmed such that each “time interval” or “season” has 10
162
replicates (“occasions” in occupancy modeling terminology). Analyses using other runs of randomly
163
assigned replicates did not change qualitative conclusions in the main paper (detailed results not
164
shown). This is because our model is relatively simple, and we do not model any differences among
165
the samples or among “occasions”. If we did (given that we had more data for modeling more complex
166
models), the results will mostly likely change with this random assignment. The number of occasions
Liow Supplement 8
167
and hence missing values do affect the confidence intervals of the estimates although not the estimates
168
themselves (see above). These observations should serve as a reminder that the ideal dataset should
169
contain replicates up core or section for multiple time interval analyses and that missing values are
170
ideally avoided. The data as presented has 33 sites and 10 replicates (occasions) in each time interval.
171
Entering into Presence
172
Open PRESENCE. Start a new project (under File). Click on “Input Data Form” then select all the
173
green cells in the excel spreadsheet labeled “multi-season input format” and paste. There should be 33
174
rows of data. Change the cell “No. of occ/season” to 10 and the “no. site covar” cell to 3. A new site
175
covariates sheet should show up. Paste the yellow cells from the excel spread sheet in this new sheet in
176
PRESENCE using the Edit pull down menu. Save the file and say no to using the last column as
177
frequency. Click “ok” on the main menu, and say “OK” to creating a new folder to hold results from
178
analyses of this dataset.
179
Analyses and Results
180
We now will illustrate 2 of the extreme member models for Hebertella, out of the 8 that were run for
181
multi-season analyses reported in the main text, namely
182
1) psi(t),gamma(t),p(t)
183
2) psi(F,t),gamma(F,t),p(F,t)
184
Click on Run and select Analysis: Multi-season. For model parameterization, click on “seasonal
185
occupancy and colonization, detection in order to replicate the results in the main text. The various
186
ways of model parameterization are theoretically equivalent (MacKenzie et al. 2006) although
187
different parameters are estimated. For model (1), set the occupancy and colonization sheets with “init”
188
to “full identity”. For the detection sheet, click “seasonal effects”. This is such that there will be a
189
single estimate for each of the parameters psi, gamma and p for each time interval (or season). Click
190
OK to run. The output is similar to the single season model except that now there is an extra parameter,
191
gamma. Reading the values from “untransformed estimates of coefficients for covariates (Beta's)” will
Liow Supplement 9
192
give the values for the first two columns in the following table. As before, these beta estimates can be
193
transformed to “true estimates” (i.e. probability values) via e  / 1  e  since we chose the logit link
194
function. So for instance, Psi (C1) = -0.606586 for the untransformed estimate and the transformed
195
estimate is obtained by e 0.606586 / 1  e 0.606586  0.3528 . These transformed values can also be read
196
off the output at “Individual Site estimates” or further down at “real parameters.” Note that the various
197
“surveys” have the same output values because we have not modeled any differences among the
198
sampling occasions. In fact, we cannot because we cannot differentiate these replicates like ecologists
199
can. It pays to check why Psi(C1) and Psi(C2) have zeros for standard errors (as with gamma (C1-2)).
200
In this case, it is largely because Hebertella is not sampled in C1 (see datamatrix in spreadsheet),
201
hence both the occupancy in C1 and the next time-step C2, which is affected by the colonization
202
parameter, are not well-constrained. This effect percolates through to C2 such that gamma (C1-2) has
203
a zero standard error and that for gamma (C2-3) is rather large, as with the standard error for p(C1).
204




Liow Supplement 10
205
Summary table showing results for psi(t),gamma(t),p(t)
Untransformed estimate
standard error
Transformed estimate
(untransformed
estimate)
Psi(C1)
-0.606586
0.000000
0.3528
Psi(C2)
0.660279
0.000000
0.6593
Psi(C3)
0.332957
0.769356
0.5825
Psi(C4)
-0.244741
0.653325
0.4391
Psi (C5)
0.523016
0.441302
0.6279
gamma(C1-2)
-0.105766
0.000000
0.4736
gamma(C2-3)
0.015198
1.910512
0.5038
gamma(C3-4)
-0.021979
1.408863
0.4945
gamma(C3-5)
0.475942
0.801564
0.6168
p(C1)
-0.180332
1.116662
0.4550
p(C2)
-0.020765
0.277056
0.4948
p(C3)
0.189343
0.306836
0.5472
p(C4)
0.435745
0.369953
0.6072
p(C5)
0.961263
0.191215
0.7234
206
207
For model (2), click on Run and select Analysis: Multi-season and “seasonal occupancy and
208
colonization, detection” again. Set the occupancy and colonization sheets with “init” to “full identity”
209
and “seasonal effects” for the detection sheet as for model (1). But in each case, add 3 columns and
210
specify “init” as D, S, and O respectively in each column. Run the model, there should be 23
211
parameters and this model should have a greater AIC weight than model (1). Now we have estimates
212
of psi, gamma, and p specific to each time interfval and facies. Scrolling down and comparing the
213
“real estimates” with those from model (1), one sees that values are higher in general (because
Liow Supplement 11
214
differences in facies are taken into account in different time intervals). Standard errors are larger
215
because there is now actually less data for a more complex model where colonization is involved.
216
217
References cited
218
Burnham, K. P., and D. R. Anderson. 2002. Model Selection and Multimodel Inference: A Practical
219
Information-Theoretic Approach. Springer, New York.
220
Cooch, E., and G. White. 2006. Program Mark: A Gentle Introduction
221
222
(http://www.phidot.org/software/mark/docs/book/).
Holland, S. M., and M. E. Patzkowsky. 2007. Gradient ecology of a biotic invasion: biofacies of the
223
type Cincinnatian series (Upper Ordovician), Cincinnati, Ohio region, USA. Palaios 22:392-
224
407.
225
Liow, L. H., and J. D. Nichols. 2010. Estimating rates and probabilities of origination and extinction
226
using taxonomic occurrence data: Capture-recapture approaches. Pp. 81-94. In G. Hunt, and J.
227
Alroy, eds. Short Courses in Paleontology: Quantitative Paleobiology. Paleontological Society.
228
229
230
231
232
233
MacKenzie, D., and L. Bailey. 2004. Assessing the fit of site-occupancy models. Journal of
Agricultural Biological and Environmental Statistics 9(3):300-318.
MacKenzie, D., J. Nichols, A. Royle, K. Pollock, L. Bailey, and J. Hines. 2006. Occupancy Estimation
and Modeling: Inferring Patterns and Dynamics of Species Occurrence Academic Press.
Download