jbi12006-sup-0001-AppendixS1

advertisement
SUPPORTING INFORMATION
Using virtual species to study species distributions and model performance
Christine N. Meynard and David M. Kaplan
Journal of Biogeography
Appendix S1 Theoretical expected values for different measures of performance based
on predictions of presence–absence including sample bias.
Some preliminary definitions
The methodology and resulting equations will be presented both in continuous form for
probabilities of occupancy defined on a continuous landscape, and discrete form for those
defined for a finite set of observation locations. The discrete form is presented as a particular
case of the continuous form. The former is more appropriate and attractive for theoretical
work, whereas the latter is best for real data on gridded or discrete landscapes.
In what follows, we distinguish between species prevalence (i.e. the real proportion of sites
occupied by the species in the real or virtual landscape) and sample prevalence (i.e. the
proportion of sites occupied by the species in the sample used to model the probability of
occurrence). We also distinguish between theoretical quantities (e.g. ‘theoretical species
prevalence’, i.e. the prevalence of a species over a landscape that would be obtained by
averaging over multiple probabilistic realizations of the species on that landscape) and nontheoretical quantities (e.g. plain ‘species prevalence’, indicating the prevalence of a species on
a landscape in a single probabilistic realization of presence–absence for that species). The
word ‘theoretical’ may be omitted wherever there is no risk of confusion.
The landscape is presumed to be described by a set of environmental variables
. These environmental variables will typically include temperature,
rainfall, altitude, etc. The probability density that a given site in the landscape is described by
a particular set of environmental variables is denoted
. This probability density is
normalized so that:
where
represents integration over the n environmental variables (though there may be 3
or more environmental variables, only a double integral, , is shown for notational
simplicity). The probability of occupancy of a site described by a given set of environmental
variables is denoted by
. Using this notation, the theoretical species prevalence is given
by the mean value of this probability of occupancy over the entire landscape:
In discrete form,
are the environmental conditions at site ,
is the fraction of the total number of sites,
, that site represents, and
is the probability of
occupancy of site . The theoretical species prevalence is then given by:
The confusion matrix
Suppose that we have modelled the probability of occupancy of a species over the landscape.
We denote this modelled probability of occupancy as
so as to distinguish it from the
real probability of occupancy,
.
The confusion matrix is a way of summarizing the correspondence between presence–absence
predictions derived from the real probability of occupancy and presence–absence predictions
derived from the modelled probability of occupancy. It is a table that contrasts the number of
observed presences and absences with the number of predicted presences and absences
(Fielding & Bell, 2007):
Observed values
Predicted values
Presence
Absence
Presence
A
B
Absence
C
D
Notice that these statistics require the generation of presence–absence predictions, as the
probabilities of occurrence (both real and modelled) need to be converted to occurrence
before the confusion matrix can be calculated. Typically, the predicted presences and
absences are derived from the modelled probability of occupancy by assuming all sites whose
modelled probability of occupancy is superior to a certain threshold, , are occupied. The
‘best’ threshold value is determined by maximizing some statistic derived from the confusion
matrix. In contrast, the presence–absence patterns for the virtual species (i.e. the ‘real’ pattern
of occupancy) needs to be generated in a way that will reflect the probability of occurrence.
That is, a site with a probability of 0.5 will be occupied 5 out of 10 times. For this, the
probability of occurrence is compared at each location with random numbers between 0 and 1.
If the random number is less than the probability of occurrence at a location, the site is
considered occupied. This procedure is repeated for all sites.
Using presences and absences is necessary for real observations taken at discrete locations,
but when the real probability of occupancy is known, one can calculate a theoretical confusion
matrix that is what one would obtain in the limit of a large, unbiased set of observations.
Suppose that we treat as occupied all sites with a modelled probability of occupancy greater
than . Then the model distribution of presence and absences will be as follows:
For a given set of environmental conditions for which the modelled probability of
occupancy,
, is greater than , the model will on average correctly predict presence
fraction of the time (i.e. if a large number of realizations of the virtual species are
generated from the real probability of occupancy, the mean occupancy of a site will approach
the real probability of occupancy). Repeating this logic, one can find all the elements of the
confusion matrix. In continuous form, this gives:
(1)
where it is understood that the integrals are performed only over the part of environmental
parameter space that satisfies the condition indicated below the integral sign. In discrete
form, this becomes:
(2)
This gives the theoretical confusion matrix for a given threshold, real probability of
occupancy and modelled probability of occupancy. In practice, this theoretical confusion
matrix can be calculated numerically for an arbitrary probability of occupancy and then
appropriate statistics and optimal thresholds are derived from this theoretical confusion
matrix.
A special case of this methodology is given when the modelled probability of occupancy is
taken to be the same as the true probability of occupancy,
. This can be done to
examine the behaviour of the confusion matrix and derived statistics in the hypothetical case
of an ‘optimal’ model that perfectly predicts the probability of occupancy (Meynard &
Kaplan, 2012). When the real probability of occupancy is not a threshold function of
environmental variables, this will produce non-trivial results. Statistics derived from the
confusion matrix, such as sensitivity, will be below their maximal theoretical value, e.g. the
expected value of sensitivity will be <1 even for a model that recovers perfectly the true
probability of occurrence.
Effects of sample bias
Often, one wishes to examine the effects of sample bias on the behaviour of species
distribution models and SDM performance statistics derived from the confusion matrix. It is
not possible to provide a general theoretical response to the effects of sample bias because
sample bias can come about via a number of different mechanisms that affect presence and
absence observations in numerous different ways. Simulations are generally necessary in
these cases. However, when a particular known mechanism for generating sample bias is
used, analytic approaches may be applicable.
Here we will consider one of the simplest possible mechanisms for generating sample bias:
over- or undercounting presences and absences. In these cases, sample prevalence will be
different from the species prevalence in the landscape. We assume that our sample is biased in
the number of presences and absences with respect to the true relative numbers, but that the
presences are randomly selected from the set of presences and the absences are randomly
selected from the set of absences (so that that the distribution of each is not biased with
respect to their true distributions). This may come about if, for example, presence is
consistently under-detected in a way that is directly proportional to the probability of
occupancy. Even in cases where sample bias is more complicated than this, as is often the
case, this simple mechanism may provide a useful baseline for evaluating the consequences of
sample bias.
Given this assumption, the probability density distribution of presences in a biased sample is:
(3)
where
is sample prevalence, i.e. the fraction of presences in the sample. Similarly the
distribution of absences in the sample is given by:
(4)
These probability densities can be used to determine the optimal model fit to a biased sample
in the limit of large sample size, as well as the confusion matrix derived from such a biased
sample and the model fit to that sample. The log likelihood of fit of a model,
, where
are a vector of model parameters, to such a biased sample is given by:
This function can be numerically integrated to calculate the log likelihood for this model and
numerical optimization can be used to determine optimal parameter values. This will give the
model that best represents a sample that is very large (infinite), but consistently biased in the
sample prevalence with respect to the species prevalence. Note that this log likelihood is
quite similar to the log likelihood that is maximized to determine the model parameters that
best fit a given distribution of presences and absences, except that in this case one passes
directly from the real probability of occupancy to the optimal model without having to
generate a presence–absence distribution and a biased sample of that presence–absence
distribution. This technique is best used in comparison with simulations to separate the
effects of sample bias from the effects of sample size.
The confusion matrix for this biased sample and a model fit to this sample are calculated as in
Eq. (1), except that the distribution of presences and absences must be replaced by the
distributions in Eqs. (3) and (4):
This confusion matrix can then be used to calculate optimal thresholds, which can then be
applied to the original unbiased probability of occupancy as described in Eq. (1) to examine
the effect of sample bias on model performance.
The log likelihood for a discrete landscape is given by:
(5)
Similarly, the confusion matrix is given by:
(6)
Case study
The equations above can be used to calculate the expected effects of this simple sample bias
mechanism on optimal thresholds and statistics derived from the confusion matrix. For the
concrete example used in the manuscript, the true pattern of species occupancy was created
from 4 environmental variables varying over the European landscape as described in JiménezValverde & Lobo (2007). Variables were demeaned and normalized to have a variance of 1.
True species occupancy was a double logit in these four normalized environmental variables
with inflection points of the logistic curves occurring at ± a single fixed fraction of 1 (the
standard deviation of all variables). In what follows we use the terminology as in Meynard &
Kaplan (2012), where α is the inverse of the slope of a logistic curve at the inflection point.
Several different slopes for the logistic curves were used (for each configuration, a single
unique slope was used for both logistic responses on all four variables), ranging from a
threshold environmental response (i.e. α = 0) to a very gradual response varying over the
same scale as the environmental data (i.e. α = 0.5). In practice, the inflection point of the
logistic responses was varied so that the theoretical species prevalence would be comparable
between configurations with different values for the logistic slope, α. Species prevalence was
the same as in Jiménez-Valverde & Lobo (2007), i.e. 0.17.
These four normalized environmental variables were Box–Cox transformed and renormalized
to zero mean and variance of 1. PCA analysis was then performed on the transformed
variables. The first two principal components representing a total of 88% of the variance in
the original Box–Cox transformed variables were kept and used to model probability of
occupancy. Models were based on a GLM including the two PCA variables plus all terms and
interactions out to third order. Unlike Jiménez-Valverde & Lobo (2007), stepwise regression
was not used because this is not readily available for the alternative logitreg GLM fitting
scheme (Venables & Ripley, 2002) that we used in addition to the standard GLM function in
R.
In addition to modelling the effects of sample prevalence on SDM performance using multiple
realizations of the virtual species presence–absence distribution and the subsampling strategy
described in Jiménez-Valverde & Lobo (2007), theoretical statistics were calculated using the
equations above. Red dash-dotted curves in Figure 2 are based on statistics derived from the
model maximizing the log likelihood given in Eq. (5) for a given level of sample prevalence.
Optimal thresholds were then derived from this optimal model via the confusion matrix in Eq.
(6). To test performance of such models, sensitivity and specificity were calculated from the
true, unbiased pattern of species occupancy using Eq. (2). Yellow solid curves in Figure 2 are
similar to the red dash-dotted curve, except that the step of obtaining the optimal model was
skipped and the true pattern of species occupancy, , was used instead of the modelled
probability of occupancy,
, in Eq. (6). Horizontal, dashed blue lines in the figures
represent theoretical statistics for a large unbiased sample and a model that perfectly
reproduces the true probability of occupancy. Vertical, dashed blue lines in Figure 2 indicate
the true species prevalence for which one expects biased (yellow) and unbiased results
(dashed-blue) to match.
REFERENCES
Jiménez-Valverde, A. & Lobo, J.M. (2007) Threshold criteria for conversion of probability of
species presence to either–or presence–absence. Acta Oecologica – International
Journal of Ecology, 31, 361-369.
Meynard, C.N. & Kaplan, D.M. (2012) The effect of a gradual response to the environment
on species distribution modeling performance. Ecography, 35, 468-480. Venables,
W.N. & Ripley, B.D. (2002) Modern applied statistics with S, 4th edn. SpringerVerlag, New York.
Download