SUPPORTING INFORMATION Using virtual species to study species distributions and model performance Christine N. Meynard and David M. Kaplan Journal of Biogeography Appendix S1 Theoretical expected values for different measures of performance based on predictions of presence–absence including sample bias. Some preliminary definitions The methodology and resulting equations will be presented both in continuous form for probabilities of occupancy defined on a continuous landscape, and discrete form for those defined for a finite set of observation locations. The discrete form is presented as a particular case of the continuous form. The former is more appropriate and attractive for theoretical work, whereas the latter is best for real data on gridded or discrete landscapes. In what follows, we distinguish between species prevalence (i.e. the real proportion of sites occupied by the species in the real or virtual landscape) and sample prevalence (i.e. the proportion of sites occupied by the species in the sample used to model the probability of occurrence). We also distinguish between theoretical quantities (e.g. ‘theoretical species prevalence’, i.e. the prevalence of a species over a landscape that would be obtained by averaging over multiple probabilistic realizations of the species on that landscape) and nontheoretical quantities (e.g. plain ‘species prevalence’, indicating the prevalence of a species on a landscape in a single probabilistic realization of presence–absence for that species). The word ‘theoretical’ may be omitted wherever there is no risk of confusion. The landscape is presumed to be described by a set of environmental variables . These environmental variables will typically include temperature, rainfall, altitude, etc. The probability density that a given site in the landscape is described by a particular set of environmental variables is denoted . This probability density is normalized so that: where represents integration over the n environmental variables (though there may be 3 or more environmental variables, only a double integral, , is shown for notational simplicity). The probability of occupancy of a site described by a given set of environmental variables is denoted by . Using this notation, the theoretical species prevalence is given by the mean value of this probability of occupancy over the entire landscape: In discrete form, are the environmental conditions at site , is the fraction of the total number of sites, , that site represents, and is the probability of occupancy of site . The theoretical species prevalence is then given by: The confusion matrix Suppose that we have modelled the probability of occupancy of a species over the landscape. We denote this modelled probability of occupancy as so as to distinguish it from the real probability of occupancy, . The confusion matrix is a way of summarizing the correspondence between presence–absence predictions derived from the real probability of occupancy and presence–absence predictions derived from the modelled probability of occupancy. It is a table that contrasts the number of observed presences and absences with the number of predicted presences and absences (Fielding & Bell, 2007): Observed values Predicted values Presence Absence Presence A B Absence C D Notice that these statistics require the generation of presence–absence predictions, as the probabilities of occurrence (both real and modelled) need to be converted to occurrence before the confusion matrix can be calculated. Typically, the predicted presences and absences are derived from the modelled probability of occupancy by assuming all sites whose modelled probability of occupancy is superior to a certain threshold, , are occupied. The ‘best’ threshold value is determined by maximizing some statistic derived from the confusion matrix. In contrast, the presence–absence patterns for the virtual species (i.e. the ‘real’ pattern of occupancy) needs to be generated in a way that will reflect the probability of occurrence. That is, a site with a probability of 0.5 will be occupied 5 out of 10 times. For this, the probability of occurrence is compared at each location with random numbers between 0 and 1. If the random number is less than the probability of occurrence at a location, the site is considered occupied. This procedure is repeated for all sites. Using presences and absences is necessary for real observations taken at discrete locations, but when the real probability of occupancy is known, one can calculate a theoretical confusion matrix that is what one would obtain in the limit of a large, unbiased set of observations. Suppose that we treat as occupied all sites with a modelled probability of occupancy greater than . Then the model distribution of presence and absences will be as follows: For a given set of environmental conditions for which the modelled probability of occupancy, , is greater than , the model will on average correctly predict presence fraction of the time (i.e. if a large number of realizations of the virtual species are generated from the real probability of occupancy, the mean occupancy of a site will approach the real probability of occupancy). Repeating this logic, one can find all the elements of the confusion matrix. In continuous form, this gives: (1) where it is understood that the integrals are performed only over the part of environmental parameter space that satisfies the condition indicated below the integral sign. In discrete form, this becomes: (2) This gives the theoretical confusion matrix for a given threshold, real probability of occupancy and modelled probability of occupancy. In practice, this theoretical confusion matrix can be calculated numerically for an arbitrary probability of occupancy and then appropriate statistics and optimal thresholds are derived from this theoretical confusion matrix. A special case of this methodology is given when the modelled probability of occupancy is taken to be the same as the true probability of occupancy, . This can be done to examine the behaviour of the confusion matrix and derived statistics in the hypothetical case of an ‘optimal’ model that perfectly predicts the probability of occupancy (Meynard & Kaplan, 2012). When the real probability of occupancy is not a threshold function of environmental variables, this will produce non-trivial results. Statistics derived from the confusion matrix, such as sensitivity, will be below their maximal theoretical value, e.g. the expected value of sensitivity will be <1 even for a model that recovers perfectly the true probability of occurrence. Effects of sample bias Often, one wishes to examine the effects of sample bias on the behaviour of species distribution models and SDM performance statistics derived from the confusion matrix. It is not possible to provide a general theoretical response to the effects of sample bias because sample bias can come about via a number of different mechanisms that affect presence and absence observations in numerous different ways. Simulations are generally necessary in these cases. However, when a particular known mechanism for generating sample bias is used, analytic approaches may be applicable. Here we will consider one of the simplest possible mechanisms for generating sample bias: over- or undercounting presences and absences. In these cases, sample prevalence will be different from the species prevalence in the landscape. We assume that our sample is biased in the number of presences and absences with respect to the true relative numbers, but that the presences are randomly selected from the set of presences and the absences are randomly selected from the set of absences (so that that the distribution of each is not biased with respect to their true distributions). This may come about if, for example, presence is consistently under-detected in a way that is directly proportional to the probability of occupancy. Even in cases where sample bias is more complicated than this, as is often the case, this simple mechanism may provide a useful baseline for evaluating the consequences of sample bias. Given this assumption, the probability density distribution of presences in a biased sample is: (3) where is sample prevalence, i.e. the fraction of presences in the sample. Similarly the distribution of absences in the sample is given by: (4) These probability densities can be used to determine the optimal model fit to a biased sample in the limit of large sample size, as well as the confusion matrix derived from such a biased sample and the model fit to that sample. The log likelihood of fit of a model, , where are a vector of model parameters, to such a biased sample is given by: This function can be numerically integrated to calculate the log likelihood for this model and numerical optimization can be used to determine optimal parameter values. This will give the model that best represents a sample that is very large (infinite), but consistently biased in the sample prevalence with respect to the species prevalence. Note that this log likelihood is quite similar to the log likelihood that is maximized to determine the model parameters that best fit a given distribution of presences and absences, except that in this case one passes directly from the real probability of occupancy to the optimal model without having to generate a presence–absence distribution and a biased sample of that presence–absence distribution. This technique is best used in comparison with simulations to separate the effects of sample bias from the effects of sample size. The confusion matrix for this biased sample and a model fit to this sample are calculated as in Eq. (1), except that the distribution of presences and absences must be replaced by the distributions in Eqs. (3) and (4): This confusion matrix can then be used to calculate optimal thresholds, which can then be applied to the original unbiased probability of occupancy as described in Eq. (1) to examine the effect of sample bias on model performance. The log likelihood for a discrete landscape is given by: (5) Similarly, the confusion matrix is given by: (6) Case study The equations above can be used to calculate the expected effects of this simple sample bias mechanism on optimal thresholds and statistics derived from the confusion matrix. For the concrete example used in the manuscript, the true pattern of species occupancy was created from 4 environmental variables varying over the European landscape as described in JiménezValverde & Lobo (2007). Variables were demeaned and normalized to have a variance of 1. True species occupancy was a double logit in these four normalized environmental variables with inflection points of the logistic curves occurring at ± a single fixed fraction of 1 (the standard deviation of all variables). In what follows we use the terminology as in Meynard & Kaplan (2012), where α is the inverse of the slope of a logistic curve at the inflection point. Several different slopes for the logistic curves were used (for each configuration, a single unique slope was used for both logistic responses on all four variables), ranging from a threshold environmental response (i.e. α = 0) to a very gradual response varying over the same scale as the environmental data (i.e. α = 0.5). In practice, the inflection point of the logistic responses was varied so that the theoretical species prevalence would be comparable between configurations with different values for the logistic slope, α. Species prevalence was the same as in Jiménez-Valverde & Lobo (2007), i.e. 0.17. These four normalized environmental variables were Box–Cox transformed and renormalized to zero mean and variance of 1. PCA analysis was then performed on the transformed variables. The first two principal components representing a total of 88% of the variance in the original Box–Cox transformed variables were kept and used to model probability of occupancy. Models were based on a GLM including the two PCA variables plus all terms and interactions out to third order. Unlike Jiménez-Valverde & Lobo (2007), stepwise regression was not used because this is not readily available for the alternative logitreg GLM fitting scheme (Venables & Ripley, 2002) that we used in addition to the standard GLM function in R. In addition to modelling the effects of sample prevalence on SDM performance using multiple realizations of the virtual species presence–absence distribution and the subsampling strategy described in Jiménez-Valverde & Lobo (2007), theoretical statistics were calculated using the equations above. Red dash-dotted curves in Figure 2 are based on statistics derived from the model maximizing the log likelihood given in Eq. (5) for a given level of sample prevalence. Optimal thresholds were then derived from this optimal model via the confusion matrix in Eq. (6). To test performance of such models, sensitivity and specificity were calculated from the true, unbiased pattern of species occupancy using Eq. (2). Yellow solid curves in Figure 2 are similar to the red dash-dotted curve, except that the step of obtaining the optimal model was skipped and the true pattern of species occupancy, , was used instead of the modelled probability of occupancy, , in Eq. (6). Horizontal, dashed blue lines in the figures represent theoretical statistics for a large unbiased sample and a model that perfectly reproduces the true probability of occupancy. Vertical, dashed blue lines in Figure 2 indicate the true species prevalence for which one expects biased (yellow) and unbiased results (dashed-blue) to match. REFERENCES Jiménez-Valverde, A. & Lobo, J.M. (2007) Threshold criteria for conversion of probability of species presence to either–or presence–absence. Acta Oecologica – International Journal of Ecology, 31, 361-369. Meynard, C.N. & Kaplan, D.M. (2012) The effect of a gradual response to the environment on species distribution modeling performance. Ecography, 35, 468-480. Venables, W.N. & Ripley, B.D. (2002) Modern applied statistics with S, 4th edn. SpringerVerlag, New York.