jbi12213-sup-0001-AppendixS1

advertisement
Journal of Biogeography
SUPPORTING INFORMATION
Accounting for geographical variation in species–area relationships improves the
prediction of plant species richness at the global scale
Katharina Gerstner, Carsten F. Dormann, Tomáš Václavík, Holger Kreft and Ralf Seppelt
Appendix S1 Aggregation of biomes and land-cover classes
Table S1.1 Aggregation of biomes using regression tree analysis (De’ath & Fabricius, 2000) and
regarding the ratio of log species richness and log area as response. It is evident that this ratio
increases from biome group 1 to 4.
Table S1.2 Aggregation of land-cover classes
Appendix S2 Performance of the logarithmic model S = c + z × log(A)
Although the power law is found to generally be the most appropriate for describing SARs (Connor &
McCoy, 1979; Dengler, 2009; Triantis et al., 2012) the best-fit model for a particular set of data can
only be determined empirically (Connor & McCoy, 1979; Dengler, 2009; Triantis et al., 2012). Hence,
we tested the performance of the logarithmic model (Gleason, 1922) as an alternative to the power law
model. The logarithmic model is also widely used and sometimes found to outcompete the power
model (Triantis et al., 2012). The procedure remained the same as for the power model. Firstly, we
selected a neighbourhood distance based on minimizing AIC and residual spatial autocorrelation
(RSA). Secondly, we fitted the SAR parameters for different models and the appropriate distance and
thereby calculated AIC. Finally, we calculated R² with 10-fold cross-validation using the fitted
parameters from the second step.
Figure S2.1 Observed relationship of species richness against log area.
Table S2.1 Models compared by degrees of freedom, ΔAIC values and mean predictive ability. The R²
was computed by 10-fold cross-validation. Variation of SARs improves prediction of the species
richness pattern.
In summary, we found similar results as for the power law. The AIC gets minimized for a
neighbourhood distance of 700 km (data not shown) and the model performance expressed as
percentage of explained variance did not differ much (± 4%) (Fig. S2.1 and Table 1). The use of a
logarithmic model did not improve the best power law model (maximum of explained variance was
46.1% for the biome model).
Appendix S3 Model selection, forecast uncertainty and spatial dependency of parameter
estimates
Model selection
Because spatial autocorrelation is present in the data, we performed simultaneous autoregressive
modelling of spatial error type using the function spautolm (package SPDEP) of the software R (R
Development Core Team, 2011). This model extend the usual OLS regression model Y = Xβ + ε by
splitting the error term in an independently standard normally distributed term ε and a spatially
dependent error term µ. The simultaneous autoregressive spatial error model thus takes the form
Y = Xβ + λWµ + ε
where λ is the spatial autoregression coefficient and W is the spatial weights matrix representing the
spatial structure (Dormann et al., 2007; Bivand et al., 2008). For model selection issues we had to
define a neighbourhood which best models the spatial structure in the residuals and thus minimize
spatial autocorrelation in the ε-term. In order to investigate the impact of different neighbourhood
distances on the model performances and the spatial autocorrelation in the ε-error term we first plotted
a histogram of neighbourhood distances (Fig. S3.1) and decided to test a sequence of models
considering neighbourhood distances from 0 to 5000 km by an increment of 100 km.
Figure S3.1 Histogram of neighborhood distances in km².
We compared model performances using several statistics (cf. Kissling & Carl, 2008): the Akaike
information criterion (AIC), R² and residual spatial autocorrelation (RSA), which sums up the absolute
Moran’s I values. The Moran’s I correlogram was calculated with the function correlog() from the Rpackage NCF. R² was assessed with a pseudo-R² value calculated as the coefficient of determination for
the linear regression predicted versus observed species numbers. AIC values allow the selection of
models based on both model fit and model complexity. Model selection criteria were minimizing AIC,
maximizing R² and minimizing RSA. Figures show that the AIC was minimized by considering a
neighbourhood of 700 km, which minimized the RSA at the same time. In contrast, R² decreased for
this neighbourhood distance. However, there is no reason why R² and RSA should identify the same
optimal scale for the neighbourhood, as they quantify different model qualities. R² quantifies the fit to
the data, irrespective of why the fit was high. RSA quantifies the degree to which the assumption of
independence was violated. Thus, at scales where RSA is high (e.g. 100 km), R² reports good fit for a
model that strongly violates the assumption of independence. Moreover, R² usually increases when
including more predictors. In our case the number of predictors remained equal but the information
added from the neighbourhood differed and might be even contrasting for certain distances. Hence, for
the selection of a neighbourhood distance, joint usage of RSA and AIC is the preferred method of
choice which is in accordance with the recommendation made by Kissling & Carl (2008).
Figure S3.2 Comparison of model performances depending on different neighbourhood distances.
Plots refer to AIC (top), coefficient of determination R² (centre) and residual spatial autocorrelation
(RSA, bottom). The AIC is minimized by considering a neighbourhood of 700 km, which minimizes
the RSA at the same time but is not in accordance with R², which is far from the optimum for this
neighbourhood distance.
Model forecast uncertainty
We quantified uncertainty in the model forecasts using formula (3) in the main paper for calculating
the standard deviation of the forecast (Neter et al., 1996). Plots of 95%-confidence intervals of the
model forecast uncertainty for the biome and the land-cover model are presented. Particular SAR
curves are considered as significantly different from the global SAR if there is no intersection of the
curves and their 95% confidence intervals over the entire area range. For the biome model the lengths
of SAR curves reflect the range sizes of sampling units that the relationship was fitted to.
For biomes:
For land cover:
Figure S3.3 Specific differences in the biome- and land-cover SAR curves (red) versus the global
SAR curve (black) and their 95% confidence intervals (CIs; dashed lines). Differences from the global
SAR are significant when the corresponding 95% CIs did not intersect with the 95% CIs of the global
model over the entire range of the sampling area.
Spatial dependency of parameter estimates
Figure S3.4 Spatial dependency of parameter estimates: Comparison of parameter estimates using the
entire dataset (error estimate from spautolm function) and 1000 bootstrap samples. Parameter
estimates are unbiased but the conditional error (due to spatial arrangement) tends to be
underestimated during the fit.
Spatial dependency of raw data was more important in the biome model than in the global or in the
land-cover model. This is caused by the number of samples used to fit the effect in each group of
predictors (Table 2). The global model used all samples to determine the area effect while the biome
model reduced sample number to biomes in order to determine the area effect in biomes separately.
For instance, in flooded grasslands and savannas only three samples were used. To exclude one of
them led to considerably different parameter estimates.
Download