Biomathematics and Statistics Scotland Prediction of New Colonies – Seabird Tracking Data (Under Agreement C10-0206-0387) CONTRACT No: C10-0206-0387 Report submitted to: Joint Nature Conservation Committee November 2012 Authors: Mark J Brewer, Jackie M Potts, Elizabeth I Duff, David A Elston CONTENTS 1. 2. 3. 4. 5. 6. 7. PAGE Non-Technical Summary Introduction Data Methodology Results Prediction Maps of Usage Discussion 3 4 6 14 17 29 42 In addition to this report, there are two further documents associated with this project: (i) BioSS Terns Report II – Results Appendix; (ii) BioSS Terns Report II– Software; and also ancillary files: (i) Spreadsheet files of grid predictions for each of the thirteen species/colony combinations for unsurveyed colonies; (ii) R code files for: ordination; fitting models to a combination of sites; cross-validation; grid predictions (iii) cleaned and standardised versions of the data files (for survey data and grids). This report is © Biomathematics and Statistics Scotland 2012 2 1. Non-Technical Summary The Joint Nature Conservation Committee (JNCC) is working on the identification of important marine areas around the UK that are used by five species of tern during the breeding season. For the four larger tern species (Arctic, common, roseate and Sandwich terns), data are available from boat surveys, using both visual tracking and transect survey methods. Following a competitive tendering process, in June 2012 BioSS was tasked with making predictions of usage and preference for Arctic, common and Sandwich terns for colonies lacking visual tracking data. This work forms part of Phase II of a larger tern project JNCC are undertaking, and follows on from a previous project completed by BioSS earlier in 2012 as part of Phase I of the tern project. The Phase I work we undertook previously used visual tracking data to learn about important associations between terns’ usage/preference and environmental covariates, and to map usage/preference for each tern species for those colonies with tracking data. The methodology developed from the Phase I project – essentially a flexible weighted logistic regression model – would be used in this new, Phase II project. Predictive models were defined for all new colonies which combined data from all relevant colonies for each species separately. An evaluative procedure (employing assessment by a form of crossvalidation) determined that in terms of the colonies with data, better predictions were obtained by combining data in this way rather than using ecological consideration or multivariate analysis of environmental data to suggest subsets of “similar” colonies for prediction. As part of the crossvalidation analysis, we discovered that the most important predictors are distance to colony, distance to shore, bathymetry and chlorophyll concentration. Based upon the above analysis, predictions and prediction maps were produced for all requested unsurveyed colonies. 3 2. Introduction 2.1 Background – Previous Phase This project represents Phase II of analysis of data sets on four species of tern in several colonies in UK offshore waters. Phase I was concerned with developing models specifically designed for the type of tracking data available. The report for Phase I (Brewer et al., 2012) will be referred to throughout this document as “the Phase I report”. The Phase I analysis determined that a weighted logistic regression was appropriate for analysing the data; the data itself was formed of individual tracks of known foraging instances (forming cases) with sets of randomly generated perturbations of those tracks (forming controls). The analysis thus took a case-control form. It was found that hierarchical (or random effect, mixed) models were not required, as had been used in previous tracking analysis work by Aarts et al. (2008) and Wakefield et al. (2010) – the difference being that for JNCC’s dataset, there were no known repeat observations per individual. In this framework, the cases represent “presence” and the controls represent “absences”. A number of explanatory variables were included in the regression, representing the environmental conditions at different locations, but also including the measures “distance to colony” and “distance to shore”. Different forms of weighted logistic models were considered during Phase I analyses, using a range of facilities in R. Both GLMs and GAMs were considered with various model selection strategies where appropriate. Spatial autocorrelation of the response data was addressed, both by the weighting in the regression and via a spatial correlation network derived using the INLA (Integrated Nested Laplace Approximation) package in R (INLA, 2012). Different models were appropriate depending on the purpose of the modelling – for example, whether the aim was to identify significant relationships with the environmental covariates or to make predictions of usage and/or preference by each species in each location. Further details can be found in the Phase I report itself. 2.2 Second Phase – Colonies Without Tracking Data JNCC wish to provide predictions for a number of colonies which have no tracking data available (Phase II). The task at hand is to use data from surveyed colonies, using models such as those developed in Phase I, to make predictions for these new colonies. The new colonies as specified in the invitation to tender and subsequently modified by JNCC (with agreement from BioSS) are the following (with colony names we shall use in the rest of this report in bold): 4 Table 1. Colonies without tracking data available, for which predictions are to be made. Common tern Sandwich tern Arctic tern Dungeness Foulness (Greater Thames) Breydon Water (Norfolk) Liverpool Bay (The Dee estuary; Ribble & Alt estuaries) Strangford Lough (N Ireland) Carlingford Lough (N Ireland) Farne Islands (Northumberland) Isle of May (Firth of Forth) Liverpool Bay (Duddon Estuary) Carlingford Lough (N Ireland) Strangford Lough (N Ireland) Strangford Lough (N Ireland) Isle of May (Firth of Forth) These colonies supplement the list of colonies in the Phase I report; however, we provide a list here of colonies with data, as some new colonies (indicated with *) have been added for this Phase II work: Table 2. Colonies with tracking data available. Common tern Sandwich tern Arctic tern Coquet and Farne* Islands (Northumberland) Larne Lough (Northern Ireland) Glas Eileanan / South Shian (Mull area, west Scotland) Leith Docks (Firth of Forth) Cemlyn (Anglesey) Coquet and Farne Islands (Northumberland) Larne Lough and Cockle* Island (Northern Ireland) Sands of Forvie (Aberdeenshire) Cemlyn (Anglesey) Coquet and Farne Islands (Northumberland) Copeland / Cockle Islands (Outer Ards, Northern Ireland) The question of how to determine which of the colonies with survey data should be used to predict which of the colonies without is addressed in the methodology Section 4. This required us to determine suitable metrics for comparing models (within species) fitted using different subsets of colonies and different selected covariates. 5 3. Data 3.1 Data Summary The environmental covariates for this phase are as for the first part: see Section 2 of the Phase I report for full details. As part of this second phase, we were required to conduct a deeper inspection of the data in order to justify the “extrapolation” required in producing predictions and maps for the new colonies. Boxplots were used to compare the ranges of the environmental covariates between colonies; this is discussed in Section 4.2. Such differences in ranges were not of concern in Phase I as each colony was analysed separately; only when multiple colonies are considered together is range mismatch a potential problem. Boxplots were also used to identify outliers and variables which have a skewed distribution. Section 3.2 which follows contains a discussion of outliers in some of the environmental variables; this follows up a recommendation made by us in the discussion (Section 6) of the Phase I report. We also considered whether we could use logged versions of chlorophyll concentrations and wave and current shear stresses; on the basis of our new findings, we would recommend that this transformation could have been applied during the Phase I analysis. Some of the covariates considered in Phase I of the project were not considered further in Phase II. Eastness, northness, slope and sand were not considered because they were not selected in any of the Phase 1 models. (There was one exception where slope was selected by the AIC criterion, but was not significant). The interannual standard deviation of probability of a frequent thermal front in spring and summer were also excluded from the model selection process for Phase II. This was because even though they were selected in some Phase I models, it did not seem biologically realistic to suppose that the birds would respond to these variables while not responding to the probability of a frequent thermal front itself. We would recommend excluding these from the Phase I models also. 3.2 Variable Inspection – Outliers The boxplots in Figure 1 illustrate the range of values for each environmental covariate; as the predictive grids for Sandwich (out to 55km from the colony) are different from those for the other three species (out to 31km from the colony), there is a separate plot having only the colonies relevant to Sandwich terns. The variable name is indicated by the y-axis, and the key for the colony codes on the x-axis is as follows: ce co fa ll le mu oa br ca dn fn im ri Cemlyn Coquet Farnes Larne Lough Leith Mull Outer Ards Breydon Carlingford Lough Dungeness Foulness Isle of May Ribble 6 st fo du Strangford Lough Forvie Duddon The boxplots show a negatively skewed distribution for sea surface temperature, with low values occurring near the shore. The extent to which these data are reliable is uncertain. Removal of the values that were considered unreliable would have resulted in considerable loss of data, particularly around the shore, so sea surface temperature was excluded from the analysis instead. Chlorophyll concentrations and wave and current shear stresses had highly positively skewed distributions and were therefore log-transformed prior to further analysis. (The log-transformed versions are shown in the boxplots below). There was no reason to question the reliability of these, as lognormal distributions frequently arise for variables such as chemical concentration which have low mean values, high variances, and cannot be negative. On the other hand, some of the sea surface temperature values, particularly those below 0C, seemed unrealistic. 7 Figure 1. Boxplots of environmental covariates. Colonies to the left of the vertical line are those with tracking data and those to the right are those without. 8 9 10 11 12 13 4. Methodology 4.1 Weighted Logistic Regression via a Case-Control Design As noted earlier, the form of statistical model used for analysing the tern tracking data was a weighted logistic regression based on a case-control design. Full details of the modelling procedure and the generation of the control data can be found in the Phase I report. We did not include INLA in Phase II, as it was only used for model checking in Phase I and not for making predictions. 4.2 Comparisons of Environmental Data Between Colonies One extremely important aspect of this project is to determine which colony or colonies can be used to build models to make predictions for new colonies lacking tracking data. For each species, we decided to compare the similarity or otherwise across colonies (or, more specifically, the foraging ranges of colonies) of the environmental covariates used in the modelling. The reasoning for this is that if a set of colonies appears to contain approximately the same environments, this might be justification for using a model from one or more colonies within the set to obtain predictions for another. On the other hand, colonies which are well-separated in multivariate environment space may present radically different environments to terns, and therefore a model from one such colony may not be suitable to predicting for another. We compare the environmental data between colonies in two ways: firstly, we look at simple boxplot summaries (presented in Section 3.2) for each environmental covariate in turn; secondly, we use a principal component analysis (PCA) to study the combination of information from all covariates simultaneously. Principal component analysis takes a set of variables and replaces them with a smaller number of new variables (the principal components) in such a way that as much as possible of the information in the original variables is retained in the new ones. This allows us to plot the data in a concise way, for example by plotting the second principal component (PC2) against the first principal component (PC1). Colonies which are close together in this plot will be similar in terms of the original set of environmental covariates. This exploratory analysis will then help us in selecting suitable subsets of colonies with which to build models for making usage and preference predictions for the new colonies. Visual inspection of the boxplots in Section 3.2 can indicate which variables may be unsuitable for extrapolating from one colony to another. We found two such variables: (i) Salinity is a significant covariate at Cemlyn, but the boxplots show that the distribution of salinity at Cemlyn is very different from that at other colonies; (ii) wave and current shear stress are significant covariates at Outer Ards, but the distribution of wave shear stress was different from that at many of the other colonies. The set of variables to be considered in the PCA was: bathy_1sec , strat_temp , summ_front , spring_front , log_chl_apr , log_chl_may , log_chl_june , log_ss_wave , log_ss_current , sal_spring , sal_summ. However, some of the environmental variables are not available for some of the new colonies. For example, at Dungeness ss_wave, ss_current, sal_spring and sal_summ are entirely missing. The PCA function in R will remove entirely any row that contains a missing value; as such, trying to use all the 14 above variables can result in all data for one or more colonies being removed. The offending variables are the final four in the set above; hence, for each species, we conduct a PCA both on the above set of covariates (All Variables) and the following smaller set (Reduced Set of Variables): bathy_1sec , strat_temp , summ_front , spring_front , log_chl_apr , log_chl_may , log_chl_june . 4.3 Cross Validation for Selecting Predictive Models for Colonies/Species JNCC supplied us with suggested groupings for prediction purposes and were taken into consideration in the cross-validation exercise. These are summarised briefly as follows and were based loosely on geographical similarities. Some of these such as the close similarity between Coquet and Farne Islands were confirmed by the PCA. Table 3. Suggested colony groupings Common tern Group 1 2 3 4 5 6 Model Coquet Island, Farne Islands (very little data) Coquet Island, Farne Islands (very little data), Cemlyn Larne Lough Prediction Farne Islands, Isle of May Dungeness Larne Lough Cemlyn Coquet Island, Farne Islands, Leith Docks Larne Lough, Glas Eileanan / South Shian (Mull), Cemlyn Strangford Lough, Carlingford Lough Strangford Lough, Carlingford Lough Foulness, Breydon Water Liverpool Bay (Ribble) Arctic tern Group 1 2 Model Coquet Island, Farne Islands Outer Ards 15 Prediction Isle of May Strangford Lough Sandwich tern Group 1 2 Model Larne Lough, Cockle Island Larne Lough, Cockle Island, Cemlyn Prediction Carlingford Lough, Strangford Lough Duddon Estuary, Carlingford Lough, Strangford Lough The suggested ecological groupings and the PCA exercise indicated which colonies might be similar in terms of environment and resulted in a series of colony groupings. Data from each colony within each resulting grouping were combined to produce models that could be used to make predictions for new colonies. Cross-validation was used to select which colonies to use for prediction. This was done by assessing the fit of predictions to the tracking data from a particular colony from (i) a model developed using the remaining colonies in a proposed grouping and comparing this with (ii) a model developed using data from all the remaining colonies. For example, it was suggested that data from Coquet and Farnes might be used to predict Arctic terns at the Isle of May. We therefore tested whether Farnes was better predicted using a model developed for Coquet alone, or a model using all available Arctic tern data (Coquet and Outer Ards together). The assessment was carried out on the tracking data (observations and controls) rather than on the grid data because we did not have presence-absence data in the form of a grid. Two scores were used to assess quality of predictions (fitted to the tracking data): (1) The sum of squared errors ( y i pi ) 2 If this quantity is divided by the number of observations, it gives the mean squared error, also known as the Brier score when applied to probabilistic predictions (Brier, 1950); (2) A score related to the log-likelihood ( y log( p ) (1 y ) log(1 p )) i i i i where y is the binary variable indicating foraging behaviour and p is the predicted probability. The intercept is arbitrary for case-control data as it depends on the ratio of controls to cases, which we have chosen, and which has no biological meaning. An adjustment was therefore made to the intercept for each model before calculating the two scores. A constant was added to the intercept to ensure that the sum of the predicted probabilities was equal to the sum of the values of the binary variable. There are many other measures that could have been used; see Liu et al. 2011 for a review. For example, the area under the receiver operating characteristic (ROC) curve, known as the AUC, is widely used, although it has received some criticism (Lobo et al., 2008). It is unlikely that the overall conclusions would have changed had we used a different metric – the results in the end were clear and consistent in terms of prediction assessment, and predicted maps tended to vary only slightly for the better-fitting models in any case. Results and interpretation from this analysis are found in Section 5. The predictions themselves can be found in supplemental spreadsheets while maps of predictions are presented in Section 6. 16 5. Results 5.1 Principal Component Analysis - Comparisons of Environmental Data 5.1.1 Common Tern – All Variables With the full set of PCA variables, it can be seen that Farnes, Isle of May, Coquet and Leith all occupy the same space in PC1 and PC2, suggesting that these colonies are similar in terms of the major sources of variation in environmental conditions. Strangford and Carlingford Loughs appear to lie between (but overlapping) Ribble and Larne Lough. Cemlyn seems to be something of an outlier here, but note that there are a large number of missing values for the variable ss_current, which removes most of the data points for that colony. 5.1.2 Common Tern – Reduced Set of Variables With the reduced set of PCA variables, the picture changes dramatically. From the plot of PC2 vs PC1, the colonies not seem to separate out at all well. The next plot – showing PC4 against PC3 – shows that we need to go to the third principal component before we start getting clear colony distinction. This in itself suggests that differences between colonies are not the major source of 17 variability in environmental conditions. In the PC4 vs PC3, the patterns are similar compared with the plot in Section 5.1.1, but note that there are now more colonies included – those with all missing values in the excluded variables. Interestingly, Dungeness seems to sit well with the Irish Sea colonies, although we should stress again that from the PC2 vs PC1 plot, Dungeness does not appear noticeably different from other colonies. In either plot, Foulness and Breydon seem similar to each other; they resemble Ribble and Dungeness most closely in PC2 vs PC1, but are linked with Coquet in PC4 vs PC3. 18 5.1.3 Sandwich Tern – All Variables With the full set of PCA variables, we see that Coquet, Forvie and Farnes are similar, and that Duddon overlaps Cemlyn. Larne Lough, Carlingford and Strangford lie in between these two groups. 19 5.1.4 Sandwich Tern – Reduced Set of Variables With the reduced set of PCA variables, as with Common Terns we see a much less clear separation of colonies. What we do see is that Duddon now overlaps Cemlyn very well, and that the Loughs Larne, Strangford and Carlingford lie between Duddon/Cemlyn and Coquet/Farnes/Forvie. 20 5.1.5 Arctic Tern – All Variables With the full set of PCA variables, there is very clear separation into groups. Coquet, Farnes and Isle of May form one group, while Carlingford and Outer Ards form another. Cemlyn is something of an outlier, but as noted for Common Terns, very many missing values for one variable means most observations are deleted. 21 5.1.6 Arctic Tern – Reduced Set of Variables With the reduced set of PCA variables, the group separation is less clear than with the full set, but still apparent. Coquet, Farnes and Isle of May still overlap strongly, whereas Cemlyn, now less of an outlier, overlaps Outer Ards. Carlingford Lough lies between these two overlapping groups. 22 5.2 Cross-Validation for Selecting Predictive Models for Colonies/Species The aim was to find a set of variables that were consistent predictors across the different colonies for which we have data, as it is more likely that these will be successful at making predictions for new colonies. We took this approach, rather than considering all variables when selecting a model for a combination of different colonies, because the latter approach would have tended to select variables that explain a difference in intercept between colonies (which is of no interest), as well as those which explain the pattern of foraging within a colony. In theory, as we have used a ratio of 12 controls to each data point we would expect the intercept to be the same for each colony. However, in practice, it differs because points have been excluded where control tracks fell on land and where there are missing covariate values. The variables that are consistently selected are dist_col, dist_shore, bathy_1sec and chl_june. When considering models for combinations of sites in the cross-validation exercise the candidate variables were reduced to this set. The variable that most commonly appeared to have a nonlinear effect in the Phase I models was dist_col. We therefore considered GAM models with a nonlinear term in dist_col as possible candidate models, but constrained the other terms to be linear. 23 Removal of some variables and log-transformation of others, as discussed in Section 3.2, led to some changes in the models for single colonies developed in Phase I of the project. New models selected using either AIC (Akaike’s Information Criterion) or likelihood ratio tests (LRT) are shown below. This is to demonstrate that dist_col, dist_shore, bathy_1sec and chl_june were being selected consistently; where AIC selects additional variables that are not on this list we present only the results for LRT. Arctic terns Coquet: dist_col, chl_june, bathy_1sec (AIC and LRT) Farnes: dist_col, dist_shore, sal_spring (AIC; LRT omits dist_shore) Outer Ards: dist_col, chl_june, ss_wave, ss_current (AIC and LRT) Common terns Cemlyn: dist_col, bathy_1sec (AIC and LRT; salinity excluded) Leith: dist_col, dist_shore, chl_may, chl_june, sal_summ (LRT) Coquet: dist_col, bathy_1sec, chl_june (LRT) Larne Lough: dist_col, dist_shore, chl_june, bathy_1sec (LRT) Sandwich terns Cemlyn: dist_col, dist_shore, chl_apr, chl_june (AIC; LRT omits dist_shore and chl_june) Coquet: dist_col, dist_shore (LRT) Farnes: dist_shore, sum_front, spring_front, bathy_1sec, sal_summ (AIC) Forvie: dist_shore, strat_temp (AIC and LRT) Larne Lough: dist_col, dist_shore (LRT – after excluding covariates with large numbers of missing values) Cockle Island: dist_col, chl_june, ss_current (AIC and LRT) Cross-validation results are shown in Table 4 below. Note that due to the large number of missing chlorophyll values for Larne Lough, chlorophyll was excluded when making predictions for Larne Lough, and from any models in which Larne Lough is the sole colony used to make the predictions. In general, predictions are better when data from all available colonies for that species are combined. There are some cases where predictions based on a single colony are slightly better than those based on all colonies combined, but they can be considerably worse. In the final models we have therefore chosen to use data from all available colonies for each species, to provide a consistent approach. The use of GAM models with a non-linear term for distance to colony sometimes makes predictions worse when the model is applied to another colony. Chakraborty et al. (2011) note in general that GAMs can be poor for out-of-sample prediction. Linear terms only were therefore used in the final predictive models. 24 The following covariates were used for each species in the final models: Arctic terns: distance to colony and bathymetry Common terns: distance to colony, distance to shore and bathymetry Sandwich terns: distance to colony, distance to shore, bathymetry and June chlorophyll concentration. (June chlorophyll concentration was omitted for Strangford and Carlingford Loughs due to the large number of missing values). Full details of the models are presented in the Results Appendix. 25 Table 4. Results of cross-validation. Models with lower values of the sum of squared errors and higher values (i.e. lower absolute values) of the LL score are better; the best model in each case is shown in bold type. The notation s(dist_col) indicates a GAM with a nonlinear term in distance to colony. Species Colony to predict Model developed for Covariates LL Score Arctic Coquet Arctic Farnes Common Coquet Farnes Farnes, Outer Ards Coquet Coquet Coquet, Outer Ards Coquet, Outer Ards Cemlyn Leith Leith Cemlyn, Leith, Larne Lough, Mull, Farnes Cemlyn, Leith, Larne Lough, Mull, Farnes Cemlyn, Leith, Larne Lough, Mull, Farnes Coquet Coquet Larne Lough Coquet, Leith, Farnes Coquet, Leith, Farnes Coquet, Leith, Larne Lough, Mull, Farnes Coquet, Leith, Larne Lough, Mull, Farnes Coquet Coquet Coquet, Cemlyn, Farnes Cemlyn, Coquet, Larne Lough, Mull, Farnes Cemlyn, Coquet, Larne Lough, Mull, Farnes dist_col dist_col, bathy_1sec dist_col, chl_june, bathy_1sec s(dist_col), bathy_1sec, chl_june dist_col, chl_june, bathy_1sec dist_col, bathy_1sec dist_col, bathy_1sec dist_col, dist_shore, chl_may, chl_june, sal_summ dist_col, chl_june dist_col, bathy_1sec dist_col, bathy_1sec, dist_shore s(dist_col), bathy_1sec dist_col, bathy_1sec, chl_june s(dist_col), bathy_1sec, chl_june s(dist_col), bathy_1sec, dist_shore dist_col, bathy_1sec, chl_june, dist_shore s(dist_col), bathy_1sec, dist_shore dist_col, bathy_1sec, dist_shore dist_col, bathy_1sec, chl_june, dist_shore dist_col, bathy_1sec, chl_june s(dist_col), bathy_1sec, chl_june dist_col, bathy_1sec, chl_june dist_col, dist_shore, bathy_1sec s(dist_col), bathy_1sec -9612 -9560 -4641 -4770 -4297 -4132 -7657 -5979 -5868 -6165 -6108 -6123 -3026 -3021 -4497 -3213 -3325 -3331 -3324 -16574 -16091 -17024 -14806 -14761 Cemlyn Leith 26 Sum of Squared Errors 2582 2558 1306 1332 1217 1180 1970 1749 1744 1769 1761 1772 916 929 1326 971 1049 1007 1004 4514 4422 4627 4139 4134 Larne Lough Sandwich Larne Lough Cemlyn Cockle Island Cemlyn Coquet, Farnes, Leith, Mull Cemlyn, Coquet, Leith, Mull, Farnes Cemlyn, Coquet, Leith, Mull, Farnes Cockle Island Cockle Island, Cemlyn Cockle Island, Cemlyn Cockle Island, Cemlyn, Coquet, Farnes, Forvie Cockle Island, Cemlyn, Coquet, Farnes, Forvie Larne Lough, Cockle Island Larne Lough, Cockle Island, Coquet, Farnes, Forvie Larne Lough, Cockle Island, Coquet, Farnes, Forvie Cemlyn, Larne Lough Cemlyn, Larne Lough, Coquet, Farnes, Forvie 27 dist_col, bathy_1sec dist_col, bathy_1sec,dist_shore dist_col, bathy_1sec,dist_shore s(dist_col), bathy_1sec, dist_shore dist_col, dist_shore dist_col, dist_shore, bathy_1sec s(dist_col), dist_shore dist_col, dist_shore, bathy_1sec s(dist_col), dist_shore, bathy_1sec dist_col, bathy_1sec dist_col,bathy_1sec,dist_shore s(dist_col), dist_shore, bathy_1sec dist_col,bathy_1sec,dist_shore,chl_june dist_col,bathy_1sec,dist_shore,chl_june -9214 -5034 -5046 -5185 -2692 -2523 -2942 -2063 -2250 -6742 -6729 -6606 -5990 -4760 1477 1373 1375 1415 794 736 879 595 697 2016 1834 1834 1387 1282 6. Prediction Maps of Usage To calculate usage, preference is divided by distance to colony and multiplied by a scale factor which ensures that the probabilities sum to one. For mapping purposes, the probabilities have been multiplied by 1000. A very small number of points closest to the colony were removed if this gave a value greater than 50. Common Tern, Dungeness 28 Common Tern, Foulness 29 Common Tern, Breydon Water 30 Common Tern, Dee estuary 31 Common Tern, Ribble estuary 32 Common Tern, Strangford Lough 33 Common Tern, Carlingford Lough 34 Common Tern, Farne Islands 35 Common Tern, Isle of May 36 Arctic Tern, Strangford Lough 37 Arctic Tern, Isle of May 38 Sandwich Tern, Duddon estuary 39 Sandwich Tern, Strangford Lough 40 Sandwich Tern, Carlingford Lough 41 7. Discussion Cross-validation has shown that it is generally better to combine all available data for a tern species when making predictions for a new colony, rather than basing predictions on a colony or colonies that appear to be ecologically similar. This means that the model developed for each species is more robust, because it is based on data from a larger number of tracks, and covering a wider range of environments. It might be possible to give the colonies differing weights, but it is unclear how such weights should be chosen, as they would need to take account of the amount of data available for each colony, as well as its ecological similarity to the colony being predicted which is difficult to measure in relation to a terns assessment of its environment. The analysis has also shown that whereas the best models for predicting the available data from a colony may involve many covariates and non-linear terms, simple linear models with a small number of variables (in particular distance to colony, distance to shore, bathymetry and chlorophyll concentration), are better for extrapolating to a different colony. 42 References Aarts, G., MacKenzie, M., McConnell, B., Fedak, M. and Matthiopoulos, J. (2008) Estimating spaceuse and habitat preference from wildlife telemetry data. Ecography, 31, 140-160. Brewer M.J., Potts J.M., Duff E.I. & Elston D.A. (2012). To carry out tern modelling under the Framework Agreement C10-0206-0387. Report submitted to: Joint Nature Conservation Committee. Brier, G.W. (1950) Verification of forecasts expressed in terms of probability. Monthly Weather Review, 78, 1-3. Chakraborty, A. Gelfand, A.E., Wilson A.M., Latimer A.M. and Silander J.A. Point pattern modelling for degraded presence-only data over large regions. Applied Statistics 60, 757-776. INLA (2012) R-Package, http://www.r-inla.org/ Liu C., White, M. and Newell G. (2011) Measuring and comparing the accuracy of species distribution models with presence-absence data. Ecography 34, 232-243. Lobo J.M., Jiménez-Valverde, A. and Real, R. (2008) AUC: A misleading measure of the performance of predictive distribution models. Global Ecology and Biogeography, 17, 145-151. Wakefield, E.D., Phillips, R.A., Trathan, P.N., Arata, J., Gales, R., Huin, N., Robertson, G., Waugh, S.M., Weimerskirch, H. and Matthiopoulos, J. (2011) Habitat preference, accessibility, and competition limit the global distribution of breeding Black-browed Albatrosses. Ecological Monographs, 81, 141-167. 43