NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Analysis of Temporal (Stratigraphic) and Spatial Data John Birks ANALYSIS OF TEMPORAL AND SPATIAL DATA Introduction Temporal stratigraphic data Single sequence Partitioning or zonation Sequence splitting Rate-of-change analysis Gradient analysis and summarisation Analogue matching Relationships between two or more sets of variables in same sequence Two or more sequences Sequence comparison and correlation Multi-proxy studies Hypothesis testing Spatial geographical data Spatial autocorrelation Spatially constrained clusterings Spatially constrained ordinations Predictive models for spatial data INTRODUCTION Analysis of quadrats, lakes, streams, etc. Assume no autocorrelation, namely cannot predict the values of a variable at some point in space from known values at other sampling points. PALAEOCOLOGY – fixed sample order in time. strong autocorrelation – temporal autocorrelation STRATIGRAPHICAL DATA biostratigraphic, lithostratigraphic, geochemical, geophysical, morphometric, isotopic multivariate continuous or discontinuous time series ordering very important – display, partitioning, trends, interpretation SPATIAL DATA many types, spatial autocorrelation spatial or geographical co-ordinates very important raises problems of statistical inference as samples not independent TEMPORAL STRATIGRAPHIC DATA ANALYSIS OF SINGLE SEQUENCE ZONATION OR PARTITIONING Useful for: 1) description 2) discussion and interpretation 3) comparisons in time and space “sediment body with a broadly similar composition that differs from underlying and overlying sediment bodies in the kind and/ or amount of its composition”. CONSTRAINED CLUSTERINGS 1) Constrained agglomerative procedures CONSLINK CONISS 2) Constrained binary divisive procedures Partition into g groups by placing g – 1 boundaries. Number of possibilities n g 1 n1 for g 2 Compared with non-constrained situation. 2 n 1 1 Criteria – within-group sum-of-squares or variance – within-group information n m p i 1 k 1 ik log pik qik SPLITLSQ SPLITINF 3) Constrained optimal divisive analysis OPTIMAL 2 group ______________________________ n1 3 group n2 n1 4 group n2 4) n1 n3 Variable barriers approach BARRIER All methods in one program: ZONE RIOJA Pollen diagram and numerical zonation analyses for the complete Abernethy Forest 1974 data set. Birks & Gordon (1985) CONISS = constrained incremental sum-of-squares (= constrained Ward's minimum variance) OPTIMAL SUM OF SQUARES PARTITIONS OF THE ABERNETHY FOREST 1974 DATA Number of groups g (zones) Percentage of total sum-ofsquares Markers 2 59.3 15 3 28.4 15 32 4 18.9 15 33 41 5 14.7 15 33 41 45 6 10.6 15 32 34 41 45 7 8.1 15 26 32 34 41 45 8 5.8 8 15 26 32 34 41 45 9 4.7 8 15 24 29 32 34 41 45 10 3.9 8 15 24 29 32 33 34 41 45 HOW MANY ZONES? K D Bennett (1996) Determination of the number of zones in a bio-stratigraphical sequence. New Phytologist 132, 155-170 Broken stick model 1 n 1 Pr n ik i RIOJA (R) BSTICK Ioannina Basin Tzedakis (1994) Pollen percentage diagram plotted against depth. Lithostratigraphic column is represented; symbols are based on Troels-Smith (1995). Variance accounted for by the nth zone as a proportion of the total variance (fluctuating curve) compared with values from a brokenstick model (smooth curve): (a) randomized data set, (b) original data set. Original data Broken stick model Zonation method: binary divisive using the information content statistic. Data set; Ioannina. Bennett (1996) Technical Point Turns out that the binary divisive procedures SPLINTF and SPLITLSQ of Gordon and Birks (1972) are an early implementation of De’ath’s (2002) multivariate regression trees (MRT) discussed in the Modern Regression lecture. Both are MRTs where a vector of sample depths or ages is used as the sole explanatory predictor variable SPLINTF = distance-based MRT with information content as the dissimilarity measure SPLITLSQ = MRT with Euclidean distance as the distance measure Advantage of MRT over SPLINTF/SPLITLSQ as a zonation procedure is that the k-fold cross-validation in CARTs provides a simple way to assess the number of zones into which the stratigraphical sequence should be split. MRT using the optimal partitioning approach still to be implemented. mvpart (R) SEQUENCE SPLITTING Walker & Wilson (1978) J. Biogeog. 5, 1–21 Walker & Pittelkow (1981) J. Biogeog. 8, 37–51 SPLIT, SPLIT2 BOUND2 Need statistically ‘independent’ curves Pollen influx (grains cm–2 year–1) PCA or CA or DCA axes Aitchison log-ratio transformation Zik log pik p i where m log pi log pik m k 1 CANOCO LOGRATIO Correlograms of sequence splits with charcoal, inorganic matter and total pollen influxes for three sections of the pollen record. The vertical scales give correlations; the horizontal scales give time lag in years (assuming a sampling interval of 50 years). Technical Point The sequence splitting of Walker and Wilson (1978) is a precursor of regression trees within CART (see Modern Regression lecture). In a regression tree a quantitative response variable, in our case a stratigraphical sequence of taxon A, is repeatedly split so that at each partition the sequence is divided into two mutually exclusive groups, each of which is homogeneous as possible. In the regression tree implementation, a vector of sample depths or ages is used as the sole explanatory predictor variable. The splitting is then applied to each group separately until some stopping rule is reached. Usually k-fold cross-validation is used to find the optimal tree-size using cross-complexity (CC) pruning. CC = Timpurity + (Tcomplexity) where Timpurity is the impurity of the current tree over all terminal nodes; Tcomplexity is the number of terminal leaves; and is a real number >0 is the tuning parameter that is minimised in CC pruning. Represent trade-off between tree-size and goodness-of-fit. Small values of give large trees; large values of lead to small trees. Starting with full tree, search to identify the terminal node that results in the lowest CC for a given value of . As penalty on tree complexity is increased, the tree that minimises CC will become smaller and smaller until the penalty is so great that a tree with a single node (i.e. the original data) has the lowest CC: Search produces a sequence of progressively smaller trees with associated CC. k-fold cross-validation is used to find the optimal value of that gives the minimal root mean squared error (RMSE). Alternative is to select the smallest tree that lies within 1 standard error of the RMSE of the best tree. rpart (R) RATE OF CHANGE ANALYSIS Amount of palynological compositional change per unit time. Calculate dissimilarity between pollen assemblages of two adjacent samples and standardise to constant time unit, e.g. 250 14C years. Jacobson & Grimm (1986) Ecology 67, 958-966 Grimm & Jacobson (1992) Climate Dynamics 6, 179-184 RATEPOL POLSTACK (TILIA) Graph of distance (number of standard deviations) moved every 100 yr in the first three dimensions of the ordination vs age. Greater distance indicates greater change in pollen spectra in 100yr. Jacobson & Grimm (1986) Jacobson & Grimm (1986) GRADIENT ANALYSIS OF SINGLE SEQUENCE Ordination methods CA/DCA joint plot or PCA biplot Constrained CA or PCA Sample summary CA/DCA/PCA Species arrangement CCA or simple discriminants CA = correspondence analysis DCA = detrended correspondence analysis PCA = principal components analysis CCA = canonical correspondence analysis VEGAN CANOCO PCA Biplot 74.6% Gordon, 1982 Biplot of the Kirchner Marsh data; C2 = 0.746. The lengths of the Picea and Quercus vectors have been scaled down relative to the other vectors. Stratigraphically neighbouring levels are joined by a line. CA Joint Plot 62% Gordon, 1982 Correspondence analysis representation of the Kirchner Marsh data; C2 = 0.620. Stratigraphically neighbouring levels are joined by a line. Stratigraphical plot of sample scores on the first correspondence analysis axis (left) and of rarefaction estimate of richness (E(Sn)) (right) for Diss Mere, England. Major pollenstratigraphical and cultural levels are also shown. The vertical axis is depth (cm). The scale for sample scores runs from –1.0 (left) to + 1.2 (right). Haberle & Bennett 2004 The 1st and 2nd axis of the Detrended Correspondence Analysis for Laguna Oprasa and Laguna Facil plotted against calibrated calendar age (cal yr BP). The 1st axis contrasts taxa from warmer forested sites with cooler herbaceous sites. The 2nd axis contrasts taxa preferring wetter sites with those preferring drier sites. Species arrangement Percentage pollen and spore diagram from Abernethy Forest, Inverness-shire. The percentages are plotted against time, the age of each sample having been estimated from the deposition time. Nomenclatural conventions follow Birks (1973a) unless stated in Appendix 1. The sediment lithology is indicated on the left side, using the symbols of Troels-Smith (1995). The pollen sum, P, includes all non-aquatic taxa. Aquatic taxa, pteridophytes, and algae are calculated on the basis of P + group as indicated. Pollen types re-arranged on the basis of the weighted average for depth TRAN ANALOGUE ANALYSIS Modern training set – similar taxonomy – similar sedimentary environment Compare fossil sample 1 with all modern samples, use appropriate DC, find sample in modern set ‘most like’ (i.e. lowest DC) fossil sample 1, call it ‘closest analogue’, repeat for fossil sample 2, etc. Overpeck et al. (1985) Quat. Res. 23, 87–108 ANALOG MATCH MAT ANALOGUE – R package RIOJA Compare fossil sample i with modern sample j Repeat for all modern samples Repeat for all fossil samples Calculate similarity between i and j Sij Find modern sample with highest similarity 'ANALOGUE' ? Evaluation Dissimilarity coefficients, radiocarbon dates, pollen zones, and vegetation types represented by the top ten analogues from the Lake West Okoboji site. Maps of squared chord distance values with modern samples at selected time intervals Plots of minimum squared chorddistance for each fossil spectrum at each of the eight sites. Analogues and lake restoration Flower et al. (1997) A schematic representation of how fossil diatom zones/samples in a sediment core from an acidified lake can be compared numerically with modern surface sediment samples collected from potential modern analogue lakes. In this space-for-time model the vertical axis represents sedimentary diatom zones defined by depth and time; the horizontal axis represents spatially distributed modern analogue lakes and the dotted lines indicate good floristic matches (dij = <0.65), as defined by the mean squared Chi-squared estimate of dissimilarity (SCD, see text). Flower et al. (1997) COMPARISON AND CORRELATION BETWEEN TIME SERIES Two or more stratigraphical sets of variables from same sequence. Are the temporal patterns similar? (1) Separate ordinations Oscillation log - likelihood G-test or 2 test (2) Constrained ordinations Pollen data - 3 or 4 ordination axes or major patterns of variation Y Chemical data - 3 or 4 ordination axes X Depth as a covariable Does 'chemistry' explain or predict 'pollen'? i.e. is variance in Y well explained by X? Lotter et al. (1992) J. Quat. Sci. Pollen 16O/18O (depth) 34% 16% 12% 79% 12% 4% 1% COMPARISON AND CORRELATION BETWEEN TIME SERIES Two or more stratigraphical sets of variables from same sequence. Are the temporal patterns similar? (1) Separate ordinations Oscillation log - likelihood G-test or 2 test (2) Constrained ordinations Pollen data - 3 or 4 ordination axes or major patterns of variation Y Chemical data - 3 or 4 ordination axes X Depth as a covariable Does 'chemistry' explain or predict 'pollen'? i.e. is variance in Y well explained by X? Lotter et al. (1992) J. Quat. Sci. Pollen 16O/18O (depth) Pollen, oxygen-isotope stratigraphy, and sediment composition of Aegelsee core AE-1 (after Wegmüller and Lotter 1990) Pollen and oxygen-isotope stratigraphy of Gerzensee core G-III (after Eicher and Siegenthaler 1976) Is there a statistically significant relationship between the pollen stratigraphy and the stable-isotope record? Summary of the results from detrended correspondence analysis (DCA) of late-glacial pollen spectra from five sequences. The percentage variance represented by each DCA axis is listed. Reduce pollen data to DCA axes. Use these then as ‘responses’ Site No. of samples No. of taxa DCA Axis 1 2 3 4 Aegelsee AE-1 100 26 57.2 12.0 2.3 1.4 Aegelsee AE-3 54 32 44.3 3.3 1.5 1.4 Gerzensee G-III 65 28 37.6 4.0 1.2 0.9 Faulenseemoos 62 25 44.1 18.8 5.0 3.8 Rotsee RL-250 44 23 38.2 13.3 3.1 2.3 Results of redundancy analysis and partial redundancy analysis permutation tests for the significance of axis 1 when oxygen isotopes and depth are predictor variables, when oxygen is the only predictor, and when oxygen isotopes are the predictor variable and depth is a covariable. Site Predictor variable: 18O and depth Predictor variable: 18O Covariable: depth Predictor variable: 18O Number of response variables (DCA axes) Pollen DCA axes Aegelsee AE-1 0.01a 0.01a 0.02a 2 Aegelsee AE-3 0.01a 0.16 0.20 1 Gerzensee G-III 0.01a 0.46 0.57 1 Faulenseemoos 0.01a 0.01a 0.01a 3 Rotsee RL-250 0.01a 0.21 0.08 2 a Significant at p< 0.05 (Lotter et al. 1992) MULTI-PROXY STUDIES In multi-proxy studies (e.g. pollen, diatoms, chironomids, etc. studied on the same core), important question is ‘are the major stratigraphical patterns of variation (‘signal’) the same in all proxies?’ Laguna Facil, southern Chile Massaferro et al. 2005 Quaternary Science Reviews 24: 2510-2522 Pollen and chironomids studied on the same core Simplified each data-set to the first ordination axes of a correspondence analysis (CA) and a principal components analysis (PCA) for both data-sets Chironomid stratigraphy Massaferro et al. 2005 Pollen stratigraphy Massaferro et al. 2005 Massaferro et al. 2005 Can detect similarities in both proxies and differences 1. Major change in both prior to 14,700 cal yr BP. 2. Changes in the chironomids tend to lag behind changes in the pollen. Perhaps a chironomid response to changes in vegetation (tree canopy and forest type) or lake chemistry, resulting from changes in catchment soils as a result of vegetational change. 3. At about 7200 cal yr BP, chironomids change before the pollen. May be a response to climate change. 4. Strong correlations between the charcoal stratigraphy and pollen and chironomid stratigraphies. Probable importance of fire and/or vulcanism in influencing both vegetational and limnological dynamics. Massaferro et al. 2005 Can use ordination methods to summarise several palaeoecological proxies and to compare with other proxies Haberle et al. 2006 Lake Euramoo, NE Queensland, last 800 years Major changes between preEuropean period (A) and European settlement (B) Tested how well different proxies ‘predict’ or ‘explain’ (in a statistical sense) other proxies Only proxy that significantly predicted other proxies was pollen that predicted changes in diatoms (25.4%) and chironomids (15.4%) Illustrates the importance of catchment and its vegetation on the lake and its biota Assessing Potential External 'Drivers' on an Aquatic Ecosystem Bradshaw et al. 2005 The Holocene 15: 1152-1162 Dalland Sø, a small (15 ha), shallow (2.6 m) lowland eutrophic lake on the island of Funen, Denmark. Catchment (153 ha) today agriculture 77 ha built-up areas 41 ha woodland 32 ha wetlands 3 ha Nutrient rich – total P 65-120 mg l-1 Map of Dalland Sø Multi-proxy study to assess role of potential external 'drivers' or forcing functions on changes in the lake ecosystem in last 7000 yrs. Data: No. of samples Transformation Sediment loss-on-ignition % 560 None Sediment dry mass accumulation rate 560 Log (x + 1) Sediment minerogenic matter accumulation rate 560 Log (x + 1) Plant macrofossil concentrations 280 Log (x + 1) Pollen % 90 None Diatoms % 118 None Diatom inferred total P 118 None Biogenic silica 84 Not used Pediastrum % 90 None Zooplankton 31 Not used Terrestrial landscape or catchment development Bradshaw et al. 2005 Aquatic ecosystem development Bradshaw et al. 2005 DCA of pollen and diatom data separately to summarise major underlying trends in both data sets Pollen – high scores for trees, low scores for light-demanding herbs and crops Diatom - high scores mainly planktonic and large benthic types, low scores for Fragilaria spp. and eutrophic spp. (e.g. Cyclostephanos dubius) Bradshaw et al. 2005 Major contrast between samples before and after Late Bronze Age forest clearances 'Lake' Prior to clearance, lake experienced few impacts. After the clearance, lake heavily impacted. 'Catchment' Bradshaw et al. 2005 Canonical Correspondence Analysis Response variables: Diatom taxa Predictor variables: Pollen taxa, LOI, dry mass and minerogenic accumulation rates, plant macrofossils, Pediastrum Covariable: Age 69 matching samples Partial CCA with age partialled out as a covariable. Makes interpretation of effects of predictors easier by removing temporal trends and temporal autocorrelation Partial CCA all variables: 18.4% of variation in diatom data explained by Poaceae pollen, Cannabis-type pollen, and Daphnia ephippia, the only three independent and statistically significant predictors. As different external factors may be important at different times, divided data into 50 overlapping data sets – sample 1-20, 2-21, 3-22, etc. Bradshaw et al. 2005 CCA of 50 subsets from bottom to top and % variance explained 1. 4520-1840 BC Poaceae is sole predictor variable (20-22% of diatom variance) 2. 3760-1310 BC LOI and Populus pollen (16-33%) 3. 3050-600 BC Betula, Ulmus, Populus, Fagus, Plantago, etc. (17-40%) i.e. in these early periods, diatom change influenced to some degree by external catchment processes and terrestrial vegetation change. 4. 2570 BC – 1260 AD Erosion indicators (charcoal, dry mass accumulation), retting indicator Linum capsules, Daphnia ephippia, Secale and Hordeum pollen (11-52%) i.e. changing water depth and external factors 5. 160 BC – 1900 AD Hordeum, Fagus, Cannabis pollen, Pediastrum boryanum, Nymphaea seeds (22-47%) i.e. nutrient enrichment as a result of retting hemp, also changes in water depth and water clarity Bradshaw et al. 2005 Strong link between inferred catchment change and within-lake development. Timing and magnitude are not always perfectly matched, e.g. transition to Mediæval Period ANALYSIS OF TWO OR MORE SEQUENCES Regional zones, description of common features, interpretation, detection of unique features. Sequence comparison and correlation. Sequence slotting SLOTSEQ FITSEQ CONSSLOT Combined scaling of two or more sequences. CANOCO SLOTSEQ Slotting of the sequences S1 (A1, A2, ..., A10) and S2 (B1, B2, ..., B7), illustrating the contributions to the measure of discordance (S1, S2) and the 'length' of the sequences, m(S1, S2). The results of sequenceslotting of the Wolf Creek and Horseshoe Lake pollen sequences ( = 2.095). Radiocarbon dates for the pollen zone boundaries are also given, expressed as radiocarbon years before present (BP). Birks & Gordon (1985) Comparison of oxygen-isotope records from Swiss lakes Aegelsee (AE-3), Faulenseemoos (FSM) and Gerzensee (G-III) with the Greenland Dye 3 record (Dansgaard et al, 1982). LST marks the position of the Laacher See Tephra (11,000 yr BP). Letters and numbers mark the position of synchronous events (for details see text). Lotter et al. (1992) Psi values for pair-wise sequence slotting of the stable-isotope stratigraphy at five Swiss late-glacial sites and the Dye 3 site in Greenland. Values above the diagonal are constrained slotting, using the three major shifts shown in previous figure; values below the diagonal are for sequence slotting in the absence of any external constraints. The mean 18O and standard deviation for each sequence is also listed. CONSLOXY FUGLA NESS, Shetland Pollen diagram from Sel Ayre showing the frequencies of all determinable and indeterminable pollen and spores expressed as percentages of total pollen and spores (P). Abbreviations: undiff. = undifferentiated, indet = indeterminable. Comparison of Bjärsjöholmssjön and Färskesjön using principal component analysis. The mean scores of the local pollen zones and the ranges of the sample scores in each zone are plotted on the first and second principal components, and are joined up in stratigraphic order. The Blekinge regional pollen assemblage zones are also shown. Birks & Berglund (1979) Comparison of Färskesjön and Lösensjön using principal component analysis. The mean scores of the local pollen zones and the ranges of the sample scores in each zone are plotted on the first and second principal components, and are joined up in stratigraphic order. The regional pollen assemblage zones are also shown. Haberle & Bennett, 2004 The 1st and 2nd axis of the Detrended Correspondence Analysis for Laguna Oprasa and Laguna Facil plotted against calibrated calendar age (cal yr BP). The 1st axis contrasts taxa from warmer forested sites with cooler herbaceous sites. The 2nd axis contrasts taxa preferring wetter sites with those preferring drier sites Tzedakis & Bennett (1995) Pollen percentage diagram of selected taxa plotted against depth. Lithostratigraphic symbols are based on Troels-Smith (1995). For correlations and ages see Tzedakis (1993, 1994). Pollen percentage diagrams of selected arboreal taxa of the Metsovon, Zista, Pamvotis, and Dodoni I and II forest periods of Ioannina 249. 5e 7c 9c 11a + b + c Tzedakis & Bennett (1995) Solar insolation values of mid-month day for selected periods at latitude 39º40'N. Values are given for July and January extremes and July minus January for each interglacial period calculated at thousand year intervals. Values are expressed in cal cm2 day-1. In parentheses are percentage differences from 10 ka values. Timing of extreme insolation excursions also given. Data from a computer program written by N.G. Pisias, based on Berger (1978). Chronology based on Imbrie et al. (1984) and Martinson et al. (1987) Tzedakis & Bennett, 1995 Combined plot of sample scores on the first two principal components for Metsovon, Zista, Pamvotis, and Dodoni I forest periods. Asterisks indicate the base of the intervals considered. Results of comparison of vegetation and climatic signatures of different interglacial periods. '+' sign means similar and '-' means different. First sign refers to climate and second to vegetation character. Different climate, similar pollen in one comparison TEMPORAL DATA - few (10–20) points e.g. from monitoring Rate of change Gradient analysis (unconstrained, constrained) Principal response curves Variance partitioning Trend analysis – regression against time, Monte Carlo permutation testing - many (>100) points Time-series analysis – see Gavin Simpson’s lecture HYPOTHESIS TESTING Lake Development and Catchment Change Assessing potential 'drivers' on aquatic ecosystems. What determines changes in lake organisms and lake sediments? 1. External climate forcing functions 2. Catchment forcing functions 3. Lake as isolated system that evolves through time with its own internal dynamics Birks et al. 2000 (a) Sägistalsee, Bernese Oberland, Swiss Alps Andy Lotter A.F. Lotter et al. 2003 J. Paleolimnology 30: 253-342 Lotter & Birks 2003 Age-depth model Sedimentation rate Lotter & Birks 2003 Wick et al. 2003 Wick et al. 2003 Heiri & Lotter 2003 Sägistalsee, Switzerland Ideal study: 1. Critical ecological situation at tree-line today; sensitive 2. One core. Many proxies (pollen, macros, chironomids, cladocera, grain size, sediment magnetics, sediment geochemistry) 3. Well dated; 18 AMS 14C dates on terrestrial plant material 4. Well co-ordinated by A.F. Lotter 5. High quality data: No. of samples No. of taxa/variables Pollen 212 203 Plant macros 372 53 Chironomids 82 30 Cladocera 112 7 Geochemistry 176 14 Grain-size 294 6 Magnetics 504 5 Data-set 6. Consistent numerical methodology on all proxies 7. Numerical methods used to test hypotheses about the influence of climate and catchment processes on the aquatic ecosystem in the perspective of the Holocene time-scale. (Partial redundancy analysis with restricted Monte Carlo permutation tests) Of the catchment changes, the main ones appear to be the spread of Picea abies at about 6300 cal BP and Bronze Age and subsequent forest clearances and conversion to grazing pastures. 8. Split proxy data into one predictor variable (plant macrofossils as a reflection of catchment vegetation) and several response variables (cladocera, chironomids, pollen, sediment grain-size, magnetics, geochemistry) Predictor variables: Lotter & Birks 2003 Hypotheses tested: 1. Climate has had a significant control on lake ecosystem changes 2. Catchment vegetation has played significant role on lake changes "Responses" (proxies) Terrestrial Pollen Macrofossils Lake biotic Chironomids Cladocera Lake abiotic Grain size Magnetics Geochemistry Scale Climate a significant predictor? Catchment vegetation a significant predictor? Y Y - - Lake Lake N N Y Y Lake Lake Lake * Y Y (Y) # Catchment & regional Catchment * Tested against insolation, central European cold phases, & Atlantic IRD record # Veg phases: Betula-Pinus cembra; Alnus-Pinus cembra; Picea abies ~ 6300 cal BP; Pasture phases from Bronze Age to present SPATIAL GEOGRAPHICAL DATA Geographical co-ordinates X, Y Spatial analysis Legendre & Fortin (1989) Vegetatio 80: 107-138 Legendre (1993) Ecology 74: 1659-1673 Koenig (1999) Trends in Ecology & Evolution 14: 22-26 Borcard et al. (2004) Ecology 85: 1826-1832 STATISTICAL ANALYSIS Random sample assumption Spatial autocorrelation Effect of spatial autocorrelation on tests of correlation coefficients for randomly generated, positively autocorrelated data r -1 Confidence interval of a correlation coefficient 0 +1 True interval: r not significantly different from zero Confidence interval computed from the usual tables r 0 *** ‘Liberal’ results – too many coefficients will be judged statistically significant when, in reality, they are not SPATIAL AUTOCORRELATION Classical statistics assumes independence of observations. Ecological variables very commonly show spatial structure in the sample space. Variable is autocorrelated when it is possible to predict values of this variable at some points in space from the known values at other sampling points whose spatial positions are known. Correlation in relative mean density of mountain hares between eleven provinces in Finland over 39 years (194685) plotted against distance between centres of provinces. HOW TO TEST FOR SPATIAL STRUCTURE? Spatial autocorrelation coefficients – Moran's I H0 – no spatial autocorrelation Each value of the I coefficient is equal to E(I) = -(n-1)-1 0 where E(I) is the expected I and n is the number of data points H1 – there is significant spatial autocorrelation The value of I is significantly different from E(I) I(d) nwij(y i y )(y j y ) W(yi y )2 I(d) nwij(y i y )(y j y ) W(yi y )2 where y represents the values of the variables, all summations are for i and j varying from 1 to n, the number of data points but excluding where i = j. The wij's take the value 1 when the pair (i,j) relates to distance class d (the one being computed) and is 0 otherwise, W is the sum of the wij's or the number of pairs (in the whole square matrix of distances between points) taken into account when computing coefficients for a given distance class. I(d) is computed for each distance class d. Moran's I usually -1 to +1 but can exceed these values. Positive I suggests positive correlation Negative I suggests negative correlation. Can test for significance by standard errors and confidence intervals or by randomisation tests. Behaves like Pearson's correlation coefficient r as its numerator is sum of cross-products of centred terms (covariance term), comparing in turn the values found at all pairs of points in the given distance class. Sensitive to extreme values, like r is. Plot a CORRELOGRAM where Moran's I is plotted against distance (d). All-directional correlogram – assume that the phenomenon is isotropic, namely that the autocorrelation function is the same whatever direction is considered. Correlograms for artificial data. Black squares are significant at = 0.05 Legendre & Fortin 1989 Moran's I correlogram for cross-validation residuals for transfer functions. See low I in MAT and ANN, high I in WA and GLR (ML) (spatial autocorrelation not sucked in by these methods), intermediate I in WAPLS SPATIALLY CONSTRAINED CLUSTERINGS Legendre (1987) In: Evolutionary Biogeography of the Marine Algae of the North Atlantic (eds. D.J. Garbary & R.R. Soult). Springer Legendre & Legendre (1984) Can. J. Fish. Aquat. Sci. 41, 1781-1802 Andersson (1988) Vegetatio 74, 95-106 Openshaw (1974) Computer Applic. 3-4, 136-160 Webster & Burrough (1972) J. Soil Sc. 23, 222-234 REGIONALISATION REGULAR GRID A) Only group objects if they are adjacent CONCLUST DC matrix of objects D Adjacency matrix (1/0) A (adjacent if have side or corner in common) Compare D and A. If not adjacent, flag as negative DC and ignore. Generalised agglomerative strategy 7 methods As fuse, update adjacency matrix If Dab or Dbc positive, Dabc must be positive Plot results as map for 10, 9, 8... 2 groups CONCMAP CONCSCR printer screen colours Observations: 1) Little difference in results between clustering methods (cf unconstrained ca). Little difference with different DCs (within reason!). 2) Faster than unconstrained ca. 3) Spatial constraints with biogeographical data make little difference, i.e. data strongly structured themselves. IRREGULAR GRID B) Weight DC matrix between objects Geog distance Webster & Burrough (1972) Dij dij dmax .w distance weighting D inverse square D d d ij max .w Dijd ij 1 w exponential d ij CONDCMAT Weighting factor 1w Dijd Dij 1 e dij / w where w w Similar results to CONCLUST, but does not have to be grid pattern. dij2 Andersson (1988) neighbour weighting 1/0 data for species (variable) analysis NEIWEI + + + + 1 + + + + + 1 + 8 = 9 score + 1 1 + 3 = 4 score + 'pseudofrequency' scores Scores Species A 1 1 1 1 1 1 4 4 7 7 1 3 1 1 1 1 1 5 8 8 8 5 1 1 1 1 1 6 9 9 9 6 1 1 1 1 1 4 6 6 6 4 SPATIALLY CONSTRAINED ORDINATIONS CCA or RDA detect simple gradients using x and y co-ordinates b1x b2 y Direction of gradient is tan–1 (b2/b1) Complex gradients quadratic b1 x b2 y b3 x 2 b4 xy b5 y 2 b6 x 3 b7 x 2 y b8 xy 2 b9 y 3 cubic Trend-surface analysis Can partial out spatial effects – remove effects of spatial autocorrelation. CANOCO CCA site scores WA species scores Maps obtained by block kriging for the sample scores, on canonical axes 1 (top) and 2 (bottom), in the species space (left) and in the trendsurface geographic space (right); values multiplied by 100 for mapping. Peaks are shadowed. No samples had been taken from the blanked area on the left. Axis 1 Axis 2 CCA site scores linear combinations of env. variables VARIANCE PARTITIONING INTO FOUR ADDITIVE COMPONENTS a) Non-spatial environmental variation i.e. environmental effects after partialling geographical variation Local environmental b) Spatially structured environmental variation i.e. spatially covarying environmental variation Regional environmental c) Spatial variation not shared by environmental variables i.e. spatial effects after partialling environmental variables Pure spatial d) Unexplained CCA explanatory vars covariables canonical s % 1) CCA Envir - 0.268 18.6 2) CCA Geography - 0.373 25.9 3) partial CCA Envir Geography 0.156 10.8 4) partial CCA Geography Envir 0.261 18.1 Total inertia 1.443 a) Non-spatial (analysis 3) b) Spatially covarying environmental variation (analyses 1-3) c) Pure spatial (analysis 4) d) Unexplained 10.8% 7.8% 18.1% 63.3% Variation partitioning of a species data table, showing that fraction (b) is the intersection of the environmental and spatial components of the species variation. (a) (b) (c) Environmental variance (d) Unexplained Spatial structure variance Variation partitioning of the oribatid mites data matrix 100% 90% Percent of variation 80% 43.0 % 70% 60% 50% Undetermined 12.2 % Space 40% 30% 31.0 % Environment 20% 10% Env + space 13.7 % 0% Oribatids Fraction A Non-spatial environmental variation 13.7% 'Local environment' 'Pure environment' independent of space Fraction B Spatially-structured environmental variation 31.0% (Spatial component of the environmental influence) Substrate moisture content Fraction C Non-environmentally explained variation 12.2% Spatial structure independent of the environmental variables 'Pure spatial' Theoretical causal relationships between environmental variables (representing processes) and community structure. Fractions (a), (b), (c) and (d) of the community data variation refer to Figure 5. ECM: Environmental control model. BCM: Biotic control model. HD: Historical dynamics. Asterisks * indicates factors not explicitly spelled out in the model. Non-spatial environmental variation Spatially structured env. variation Non-envir spatial variation Unexplained Fraction Causal factor Process Effect (a) Environmental factor ECM - Community structure (a)* Non-spatially structured factor not included in the analysis ECM - Env. variable in the analysis - Non-spatial community var. Historical events without spatial structure at the study scale HD - Env. variable in the analysis - Non-spatial community var. (b) Env. factor with spatial structure ECM - Community spatial structure (b)* Spatially structured env. factor not included in the analysis ECM - Env. variable in the analysis - Community spatial structure Spatially structured historical events HD - Env. variable in the analysis - Community spatial structure Spatially structured factors not included in the analysis ECM - Community spatial structure Spatially structured historical events HD - Community spatial structure Predation, competition, etc. BCM - Community spatial structure Factor not included in the analysis, not spatially structured (at study scale) ECM - Non-explained community var. Biotic control factors not spatially structured (at study scale) BCM - Non-explained community var. Random variation, sampling error, etc. Noise - Non-explained community var. (c)* (d)* Local environment Covariation between environment and space Spatial Major limitation of this approach is that it is unsuitable for spatial structures present at a WIDE range of different spatial scales. Principal co-ordinates analysis of neighbour matrices (PCNM). Borcard & Legendre (2002) Ecological Modelling 153: 51-68 Borcard et al. (2004) Ecology 85: 1826-1832 Eigenvalue decomposition of a truncated matrix of geographic distances between the sampling sites. Eigenvalues corresponding to positive eigenvalues are used as spatial descriptors in regression or canonical ordinations. SPACEMAKER PCNM (R) spacemakeR Borcard & Legendre (2002) PCNM of linear transect of 100 samples, 1 m apart. Set distance threshold at 1 m to retain only the closest neighbours: replaced other distance by 1 m x 4 = 4 m. Principal co-ordinates correspond to a series of sinusoids with decreasing periods. Largest is n+1, smallest is ~3. Borcard & Legendre (2002) Ecological data – Adiantum tomentosum abundance along transects in NE Peru. 260 adjacent 5 x 5 m subplots (a) Fern (thick), PCNM (thin line) (b) very broad scale (thick), broad scale (thin line) (c) medium scale (d) fine scale Oribatid mites and PCNM – irregular two-dimensional sampling PCNM gives 43 variables with truncation distance of 1.012 m Show coarse broad-scale patterns and fine-scale patterns Forward selection in RDA retains 12 PCNM variables. Explains 45.1% of variance (cf. 43.2% in simple RDA) RDA Axis 1 22.6% variance – shrubs or no shrubs R2 = 0.48 RDA Axis 2 8.4% variance – shrubs or hummocks R2 = 0.11 RDA Axis 3 4.5% variance R2 = 0.34 – areas of low water content and no shrubs When use environmental variables and simple X-Y trend as covariables, and RDA with PCNM variables, two significant axes remain. May reflect unmeasured abiotic or biotic mechanisms, such as food sources. Atlantic foraminifera & SST Telford & Birks (2005) Matrix of PCNM variables created from matrix of distances between N Atlantic sites truncated at 781 km, the minimum distance that links all sites into a single network. 385 orthogonal PCNM representing space. Forward selection in CCA retained 37 of these. Represent large spatial patterns. SST independent of space 1.8% variance Covariation between SST & space 29.9% variance Space independent of SST 42.5% variance Unexplained 25.7% Pure space explains most. Therefore there are important unknown spatial structures in the data. If only considering SST, expect strong spatial autocorrelation in residuals of SST transfer function models. Lowest autocorrelation in MAT and ANN residuals Highest autocorrelation in WA and GLR (= ML) residuals Highlights 'secret assumption' of transfer functions PREDICTIVE MODELS FROM SPATIAL DATA Nature management – well explored areas, poorly explored areas Lesotho bird atlas Habitat variables PCA axes Logistic regression to model species occurrences and absences in terms of habitat PCA log p b0 b1 x1 b2 x 2 b3 x 3 b4 x 4 b5 x 5 1 p PCA site scores Wildlife management GIS recording effort Mt Graham red squirrel in relation to env vars Logistic regression Pereira & Itami (1991) Photogr. Engin. & Remote Sensing 57, 1475–1486 Summary of the overall logistic models. The upper data are regression coefficients with their standard errors in brackets. Pied crow Ground woodpecker Cape vulture 1 Cape vulture 2* PC1 -0.90 (0.28) 0.54 (0.18) 0.40 (0.14) 0.85 (0.28) PC2 -0.14 (0.41) -0.72 (o.29) 0.02 (0.22) -0.15 (0.25) PC3 -0.49 (0.35) 0.01 (0.28) -0.31 (0.23) -0.44 (0.27) PC4 -0.34 (0.29) -0.24 (0.29) 0.02 (0.29) 0.76 (0.48) Effort 0.15 (0.09) 0.31 (0.14) 0.04 (0.03) 0.10 (0.04) Constant -2.43 (0.92) -1.52 (0.84) -0.75 (0.42) -1.96 (0.79) Deviance 33.95 45.21 62.73 48.88 Df 49 49 49 47 Pvalue+ 0.95 0.63 0.09 0.40 * Cape vulture 2 excludes data for two squares identified as having a disproportionate effect on the model using all the data (Cape vulture 1). + The P-value is best interpreted as a measure of standardized deviance, useful for comparing models with differing degrees of freedom. Distribution maps for three bird species in Lesotho produced by logistic modelling of presenceabsence data. Higher probabilities of occurrence are indicated by increasing circle size and actual field records are shown as filled circles. Hill (1991) J. Biogeogr. 18, 247–255 CCA species data +/– environmental data log max altitude annual rainfall mean temperature geology presence of coast p b0 b1 x1 b2 x12 b3 x 2 b4 x 3 b5 x 4 1 p x1 – x4 are site scores in CCA Predict distributions given simple environmental data. Actual and predicted distributions of species using logit regression with six parameters. The species are Dipper (Cinclus cinclus), Little Ringed Plover (Charadius dubius) and Common Rockrose (Helianthemum nummularium). Circles of increasing size signify categories of probability as follows: 1-4%; 5-10% 11-30%; 3150%; 51-75%; 76-100%. Actual DIPPER Predicted DIPPER Actual LITTLE RINGED PLOVER Predicted LITTLE RINGED PLOVER Actual ROCKROSE Predicted ROCKROSE PREDICTION OF UPLAND PLANT COMMUNITY DISTRIBUTION USING LOGISTIC REGRESSION 54 upland vegetation types recorded in 1,514 ten-kilometre grid squares in the uplands of Scotland, England, and Wales. Environmental variables from National Land Characteristics Data Bank. Topography 13 variables (22 possible) Climate 18 variables (29 possible) Geology 19 variables (29 possible) Soil types 8 variables (8 possible) Land-use 2 variables (22 possible) Reduced 31 Topography + climate variables to 5 PCA axes (63.6% variance) and 27 Geology + Soil type variables to 2 PCA axes (20.3%) Used 5 PCA axes + their square terms, the 2 PCA axes, + Land-use variables as predictors in logistic regression using the +/- of each vegetation type as the response variable. 54 models 7 have rho (r2) < 0.20 26 have rho 0.20 - 0.40 20 have rho 0.40 - 0.60 2 have rho > 0.60 Mean rho values Calcareous grassland 0.38 Heaths 0.41 Mires 0.26 Other grasslands 0.41 Woodland & scrub 0.40 Alpine snow-beds etc. 0.52 Poorest fits: Heaths 1 Mires 5 Grasslands 1 Predicted and known 10km square distribution of NVC U20 (Pteridium aquilinum – Galium saxatile community). Predictions were not made for lowland areas. Predicted and known 10km square distribution of NVC U10 (Carex bigelowii – Racomitrium lanuginosum mossheath). Predicted and known 10km square distribution of NVC H13 (Calluna vulgaris – Cladonia arbuscula heath). Predicted and known 10km square distribution of NVC H9 (Calluna vulgaris – Deschampsia flexuosa heath) in the uplands. Predicted and known 10km square distribution of NVC M6 (Carex echinata – Sphagnum recurvum/auriculatum mire). Predicted and known 10km square distribution of NVC M10 (Carex dioica – Pinguicula vulgaris mire). Predicted and known 10km square distribution of NVC W19 (Juniperus communis – Oxalis acetosella woodland). Salix herbaceaRacomitrium heterostichum, snow-bed Cryptogramma crispa-Athyrium distentifolium, snow-bed Luzula sylvaticaGeum rivale, tallherb community Saxifraga aizoidesAlchemilla glabra, banks Nardus strictaGalium saxatile, grassland Festuca ovinaAgrostis capillarisGalium saxatile, grassland Festuca ovinaAgrostis capillarisRumex acetosella, grassland Calluna vulgarisErica cinerea, heath Erica tetralixSphagnum compactum, wet heath Erica tetralixSphagnum papillosum, raised and blanket mire PREDICTING THE PROBABILITY OF SPECIES OCCURRENCE USING SURVEY DATA Le Duc et al. (1992) Watsonia 19: 97-105 Le Duc et al. (1992) Aspects of Applied Biology 29: 41-48 Firbank et al. (1998) Weed Research 35: 1-10 Plant recording 10 km grid squares Tetrads 2 km grid squares Impossible to record all tetrads, only record 3 (A, J, and W) Convert tetrad data to probabilities of species occurrence, introducing some spatial smoothing in the interpolation. Layout of the botanical monitoring scheme of the BSBI. Gaussian smoothing of occurrence in tetrads. Species occurrence Probability of species occurrence To predict species occurrence, need external predictors (e.g. soil type, land-use classes) and logistic regression. Veronica montana (a) data (b) estimated probability (c) estimated probability using soil groups (d) estimated probability using land-use classes Soil type main predictor Predicting weed distribution using tetrad data and soil types. Firbank et al. (1998) p a b x Smooth c x Soil log e 1 p Soil 16 classes Alopecurus myosuroides (a) tetrads (b) smoothed probability of occurrence (c) prediction using (b) + soils (d) 10 km square map (a) Elymus repens (b) Legousia hybrida (c) Papaver rhoeas (d) Senecio jacobea Species pool of cereal weeds greatest in central and southern England. Does not entirely coincide with distribution of arable farming. (a) grass weeds of cereals (b) broad-leaved weeds (c) distribution of arable land PREDICTION OF FUTURE CHANGES - TROLLIUS EUROPAEUS OBSERVED today PREDICTED today PREDICTED future Watt et al. (1997) Known distribution of globeflower (Trollius europaeus)(data from the Biological Records Centre) Predicted current distribution using Jan min. & July max. temp and annual precipitation as independent variables in a logistic regression. Predicted distribution in 2050 using the same model but imposing the UK transient climate scenario for 2050. KEY RESEARCHERS IN ANALYSIS OF TEMPORAL PALAEOECOLOGICAL DATA Steve Juggins Ed Cushing Eric Grimm Bent Odgaard Allan Gordon Keith Bennett Andy Lotter KEY RESEARCHERS IN SPATIAL ANALYSIS OF ECOLOGICAL DATA Daniel Borcard Pierre Legendre Mark Hill Richard Telford Marie-Josée Fortin