Using Ordination Methods in Palaeoecology John Birks University of Bergen University College London University of Oxford Tilia Workshop, Liverpool, May 2011 Introduction Ordination methods and palaeoecological functions Uses in palaeoecology Data summarisation Data analysis Data interpretation Strengths and weaknesses Conclusions Introduction Ordination – term first presented in ecology by David Goodall in 1954, derived from German ‘ordnung’ Ordering of samples and species in relation to their overall similarity (indirect gradient analysis) or to their environment (direct gradient analysis) End result is a low-dimensional representation of multivariate data (many objects, many variables). Axes are chosen to fulfil certain mathematical properties Great use in data summarisation, data analysis, and data interpretation % food A simple example of data summarisation using ordination – European food (Reader’s Digest survey) GC ground coffee IC instant coffee TB tea or tea bags SS sugarless sugar BP packaged biscuits SP soup (packages) ST soup (tinned) IP instant potatoes FF frozen fish VF frozen vegetables AF fresh apples OF fresh oranges FT tinned fruit JS jam (shop) CG garlic clove BR butter ME margarine OO olive, corn oil YT yoghurt CD crispbread 90 49 88 19 57 51 19 21 27 21 81 75 44 71 22 91 85 74 30 26 D 82 10 60 2 55 41 3 2 4 2 67 71 9 46 80 66 24 94 5 18 I 88 42 63 4 76 53 11 23 11 5 87 84 40 45 88 94 47 36 57 3 F 96 62 98 32 62 67 43 7 14 14 83 89 61 81 16 31 97 13 53 15 NL 94 38 48 11 74 37 25 9 13 12 76 76 42 57 29 84 80 83 20 5 B 97 61 86 28 79 73 12 7 26 23 85 94 83 20 91 94 94 84 31 24 L 27 86 99 22 91 55 76 17 20 24 76 68 89 91 11 95 94 57 11 28 GB 72 26 77 2 22 34 1 5 20 3 22 51 8 16 89 65 78 92 6 9 P Country 55 31 61 15 29 33 1 5 15 11 49 42 14 41 51 51 72 28 13 11 A 73 72 85 25 31 69 10 17 19 15 79 70 46 61 64 82 48 61 48 30 CH 97 13 93 31 43 43 39 54 45 56 78 53 75 9 68 32 48 2 93 S 96 17 92 35 66 32 32 11 51 42 81 72 50 64 11 92 91 30 11 34 DK 96 17 83 13 62 51 4 17 30 15 61 72 34 51 11 63 94 28 2 62 N 98 12 84 20 64 27 10 8 18 12 50 57 22 37 15 96 94 17 64 SF 70 40 40 62 43 2 14 23 7 59 77 30 38 86 44 51 91 16 13 E 13 52 99 11 80 75 18 2 5 3 57 52 46 89 5 97 25 31 3 9 IRL Ordination – correspondence analysis Key: Countries: A Austria, B Belgium, CH Switzerland, D West Germany, E Spain, F France, GB Great Britain, I Italy, IRL Ireland, L Luxembourg, N Norway, NL Holland, P Portugal, S Sweden, SF Finland Correspondence analysis of percentages of households in 16 European countries having each of 20 types of food. Minimum spanning tree fitted to the full 15dimensional correspondence analysis solution What has this to do with pollenstratigraphical data and palaeoecology? Multivariate data Pollen data - 2 pollen types x 15 samples Variables Depths are in centimetres, and the units for pollen frequencies may be either in grains counted or percentages. Sample 1 2 3 4 5 6 7 Samples 8 9 10 11 12 13 14 15 Depth 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 Type A 10 12 15 17 18 22 23 26 35 37 43 38 47 42 50 Type B 50 42 47 38 43 37 35 26 23 22 18 17 15 12 10 Adam (1970) Alternate representations of the pollen data Palynological representation Geometrical representation In (a) the data are plotted as a standard diagram, and in (b) they are plotted using the geometric model. Units along the axes may be either pollen counts or percentages. Adam (1970) Of course palaeoecological data consist of more than two pollen types and 15 samples Main features are • Many taxa (50-300) • Many samples or objects (50-500) • Many zero values in data matrix (‘sparse’ data) • Few abundant taxa, many rare taxa • Data are usually expressed as percentages or proportions (‘closed’ compositional data) • Data are not normally distributed in a statistical sense so classical statistical tests are not appropriate • Stratigraphical data form temporal-series with a fixed sample order Why do ordinations? 1. Data simplification and data reduction - “signal from noise” 2. Detect features that might otherwise escape attention. 3. Hypothesis generation and prediction. 4. Data exploration as aid to further data collection. 5. Communication of results of complex data. Ease of display of complex data. 6. Aids communication and forces us to be explicit. “The more orthodox amongst us should at least reflect that many of the same imperfections are implicit in our own cerebrations and welcome the exposure which numbers bring to the muddle which words may obscure”. D Walker (1972) 7. Tackle problems not otherwise soluble. Hopefully better science. 8. Fun! Ordination Methods and Palaeoecological Functions Biological data Y only - ordination, classical ordination, indirect gradient analysis, classical or metric scaling, nonmetric multidimensional scaling Principal components analysis PCA Correspondence analysis CA Detrended correspondence analysis DCA Also: Principal coordinates analysis (metric scaling) PCoA Non-metric multidimensional scaling NMDS Biological data Y and environmental data X – canonical ordination, constrained ordination, direct gradient analysis, multivariate regression Redundancy analysis RDA Canonical correspondence analysis CCA Detrended canonical correspondence analysis DCCA Also: Canonical analysis of principal coordinates CAP Aims of indirect gradient analysis 1. Summarise multivariate data in a convenient lowdimensional geometric way. Dimension-reduction technique 2. Uncover the fundamental underlying structure of data. Assume that there is underlying LATENT structure. Occurrences of all species are determined by a few unknown environmental variables, LATENT VARIABLES, according to a simple response model. In ordination trying to recover and identify that underlying structure PCA, CA, DCA, PCoA, and NMDS all fulfil aim 1 Only PCA, CA, and DCA fulfil aim 2. Will discuss only these as they are trying to uncover the underlying structure. (PCoA can if you use the same distance measures implicit in PCA or CA!) Underlying response models A straight line displays the linear relation between the abundance value (y) of a species and an environmental variable (x), fitted to artificial data (●). (a = intercept; b = slope or regression coefficient). A Gaussian curve displays a unimodal relation between the abundance value (y) of a species and an environmental variable (x). (u = optimum or mode; t = tolerance; c = maximum = exp(a)). Besides making a low-dimensional map of multivariate data, more difficult but biologically more important is the ordination problem, namely Construct the single hypothetical variable (latent variable) that gives the best fit in a statistical sense to the species data according to an assumed linear response model (PCA) or assumed unimodal species response model (CA, DCA) PCA is the ordination technique that constructs the theoretical latent variable that minimises the total residual sum-of-squares after fitting linear lines or planes to the species data CA is the ordination technique that constructs the theoretical latent variable that maximises the dispersion of the species scores after fitting unimodal curves or surfaces to the species data Repeated for PCA axis 2, 3, …, n with constraint that all axes are uncorrelated with each other Three dimensional view of a plane fitted by least-squares regression of responses (●) on two explanatory variables PCA axis 1 and PCA axis 2. The residuals, i.e. the vertical distances between the responses and the fitted plane are shown. Least squares regression determines the plane by minimization of the sum of these squared distances. Representations of PCA results as biplots of axes 1 and 2 Correlation (=covariance) biplot scaling Species scores sum of squares = λ Site scores scaled to unit sum of squares Emphasis on species Distance biplot scaling Site scores sum of squares = λ Species scores scaled to unit sum of squares Emphasis on sites Total sum-of-squares (variance) = 1598 = sum of eigenvalues Axis 1 = 471 = 29%; Axis 2 = 344 = 22%; Total variance = 51% PCA biplots •Axes must have identical scales •Species loadings and site scores in the same plot: graphical order 2 is an approximation of the data •Origin: species averages. Points near the origin are average or are poorly represented •Species increase in the direction of the arrow, and decrease in the opposite direction •The longer the arrow, the stronger the increase •Angles between vector arrows approximate their correlations (r = Cos = Correlation) •Distance from origin reflects magnitude of change •Approximation: project site point onto species vector Biplot interpretation for Agrostis stolonifera Summarises abundance of species in samples. In correlation (covariance) biplot, site scores scaled to unit sum of squares and sum of squared species scores equals eigenvalue Representation of CA results as joint plots λ2 =0.40 CA ordination diagram of the Dune Meadow Data in Hill’s scaling. λ1 = 0.53 λ1 =0.53 λ2 = 0.40 λ3 = 0.26 λ4 = 0.17 CANOCO R CA: joint plot interpretation Joint plot with weighted Chi-squared metric: species and sites in the same plot with Hill's scaling. • Distance from the origin: Chi-squared difference from the profile • Points at the origin either average or poorly explained • Distant species often rare, close species usually common • Unimodal centroid interpretation: species optima and gradient values – at least for well-explained species • Can also construct CA biplots • Samples close together are inferred to resemble one another in species composition • Samples with similar species composition are assumed to be from similar environments J. Oksanen (2002) Detrended correspondence analysis (DCA) Aim to correct three 'artefacts' or 'faults' in CA: 1. Detrending to remove 'spurious' curvature in the ordination of strong single gradients 2. Rescaling to correct shrinking at the ends of ordination axes resulting in packing of sites at gradient ends 3. Downweighting to reduce the influence of rare species Implemented originally in DECORANA and now in CANOCO and R (vegan) Allows estimation of gradient length or the amount of compositional turnover along the DCA axes (in standard deviation units). 4 sd units represent complete turnover along gradient. CA applied to artificial data (- denotes absence). Column a: The table looks chaotic. Column b: After rearrangement of species and sites in order of their scores on the first CA axis (u k and x i ), a twoway Petrie matrix appears: λ1=0.87 Column a Column b Species A B C D E F G H I Sites 1 2 3 1 – – 1 – – 1 1 – – – – – 1 – – 1 – – – 1 – – 1 – – 1 Species 4 – – – 1 1 1 – – – 5 – – – 1 – – 1 1 – 6 – – – 1 – 1 1 – – 7 – 1 1 – 1 – – – – A B C E F D G H I xi 2 = 0.57 Arch effect uk Sites 1 7 1 – 1 1 1 1 – 1 – – – – – – – – – – ­– – 1 1 . . 4 0 2 – – 1 1 1 – – – – – 0 . 6 4 – – – 1 1 1 – – – 6 – – – – 1 1 1 – – 5 – – – – – 1 1 1 – 3 – – – – – – 1 1 1 0 0 1 2 . . . . 0 6 0 4 0 8 0 0 0 8 0 'Seriation' to arrange data into a sequence -1.4 -1.24 -1.03 -0.56 0 0.56 1.03 1.24 1.4 Distorte d distance s 1 = 0.87 Ordination by CA of the two-way Petrie matrix in the table above. a: Arch effect in the ordination diagram (Hill’s scaling; sites labelled as in table above; species not shown). b: One-dimensional CA ordination (the first axis scores of Figure a, showing that sites at the ends of the axis are closer together than sites near the middle of the axis. c: One-dimensional DCA ordination, obtained by nonlinearly rescaling the first CA axis. The sites would not show variation on The cause of the arch in CA • There is a curve in the species space, and PCA shows it correctly. • CA may be able to deal with unimodal responses, but if there is one dominant gradient, the second axis is the first axis folded. Occurs when the first axis is at least twice as long as the second 'real' axis. • Problems clearly arise when there is one strong dominant gradient. J. Oksanen (2002) Implicit distances between objects in PCA and CA Euclidean distance implicit in PCA involves absolute differences of species between sites. Chi-squared distance implicit in CA involves proportional differences in abundances of species between sites. Differences in site and species totals are therefore less influential in CA than in PCA unless some transformation is used in PCA to correct for this effect (e.g. percentage transformations) Data transformations in PCA 1. Centred species data PCA variance–covariance matrix. Species implicitly weighted by the variance of their values 2. Standardised PCA PCA correlation matrix. Centre species and divide by standard deviation (zero mean, unit variance). All species receive equal weight, including rare species. Use when data are in different units, e.g. pH, LOI, Ca 3. Square root transformation of percentage data. Chord distance or Hellinger distance. Excellent with % data 4. Log (y + 1) transformation for abundance data 5. Log transformation and centre by species and samples = log-linear contrast PCA for closed % data (few variables, e.g. blood groups) Data transformations in PCA Not as critical as in PCA as CA must have data in identical units (cf. PCA of correlation matrix) 1. Square root transformation of percentage data. Reduces impact of abundant species, optimises ‘signal to noise’ ratio How many ordination axes to retain for interpretation? Jackson, D.A. (1993) Ecology 74, 2204–2214 PCA – applicable to CA, PCoA, ?DCA Assessment of eigenvalues: Scree plot Broken-stick Total variance (=) divided randomly amongst the axes, eigenvalues follow a broken stick distribution. p bk i k 1 i p = number of variables (= no) e.g. 6 eigenvalues bk = size of eigenvalue % variance – 40.8, 24.2, 15.8, 10.7, 6.1, 2.8 Simple to calculate, robust and reliable = observed eigenvalues = broken-stick model expectation 3 axis model appropriate R Aims of direct gradient analysis Prior to 1986 and the development of canonical correspondence analysis (CCA) by Cajo ter Braak, approaches to interpretation of PCA/CA/DCA results were 1. Plot or contour values of external environmental variables on ordination plot 2. Plot external variables against ordination axes 3. Regress ordination axis (composite response variable) on external variables Limitations of direct gradient analysis 1. External variables may turn out to be poorly related to the first few ordination axes 2. Strong relationships with, say, axis 4 or 5 easily overlooked Limitations overcome by canonical or constrained ordination = multivariate direct gradient analysis Canonical ordination techniques Ordination and regression in one technique – Cajo ter Braak 1986 Search for a weighted sum of environmental variables that fits the species best, i.e. that gives the maximum regression sum of squares Ordination diagram 1) patterns of variation in the species data 2) main relationships between species & each environmental variable Redundancy analysis constrained or canonical PCA Canonical correspondence analysis (CCA) constrained CA Detrended CCA constrained DCA Axes constrained to be linear combinations of environmental variables. In effect PCA or CA or DCA with one extra step: Do a multiple regression of site scores on the environmental variables and take as new site scores the fitted values of this regression. Multivariate regression of Y on X. Indirect GA Species Primary data in gradient analysis Abundances or +/variables Response variables Y Values Env. vars Direct GA PLUS Predictor or explanatory variables X Classes CCA triplot CCA of the Dune Meadow Data. Ordination diagram with environmental variables represented by arrows. the c scale applies to environmental variables, the u scale to species and sites. the types of management are also shown by closed squares at the centroids of the meadows of the corresponding types of management. DCA CCA 1 0.54 0.46 2 0.40 0.29 R axis 1 0.87 0.96 R axis 2 0.83 0.89 a b: Inferred ranking of the species along the variable amount of manure, based on the biplot interpretation of Part a of this figure. b CCA of the Dune Meadow Data. a: Ordination diagram with environmental variables represented by arrows. The c scale applies to environmental variables, the u scale to species and sites. The types of management are shown by closed squares at the centroids of the meadows of the corresponding types of management. Redundancy analysis – constrained PCA Short (< 2SD) compositional gradients Linear or monotonic responses Reduced-rank regression PCA of y with respect to x Two-block mode C PLS PCA of instrumental variables Rao (1964) PCA - best hypothetical latent variable is the one that gives the smallest total residual sum of squares RDA - selects linear combination of environmental variables that gives smallest total residual sum of squares ter Braak (1994) Ecoscience 1, 127–140 Canonical community ordination Part I: Basic theory and linear methods RDA ordination diagram of the Dune Meadow Data with environmental variables represented as arrows. The scale of the diagram is: 1 unit in the plot corresponds to 1 unit for the sites, to 0.067 units for the species and to 0.4 units for the environmental variables. Biplot interpretation Statistical testing of constrained ordination results Statistical significance of species-environmental relationships. Monte Carlo permutation tests. Distribution-free tests but assume exchangeability of samples. Randomly permute the environmental data, relate to species data ‘random data set’. Calculate eigenvalue and sum of all canonical eigenvalues (trace). Repeat many times (999). If species react to the environmental variables, observed test statistic (1 or trace) for observed data should be larger than most (e.g. 95%) of test statistics calculated from random data. If observed value is in top 5% highest values, conclude species are significantly related to the environmental variables. Special ‘restricted’ permutation tests for time-ordered data as occur in palaeoecology. Statistical significance of constraining variables • CCA or RDA maximise correlation with constraining variables and eigenvalues. • Permutation tests can be used to assess statistical significance: - Permute rows of environmental data. - Repeat CCA or RDA with permuted data many times. - If observed higher than (most) permutations, it is regarded as statistically significant. J. Oksanen (2002) Partial constrained ordinations (partial CCA, RDA, etc) e.g. pollution effects seasonal effects COVARIABLES Z Eliminate (partial out) effect of covariables. Relate residual variation to pollution variables. Replace environmental variables by their residuals obtained by regressing each pollution variable on the covariables. Analysis is conditioned on specified variables or covariables. These conditioning variables may typically be 'random' or background variables, and their effect is removed from the CCA or RDA based on the 'fixed' or interesting variables. Very useful in testing competing hypotheses as one can test significance of sets of variables when other sets are partialled out. Partial CCA Natural variation due to sampling season and due to gradient from fresh to brackish water partialled out by partial CCA. Variation due to pollution could now be assessed. Ordination diagram of a partial canonical correspondence analysis of diatom species (A) in dykes with as explanatory variables 24 variables-of-interest (arrows) and 2 covariables (chloride concentration and season). The diagram is symmetrically scaled and shows selected species and standardized variables and, instead of individual dykes, centroids (•) of dyke clusters. The variables-of-interest shown are: BOD = biological oxygen demand, Ca = calcium, Fe = ferrous compounds, N = Kjeldahl-nitrogen, O2 = oxygen, P = orthophosphate, Si= siliciumcompounds, WIDTH = dyke width, and soil types (CLAY, PEAT). All variables except BOD, WIDTH, CLAY and PEAT were transformed to logarithms because of their skew distribution. PCA or CA/DCA? PCA – linear response model CA/DCA – unimodal response model How to know which to use? Gradient lengths important. Estimate with DCA If short, good statistical reasons to use LINEAR methods. If long, linear methods become less effective, UNIMODAL methods become more effective. Range 1.5–3.0 standard deviations both are effective. In practice: Do a DCA first and establish gradient length. If less than 2 SD, responses are monotonic. Use PCA. If more than 2 SD, use CA or DCA. When to use CA or DCA more difficult. Ideally use CA (fewer assumptions) but if arch is present, use DCA. DCA results can be unstable when eigenvalues 1 and 2 are close to each other (e.g. 0.55, 0.54) (Oksanen (1988) Vegetatio 74, 29–32). Always do a CA to assess the effect of downtrending on the data-set. (a) The response curves of 3 species along a gradient; 12 quadrats are located at the numbered points marked with arrowheads (artificial data). (b) Ordinations of the 12 data points by PCA (hollow dots, dashed line) and by CA (solid dots, solid line). Both ordinations exhibit the arch effect. The CA ordination also shows scale contractions at both extremities. Hypothetical diagram of the occurrence of species A-J over an environmental gradient. The length of the gradient is expressed in standard deviation units (SD units). Broken lines (A’, C’, H’, J’) describe fitted occurrences of species A, C, H and J respectively. If sampling takes place over a gradient range <1.5 SD, this means the occurrences of most species are best described by a linear model (A’ and C’). If sampling takes place over a gradient range >3 SD, occurrences of most species are best described by an unimodal model (H’ and J’). Outline of ordination techniques. DCA (detrended correspondence analysis) was applied for the determination of the length of the gradient (LG). LG is important for choosing between ordination based on a linear or on a unimodal response model. In cases where LG <3, ordination based on linear response models is considered to be the most appropriate. PCA (principal component analysis) visualises variation in species data in relation to best fitting theoretical variables. Environmental variables explaining this visualised variation are deduced afterwards, hence, indirectly. RDA (redundancy analysis) visualises variation in species data directly in relation to quantified environmental variables. Before analysis, covariables may be introduced in RDA to compensate for systematic differences in experimental units. After RDA, a permutation test can be used to examine the significance of effects. Indirect gradient analysis or direct gradient analysis? 1. Direct methods (RDA, CCA, DCCA) study the part of the variation in the species data that can be explained by a particular set of external variables 2. Indirect methods (PCA, CA, DCA) focus on the major patterns of variation in the species data, irrespective of any external variables If external data available, direct approach likely to be more effective than traditional indirect approach Depends on research questions and hypotheses being considered and on the data available Uses in Palaeoecology Consider selected examples in data summarisation, data analysis, and data interpretation. Major aim throughout is to help the palaeoecologist summarise and understand her/his data, to generate hypotheses, or to test hypotheses Data summarisation 1.Gradient analysis or ordination of a single stratigraphical sequence. PCA, CA, or DCA, RDA or CCA or DCCA constrained by depth or age PCA Biplot 74.6% Gordon (1982) Biplot of the Kirchner Marsh data; C2 = 0.746. The lengths of the Picea and Quercus vectors have been scaled down relative to the other vectors. Stratigraphically neighbouring levels are joined by a line. CA Joint Plot 62% Gordon (1982) Correspondence analysis representation of the Kirchner Marsh data; C2 = 0.620. Stratigraphically neighbouring levels are joined by a line. Stratigraphical plot of sample scores on the first correspondence analysis axis (left) and of rarefaction estimate of richness (E(Sn)) (right) for Diss Mere, England. Major pollenstratigraphical and cultural levels are also shown. The vertical axis is depth (cm). The scale for sample scores runs from –1.0 (left) to + 1.2 (right). Birks et al. (1988) Adam (1974) Stratigraphic plot of PCA axes 1-6, Osgood Swamp, California. Only axes 1-3 exceed broken-stick model expectations 2.Gradient analysis or ordination of two or more stratigraphical sequences Fugla Ness, Shetland Birks & Ransom (1969) Birks & Peglar (1979) Pollen diagram from Sel Ayre showing the frequencies of all determinable and indeterminable pollen and spores expressed as percentages of total pollen and spores (P). Abbreviations: undiff. = undifferentiated, indet = indeterminable. Birks & Peglar (1979) Birks & Berglund (1979) Comparison of Färskesjön and Lösensjön using principal component analysis. The mean scores of the local pollen zones and the ranges of the sample scores in each zone are plotted on the first and second principal components, and are joined up in stratigraphic order. The regional pollen assemblage zones are also shown. Birks & Berglund (1979) Comparison of Bjärsjöholmssjön and Färskesjön using principal component analysis. The mean scores of the local pollen zones and the ranges of the sample scores in each zone are plotted on the first and second principal components, and are joined up in stratigraphic order. The Blekinge regional pollen assemblage zones are also shown. Haberle & Bennett (2004) The 1st and 2nd axis of the Detrended Correspondence Analysis for Laguna Oprasa and Laguna Facil plotted against calibrated calendar age (cal yr BP). The 1st axis contrasts taxa from warmer forested sites with cooler herbaceous sites. The 2nd axis contrasts taxa preferring wetter sites with those preferring drier sites 3.Arrangement of taxa along the major axis of variation, namely depth or age Abernethy Forest Birks & Mathews (1973) Percentage pollen and spore diagram from Abernethy Forest, Inverness-shire. The percentages are plotted against time, the age of each sample having been estimated from the deposition time. Nomenclatural conventions follow Birks (1973a) unless stated in Appendix 1. The sediment lithology is indicated on the left side, using the symbols of Troels-Smith (1995). The pollen sum, P, includes all non-aquatic taxa. Aquatic taxa, pteridophytes, and algae are calculated on the basis of P + group as indicated. Birks (1993) Pollen types re-arranged on the basis of the weighted average TRAN for depth = CCA with depth as external variable CANOCO Data analysis Techniques that estimate particular numerical characteristics from palaeoecological data such as inferred past environment or compositional turnover 1.Ordination as a tool in testing if a given environmental reconstruction is statistically significant (Telford & Birks 2011 Quat Sci Rev doi: 10.1016/j.quascirev.2011.03.002) Basic idea of quantitative environmental reconstruction is two-step process 1. Xm = Ym Ûm where Xm = modern environmental variable(s) Ym = modern biological assemblages in surface samples Ûm = estimated modern calibration (‘transfer’) functions ^ 2. Xf = Yf Ûm where ^ Xf = inferred past environmental variable(s) Yf = fossil assemblages Ûm = estimated modern calibration (‘transfer’) functions Various numerical ways of doing this – two-way weighted averaging, WAPLS (assuming unimodal responses), inverse linear regression, partial least squares regression (assuming linear responses) See Birks et al. 2010 The Open Ecology Journal 3: 68-110 Obtain reconstruction of, say, July air temperature. Is it statistically significant or is it a result of chance? Various steps 1. Do PCA of fossil data (Yf) and see how much variance is explained by the first axis – maximum possible latent variable. Say it is 32% 2. Do RDA of fossil data (Yf) with reconstructed environmental variable (Xf) as sole external variable. Say it explains 19% of the variation in the fossil data. 3. Using the modern data Xm and Ym, generate 999 random environmental reconstructions to generate a null distribution of Xf 4. Compare observed variation (19%) with null distribution and estimate statistical significance of Xf Telford & Birks (2011) 5. If two or more environmental reconstructions have been generated from the same fossil data, can test if any of them are statistically significant using a forward-selection procedure in RDA (= partial RDA) 2.Using ordination to estimate compositional turnover as a means of comparing dynamics of different ecological systems Use detrended canonical correspondence analysis (DCCA) with palaeoecological data as response variables and age or depth as sole external variable. With Hill’s scaling in terms of standard deviation units, can estimate turnover in palaeoecological data (Birks 2007 Vegetation History and Archaeobotany 16: 197-202). Depth (cm) Lo ss -o nig ni tio n 9200 9400 9600 9800 650 10000 660 10200 10400 10600 680 10800 700 11000 11200 11400 11600 Early Holocene - Major Taxa 20 40 20 20 20 20 40 20 20 20 20 40 20 20 20 G ym Po noc ly a Po pod rpiu pu iu m Pi lus m v dry nu t ul op s rem ga te sy u re ris lve la a C gg or st . ris ylu s a So ve lla rb na us cf .S .a uc up ar ia he rb G ac ra ea m -ty in ea pe e C ar ex -ty pe D ry op te ris -ty pe Fi lip e R nd um ul a Em ex pe ace tru to m sa Ju ni ni gr pe um ru s Be co tu m la m un is Sa lix at 55 0 ° Krakenes Sa x R ifrag o an a C Se unc opp du ulu os m s itif gl ol ac ia C ia -ty ap lis pe se -ty lla pe -ty pe R um ex ac et os Ko el en la -ty O igi pe xy a ria isl Sa d an lix igy dic un na a di ff. Lithology Kråkenes, western Norway Birks & Birks 2008 The Holocene 18: 19-30 600 Zone 610 620 630 640 7 670 690 710 720 6 730 5 740 750 4 760 770 3 2 1 40 20 20 Percentages of Calculation Sum Fine resolution diagram from end of Younger Dryas 11500 years ago to 9175 years ago. Turnover estimates Kråkenes Turnover (SD) Duration (yrs) Total pollen record 2.75 2450 Younger Dryas to Betula zone 260 yrs since Younger Dryas Glacial forelands 2.42 720 1.91 260 2.98-3.81 ( =3.32, sd 2.66) 260 Greater compositional change in 260 yrs on glacial forelands since ‘Little Ice Age’ than at Kråkenes early Holocene Compare amount of change at many sites over the same time interval – ‘meta-analysis’ Smol et al. 2005 PNAS 102: 4397-4402 Diatom stratigraphies for last 150 years in 42 arctic lakes Turnover 0.70-2.84 SD Compared with turnover in last 150 years in unimpacted temperate lakes Turnover 0.72-1.39 SD, median 1.02 SD Turnover >1 SD in arctic lakes suggests greater compositional change relative to undisturbed temperate lakes Use as a baseline Moritz et al. 2002 Smol et al. (2005) Back to Kråkenes Turnover in diatom stratigraphy in first 150 years since Younger Dryas is 2.81 SD, about the same as in lakes in Arctic Canada (Ellesmere Island) in the last 150 years Interesting parallel Rapid biotic turnover in response to climate change 3.Comparing fossil and modern assemblages Jacobson & Grimm (1986) DCA Graph of distance (number of standard deviations) moved every 100 yr in the first three dimensions of the ordination vs age. Greater distance indicates greater change in pollen spectra in 100yr. Jacobson & Grimm (1986) DCA Ordination of the 100 BP analogue pollen assemblages and a 5sample running average of the Billy’s Lake fossil pollen samples. Points on this running average curve represent the position every 100 yr (rather than the position of each sample). Time marked for each 1000 yr (k). Birks et al. (1990) Passive fossil samples added into CCA of modern diatom-chemistry data Fossil samples fitted on basis on overall composition into CCA species-environment space Canonical correspondence analysis (CCA) time-tracks of selected cores from the Round Loch of Glenhead; (a) K5, (b) K2, (c) K16, (d) k86, (e) K6, (f) environmental variables. Cores are presented in order of decreasing sediment accumulation rate. Allott et al. (1992) Hypothesis testing using constrained ordinations 1.Assessing potential external ‘drivers’ on an aquatic ecosystem Bradshaw et al. 2005 The Holocene 15: 1152-1162 Dalland Sø, a small (15 ha), shallow (2.6 m) lowland eutrophic lake on the island of Funen, Denmark. Catchment (153 ha) today agriculture 77 ha built-up areas 41 ha woodland wetlands 32 ha 3 ha Nutrient rich – total P 65-120 mg l-1 Terrestrial landscape or catchment development Bradshaw et al. (2005) Aquatic ecosystem development Bradshaw et al. (2005) DCA of pollen and diatom data separately to summarise major underlying trends in both data sets Pollen – high scores for trees, low scores for lightdemanding herbs and crops Diatom -high scores mainly planktonic and large benthic types, low scores for Fragilaria spp. and eutrophic spp. (e.g. Cyclostephanos dubius) Bradshaw et al. (2005) Major contrast between samples before and after Late Bronze Age forest clearances 'Lake' Prior to clearance, lake experienced few impacts. After the clearance, lake heavily impacted. 'Catchment' Bradshaw et al. (2005) Canonical correspondence analysis Response variables: Diatom taxa Predictor external variables: Pollen taxa, LOI, dry mass and minerogenic accumulation rates, plant macrofossils, Pediastrum Covariable: Age 69 matching samples Partial CCA with age partialled out as a covariable. Makes interpretation of effects of predictors easier by removing temporal trends and temporal autocorrelation Partial CCA all variables: 18.4% of variation in diatom data explained by Poaceae pollen, Cannabis-type pollen, and Daphnia ephippia, the only three independent and statistically significant predictors. As different external factors may be important at different times, divided data into 50 overlapping data sets – sample 1-20, 2-21, 3-22, etc. Bradshaw et al. (2005) CCA of 50 subsets from bottom to top and % variance explained 1. 4520-1840 BC Poaceae is sole predictor variable (20-22% of diatom variance) 2. 3760-1310 BC LOI and Populus pollen (16-33%) 3. 3050-600 BC Betula, Ulmus, Populus, Fagus, Plantago, etc. (17-40%) i.e. in these early periods, diatom change influenced to some degree by external catchment processes and terrestrial vegetation change. 4. 2570 BC – 1260 AD Erosion indicators (charcoal, dry mass accumulation), retting indicator Linum capsules, Daphnia ephippia, Secale and Hordeum pollen (11-52%) i.e. changing water depth and external factors 5. 160 BC – 1900 AD Hordeum, Fagus, Cannabis pollen, Pediastrum boryanum, Nymphaea seeds (22-47%) i.e. nutrient enrichment as a result of retting hemp, also changes in water depth and water clarity Bradshaw et al. (2005) Strong link between inferred catchment change and within-lake development. Timing and magnitude are not always perfectly matched, e.g. transition to Medieval Period 2.Lake Euramoo, NE Queensland Can use ordination methods to summarise several palaeoecological proxies and to compare with other proxies over last 800 years Major changes between pre-European period (A) and European settlement (B) Haberle et al. (2006) Tested using RDA how well different proxies ‘predict’ or ‘explain’ (in a statistical sense) other proxies Only proxy that significantly predicted other proxies was pollen that predicted changes in diatoms (25.4%) and chironomids (15.4%) Illustrates the importance of catchment and its vegetation on the lake and its biota Strengths and Weaknesses Merits and drawbacks of indirect ordination methods 1. Can distract attention from individual species responses by focussing on the overall multivariate response only. 2. As it is a correlative method, it can help with hypothesis generation. 3. It can rarely, if ever, demonstrate causality. 4. Ordinations can provide useful low-dimensional representations of complex data. Valuable for summarisation and for hypothesis generation. 5. Ordination is a tool and a means to an end. It is not an end in itself. Current uses of indirect ordination methods Hill, M.O. (1988) Bull. Soc. Roy. Bot. Belg. 121, 134–41 “Ordination is a rather artificial technique. The idea that the world consists of a series of environmental gradients, along which we should place our vegetation samples, is attractive. But this remains an artificial view of vegetation. In the end the behaviour of vegetation should be interpreted in terms of its structure, the autoecology of its species and, above all, the time factor. At this level, trends become unimportant and multivariate analysis is perhaps irrelevant. Ordination is useful to provide a first description but it cannot provide deeper biological insights.” Current uses of direct ordination methods Direct gradient analysis or constrained ordination techniques allow hypothesis testing, not only hypothesis generation Need fossil data and relevant external data. Major challenge to have relevant external data that are ecologically independent of the fossil data Possible in a few examples – fossil data and volcanic tephra; split-sampling of fossil data into external predictors of vegetation type (e.g. macrofossils) and biological responses (e.g. diatoms, chironomids) CONFIRMATORY DATA ANALYSIS EXPLORATORY DATA ANALYSIS Real world ’facts’ Hypotheses Real world ‘facts’ Observations Measurements Data Data analysis Patterns ‘Information’ Observations Measurements Data Statistical testing Hypothesis testing Narratives Theory Hypotheses EXPLORATORY DATA ANALYSIS CONFIRMATORY DATA ANALYSIS How can I optimally describe or explain variation in data set? Can I reject the null hypothesis that the species are unrelated to a particular environmental factor or set of factors? Samples can be collected in many ways, including subjective sampling. Samples must be representative of universe of interest – random, stratified random, systematic. ‘Data-fishing’ permissible, post-hoc analyses, explanations, hypotheses, narrative okay. Analysis must be planned a priori. P-values only a rough guide. P-values meaningful. Stepwise techniques (e.g. forward selection) useful and valid. Stepwise techniques not strictly valid. Main purpose is to find ‘pattern’ or ‘structure’ in nature. Inherently subjective, personal activity. Interpretations not repeatable. Main purpose is to test hypotheses about patterns. Inherently analytical and rigorous. Interpretations repeatable. A well-designed modern palaeoecological study combines both - Initial phase is exploratory, perhaps involving pilot data or previous data to generate hypotheses. 1) Two-phase study - Second phase is confirmatory, collection of new data from defined sampling scheme, planned data analysis. - Large data set (>100 objects), randomly split into two (75/25) – exploratory set and confirmatory set. 2) Split-sampling - Generate hypotheses from exploratory set (allow data fishing); test hypotheses with confirmatory set. - Rarely done in palaeoecology. Data diving with cross-validation: an investigation of broad-scale gradient in Swedish weed communities Hallgren et al. 1999 J Ecology 87: 1037-1051 Full data set Remove observations with missing data Clean data set Ideas for more analysis Random split Exploratory data set Hypotheses Choice of variables Some previously removed data Confirmatory data set Hypothesis tests Combined data set Analyses for display RESULTS Flow chart for the sequence of analyses. Solid lines represent the flow of data and dashed lines the flow of analysis. Split-sampling very data-demanding Palaeoecological data collection is very labourintensive Exciting developments at Massey University, New Zealand towards automated pollen counting Auto Stage – can now identify 50 taxa with a reliability of 98% or more and can flag others as unknown to be looked at by the palynologist. http://autopollen.massey.ac.nz AutoStage flow Betula pendula (silver birch) Ligustrum lucida (privet) Dactylis glomerata (cocksfoot) Cupressus macrocarpa (macrocarpa tree) Wattle acacia Pinus radiata Much greater standard deviation for ‘people’ compared to machine, especially for pollen types 3 (Cupressus) and 4 (Ligustrum) i.e. we are more variable than a machine at pollen counting. Time 60 pollen types and a count of 2000-3000 grains take 3 hours (quicker than an experienced pollen analyst). Also Auto Stage can analyse 24 hours a day, 7 days a week, 52 weeks a year. Can count 3000 samples in a year, compared to a hard-working pollen analyst of 500 samples a year, i.e. 6 times more! Cost About $10,000 = £7000 = 70,000 Norwegian kroner - about 2-3 month’s salary! A major breakthrough – but how do we prepare that number of samples? Conclusions Ordination techniques are useful tools in palaeoecology for data summarisation, data analysis, and data interpretation Limiting factor in their full exploitation is availability of external data and data-sets large enough for split-sampling cross-validation needed in hypothesis testing AutoStage and automated pollen counting are major challenges for next 5 years Good reasons for selecting PCA, CA, or DCA (indirect methods) or RDA, CCA, or DCCA (direct methods) Good reasons for selecting PCA (linear) or CA (unimodal) and CA (no curvature) or DCA (curvature) No real role for NMDS PCoA and CAP potentially useful if there are good reasons to use distance measures not possible in PCA, CA, RDA, or CCA Andrew Lang 18441912. He uses statistics as a drunken man uses lamp-posts – for support rather than illumination. From MacKay, 1977, and reproduced through the courtesy of the Institute of Physics. Statistics are for illumination! Post-1987 Pre-1987 Sketches illustrating statistical zap and shotgun approaches to data analysis Cajo ter Braak 1987 Wageningen Major players in ordination theory Karl Pearson David W. Goodall 1901 Invented PCA 1954 First use of PCA in ecology John C. Gower Joseph B. Kruskal 1964 1966 Popularised PCoA, invented Procrustes rotation Development of NMDS Mark O. Hill 1973 Popularised CA in ecology, 1980 DCA Cajo J.F. ter Braak 1985 Unified PCA and CA in terms of response models Jari Oksanen Continuous questioning about ordination methods, championing NMDS, developing R (vegan) Pierre Legendre Major developments in extending direct ordination methods Petr Šmilauer Developed CanoDraw and CANOCO for Windows Marti J. Anderson Extending RDA and CCA to other distance measures and developing CAP Richard J. Telford Statistical testing of environmental reconstructions Key researchers in the quantitative analysis of palaeoecological data Andy Lotter Keith Bennett Eric Grimm Allan Gordon Bent Odgaard Steve Juggins Ed Cushing Gavin Simpson Acknowledgements Allan Gordon Mark Hill Cajo ter Braak Petr Šmilauer Steve Juggins Richard Telford Pierre Legendre Cathy Jenks