NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 12 Overview John Birks OVERVIEW • Topics covered • Exploratory data analysis • Clustering • Gradient analysis • Hypothesis testing • Principle of parsimony in data analysis • Possible future developments • Conventional • Less conventional • Some applications • Volcanic tephras • Scotland’s most famous product • Integrated analyses • Problems of percentage compositional data • Log-ratios • Chameleons of CA and CCA • Software availability • Web sites • Final comments EXPLORATORY DATA ANALYSIS Essential first step Feel for the data – ranges, need for transformations, rogue or outlying observations NEVER FORGET THE GRAPH CLUSTERING Can be useful for some purposes – basic description, summarisation of large data sets. Fraught with problems and difficulties – choice of DC, choice of clustering method, difficulties of validation and evaluation Good general purpose TWINSPAN – ORBACLAN – COINSPAN GRADIENT ANALYSIS Regression, calibration, ordination, constrained ordination, discriminant analysis and canonical variates analysis, analysis of stratigraphical and spatial data. HYPOTHESIS TESTING Randomisation tests, Monte Carlo permutation tests. Cajo ter Braak 1987 Wageningen Classification of gradient analysis techniques by type of problem, response model and method of estimation. Linear Response Model Unimodal Response Model Type of problem Least-Squares Estimation Maximum Likelihood Estimation Weighted Averaging Estimation Regression Multiple regression Gaussian regression Weighted averaging of site scores (WA) Calibration Linear calibration; ‘inverse regression’ Gaussian calibration Weighted averaging of species scores (WA) Ordination Principal components analysis (PCA) Gaussian ordination Correspondence analysis (CA); detrended CA (DCA) Constrained ordinationa Redundancy analysis (RDA)d Gaussian canonical ordination Canonical CA (CCA); detrended CCA (DCCA) Partial ordinationb Partial components analysis Partial Gaussian ordination Partial CA; partial DCA Partial constrained ordinationc Partial redundancy analsyis Partial Gaussian canonical ordination Partial CCA; partial detrended CCA Constrained multivariate regression b Ordination after regression on covariables c Constrained ordination after regression on covariables = constrained partial multivariate d “Reduced-rank regression” = “PCA of y with respect to x” regression a A straight line displays the linear relation between the abundance value (y) of a species and an environmental variable (x), fitted to artificial data (). (a = intercept; b = slope or regression coefficient). A unimodal relation between the abundance value (y) of a species and an environmental variable (x). (u = optimum or mode: t = tolerance; c = maximum). GRADIENT ANALYSIS Linear based-models or unimodal-based methods Critical question, not a matter of personal preference If gradients are short, sound statistical reasons to use linear methods – Gaussianbased methods break down, edge effects in CA and related techniques become serious, biplot interpretations easy. If gradients are long, linear methods become ineffective (‘horseshoe’ effect). How to estimate gradient length? Regression Calibration Ordination Constrained ordination Hierarchical series of response models GLM and HOF GLM, DCCA (single x variable) DCA (detrending by segments, non-linear rescaling) DCCA (detrending by segments, non-linear rescaling) Partial ordination Partial DCA (detrending by segments, non-linear rescaling) Partial constrained Partial DCCA (detrending by segments, non-linear rescaling) ordination HYPOTHESIS TESTING Monte Carlo permutation tests and randomisation tests Distribution free, do not require normality of error distribution Do require INDEPENDENCE or EXCHANGEABILITY Validity of permutation test results depends on the validity of the type of permutation for the data set at hand. Completely randomised observations, completely random permutation is appropriate = randomisation test. Randomised block design-permutation must be conditioned on blocks, e.g. type of farm declared as covariable, if randomisation is conditioned on these, permutations are restricted to within farm. Time series or line transect – restricted permutations and data kept in order. Spatial data on grid – restricted permutations and data kept in position. Repeated measurements – BACI PRINCIPLE OF PARSIMONY IN DATA ANALYSIS William of Occam (Ockham), 14th century English nominalist philosopher. Insisted that given a set of equally good explanations for a given phenomenon, the explanation to be favoured is the SIMPLEST EXPLANATION. Strong appeal to common sense. Entities should not be multiplied without necessity. It is vain to do with more what can be done with less. An explanation of the facts should be no more complicated than necessary. Among competing hypotheses or models, favour the simplest one that is consistent with the data. ‘Shaved’ explanations to the minimum. In data analysis: 1) Models should have as few parameters as possible. 2) Linear models should be preferred to non-linear models. 3) Models relying on few assumptions should be preferred to those relying on many. 4) Models should be simplified/pared down until they are MINIMAL ADEQUATE. 5) Simple explanations should be preferred to complex explanations. RELEVANCE OF PRINCIPLE OF PARSIMONY TO DATA ANALYSIS MINIMAL ADEQUATE - as statistically acceptable as the most complex model MODEL (MAM) - only contains significant parameters - high explanatory power - large number of degrees of freedom - may not be one MAM CLUSTERING - prefer simple cluster analysis methods (few assumptions, simple values of , , ) - intuitively sensible REGRESSION - GAM – GLM - In GAM, simplest smoothers to be used - In GLM, model simplification to find MAM (e.g. AIC) CALIBRATION - minimum number of components for lowest RMSEP in PLS or WA-PLS ORDINATION - retain smallest number of statistically significant axes (broken stick test) - retain ‘signal’ at expense of noise PARTIAL ORDINATION remove effects of ‘nuisance variables’ (covariables or concomitant variables) by partialling out their effects ordination of residuals retain smallest number of statistically significant axes (broken stick test) ‘signal’ at expense of ‘noise’ and ‘nuisance variables’ CONSTRAINED ORDINATION most powerful if the number of predictor variables is small compared to number of samples. Constraints are strong, arch effects avoided, no need for detrending, outlier effects minimised minimal adequate model (forward selection, VIF, variable selection, AIC) only retain statistically significant axes PARTIAL CONSTRAINED ORDINATION as above + partial ordination STRATIGRAPHICAL DATA ANALYSIS only retain statistically significant zones simplify data to major axes or gradients of variation CHOICE BETWEEN INDIRECT & DIRECT GRADIENT ANALYSIS Indirect gradient analysis – two steps Direct gradient analysis – one combined step If relevant environmental data are to hand, direct approach is likely to be more effective and simpler than indirect approach. Generally achieve a simpler model from direct gradient analysis. CHOICE BETWEEN REGRESSION & CONSTRAINED ORDINATION Both regression procedures! One Y or many Y. Depends on purpose – is it an advantage to analyse all species simultaneously or individually? CONSTRAINED ORDINATION REGRESSION Community assemblage or individual taxa? HOLISTIC INDIVIDUALISTIC COMMON GRADIENTS SEPARATE GRADIENTS QUICK, SIMPLE SLOW, COMPLEX, DEMANDING LITTLE THEORY MUCH THEORY (GLM) EXPLORATORY MORE CONFIRMATORY, IN DEPTH LIMITING FACTORS Research questions Hypotheses to be tested and evaluated Data quality TYPES OF GRADIENT ANALYSIS METHODS BASED ON WEIGHTED AVERAGING Community data - incidences (1/0) or abundances ( 0) of species at sites. Environmental data - quantitative and/or qualitative (1/0) variables at same sites. Use weighted averages of species scores (appropriate for unimodal biological data) and linear combinations (weighted sums) of environmental variables (appropriate for linear environmental data) Method Abbreviation Response variables (y) Correspondence analysis CA (also DCA) Community data - Canonical correspondence analysis CCA (also DCCA) Community data Environmental variables 7 CCA partial least squares CCA-PLS Many Community data environmental variables 11 Weighted averaging calibration WA Environmental variable Community data 8 WA partial least squares WA-PLS Environmental variable(s) Community data 8 Community data Community data 11 Co-correspondence CO-CA analysis Also partial CA, partial DCA, partial CCA, partial DCCA. Predictors (x) Lecture 6 POSSIBLE FUTURE DEVELOPMENTS - CONVENTIONAL Lecture topic 2 Exploratory data 3 Clustering Model specific ‘outlier’ detection; interactive analysis graphics COINSPAN; better randomisation tests; CART; latent class analysis 4, 5 Regression analysis GLM and GAM framework evaluation by crossvalidation. Give up SS, deviance, t, etc! 6 Indirect gradient ? quest for the ‘ideal’ ordination method, 2-analysis matrix CA and PCA 7 Direct gradient 3-matrix CCA and RDA (biology, environment, analysis species attributes); multi-component variance partitioning, vector-based reduced rank models with GAMs 8 Calibration and reconstruction WAPLS; non-linear deshrinking; ? ML; mixed response models; chemometrics, Bayesian framework, more consideration of spatial autocorrelation 9 Classification ? give up classical methods; use permutation tests; classification and regression trees and random forests 10 Stratigraphical and spatial data Hypothesis testing ? more consideration of temporal and spatial autocorrelation More realistic permutation tests (restrictions); better p estimation 11 NEURAL NETWORKS – THE LESS CONVENTIONAL DATA ANALYSIS APPROACH IN THE FUTURE? Back propagation neural network – layers containing neurons input vector input layer hidden layer output layer output vector Clearly can have different types of input and output vectors, e.g. INPUT VECTORS OUTPUT VECTORS > 1 Predictor 1 or more Responses Regression > 1 ‘Responses’ 1 or more ‘Predictors’ Inverse regression or calibration > 1 Variables 2 or more Classes Discriminant analysis CALIBRATION (INVERSE REGRESSION) AND ENVIRONMENTAL RECONSTRUCTIONS Malmgren & Nordlund (1997) Palaeo-3 136, 359–373 Planktonic foraminifera 54 core-top samples Summer water and winter water temperatures Core E48–22 Extends to oxygen stage 9 320,000 years Compared neural network as a calibration tool with: Imbrie & Kipp principal component regression Modern analog technique (MAT) 2-block PLS (SIMCA) WA-PLS CRITERION FOR NETWORK SUCCESS Cross-validation leave-one-out Estimate RMSE (average error rate in training set) RMSEP (predictions based on leave-one-out cross-validation) 3 neurons 600–700 cycles RMSEP Neural N Summer 0.71 Winter °C 0.76 rs 0.99 rw 0.98 PLS 1.01 1.05 0.98 0.97 MAT 1.26 1.14 0.97 0.96 Imbrie & Kipp WA-PLS 1.22 1.04 1.05 0.86 0.97 0.97 0.96 0.96 Changes in root-mean-square errors (RMSE) for S in relation to number of training epochs for 3-layer BP neural networks with 1, 2, 3, 4, 5, and 10 neurons in the hidden layer. The networks were trained over 50 intervals of 100 epochs each (in total of 5,000 epochs). As expected, the RMSEs decrease as training proceeds. The minimum RMSE, 0.3539, was obtained after training a network with 10 neurons in the hidden layer over 5,000 epochs. Similar results were obtained also for W (not shown in diagram). Changes in root-mean-square errors of prediction (RMSEP) for S with increasing number of training epochs in a 3-layer back propagation neural network with 1, 2, 3, 4, 5, and 10 neurons in the hidden layer. These error rates were determined using the Leave-One-Out technique, implying training of the networks over 54 sets consisting of 53 observations each, with one observation left out for later testing. The lowest RMSEPs for both S and W, 0.7176 and 0.7636, respectively, were obtained for a configuration with 3 neurons (only the results for S are shown in the diagram). Note that set-ups with 1, 2, and 3 neurons gave lower RMSEPs than for 4, 5, and 10 neurons. Summer Winter Relationships between observed and predicted S and W using a 3-layer BP neural networks with 3 neurons in the hidden layer. Lines are linear regression lines. The product-moment correlation coefficients (r) are shown in the lower right hand corners. Prediction errors for different network configurations: root-mean-square errors for the differences between observed and predicted S and W using a 3-layer BP neural network with 1, 2, 3, 4, 5, and 10 neurons in the hidden layer. S W No. neurons RMSEP No. epochs RMSEP No. epochs 1 0.8779 500 0.8796 300 2 0.7850 1800 0.9013 700 3 0.7176 600 0.7636 700 4 1.0621 700 0.8776 700 5 1.0032 2200 0.9206 3600 10 1.2108 500 0.9332 3000 Root-mean-square errors of prediction (RMSEP) are based on the Leave-One-Out technique in which each of the 54 observations in the data set is left out one at a time and the network is trained on the remaining observations. The trained network is then used to predict the excluded observation. The network was run over 50 intervals of 100 epochs each, and the error rates were recorded after each interval. Prediction error for different methods: Root-mean-square errors of prediction (RMSEP) for S and W obtained from a 3-layer BP network, Imbrie-Kipp Transfer Functions (IKTF), the Modern Analog Technique (MAT), and Soft Modelling of Class Analogy (SIMCA) Method S W BP network 0.7136 0.7636 IKTF 1.2224 1.0550 MAT 1.2610 1.1346 SIMCA 1.0058 1.0501 PLS WA-PLS 1.0419 0.8560 WA-PLS Neural Network Predictions were made using the Leave-One-Out technique Predictions of S and W in core E48-22 from southern Indian Ocean based on a BP network, compared to the oxygen isotope (18O of Globorotalia truncautulinoides) curve presented by Williams (1976) for the uppermost 440 cm of the core. The crosscorrelation coefficients for the relationships between 18O and the predicted S and W are –0.68 and –0.71, respectively, for zero lags (p<0.001). Interglacial isotope stages 1, 5, 7, and 9 as interpreted here, are indicated in the diagram. Problems with ANN implementation and cross-validation Easy to over-fit the model. Leave-one-out cross-validation is not a stringent test as ANN will continue to train and optimise its network to the one sample left out. Need a training set (ca. 80%) and an optimisation (or selection set) (ca. 10%) to select the ANN model with the lowest prediction error AND an independent test set (ca. 10%) whose prediction error is calculated using the model selected by the optimisation set. Telford et al. (2004) Palaeoceanography 19 947 Atlantic foraminifera data. Split randomly 100 times into training set (747 samples), optimisation set (100 samples), and test set (100 samples). Median RMSEP (ºC) ANN MAT Training set 0.72 0.94 Optimisation set 0.94 0.94 Test set 1.11 1.02 No advantage in the hours of ANN computing when crossvalidated rigorously. ANN appears to be a very complicated (and slow) way of doing a MAT! May not be so good after all! DIATOMS AND NEURAL NETWORKS Descriptive statistics for the SWAP diatom-pH data set No. of samples 167 No. of taxa 267 % no. of +ve values in data 18.47 Total inertia 3.39 Min. Median Mean Max. N2 for samples 5.13 28.58 29.22 57.18 N2 for taxa 1 14.99 23.76 120.86 pH 4.33 5.27 5.56 7.25 S.D. Range 0.77 2.92 SWAP data-set: 167 lakes convergence Artificial Neural Network Yves Prairie & Julien Racca (2002) SWAP data-set: 167 lakes jack-knife predicted pH against observed pH Yves Prairie & Julien Racca (2002) pH reconstruction by ANN and WA-PLS: (RLGH core) Yves Prairie & Julien Racca (2002) SKELETONISATION ALGORITHM Pruning algorithm comparable to BACKWARD ELIMINATION in regression models 1. Measure relevance Pi for each taxon i Pi = E without i – E with i where E = RMSE 2. Train network with all taxa using back-propagation 3. Compute relevance Pi based on error propagation and weights 4. Taxon with smallest estimated relevance Pi [Did this in 5% classes of importance] 5. Re-train the network to a minimum again [After deleting a taxon, the values of the remaining taxon are not re-calculated, so the input data are always the same original relative abundance values] Racca et al. (2003) N2 ANN functionality Leave-one-predicted pH ANN ROUND LOCH OF GLENHEAD 30% pruned ANN 60% pruned ANN 85% pruned ANN 0% pruned ANN All taxa WA All taxa ML General characteristics of the 37 most functional taxa for calibration based on ANN modelling approach. Summary statistics of the SWAP diatom pH inference models according to the classes of taxa included based on the Skeletonisation procedure Apparent Cross-validation Cross-validation Apparent Ideally apparent RMSE should be a reliable measure of the actual predictive of a model, and the difference between apparent and cross-validated RMSE indicates the extent to which the model has overfitted the data Examples of the recently published diatom-based inference models in palaeolimnology used. CURSE OF DIMENSIONALITY related to ratio of number of taxa to number of lakes, as this ratio determines the ratio of the dimensional space in which the function is determined to the number of observations for which the function is determined. MAXIMUM ROBUSTNESS – ratio of taxa : lakes as small as possible (1) increase the number of lakes (2) decrease the number of taxa “Neural networks have the potential for data analysis and represent a viable alternative to more conventional data-analytical methods”. Malmgren & Nordlund (1997) Advantages: 1) Mixed linear and non-linear responses. 2) Good empirical performance. 3) Wide applicability. 4) Many predictors and many ‘responses’. Disadvantages: 1) Very much a black box. 2) Conceptually complex. 3) Little underlying theory. 4) Easy to misuse and report erroneous model performance statistics. PATTERN RECOGNITION Unsupervised (cluster analysis, indirect gradient analysis) or supervised (discriminant analysis, direct gradient analysis) Statistical theory Linear methods Discriminants & Decision Theory Neural network BELIEF NETWORKS Non-parametric methods CART trees Nearest-neighbour K-NN LDA VOLCANIC TEPHRAS IN N.W.EUROPE OF LATEGLACIAL AND EARLY HOLOCENE AGE Vedde Ash (Rhyolitic type) Vedde (Basaltic type) Borrobol Saksunarvatn SiO2 TiO2 mid Younger Dryas ca 10600 14C yrs BP Kråkenes, Norway Several other sites in W Norway Borrobol, Scotland Tynaspirit, Scotland Whitrig, Scotland Kråkenes W Norway Lower LG Interstadial ca 12500 14C yrs BP Borrobol, Scotland Tynaspirit, Scotland Whitrig, Scotland early Holocene ca 9000 14C yrs BP = 9930 – 10010 cal yr Faeroes Kråkenes, W Norway Dallican Water, Shetland Al2O3 FeO MnO MgO CaO Na2O K2O “The way in which correlation by tephrochronology may revolutionise approaches to reconstructing the sequence of events in the N.E.Atlantic region...” Lowe & Turney (1997) SiO2 V VB Al2O3 B S VB V MgO V VB B TiO2 S VB V CaO B S V VB B FeO S V VB K2O B S V VB B B S Na2O S V VB B S 2 = 0.841 28% 1 = 0.988 32.9% CANONICAL VARIATES ANALYSIS (= multiple discriminant analysis) Group means Saksun Vedde Borrobol Vedde B. CVA – individual samples CVA CVA- biplot of variables Vedde Scotland + a few Vedde Norway • Borrobol • Saksunavatn • Vedde Basaltic • Vedde Norway Vedde Norway • Vedde Scotland Minimum-variance Borrobol Saksunati Vedde Basalt 0.955 cophenetic correlation cluster analysis √% data = chord distance PCA √% data 97.4% 2 = 0.016 1.6% 1 = 0.96 95.9% Vedde Norway Saksunavatn Borobol Vedde Scotland Vedde basaltic Vedde Scotland Saksunavatn Borrobol Vedde Norway Vedde Basaltic PCA 97.4% All samples PCA 97.4% All samples “Tephrochronology offers the potential of overcoming problems of correlation because ash layers provide timeparallel markers and therefore precise comparisons between sequences” “The geochemical signature of each ash is unmistakable” Lowe & Turney (1997) Turney et al. (1997) SCOTLAND'S MOST FAMOUS PRODUCT Lapointe & Legendre (1994) Applied Statistics 43, 237-257 Dendrogram representing the minimum variance hierarchical classification of single-malt Scotch whiskies: two scales are provided at the top of the graph - the number of groups formed by cutting the dendrogram vertically at the given points and the fusion distances of the hierarchical classification (represented by vertical segments in the dendrogram); the vertical order of the whiskies is partly arbitrary - swapping the branches of a dendrogram does not change the corresponding cophenetic matrix (the 12 groups detailed in Appendix A are labelled A-L here) Map of Scotland showing the positions of the Scottish distilleries, divided into 11 groups (symbols) in the regional classification of singlemalt whiskies (Appendix B) (the six Speyside groups are deferred to Fig. 3):distiilery names are represented by fourletter abbreviations (see Fig. 3); the names of regions and of some major cities are also indicated - notice that two Scotches in the present study come from the Springbank distillery; Springbank pertains to the western group whereas Longrow is a member of the Islay group. Map of the Speyside region showing six of the 11 groups (symbols) of Scotch distilleries of the regional classification of single-malt whiskies (Appendix B) (the names of regions and of some major cities are also indicated) and abbreviations and full names of the distilleries. Looked at spatially constrained classification and constrained ordination (RDA) Looked at similarities between results based on: Colour Nose All give consistent results. Can use one to Body predict the other, except for finish. Palate Finish TEST OF CONGRUENCE AMONG DISTANCE MATRICES (CADM) Legendre & Lapointe (2005) 5 data sets - colour (14 variables +/-) 1 - nose (12 variables +/-) - body (8 variables +/-) - palate (15 variables +/-) - finish (19 variables +/-) (1 - Jaccard coefficient)½ to give 5 distance 2 3 4 5 matrices Overall CADM test - null hypothesis of incongruence rejected (H0) (p = 0.0001) Compare 1 with 2-5 2 with 1, 3-5 3 with 1, 2, 4, 5 - H0 rejected - H0 rejected - H0 rejected 4 with 1-3, 5 - H0 rejected 5 with 1-4 - H0 not rejected Mantel test (2 matrices) Finish not related to Colour, Nose, Body or Palate. Principal co-ordinates analysis of Mantel-test statistics. Axis 1 = 28.7%, axis 2 = 26.3%. Why is FINISH so different? It is important! How were the whiskies tested by the tasters? Did they swallow or spit? If the latter, the finish variables may not be fully detected. ONLY WHEN SWALLOWING CAN ONE TOTALLY CAPTURE THE AFTERTASTE. But, “some professional blenders work only with their nose, not finding it necessary to let the whisky pass their lips”. SINGLE MALTS MUST BE SWALLOWED! INTEGRATED ANALYSES OF BIOLOGICAL AND ENVIRONMENTAL DATA For nature conservation and management purposes, useful to have an overview of the natural zonation of the area as a whole. Such zonation should: 1. Have characteristic or indicator species or life-forms 2. Correspond to a circumscribed range of environments 3. Have some geographical coherence Requires integrated analysis of biological and environmental data. INDIRECT CLUSTERING APPROACH Biological data Environmental data Clusters e.g. TWINSPAN Biological clusters e.g. DISCRIM Canonical variates analysis RIVPACS cf. Indirect gradient analysis Biological data PCA or CA Regression with environmental data DIRECT CLUSTERING APPROACH 1. Latent class analysis with biological data as +/- or counts following binomial or Poisson distribution and environmental data following, after log transformation, normal distribution. ter Braak et al. (2003) Ecological Modelling 160: 235-248 Biological data + Environmental data Clusters or Zones 2. CCA, RDA, or DCCA of biological and environmental data combined in multivariate direct gradient analysis, followed by minimum-variance cluster analysis (Ward's method) or k-means minimum-variance cluster analysis. Estimate characteristic species for each cluster. Carey et al. (1995) J. Ecology 83: 833845. Biogeographical zonation of Scotland. Characteristic species of biogeographical zones 3. Principal co-ordinates analysis of mixed (biological and environmental) data using Gower's (1971) coefficient. m sij w ijk sijk k 1 m w ijk k 1 where sij is the similarity between sites i and j as measured by the variable k and wijk is typically 1 or 0 depending on whether or not the comparison is valid for variable k. Weights of zero are assigned when k is unknown for one or both sites or to binary variables to exclude negative matches. For binary variables sij is the Jaccard coefficient. For categorical data the component similarity sijk is one when the two sites have the same value and zero otherwise. For quantitative data sijk 1 x ik x jk Rk where Rk is the range of variable k AN EXAMPLE Altitude Moisture Limestone Sheep Age Site 1 120 1 - - 1 Site 2 150 2 + - 2 Site 3 110 3 + + 3 s12 1 (1 30 40) 1 0 1 0 0 1 1 0 0.0625 1 1 1 0 1 Clusters can then be defined using the principal co-ordinate axes scores in a minimum-variance cluster analysis or a partitioning of the sites on the basis of the ordination scores. 4. Constrained indicator species analysis (COINSPAN) Carleton, T.J. et al. (1996) J. Vegetation Science 7: 125-130 Like TWINSPAN (biological data only) but uses CCA first axis instead of CA first axis (as in TWINSPAN) as the basis for ordering samples prior to creating dichotomies. The resulting clustering is based on CCA axis 1, a linear combination of environmental variables that maximises the dispersion of species scores. COINSPAN clustering thus integrates biology and environment together. Surprisingly little used - has considerable potential. PROBLEMS OF PERCENTAGE (COMPOSITIONAL) DATA Jackson D.A. (1997) Ecology 78, 929–940 Simulated data SIM 200 observations x 5 variables Different means and variances Mean x1 30 x2 60 x3 60 x4 120 x5 120 Variance 16 16 64 64 4096 Correlations between all variables = 0 Transformed into percentages Raw data – BASIS Transformed data – PERCENTAGE or PROPORTIONS COMPOSITION BASIS r Bivariate casement plots of the basis (lower triangular matrix) and composition (upper triangular matrix) for the simulated data SIM. The basis relationship are independently generated, and correlations approximate zero. Note the strong linear relationships in the composition arising due to the constant-sum constraint, i.e. matrix closure. S1-S5 represent variables. COMPOSITION BASIS Frequency distributions of the bivariate correlations for SIM obtained under randomization. Each plot corresponds to the correlation between two variables from the basis (lower triangular matrix) or the composition (upper triangular matrix) used in the previous figure. The basis matrix was randomized within each column, the composition recalculated, and the correlation recalculated. Each plot is a frequency distribution of the correlations obtained from 10 000 randomized matrices. Eigenvector coefficients from a principal component analysis of the correlation matrix of SIM. Results from a PCA of the basis and the composition are presented. SIM Composition Basis Scree plots of the eigenvalues for each component from the (a) simulated data (SIM) and (b) herbivorous zooplankton data (ZOO). The solid line represents the eigenvalues from the basic data (i.e. nonstandardised), and the dashed line represents the eigenvalues from the compositional data (i.e. proportions). Basis Composition Scatterplots of the first two components from a principal component analysis of SIM using the (a) basis and (b) composition in calculating the correlation matrix. Letters refer to the points positioned at the ends of axes 1 and 2. CLUSTER ANALYSIS BASIS COMPOSITION UPGMA cluster analysis based on a correlation matrix of the variables (S1-S5 and H1-H5) from: (a) the basis data of the simulation data (SIM); (b) the compositional data of SIM; (c) the basis data of the zooplankton data (ZOO); and (d) the compositional data of ZOO. REF 1. REF POSSIBLE SOLUTIONS CENTRED LOG RATIO Aitchison (1986) All variables are retained in analysis but are standardised by dividing each variable by a denominator based on a geometric composite of all variables. PCA covariance matrix Yij covlogxi gx , log x j gx i, j, ..., m and g(x) is the geometric mean of the variables, i.e. gx xi 1 m Advantages: 1. All variables are retained. 2. Pairwise relationships are the same regardless of using basis or compositional data. Problems: 1. With SIM, correlations still very strong! 0.412 0.843 -0.799 -0.906 2. Zero values have unidentified log-ratio value. Replace zero values by small value. REF 3. Matrix is singular, so only m-1 components. REF 2) CORRESPONDENCE ANALYSIS REF Only considers proportional relationships between variables; unaffected by using basis or compositional data. CA/DCA/CCA – focuses on relative abundances PCA/RDA – focuses on absolute abundance If an environmental variable influences total biomass, but leaves the species composition unchanged, the variable will be important in PCA/RDA but not at all important in CA/DCA/CCA. One approach analyse total biomass separately by regression analyse species composition by CCA Analyses are fully complementary. REF (PCA/RDA would probably give results close to the regression analysis). REF REF REF UNRESOLVED QUESTION SINCE 1986 IN CA/CCA How can CA and CCA 1. Model unimodal function (c.f. WA as approximate Gaussian ML regression) and 2. Be linear with fit y ik y i y k y 1 bk1xi1 ... Partial answer CA and CCA model compositional data (proportions) This compares with Aitchinson's log-ratio model and the polytomous GLM which are linear in centred logs but unimodal in the original data. REF REF THE TWO FACES OF CORRESPONDENCE ANALYSIS AND CANONICAL CORRESPONDENCE ANALYSIS REF CA and CCA are methods for analysing unimodal data. REF CA and CCA are CHAMELEONS 1) Unimodal methods 2) Linear methods CCA can be derived as a weighted form of reduced rank regression = redundancy analysis = principal component analysis with respect to instrumental variables. The key element is that the relative abundance is a linear function of the environmental variables (relative here means relative to sample total and species total). As unimodality and compositional data often go hand in hand, common element is that CCA models compositional (i.e. relative) abundance data instead of the absolute abundance data. ECOLOGICAL TERMS CCA (and CA) models relative abundances; takes sample size for granted. Usually the diversity of a sample increases with its size. CCA and CA take that aspect of -diversity for granted and focuses, instead, on the -diversity (dissimilarity between sites). If the trend in -diversity coincides with -diversity (e.g. species disappear one by one along a gradient), CA and CCA can extract such trends. In unimodal context, species scores are weighted averages of sample scores and vice versa. In linear context, species scores are derived from a weighted linear regression of transformed species data on to the sample scores. REF REF REF Linear context most useful when gradient length is < 3SD. Unimodal context most useful when gradient length is > 4SD. For intermediate lengths, either contexts may be useful. Can transform unimodal model into linear model by ‘take logarithms and double centre’ (for data with no zeroes). If data contain zeroes, no explicit linearising data transformation because we cannot take logarithms. In CA and CCA, a transformation is implicit that is close to the exact transformation. EXACT log y ik with y ik yik g gi gk where gi and g k are geometric averages of across rows and columns, respectively and is the overall geometric average y ik y y i y k CA/CCA y ik where REF yi and y k are the abundance totals across species in site i and across samples for species k INHERIT THEIR TWO FACES FROM MODELS OF COMPOSITIONAL DATA. REF DATA TYPE AND CHOICE OF ORDINATION METHOD Besides gradient length (standard deviations), data type is also important in selecting ordination method. Absolute abundance Relative abundance (Compositional differences) Unconstrained PCA (linear) CA, DCA (unimodal) Constrained RDA (linear) CCA, DCCA (unimodal) Constrained (PRC) (linear) - PCA/RDA are weighted summations; CA/CCA are weighted averages, hence the difference between modelling absolute values (PCA/RDA) or relative values (CA/CCA). Cannot currently model satisfactorily absolute abundances over long graidents. Need to partition the data into smaller gradients first (e.g. TWINSPAN). SPECIES ABSENCES IN DATA SETS Besides removing the absolute abundance effect, CA, DCA, CCA, and partial CCA (and WA and WA-PLS) do not consider species absences or zero values in the biological data. Zero values - ? Show real absence ? Reflect incomplete sampling ? Chance Is this an advantage or disadvantage? SOFTWARE AVAILABILITY CANOCO & CANODRAW MAT, ZONE, WINTRAN, C2 MicroComputer Power 111 Clover Lane ITHACA, NY 14850 USA Steve Juggins Geography Department University of Newcastle NEWCASTLE UPON TYNE NE1 7RH FurnesR@microcomputerpower.com http://www.microcomputerpower.com (Stephen.Juggins@newcastle.ac.uk) http://www.campus.ncl.ac.uk/staff/ Stephen.Juggins/ HOF Jari Oksanen Department of Biology University of Oulu OULU Finland (jari.oksanen@oulu.fi) http://cc.oulu.fi/~jarioksa/ TWINSPAN (Mark Hill), DISCRIM (Cajo ter Braak), TWINGRP, RATEPOL, SPLIT, etc John Birks Department of Biology University of Bergen Allégaten 41 N-5007 BERGEN Norway (John.Birks@bio.uib.no) QUERIES John.Birks@bio.uib.no John Birks, Department of Biology, University of Bergen, Allégaten 41, N-5007 Bergen, Norway Fax: (+47) 55 58 96 67 gavin.simpson@ucl.ac.uk Gavin Simpson, Environmental Change Research Centre, University College London, Gower Street, London, WC1E 6BT, UK http://www.homepages.ucl.ac.uk/~ucfagls/ncourse/ VALUABLE WEB SITES FOR NUMERICAL ECOLOGISTS AND PALAEOECOLOGISTS www.okstate.edu/artsci/botany/ordinate Mike Palmer's ordination site with masses of documentation, explanatory notes, links, details of software, etc. www.canoco.com Cajo ter Braak's site about CANOCO and with answers to many frequently asked questions (FAQ) www.microcomputerpower.com Richard Furnas' site about CANOCO and related software availability and ordering www.canodraw.com Petr Šmilauer's site about CANODRAW and CANOCO and related software http://regent.bf.jcu.cz/maed Details of Petr Šmilauer and Jan Lepš' course and data on multivariate analysis of ecological data. WEB SITES continued http://cc.oulu.fi/~jarioksa/ Jari Oksanen's site with his R vegan package, lecture notes, programs (e.g. HOF), documentation, comments, FAQ, and much more http://www.bio.umontreal.ca/legendre/indexEnglish.html Pierre Legendre's site with details of publications, software, activities, etc. http://www.bio.umontreal.ca/Casgrain/en/labo/index.html Software from Pierre Legendre's lab http://labdsv.nr.usu.edu/ Dave Robert's site about quantitative vegetation ecology with lecture notes, software details, etc. www.nku.edu/~boycer/fso/ Rick Boyce's site about fuzzy set ordination http://cran.r-project.org/ R website WEB SITES continued www.stat.auckland.ac.nz/~mja/ Marti Andersen's site with new software, details of publications, activities, etc. www.campus.ncl.ac.uk/staff/Stephen.Juggins Steve Juggins' site for C2, WinTran, ZONE, etc. www.chrono.qub.ac.uk/psimpoll/psimpoll.html Keith Bennett's site for palaeoecological software, notes, etc. www.chrono.qub.ac.uk/inqua Keith Bennett's site of INQUA Data Analysis Sub-Commission software, newsletters, etc. www.env.duke.edu/landscape/classes/env358/env358.html Dean Urban's site with excellent lecture notes on Multivariate Methods for Environmental Applications FINAL COMMENTS Numerical Analysis of Biological Data Basic building-blocks and concepts and the resulting numerical methods Continuum concept Niches Weighted averaging 'Communities' CA/DCA TWINSPAN Cluster analysis Metric scaling Non-metric scaling Indicator-species analysis INDVAL Numerical Analysis of Environmental Data Basic building-blocks and concepts and the resulting numerical methods GLM Regression models Multiple regression Gradients Correlation & covariance + PCA Linear combinations PLS Crossvalidation RDA Permutation tests Cluster analysis Procrustes rotation Co-inertia analysis Linear discriminant analysis, canonical correlation analysis Numerical Analysis of Biological and Environmental Data Basic building-blocks and concepts and the resulting numerical methods GLM & GAM Regression models Niches & Gradients Weighted averaging Multiple regression + CA/DCA TWINSPAN WA WA-PLS Crossvalidation CCA-PLS Co-CA Co-inertia analysis DISCRIM PLS COINSPAN Permutation tests Cluster analysis CCA Distance-based PCoA Canonical analysis of principal co-ordinates (CAP) Multiple discriminant analysis Andrew Lang 1844-1912. He uses statistics as a drunken man uses lampposts – for support rather than illumination. From MacKay, 1977, and reproduced through the courtesy of the Institute of Physics. Statistics are for illumination! Sketches illustrating statistical zap and shotgun THE PEOPLE WHO HAVE MADE THE STATISTICAL ZAP POSSIBLE Mark O. Hill Marti J. Anderson Cajo J.F. ter Braak Richard Telford Steve Juggins Pierre Legendre Gavin Simpson