Using ordination methods in palaeoecology

advertisement
Using Ordination Methods
in Palaeoecology
John Birks
University of Bergen
University College London
University of Oxford
Tilia Workshop, Liverpool, May 2011
Introduction
Ordination methods and palaeoecological
functions
Uses in palaeoecology
Data summarisation
Data analysis
Data interpretation
Strengths and weaknesses
Conclusions
Introduction
Ordination – term first presented in ecology by
David Goodall in 1954, derived from German
‘ordnung’
Ordering of samples and species in relation to
their overall similarity (indirect gradient analysis) or
to their environment (direct gradient analysis)
End result is a low-dimensional representation of
multivariate data (many objects, many variables).
Axes are chosen to fulfil certain mathematical
properties
Great use in data summarisation, data analysis,
and data interpretation
% food
A simple example of data summarisation using
ordination – European food (Reader’s Digest survey)
GC ground coffee
IC instant coffee
TB tea or tea bags
SS sugarless sugar
BP packaged biscuits
SP soup (packages)
ST soup (tinned)
IP instant potatoes
FF frozen fish
VF frozen vegetables
AF fresh apples
OF fresh oranges
FT tinned fruit
JS jam (shop)
CG garlic clove
BR butter
ME margarine
OO olive, corn oil
YT yoghurt
CD crispbread
90
49
88
19
57
51
19
21
27
21
81
75
44
71
22
91
85
74
30
26
D
82
10
60
2
55
41
3
2
4
2
67
71
9
46
80
66
24
94
5
18
I
88
42
63
4
76
53
11
23
11
5
87
84
40
45
88
94
47
36
57
3
F
96
62
98
32
62
67
43
7
14
14
83
89
61
81
16
31
97
13
53
15
NL
94
38
48
11
74
37
25
9
13
12
76
76
42
57
29
84
80
83
20
5
B
97
61
86
28
79
73
12
7
26
23
85
94
83
20
91
94
94
84
31
24
L
27
86
99
22
91
55
76
17
20
24
76
68
89
91
11
95
94
57
11
28
GB
72
26
77
2
22
34
1
5
20
3
22
51
8
16
89
65
78
92
6
9
P
Country
55
31
61
15
29
33
1
5
15
11
49
42
14
41
51
51
72
28
13
11
A
73
72
85
25
31
69
10
17
19
15
79
70
46
61
64
82
48
61
48
30
CH
97
13
93
31
43
43
39
54
45
56
78
53
75
9
68
32
48
2
93
S
96
17
92
35
66
32
32
11
51
42
81
72
50
64
11
92
91
30
11
34
DK
96
17
83
13
62
51
4
17
30
15
61
72
34
51
11
63
94
28
2
62
N
98
12
84
20
64
27
10
8
18
12
50
57
22
37
15
96
94
17
64
SF
70
40
40
62
43
2
14
23
7
59
77
30
38
86
44
51
91
16
13
E
13
52
99
11
80
75
18
2
5
3
57
52
46
89
5
97
25
31
3
9
IRL
Ordination – correspondence analysis
Key:
Countries:
A Austria,
B Belgium,
CH Switzerland,
D West Germany,
E Spain,
F France,
GB Great Britain,
I Italy,
IRL Ireland,
L Luxembourg,
N Norway,
NL Holland,
P Portugal,
S Sweden,
SF Finland
Correspondence analysis of percentages of
households in 16 European countries having
each of 20 types of food.
Minimum spanning tree fitted to the full 15dimensional correspondence analysis solution
What has this to do with pollenstratigraphical data and palaeoecology?
Multivariate data
Pollen data - 2 pollen types x 15 samples
Variables
Depths are in
centimetres, and
the units for
pollen frequencies
may be either in
grains counted or
percentages.
Sample
1
2
3
4
5
6
7
Samples
8
9
10
11
12
13
14
15
Depth
0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
Type A
10
12
15
17
18
22
23
26
35
37
43
38
47
42
50
Type B
50
42
47
38
43
37
35
26
23
22
18
17
15
12
10
Adam (1970)
Alternate representations of the pollen data
Palynological
representation
Geometrical
representation
In (a) the data are plotted as a standard diagram, and in (b)
they are plotted using the geometric model. Units along the
axes may be either pollen counts or percentages.
Adam (1970)
Of course palaeoecological data consist of more than
two pollen types and 15 samples
Main features are
• Many taxa (50-300)
• Many samples or objects (50-500)
• Many zero values in data matrix (‘sparse’ data)
• Few abundant taxa, many rare taxa
• Data are usually expressed as percentages or
proportions (‘closed’ compositional data)
• Data are not normally distributed in a statistical
sense so classical statistical tests are not
appropriate
• Stratigraphical data form temporal-series with a
fixed sample order
Why do ordinations?
1. Data simplification and data reduction - “signal from noise”
2. Detect features that might otherwise escape attention.
3. Hypothesis generation and prediction.
4. Data exploration as aid to further data collection.
5. Communication of results of complex data. Ease of display
of complex data.
6. Aids communication and forces us to be explicit.
“The more orthodox amongst us should at least reflect that
many of the same imperfections are implicit in our own
cerebrations and welcome the exposure which numbers
bring to the muddle which words may obscure”. D Walker (1972)
7. Tackle problems not otherwise soluble. Hopefully better
science.
8. Fun!
Ordination Methods and
Palaeoecological Functions
Biological data Y only - ordination, classical ordination,
indirect gradient analysis,
classical or metric scaling, nonmetric multidimensional scaling
Principal components analysis
PCA
Correspondence analysis
CA
Detrended correspondence analysis
DCA
Also:
Principal coordinates analysis (metric scaling) PCoA
Non-metric multidimensional scaling
NMDS
Biological data Y and environmental data X –
canonical ordination, constrained
ordination, direct gradient
analysis, multivariate regression
Redundancy analysis
RDA
Canonical correspondence analysis
CCA
Detrended canonical correspondence
analysis
DCCA
Also:
Canonical analysis of principal coordinates
CAP
Aims of indirect gradient analysis
1. Summarise multivariate data in a convenient lowdimensional geometric way. Dimension-reduction
technique
2. Uncover the fundamental underlying structure of data.
Assume that there is underlying LATENT structure.
Occurrences of all species are determined by a few
unknown environmental variables, LATENT VARIABLES,
according to a simple response model. In ordination
trying to recover and identify that underlying
structure
PCA, CA, DCA, PCoA, and NMDS all fulfil aim 1
Only PCA, CA, and DCA fulfil aim 2. Will discuss only
these as they are trying to uncover the underlying
structure. (PCoA can if you use the same distance measures
implicit in PCA or CA!)
Underlying response models
A straight line displays the
linear relation between the
abundance value (y) of a
species and an environmental
variable (x), fitted to artificial
data (●). (a = intercept; b =
slope or regression
coefficient).
A Gaussian curve displays
a unimodal relation
between the abundance
value (y) of a species and
an environmental variable
(x). (u = optimum or
mode; t = tolerance;
c = maximum = exp(a)).
Besides making a low-dimensional map of
multivariate data, more difficult but biologically more
important is the ordination problem, namely
Construct the single hypothetical variable (latent
variable) that gives the best fit in a statistical sense
to the species data according to an assumed linear
response model (PCA) or assumed unimodal species
response model (CA, DCA)
PCA is the ordination technique that constructs the
theoretical latent variable that minimises the total
residual sum-of-squares after fitting linear lines or
planes to the species data
CA is the ordination technique that constructs the
theoretical latent variable that maximises the
dispersion of the species scores after fitting unimodal
curves or surfaces to the species data
Repeated for PCA axis 2, 3, …, n with constraint
that all axes are uncorrelated with each other
Three dimensional view of a
plane fitted by least-squares
regression of responses (●)
on two explanatory variables
PCA axis 1 and PCA axis 2.
The residuals, i.e. the
vertical distances between
the responses and the fitted
plane are shown. Least
squares regression
determines the plane by
minimization of the sum of
these squared distances.
Representations of PCA results as
biplots of axes 1 and 2
Correlation (=covariance) biplot scaling
Species scores sum of squares = λ
Site scores scaled to unit sum of squares
Emphasis on species
Distance biplot scaling
Site scores sum of squares = λ
Species scores scaled to unit sum of
squares
Emphasis on sites
Total sum-of-squares (variance) = 1598 = sum of eigenvalues
Axis 1 = 471 = 29%; Axis 2 = 344 = 22%; Total variance = 51%
PCA biplots
•Axes must have identical scales
•Species loadings and site scores in the same plot:
graphical order 2 is an approximation of the data
•Origin: species averages. Points near the origin are
average or are poorly represented
•Species increase in the direction of the arrow, and
decrease in the opposite direction
•The longer the arrow, the stronger the increase
•Angles between vector arrows approximate their
correlations (r = Cos = Correlation)
•Distance from origin reflects magnitude of change
•Approximation: project site point onto species vector
Biplot
interpretation for
Agrostis stolonifera
Summarises
abundance of
species in samples.
In correlation (covariance) biplot, site
scores scaled to unit sum of squares and
sum of squared species scores equals
eigenvalue
Representation of CA results as joint plots
λ2 =0.40
CA ordination
diagram of the Dune
Meadow Data in
Hill’s scaling.
λ1 = 0.53
λ1 =0.53
λ2 = 0.40
λ3 = 0.26
λ4 = 0.17
CANOCO
R
CA: joint plot interpretation
Joint plot with weighted Chi-squared metric: species
and sites in the same plot with Hill's scaling.
• Distance from the origin: Chi-squared
difference from the profile
• Points at the origin either average or
poorly explained
• Distant species often rare, close species
usually common
• Unimodal centroid interpretation: species
optima and gradient values – at least for
well-explained species
• Can also construct CA biplots
• Samples close together are inferred to
resemble one another in species composition
• Samples with similar species composition are
assumed to be from similar environments
J. Oksanen (2002)
Detrended correspondence analysis
(DCA)
Aim to correct three 'artefacts' or 'faults' in CA:
1. Detrending to remove 'spurious' curvature in the
ordination of strong single gradients
2. Rescaling to correct shrinking at the ends of ordination
axes resulting in packing of sites at gradient ends
3. Downweighting to reduce the influence of rare species
Implemented originally in DECORANA and now in CANOCO
and R (vegan)
Allows estimation of gradient length or the amount of
compositional turnover along the DCA axes (in standard
deviation units). 4 sd units represent complete turnover
along gradient.
CA applied to artificial data (- denotes absence). Column a: The
table looks chaotic. Column b: After rearrangement of species and
sites in order of their scores on the first CA axis (u k and x i ), a twoway Petrie matrix appears: λ1=0.87
Column a
Column b
Species
A
B
C
D
E
F
G
H
I
Sites
1 2 3
1 – –
1 – –
1 1 –
– – –
– 1 –
– 1 –
– – 1
– – 1
– – 1
Species
4
–
–
–
1
1
1
–
–
–
5
–
–
–
1
–
–
1
1
–
6
–
–
–
1
–
1
1
–
–
7
–
1
1
–
1
–
–
–
–
A
B
C
E
F
D
G
H
I
xi
2 = 0.57
Arch effect
uk
Sites
1 7
1 –
1 1
1 1
– 1
– –
– –
– –
– –
– –
­– –
1 1
. .
4 0
2
–
–
1
1
1
–
–
–
–
–
0
.
6
4
–
–
–
1
1
1
–
–
–
6
–
–
–
–
1
1
1
–
–
5
–
–
–
–
–
1
1
1
–
3
–
–
–
–
–
–
1
1
1
0 0 1 2
. . . .
0 6 0 4
0 8 0 0 0 8 0
'Seriation' to arrange
data into a sequence
-1.4
-1.24
-1.03
-0.56
0
0.56
1.03
1.24
1.4
Distorte
d
distance
s
1 =
0.87
Ordination by CA of the two-way Petrie matrix
in the table above. a: Arch effect in the
ordination diagram (Hill’s scaling; sites
labelled as in table above; species not
shown). b: One-dimensional CA ordination
(the first axis scores of Figure a, showing that
sites at the ends of the axis are closer
together than sites near the middle of the
axis. c: One-dimensional DCA ordination,
obtained by nonlinearly rescaling the first CA
axis. The sites would not show variation on
The cause of the arch in CA
• There is a curve in the species space, and PCA shows it correctly.
• CA may be able to deal with unimodal responses, but if there is one
dominant gradient, the second axis is the first axis folded. Occurs
when the first axis is at least twice as long as the second 'real' axis.
• Problems clearly arise when there is one strong dominant gradient.
J. Oksanen (2002)
Implicit distances between objects in
PCA and CA
Euclidean distance implicit in PCA involves
absolute differences of species between sites.
Chi-squared distance implicit in CA involves
proportional differences in abundances of species
between sites.
Differences in site and species totals are therefore
less influential in CA than in PCA unless some
transformation is used in PCA to correct for this
effect (e.g. percentage transformations)
Data transformations in PCA
1. Centred species data  PCA variance–covariance matrix.
Species implicitly weighted by the variance of their
values
2. Standardised PCA  PCA correlation matrix. Centre
species and divide by standard deviation (zero mean,
unit variance). All species receive equal weight, including
rare species. Use when data are in different units, e.g.
pH, LOI, Ca
3. Square root transformation of percentage data. Chord
distance or Hellinger distance. Excellent with % data
4. Log (y + 1) transformation for abundance data
5. Log transformation and centre by species and samples =
log-linear contrast PCA for closed % data (few variables,
e.g. blood groups)
Data transformations in PCA
Not as critical as in PCA as CA must have data
in identical units (cf. PCA of correlation matrix)
1. Square root transformation of percentage
data. Reduces impact of abundant species,
optimises ‘signal to noise’ ratio
How many ordination axes to retain for
interpretation?
Jackson, D.A. (1993) Ecology 74,
2204–2214
PCA – applicable to CA, PCoA, ?DCA
Assessment of eigenvalues:
Scree plot
Broken-stick
Total variance (=) divided randomly amongst the axes,
eigenvalues follow a broken stick distribution.
p
bk  
i k
1
i
p = number of variables (= no)
e.g. 6 eigenvalues
bk = size of eigenvalue
% variance – 40.8, 24.2,
15.8, 10.7, 6.1, 2.8
Simple to calculate, robust and reliable
= observed eigenvalues
= broken-stick model expectation
3 axis model appropriate
R
Aims of direct gradient analysis
Prior to 1986 and the development of canonical
correspondence analysis (CCA) by Cajo ter Braak,
approaches to interpretation of PCA/CA/DCA results
were
1. Plot or contour values of external environmental
variables on ordination plot
2. Plot external variables against ordination axes
3. Regress ordination axis (composite response
variable) on external variables
Limitations of direct gradient analysis
1. External variables may turn out to be poorly
related to the first few ordination axes
2. Strong relationships with, say, axis 4 or 5
easily overlooked
Limitations overcome by canonical or constrained
ordination = multivariate direct gradient
analysis
Canonical ordination techniques
Ordination and regression in one technique – Cajo ter Braak 1986
Search for a weighted sum of environmental variables that fits the
species best, i.e. that gives the maximum regression sum of
squares
Ordination diagram
1) patterns of variation in the species data
2) main relationships between species & each environmental variable
Redundancy analysis  constrained or canonical PCA
Canonical correspondence analysis (CCA)  constrained CA
Detrended CCA  constrained DCA
Axes constrained to be linear combinations of environmental variables.
In effect PCA or CA or DCA with one extra step:
Do a multiple regression of site scores on the environmental variables
and take as new site scores the fitted values of this regression.
Multivariate regression of Y on X.
Indirect GA
Species
Primary data in gradient analysis
Abundances
or
+/variables
Response variables
Y
Values
Env. vars
Direct GA
PLUS
Predictor or explanatory variables
X
Classes
CCA triplot
CCA of the Dune Meadow
Data. Ordination diagram
with environmental
variables represented by
arrows. the c scale applies
to environmental variables,
the u scale to species and
sites. the types of
management are also
shown by closed squares at
the centroids of the
meadows of the
corresponding types of
management.
DCA
CCA
1
0.54
0.46
2
0.40
0.29
R axis 1
0.87
0.96
R axis 2
0.83
0.89
a
b: Inferred ranking of the
species along the variable
amount of manure, based
on the biplot interpretation
of Part a of this figure.
b
CCA of the Dune Meadow Data. a:
Ordination diagram with
environmental variables represented
by arrows. The c scale applies to
environmental variables, the u scale
to species and sites. The types of
management are shown by closed
squares at the centroids of the
meadows of the corresponding types
of management.
Redundancy analysis – constrained PCA
Short (< 2SD) compositional gradients
Linear or monotonic responses
Reduced-rank regression
PCA of y with respect to x
Two-block mode C PLS
PCA of instrumental variables
Rao (1964)
PCA - best hypothetical latent variable is the one that gives
the smallest total residual sum of squares
RDA - selects linear combination of environmental variables
that gives smallest total residual sum of squares
ter Braak (1994) Ecoscience 1, 127–140 Canonical community
ordination Part I: Basic theory and linear methods
RDA ordination diagram of the Dune Meadow Data with environmental
variables represented as arrows. The scale of the diagram is: 1 unit in
the plot corresponds to 1 unit for the sites, to 0.067 units for the species
and to 0.4 units for the environmental variables. Biplot interpretation
Statistical testing of constrained
ordination results
Statistical significance of species-environmental relationships.
Monte Carlo permutation tests. Distribution-free tests but assume
exchangeability of samples.
Randomly permute the environmental data, relate to species data
‘random data set’. Calculate eigenvalue and sum of all canonical
eigenvalues (trace). Repeat many times (999).
If species react to the environmental variables, observed test
statistic (1 or trace) for observed data should be larger than most
(e.g. 95%) of test statistics calculated from random data. If
observed value is in top 5% highest values, conclude species are
significantly related to the environmental variables.
Special ‘restricted’ permutation tests for time-ordered data as
occur in palaeoecology.
Statistical significance of constraining
variables
• CCA or RDA maximise correlation
with constraining variables and
eigenvalues.
• Permutation tests can be used to
assess statistical significance:
- Permute rows of environmental
data.
- Repeat CCA or RDA with permuted
data many times.
- If observed  higher than (most)
permutations, it is regarded as
statistically significant.
J. Oksanen (2002)
Partial constrained ordinations
(partial CCA, RDA, etc)
e.g.
pollution effects
seasonal effects  COVARIABLES Z
Eliminate (partial out) effect of covariables. Relate residual
variation to pollution variables.
Replace environmental variables by their residuals obtained by
regressing each pollution variable on the covariables.
Analysis is conditioned on specified variables or covariables.
These conditioning variables may typically be 'random' or
background variables, and their effect is removed from the CCA
or RDA based on the 'fixed' or interesting variables.
Very useful in testing competing hypotheses as one can test
significance of sets of variables when other sets are partialled
out.
Partial CCA
Natural variation due to sampling season
and due to gradient from fresh to brackish
water partialled out by partial CCA.
Variation due to pollution could now be
assessed.
Ordination diagram of a partial
canonical correspondence analysis
of diatom species (A) in dykes with
as explanatory variables 24
variables-of-interest (arrows) and
2 covariables (chloride
concentration and season). The
diagram is symmetrically scaled
and shows selected species and
standardized variables and,
instead of individual dykes,
centroids (•) of dyke clusters. The
variables-of-interest shown are:
BOD = biological oxygen demand,
Ca = calcium, Fe = ferrous
compounds, N = Kjeldahl-nitrogen,
O2 = oxygen, P = orthophosphate, Si= siliciumcompounds, WIDTH = dyke width,
and soil types (CLAY, PEAT). All
variables except BOD, WIDTH,
CLAY and PEAT were transformed
to logarithms because of their
skew distribution.
PCA or CA/DCA?
PCA – linear response model
CA/DCA – unimodal response model
How to know which to use?
Gradient lengths important. Estimate with DCA
If short, good statistical reasons to use LINEAR methods.
If long, linear methods become less effective, UNIMODAL methods
become more effective.
Range 1.5–3.0 standard deviations both are effective.
In practice:
Do a DCA first and establish gradient length.
If less than 2 SD, responses are monotonic. Use PCA.
If more than 2 SD, use CA or DCA.
When to use CA or DCA more difficult.
Ideally use CA (fewer assumptions) but if arch is present, use DCA.
DCA results can be unstable when eigenvalues 1 and 2 are close to
each other (e.g. 0.55, 0.54) (Oksanen (1988) Vegetatio 74, 29–32).
Always do a CA to assess the effect of downtrending on the data-set.
(a) The response curves of
3 species along a gradient;
12 quadrats are located at
the numbered points
marked with arrowheads
(artificial data).
(b) Ordinations of the 12
data points by PCA (hollow
dots, dashed line) and by
CA (solid dots, solid line).
Both ordinations exhibit the
arch effect. The CA
ordination also shows scale
contractions at both
extremities.
Hypothetical diagram of the occurrence of species A-J over an environmental
gradient. The length of the gradient is expressed in standard deviation units (SD
units). Broken lines (A’, C’, H’, J’) describe fitted occurrences of species A, C, H and
J respectively. If sampling takes place over a gradient range <1.5 SD, this means
the occurrences of most species are best described by a linear model (A’ and C’). If
sampling takes place over a gradient range >3 SD, occurrences of most species
are best described by an unimodal model (H’ and J’).
Outline of ordination techniques. DCA
(detrended correspondence analysis)
was applied for the determination of
the length of the gradient (LG). LG is
important for choosing between
ordination based on a linear or on a
unimodal response model. In cases
where LG <3, ordination based on
linear response models is considered to
be the most appropriate. PCA (principal
component analysis) visualises
variation in species data in relation to
best fitting theoretical variables.
Environmental variables explaining this
visualised variation are deduced
afterwards, hence, indirectly. RDA
(redundancy analysis) visualises
variation in species data directly in
relation to quantified environmental
variables. Before analysis, covariables
may be introduced in RDA to
compensate for systematic differences
in experimental units. After RDA, a
permutation test can be used to
examine the significance of effects.
Indirect gradient analysis or direct
gradient analysis?
1. Direct methods (RDA, CCA, DCCA) study the part
of the variation in the species data that can be
explained by a particular set of external variables
2. Indirect methods (PCA, CA, DCA) focus on the
major patterns of variation in the species data,
irrespective of any external variables
If external data available, direct approach likely to be
more effective than traditional indirect approach
Depends on research questions and hypotheses
being considered and on the data available
Uses in Palaeoecology
Consider selected examples in data summarisation,
data analysis, and data interpretation. Major aim
throughout is to help the palaeoecologist
summarise and understand her/his data, to
generate hypotheses, or to test hypotheses
Data summarisation
1.Gradient analysis or ordination of a single
stratigraphical sequence. PCA, CA, or DCA, RDA
or CCA or DCCA constrained by depth or age
PCA Biplot
74.6%
Gordon (1982)
Biplot of the Kirchner
Marsh data; C2 =
0.746. The lengths of
the Picea and Quercus
vectors have been
scaled down relative to
the other vectors.
Stratigraphically
neighbouring levels are
joined by a line.
CA Joint Plot
62%
Gordon (1982)
Correspondence analysis representation of the Kirchner Marsh data; C2
= 0.620. Stratigraphically neighbouring levels are joined by a line.
Stratigraphical
plot of sample
scores on the first
correspondence
analysis axis (left)
and of rarefaction
estimate of
richness (E(Sn))
(right) for Diss
Mere, England.
Major pollenstratigraphical and
cultural levels are
also shown. The
vertical axis is
depth (cm). The
scale for sample
scores runs from
–1.0 (left) to +
1.2 (right).
Birks et al. (1988)
Adam (1974)
Stratigraphic plot of PCA axes 1-6, Osgood Swamp,
California. Only axes 1-3 exceed broken-stick model
expectations
2.Gradient analysis or ordination of two or
more stratigraphical sequences
Fugla Ness,
Shetland
Birks &
Ransom
(1969)
Birks & Peglar (1979)
Pollen diagram from Sel Ayre showing the frequencies of all
determinable and indeterminable pollen and spores expressed as
percentages of total pollen and spores (P).
Abbreviations: undiff. = undifferentiated, indet = indeterminable.
Birks & Peglar (1979)
Birks &
Berglund
(1979)
Comparison of Färskesjön and Lösensjön using principal component analysis. The
mean scores of the local pollen zones and the ranges of the sample scores in each
zone are plotted on the first and second principal components, and are joined up in
stratigraphic order. The regional pollen assemblage zones are also shown.
Birks &
Berglund
(1979)
Comparison of Bjärsjöholmssjön and Färskesjön using principal
component analysis. The mean scores of the local pollen zones and the
ranges of the sample scores in each zone are plotted on the first and
second principal components, and are joined up in stratigraphic order.
The Blekinge regional pollen assemblage zones are also shown.
Haberle & Bennett (2004)
The 1st and 2nd axis of the Detrended Correspondence Analysis for
Laguna Oprasa and Laguna Facil plotted against calibrated calendar age
(cal yr BP). The 1st axis contrasts taxa from warmer forested sites with
cooler herbaceous sites. The 2nd axis contrasts taxa preferring wetter
sites with those preferring drier sites
3.Arrangement of taxa along the major axis of
variation, namely depth or age
Abernethy
Forest
Birks & Mathews (1973)
Percentage pollen and spore diagram from Abernethy Forest, Inverness-shire. The
percentages are plotted against time, the age of each sample having been
estimated from the deposition time. Nomenclatural conventions follow Birks
(1973a) unless stated in Appendix 1. The sediment lithology is indicated on the left
side, using the symbols of Troels-Smith (1995). The pollen sum, P, includes all
non-aquatic taxa. Aquatic taxa, pteridophytes, and algae are calculated on the basis
of P +  group as indicated.
Birks
(1993)
Pollen types re-arranged on the basis of the weighted average
TRAN
for depth = CCA with depth as external variable
CANOCO
Data analysis
Techniques that estimate particular numerical
characteristics from palaeoecological data such as
inferred past environment or compositional turnover
1.Ordination as a tool in testing if a given
environmental reconstruction is statistically
significant (Telford & Birks 2011 Quat Sci Rev
doi: 10.1016/j.quascirev.2011.03.002)
Basic idea of quantitative environmental reconstruction is
two-step process
1. Xm = Ym Ûm
where
Xm = modern environmental variable(s)
Ym = modern biological assemblages in surface samples
Ûm = estimated modern calibration (‘transfer’) functions
^
2. Xf = Yf Ûm
where
^
Xf = inferred past environmental variable(s)
Yf = fossil assemblages
Ûm = estimated modern calibration (‘transfer’) functions
Various numerical ways of doing this – two-way weighted
averaging, WAPLS (assuming unimodal responses),
inverse linear regression, partial least squares regression
(assuming linear responses)
See Birks et al. 2010 The Open Ecology Journal 3: 68-110
Obtain reconstruction of, say, July air temperature.
Is it statistically significant or is it a result of chance?
Various steps
1. Do PCA of fossil data (Yf) and see how much variance
is explained by the first axis – maximum possible
latent variable. Say it is 32%
2. Do RDA of fossil data (Yf) with reconstructed
environmental variable (Xf) as sole external variable.
Say it explains 19% of the variation in the fossil data.
3. Using the modern data Xm and Ym, generate 999
random environmental reconstructions to generate a
null distribution of Xf
4. Compare observed variation (19%) with null
distribution and estimate statistical significance of Xf
Telford & Birks
(2011)
5. If two or more environmental reconstructions
have been generated from the same fossil data,
can test if any of them are statistically significant
using a forward-selection procedure in RDA (=
partial RDA)
2.Using ordination to estimate compositional
turnover as a means of comparing
dynamics of different ecological systems
Use detrended canonical correspondence
analysis (DCCA) with palaeoecological data as
response variables and age or depth as sole
external variable. With Hill’s scaling in terms of
standard deviation units, can estimate turnover
in palaeoecological data (Birks 2007 Vegetation
History and Archaeobotany 16: 197-202).
Depth (cm)
Lo
ss
-o
nig
ni
tio
n
9200
9400
9600
9800
650
10000
660
10200
10400
10600
680
10800
700
11000
11200
11400
11600
Early Holocene - Major Taxa
20
40
20
20
20
20
40
20
20
20
20
40
20
20
20
G
ym
Po noc
ly a
Po pod rpiu
pu iu m
Pi lus m v dry
nu t ul op
s rem ga te
sy u re ris
lve la a
C
gg
or
st
.
ris
ylu
s
a
So
ve
lla
rb
na
us
cf
.S
.a
uc
up
ar
ia
he
rb
G
ac
ra
ea
m
-ty
in
ea
pe
e
C
ar
ex
-ty
pe
D
ry
op
te
ris
-ty
pe
Fi
lip
e
R nd
um ul
a
Em ex
pe ace
tru to
m sa
Ju
ni
ni
gr
pe
um
ru
s
Be
co
tu
m
la
m
un
is
Sa
lix
at
55
0
°
Krakenes
Sa
x
R ifrag
o
an a
C
Se unc opp
du ulu os
m s itif
gl ol
ac ia
C
ia -ty
ap
lis pe
se
-ty
lla
pe
-ty
pe
R
um
ex
ac
et
os
Ko
el
en
la
-ty
O igi
pe
xy a
ria isl
Sa d an
lix igy dic
un na a
di
ff.
Lithology
Kråkenes, western Norway
Birks & Birks 2008 The Holocene 18: 19-30
600
Zone
610
620
630
640
7
670
690
710
720
6
730
5
740
750
4
760
770
3
2
1
40
20
20
Percentages of Calculation Sum
Fine resolution diagram from end of Younger Dryas
11500 years ago to 9175 years ago.
Turnover estimates
Kråkenes
Turnover (SD)
Duration
(yrs)
Total pollen record
2.75
2450
Younger Dryas to Betula
zone
260 yrs since Younger
Dryas
Glacial forelands
2.42
720
1.91
260
2.98-3.81
( =3.32, sd 2.66)
260
Greater compositional change in 260 yrs on glacial
forelands since ‘Little Ice Age’ than at Kråkenes early
Holocene
Compare amount of change at many sites over
the same time interval – ‘meta-analysis’
Smol et al. 2005 PNAS 102: 4397-4402
Diatom stratigraphies for last 150 years in 42
arctic lakes
Turnover
0.70-2.84 SD
Compared with turnover in last 150 years in
unimpacted temperate lakes
Turnover
0.72-1.39 SD, median 1.02 SD
Turnover >1 SD in arctic lakes suggests greater
compositional change relative to undisturbed
temperate lakes
Use as a baseline
Moritz et al. 2002
Smol et al. (2005)
Back to Kråkenes
Turnover in diatom stratigraphy in first 150 years
since Younger Dryas is 2.81 SD, about the same
as in lakes in Arctic Canada (Ellesmere Island) in
the last 150 years
Interesting parallel
Rapid biotic turnover in response to climate
change
3.Comparing fossil and modern assemblages
Jacobson & Grimm (1986)
DCA
Graph of distance (number of
standard deviations) moved every
100 yr in the first three dimensions
of the ordination vs age. Greater
distance indicates greater change
in pollen spectra in 100yr.
Jacobson & Grimm (1986)
DCA
Ordination of the 100
BP analogue pollen
assemblages and a 5sample running
average of the Billy’s
Lake fossil pollen
samples. Points on this
running average curve
represent the position
every 100 yr (rather
than the position of
each sample). Time
marked for each 1000
yr (k).
Birks et al. (1990)
Passive fossil samples
added into CCA of modern
diatom-chemistry data
Fossil samples fitted on basis on overall composition
into CCA species-environment space
Canonical
correspondence
analysis (CCA)
time-tracks of
selected cores from
the Round Loch of
Glenhead; (a) K5,
(b) K2, (c) K16, (d)
k86, (e) K6, (f)
environmental
variables. Cores are
presented in order
of decreasing
sediment
accumulation rate.
Allott et al. (1992)
Hypothesis testing using constrained
ordinations
1.Assessing potential external ‘drivers’ on an
aquatic ecosystem
Bradshaw et al. 2005 The Holocene 15: 1152-1162
Dalland Sø, a small (15 ha), shallow (2.6 m) lowland
eutrophic lake on the island of Funen, Denmark.
Catchment (153 ha) today
agriculture
77 ha
built-up areas 41 ha
woodland
wetlands
32 ha
3 ha
Nutrient rich – total P 65-120 mg l-1
Terrestrial landscape
or catchment
development
Bradshaw
et al.
(2005)
Aquatic ecosystem development
Bradshaw et al. (2005)
DCA of pollen and diatom data separately to summarise
major underlying trends in both data sets
Pollen – high scores for trees,
low scores for lightdemanding herbs and
crops
Diatom -high scores mainly
planktonic and large
benthic types, low
scores for Fragilaria
spp. and eutrophic spp.
(e.g. Cyclostephanos
dubius)
Bradshaw et al. (2005)
Major contrast between samples before and
after Late Bronze Age forest clearances
'Lake'
Prior to clearance,
lake experienced
few impacts.
After the clearance,
lake heavily
impacted.
'Catchment'
Bradshaw et al. (2005)
Canonical correspondence analysis
Response variables:
Diatom taxa
Predictor external variables:
Pollen taxa, LOI, dry mass and minerogenic accumulation
rates, plant macrofossils, Pediastrum
Covariable:
Age
69 matching samples
Partial CCA with age partialled out as a covariable. Makes
interpretation of effects of predictors easier by removing
temporal trends and temporal autocorrelation
Partial CCA all variables:
18.4% of variation in diatom data explained by Poaceae
pollen, Cannabis-type pollen, and Daphnia ephippia, the only
three independent and statistically significant predictors.
As different external factors may be important at different times,
divided data into 50 overlapping data sets – sample 1-20, 2-21,
3-22, etc.
Bradshaw
et al.
(2005)
CCA of 50 subsets from bottom to top and % variance explained
1. 4520-1840 BC Poaceae is sole predictor variable
(20-22% of diatom variance)
2. 3760-1310 BC LOI and Populus pollen (16-33%)
3. 3050-600 BC Betula, Ulmus, Populus, Fagus,
Plantago, etc. (17-40%)
i.e. in these early periods, diatom change influenced to
some degree by external catchment processes and
terrestrial vegetation change.
4. 2570 BC – 1260 AD Erosion indicators (charcoal,
dry mass accumulation), retting indicator Linum
capsules, Daphnia ephippia, Secale and Hordeum
pollen (11-52%)
i.e. changing water depth and external factors
5. 160 BC – 1900 AD Hordeum, Fagus, Cannabis
pollen, Pediastrum boryanum, Nymphaea seeds
(22-47%)
i.e. nutrient enrichment as a result of retting
hemp, also changes in water depth and water
clarity
Bradshaw et
al. (2005)
Strong link between inferred catchment change and within-lake
development. Timing and magnitude are not always perfectly matched,
e.g. transition to Medieval Period
2.Lake Euramoo, NE Queensland
Can use ordination methods to summarise several
palaeoecological proxies and to compare with
other proxies over last 800 years
Major changes between pre-European period (A)
and European settlement (B)
Haberle et al. (2006)
Tested using RDA how well different proxies ‘predict’
or ‘explain’ (in a statistical sense) other proxies
Only proxy that significantly predicted other proxies
was pollen that predicted changes in diatoms
(25.4%) and chironomids (15.4%)
Illustrates the importance of catchment and its
vegetation on the lake and its biota
Strengths and Weaknesses
Merits and drawbacks of indirect
ordination methods
1. Can distract attention from individual species responses
by focussing on the overall multivariate response only.
2. As it is a correlative method, it can help with hypothesis
generation.
3. It can rarely, if ever, demonstrate causality.
4. Ordinations can provide useful low-dimensional
representations of complex data. Valuable for
summarisation and for hypothesis generation.
5. Ordination is a tool and a means to an end. It is not an
end in itself.
Current uses of indirect ordination
methods
Hill, M.O. (1988) Bull. Soc. Roy. Bot. Belg. 121, 134–41
“Ordination is a rather artificial technique. The idea that the
world consists of a series of environmental gradients, along
which we should place our vegetation samples, is attractive.
But this remains an artificial view of vegetation. In the end
the behaviour of vegetation should be interpreted in terms of
its structure, the autoecology of its species and, above all, the
time factor. At this level, trends become unimportant and
multivariate analysis is perhaps irrelevant. Ordination is
useful to provide a first description but it cannot provide
deeper biological insights.”
Current uses of direct ordination
methods
Direct gradient analysis or constrained ordination
techniques allow hypothesis testing, not only
hypothesis generation
Need fossil data and relevant external data. Major
challenge to have relevant external data that are
ecologically independent of the fossil data
Possible in a few examples – fossil data and
volcanic tephra; split-sampling of fossil data into
external predictors of vegetation type (e.g.
macrofossils) and biological responses (e.g.
diatoms, chironomids)
CONFIRMATORY
DATA ANALYSIS
EXPLORATORY
DATA ANALYSIS
Real world ’facts’
Hypotheses
Real world ‘facts’
Observations
Measurements
Data
Data analysis
Patterns
‘Information’
Observations
Measurements
Data
Statistical
testing
Hypothesis
testing
Narratives
Theory
Hypotheses
EXPLORATORY
DATA ANALYSIS
CONFIRMATORY
DATA ANALYSIS
How can I optimally describe or
explain variation in data set?
Can I reject the null hypothesis
that the species are unrelated to a
particular environmental factor or
set of factors?
Samples can be collected in many
ways, including subjective
sampling.
Samples must be representative of
universe of interest – random,
stratified random, systematic.
‘Data-fishing’ permissible, post-hoc
analyses, explanations,
hypotheses, narrative okay.
Analysis must be planned a priori.
P-values only a rough guide.
P-values meaningful.
Stepwise techniques (e.g. forward
selection) useful and valid.
Stepwise techniques not strictly
valid.
Main purpose is to find ‘pattern’ or
‘structure’ in nature. Inherently
subjective, personal activity.
Interpretations not repeatable.
Main purpose is to test hypotheses
about patterns. Inherently
analytical and rigorous.
Interpretations repeatable.
A well-designed modern palaeoecological study
combines both
- Initial phase is exploratory, perhaps
involving pilot data or previous data to
generate hypotheses.
1) Two-phase study - Second phase is confirmatory,
collection of new data from defined
sampling scheme, planned data
analysis.
- Large data set (>100 objects),
randomly split into two (75/25) –
exploratory set and confirmatory set.
2) Split-sampling
- Generate hypotheses from
exploratory set (allow data fishing);
test hypotheses with confirmatory set.
- Rarely done in palaeoecology.
Data diving with cross-validation: an investigation of
broad-scale gradient in Swedish weed communities
Hallgren et al. 1999 J Ecology 87: 1037-1051
Full data set
Remove observations with missing data
Clean data set
Ideas for
more analysis
Random split
Exploratory
data set
Hypotheses
Choice of variables
Some
previously
removed
data
Confirmatory
data set
Hypothesis
tests
Combined
data set
Analyses for display
RESULTS
Flow chart for the
sequence of
analyses. Solid lines
represent the flow of
data and dashed
lines the flow of
analysis.
Split-sampling very data-demanding
Palaeoecological data collection is very labourintensive
Exciting developments at Massey University, New
Zealand towards automated pollen counting
Auto Stage – can now identify 50 taxa with a
reliability of 98% or more and can flag others as
unknown to be looked at by the palynologist.
http://autopollen.massey.ac.nz
AutoStage flow
Betula pendula
(silver birch)
Ligustrum lucida
(privet)
Dactylis glomerata
(cocksfoot)
Cupressus
macrocarpa
(macrocarpa tree)
Wattle acacia
Pinus radiata
Much greater standard deviation for ‘people’ compared to machine,
especially for pollen types 3 (Cupressus) and 4 (Ligustrum) i.e. we
are more variable than a machine at pollen counting.
Time
60 pollen types and a count of 2000-3000 grains
take 3 hours (quicker than an experienced pollen
analyst).
Also Auto Stage can analyse 24 hours a day, 7 days
a week, 52 weeks a year.
Can count 3000 samples in a year, compared to a
hard-working pollen analyst of 500 samples a year,
i.e. 6 times more!
Cost About $10,000 = £7000 = 70,000 Norwegian
kroner - about 2-3 month’s salary!
A major breakthrough – but how do we prepare that
number of samples?
Conclusions
Ordination techniques are useful tools in
palaeoecology for data summarisation, data
analysis, and data interpretation
Limiting factor in their full exploitation is
availability of external data and data-sets
large enough for split-sampling cross-validation
needed in hypothesis testing
AutoStage and automated pollen counting are
major challenges for next 5 years
Good reasons for selecting PCA, CA, or DCA
(indirect methods) or RDA, CCA, or DCCA (direct
methods)
Good reasons for selecting PCA (linear) or CA
(unimodal) and CA (no curvature) or DCA
(curvature)
No real role for NMDS
PCoA and CAP potentially useful if there are good
reasons to use distance measures not possible in
PCA, CA, RDA, or CCA
Andrew Lang 18441912. He uses
statistics as a drunken
man uses lamp-posts –
for support rather than
illumination.
From MacKay, 1977,
and reproduced
through the courtesy of
the Institute of
Physics.
Statistics are for
illumination!
Post-1987
Pre-1987
Sketches illustrating statistical zap and shotgun approaches to
data analysis
Cajo ter Braak
1987 Wageningen
Major players in ordination theory
Karl Pearson
David W. Goodall
1901 Invented PCA
1954 First use of
PCA in ecology
John C. Gower
Joseph B. Kruskal 1964
1966 Popularised
PCoA, invented
Procrustes rotation
Development of NMDS
Mark O. Hill 1973
Popularised CA in
ecology, 1980 DCA
Cajo J.F. ter
Braak 1985 Unified
PCA and CA in
terms of response
models
Jari Oksanen
Continuous questioning
about ordination
methods, championing
NMDS, developing R
(vegan)
Pierre Legendre
Major developments in
extending direct
ordination methods
Petr Šmilauer
Developed
CanoDraw and
CANOCO for
Windows
Marti J. Anderson
Extending RDA and
CCA to other distance
measures and
developing CAP
Richard J. Telford
Statistical testing
of environmental
reconstructions
Key researchers in the quantitative
analysis of palaeoecological data
Andy Lotter
Keith Bennett
Eric Grimm
Allan Gordon
Bent Odgaard
Steve Juggins
Ed Cushing
Gavin Simpson
Acknowledgements
Allan Gordon
Mark Hill
Cajo ter Braak
Petr Šmilauer
Steve Juggins
Richard Telford
Pierre Legendre
Cathy Jenks
Download