+ b - eecrg

advertisement
NUMERICAL ANALYSIS OF
BIOLOGICAL AND
ENVIRONMENTAL DATA
Lecture 4
Regression Analysis
John Birks
REGRESSION ANALYSIS
Introduction, Aims, and Main Uses
Response model
Types of response variables y
Types of predictor variables x
Types of response curves
Transformations
Types of regression
Null hypothesis, alternative hypothesis, type I and II errors,
 and 
Quantitative response variable
Nominal explanatory (predictor) variables
Quantitative explanatory (predictor) variables
General linear model
REGRESSION ANALYSIS continued
Presence/absence response variable
Nominal explanatory (predictor) variables
Quantitative explanatory (predictor) variables
Generalised linear model (GLM)
Multiple linear regression
Multiple logit regression
Selecting explanatory variables
Nominal or nominal and quantitative explanatory
variables
Assessing assumptions of regression model
Simple weighted average regression
Model II regression
Software for basic regression analysis
INTRODUCTION
Explore relationships between variables and their environment
+/– or abundances for species (responses)
Individual species, one or more environmental variable (predictors)
Species abundance or presence/absence
- response variable Y
Environmental variables
- explanatory or predictor variables X
AIMS
1. To describe response variable as a function of one or more explanatory
variables. This RESPONSE FUNCTION usually cannot be chosen so that the
function will predict responses without error. Try to make these errors as
small as possible and to average them to zero.
2. To predict the response variable under some new value of an explanatory
variable. The value predicted by the response function is the expected
response, the response with the error averaged out. cf. CALIBRATION
3. To express a functional relationship between two variables thought, a
priori, to be related by a simple mathematical relationship, but where only
one of the variables is known exactly. cf. MODEL II REGRESSION
MAIN USES
(1) Estimate ecological parameters for species, e.g. optimum,
amplitude (tolerance) ESTIMATION AND DESCRIPTION
(2) Assess which explanatory variables contribute most to a
species response and which explanatory variables appear to
be unimportant. Statistical testing MODELLING
(3) Predict species responses (+/–, abundance) from sites with
observed values of explanatory variables PREDICTION
(4) Predict environmental variables from species data CALIBRATION or ‘INVERSE REGRESSION’
Fox (2002)
Sokal & Rohlf (1995)
Draper & Smith (1981)
Montgomery & Peck (1992)
Crawley (2002, 2005)
RESPONSE MODEL
Systematic part - regression equation
Error part statistical distribution of error
Y = b0 + b1x + 
response variable
error
b0, b1 fixed but unknown coefficients
b0 = intercept
b1 = slope
explanatory variable
Ey = b0 + b1x
SYSTEMATIC PART
Error part is distribution of , the random variation of the observed response around the
expected response.
Aim is to estimate systematic part from data while taking account of error part of model.
In fitting a straight line, systematic part simply estimated by estimating b0 and b1.
Least squares estimation – error part assumed to be normally distributed.
TYPES OF RESPONSE VARIABLES - y
Quantitative (log transformation)
% quantitative
Nominal including +/–
TYPES OF EXPLANATORY or PREDICTOR VARIABLES - x
Quantitative
Nominal
Ordinal (ranks) - treat as nominal 1/0 if few classes, quantitative if many
classes
TYPES OF RESPONSE CURVES
If one explanatory variable x, consists of fitting curves through data.
What type of curve?
(i)
EDA scatter plots of y and x.
(ii)
Underlying theory and available knowledge.
TYPES OF RESPONSE CURVES
Shapes of response curves. The expected response (Ey) is
plotted against the environmental variable (x). The curves
can be constant (a: horizontal line), monotonic (b: sigmoid
curve, c: straight line), monotonic decreasing (d: sigmoid
curve), unimodal (e: parabola, f: symmetric, Gaussian curve,
g: asymmetric curve and a block function) or bimodal (h).
Response curves derived from a bimodal curve
by restricting the sampling interval. The curve
is bimodal in the interval a-f, unimodal in a-c
and in d-f, monotonic in b-c and c-e and almost
constant in c-d. Ey = expected response; x =
environmental variable.
TRANSFORMATIONS
Usually needed
TRANSFOR
TYPES OF REGRESSION
Explanatory variable x
One
Response Quantitative
variable
y
+/-
Many
Nominal
Quantitative
Nominal
Quantitative
ANOVA
Linear and
non-linear
regression
Multiple LR
with nominal
dummy
variables
Multiple LR
Logit
[Log linear
contingency
tables]
Multiple
logit
regression
2 test
(LR = Linear Regression)
Also weighted averaging regression and model II regressions
NULL HYPOTHESIS, ALTERNATIVE HYPOTHESIS,
TYPE I ERROR, TYPE II ERROR, , AND 
Null hypothesis H0
‘y not correlated with x’
No difference, no association, no correlation. Hypothesis
to be tested, usually by some type of significance test.
Alternative hypothesis H1
Postulates non-zero difference, association, correlation.
Hypothesis against which null hypothesis is tested.
Tests of statistical hypotheses are probabilistic
Can just as well estimate the degree to which an effect is felt as judge whether
the effect exists or not.
As a result, can compute probabilities of two types of error.
Type I error ()
probability that we have mistakenly rejected a true null hypothesis
Type II error ()
probability that we have mistakenly failed to reject a false null
hypothesis
DECISION
TRUTH
Accept H0
Reject H0
H0 true
No error: 1 - 
Type I error: 
H0 false
Type II error: 
No error: 1 - 
Power of a test is simply the probability of not making type
II error, namely 1-. The higher the power, the more likely
it is to show, statistically, an effect that really exists.

0.05

Rarely estimated. Function of critical value of
sample size, and the magnitude of effect being
looked for.
Type I error:

Error that results when the null hypothesis is
FALSELY REJECTED
0.01
0.001
Type II error: Error that results when the null hypothesis is

FALSELY ACCEPTED
QUANTITATIVE RESPONSE VARIABLE,
NOMINAL EXPLANATORY VARIABLE
Relative cover (log-transformed) of a plant species () in relation to the soil types
of clay, peat and sand. The horizontal arrows indicate the mean value in each
type. The solid vertical bars show the 95% confidence interval for the expected
values in each type and the dashed vertical lines the 95% prediction interval for
the log-transformed cover in each type.
QUANTITATIVE RESPONSE VARIABLE,
NOMINAL EXPLANATORY VARIABLE
Plant cover
y
3 soil types
x
Response model - Systematic part. 3 expected responses, one for each
soil type. Error part – observed responses vary around
expected responses in each soil type. Normally
distributed, and variance within each soil type is same.
Estimate:
Assume responses are independent.
 ANALYSIS OF VARIANCE (ANOVA)
Expected responses in 3 soil types. Least squares. Sum over all sites of squared
differences between observed and expected response to be minimal. Parameter that
minimises this SS is the mean.
Difference between Ey and observed response is residual. Least squares minimises
sum of squared vertical distances. Residual SS.
Ey, standard error, and 95% confidence interval = Estimate  t(0.95) x s.e
5% critical value in 2-tailed test.
Degrees of freedom (v) = n-q parameters
ANOVA table
Means and ANOVA table of the transformed
relative cover of the above figure
Term
mean
Clay
Peat
Sand
Overall
mean 2.33
1.7
3.17
2.33
s.e.
0.33
0.38
0.38
95% confidence
interval
Regression
Residual
Total
R2 adj = 0.25
Value of t0.05(v) depends on
number of degrees of freedom
(v) of the residual with v = 17,
t0.05(17) = 2.11
(1.00, 2.40)
(2.37, 3.97)
(1.53, 3.13)
ANOVA table
d.f.
q-1
n-q
n-1
Estimate ± t(0.05)(v)  s.e.
(ss/df)
d.f.
2
17
19
s.s
7.409
14.826
22.235
= ms regression df = 2
m.s
3.704
0.872
1.17
F
4.24
ms residual (n - q df = 17)
Critical value of F at 5% level
is 3.59
variance
q = parameters = 3, n = number of objects = 20
ms = ss/df
Total ss = Regression ss (q - 1 = 2 df) + Residual ss (n - q = 17 df)
(n - 1 = 19 df)
R2adj = 1 – (residual variance / total variance) = 1 - (0.872/1.17) = 0.25
R2 = 1 – (residual sum of squares / total sum of squares) = 1 - (14.826/22.235) = 0.333
R
QUANTITATIVE RESPONSE VARIABLE,
QUANTITATIVE EXPLANATORY VARIABLE
Straight line fitted by least-squares regression of log-transformed relative
cover on mean water-table. The vertical bar on the far right has length
equal to twice the sample standard deviation T, the other two smaller
vertical bars are twice the length of the residual standard deviation (R).
The dashed line is a parabola fitted to the same data (●)
Error part – responses independent and normally distributed around
expected values zy
Straight line fitted by least-squares: parameter estimates
and ANOVA table for the transformed relative cover of
the figure above
Term
Parameter Estimate
s.e.
T (= estimate/se)
Constant
b0
4.411
0.426
10.35
Water-table
b1
-0.037
0.00705
-5.25
ANOVA table
df
d.f.
s.s.
m.s.
Parameters-1
Regression
1
13.45 13.45
n-parameters
Residual
18
8.78
n-1
Total
19
22.23 1.17
R2adj = 0.58
R2 = 0.61
0.488
F
27.56
df
1,18
r = 0.78
R
QUANTITATIVE RESPONSE VARIABLE,
QUANTITATIVE EXPLANATORY VARIABLE
Does expected response depend on water table?
F = 27.56 >> 4.4
(F = MS regression
MS residual )
(critical value 5%) df (1, 18)
(df = parameters – 1,
n – parameters )
Does slope b1 = 0?

t of b1  b1 se 

F  5.25  absolute value of critical value of twotailed t-test at 5%
t0.05,18 = 2.10
2

b1 not equal to 0
b

[exactly equivalent to F test  1 seb   F ]
1

Construct 95% confidence interval for b1
estimate  t0.05, v  se = 0.052 / 0.022
Does not include 0 0 is unlikely value for b1
Check assumptions of response model
Plot residuals against x and Ey
Could we fit a curve to these data better than a straight line?
Parabola
Ey = b0 + b1x + b2x2
Straight line fitted by least-squares regression of log-transformed relative cover
on mean water table. The vertical bar on the far right has a length equal to twice
the sample standard deviation T, the other two smaller vertical bars are twice
the length of the residual standard deviation (R). The dashed line is a parabola
fitted to the same data ().
Polynomial regression
R
Parabola fitted by least-squares regression: parameter
estimates and ANOVA table for the transformed relative
cover of above figure.
Term
Parameter
Estimate
s.e.
t
Constant
b0
3.988
0.819
4.88
Water-table
b1
-0.0187
0.0317
-0.59
(Water-table)2
b2
-0.000169
0.000284
-0.59
Not
different
from 0
ANOVA table
1 extra
parameter
Regression
1 less
d.f.
d.f.
s.s.
m.s.
F
2
13.63
6.815
13.97
Residual
17
8.61
0.506
Total
19
22.23
1.17
R2adj = 0.57
(R2adj = 0.58 for linear model)
R
GENERAL LINEAR MODEL
Regression Analysis Summary
Response variable
Y = EY + e
where EY is the expected value of Y for particular values of the
predictors and e is the variability ("error") of the true values around
the expected values EY.
The expected value of the response variable is a function of the
predictor variables
EY = f(X1, ..., Xm)
EY = systematic component, e = stochastic or error component.
Simple linear regression
EY = f(X) = b0 + b1X
Polynomial regression
EY = b0 + b1X + b2X2
Null model
EY = b0
EY = Ŷ = b0 +
p
b x
j
j 1
j
Fitted values allow you to estimate the error component, the regression residuals
ei = Yi – Ŷi
Total sum of squares (variability of response variable)
TSS =
n
(Y
i 1
i
 Y )2
where Y = mean of Y
This can be partitioned into
(i) The variability of Y explained by the fitted model, the regression or model sum
of squares
MSS =
n
 (Yˆ
i 1
i
 Y )2
(ii) The residual sum of squares
n
RSS =  (Yi  Yˆi )2
i 1
=
n
e
i 1
2
i
Under the null hypothesis that the response variable is independent of the
predictor variables MSS = RSS if both are divided by their respective number of
degrees of freedom.
PARABOLA FITTED TO LOG-ABUNDANCE DATA,
fitting a Gaussian unimodal response curve to original
abundance data
z
(y)
z = c exp[-0.5(x-u)2/t2]
(y)
Gaussian response curve with its three ecologically important
parameters: maximum (c), optimum (u) and tolerance (t).
Vertical axis: species abundance. Horizontal axis: environmental
variable. The range of occurrence of the species is seen to be
about 4t.
loge z = b0 + b1x + b2x2 = loge (c) - 0.5 (x-u)2/t2
Optimum u = b1 / (2b2)
Tolerance t = 1/ (2b2)
Maximum c = exp (b0 + b1u + b2u2)
If b2 +, minimum
Approximate SE of u and t can be calculated
PRESENCE-ABSENCE RESPONSE VARIABLE,
NOMINAL EXPLANATORY VARIABLE
Numbers of fields in which Achillea ptarmica is present and absent in meadows with
different types of agricultural use and frequency of occurrence of each type (unpublished
data from Kruijne et al., 1967). The types are pure hayfield (ph), hay pastures (hp),
alternate pasture (ap) and pure pasture (pp).
Achillea ptarmica
Agricultural use
Explanatory variables
Response
χ2 =
 o  e
2
e
ph
hp
ap
pp
Total
Present
37
40
27
9
113
Absent
109
356
402
558
1425
Total
146
396
429
567
1538
0.254
0.101
0.063
0.016
0.073
Frequency
o = observed frequency
e = expected frequency
(r-1) (c-1) degrees of freedom
Relative frequency of occurrence is 113/1538 = 0.073
Under null hypothesis, the expected number of fields with Achillea ptarmica present is, pure
hayfield (ph) 0.073 x 146 = 10.7, haypasture (hp) 0.073 x 396, etc. Calculated x2 = 102.1
compared with critical value of 7.81 at 0.05 level with 3 df. Conclude that occurrence of A.
ptarmica depends on field type.
PRESENCE-ABSENCE RESPONSE VARIABLE,
QUANTITATIVE EXPLANATORY VARIABLE
Sigmoid curve fitted by logit regression of the presences (● at p = 1)
and absences (● at p = 0) of a species on acidity (pH). In the
display, the sigmoid curve looks like a straight line but it is not. The
curve expresses the probability (p) of occurrence of the species in
relation to pH.
1: Ey = bo+b1x
Can be negative
Straight line (a),
exponential curve (b) and
sigmoid curve (c)
representing equations 1,2,
and 3, respectively.
Systematic part – defined as
shown
2: Ey = exp(bo+b1x)
Can be >1
3: Ey = p = [exp(bo+b1x)]
[1 + exp (bo+b1x)]
(bo + b1x) linear predictor
Error part – response can
only have two values
therefore binomial error
distribution
Cannot estimate parameters
by least-squares regression
as errors not normally
distributed and have no
constant variance
LOGIT REGRESSION – special
case of GLM
GENERALISED LINEAR MODEL
GENERALISED LINEAR MODEL (GLM)
Not the same as General Linear Model, more generalised
Logit
 p 
log e 


 1  p 
or
p = [exp (linear predictor)] / [1 + exp (linear predictor)]
linear predictor
Estimation in GLM by maximum likelihood.
Likelihood is defined for a set of parameter values as the probability of responses
actually observed when that set of values is the true set of parameter values. ML
chooses the set of parameter values for which likelihood is maximum.
Measure deviation of observed responses to fitted responses, not by residual SS as
in least-squares, but by RESIDUAL DEVIANCE.
[Least-squares principle equivalent to ML if errors are independent and follow
normal distribution].
Least-squares regression is one type of GLM.
Solved iteratively.
GLIM
GENSTAT
R or S-PLUS
Sigmoid curve fitted by logit regression of the presences (● at p = 1) and
absences (● at p = 0) of a species on acidity (pH). In the display, the sigmoid
curve looks like a straight line but is not. The curve expresses the probability
(p) of occurrence of the species in relation to pH.
Sigmoid curve fitted by logit regression parameter estimates and deviance
table for the presence-absence data of the above figure.
Term
Parameter
Estimate
s.e.
t
Constant b0
2.03
1.98
1.03
pH
b1
-0.484
0.357
-1.36 (not >|2.111|)
d.f.
Deviance
Mean deviance
Residual 33
43.02
1.304
Not different from a horizontal line, as t-test of b1 = 0 not rejected
Parabola (a),
Gaussian
curve (b) and
Gaussian logit
curve (c)
representing
the
equations,
respectively.
If we take for linear predictor the logit transformation of p loge [p/(1-p)] = linear predictor
p = [exp (linear predictor) ]/[ 1 + exp (linear predictor)]
For a parabola (b0 + b1x + b2x2) we get p = [exp (b0 + b1x + b2x2) ]/[1 + exp (b0 + b1x + b2x2)]
or log  p  = b0 + b1x + b2x2
1  p 
GAUSSIAN LOGIT CURVE
Gaussian logit curve fitted by logit regression of the presences (● at p = 1)
and absences (● at p = 0) of a species on acidity (pH). u = optimum; t =
tolerance; pmax = maximum probability of occurrence.
Gaussian logit curve fitted by logit regression: parameter
estimates and deviance table for presence-absence data
Term
Estimate
s.e.
t
Constant
b0
-12.88
51.1
-2.52
pH
b1
49.4
19.8
2.5
pH2
b2
4.68
1.9
-2.47
d.f.
Deviance
Mean deviance
32
23.17
0.724
Residual
> t of 1.96
u = -b1 / (2b2)
t = 1 / (√(-2b2)
pmax = {1 + exp (-b0 – b1u – b2u2)}
t – tests of b2, b1 and b0
Deviance tests - Gaussian logit curve → linear – logit (sigmoidal) → null
model
Drop in deviance > χ2 3.84
Residual deviance of a model is compared with that of an extended
model. The additional parameters in the extended model (e.g. Gaussian
logit) are significant when the drop in residual deviance is larger than
the critical value of a χ2 distribution with k degrees of freedom
(k=number of additional parameters)
Example:
Gaussian logit model – residual deviance = 23.17
Sigmoidal model – residual deviance = 43.02
43.02 - 23.17=19.85 which is >> χ 20.05(1)=3.84
RESPONSE VARIABLE WITH MANY ZERO
VALUES
Counts 0,1,2,3...
Log-linear or Poisson regression
Log Ey = linear predictor
Can be
(b0 + b1x)
exponential curve
(b0 + b1x + b2x2) Gaussian curve (if b2 < 0)
[Poisson error distribution, link function log]
Can transform to PSEUDOSPECIES (as in TWINSPAN) and use as +/–
response variables in logit regression.
R
QUANTITATIVE RESPONSE VARIABLE,
MANY QUANTITATIVE EXPLANATORY VARIABLES
Response variable expressed as a function of two or more explanatory variables.
Not the same as separate analyses because of correlations between explanatory
variables and interaction effects.
MULTIPLE LEAST-SQUARES LINEAR REGRESSION
Planes
Ey = b0 + b1x1 + b2x2
explanatory variables
b0 – expected response when x1 and x2 = 0
b1 – rate of change in expected response along x1 axis
b2 – rate of change in expected response along x2 axis
b1 measures change of Ey with x1 for a fixed value of x2
b2 measures change of Ey with x2 for a fixed value of x1
R
A straight line displays the linear
relationship between the abundance
value (y) of a species and an
environmental variable (x), fitted to
artificial data (). (a = intercept; b
= slope or regression coefficient).
A plane displays the linear relation
between the abundance value (y) of
a species and two environmental
variables (x1 and x2) fitted to
artificial data ().
Three-dimensional view of a plane fitted by
least-squares regression of responses (●) on
two explanatory variables x1 and x2. The
residuals, i.e. the vertical distances
between the responses and the fitted plane
are shown. Least-squares regression
determines the plane by minimization of the
sum of these squared vertical distances.
Estimates of b0, b1, b2 and standard errors and t (estimate / se)
ANOVA total SS, residual SS, regression SS
R2 = 1 
Residual SS
Total SS
R2adj = 1 
Residual MS
Total MS
Ey = b0 + b1x1 + b2x2 + b3x3 + b4x4 + ……..bmxm
MULTICOLLINEARITY
Forward selection
Selection of explanatory variables:
Backward selection
‘Best-set’ selection
R
REF
REGRESSION AND ANOVA
REF
In multiple regression, where yi are n independent variables (response), the familiar
linear model is:
yi = 0 + 1xi1 + 2xi2 + ….+ kxik + i
(A1)
where xij’s (k predictor variables) are known constants, 0, 1,…, k are unknown
parameters and i’s are independent normal random variables. In matrix notation, the
model is written as
y = X + , with matrices:
 y1 
 
 y2 
. 
y  
. 
. 
 
 ynT 
 
1

1
.
X 
.
.

1

x11 . . x1k 

x 21 . . x 2 k 
.
. 

.
. 
.
. 
xnT 1 . . xnTk 
 0
 
 1 
  . 
 
. 
 k 
 
 1 
 
 2 
. 
  
. 
. 
 
 nT 
 
where nT = total number of replicates. The least squares estimates b of the
parameters  are obtained by the normal equations:
X’Xb = X’y
(A2)
And taking the inverse of X’X, we have:
REF
b = [X’X]-1 [X’y]
(A3)
REF
REF
REF
In a similar fashion, consider the linear model for a one-way ANOVA:
Yij =  + i + ij
(A4)
where yij is the value of the jth replicate in the ith treatment,  is the overall parametric
mean, i is the effect of the ith treatment and ij is the random normal error associated
with that replicate. The model for the expectation of y in any particular treatment is:
E(yi) =  + ti
(A5)
with ti the ith treatment effect. If there were, for example, three treatments, the model
could be written as:
E(y) = X0 + t1X1 + t2X2 + t3X3
(A6)
The values of Xi required to reproduce the model E(yi) =  + ti for a given yi, using equation
A6 are:
X0 = 1 and
REF
 Xi  1 if the ith treatmentis applied, otherwise


 Xi  0

REF
REF
This can be expressed by the following matrices:
 y11
 . 
 
 . 
 . 
 
 y1 j 
 
 y 21
y . 
 
 . 
y2 j 
 
 y31
 . 
 
 . 
 
 y3 j 
1
1

1
1

1

1
X
1

1
1

1
1

1
1 0 0
1 0 0

1 0 0
1 0 0
0 1 0

0 1 0
0 1 0

0 1 0
0 0 1

0 0 1
0 0 1

0 0 1
REF
μ 
t1 
b 
t 2 
t 3 
 
where the columns of the matrix X correspond to X0, X1, X2 and X3, respectively.
A least-squares solution may again be obtained by the equation:
X’Xb=X’y
(A7)
REF
REF
RESPONSE SURFACES
PARABOLA
QUADRATIC SURFACE
Ey = b0 + b1x + b2x2
Ey = b0 + b1x1 + b2x12 + b3x2 + b4x22 (5
parameters)
If log Y Gaussian curve
Bivariate Gaussian response surface if b2 and b4
are both negative
T-tests to test  0
Test if surface is unimodal in direction of x1 by
null hypothesis b2  0 against b2 < 0 (t of b2)
b4 – test if surface is unimodal in direction of x2
Can also test if x2 influences abundance of y in addition to x1, i.e. do b3 and b4 = 0?
MORE COMPLEX MODELS
Ey = b0 + b1x1 + b2x12 + b3x2 + b4x22 + b5x3 + b6x32 + ... btxm2
Hence need for selecting explanatory variables
R
PRESENCE-ABSENCE RESPONSE VARIABLE
MANY QUANTITATIVE EXPLANATORY VARIABLES
MULTIPLE LOGIT REGRESSION
Multiple logit regression
 p 
log e 
 b0  b1 x1  b2 x2

1  p 
2 expl variables
Test for effects of x1 and x2. t-tests of b1 and b2.
Bivariate Gaussian logit surface
 p 
2
2
log e 

b

b
x

b
x

b
x

b
x
4 2
3 2
 0 1 1 2 1
1

p


2 expl variables
R
 p 
2
2
log e 

b

b
x

b
x

b
x

b
x
4 2
3 2
 0 1 1 2 1
1

p


Three-dimensional view of a
bivariate Gaussian logit surface
with the probability of
occurrence (p) plotted
vertically and the two
explanatory variables x1 and x2
plotted in the horizontal plane.
Elliptical contours of the
probability of occurrence p plotted
in the plane of the explanatory
variables x1 and x2. One main axis
of the ellipses is parallel to the x1
axis and the other to the x2 axis.
Gaussian logit surface
R
INTERACTION EFFECTS OF X1 AND X2
Product terms x1x2
Ey = b0 + b1x1 + b2x2 + b3x1x2
= (b0 + b2x2) + (b1 + b3x2) x1
Intercept and slope and hence values of x1 depend on x2
Effect of x2 also depends on x1
If b3 = 0, NO INTERACTION between x1 and x2
Quadratic surface
Ey = b0 + b1x1 + b2x12 + b3x2 + b4x22 + b5x1x2
If b2 + b4 < 0 and 4b2b4 – b52 > 0, have unimodal surface with ellipsoidal contours but
axes not necessarily orthogonal
Can calculate overall optimum
u1 = (b5b3 – 2b1b4) / d
d = 4b2b4 – b52
u2 = (b5b1 – 2b3b2) / d
Gaussian logit surface
 p 
log e 
 b0  b1 x1  b2 x12  b3 x 2  b4 x 22  b5 x1 x 2

1  p 
If b5 ≠ 0, optimum with respect to x1 does depend on value of x2.
If b5 = 0, optimum with respect to x1 does not depend on values of x2,
i.e. NO INTERACTION
R
SELECTING EXPLANATORY VARIABLES
• If model is balanced, parameters can be entered or removed in any order
• Adequate model: Non-significantly different from the best model
• Best subset method for selecting variables
Try all possible combinations, select the best
Look at the others as well
• Automatic selection of variables does not necessarily give the best subset
Backward elimination: Start with all variables, then remove
variables starting with the worst, and continue until all
remaining are significant
Forward selection: Start with nothing, add best, as long as
the new variables are significant
Stepwise: Start with forward selection, but try backward
elimination after every step
J.D. Olden & D.A. Jackson (2000) Ecoscience 7, 501-510.
Torturing data for the sake of generality: how valid are our regression models?
AKIAKE INFORMATION CRITERION (AIC)
Index of fit that takes account of the parsimony of the
model by penalising for the number of parameters. The
more parameters in a model, the better the fit. You get a
perfect fit if you have a parameter for every data point but
the model has no explanatory power.
Trade-off between goodness of fit and the number of
parameters required by parsimony.
AIC useful as it explicitly penalises any superfluous
parameters in the model by adding 2p to the variance or
deviance.
AIC = -2 x (maximised log-likelihood) + 2 x (number of
parameters)
Small values are indicative of a good fit to the data.
In multiple regression, AIC is just the residual variance plus
twice the number of regression coefficients (including the
intercept).
Used to compare the fit of alternative models with different
numbers of parameters, and thus useful in model selection.
Smaller the AIC, better the fit.
Given the alternative models involving different numbers of
parameters, select the model with the lowest AIC.
R
MANY EXPLANATORY NOMINAL OR NOMINAL
AND QUANTITATIVE VARIABLES
Three soil types - clay, peat, sand
Clay - reference class
Peat - dummy variable x2
Sand - dummy variable x3
x2 = 1 when peat, 0 when clay or sand
x3 = 1 when sand, 0 when clay or peat
k classes, k – 1 dummy variables
Systematic part
Ey = b1 + b2x2 + b3x3
b1 = expected response in reference class (clay)
b2 = difference in expected response between peat and clay
b3 = difference in response between sand and clay
Multiple logit regression - +/– response variable, one continuous
variable (x1) and one nominal variable (3 classes (x2, x3))
 p 
2
log e 

b

b
x

b
x
 b3 x 2  b4 x3
1
1
1
0
2

1  p 
R
Response curves for Equisetum fluviatile fitted by
multiple logit regression of the occurrence of E.
fluviatile in freshwater ditches on the logarithm of
electrical conductivity (EC) and soil type surrounding the
ditch (clay, peat, sand). Data from de Lange (1972).
Residual deviance tests to test if maxima are different
by dropping x2 and x3.
ASSESSING ASSUMPTIONS OF REGRESSION MODEL
Regression diagnostics – Faraway (2005) chapter 4
Linear least-squares regression
1.
relationship between Y and X is linear, perhaps after
transformation
2.
variance of random error is constant for all observations
3.
errors are normally distributed
4.
errors for n observations are independently distributed
Assumption (2) required to justify choosing estimates of b
parameters so as to minimise residual SS and needed in tests of t
and F values. Clearly in minimising SS residuals, essential that no
residuals should be larger than others.
Assumption (3) needed to justify significance tests and confidence
intervals.
RESIDUAL PLOTS
Plot (Y – EŶ) against EŶ or X
R
RESIDUAL PLOTS
Residual plots from the multiple regression of
gene frequencies on environmental variables
for Euphydryas editha:
(a) standardised residuals plotted against Y
values from the regression equation,
(b) standardised residuals against X1,
(c) standardised residuals against X2,
(d) standardised residuals against X3,
(e) standardised residuals against X4, and
(f) normal probability plot.
Normal probability plot –plot ordered
standardised residuals against expected values
assuming standard normal distribution.
If (Y – ŶI) is standard residual for I, expected
value is value for standardised normal
distribution that exceeds proportion {i – (⅜)} /
(n + (¼)) of values in full population
Standardised residual =
(Y  Yˆ )
MSE
R
SIMPLE WEIGHTED AVERAGE REGRESSION
OPTIMA
+/–
n
uˆk  1 n  xi
i 1
n
Abundance data
uˆk 
y
i 1
n
ik
y
i 1
xi
yik abundance of
species k at site i
ik
TOLERANCES
1
+/–


tˆk  1 n  xi  x 2 
 i 1

Abundance data

2
ˆ


y
x

u
k
  ik i

i

1

tˆk  
n


y ik



i 1
n
n
2
1
2
WACALIB
CALIB
C2
DISREGARDS ABSENCES - DEPENDS ON
DISTRIBUTION OF EXPLANATORY VARIABLE X
ter Braak & Looman (1986) Vegetatio 65: 3-11
+/– data - WA just as good as GLR when:
1. species is rare and has narrow tolerance
2. distribution of environmental variable amongst sites is
reasonably homogenous over range of species occurrences
3. site scores (xi) are closely spaced in comparison with
species amplitude or tolerance
Abundance data:
1. Poisson distributed
2. sites homogeneously distributed
WEIGHTED AVERAGES ARE GOOD ESTIMATES
... of species optima if:
... of gradient values if:
1. Sites x are evenly distributed
about optimum u
1. Species optima u are evenly
distributed about site x
2. Sites are close to each other
2. All species have equal response
widths t
3. All species have equal maximum
abundance h
4. Optima u are close to each
other
Conditions are strictly true only for infinite gradients.
J. Oksanen (2002)
BIAS AND TRUNCATION IN WEIGHTED
AVERAGING
Weighted averages are usually good estimates of Gaussian optima,
unless the response is truncated. Overestimation at the low end
of the gradient, underestimation at the high end of the gradient.
Slight bias towards the gradient centre: shrinkage of WA estimates
WA GLR
WA
GLR
J. Oksanen (2002)
MODEL II REGRESSION
When both the response and predictor variables of the model are
random (not controlled by the researcher), there is error associated
with measurements of both x and y.
This is model II regression
Examples:
Body mass and length
In vivo fluorescence and chlorophyll a
Respiration rate and biomass
Want to estimate the parameters of the equation that describes the
relationship between pairs of random variables.
Must use model II regression for parameter estimation, as the slope
found by ordinary least-squares regression (model I regression) may
be biased by the presence of measurement error in the predictor
variable.
MODEL II REGRESSION METHODS
Choice of model II regression method depends on the reasons for use and
on the features of data
Method Use and data
Test possible
OLS
Error on y >> error on x
Yes
MA
Distribution is bivariate normal
Variables are in the same physical units or dimensionless
Variance of error about the same for x and y
Distribution is bivariate normal
Error variance on each axis proportional to variance of corresponding variable
RMA
Check scatter diagram: no outliers
Yes
SMA
Correlation r is significant
No
OLS
Distribution is not bivariate normal
Relationship between x and y is linear
Yes
OLS
To compute forecasted (fitted) or predicted y values
(Regression equation and confidence intervals are irrelevant)
Yes
MA
To compare observations to model predictions
Yes
OLS = ordinary least squares regression
SMA = standard major axis regression
MA = major axis regression
RMA = ranged major axis regression
MODEL II
(www.fas.umontreal.ca/biol/legendre)
MODEL II REGRESSION METHODS (continued)
(1) Major axis regression (MA) is the first principal component of the scatter
of points. This axis minimises the squared Euclidean distances between
the points and the regression line instead of the vertical distances as in
OLS
(2) Standard major axis regression (SMA) is a way to make the variables
dimensionally homogenous prior to regression.
i) standardise variables x and y (subtract mean, divide by standard
deviation)
ii) compute MA regression on standardised x and y
iii) back-transform the slope estimate to the original units by multiplying
it by sy/sx where s = standard deviations of y and x.
MODEL II REGRESSION METHODS (continued)
(3) Ranged major axis regression (RMA)
A disadvantage of SMA regression is that the standardisation makes
the variances equal.
In RMA, variables are made dimensionally homogeneous by ranging
y i1 
y i  y min
y max  y min
i) transform variable x and y by ranging
ii) compute MA regression on ranged y and x
iii) back-transform the slope estimate to the original units by
multiplying them by the ratio of the ranges (ymax – ymin)/(xmax – xmin)
(4) Ordinary least squares regression (OLS)
Assumes no error on x. If error on y >> error on x, OLS can be used
to estimate the slope parameter
STATISTICAL TESTING FOR MODEL II
REGRESSION
Confidence intervals – with all methods, confidence intervals are large
when n is small. Become smaller as n reaches about 60, after which they
change very slowly. Model II regression should ideally be used with data
sets with 60 or more observations. Confidence intervals for slope and
intercept possible for MA, SMA, RMA, and OLS.
Statistical significance of slope – can be assessed by permutation tests for
the slopes of MA, OLS, and RMA and for the correlation coefficient r.
Cannot test by permutation the slope in SMA as the slope estimate is sy/sx
and for all permuted data sy/sx is constant. All one can do is to test the
correlation rxy instead of testing bSMA.
General advice is to compute MA, RMA, SMA, and OLS and evaluate
results carefully in light of the features of the data (magnitude of errors,
distributions) and the purpose of the regression.
Legendre & Legendre (1998) pp. 500-517
McArdle (1998) Can. J. Zool. 66, 2329-2339
COMPUTING SOFTWARE FOR
REGRESSION ANALYSIS
Basic regression
MINITAB
SYSTAT
GENSTAT or GLIM
STATISTIX (SX)
R or S-PLUS
Weighted average regression
C2
Model II regression
MODEL II
Download