Correlation and linear regression

advertisement
Lecture 12
Correlation and linear regression
20
The least squares method of Carl
Friedrich Gauß.
Y
15
OLRy
y = ax + b
y2
10
5
y
0
0
5
10
15
D 

i 1
D
a
D
b
n
(  y )   [ y i  ( ax i  b )]
2
xy
i
X
n
n
20
2
a 
 nxy
i 1
n
x
i 1
i
2
i
 nx
2
i 1
n
  2  x i ( y i  ax i  b )  0
b  y  ax
i 1
n
  2  ( y i  ax i  b )  0
i 1
y  ax  y  a x

( y  y)  a(x  x)
Covariance
n
x
yi  n x y
i
i 1
n
a

x
2
i
 nx
n
1
x

n
1
i
i 1
yi  x y
i 1
n
x

n
2
1

2
i
x
n
(x

n
i
 x )( y i  y )
i 1
1

n
(x

n
2
i 1
i
 x)
2
s xy
2
sx
i 1
Variance
Correlation coefficient
 
 xy
r 
 x y
as x  s xy  rs x s y
2
s xy
r a
sx sy
sx
sy
1  r  1
Slope a and coefficient of correlation r are zero if the covariance is zero.
 xy
2
r 
2
 
2
x
0  r 1
2
2
y
Coefficient of determination
R
2

E xplained variance
T otal variance
Dimorphic species
Brachypterous species
Relationships between macropterous,
dimorphic and brachypterous ground beetles
on 17 Mazurian lake islands
7
Positive correlation; r =r2= 0.41
6
The regression is weak.
5
Macropterous species richness explains only 17% of
4
the variance in brachypterous species richness.
3
We have some islands without brachypterous
2
y = 0.192x + 0.4671
species.
1
R² = 0.1723
We really don’t know what is the independent
0
0
10
20
30 variable.
There is no clear cut logical connection.
Macropterous species
14
12
10
8
6
4
2
0
y = 0.3875x + 3.7188
R² = 0.4455
0
10
20
Macropterous species
30
Positive correlation; r =r2= 0.67
The regression is moderate.
Macropterous species richness explains only 45% of
the variance in dimorphic species richness.
The relationship appears to be non-linear. Logtransformation is indicated (no zero counts).
We really don’t know what is the independent
variable.
There is no clear cut logical connection.
Brachypterous species
y = -36.203x + 5.5585
R² = 0.2311
7
6
5
4
3
2
1
0
0
0.05
0.1
0.15
Brachypterous species
Isolation
45
40
35
30
25
20
15
10
5
0
y = 0.4894x + 22.094
R² = 0.0037
-3
-2
-1
Negative correlation; r =r2= -0.48
The regression is weak.
Island isolation explains only 23% of the variance in
brachypterous species richness.
We have two apparent outliers. Without them the
whole relationship would vanish, it est R2  0.
Outliers have to be eliminated fom regression
analysis.
We have a clear hypothesis about the logical
relationships. Isolation should be the predictor of
species richness.
No correlation; r =r2= 0.06
The regression slope is nearly zero.
Area explains less than 1% of the variance in
brachypterous species richness.
We have a clear hypothesis about the logical
relationships. Area should be the predictor of
species richness.
0
ln Area
1
2
The matrix perspective
Macro
Constant
4
6
3
4
1
4
2
5
1
0
0
0
1
4
2
7
12
13
18
10
14
7
22
9
7
15
13
8
10
8
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
6
2
14
6
1
1
Transpose Macro
Constant
7 

12 
13  ( a 0

... 

6 
12
1
13
1
Dispersion matrix
XX
2499
193
193
17
T
Inverse
0.003248 -0.03687
-0.03687 0.477455
XTY
570
45
7
6
5
4
3
2
1
0
y = 0.192x + 0.4671
R² = 0.1723
0
Coefficients
a1 0.192014
10
20
X is not quadratic.
It doesn’t possess an inverse
Y  Xa
Y  Xa  X Y  X Xa
T
1
T
1
( X X ) X Y  ( X X ) X Xa  Ia  a
T
T
T
T
Y  Xa
a1 )
30
Macropterous species
a0 0.467138
 4   a 0   7 a1 
4 
1 
7 

    
 
 
 
a
12
a
6
6
1
   0 
 
 
 12 
1 
 3    a    13 a    3   a  1   a  13  
1
0
1

   0 
 
 
 
 ...   ...   ... 
 ... 
 ... 
 ... 

    
 
 
 
 2   a 0   6 a1 
2 
1 
6 
4   1
  
6   1
3    1
  
 ...   ...
  
2   1
7
1
Brachypterous species
Brachy
1
a  (X X ) X Y
T
T
Macro
Constant
4
6
3
4
1
4
2
5
1
0
0
0
1
4
2
7
12
13
18
10
14
7
22
9
7
15
13
8
10
8
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
6
2
14
6
1
1
Brachypterous species
Brachy
Dispersion matrix
XX
2499
193
193
17
7
6
5
4
3
2
1
0
T
n

2
  xi
T
X X   n i 1

  x i const
 i 1
y = 0.192x + 0.4671
R² = 0.1723
0
10
20

x
const

 i
i 1

n
2 
const


i 1

n
30
Macropterous species
Variance
Brachy
Macro
Constant
4
6
3
4
1
4
2
5
1
0
0
0
1
4
2
7
12
13
18
10
14
7
22
9
7
15
13
8
10
8
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
6
2
14
6
1
1

2

1
n
n

i 1
1 n
2 
x  x    xi  x  
n  i 1

2
i
2
Covariance
 xy 
1
n
n

i 1

1 n
x i y i  x y     x i  x  y i  y  
n  i 1

Raw data
Brachy Macro Constant
4
7
1
6
12
1
3
13
1
4
18
1
1
10
1
4
14
1
2
7
1
5
22
1
1
9
1
0
7
1
0
15
1
0
13
1
1
8
1
4
10
1
2
8
1
6
14
1
2
6
1
Σ
Arithmetic mean
Brachy Macro Constant
Brachy
4
6
3
4
1
4
2
5
1
2.64706 11.3529
1
Macro
7
12
13
18
10
14
7
22
9
2.64706 11.3529
1
Constant
1
1
1
1
1
1
1
1
1
2.64706 11.3529
1
2.64706 11.3529
1
Brachy 2.6471 2.6471 2.6471 2.6471 2.6471 2.6471 2.6471 2.6471 2.647059
2.64706 11.3529
1
Macro 11.353 11.353 11.353 11.353 11.353 11.353 11.353 11.353 11.35294
2.64706 11.3529
1
Constant
1
1
1
1
1
1
1
1
1
2.64706 11.3529
1
Dispersion matrix
Squared means
2.64706 11.3529
1
T
T
2.64706 11.3529
1
185
570
45
45
XX
M M 119.12 510.88
2.64706 11.3529
1
570 2499
193
510.88 2191.1
193
2.64706 11.3529
1
45
193
17
45
193
17
2.64706 11.3529
1
Variance Covariance
2.64706 11.3529
1
Brachy
Constant
2.64706 11.3529
1
0
3.8754 3.4775
0
XTX - MTM 65.882 59.118
XTX - MTM
2.64706 11.3529
1
59.118 307.88
0
/17 Macro 3.4775 18.111
0
2.64706 11.3529
1
0
0
0
Constant
0
0
0
2.64706 11.3529
1
1
(X X  M M ) 
T
T
n
r 
 xy
 x
1 /  x

 0

S
y
 1 /  x
Σ
1 /  y   0
[( X  M ) ( X  M )
T
n

0
1


1 /  y 
0
1/x
1/y
Covariance
3.8754 3.4775
3.4775 18.111
V
0.508
0
0 0.235 r
r2
  12

  12
Σ 
...


 1n
VS
1.9686 1.7665
0.8171 4.2557
VSV
1 0.4151
0.4151
1
0.1723
 12
...

...
2
2
...
...
 2n
...
Covariances
 1n 

 2n 

...

2 
n 
Variances
The covariance matrix
is square and
symmetric
Non-linear relationships
Ground beetles on Mazurian lake islands
Linear function
Logarithmic function
100
40
40
Species
50
30
10
20
y = 6.0987ln(x) - 8.3513
R² = 0.6003
y = 0.0056x + 24.305
R² = 0.2963
10
0
0
2000
1
Individuals
20
y = 6.7337x0.2306
R² = 0.67
10
100
Individuals
10000
0
2000
4000
Individuals
The species – individuals relationship are obviously non-linear.
100
Species
30
0
1
4000
Power function
60
50
Species
Species
60
10
y = 6.7337x0.2306
R² = 0.67
The power function has the highest R2 and explains therefore
most of the variance in species richness.
The coefficient of determination is a measure of goodness of
fit.
0 . 2308
1
1
100
Individuals
10000
S  6 . 733 I
ln S  ln( 6 . 733 )  0 . 2308 ln I  1 . 907  0 . 2308 ln I
Intercept Slope
Having more than one predictor
Island
1pog
2pog
3pog
cor
dab
ful
gil
guc
hel
lip
mil
sos
swi
ter
wil
wron
wros
Species Individuals
13
55
24
149
31
206
29
3450
31
505
37
996
54
1895
27
476
25
325
30
459
34
1410
33
829
34
1704
16
91
21
28
21
102
342
258
Area
0.01
0.9
2.1
6.84
10
9.9
10
0.92
2.3
4.19
0.2
20.09
2.08
0.03
Isolation
0.088719
0.088592
0.081131
0.089384
0.080644
0.094508
0.093676
0.097195
0.088938
0.088367
0.089204
0.087405
0.096915
0.085875
Describe species richness in dependence of
numbers of individuals, area, and isolation
of islands.
We need a clear hypothesis about
dependent and independent predictors.
Use a block diagram.
1 0.096584
0.15
0.01
0.15
0.01
Individuals
Area
Isolation
Species
Individuals
Collinearity
Area
Isolation
Predictors are not independent.
Numbers of individuals depends on
area and degree of isolation.
We need linear relationships
Species
We use ln transformed variables of
species, area, and individuals.
Check for multicollinearity
using a correlation matrix.
We check for non-linearities using
plots.
The correlation between area and
individuals is highly significant.
The probability of H0 = 0.004.
Of the predictors area and
individuals are highly correlated.
In linear regression analysis
correlations of predictors below
0.7 are acceptable.
The final data for our analysis
Island
1pog
2pog
3pog
cor
dab
ful
gil
guc
hel
lip
mil
sos
swi
ter
ln_S
Constant ln_Ind.
ln_Area Isolation
2.564949
1
4.007333 -4.60517 0.088719
3.164068
1
5.003946 -0.10536 0.088592
3.427515
1
5.327876 0.741937 0.081131
3.366817
1
8.14613 1.922788 0.089384
3.443352
1
6.224558 2.302585 0.080644
3.609114
1
6.903747 2.292535 0.094508
3.985008
1
7.546974 2.302585 0.093676
3.294602
1
6.165418 -0.08338 0.097195
3.236061
1
5.783825 0.832909 0.088938
3.401197
1
6.12905 1.432701 0.088367
3.521447
1
7.251345 -1.60944 0.089204
3.483143
1
6.72022 3.000222 0.087405
3.531251
1
7.440734 0.732368 0.096915
2.772589
1
4.51086 -3.50656 0.085875
wil
wron
wros
3.060271
3.332205
3.020425
1
1
1
4.624973
0
0.096584
5.834811 -1.89712
0.01
5.55296 -1.89712
0.01
The vector Y The matrix X contains
contains the the effect (predictor)
response variables
variable
The model
ln S  a 0  a1 ln Ind  a 2 ln Area  a 3 Isolation
The predictor
variables have
to contain
different
information.
If X is singular
no inverse
exists
60
Multiple linear regression
Y  Xa
1
a  (X X ) X Y
T
T
F  t  (n  2)
r
2
2
1 r
The probability that R2
is zero is only 0.01%.
With 99.9% R2 > 0
and hence statistically
significant.
The model explains
78.6 % of variance in
species richness.
21.4% of avriance
remains
unexplained.
t
Coefficien t
Standard
ln S  2 . 48  0 . 15 ln Ind  0 . 07 ln Area  0 . 91 Isolation
error
The probabilities that
the coefficients
deviate from zero.
Isolation is not a
significant predictor.
2
What distance to minimize?
20
Y
15
OLRy
y2
10
5
x2
0
0
OLRx
5
10
15
20
X
s xy
aOLRy 
s
2
aOLRx 
2
x
2
aOLRys
2
x
 ys xy 
sy
sy
s xy
2
 aOLRx * aOLRy 
aOLRx
Model I regression
sy
2
sx
20
RMA
Y
15
OLRy
10
x y
y2
5
x2
0
OLRx
0
5
10
15
20
X
2
a RMA 
a RM A 
sy
sx

axay 
s xy s y
s
2
x
s xy
a O LRy
r

sy
sx
a RMA  a OLRy
Reduced major axis regression is the geometric average of aOLRy and aOLRx
Model II regression
Past standard output of linear regression
Reduced major axis
Parameters and
standard errors
Parametric
probability for r = 0
t (df  n  2) 
r
F  t  (n  2)
2
n2
1 r
r
2
2
1 r
2
Permutation test for
statistical
significance
We don’t have a clear hypothesis about the causal relationships.
In this case RMA is indicated.
Both tests indicate that
Brach and Macro are
not significantly
correlated.
The RMA regression
slope is insignificant.
Permutation test for statistical significance
Brach
4
6
3
4
1
4
2
5
1
0
0
0
1
4
2
Los()
0.335757
0.787809
0.310238
0.626757
0.220597
0.012454
0.909548
0.299534
0.177327
0.953261
0.242402
0.595826
0.596459
0.880829
0.548183
Macro
14
10
12
22
13
6
9
10
8
7
7
13
8
14
15
Los()
0.531818
0.580728
0.101989
0.115425
0.413435
0.684826
0.474608
0.830635
0.581156
0.916832
0.974389
0.625952
0.260397
0.61705
0.588517
14
6
6
2
0.790054
0.999702
7
18
0.015239
0.253364
N
Mean r
Lower CL
Upper CL
0.099125
1000
0.061
-0.538
0.768
Observed r 0.41508801
N
Macro
7
12
13
18
10
14
7
22
9
7
15
13
8
10
8
Macro
Los()
Macro
0.258728
14
90 10
0.860314
9
80 18
0.709402
15
70 6
0.793515
12
60 8
0.965281
7
50 14
0.305505
13
40 10
22
0.701483
10
30
0.061196
22
20 7
0.204792
8
S N2.5
= 25
10 13
0.72657
8
07
7
0.013131
18
15
0.066869
-0.2
-0.4 10
-0.6
-0.8
13
0.414809
6
Lower
CL
14
0.093979
7
9
0.462482
7
8
12
-0.05535
Randomize 1000 times x or y.
Calculate each time r.
Plot the statistical distribution and calculate the
lower and upper confidence limits.
0.234162
0.011327
13
14
0.302746
Los()
Macro
Los()
Macro
0.296023
10
0.809377
14
0.524753
8
0.801854
10
g15> 0 0.942821
0.826895
22
0.064408
13
0.722662
12
0.25255
7
0.218747
18
Observed
r
0.976486
8
0.404831
13
0.170293
22
0.745551
8
0.517693
14
0.968818
6
0.355126
10
0.822951
S N2.5 =7 25
0.38976
6
0.78764
14
0.639621
7
0.878803
15
0.511781
0.6 0.8 7 1
0 0.2 70.4 0.032343
0.489293
14
0.92727
10
m
>
0
Upper
CL
0.504421
12
0.267633
8
0.630868
13
0.106493
7
r
0.778739
0.815214
18
9
0.89634
0.4389
0.358917
Calculating confidence limits
Rank all 1000 coefficients of
correlation and take the values
at rank positions 25 and 975.
13
9
-0.0413
Upper CL
Lower CL
The 95% confidence limit of the regression slope
mark the 95% probability that the regression slope is within these
limits.
The lower CL is negative, hence the zero slope is with the 95% CL.
The RMA regression has
a much steeper slope.
This slope is often
intuitively better.
The coefficient of
correlation is
independent of the
regression method
In OLRy regression
insignificance of
slope means also
insignificance of r
and R2.
20
Y
15
Outliers have an
overproportional
influence on
correlation and
regression.
OLRy
y2
10
5
y
0
20
15
10
5
0
X
Y
Outliers should be
eliminated from regression
analysis.
Instead of the Pearson
coefficient of
correlations use
Spearman’s rank order
correlation.
7
6
5
4
3
2
1
0
rPearson = 0.79
Normal correlation on
ranked data
rSpearman = 0.77
0
1
2
3
4
X
5
6
7
Home work and literature
Refresh:
Literature:
•
•
•
•
•
•
•
Łomnicki: Statystyka dla biologów
http://statsoft.com/textbook/
Coefficient of correlation
Pearson correlation
Spearman correlation
Linear regression
Non-linear regression
Model I and model II regression
RMA regression
Prepare to the next lecture:
• F-test
• F-distribution
• Variance
Download