Lecture 4
EEOB 590C
1
•Regression: Y~X
•X & Y are continuous
•H o
: no relationship between X & Y
Y
b b
X
e
( b is intercept, b
1 is slope, and e ij
5.85
is error)
F
2
mod
2 error el
5.14
headwdth 4.44
3.73
3.03
19.58
27.65
35.73
43.80
51.88
SVL
•Name from Galton’s work: offspring values ‘regress’ to that of parents (tend towards)
2
1. Independence : error in Y ( e ij
) must be independent
2. Normality : Error in Y ( e ij
) must be normally distributed
3. Homoscedasticity : samples along regression line are homoscedastic (variance doesn’t ‘balloon’ along regression)
4. Values of X are independent and measured WITHOUT
ERROR
3
•Fit line that minimizes sum of squared deviations (LS fit) from
Y-variates to line (vertical deviations b/c no error in X) i n
1
X i
X
Y i
Y
•Slope calculated as: ( ) i n
1
X i
X
2 SS x
b
0
Y b X
5.85
5.14
headwdth 4.44
3.73
3.03
19.58
27.65
43.80
51.88
35.73
SVL
•Note: this is Model I regression with 1 Y for each X. Slight alterations exist for multiple
Y for each X: see Biometry.
4
•Significance assessed using F-ratio (MS
Regr
•SS error:
SSR
SSM
SSE
•Test using F-ratio
i n
1
i n
1
Y i
Y
ˆ i
2
Y
ˆ i
Y
2
Y
vs. MS
Err
)
•SS explained: where are predicted values
Source df
Regression 1
Error
Total n-2 n-1
SS
SSM
SSE
SST=SSM+SSE
MS F
SSM/df MSM/MSE
SSE/df
•Tests whether regression explains a significant portion of variation
P
5
•Standard approach:
•Compare F-ratios to F-distribution with appropriate df
•Resampling Alternative:
•Residual randomization: shuffle
Y across X for single-factor regression
X Y X Y
E obs
E obs
E rand
6
•F-test assesses variance explained, but not how much Y changes with X
•Slope test ( b
1 t
b
1
0 s b
1 s b
1
i
1
MSE
X i
X
•Intercept test ( b
0 b
0 s b
0
0
≠ 0): with (n-2 df)
n n i
1
X i
X
2
7
> summary(lm(Y~X1))#TotalLength ~ AlarExtent
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.3045 0.3383 3.856 0.000179 ***
X1 0.6848 0.0615 11.136 < 2e-16 ***
Multiple R-squared: 0.4806, Adjusted R-squared: 0.4768
> anova(lm(Y~X1))#provides regression coefficients
Df Sum Sq Mean Sq F value Pr(>F)
X1 1 0.032412 0.032412 124.01 < 2.2e-16 ***
5.44
5.46
5.48
X1
5.50
5.52
5.54
Y= 1.3045 + 0.6848X
8
•Correlation and regression closely linked
Y
X
i n
( X
SS x
)( Y i
Y ) r xy n
i
1
( X i
X )( Y i
Y ) s x s y
•
•Difference in denominator, so related as: r is a standardized r xy
b
s x
s y
regression coefficient (i.e. slope in standard deviation units)
•Meaning: b: error in Y direction only (direction of scatter), r : error in X & Y (dispersion of scatter)
•Thus, regression models causation (Y as function of X) and correlation models association (covariation in X & Y)
9
headwdth 4.44
5.14
5.85
Regression b
Y
X n
i
1
( X i
X )( Y i
Y )
SS x r xy
Correlation n
i
1
( X i
X )( Y i
Y ) s x s y
5.85
5.14
headwdth 4.44
3.73
3.73
3.03
19.58
3.03
19.58
27.65
35.73
SVL
43.80
51.88
27.65
35.73
SVL
43.80
51.88
10
•When both X & Y contain error, model I regression underestimates slope
•Model II regression accounts for this by minimizing deviations perpendicular to regression line (not in Y direction only)
•Different types of model II regression, depending on data (major axis, reduced major axis, etc.)
•X & Y in ‘same’ units/scale: major axis regression (PCA)
•X & Y not in same units/scale: standard (reduced) major axis regression
11
•Regression using slope of major axis of correlation ellipse
•Found with PCA (Principal components analysis)
•Steps
VCV
2 s s
X XY •Calculate covariance matrix for X & Y: s s
XY Y
2
•Find principal eigenvalue ( l
) and eigenvector of VCV
•For 2 x 2, l
1
s
2
X s
Y
s
2
X
s
2
Y
2
2 2 s s
s
2
XY
•Slope of major axis (principal eigenvector): b
l
1 s
XY
s
X
2
•More on PCA and eigenanalysis in a few weeks
12
•Model II regression for when X & Y are in different units/scale
•Either:
•Standardize variables to
X
•Major axis regression of the standardized variables
•Or:
•Calculate SMA slope as: b
b r
XY
•Also called geometric mean or standard major axis regression
13
> lmodel2(Y~X1,nperm=999)
Angle between the two OLS regression lines = 20.53316 degrees
Method Intercept Slope Angle (degrees) P-perm (1-tailed)
1 OLS 1.3044593 0.6847984 34.40328 0.001
2 MA -0.3329159 0.9824064 44.49152 0.001
3 SMA -0.3624212 0.9877692 44.64746 NA
MA regression
OLS slope (blue)
MA slope (red)
*Note: SMA slope = b
OLS
/r
5.44
5.46
5.48
x
5.50
5.52
5.54
14
•Identify relationship between several X and continuous Y
•Predict Y using several variables simultaneously
•Model:
• b i
Y i
b
0
b
X
1 1 i
b
X
2 2 i
e i are partial regression coefficients (effect of X i holding effects of other X constant)
•For 2X, think of fitting a plane to the data while
15
> summary(lm(Y~X1+X2))
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.82521 0.34843 5.238 6.18e-07 ***
X1 0.52516 0.07142 7.354 1.77e-11 ***
X2 0.11042 0.02835 3.895 0.000155 ***
> anova(lm(Y~X1+X2))
Df Sum Sq Mean Sq F value Pr(>F)
X1 1 0.032412 0.032412 137.118 < 2.2e-16 ***
X2 1 0.003585 0.003585 15.168 0.0001551 ***
5.42
5.44
5.46
5.48
X1
5.50
5.52
5.54
5.56
3.10
3.15
3.20
3.25
3.30
3.35
3.40
3.45
16
j i holding X j constant.
Y
Y
•Expresses change in normalized units
(standard normal deviates: )
Y
•ADVANTAGE: can be directly compared for each variable
•Found from variable correlations: b
'
2
r
r r
1
2 r
X X
2
1 2
•Interpret, then calculate back to original units and for conventional partial regression coefficients b
2
b
'
2
S
Y
S
X 1
17
•For all models, R 2 : proportion of variance explained by model
•Should I add another variable? Test difference in R 2
•NOTE: R 2 will ‘top’ out as you add variables (so frequently adjusted R 2 used)
•Can also compare models using ‘aov’ in R, using AIC, etc.
•How to add variables: stepwise addition, stepwise deletion, random addition, etc.
•NOTE: different methods yield different results (b/c j depend on other variables in model)
•
CAREFUL WITH BIOLOGICAL INTERPRETATION!!!
18
•To compare 2 regression lines, calculate F-ratio as:
F
X
1
X
X
1
X
1
1
2
2
b
1
b
2
2
X
2
X
X
2
X
2
2
2
2
2 s
Y X
Df = 1, (n
1
+ n
2
-4)
•Where is the weighted average of
Y X s
2
Y X
•Procedure can be generalized to compare > 2 regression lines (see Biometry)
19
•Often, one wishes to compare regression lines AND the groups
•ANCOVA ‘combines’ regression and ANOVA
•H o
: no difference among slopes, no difference among groups
•Must first compare slopes, then compare groups (ANOVA) while holding effects of covariate constant
•Several possible outcomes
Groups the same, slopes the same
(overlapping data)
Slopes different, ‘easily’ identified pattern
Groups the same, slopes the same
(non -overlapping data)
Slopes the same, groups different
Slopes different,
‘complicated’ pattern 20
•Model:
Y ij
•Calculate SS for:
b i within
X ij
X i
e ij
1: covariate (common regression)
2: group x covariate interaction (slopes test)
3: groups
Source
Group
Covariate
Groups x
Cov df a-1
1 a-1
Error n-2(a-1)+1
SS
SSG
SSR
SSInt
SSE
MS
SSG/df
SSR/df
SSInt/df
SSE/df
F
MSG/MSEG
MSR/MSE
MSInt/MSE
•Note: MSEG = MSE + MSInt
P
21
•Standard approach:
•Compare F-ratios for each factor to F-distribution with appropriate df
•Resampling Alternative:
•Residual randomization: shuffle residuals from reduced model to assess that factor (e.g., remove A×cov and use randomization to test
A×cov factor)
X Y
X Y
E obs
E obs
E rand
22
•First, we’re testing whether or not slopes for each group are statistically different (slopes test = test of interaction)
•If they are different, we cannot compare groups, as the answer depends upon WHERE along covariate we compare (though, an ad hoc approach could be used)
•If slopes are NOT different, fit a common slope, and then test groups while holding effects of covariate constant (ANOVA on adjusted group means)
23
> anova(lm(Y~X2*SexBySurv))
Df Sum Sq Mean Sq F value Pr(>F)
X2 1 0.023215 0.0232149 77.3890 8.139e-15 ***
SexBySurv 3 0.005172 0.0017239 5.7467 0.001014 **
X2:SexBySurv 3 0.000651 0.0002172 0.7239 0.539496
Residuals 128 0.038397 0.0003000
•
Fit common-slope model
> anova(lm(Y~X2+SexBySurv))
Df Sum Sq Mean Sq F value Pr(>F)
X2 1 0.023215 0.0232149 77.8814 6.015e-15 ***
SexBySurv 3 0.005172 0.0017239 5.7833 0.0009591 ***
3.15
3.20
3.25
X2
3.30
3.35
3.40
24
1. Perform ANOVA on regression residuals: NOT the same as
ANCOVA (different df, pooled b
, etc.). Also, lose test of slopes, which is important (see J. Anim. Ecol.
2001. 70:708-711)
2. Significant gp * cov, but still compare groups: not useful, as answer depends upon where along regression you compare
3. “Size may be a covariate, so I’ll use a small size range to
‘standardize’ for it”: choosing animals of similar sizes will eliminate covariate, but also will eliminate potentially important biological information (e.g., what if male head width grows relatively faster than females (i.e. size * head interaction?)
25
•Many other models possible
•PATH ANALYSIS & SEM
(structural equation modeling)
: looks at partial correlation coefficients and partial regression coefficients to explain variance in data according to hypothesized ‘causal’ path between variables
X
2
X
1
Y
X
3
•Curvilinear regression: similar idea as above, but with ‘adjusted’
X variables: X, X 2 , X 3 , etc.
26