Regression Models & ANCOVA Advanced Biostatistics Dean C. Adams Lecture 4

advertisement

Regression Models & ANCOVA

Advanced Biostatistics

Dean C. Adams

Lecture 4

EEOB 590C

1

Linear Regression Models

•Regression: Y~X

•X & Y are continuous

•H o

: no relationship between X & Y

Y

 b  b

X

 e

( b is intercept, b

1 is slope, and e ij

•Test statistic: F-ratio (ratio of variances)

5.85

is error)

F

2

 mod

2 error el

5.14

headwdth 4.44

3.73

3.03

19.58

27.65

35.73

43.80

51.88

SVL

•Name from Galton’s work: offspring values ‘regress’ to that of parents (tend towards)

2

Model I Regression Assumptions

1. Independence : error in Y ( e ij

) must be independent

2. Normality : Error in Y ( e ij

) must be normally distributed

3. Homoscedasticity : samples along regression line are homoscedastic (variance doesn’t ‘balloon’ along regression)

4. Values of X are independent and measured WITHOUT

ERROR

3

Model I Regression Calculations

•Fit line that minimizes sum of squared deviations (LS fit) from

Y-variates to line (vertical deviations b/c no error in X) i n 

1

X i

X



Y i

Y

•Slope calculated as: ( ) i n 

1

X i

X

 2 SS x

  b

0

Y b X

5.85

5.14

headwdth 4.44

3.73

3.03

19.58

27.65

43.80

51.88

35.73

SVL

•Note: this is Model I regression with 1 Y for each X. Slight alterations exist for multiple

Y for each X: see Biometry.

4

Significance Testing of Model I Regression

•Significance assessed using F-ratio (MS

Regr

•SS error:

SSR

SSM

SSE

•Test using F-ratio

 i n 

1

 i n 

1

Y i

Y

ˆ i

 2

Y

ˆ i

Y

 2

Y

vs. MS

Err

)

•SS explained: where are predicted values

Source df

Regression 1

Error

Total n-2 n-1

SS

SSM

SSE

SST=SSM+SSE

MS F

SSM/df MSM/MSE

SSE/df

•Tests whether regression explains a significant portion of variation

P

5

Regression: Assessing Significance

•Standard approach:

•Compare F-ratios to F-distribution with appropriate df

•Resampling Alternative:

•Residual randomization: shuffle

Y across X for single-factor regression

X Y X Y

E obs

E obs

E rand

6

Model I Regression: Parameter Tests

•F-test assesses variance explained, but not how much Y changes with X

•Slope test ( b

1 t

 b 

1

0 s b

1 s b

1

 i

1

MSE

X i

X

•Intercept test ( b

0 b 

0 s b

0

0

≠ 0): with (n-2 df)

 n n  i

1

X i

X

2

7

Example: Model I Regression

> summary(lm(Y~X1))#TotalLength ~ AlarExtent

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 1.3045 0.3383 3.856 0.000179 ***

X1 0.6848 0.0615 11.136 < 2e-16 ***

Multiple R-squared: 0.4806, Adjusted R-squared: 0.4768

> anova(lm(Y~X1))#provides regression coefficients

Df Sum Sq Mean Sq F value Pr(>F)

X1 1 0.032412 0.032412 124.01 < 2.2e-16 ***

5.44

5.46

5.48

X1

5.50

5.52

5.54

Y= 1.3045 + 0.6848X

8

Regression vs. Correlation

•Correlation and regression closely linked

Y

X

 i n

( X

SS x

)( Y i

Y ) r xy n 

 i

1

( X i

X )( Y i

Y ) s x s y

•Difference in denominator, so related as: r is a standardized r xy

 b

 s x

 s y

 regression coefficient (i.e. slope in standard deviation units)

•Meaning: b: error in Y direction only (direction of scatter), r : error in X & Y (dispersion of scatter)

•Thus, regression models causation (Y as function of X) and correlation models association (covariation in X & Y)

9

headwdth 4.44

5.14

5.85

Regression vs. Correlation

Regression b

Y

X n 

 i

1

( X i

X )( Y i

Y )

SS x r xy

Correlation n 

 i

1

( X i

X )( Y i

Y ) s x s y

5.85

5.14

headwdth 4.44

3.73

3.73

3.03

19.58

3.03

19.58

27.65

35.73

SVL

43.80

51.88

27.65

35.73

SVL

43.80

51.88

10

Model II Regression

•When both X & Y contain error, model I regression underestimates slope

•Model II regression accounts for this by minimizing deviations perpendicular to regression line (not in Y direction only)

•Different types of model II regression, depending on data (major axis, reduced major axis, etc.)

•X & Y in ‘same’ units/scale: major axis regression (PCA)

•X & Y not in same units/scale: standard (reduced) major axis regression

11

Major Axis Regression

•Regression using slope of major axis of correlation ellipse

•Found with PCA (Principal components analysis)

•Steps

VCV

 

2 s s

X XY •Calculate covariance matrix for X & Y: s s

XY Y

2

•Find principal eigenvalue ( l

) and eigenvector of VCV

•For 2 x 2, l

1

 s

2

X s

Y

 s

2

X

 s

2

Y

2

 

2 2 s s

 s

2

XY

•Slope of major axis (principal eigenvector): b

 l

1 s

XY

 s

X

2

•More on PCA and eigenanalysis in a few weeks

12

Major Axis Regression

•Model II regression for when X & Y are in different units/scale

•Either:

•Standardize variables to

X

•Major axis regression of the standardized variables

•Or:

•Calculate SMA slope as: b

 b r

XY

•Also called geometric mean or standard major axis regression

13

Example: Model II Regression

> lmodel2(Y~X1,nperm=999)

Angle between the two OLS regression lines = 20.53316 degrees

Method Intercept Slope Angle (degrees) P-perm (1-tailed)

1 OLS 1.3044593 0.6847984 34.40328 0.001

2 MA -0.3329159 0.9824064 44.49152 0.001

3 SMA -0.3624212 0.9877692 44.64746 NA

MA regression

OLS slope (blue)

MA slope (red)

*Note: SMA slope = b

OLS

/r

5.44

5.46

5.48

x

5.50

5.52

5.54

14

Multiple Regression

•Identify relationship between several X and continuous Y

•Predict Y using several variables simultaneously

•Model:

• b i

Y i

 b

0

 b

X

1 1 i

 b

X

2 2 i

  e i are partial regression coefficients (effect of X i holding effects of other X constant)

•For 2X, think of fitting a plane to the data while

15

Example: Multiple Regression

> summary(lm(Y~X1+X2))

Estimate Std. Error t value Pr(>|t|)

(Intercept) 1.82521 0.34843 5.238 6.18e-07 ***

X1 0.52516 0.07142 7.354 1.77e-11 ***

X2 0.11042 0.02835 3.895 0.000155 ***

> anova(lm(Y~X1+X2))

Df Sum Sq Mean Sq F value Pr(>F)

X1 1 0.032412 0.032412 137.118 < 2.2e-16 ***

X2 1 0.003585 0.003585 15.168 0.0001551 ***

5.42

5.44

5.46

5.48

X1

5.50

5.52

5.54

5.56

3.10

3.15

3.20

3.25

3.30

3.35

3.40

3.45

16

Multiple Regression Cont.

 j i holding X j constant.

Y

Y

•Expresses change in normalized units

(standard normal deviates: )

Y

•ADVANTAGE: can be directly compared for each variable

•Found from variable correlations: b

'

2

 r

 r r

1

 2 r

X X

2

1 2

•Interpret, then calculate back to original units and for conventional partial regression coefficients b

2

 b

'

2

S

Y

S

X 1

17

Multiple Regression: Adding X Variables

•For all models, R 2 : proportion of variance explained by model

•Should I add another variable? Test difference in R 2

•NOTE: R 2 will ‘top’ out as you add variables (so frequently adjusted R 2 used)

•Can also compare models using ‘aov’ in R, using AIC, etc.

•How to add variables: stepwise addition, stepwise deletion, random addition, etc.

•NOTE: different methods yield different results (b/c j depend on other variables in model)

CAREFUL WITH BIOLOGICAL INTERPRETATION!!!

18

Comparing Regression Lines

•To compare 2 regression lines, calculate F-ratio as:

F

X

1

X

X

1

X

1

1

2

2

 b

1

 b

2

 

2

X

2

X

X

2

X

2

2

2

2

 2 s

Y X

Df = 1, (n

1

+ n

2

-4)

•Where is the weighted average of

Y X s

2

Y X

•Procedure can be generalized to compare > 2 regression lines (see Biometry)

19

Analysis of Covariance: ANCOVA

•Often, one wishes to compare regression lines AND the groups

•ANCOVA ‘combines’ regression and ANOVA

•H o

: no difference among slopes, no difference among groups

•Must first compare slopes, then compare groups (ANOVA) while holding effects of covariate constant

•Several possible outcomes

Groups the same, slopes the same

(overlapping data)

Slopes different, ‘easily’ identified pattern

Groups the same, slopes the same

(non -overlapping data)

Slopes the same, groups different

Slopes different,

‘complicated’ pattern 20

ANCOVA: Computations

•Model:

Y ij

•Calculate SS for:

  b i within

X ij

X i

 e ij

1: covariate (common regression)

2: group x covariate interaction (slopes test)

3: groups

Source

Group

Covariate

Groups x

Cov df a-1

1 a-1

Error n-2(a-1)+1

SS

SSG

SSR

SSInt

SSE

MS

SSG/df

SSR/df

SSInt/df

SSE/df

F

MSG/MSEG

MSR/MSE

MSInt/MSE

•Note: MSEG = MSE + MSInt

P

21

ANCOVA: Assessing Significance

•Standard approach:

•Compare F-ratios for each factor to F-distribution with appropriate df

•Resampling Alternative:

•Residual randomization: shuffle residuals from reduced model to assess that factor (e.g., remove A×cov and use randomization to test

A×cov factor)

X Y

X Y

E obs

E obs

E rand

22

ANCOVA: What are We Doing?

•First, we’re testing whether or not slopes for each group are statistically different (slopes test = test of interaction)

•If they are different, we cannot compare groups, as the answer depends upon WHERE along covariate we compare (though, an ad hoc approach could be used)

•If slopes are NOT different, fit a common slope, and then test groups while holding effects of covariate constant (ANOVA on adjusted group means)

23

Example: ANCOVA

> anova(lm(Y~X2*SexBySurv))

Df Sum Sq Mean Sq F value Pr(>F)

X2 1 0.023215 0.0232149 77.3890 8.139e-15 ***

SexBySurv 3 0.005172 0.0017239 5.7467 0.001014 **

X2:SexBySurv 3 0.000651 0.0002172 0.7239 0.539496

Residuals 128 0.038397 0.0003000

Fit common-slope model

> anova(lm(Y~X2+SexBySurv))

Df Sum Sq Mean Sq F value Pr(>F)

X2 1 0.023215 0.0232149 77.8814 6.015e-15 ***

SexBySurv 3 0.005172 0.0017239 5.7833 0.0009591 ***

3.15

3.20

3.25

X2

3.30

3.35

3.40

24

ANCOVA Data: Common Errors

1. Perform ANOVA on regression residuals: NOT the same as

ANCOVA (different df, pooled b

, etc.). Also, lose test of slopes, which is important (see J. Anim. Ecol.

2001. 70:708-711)

2. Significant gp * cov, but still compare groups: not useful, as answer depends upon where along regression you compare

3. “Size may be a covariate, so I’ll use a small size range to

‘standardize’ for it”: choosing animals of similar sizes will eliminate covariate, but also will eliminate potentially important biological information (e.g., what if male head width grows relatively faster than females (i.e. size * head interaction?)

25

Other Regression Models

•Many other models possible

•PATH ANALYSIS & SEM

(structural equation modeling)

: looks at partial correlation coefficients and partial regression coefficients to explain variance in data according to hypothesized ‘causal’ path between variables

X

2

X

1

Y

X

3

•Curvilinear regression: similar idea as above, but with ‘adjusted’

X variables: X, X 2 , X 3 , etc.

26

Download