Multiple regression - University of Dayton

advertisement
Multiple Regression
Analysis of Biological Data
Ryan McEwan and Julia Chapman
Department of Biology
University of Dayton
ryan.mcewan@udayton.edu
Simple linear regression is a way of understanding the relationship between two variables
where the data analyst assumes that one variable (predictor; independent variable) drives a
second variable (response; dependent).
Extremely useful this is, and yet in most biological situations any given response variable is
likely to be determined by more than just a single predictor.
In this case, wing length is related to age, but you can imagine that nutritional status or
gender could be important as well.
Here is aboveground biomass (Y axis) in a forest
and stem density in that forest.
You see a relationship, but a messy one.
Maybe adding other variables would help
Explain AGB.
How about soil nitrogen?
How about species diversity?
How about mean temperature at each point?
Etc.
In biology you may be collecting a slew of values that might serve as
predictors for a potential response.
Consider a correlation matrix!!
Herbaceous cover =
You are building a model!!
Herbaceous cover =
+
Herbaceous cover =
+
+
+
Multiple regression is a process of figuring out
statistically what suite variables best predict a
particular response…
…okay how do you proceed?
Herbaceous cover =
+
+
+
Forward selection:
(1) select the variable that forms the best regression relationship
with the response variable.
(2) Add all of the variables in the pool, in a stepwise fashion, to
find the best relationship, throwing back in weaker ones.
(3) Repeat step 2 until adding in variables no longer makes a
stronger relationship.
Herbaceous cover =
+
+
Herbaceous cover =
Backward selection:
(1) Start with all variables in the model
(2) Eject each one and test the relationship
(3) Throw back into the pool the variable(s) that weaken, or fail to
strengthen the relationship.
+
+
Backward selection:
(1) Start with all variables in the model
(2) Eject each one and test the relationship
(3) Throw back into the pool the variable(s) that weaken, or fail to
strengthen the relationship.
Herbaceous cover =
+
+
+
+
A few more things to cover:
(a) How to evaluate models?
(b) What about correlated variables
(c) What about categorical variables?
Herbaceous cover =
+
+
+
+
A few more things to cover:
(a) How to evaluate models?
Herbaceous cover =
Herbaceous cover =
+
+
+
+
+
+
A few more things to cover:
(a) How to evaluate models?
(1) P-value
(2) R2
(3) Akaike Information Criterion (AIC)
Herbaceous cover =
Herbaceous cover =
+
+
+
+
+
+
A few more things to cover:
(a) How to evaluate models?
(1) P-value
(2) R2
(3) Akaike Information Criterion (AIC)
AIC is a way of comparing the information content of different models. It does not
provide a statistical test, per se, but rather provides a quantitative way to assess model
fit vs. model complexity. The best model is the one with the lowest AIC
A few more things to cover:
(a) How to evaluate models?
(b) What about correlated variables
(c) What about categorical variables?
Herbaceous cover =
+
+
+
+
A few more things to cover:
(a) How to evaluate models?
(b) What about correlated variables
(c) What about categorical variables?
Herbaceous cover =
+
+
+
+
A few more things to cover:
(a) How to evaluate models?
(b) What about correlated variables
(c) What about categorical variables?
Strongly correlated variables effectively contain the
same information, thus should not be inserted into
the same model.
The data analyst needs to assess “muliticollinearity” among the variables in the model. One
simple way to think about it = correlation matrix. Formally, a model building procedure
generally includes calculation of “Variable Inflation Factors” and ejecting from the model one
of two variables that are highly correlated.
Herbaceous cover =
+
+
+
+
A few more things to cover:
(a) How to evaluate models?
(b) What about correlated variables?
(c) What about categorical variables?
Multiple regression models CAN incorporate yes/no variables (logistic) or even categorical
variables.
Herbaceous cover =
+
+
+
+
Download