docx

advertisement
Regression and Variable Selection
Maria Sagot (msagot1@lsu.edu)
November 20, 2009; with subsequent modifications on Nov. 30, 2009
Background
Regression Analysis: “Is a method used for analyzing a relationship between
two or more variables in such a manner that one variable can be predicted or
explained by using information on the others.” (Freund and Wilson 2003). It is
used when both the response and predictor variables are continuous variables
(Crawley 2007). The simplest linear model that describes the relationship
between the response variable and the explanatory variable(s) is in the form:
y=o+1x+
Where y is the response variable, x is a single continuous explanatory variable,
o is the intercept (value of y when x=0), 1 is the slope (change of y
corresponding a unit change in x) and  is the error term (Freund and Wilson
2003).
Assumptions:
1.
2.
3.
4.
The linear model is appropriate
The error terms are independent
The error terms are (approximately) normally distributed
The error terms have a common variance (Freud and Wilson 2003)
In R, you can perform simple linear multiple regressions using the functions lm or
glm from the package Stats. These functions can be used to fit linear models, for
example regressions, single stratum analysis of variance or analysis of
covariance.
Normality
One critical assumption in regression analysis that is usually not met, especially
in biological studies, is normality. Functions that are non-normal can be
normalized by transformations. The most frequent transformations are logarithms
and square root transformations (Crawley 2007). Most authors agree that the
best test to detect deviations from normality is Shapiro-Wilks (Conover 1999). In
R, this test can be performed by the function Shapiro.test from the package
Stats.
1
Poisson Regression
In biological studies we usually find data sets with many zeros and data sets that
are counts. In these cases a Poisson distribution (rather than the normal) is more
appropriate since in a poisson distribution the variance is equal to the mean. In
R, poisson regressions can be performed by the function glm,
family=poisson, from the package Stats.
Variable selection
Variable selection in regression identifies the best subset among many variables
to include in the model. This problem arises when one wants to model the
relationship between the response variable and a subset of many predictor
variables, but there is uncertainty about which subset to use. This situation is
particularly important when P is large and the data set contains many redundant
or irrelevant explanatory variables (Geaghan 2007). In variable selection, the
addition or removal of any variable will change all other variables in the model.
Therefore, this process should be done one variable at a time. The most
commonly used variable selection methods are: Forward, Backward and
Stepwise Selection (Geaghan 2007). Backward Selection starts with the full
model, and then a selection criterion is established (removal of non-significant
variables). The least significant variable is examined and if it does not meet the
criteria, the variable is deleted and the model is refit. This process is repeated
until all variables that do not meet the criteria are eliminated from the model
(Geaghan 2007). Forward Selection examines all possible linear regressions
and selects the best one to start with. The first variable in this case is the most
significant one and it will remain in the model for the whole analysis. This step is
repeated until no more variables meet the criteria (Geaghan 2007). Stepwise
Selection is like Forward Selection except that at each step the analysis checks
if the variables already in the model still meet the assumptions. If one or more of
the variables do not meet the criteria, they are removed from the model
(Geaghan 2007). In R, variable selection can be performed with the function
stepAIC from the package MASS. Also, the function boot.stepAIC, from the
package boot.stepAIC, can perform variable selection.
And this is how it is performed:
1) Simple Regression
A sample data set (habitat.txt) is available with this tutorial.
2
Before you begin with the analyses save the file to your desktop.
>data=read.table(file="/Users/mariasagot/Desktop/habitat.tx
t", sep="\t", header=T)
read.table() reads the contents of a data set and creates a data frame. In
this functions, the argument header=T indicates that the data set contains
labeled columns or headers.
> attach(data)
Attach data set called data
> data[1:10,]
Displays the first 10 rows of the data set
density1 density2 pres.abs light grown.cover dbh num.trees height
opening inclination sts
1
2
0.00955414
661
8
2
8
0.03503185
236
0
3
9
0.05414013
409
0
4 0.06050955
19
311
5 0.06369427
22
655
0
absence
414
2
14
0
23
0
absence
256
4
19
0
144
0
absence
245
8
20
0
145
0
absence
249
10
26
0
147
0
absence
338
11
29
36
148
0
etc…
> names(data)
Displays column names
[1] "density1""density2""pres.abs""light""grown.cover""dbh"
[7] "num.trees""height""opening""inclination""sts""num.ind"
> linear=lm(density1~opening)
Function lm is used to perform regression, analysis of variance and
analysis of covariance. These models have the form response~terms,
where the response is a numeric vector and the terms specify linear
predictors for the response. Terms can be specified in different ways: 1)
first+second: indicates that it will take all the terms in the first predictor,
together will all the terms in the second predictor. 2) first:second: Indicates
that it will take the interactions for all the terms in the first predictor and the
3
interactions for all the terms in the second predictor. 3) first*second:
indicates the cross of the first and second predictors. 4)
first+second+first:second: this form is the same as the first*second.
> anova(linear)
Anova function returns a summary and analysis of variance table of the
results of the function lm.
Analysis of Variance Table
Response: density1
Df
opening
Sum Sq Mean Sq F value
1 22.6665 22.6665
Residuals
96
0.4308
Pr(>F)
5050.9 < 2.2e-16 ***
0.0045
--Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> summary(linear)
Summary returns summaries of the results of various model-fitting
functions. The function invokes particular methods depending on the class
of the first argument. In the case of the lm function it displays the ANOVA
table plus other statistics such as significance and R2 value.
Call:
lm(formula = density1 ~ opening)
Residuals:
Min
1Q
Median
3Q
Max
-0.114083 -0.053098 -0.006836
0.044171
0.151146
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.1600410
0.0151026
-10.60
<2e-16 ***
opening
0.0001298
71.07
<2e-16 ***
0.0092248
--Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
4
Residual standard error: 0.06699 on 96 degrees of freedom
Multiple R-squared: 0.9813,
F-statistic:
Adjusted R-squared: 0.9812
5051 on 1 and 96 DF,
p-value: < 2.2e-16
2) Multiple Regression
>Linear.model=lm(data$density1~data$light+data$grown.cover+
data$dbh+data$num.trees+data$height+data$opening+data$incli
nation+data$sts)
Performs a multiple regression using the function lm. The model has the
same form of the simple linear regression (response~terms) where the
response is a numeric vector and the terms specify linear predictors for
the response.
> summary(Linear.model)
Call:
lm(formula = data$density1 ~ data$light + data$grown.cover +
data$dbh + data$num.trees + data$height + data$opening +
data$inclination + data$sts)
etc……
>gen.lin=glm(data$density1~data$light+data$grown.cover+data
$dbh+data$num.trees+data$height+data$opening+data$inclinati
on+data$sts)
Function glm is used to fit linear models. This model has the same form as
the lm function (response~terms), where the response is a numeric vector
and the terms specify linear predictors for the response.
> summary(gen.lin)
Call:
glm(formula = data$density1 ~ data$light + data$grown.cover +
data$dbh + data$num.trees + data$height + data$opening +
data$inclination + data$sts)
5
etc...
3) Normality
> residuals=resid(gen.lin)
This function returns the residuals of the regression.
1
7
2
0.2435404341 0.1281088722
0.0194094075 0.0424278422
3
4
0.1396625054
5
0.0471302476
6
0.0252759542
etc..
> shapiro.test(residuals)
The function Shapiro.test() is a semi/non-parametric analysis of variance
that detects different types of departure from normality using the residuals
of the regression. The null hypothesis of the test is that the sample is
taken from a normal distribution
Shapiro-Wilk normality test
data:
residuals
W = 0.9421, p-value = 0.0003017
4) Poisson Regression
>poisson.reg=glm(data$num.ind~data$light+data$grown.cover+d
ata$dbh+data$num.trees+data$height+data$opening+data$inclin
ation+data$sts, family=poisson)
The function family in glm is used to specify the distribution of the model used by
function. Some of the distributions available are: binomial(link =
"logit"),gaussian(link = "identity"),Gamma(link = "inverse"),inverse.gaussian(link =
"1/mu^2"), poisson(link = "log"), quasi(link = "identity", variance = "constant"),
quasibinomial(link = "logit"), quasipoisson(link = "log"). In this particular example
we are specifying the use of a poisson distribution.
> summary(poisson.reg)
Call:
6
glm(formula = data$num.ind ~ data$light + data$grown.cover +
data$dbh + data$num.trees + data$height + data$opening +
data$inclination + data$sts, family = poisson)
etc…
5) Variable Selection
> library(MASS)
Opens the package MASS from the library
> stepAIC(Linear.model, data, direction="both")
The function stepAIC ( ) performs stepwise model selection using Akaike
information criteria. In the function you have to specify the object, the data
and the direction (“forward”, “backward” or “both”)
Start:
AIC=-554.92
data$density1 ~ data$light + data$grown.cover + data$dbh +
data$num.trees +
data$height + data$opening + data$inclination + data$sts
Df Sum of Sq
RSS
AIC
- data$inclination
1 0.0001612
0.28 -556.87
- data$num.trees
1 0.0003763
0.28 -556.79
- data$sts
1 0.0006276
0.28 -556.71
- data$light
1 0.0026434
0.29 -556.01
<none>
0.28 -554.92
- data$height
1
0.02
0.30 -551.65
- data$grown.cover
1
0.02
0.31 -548.64
- data$dbh
1
0.05
0.33 -541.18
- data$opening
1
1.63
1.91 -369.71
Step:
AIC=-556.87
etc…
> library(bootStepAIC)
> boot.stepAIC(Linear.model, data, B=10, direction="both")
7
The function boot.stepAIC( ) is similar to stepAIC from MASS, however it
additionally implements a bootstrap procedure to investigate variability.
Here you also have to specify the object, data and direction. These
functions are currently supported by regressions and analyses of variance
performed by the following functions: lm, aov, glm, negbin, polr, survreg,
and coxph
Summary of Bootstrapping the 'stepAIC()' procedure for
Call:
lm(formula = data$density1 ~ data$light + data$grown.cover +
data$dbh + data$num.trees + data$height + data$opening +
data$inclination + data$sts)
Bootstrap samples: 10
Direction: both
Penalty: 2 * df
etc…
6) Different plots for Regression Analysis
> data1=data[c(5,6,7,8,9,10,11,12)]
variables from the data set
Select a subset of the
> library(tree)
> model=tree(data$density1~.,data=data1)
The function tree( ) performs a binary recursive partitioning of the data. It
splits the data from the terms of the right-hand-side based on the most
influential variables.
> plot(model)
Plot the tree from the function tree()
> text(model)
function plot
Writes the information on the tree plotted by the
8
>plot(opening,density1)
Plot the simple linear regression specified
> abline(lm(density1~opening))
9
Add a regression line to the plot
References
Conover WJ. 1999. Practical Nonparametric Statistics. Wiley: USA. 592pp.
Crawley MJ. 2007. The R book. Wiley: USA. 942pp.
Freund RJ and Wilson WJ. 2003. Statistical methods. Academic Press: USA.
673pp.
Hastie TJ and Pregibon D.1992. Generalized linear models. In: JM Chambers
and TJ Hastie (eds). Statistical Models in S. Wadsworth & Brooks/Cole: USA.
624pp.
Geaghan JP. 2007 EXST7015 Statistical techniques II. Course notes. James P.
Geaghan: USA. 403pp.
10
Download