Regression and Variable Selection Maria Sagot (msagot1@lsu.edu) November 20, 2009; with subsequent modifications on Nov. 30, 2009 Background Regression Analysis: “Is a method used for analyzing a relationship between two or more variables in such a manner that one variable can be predicted or explained by using information on the others.” (Freund and Wilson 2003). It is used when both the response and predictor variables are continuous variables (Crawley 2007). The simplest linear model that describes the relationship between the response variable and the explanatory variable(s) is in the form: y=o+1x+ Where y is the response variable, x is a single continuous explanatory variable, o is the intercept (value of y when x=0), 1 is the slope (change of y corresponding a unit change in x) and is the error term (Freund and Wilson 2003). Assumptions: 1. 2. 3. 4. The linear model is appropriate The error terms are independent The error terms are (approximately) normally distributed The error terms have a common variance (Freud and Wilson 2003) In R, you can perform simple linear multiple regressions using the functions lm or glm from the package Stats. These functions can be used to fit linear models, for example regressions, single stratum analysis of variance or analysis of covariance. Normality One critical assumption in regression analysis that is usually not met, especially in biological studies, is normality. Functions that are non-normal can be normalized by transformations. The most frequent transformations are logarithms and square root transformations (Crawley 2007). Most authors agree that the best test to detect deviations from normality is Shapiro-Wilks (Conover 1999). In R, this test can be performed by the function Shapiro.test from the package Stats. 1 Poisson Regression In biological studies we usually find data sets with many zeros and data sets that are counts. In these cases a Poisson distribution (rather than the normal) is more appropriate since in a poisson distribution the variance is equal to the mean. In R, poisson regressions can be performed by the function glm, family=poisson, from the package Stats. Variable selection Variable selection in regression identifies the best subset among many variables to include in the model. This problem arises when one wants to model the relationship between the response variable and a subset of many predictor variables, but there is uncertainty about which subset to use. This situation is particularly important when P is large and the data set contains many redundant or irrelevant explanatory variables (Geaghan 2007). In variable selection, the addition or removal of any variable will change all other variables in the model. Therefore, this process should be done one variable at a time. The most commonly used variable selection methods are: Forward, Backward and Stepwise Selection (Geaghan 2007). Backward Selection starts with the full model, and then a selection criterion is established (removal of non-significant variables). The least significant variable is examined and if it does not meet the criteria, the variable is deleted and the model is refit. This process is repeated until all variables that do not meet the criteria are eliminated from the model (Geaghan 2007). Forward Selection examines all possible linear regressions and selects the best one to start with. The first variable in this case is the most significant one and it will remain in the model for the whole analysis. This step is repeated until no more variables meet the criteria (Geaghan 2007). Stepwise Selection is like Forward Selection except that at each step the analysis checks if the variables already in the model still meet the assumptions. If one or more of the variables do not meet the criteria, they are removed from the model (Geaghan 2007). In R, variable selection can be performed with the function stepAIC from the package MASS. Also, the function boot.stepAIC, from the package boot.stepAIC, can perform variable selection. And this is how it is performed: 1) Simple Regression A sample data set (habitat.txt) is available with this tutorial. 2 Before you begin with the analyses save the file to your desktop. >data=read.table(file="/Users/mariasagot/Desktop/habitat.tx t", sep="\t", header=T) read.table() reads the contents of a data set and creates a data frame. In this functions, the argument header=T indicates that the data set contains labeled columns or headers. > attach(data) Attach data set called data > data[1:10,] Displays the first 10 rows of the data set density1 density2 pres.abs light grown.cover dbh num.trees height opening inclination sts 1 2 0.00955414 661 8 2 8 0.03503185 236 0 3 9 0.05414013 409 0 4 0.06050955 19 311 5 0.06369427 22 655 0 absence 414 2 14 0 23 0 absence 256 4 19 0 144 0 absence 245 8 20 0 145 0 absence 249 10 26 0 147 0 absence 338 11 29 36 148 0 etc… > names(data) Displays column names [1] "density1""density2""pres.abs""light""grown.cover""dbh" [7] "num.trees""height""opening""inclination""sts""num.ind" > linear=lm(density1~opening) Function lm is used to perform regression, analysis of variance and analysis of covariance. These models have the form response~terms, where the response is a numeric vector and the terms specify linear predictors for the response. Terms can be specified in different ways: 1) first+second: indicates that it will take all the terms in the first predictor, together will all the terms in the second predictor. 2) first:second: Indicates that it will take the interactions for all the terms in the first predictor and the 3 interactions for all the terms in the second predictor. 3) first*second: indicates the cross of the first and second predictors. 4) first+second+first:second: this form is the same as the first*second. > anova(linear) Anova function returns a summary and analysis of variance table of the results of the function lm. Analysis of Variance Table Response: density1 Df opening Sum Sq Mean Sq F value 1 22.6665 22.6665 Residuals 96 0.4308 Pr(>F) 5050.9 < 2.2e-16 *** 0.0045 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > summary(linear) Summary returns summaries of the results of various model-fitting functions. The function invokes particular methods depending on the class of the first argument. In the case of the lm function it displays the ANOVA table plus other statistics such as significance and R2 value. Call: lm(formula = density1 ~ opening) Residuals: Min 1Q Median 3Q Max -0.114083 -0.053098 -0.006836 0.044171 0.151146 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.1600410 0.0151026 -10.60 <2e-16 *** opening 0.0001298 71.07 <2e-16 *** 0.0092248 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 4 Residual standard error: 0.06699 on 96 degrees of freedom Multiple R-squared: 0.9813, F-statistic: Adjusted R-squared: 0.9812 5051 on 1 and 96 DF, p-value: < 2.2e-16 2) Multiple Regression >Linear.model=lm(data$density1~data$light+data$grown.cover+ data$dbh+data$num.trees+data$height+data$opening+data$incli nation+data$sts) Performs a multiple regression using the function lm. The model has the same form of the simple linear regression (response~terms) where the response is a numeric vector and the terms specify linear predictors for the response. > summary(Linear.model) Call: lm(formula = data$density1 ~ data$light + data$grown.cover + data$dbh + data$num.trees + data$height + data$opening + data$inclination + data$sts) etc…… >gen.lin=glm(data$density1~data$light+data$grown.cover+data $dbh+data$num.trees+data$height+data$opening+data$inclinati on+data$sts) Function glm is used to fit linear models. This model has the same form as the lm function (response~terms), where the response is a numeric vector and the terms specify linear predictors for the response. > summary(gen.lin) Call: glm(formula = data$density1 ~ data$light + data$grown.cover + data$dbh + data$num.trees + data$height + data$opening + data$inclination + data$sts) 5 etc... 3) Normality > residuals=resid(gen.lin) This function returns the residuals of the regression. 1 7 2 0.2435404341 0.1281088722 0.0194094075 0.0424278422 3 4 0.1396625054 5 0.0471302476 6 0.0252759542 etc.. > shapiro.test(residuals) The function Shapiro.test() is a semi/non-parametric analysis of variance that detects different types of departure from normality using the residuals of the regression. The null hypothesis of the test is that the sample is taken from a normal distribution Shapiro-Wilk normality test data: residuals W = 0.9421, p-value = 0.0003017 4) Poisson Regression >poisson.reg=glm(data$num.ind~data$light+data$grown.cover+d ata$dbh+data$num.trees+data$height+data$opening+data$inclin ation+data$sts, family=poisson) The function family in glm is used to specify the distribution of the model used by function. Some of the distributions available are: binomial(link = "logit"),gaussian(link = "identity"),Gamma(link = "inverse"),inverse.gaussian(link = "1/mu^2"), poisson(link = "log"), quasi(link = "identity", variance = "constant"), quasibinomial(link = "logit"), quasipoisson(link = "log"). In this particular example we are specifying the use of a poisson distribution. > summary(poisson.reg) Call: 6 glm(formula = data$num.ind ~ data$light + data$grown.cover + data$dbh + data$num.trees + data$height + data$opening + data$inclination + data$sts, family = poisson) etc… 5) Variable Selection > library(MASS) Opens the package MASS from the library > stepAIC(Linear.model, data, direction="both") The function stepAIC ( ) performs stepwise model selection using Akaike information criteria. In the function you have to specify the object, the data and the direction (“forward”, “backward” or “both”) Start: AIC=-554.92 data$density1 ~ data$light + data$grown.cover + data$dbh + data$num.trees + data$height + data$opening + data$inclination + data$sts Df Sum of Sq RSS AIC - data$inclination 1 0.0001612 0.28 -556.87 - data$num.trees 1 0.0003763 0.28 -556.79 - data$sts 1 0.0006276 0.28 -556.71 - data$light 1 0.0026434 0.29 -556.01 <none> 0.28 -554.92 - data$height 1 0.02 0.30 -551.65 - data$grown.cover 1 0.02 0.31 -548.64 - data$dbh 1 0.05 0.33 -541.18 - data$opening 1 1.63 1.91 -369.71 Step: AIC=-556.87 etc… > library(bootStepAIC) > boot.stepAIC(Linear.model, data, B=10, direction="both") 7 The function boot.stepAIC( ) is similar to stepAIC from MASS, however it additionally implements a bootstrap procedure to investigate variability. Here you also have to specify the object, data and direction. These functions are currently supported by regressions and analyses of variance performed by the following functions: lm, aov, glm, negbin, polr, survreg, and coxph Summary of Bootstrapping the 'stepAIC()' procedure for Call: lm(formula = data$density1 ~ data$light + data$grown.cover + data$dbh + data$num.trees + data$height + data$opening + data$inclination + data$sts) Bootstrap samples: 10 Direction: both Penalty: 2 * df etc… 6) Different plots for Regression Analysis > data1=data[c(5,6,7,8,9,10,11,12)] variables from the data set Select a subset of the > library(tree) > model=tree(data$density1~.,data=data1) The function tree( ) performs a binary recursive partitioning of the data. It splits the data from the terms of the right-hand-side based on the most influential variables. > plot(model) Plot the tree from the function tree() > text(model) function plot Writes the information on the tree plotted by the 8 >plot(opening,density1) Plot the simple linear regression specified > abline(lm(density1~opening)) 9 Add a regression line to the plot References Conover WJ. 1999. Practical Nonparametric Statistics. Wiley: USA. 592pp. Crawley MJ. 2007. The R book. Wiley: USA. 942pp. Freund RJ and Wilson WJ. 2003. Statistical methods. Academic Press: USA. 673pp. Hastie TJ and Pregibon D.1992. Generalized linear models. In: JM Chambers and TJ Hastie (eds). Statistical Models in S. Wadsworth & Brooks/Cole: USA. 624pp. Geaghan JP. 2007 EXST7015 Statistical techniques II. Course notes. James P. Geaghan: USA. 403pp. 10