Linear Regression with R and R-commander Linear regression is a

advertisement
Linear Regression with R and R-commander
Linear regression is a method for modeling the relationship between two variables: one independent (x)
and one dependent (y). The history and mathematical models behind regression analysis are pretty
complex (e.g http://en.wikipedia.org/wiki/Linear_regression) but the way it is used is pretty straightforward. Scientists are typically interested in getting the equation of the line that describes the best
least-squares fit between two datasets. They may also be interested in the coefficient of determination
(r2) which describes the proportion of variability in a data that is accounted for by the linear model.
Let's look at an example:
Example 1: R comes with a dataset called “cars” that contains speed (mph) and stopping distance (ft)
of cars in the 1920s (source: Ezekiel, M. (1930) Methods of Correlation Analysis. Wiley.). Let's load
the data, look at it's structure, and get some basic summary statistics.
data(cars) #load dataset
str(cars) #display its structure
numSummary(cars) #summary stats (assuming Rcmdr is installed)
Notice the data is stored in two numeric variables called speed and dist. There are 50 observations
(rows of data) and the summary command gives us the means, standard deviation, min, max, median
and 25% and 95% quantiles of the data. Lets attach this data to R's search path so that we can call the
variables by name and create an x,y plot of the data (Note: by attaching the dataset, we can type the
names of variables in the dataset to access them. E.g., typing "speed" returns object not found until we
attach cars).
attach(cars)
plot(speed,dist)
120
Notice that the data is plotted in a cloud such that cars with low speeds also had short stopping
distances. As speed increases, stopping distance also increase but it's not a 1:1 increase. For example,
there are range of breaking distances at a
speed of 20mph. Lets request a vector
containing all the distance values for
speed=20.
100
dist[speed==20]
Remember, in linear regression, we are
fitting a line of the form Y=mX+b to the
data (m=slope and b is the y-intercept).
Lets see what the slope and y-intercept
R: Linear Regression
dist
60
40
20
abline(lm(dist ~ speed))
0
Now let's fit a line to the data. We'll fit the
line so that the sum of squared errors
(square of the difference between
measured and predicted y for a given x) is
minimized. This is typically what is
meant by doing a regression analysis.
80
Wow... they range from 32 to 64 ft. That
seems like a lot of variation.
5
10
15
20
25
speed
1
TESC SciComp
v1
are:
lm(dist ~ speed)
What's the correlation coefficient (r)?
cor(speed,dist)
What is the coefficient of determination (r2) (hint... use up arrow and add ^2)
cor(speed,dist)^2
This suggests that 65% of the breaking distance of cars can be explained solely as a linear function of
speed. We'll come back to this dataset, but first let's try another example.
detach(cars)
#removed cars variables from the search path
(Note: by removing cars from the search path, we can use the variables speed and dist for something
else. If you type them now, it will say that the object is not found)
Example 2: The big-bang model of the origin of the universe says that it expands uniformly, and
locally according to Hubbles Law. Hubbles Law states that there is a linear relationship between
recessional velocity and distance such that v = Ho * D, where v is recessional velocity, typically
expressed in km/s, and D is a galaxy’s distance in Mega parsecs (1 Parsec = 3.08568025 × 1016 meters).
The data are stored in a dataset called hubble in the ganair package (more information at: http://cran.rproject.org/doc/packages/gamair.pdf). Lets open it and plot the data.
library(gamair)
data(hubble)
str(hubble)
attach(hubble)
plot(x,y)
abline(lm(y~x))
#open the library (make sure gamair is installed)
#open the dataset
#examine the structure of the dataset
#attach the dataset to the search path
#create a scatterplot of the data
#fit a line to the data with y as a function of x
Let's change the labels from x and y to something more descriptive. (Hint.. copy and paste into R).
plot(x,y,xlab="Distance (Mpc)",ylab="Velocity (kms)",main="Hubble Data")
abline(lm(y~x))
Hubble Data
?plot
Velocity (kms)
and see if you can figure out what options we
used.
Now, let's look at the slope and intercept.
lm(y~x)
lm(y~x-1)
500
Opps. We've fit a linear model with an intercept
that is not zero. It's probably more appropriate to
assume the galaxy started with a velocity of zero
at the big bang (distance=0). To set the intercept
at zero, we just add a -1 to the model.
5
#no intercept
R: Linear Regression
1000
1500
Better. Again, notice the syntax. We used the
'plot' command as before, but we added some
options to customize the look of our graph. Type
10
15
20
Distance (Mpc)
2
TESC SciComp
v1
Ho=coef(lm(y~x-1)) #Set variable Ho
Now let's calculate the age of the universe. A Mega-parsec is approximately 3.09*1019 km so the
reciprocal of the slope divided by 3.08e19 should be the age in seconds (eg. v= Ho* D + 0 fits the form
y=mx+b with units v=km/s, D=3.08e19 km, and Ho is 1/second).
age.seconds=1/(Ho/3.09e19)
Now lets convert seconds into years.
age.years=age.seconds/(60*60*24*365.25)
age.years
The answer is approximately 13 billion years. Now let's determine the confidence intervals for the
slope. The CI for the slope of a line is b0 + t*sb0 where b0 is the slope, t is the t-statistic and sb0 is the
standard error of the slope.
R has a function called qt that will give us the t-statistic. Type ?qt You'll notice that the qt function
has 5 options, but only the first two need to be set. The first is a vector containing the probabilities that
we want t-values for and the second is the number of degrees of freedom.
tail95=c(0.025, 0.975)
df=length(x)-1
Ho.tdist=qt(tail95,df)
# two tails, so 0.025 and 0.0975 for 95% CI.
Now we just need the standard error of the regression. Let's look at the entire set of summary statistics
for the regression. Lets create a variable for the regression statistics and then summarize them.
lin.regress=lm(y~x-1)
summary(lin.regress)
#creates a variable to hold the regression stats
#summarize the stats
Now this gives us a lot of information including some information on the residuals (we'll come to that
later), coefficients, r-squared, F-statistic, and p-value (probability that the slope is equal to zero). For
now, we just want the coefficient for the standard error (sb0) for our model.
se.coef=coef(summary(lin.regress))[2]
Notice the [2] at the end of the statement we typed above. If we type coef(summary(lin.regress)), we
see it returns four values. The 2 in brackets tells R that we just want the 2nd value Using this
information, the 95% CI for the slope is:
Ho.CI=Ho+Ho.tdist*se.coef
And the age of the universe is
sort(1/(Ho.CI*60*60*24*365.25/3.09e19))
between 11.5 and 14.3 billion years. We've played with R and you can start to see what a powerful
mathematical tool it can be. That said, we'll focus on basic statistics from here on out.
Important Note: Assumptions
Like most other statistical test, regression analyses require that a set of assumptions about the data are
met. Some of the assumptions include:
1) data is measured without error so that observations are equally reliable and have equal influence
2) errors are normally distributed and independent of each other
3) the mean of the errors is zero and they have a constant variance
R: Linear Regression
3
TESC SciComp
v1
So, how do we test these assumptions? One way is to look at the residuals. Let's go back to the cars
dataset.
detach(hubble)
attach(cars)
plot(cars)
abline(lm(dist~speed))
summary(lm(dist~speed))
We've seen this before and it looks like we have a decent fit but now that we've been thinking about
slopes and intercepts, we realize that it makes no sense to fit a model with a non-zero intercept. After
all, a car moving at zero miles per hour has no stopping distance but a car moving slowly still has some
stopping distance. Let's get rid of the intercept (remember... up arrow to quickly edit commands or
copy and paste).
plot(dist~speed-1)
abline(lm(dist~speed-1))
summary(lm(dist~speed-1))
Residuals vs Fitted
The first plot we get shows the model residuals
plotted against the fitted values. The residuals should
be evenly scattered above and below the zero line. A
trend in the mean residuals suggests a violation in the
assumption of independent response variables. A
trend in the variability of the residuals suggests that
the variance is related to the mean, violating the
constant variance assumption. These data appear to
exhibit a slight trend with increasing variance from
left to right.
20
-20
0
plot(lm(dist~speed-1)) # & hit Return
23
35
Residuals
Better. Now the model makes more sense and it
explains 89% of the error. Now let's look at the
residuals. There will be four plots and we'll look at
them each in turn:
40
49
10
20
30
40
50
60
70
Fitted values
lm(dist ~ speed - 1)
Illustration 1: Model Residuals
Normal Q-Q
Hit return
3
49
Hit return
1
35
0
Standardized residuals
2
23
-1
The next plot is the normal Q-Q (quantile-quantile)
plot. In this plot the standardized residuals are plotted
against the quantiles of a standard normal distribution.
If the residuals are normally distributed the data
should plot along the line. In this plot, the data do not
appear to be normally distributed. See
http://en.wikipedia.org/wiki/Q-Q_plot for more
information.
-2
-1
0
1
2
Theoretical Quantiles
lm(dist ~ speed - 1)
Illustration 2: Normal Q-Q
R: Linear Regression
4
TESC SciComp
v1
Scale-Location
Residuals vs Leverage
49
3
49
0.5
1.5
23
1
48
0
Standardized residuals
1.0
35
-2
0.0
-1
0.5
Standardized residuals
2
35
10
20
30
40
50
60
70
0.00
Fitted values
lm(dist ~ speed - 1)
Illustration 3: Cooks distance plot
Cook's distance
0.01
0.02
0.03
0.04
Leverage
lm(dist ~ speed - 1)
Illustration 4: Scale-Location plot
The third plot is the scale-location plot in which the raw residuals are standardized by their estimated
standard deviations. The scale-location plot is used to detect if the spread of the residuals is constant
over the range of fitted values. Again, we see a trend to the data such that high values exhibit more
variance.
Hit return
The fourth plot shows the standardized residuals leveraged against each datum (i.e. data point). The
plot shows which data points are exerting the most influence. If the Cooks distance line encompasses a
data point, it suggests that the analysis may be very sensitive to that point and it may be prudent to
repeat the analysis with those data excluded. We could exclude point 40 which appears to be an outlier,
but we won't for now.
We can put the four plots in a single data frame by typing:
par(mfrow=c(2,2))
plot(lm(dist~speed-1))
Because we see an increasing variance (see: http://en.wikipedia.org/wiki/Heteroscadacity), we apply a
log transformation to the data.
ln.speed=log(speed) #note the log command fits a log base-e
ln.dist=log(dist)
plot(lm(ln.dist~ln.speed-1))
Notice that you see less of a trend in all but the Q-Q plot, which appears to fit much better. Let's look
at the data summary.
summary(lm(ln.dist~ln.speed-1))
Notice that the model now explains almost 99% of the variance in the data. How do we interpret the
data? ln(y)=1.33466 * ln(x), so: y=x1.33466 which means its stopping distance increases as a power
function of speed.
R: Linear Regression
5
TESC SciComp
v1
If the semi-log plot (x vs. log(y)) is approximately a straight line with slope m, then the function is
approximately exponential. If the log-log plot of log(x) vs. log(y) is approximately a straight line with
slope m, then the function is approximately a power function.
Example 3: Regression analysis in R-Commander
So far we've been using R for all our
analyses, but using R assumes that we have
learned the syntax. While the syntax isn't
that difficult, it can be intimidating to the
causal user. Therefore, someone has written
a GUI interface for R that will allow you to
perform many of the basic statistical
functions of R without knowing any of the
syntax.
R-Commander is installed on all machines
in the CAL. If the interface is not open,
type the following commands to open it. If
you are working at home, first make sure
the package is installed
(install.packages("Rcmdr",
dependencies=TRUE)
library(Rcmdr)
Commander()
Once Rcmdr is open, we will open the
dataset for cars. To open cars Go to the top
of the commander window and open the
pull down menu called
120
Data
(Blue arial text means "menu option")
100
Data in Packages
Read data set from an attached package
dist
40
60
80
Select datasets by double clicking on it. In the
Data set window select "cars" by double
clicking on it. Note: it is very far down the list
which is sorted by case and then letter. Look
into the script window. Notice that we could
also have typed:
20
data(cars, package="datasets")
R: Linear Regression
0
into the script window to do the same thing.
Rcmdr is mostly just typing the commands for
you, which is a great way to learn what they
are.
5
10
15
20
25
speed
6
TESC SciComp
v1
Now, let's learn more about the cars dataset. Now let's create a scatter plot. To create a scatter plot open
the drop down menu
Graphs
Scatterplot
Notice there are many options listed here. These are just some of the options available in R. We'll
choose speed as our x-variable and dist as our y-variable. Leave boxplots, least squares line, and
smooth line checked. Hit OK.
Notice the syntax in the Output Window. This graph has both the best fit line (dashed) but also a
smooth line. You can type
?scatterplot
in R or Rcmdr to find out more about scatterplots.(When you type a command in the script window
don't hit return to enter the command, instead hit the submit button when you are done typing the
command.). By looking at the help menu we see that 'smooth' fits a lowess nonparametric regression
line that describes the shape of the data (see http://en.wikipedia.org/wiki/Lowess). As we can see here,
the data appears to fit an exponential curve.
Let's look at the diagnostic plots. If we open:
Models
Graphs
we see that the options are greyed out. That's because we first need to create the model. Open:
Statistics
Fit Model
Linear Model
The formula should be speed~dist-1. Now let's look at the diagnostic plots.
Model
Graphs
Basic Diagnostic Graphs
As we see before, they suggest increasing
variance. Let's transform the data using a power
function (log~log). We can do this two ways. 1)
we compute new variables for log(speed) and
log(dist) using
Data
Manage Variables in active dataset
Compute new variable
or 2) we could just change the form of the model.
We'll do the second.
Model Formula = log(speed) ~ log(dist)-1
R: Linear Regression
7
TESC SciComp
v1
Let's look at the diagnostic plots again.
Models
Graphs
Basic Diagnostic
Again. It looks like the power model fits the data better but how can we be sure. One way is to use the
model which does the best job at the predicting the residual errors. This can be done by selecting the
model with the smallest Akaike Information Criterion (AIC). While the AIC function isn't built into Rcommander, it is built into R.
When we created our models, the dialog box prompted us for the name of the model using
LinearModel.1 as the default name. The next model we created would have had the name
LinearModel.2 by default. Let's use the AIC function to determine the AIC statistic for the two
different models we tested. We'll enter these commands directly into the script window:
AIC(LinearModel.1,LinearModel.2)
# use your model names if different
We want to use the model with the lowest AIC score and we see that it is clearly the power model
(log~log). Try another model and see how it fits. Examples might be:
log(dist)~speed-1
dist ~ speed^2
Now let's get the 95% CI on the slope for our log~log model. (If you can't remember which is which,
you can get choose different models (button next to View dataset) and summarize them. Once you've
found your model, Go to the drop down window
Models
Confidence intervals
Finally, let's do a test to determine whether the correlation is significant. Go to the drop down window
Statistics
Summaries
Correlation Test
Opps... this function won't let us choose our
model so we'll have to create two extra variables
names ln.dist and ln.speed.
Data
Manage Variables in active dataset
Compute new variable
Enter: "ln.speed" as the variable name and "log (dist)" (with the space between log and speed) as the
expression. Do the same for speed. Now go back to the coorelation test and make sure that Pearson
product-moment is selected and Correlation>0 is also selected. Hit OK. [Note: We are
performing a one tail test that the correlation is positive. If we didn't know that the relationship would
be positive (i.e. we thought that stopping distance could decrease with speed) then we'd use the two
sided test].
The p-value tells us that the probability that is based on the sample, the probability that the correlation
is less than or equal to zero is very small (~ 1.11e-15).
R: Linear Regression
8
TESC SciComp
v1
Exercise 1: Use R commander to analyze the Hubble data. Provide the equation of the line, r2,
pearson correlation p-value, 95% confidence intervals, scatter plot, and basic diagnostic plots of the
data.
R: Linear Regression
9
TESC SciComp
v1
Download