Linear Regression with R and R-commander Linear regression is a method for modeling the relationship between two variables: one independent (x) and one dependent (y). The history and mathematical models behind regression analysis are pretty complex (e.g http://en.wikipedia.org/wiki/Linear_regression) but the way it is used is pretty straightforward. Scientists are typically interested in getting the equation of the line that describes the best least-squares fit between two datasets. They may also be interested in the coefficient of determination (r2) which describes the proportion of variability in a data that is accounted for by the linear model. Let's look at an example: Example 1: R comes with a dataset called “cars” that contains speed (mph) and stopping distance (ft) of cars in the 1920s (source: Ezekiel, M. (1930) Methods of Correlation Analysis. Wiley.). Let's load the data, look at it's structure, and get some basic summary statistics. data(cars) #load dataset str(cars) #display its structure numSummary(cars) #summary stats (assuming Rcmdr is installed) Notice the data is stored in two numeric variables called speed and dist. There are 50 observations (rows of data) and the summary command gives us the means, standard deviation, min, max, median and 25% and 95% quantiles of the data. Lets attach this data to R's search path so that we can call the variables by name and create an x,y plot of the data (Note: by attaching the dataset, we can type the names of variables in the dataset to access them. E.g., typing "speed" returns object not found until we attach cars). attach(cars) plot(speed,dist) 120 Notice that the data is plotted in a cloud such that cars with low speeds also had short stopping distances. As speed increases, stopping distance also increase but it's not a 1:1 increase. For example, there are range of breaking distances at a speed of 20mph. Lets request a vector containing all the distance values for speed=20. 100 dist[speed==20] Remember, in linear regression, we are fitting a line of the form Y=mX+b to the data (m=slope and b is the y-intercept). Lets see what the slope and y-intercept R: Linear Regression dist 60 40 20 abline(lm(dist ~ speed)) 0 Now let's fit a line to the data. We'll fit the line so that the sum of squared errors (square of the difference between measured and predicted y for a given x) is minimized. This is typically what is meant by doing a regression analysis. 80 Wow... they range from 32 to 64 ft. That seems like a lot of variation. 5 10 15 20 25 speed 1 TESC SciComp v1 are: lm(dist ~ speed) What's the correlation coefficient (r)? cor(speed,dist) What is the coefficient of determination (r2) (hint... use up arrow and add ^2) cor(speed,dist)^2 This suggests that 65% of the breaking distance of cars can be explained solely as a linear function of speed. We'll come back to this dataset, but first let's try another example. detach(cars) #removed cars variables from the search path (Note: by removing cars from the search path, we can use the variables speed and dist for something else. If you type them now, it will say that the object is not found) Example 2: The big-bang model of the origin of the universe says that it expands uniformly, and locally according to Hubbles Law. Hubbles Law states that there is a linear relationship between recessional velocity and distance such that v = Ho * D, where v is recessional velocity, typically expressed in km/s, and D is a galaxy’s distance in Mega parsecs (1 Parsec = 3.08568025 × 1016 meters). The data are stored in a dataset called hubble in the ganair package (more information at: http://cran.rproject.org/doc/packages/gamair.pdf). Lets open it and plot the data. library(gamair) data(hubble) str(hubble) attach(hubble) plot(x,y) abline(lm(y~x)) #open the library (make sure gamair is installed) #open the dataset #examine the structure of the dataset #attach the dataset to the search path #create a scatterplot of the data #fit a line to the data with y as a function of x Let's change the labels from x and y to something more descriptive. (Hint.. copy and paste into R). plot(x,y,xlab="Distance (Mpc)",ylab="Velocity (kms)",main="Hubble Data") abline(lm(y~x)) Hubble Data ?plot Velocity (kms) and see if you can figure out what options we used. Now, let's look at the slope and intercept. lm(y~x) lm(y~x-1) 500 Opps. We've fit a linear model with an intercept that is not zero. It's probably more appropriate to assume the galaxy started with a velocity of zero at the big bang (distance=0). To set the intercept at zero, we just add a -1 to the model. 5 #no intercept R: Linear Regression 1000 1500 Better. Again, notice the syntax. We used the 'plot' command as before, but we added some options to customize the look of our graph. Type 10 15 20 Distance (Mpc) 2 TESC SciComp v1 Ho=coef(lm(y~x-1)) #Set variable Ho Now let's calculate the age of the universe. A Mega-parsec is approximately 3.09*1019 km so the reciprocal of the slope divided by 3.08e19 should be the age in seconds (eg. v= Ho* D + 0 fits the form y=mx+b with units v=km/s, D=3.08e19 km, and Ho is 1/second). age.seconds=1/(Ho/3.09e19) Now lets convert seconds into years. age.years=age.seconds/(60*60*24*365.25) age.years The answer is approximately 13 billion years. Now let's determine the confidence intervals for the slope. The CI for the slope of a line is b0 + t*sb0 where b0 is the slope, t is the t-statistic and sb0 is the standard error of the slope. R has a function called qt that will give us the t-statistic. Type ?qt You'll notice that the qt function has 5 options, but only the first two need to be set. The first is a vector containing the probabilities that we want t-values for and the second is the number of degrees of freedom. tail95=c(0.025, 0.975) df=length(x)-1 Ho.tdist=qt(tail95,df) # two tails, so 0.025 and 0.0975 for 95% CI. Now we just need the standard error of the regression. Let's look at the entire set of summary statistics for the regression. Lets create a variable for the regression statistics and then summarize them. lin.regress=lm(y~x-1) summary(lin.regress) #creates a variable to hold the regression stats #summarize the stats Now this gives us a lot of information including some information on the residuals (we'll come to that later), coefficients, r-squared, F-statistic, and p-value (probability that the slope is equal to zero). For now, we just want the coefficient for the standard error (sb0) for our model. se.coef=coef(summary(lin.regress))[2] Notice the [2] at the end of the statement we typed above. If we type coef(summary(lin.regress)), we see it returns four values. The 2 in brackets tells R that we just want the 2nd value Using this information, the 95% CI for the slope is: Ho.CI=Ho+Ho.tdist*se.coef And the age of the universe is sort(1/(Ho.CI*60*60*24*365.25/3.09e19)) between 11.5 and 14.3 billion years. We've played with R and you can start to see what a powerful mathematical tool it can be. That said, we'll focus on basic statistics from here on out. Important Note: Assumptions Like most other statistical test, regression analyses require that a set of assumptions about the data are met. Some of the assumptions include: 1) data is measured without error so that observations are equally reliable and have equal influence 2) errors are normally distributed and independent of each other 3) the mean of the errors is zero and they have a constant variance R: Linear Regression 3 TESC SciComp v1 So, how do we test these assumptions? One way is to look at the residuals. Let's go back to the cars dataset. detach(hubble) attach(cars) plot(cars) abline(lm(dist~speed)) summary(lm(dist~speed)) We've seen this before and it looks like we have a decent fit but now that we've been thinking about slopes and intercepts, we realize that it makes no sense to fit a model with a non-zero intercept. After all, a car moving at zero miles per hour has no stopping distance but a car moving slowly still has some stopping distance. Let's get rid of the intercept (remember... up arrow to quickly edit commands or copy and paste). plot(dist~speed-1) abline(lm(dist~speed-1)) summary(lm(dist~speed-1)) Residuals vs Fitted The first plot we get shows the model residuals plotted against the fitted values. The residuals should be evenly scattered above and below the zero line. A trend in the mean residuals suggests a violation in the assumption of independent response variables. A trend in the variability of the residuals suggests that the variance is related to the mean, violating the constant variance assumption. These data appear to exhibit a slight trend with increasing variance from left to right. 20 -20 0 plot(lm(dist~speed-1)) # & hit Return 23 35 Residuals Better. Now the model makes more sense and it explains 89% of the error. Now let's look at the residuals. There will be four plots and we'll look at them each in turn: 40 49 10 20 30 40 50 60 70 Fitted values lm(dist ~ speed - 1) Illustration 1: Model Residuals Normal Q-Q Hit return 3 49 Hit return 1 35 0 Standardized residuals 2 23 -1 The next plot is the normal Q-Q (quantile-quantile) plot. In this plot the standardized residuals are plotted against the quantiles of a standard normal distribution. If the residuals are normally distributed the data should plot along the line. In this plot, the data do not appear to be normally distributed. See http://en.wikipedia.org/wiki/Q-Q_plot for more information. -2 -1 0 1 2 Theoretical Quantiles lm(dist ~ speed - 1) Illustration 2: Normal Q-Q R: Linear Regression 4 TESC SciComp v1 Scale-Location Residuals vs Leverage 49 3 49 0.5 1.5 23 1 48 0 Standardized residuals 1.0 35 -2 0.0 -1 0.5 Standardized residuals 2 35 10 20 30 40 50 60 70 0.00 Fitted values lm(dist ~ speed - 1) Illustration 3: Cooks distance plot Cook's distance 0.01 0.02 0.03 0.04 Leverage lm(dist ~ speed - 1) Illustration 4: Scale-Location plot The third plot is the scale-location plot in which the raw residuals are standardized by their estimated standard deviations. The scale-location plot is used to detect if the spread of the residuals is constant over the range of fitted values. Again, we see a trend to the data such that high values exhibit more variance. Hit return The fourth plot shows the standardized residuals leveraged against each datum (i.e. data point). The plot shows which data points are exerting the most influence. If the Cooks distance line encompasses a data point, it suggests that the analysis may be very sensitive to that point and it may be prudent to repeat the analysis with those data excluded. We could exclude point 40 which appears to be an outlier, but we won't for now. We can put the four plots in a single data frame by typing: par(mfrow=c(2,2)) plot(lm(dist~speed-1)) Because we see an increasing variance (see: http://en.wikipedia.org/wiki/Heteroscadacity), we apply a log transformation to the data. ln.speed=log(speed) #note the log command fits a log base-e ln.dist=log(dist) plot(lm(ln.dist~ln.speed-1)) Notice that you see less of a trend in all but the Q-Q plot, which appears to fit much better. Let's look at the data summary. summary(lm(ln.dist~ln.speed-1)) Notice that the model now explains almost 99% of the variance in the data. How do we interpret the data? ln(y)=1.33466 * ln(x), so: y=x1.33466 which means its stopping distance increases as a power function of speed. R: Linear Regression 5 TESC SciComp v1 If the semi-log plot (x vs. log(y)) is approximately a straight line with slope m, then the function is approximately exponential. If the log-log plot of log(x) vs. log(y) is approximately a straight line with slope m, then the function is approximately a power function. Example 3: Regression analysis in R-Commander So far we've been using R for all our analyses, but using R assumes that we have learned the syntax. While the syntax isn't that difficult, it can be intimidating to the causal user. Therefore, someone has written a GUI interface for R that will allow you to perform many of the basic statistical functions of R without knowing any of the syntax. R-Commander is installed on all machines in the CAL. If the interface is not open, type the following commands to open it. If you are working at home, first make sure the package is installed (install.packages("Rcmdr", dependencies=TRUE) library(Rcmdr) Commander() Once Rcmdr is open, we will open the dataset for cars. To open cars Go to the top of the commander window and open the pull down menu called 120 Data (Blue arial text means "menu option") 100 Data in Packages Read data set from an attached package dist 40 60 80 Select datasets by double clicking on it. In the Data set window select "cars" by double clicking on it. Note: it is very far down the list which is sorted by case and then letter. Look into the script window. Notice that we could also have typed: 20 data(cars, package="datasets") R: Linear Regression 0 into the script window to do the same thing. Rcmdr is mostly just typing the commands for you, which is a great way to learn what they are. 5 10 15 20 25 speed 6 TESC SciComp v1 Now, let's learn more about the cars dataset. Now let's create a scatter plot. To create a scatter plot open the drop down menu Graphs Scatterplot Notice there are many options listed here. These are just some of the options available in R. We'll choose speed as our x-variable and dist as our y-variable. Leave boxplots, least squares line, and smooth line checked. Hit OK. Notice the syntax in the Output Window. This graph has both the best fit line (dashed) but also a smooth line. You can type ?scatterplot in R or Rcmdr to find out more about scatterplots.(When you type a command in the script window don't hit return to enter the command, instead hit the submit button when you are done typing the command.). By looking at the help menu we see that 'smooth' fits a lowess nonparametric regression line that describes the shape of the data (see http://en.wikipedia.org/wiki/Lowess). As we can see here, the data appears to fit an exponential curve. Let's look at the diagnostic plots. If we open: Models Graphs we see that the options are greyed out. That's because we first need to create the model. Open: Statistics Fit Model Linear Model The formula should be speed~dist-1. Now let's look at the diagnostic plots. Model Graphs Basic Diagnostic Graphs As we see before, they suggest increasing variance. Let's transform the data using a power function (log~log). We can do this two ways. 1) we compute new variables for log(speed) and log(dist) using Data Manage Variables in active dataset Compute new variable or 2) we could just change the form of the model. We'll do the second. Model Formula = log(speed) ~ log(dist)-1 R: Linear Regression 7 TESC SciComp v1 Let's look at the diagnostic plots again. Models Graphs Basic Diagnostic Again. It looks like the power model fits the data better but how can we be sure. One way is to use the model which does the best job at the predicting the residual errors. This can be done by selecting the model with the smallest Akaike Information Criterion (AIC). While the AIC function isn't built into Rcommander, it is built into R. When we created our models, the dialog box prompted us for the name of the model using LinearModel.1 as the default name. The next model we created would have had the name LinearModel.2 by default. Let's use the AIC function to determine the AIC statistic for the two different models we tested. We'll enter these commands directly into the script window: AIC(LinearModel.1,LinearModel.2) # use your model names if different We want to use the model with the lowest AIC score and we see that it is clearly the power model (log~log). Try another model and see how it fits. Examples might be: log(dist)~speed-1 dist ~ speed^2 Now let's get the 95% CI on the slope for our log~log model. (If you can't remember which is which, you can get choose different models (button next to View dataset) and summarize them. Once you've found your model, Go to the drop down window Models Confidence intervals Finally, let's do a test to determine whether the correlation is significant. Go to the drop down window Statistics Summaries Correlation Test Opps... this function won't let us choose our model so we'll have to create two extra variables names ln.dist and ln.speed. Data Manage Variables in active dataset Compute new variable Enter: "ln.speed" as the variable name and "log (dist)" (with the space between log and speed) as the expression. Do the same for speed. Now go back to the coorelation test and make sure that Pearson product-moment is selected and Correlation>0 is also selected. Hit OK. [Note: We are performing a one tail test that the correlation is positive. If we didn't know that the relationship would be positive (i.e. we thought that stopping distance could decrease with speed) then we'd use the two sided test]. The p-value tells us that the probability that is based on the sample, the probability that the correlation is less than or equal to zero is very small (~ 1.11e-15). R: Linear Regression 8 TESC SciComp v1 Exercise 1: Use R commander to analyze the Hubble data. Provide the equation of the line, r2, pearson correlation p-value, 95% confidence intervals, scatter plot, and basic diagnostic plots of the data. R: Linear Regression 9 TESC SciComp v1