In-Class Exercise: Linear Regression in R

advertisement
In-Class Exercise: Linear Regression in R
You’ll need two files to do this exercise: linearRegression.r (the R script file) and mtcars.csv (the data file1). Both of
those files can be found on the course site. The data was extracted from the 1974 Motor Trend US magazine, and
comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles.
Download both files and save them to the folder where you keep your R files.
Part 1: Look at the Data File
1) Start RStudio.
2) Open the mtcars.csv data file. You’ll see something like this:
This is the raw data for our analysis. This is a comma-separated file (CSV). That just means that each data value
is separated by a comma.
Now look at the contents of the file. The first line contains the names of the fields (think of them like columns
in a spreadsheet). You can see the first field is called model, the second field is called mpg, the third field is
called cyl, and so on. The remaining lines of the file contain the data for each car model.
Here is the full list of the variables:
Variable Name
mpg
cyl
disp
hp
drat
wt
qsec
vs
am
gear
carb
Variable Description
Miles/(US) gallon (or fuel efficiency)
Number of cylinders
Displacement (cu.in.)
Gross horsepower
Rear axle ratio
Weight (lb/1000)
1/4 mile time
V/S
Transmission (0 = automatic, 1 = manual)
Number of forward gears
Number of carburetors
We will use this data set to predict the miles per gallon (mpg) based on any combination of the remaining variables
(i.e., cyl, wt, etc.).
mpg is a typical outcome variable for regression analysis because it describes a continuous value.
1
Adapted from R data set.
3) Close the OrganicsPurchase.csv file by selecting File/Close. If it asks you to save the file, choose “Don’t Save”.
Part 2: Explore the linearRegression.r Script
1) Open the linearRegression.r file. This contains the R script that performs the linear regression analysis.
2) Look at lines 8 through 14. These contain the parameters for the script. Here’s a rundown:
Variable Name in R
INPUT_FILENAME
OUTPUT_FILENAME
Value
mtcars.csv
RegressionOutput.txt
Description
The data is contained in mtcars.csv
The text output of the analysis
3) One good news about this analysis is that we do not need to install any additional package.
4) Now let’s look at the simple linear regression model with only one predictor. Scroll down to lines 31 through
37:
fit = lm(mpg ~ wt, data = mtcars)
You can see a few things at work:




The lm() function is use to fit linear regression models.
The formula for a simple linear regression model is outcome ~ predictor1 + predictor 2 +
etc.
mpg is the outcome event you’re trying to predict (i.e., fuel efficiency).
Variable(s) to the right of the ~ are used to predict the outcome. Here we have only one predictor, i.e.,
mpg.
5) Now let’s look at the multiple linear regression model with more than one predictors. Scroll down to lines 43
through 47:
mfit = lm(mpg ~ wt + disp + cyl, data = mtcars)
The only change compared to the previous one is that now we have more than one predictor (i.e. wt, disp and
cyl). Specifically, now we are looking at the effect of not just weight, but also the number of cylinders, and the
volume, or displacement, of the car, on fuel efficiency.
Part 3: Execute the linearRegression.r Script
1) Select Session/Set Working Directory/To Source File Location to change the working directory to the location
of your R script.
2) Select Code/Run Region/Run All. It could take a few seconds to run since the first time it has to install some
extra packages to do the analysis. Be patient!
Part 4: Interpreting the Output
We fit a model to our data. That's great! But the important question is, is it any good?
There are lots of ways to evaluate model fit. lm()function consolidates some of the most popular ways into the
summary function. You can invoke the summary function on any model you've fit with lm and get some metrics
indicating the quality of the fit.
Now we can look at the details of this fit with the summary function:
> summary(fit)
Call:
lm(formula = mpg ~ wt, data = mtcars)
Residuals:
Min
1Q Median
-4.5432 -2.3647 -0.1252
3Q
1.4096
Max
6.8727
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.2851
1.8776 19.858 < 2e-16 ***
wt
-5.3445
0.5591 -9.559 1.29e-10 ***
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared: 0.7528,
Adjusted R-squared: 0.7446
F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
This output contains a lot of information. Let's look at a few parts of it.
(1) Briefly, it first shows the call that's the way that the function was called; miles per gallon (y) explained by
weight (x) using the mtcars data. The regression equation we would like to fit would be 𝑚𝑝𝑔
̂ = 𝑏0 +
𝑏1 𝑤𝑡.
(2) This next part summarizes the residuals: that's how much the model got each of those predictions wronghow different the predictions were from the actual results.
(3) This table, the most interesting part, is the coefficients- this shows the actual predictors and the
significance of each.

First, we have our estimate of the intercept (𝑏0 ): the estimated value of 𝑏0 is 37.2851.
Hypothetically, if we have a car with weight of 0, the predicted miles per gallon of the car based
on our linear model would be 37.2851.

Then we can see the effect of the weight variable on miles per gallon (𝑏1 ): the estimated value of
𝑏1 is -5.3445, which shows the effect of the weight, also called the coefficient or the slope of the
weight. This shows that there's a negative relationship, where increasing the weight decreases
the miles per gallon. In particular, it shows that increasing the weight by 1000 pounds decreases
the efficiency by 5.3 miles per gallon.

So the table gives us the fitted regression line: 𝑚𝑝𝑔
̂ = 37.2851 + −5.3445𝑤𝑡. You can then use
this equation to predict the gas mileage of a car that has a weight of, say, 4500 pounds.

This second column is called the standard error: we won't examine it here, but in short, it
represents the amount of uncertainty in our estimate of the slope.

The third column is the t-value of the coefficient estimate, a mathematically relevant value that
was used to compute the last column, which is the p-value, describing whether this relationship
could be due to chance alone. Smaller p-values (typically with p-value<0.05) indicates that the
relationship is statistically significant.
(4) Multiple R-squared (R2 ): used to evaluating the goodness of fit of your model. Higher is better, with 1
being the best. Corresponds with the amount of variability in what you're predicting that is explained by
the model. In this instance, 75% of the variation in mpg can be explained by the car’s weight.
(5) Adjusted R-squared (R2 ): Similar to multiple R-squared, but will have a small penalty if you include more
variables.
Here are the list of output values returned by the summary ( ) function. The ones that are especially important are
in bold.
#
Name
1
Residuals
2
Significance codes
Description:
The residuals are the difference between the actual values of the outcome variable and
predicted values from your regression--𝑦 − 𝑦̂. For most regressions you want your residuals
to look like a normal distribution when plotted. If our residuals are normally distributed, this
indicates the mean of the difference between our predictions and the actual values is close
to 0 (good).
The stars are shorthand for significance levels, with the number of asterisks displayed
according to the p-value computed. *** for high significance and * for low significance. In
this case, *** indicates that it's unlikely that no relationship exists b/w heights of parents
and heights of their children.
The coefficient estimates are the values calculated by the regression. With a regression
model 𝑦̂ = 𝑏0 + 𝑏1 𝑥1 + 𝑏2 𝑥2 + ⋯, the 𝑏0 , 𝑏1 , 𝑏2 are the cofficients that we would like to
get. These values measure the marginal importance of each predictor variable on the
outcome variable.
Measure of the variability in the estimate for the coefficient. Lower means better but this
number is relative to the value of the coefficient. As a rule of thumb, you'd like this value to
be at least an order of magnitude less than the coefficient estimate.
3
Coefficient Estimates
4
Standard Error of the
Coefficient Estimate
(Std. Error)
5
t value of the
Coefficient Estimate
Score that measures whether or not the coefficient for this variable is meaningful for the
model. You probably won't use this value itself, but know that it is used to calculate the pvalue and the significance levels.
6
Pr(>|t|) (i.e. Variable
p-value)
Another score that measures whether or not the coefficient for this variable is meaningful
for the model. You want this number to be as small as possible. If the number is really small,
Significance Legend
R will display it in scientific notation. In or example 2e-16 means that the odds that parent
is meaningless is about 1⁄5000000000000000
The more punctuation there is next to your variables, the better.
7
Blank=bad, Dots=pretty good, Stars=good, More Stars=very good
8
9
Residual Std Error /
Degrees of Freedom
R-squared
The Residual Std Error is just the standard deviation of your residuals. You'd like this
number to be proportional to the quantiles of the residuals in #1. For a normal distribution,
the 1st and 3rd quantiles should be 1.5 +/- the std error.
The Degrees of Freedom is the difference between the number of observations included in
your training sample and the number of variables used in your model (intercept counts as a
variable).
Metric for evaluating the goodness of fit of your model. Higher is better with 1 being the
best. Corresponds with the amount of variability in what you're predicting that is explained
by the model. In this instance, ~21% of the cause for a child's height is due to the height
their parent.
10
F-statistic & resulting
p-value
WARNING: While a high R-squared indicates good correlation, correlation does not always
imply causation.
Performs an F-test on the model. This takes the parameters of our model (in our case we
only have 1) and compares it to a model that has fewer parameters. In theory the model
with more parameters should fit better. If the model with more parameters (your model)
doesn't perform better than the model with fewer parameters, the F-test will have a high pvalue (probability NOT significant boost). If the model with more parameters is better than
the model with fewer parameters, you will have a lower p-value.
The DF, or degrees of freedom, pertains to how many variables are in the model. In our case
there is one variable so there is one degree of freedom.
Try it:
Looking at the results returned by summary(mfit), and try to interpret the output.
> summary(mfit)
Call:
lm(formula = mpg ~ wt + disp + cyl, data = mtcars)
Residuals:
Min
1Q Median
-4.4035 -1.4028 -0.4955
3Q
1.3387
Max
6.0722
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 41.107678
2.842426 14.462 1.62e-14 ***
wt
-3.635677
1.040138 -3.495 0.00160 **
disp
0.007473
0.011845
0.631 0.53322
cyl
-1.784944
0.607110 -2.940 0.00651 **
--Signif. codes: 0 ?**?0.001 ?*?0.01 ??0.05 ??0.1 ??1
Residual standard error: 2.595 on 28 degrees of freedom
Multiple R-squared: 0.8326,
Adjusted R-squared: 0.8147
F-statistic: 46.42 on 3 and 28 DF, p-value: 5.399e-11
Questions:
(1) Which predictor variables are statistically significant in predicting mpg?
(2) How is the model prediction compared to the simple linear regression model?
Answers:
(1) wt and cyl
(2) The R-squared is 0.8326, which is larger than 0.7528, indicating better prediction accuracy.
Download