In-Class Exercise: Linear Regression in R You’ll need two files to do this exercise: linearRegression.r (the R script file) and mtcars.csv (the data file1). Both of those files can be found on the course site. The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles. Download both files and save them to the folder where you keep your R files. Part 1: Look at the Data File 1) Start RStudio. 2) Open the mtcars.csv data file. You’ll see something like this: This is the raw data for our analysis. This is a comma-separated file (CSV). That just means that each data value is separated by a comma. Now look at the contents of the file. The first line contains the names of the fields (think of them like columns in a spreadsheet). You can see the first field is called model, the second field is called mpg, the third field is called cyl, and so on. The remaining lines of the file contain the data for each car model. Here is the full list of the variables: Variable Name mpg cyl disp hp drat wt qsec vs am gear carb Variable Description Miles/(US) gallon (or fuel efficiency) Number of cylinders Displacement (cu.in.) Gross horsepower Rear axle ratio Weight (lb/1000) 1/4 mile time V/S Transmission (0 = automatic, 1 = manual) Number of forward gears Number of carburetors We will use this data set to predict the miles per gallon (mpg) based on any combination of the remaining variables (i.e., cyl, wt, etc.). mpg is a typical outcome variable for regression analysis because it describes a continuous value. 1 Adapted from R data set. 3) Close the OrganicsPurchase.csv file by selecting File/Close. If it asks you to save the file, choose “Don’t Save”. Part 2: Explore the linearRegression.r Script 1) Open the linearRegression.r file. This contains the R script that performs the linear regression analysis. 2) Look at lines 8 through 14. These contain the parameters for the script. Here’s a rundown: Variable Name in R INPUT_FILENAME OUTPUT_FILENAME Value mtcars.csv RegressionOutput.txt Description The data is contained in mtcars.csv The text output of the analysis 3) One good news about this analysis is that we do not need to install any additional package. 4) Now let’s look at the simple linear regression model with only one predictor. Scroll down to lines 31 through 37: fit = lm(mpg ~ wt, data = mtcars) You can see a few things at work: The lm() function is use to fit linear regression models. The formula for a simple linear regression model is outcome ~ predictor1 + predictor 2 + etc. mpg is the outcome event you’re trying to predict (i.e., fuel efficiency). Variable(s) to the right of the ~ are used to predict the outcome. Here we have only one predictor, i.e., mpg. 5) Now let’s look at the multiple linear regression model with more than one predictors. Scroll down to lines 43 through 47: mfit = lm(mpg ~ wt + disp + cyl, data = mtcars) The only change compared to the previous one is that now we have more than one predictor (i.e. wt, disp and cyl). Specifically, now we are looking at the effect of not just weight, but also the number of cylinders, and the volume, or displacement, of the car, on fuel efficiency. Part 3: Execute the linearRegression.r Script 1) Select Session/Set Working Directory/To Source File Location to change the working directory to the location of your R script. 2) Select Code/Run Region/Run All. It could take a few seconds to run since the first time it has to install some extra packages to do the analysis. Be patient! Part 4: Interpreting the Output We fit a model to our data. That's great! But the important question is, is it any good? There are lots of ways to evaluate model fit. lm()function consolidates some of the most popular ways into the summary function. You can invoke the summary function on any model you've fit with lm and get some metrics indicating the quality of the fit. Now we can look at the details of this fit with the summary function: > summary(fit) Call: lm(formula = mpg ~ wt, data = mtcars) Residuals: Min 1Q Median -4.5432 -2.3647 -0.1252 3Q 1.4096 Max 6.8727 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 37.2851 1.8776 19.858 < 2e-16 *** wt -5.3445 0.5591 -9.559 1.29e-10 *** --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 3.046 on 30 degrees of freedom Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446 F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10 This output contains a lot of information. Let's look at a few parts of it. (1) Briefly, it first shows the call that's the way that the function was called; miles per gallon (y) explained by weight (x) using the mtcars data. The regression equation we would like to fit would be 𝑚𝑝𝑔 ̂ = 𝑏0 + 𝑏1 𝑤𝑡. (2) This next part summarizes the residuals: that's how much the model got each of those predictions wronghow different the predictions were from the actual results. (3) This table, the most interesting part, is the coefficients- this shows the actual predictors and the significance of each. First, we have our estimate of the intercept (𝑏0 ): the estimated value of 𝑏0 is 37.2851. Hypothetically, if we have a car with weight of 0, the predicted miles per gallon of the car based on our linear model would be 37.2851. Then we can see the effect of the weight variable on miles per gallon (𝑏1 ): the estimated value of 𝑏1 is -5.3445, which shows the effect of the weight, also called the coefficient or the slope of the weight. This shows that there's a negative relationship, where increasing the weight decreases the miles per gallon. In particular, it shows that increasing the weight by 1000 pounds decreases the efficiency by 5.3 miles per gallon. So the table gives us the fitted regression line: 𝑚𝑝𝑔 ̂ = 37.2851 + −5.3445𝑤𝑡. You can then use this equation to predict the gas mileage of a car that has a weight of, say, 4500 pounds. This second column is called the standard error: we won't examine it here, but in short, it represents the amount of uncertainty in our estimate of the slope. The third column is the t-value of the coefficient estimate, a mathematically relevant value that was used to compute the last column, which is the p-value, describing whether this relationship could be due to chance alone. Smaller p-values (typically with p-value<0.05) indicates that the relationship is statistically significant. (4) Multiple R-squared (R2 ): used to evaluating the goodness of fit of your model. Higher is better, with 1 being the best. Corresponds with the amount of variability in what you're predicting that is explained by the model. In this instance, 75% of the variation in mpg can be explained by the car’s weight. (5) Adjusted R-squared (R2 ): Similar to multiple R-squared, but will have a small penalty if you include more variables. Here are the list of output values returned by the summary ( ) function. The ones that are especially important are in bold. # Name 1 Residuals 2 Significance codes Description: The residuals are the difference between the actual values of the outcome variable and predicted values from your regression--𝑦 − 𝑦̂. For most regressions you want your residuals to look like a normal distribution when plotted. If our residuals are normally distributed, this indicates the mean of the difference between our predictions and the actual values is close to 0 (good). The stars are shorthand for significance levels, with the number of asterisks displayed according to the p-value computed. *** for high significance and * for low significance. In this case, *** indicates that it's unlikely that no relationship exists b/w heights of parents and heights of their children. The coefficient estimates are the values calculated by the regression. With a regression model 𝑦̂ = 𝑏0 + 𝑏1 𝑥1 + 𝑏2 𝑥2 + ⋯, the 𝑏0 , 𝑏1 , 𝑏2 are the cofficients that we would like to get. These values measure the marginal importance of each predictor variable on the outcome variable. Measure of the variability in the estimate for the coefficient. Lower means better but this number is relative to the value of the coefficient. As a rule of thumb, you'd like this value to be at least an order of magnitude less than the coefficient estimate. 3 Coefficient Estimates 4 Standard Error of the Coefficient Estimate (Std. Error) 5 t value of the Coefficient Estimate Score that measures whether or not the coefficient for this variable is meaningful for the model. You probably won't use this value itself, but know that it is used to calculate the pvalue and the significance levels. 6 Pr(>|t|) (i.e. Variable p-value) Another score that measures whether or not the coefficient for this variable is meaningful for the model. You want this number to be as small as possible. If the number is really small, Significance Legend R will display it in scientific notation. In or example 2e-16 means that the odds that parent is meaningless is about 1⁄5000000000000000 The more punctuation there is next to your variables, the better. 7 Blank=bad, Dots=pretty good, Stars=good, More Stars=very good 8 9 Residual Std Error / Degrees of Freedom R-squared The Residual Std Error is just the standard deviation of your residuals. You'd like this number to be proportional to the quantiles of the residuals in #1. For a normal distribution, the 1st and 3rd quantiles should be 1.5 +/- the std error. The Degrees of Freedom is the difference between the number of observations included in your training sample and the number of variables used in your model (intercept counts as a variable). Metric for evaluating the goodness of fit of your model. Higher is better with 1 being the best. Corresponds with the amount of variability in what you're predicting that is explained by the model. In this instance, ~21% of the cause for a child's height is due to the height their parent. 10 F-statistic & resulting p-value WARNING: While a high R-squared indicates good correlation, correlation does not always imply causation. Performs an F-test on the model. This takes the parameters of our model (in our case we only have 1) and compares it to a model that has fewer parameters. In theory the model with more parameters should fit better. If the model with more parameters (your model) doesn't perform better than the model with fewer parameters, the F-test will have a high pvalue (probability NOT significant boost). If the model with more parameters is better than the model with fewer parameters, you will have a lower p-value. The DF, or degrees of freedom, pertains to how many variables are in the model. In our case there is one variable so there is one degree of freedom. Try it: Looking at the results returned by summary(mfit), and try to interpret the output. > summary(mfit) Call: lm(formula = mpg ~ wt + disp + cyl, data = mtcars) Residuals: Min 1Q Median -4.4035 -1.4028 -0.4955 3Q 1.3387 Max 6.0722 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 41.107678 2.842426 14.462 1.62e-14 *** wt -3.635677 1.040138 -3.495 0.00160 ** disp 0.007473 0.011845 0.631 0.53322 cyl -1.784944 0.607110 -2.940 0.00651 ** --Signif. codes: 0 ?**?0.001 ?*?0.01 ??0.05 ??0.1 ??1 Residual standard error: 2.595 on 28 degrees of freedom Multiple R-squared: 0.8326, Adjusted R-squared: 0.8147 F-statistic: 46.42 on 3 and 28 DF, p-value: 5.399e-11 Questions: (1) Which predictor variables are statistically significant in predicting mpg? (2) How is the model prediction compared to the simple linear regression model? Answers: (1) wt and cyl (2) The R-squared is 0.8326, which is larger than 0.7528, indicating better prediction accuracy.