5. Linear regression Learning objectives and outcomes: Once you have completed learning unit 5, you should be able to do the following: 1. Calculate and interpret linear correlation. 2. Determine and interpret the equation of the linear regression line. 3. Demonstrate knowledge of least squares regression and apply it to datasets. 4. Apply visual and numerical regression diagnostics. 5. Apply inference for the true slope and prediction intervals for the regression model. 5.1 Linear correlation In this section you will learn how to estimate the linear relationship between two variables. The Pearson’s correlation coefficient, or r, is the relevant measure of association to be used (see learning unit 2, STA1501). This is a continuous scale measure of the strength of the linear relationship between two variables. The coefficient r can have a value between −1 and +1. Example 5.1.1. Figure 5.1 contains eight records of 30 entries of employees’ Salary and Years of experience. First load the dataset salaries.xls, which is available on myModules under Additional Resources, into R and View it. > > 3 > 4 > 1 2 # Load the library import . xls file into R library ( readxl ) empl _ salaries <- read _ excel ( " salaries . xls " ) View ( empl _ salaries ) 67 Section 5.1. Linear correlation Page 68 Figure 5.1: Eight records of Years of experience and related Salary of employees The following R code calculates the correlation coefficient between the Years of experience and the corresponding Salary from the attached dataset: > # attach the empl _ salaries dataset > attach ( empl _ salaries ) 3 > # calculate the linear correlation 4 > cor ( Years _ of _ experience , Salary ) 5 [1] 0.9152571 1 2 The resulting output of 0.92 indicates a strong positive linear correlation between the variables Salary and Years of experience. Figure 5.2 illustrates the scatter plot of Salary against Years of experience. > options ( scipen =999) > plot ( Years _ of _ experience , Salary , 3 + main = ’ Salary vs Years of experience ’ , 4 + xlab = ’ Years of experience ’ , ylab = ’ Salary ’ 5 + , col = " red " ) 1 2 Section 5.2. Linear regression line equation Page 69 Figure 5.2: Scatter plot of Salary against Years of experience Figure 5.2 indicates an increase in the Salary as the number of Years of experience increase, which means there is a positive linear association. Complete Activity 5.1 in the Exercise Manual before you proceed to the next section. 5.2 Linear regression line equation The equation of a simple linear regression line defines the relationship between two variables. It determines the value of the dependent or response variable, y, for a predetermined value of the independent variable, x. In essence, it is used to predict the value of y given a value for x. Refer to learning unit 2 in STA1501 and learning unit 4 in STA1502 to refresh your memory and get more background information. To draw a linear regression line in R, we need the following two functions that form part of the built-in stats package that comes with the installation of R: 1. The abline() function is used to add one or more straight lines through the current plot. It has the following format: abline(α = NULL, β = NULL, h = NULL, v = NULL, · · · ) Section 5.2. Linear regression line equation Page 70 Parameters: α, β: specifies the intercept and the slope of the line; h: specifies y-value for horizontal line(s); v: specifies x-value(s) for vertical line(s) and returns a straight line in the plot. 2. The lm() function, which stands for linear model, can also be used to create a simple regression model. It has the following format: lm(formula,data) Parameters: formula, data: present the relation between x and y, and the vector to which the formula will be applied, respectively. It returns the parameters of the equation for the relationship between x and y. Moving forward, applying the abline() function onto the scatter plot of Salary against Years of experience variables to Figure 5.2 will yield the following: > # apply the abline () function > abline ( lm ( Salary ~ Years _ of _ experience , data = empl _ salaries ) , 3 + col = ’ blue ’ , lwd = 2) 1 2 Figure 5.3: The abline function applied to a linear regression model of Salary against Years of experience Figure 5.3 illustrates a best fit line (linear regression line) for the scatter plot of Salary against Years of experience. The following model is returned by deploying the lm() function, whereas the ouput is called using the summary() function on the model: Section 5.2. Linear regression line equation 1 2 Page 71 > salary . lm <- lm ( Salary ~ . , data = empl _ salaries ) > summary ( salary . lm ) 3 4 5 Call : lm ( formula = Salary ~ . , data = empl _ salaries ) 6 Residuals : Min 1Q 9 -28921.5 -7380.9 7 8 Median 3Q -920.6 8728.1 Max 22411.3 10 11 Coefficients : Estimate Std . Error t value 18832.8 3697.3 5.094 6386.3 531.2 12.021 15 Pr ( >| t |) 16 ( Intercept ) 0.00002146820188 * * * 17 Years _ of _ experience 0.00000000000143 * * * 18 --19 Signif . codes : 0 ‘* * * ’ 0.001 ‘* * ’ 0.01 ‘* ’ 0.05 ‘. ’ 0.1 ‘ ’ 12 ( Intercept ) 14 Years _ of _ experience 13 1 20 Residual standard error : 11240 on 28 degrees of freedom Multiple R - squared : 0.8377 , Adjusted R - squared : 0.8319 23 F - statistic : 144.5 on 1 and 28 DF , p - value : 0.00000000 000142 8 21 22 The intercept of 18 832.8 at line number 13 is the expected value of the employee Salary when accounting for zero Years of experience. Line number 14 shows the predicted slope. The employee Salary would rise by R6 386.30 for every one year increase in Years of experience. The standard error can be used to calculate an estimate of the predicted difference. The t-value, a measure of how far the estimate deviates from zero in terms of standard deviations, is the other crucial parameter. Also note that it must be significantly greater than zero in order for us to reject the null hypothesis of the model and claim that there is a relationship between Years of experience and Salary. For instance, the output of the summary model confirms that the t-values are certainly far from zero, indicating that there is a significant relationship between Salary and Years of experience. Usually, a p-value of 5% or less is the commonly used significance level (or alpha level) in hypothesis testing. The three asterisks (lines number 16 to 17) represent a highly significant regression coefficient. In this example, the p-values of Salary and Years of experience are very small. That means, we can reject the null hypothesis that there is no effect or no difference. As a result it implies that there exists a relationship between Salary and Years of experience. Lastly, R2 shows how much of the variation in the dependent variable is explained by the linear relationship with the independent variable in the model. That means about 83% of the variation in the variable Salary is explained by the linear relationship with the variable Years of experience. At this point, you should be able to load any dataset in R using the appropriate library and View the dataset. Example 5.2.1. Consider a dataset for advertising that tracks sales as a function of spending on T V , Radio and N ewspaper advertising. Thousands of rands are used to measure each of these four variables. To get a general overview of the dataset structure, the next steps should be completed after loading the file advertising.csv dataset, which is available under Additional Resources on the module site, and naming it ad sales. 1 > ad _ sales <- read . csv ( " advertising . csv " ) Section 5.2. Linear regression line equation > View ( ad _ sales ) > dim ( ad _ sales ) 4 [1] 200 4 5 > class ( ad _ sales ) 6 [1] " data . frame " 7 > str ( ad _ sales ) 8 ‘ data . frame ’: 200 9 $ TV : num 10 $ Radio : num 11 $ Newspaper : num 12 $ Sales : num 13 > Page 72 2 3 obs . of 4 variables : 230.1 44.5 17.2 151.5 180.8 ... 37.8 39.3 45.9 41.3 10.8 48.9 32.8 19.6 2.1 2.6 ... 69.2 45.1 69.3 58.5 58.4 75 23.5 11.6 1 21.2 ... 22.1 10.4 9.3 18.5 12.9 7.2 11.8 13.2 4.8 10.6 ... The output reveals that the dataset has four variables with 200 observations in total: Sales and three media channel budgets for T V , Radio and N ewspaper. The data frame variables are all integers and provide information about the kinds of analysis that can be done. To check that the data are reasonable and within expectations, the summary() function can be used. This fucntion is used to generate descriptive statistics for a given dataset or object. When you apply summary() to a data frame or a vector, it provides a concise summary of the data, including measures of central tendency (e.g. mean and median), measures of spread (e.g. minimum and maximum, quartiles), and other relevant statistics depending on the data type. > # high level overview of ad _ sales dataset > summary ( ad _ sales ) 3 TV Radio Newspaper Sales 4 Min . : 0.70 Min . : 0.000 Min . : 0.30 Min . : 1.60 5 1 st Qu .: 74.38 1 st Qu .: 9.975 1 st Qu .: 12.75 1 st Qu .:10.38 6 Median :149.75 Median :22.900 Median : 25.75 Median :12.90 7 Mean :147.04 Mean :23.264 Mean : 30.55 Mean :14.02 8 3 rd Qu .:218.82 3 rd Qu .:36.525 3 rd Qu .: 45.10 3 rd Qu .:17.40 9 Max . :296.40 Max . :49.600 Max . :114.00 Max . :27.00 10 > 1 2 The findings contain a five-point summary and mean for each of the four variables in the output. For instance, the minimum and maximum budgets for T V advertisements are 0.7 and 296.40, respectively, which translate to R700 and R296 400. However, the mean budget for T V advertisements is 147.04, which is equal to R147 040. Let us further inspect the data to see whether there exists any association between the advertising Sales and the advertisement budgets for T V , Radio and N ewspaper by determining the linear correlation. > # DataExplorer : designed for fast exploratory data analysis > library ( DataExplorer ) 3 > plot _ correlation ( ad _ sales ) 1 2 Section 5.2. Linear regression line equation Page 73 Figure 5.4: Correlation plot of T V , Radio, N ewspaper and Sales Figure 5.4 reveals a significant positive linear correlation between Sales, and T V and Radio spending budgets, as well as a weak positive linear association between Sales and the N ewspaper budget. The correlation plot results are further supported by diagrams of each of the three budgets plotted against Sales. A positive linear correlation coefficient means that as one variable increases, the other variable also tends to increase. An increase in Radio and T V budgets will correspond to an increase in Sales. > > 3 > 4 > 5 > 1 2 attach ( ad _ sales ) par ( mfrow = c (1 ,3) ) plot ( TV , Sales , xlab = " TV " , ylab = " Sales " , col = " cyan " ) plot ( Radio , Sales , xlab = " Radio " , ylab = " Sales " , col = " blue " ) plot ( Newspaper , Sales , xlab = " Newspaper " , ylab = " Sales " , col = " darkred " ) Section 5.2. Linear regression line equation Page 74 Figure 5.5: Scatter plots of Sales against T V , Radio and N ewspaper budgets Figure 5.5 indicates the positive linear correlations between the Radio and T V budgets and Sales, as opposed to almost no pattern between N ewspaper and Sales. The plots can include the simple least squares equations that represent the line of best fit for predicting Sales based on T V , Radio and N ewspaper variables. These lines of best fit depict a straightforward linear model that can be used to predict Sales by considering the respective media budget variables. > > 3 > 4 > 5 > 6 + 7 > 8 > 9 + 10 > 11 > 12 + 13 + 14 > 1 2 par ( mfrow = c (1 ,3) ) TV _ lm <- lm ( Sales ~ TV , data = ad _ sales ) radio _ lm <- lm ( Sales ~ Radio , data = ad _ sales ) newspaper _ lm <- lm ( Sales ~ Newspaper , data = ad _ sales ) plot ( TV , Sales , xlab = " TV " , ylab = " Sales " , col = " cyan " , main = " Sales against TV " , frame = FALSE ) abline ( TV _ lm , col = " red " , lwd =2) plot ( Radio , Sales , xlab = " Radio " , ylab = " Sales " , col = " blue " , main = " Sales against Radio " , frame = FALSE ) abline ( radio _ lm , col = " red " , lwd =2) plot ( Newspaper , Sales , xlab = " Newspaper " , ylab = " Sales " , col = " darkred " , main = " Sales against Newspaper " , frame = FALSE ) abline ( newspaper _ lm , col = " red " , lwd =2) Section 5.3. Least squares regression Page 75 Figure 5.6: The least squares regression equation fitted to each scatter plot of Sales against T V , Radio and N ewspaper Figure 5.6 demonstrates a weak positive linear correlation in the plot of Sales against N ewspaper, which explains why the relationship between these two variables tend to increase slightly, even though it is not very strong. 5.3 Least squares regression The goal of a linear regression line is to estimate or predict the values of the response variable, y, given values of the explanatory, x. Usually, estimates are not exact. Each one is slightly different from the actual values. The least total inaccuracy among all straight lines is an acceptable criterion for determining which line is the “best” fit. This section illustrates this criterion in order to determine whether it is possible to identify the best straight line in terms of the criterion. Let us illustrate how the estimated values and regression residuals may be obtained in R. The given data points are x1 , . . . , xn and y1 , . . . , yn . The estimated response or the ordinary least square (OLS) fitted values can be defined by yˆi = α + β̂ xi for all i = 1, . . . , n, (5.3.1) where α and β denote the intercept and slope of the regression line, respectively. The residuals are the differences between the actual responses and the fitted values, that is, ei = yi − ŷi , (5.3.2) such that the residual sum squared (RSS) of errors is defined as RSS = n X i=1 e2i = n X i=1 yi − ŷi 2 ≥ 0. (5.3.3) Using the dataset salaries from Section 5.1, let us assign the Salary variable to Y and the Years of experience to X, and then compute the OLS fitted values and residuals according to Equations (5.3.1) and (5.3.2). View the R demonstrations. Section 5.3. Least squares regression Page 76 > > 3 > 4 > 5 > 6 > 7 > 8 > 9 > 10 > 11 > 12 > 13 > attach ( empl _ salaries ) # define variables Y <- cbind ( Salary ) X <- cbind ( Years _ of _ experience ) # ordinary least squares OLS <- lm ( Y ~ X ) # predicated or estimates values y . hat <- fitted ( OLS ) # residuals of the OLS err <- resid ( OLS ) # column bind of all 4 - variables table <- cbind (Y ,X , y . hat , err ) table 14 Salary Years _ of _ experience y . hat err 15 1 19143 1.0 25219.16 -6076.1636 16 2 26205 1.3 27135.06 -930.0615 17 3 17531 1.5 28412.33 -10881.3268 18 4 23325 2.0 31605.49 -8280.4899 19 5 19691 2.2 32882.76 -13191.7552 20 6 36442 2.9 37353.18 -911.1836 21 7 39950 3.0 37991.82 1958.1838 22 8 34245 3.2 39269.08 -5024.0815 23 9 44245 3.2 39269.08 4975.9185 24 10 36989 3.7 42462.24 -5473.2447 25 11 43018 3.9 43739.51 -721.5099 26 12 35594 4.0 44378.14 -8784.1426 27 13 36757 4.0 44378.14 -7621.1426 28 14 36881 4.1 45016.78 -8135.7752 29 15 40911 4.5 47571.31 -6660.3057 1 2 To illustrate how the fitted equations in (5.3.1) and (5.3.2) are used, let us manually confirm the values of ŷ1 and e1 that are highlighted in the result as follows: ŷ1 = 18832.8 + 6386.3 x1 = 18832.8 + 6386.3 (1) = 25219.1 ei = y1 − yˆ1 = 19143 − 25219.1 = −6076.16 The least square principle can be summarised as follows: 1. Select α and β such that RSS (5.3.3) is minimised. 2. The problem min RSS = min α,β is called the least squares problem. α,β n X e2i (5.3.4) i=1 Example 5.3.1. Examine the connection between advertising budgets and Sales using the ad sales dataset as seen in Section 5.2, Example 5.2.1. Minimising the sum of squared errors will allow you to find the least squares fit for the regression of Sales onto T V , as illustrated in the following steps: > > 3 > 4 + 5 > 6 > 7 + 1 2 par ( mfrow = c (1 ,1) ) TV _ lm <- lm ( Sales ~ TV , data = ad _ sales ) plot ( TV , Sales , xlab = " TV " , ylab = " Sales " , col = " blue " , pch =19 , main = " Actual vs Predicted " ) abline ( TV _ lm , col = " red " , lwd =2) segments ( TV , Sales , TV , predict ( TV _ lm ) , col = " grey " ) Section 5.3. Least squares regression Page 77 Figure 5.7: The squared sum of errors between the actual and predicted Sales in relation to the T V budget The line of best fit (the red line in Figure 5.7) makes an effort to minimise the sum of squares for each and every one of the grey line segments, which indicate the errors. Consider the issue of predicting the value of a response Y based on multiple explanatory variables for further investigation. Let’s first review the following definition: Definition 5.3.2. Multiple linear regression models are defined by the equation Y = β0 + β1 X1 + β2 X2 + . . . + βn Xn + , where • β0 = intercept, the value of Y when Xi = 0 for all i = 1, . . . , n. • βi for all i = 1, . . . , n are the slopes of the line, defined as the change in Y for a one unit in X, keeping constant other βi0 s. • Xi for all i = 1, . . . , n are independent variables (or explanatory variables). • Y = dependent variable (or response variable). • = random error that explains the deviation of the points (X; Y ) about the line. The only difference between Definition 5.3.2 and Equation 5.3.1 for simple linear regression is the presence of several independent variables. The method of least squares, which applies simple linear regression principles to n dimensions, is used to estimate the regression coefficients. Section 5.3. Least squares regression Page 78 The lm() function can also be used to build a multiple regression model of Sales based on the three advertising budgets of T V , Radio and N ewspaper media channel variables in R as follows: 1 2 > ad _ sales . lm <- lm ( Sales ~ . , data = ad _ sales ) > summary ( ad _ sales . lm ) 3 4 5 Call : lm ( formula = Sales ~ . , data = ad _ sales ) 6 Residuals : Min 1Q 9 -8.8277 -0.8908 7 8 Median 0.2418 3Q 1.1893 Max 2.8292 10 11 Coefficients : Estimate Std . Error t value ( Intercept ) 2.938889 0.311908 9.422 14 TV 0.045765 0.001395 32.809 15 Radio 0.188530 0.008611 21.893 16 Newspaper -0.001037 0.005871 -0.177 17 Pr ( >| t |) 18 ( Intercept ) <0.0000000000000002 * * * 19 TV <0.0000000000000002 * * * 20 Radio <0.0000000000000002 * * * 21 Newspaper 0.86 22 --23 Signif . codes : 24 0 ‘* * * ’ 0.001 ‘* * ’ 0.01 ‘* ’ 0.05 ‘. ’ 0.1 ‘ ’ 1 12 13 25 Residual standard error : 1.686 on 196 degrees of freedom Multiple R - squared : 0.8972 , Adjusted R - squared : 0.8956 28 F - statistic : 570.3 on 3 and 196 DF , p - value : < 0.0 000 0 000 00 000 0 022 26 27 The output illustrates a multiple regression model for estimating Sales based on the three advertising budgets. Interpreting the F -statistic and the corresponding p-value, which are located at the bottom of the model summary, is the first step in evaluating the multiple regression analysis. As can be observed, the p-value of the F -statistic is equal to 0.00000000000000022, which indicates a very significant relationship. This indicates a meaningful relationship between at least one of the explanatory variables (T V , Radio and N ewspaper) and the response variable (Sales). Another way is to look closely at the table of coefficients presented in the following step, which displays the estimates of beta regression coefficients and the corresponding p-values of the t-statistic, to determine which explanatory variables are significantly different from zero: > # a table of beta coefficients : b _ 0 , b _ 1 , b _ 2 and b _ 3 > summary ( ad _ sales . lm ) $ coefficient 3 Estimate Std . Error t value 4 ( Intercept ) 2.938889369 0.311908236 9.4222884 5 TV 0.045764645 0.001394897 32.8086244 6 Radio 0.188530017 0.008611234 21.8934961 7 Newspaper -0.001037493 0.005871010 -0.1767146 1 2 It follows that β0 = 2.938889369, β1 = 0.045764645, β2 = 0.188530017 and β3 = −0.001037493, which can be used to express the following multiple regression model with beta coefficients rounded to three decimal places: ŷ = 2.94 + 0.046(T V ) + 0.189(Radio) − 0.001(N ewspaper), Section 5.3. Least squares regression Page 79 where ŷ is the expected or predicted cost of Sales. Observe that changes in the N ewspaper advertising budget are not significantly correlated with changes in Sales. The changes in T V and Radio advertising budgets are significantly correlated with changes in Sales. Keeping T V and Radio budgets constant, an additional R1 000 spent on N ewspapers results in a 1 000 × 0.001 = 1 rand decrease in Sales. On the other hand, keeping Radio and N ewspaper budgets constant, investing R1 000 in a T V budget results in an increase in Sales of 1 000 × 0.046 = 46 rand. The interpretation for the Radio budget is based on a similar argument as the T V budget. The N ewspaper variable can be eliminated from the model, as illustrated in the following step, because it is not statistically significant: > # new model : removes Newspaper variable > ad _ sales _ new . lm <- lm ( Sales ~ TV + Radio , data = ad _ sales ) 3 > summary ( ad _ sales _ new . lm ) 4 Call : 5 lm ( formula = Sales ~ TV + Radio , data = ad _ sales ) 1 2 6 Residuals : Min 1Q 9 -8.7977 -0.8752 7 8 Median 0.2422 3Q 1.1708 Max 2.8328 10 11 Coefficients : Estimate Std . Error t value 2.92110 0.29449 9.919 0.04575 0.00139 32.909 0.18799 0.00804 23.382 16 Pr ( >| t |) 17 ( Intercept ) <0.0000000000000002 * * * 18 TV <0.0000000000000002 * * * 19 Radio <0.0000000000000002 * * * 20 --21 Signif . codes : 22 0 ‘* * * ’ 0.001 ‘* * ’ 0.01 ’* ’ 0.05 ‘. ’ 0.1 ‘ ’ 1 12 ( Intercept ) 14 TV 15 Radio 13 23 Residual standard error : 1.681 on 197 degrees of freedom Multiple R - squared : 0.8972 , Adjusted R - squared : 0.8962 26 F - statistic : 859.6 on 2 and 197 DF , p - value : < 0.0 000 0 000 00 000 0 022 24 25 A modified multiple linear regression model is depicted in the output following an elimination of the N ewspaper budget variable. Complete Activity 5.2 in the Exercise Manual before you proceed with this section. A dataset from Example 5.3.3 illustrates the sales of certain residential properties in Ames, Iowa, from 2006 to 2010. The dataset includes 2 930 observations as well as a sizable number of explanatory variables (23 nominal, 23 ordinal, 14 discrete, and 20 continuous) that are used to determine the values of homes. Example 5.3.3. For the city of Ames, Iowa, the following dataset of house prices and attributes was gathered over a number of years. This will simply concentrate on a subset of the columns. From the other columns, we will attempt to predict the SaleP rice column. Initialise R by loading the dataset and the necessary libraries. 1 2 > # load the required libraries > library ( tidyverse ) Section 5.3. Least squares regression 3 4 5 6 7 8 9 10 11 12 Page 80 > > > > > > + + library ( dplyr ) # # read _ csv : read the CSV file into a tibble object all _ sales <- read . csv ( " house . csv " , header = TRUE ) % >% as _ tibble () # filter : used to apply the filtering conditions # select : used to select the required columns sales <- all _ sales % >% filter ( ‘ Bldg . Type ‘ == " 1 Fam " , ‘ Sale . Condition ‘ == " Normal " ) % >% select ( ‘ SalePrice ‘ , ‘ X1st . Flr . SF ‘ , ‘ X2nd . Flr . SF ‘ , ‘ Total . Bsmt . SF ‘ , ‘ Garage . Area ‘ , ‘ Wood . Deck . SF ‘ , ‘ Open . Porch . SF ‘ , ‘ Lot . Area ‘ , ‘ Year . Built ‘ , ‘ Yr . Sold ‘) % >% + # arrange : sort the resulting sales tibble by the SalePrice + arrange ( SalePrice ) Note that successive data manipulation operations are chained together using the operator %>% from line 8. Only rows with Bldg.T ype columns of 1F am and Sale.Condition of N ormal are chosen using the filter() function. The columns of interest are kept using the select() function. The SaleP rice column is used to order the resulting sales dataset using the arrange() function. Figure 5.8: Ten records of the sorted sales by SaleP rice column The following R code creates a histogram with 32 bins using ggplot2 package, where the x-axis corresponds to the SaleP rice column of the sales data frame. The plot title and axis labels are set using the labs function. The dollar format function from the scales package is used by the scale x continuous function to format the x-axis labels as dollar amounts. 1 2 3 4 5 6 > # geom _ histogram () is used to create the histogram > options ( scipen =999) > ggplot ( sales , aes ( x = SalePrice ) ) + + geom _ histogram ( color = " black " , fill = " cyan " ) + + scale _ x _ continuous ( labels = scales :: dollar _ format ( prefix = " $ " ) ) + + labs ( x = " SalePrice " , y = " Count " , title = " Histogram of SalePrice " ) Section 5.3. Least squares regression Page 81 Figure 5.9: Histogram of the SaleP rice There is a lot of variation and a clearly skewed distribution in Figure 5.9. A few homes with extraordinarily high prices can be seen in the long tail to the right. There are no homes in the short left tail that sold for less than $35 000. The following command-lines create a scatter plot, where the x-axis corresponds to the X1st.F lr.SF column and the y-axis corresponds to the SaleP rice column of the sales data frame. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 > > > > + + # aes () : map ’ X1st . Flr . SF ’ to the x - axis and ’ SalePrice ’ to the y - axis . # geom _ point () : create a scatter plot with red points options ( scipen =999) ggplot ( sales , aes ( x = ‘ X1st . Flr . SF ‘ , y = SalePrice ) ) + geom _ point ( color = " red " ) + labs ( x = " X1st . Flr . SF " , y = " SalePrice " , title = " SalePrice against X1st . Flr . SF " ) + + scale _ y _ continuous ( labels = function ( y ) paste0 ( " $ " , y ) ) > # aes () : map ’ X1st . Flr . SF ’ to the x - axis and ’ SalePrice ’ to the y - axis . > # geom _ point () : create a scatter plot with red points > options ( scipen =999) > ggplot ( sales , aes ( x = ‘ X1st . Flr . SF ‘ , y = SalePrice ) ) + + geom _ point ( color = " red " ) + + labs ( x = " X1st . Flr . SF " , y = " SalePrice " , title = " SalePrice against X1st . Flr . SF " ) + + scale _ y _ continuous ( labels = function ( y ) paste0 ( " $ " , y ) ) Section 5.3. Least squares regression Page 82 Figure 5.10: Scatter plot of the SaleP rice by X1st.F lr.SF The SaleP rice cannot be predicted by a single attribute alone. For example, the X1st.F lr.SF expressed in square feet correlates with SaleP rice, but partly explains its variability. Note that one square feet (sq ft) is approximately equal to 0.0929 square metres (sq m). The SaleP rice and the X1st.F lr.SF variables indicate a strong positive correlation, as seen in Figure 5.10 and confimred by the calculated correlation coefficient of 0.6424663 in the following R code: 1 2 3 > # calculate the correlation coefficient > cor ( sales $ SalePrice , sales $ ‘ X1st . Flr . SF ‘) [1] 0.6424663 The following R code selects all the columns using the sales[c()] syntax and then computes the correlation matrix between them using the cor() function. Each cell in the 10 × 10 matrix that results from the corr matrix variable reflects the correlation coefficient between two variables. Subset the corr matrix, which chooses the row corresponding to SaleP rice and the columns relating to the other variables, in order to extract the correlation coefficients between SaleP rice and the other variable. 1 2 3 4 5 6 7 8 9 10 11 12 > # Calculate correlation coefficients of each attributes against SalePrice > corr _ matrix <- cor ( sales [ c ( " SalePrice " , " X1st . Flr . SF " , " X2nd . Flr . SF " , + " Total . Bsmt . SF " , " Garage . Area " , " Wood . Deck . SF " , + " Open . Porch . SF " , " Lot . Area " ," Year . Built " , " Yr . Sold " ) ]) > > # print the correlation coefficients > print ( corr _ matrix [ " SalePrice " , c ( " X1st . Flr . SF " , " X2nd . Flr . SF " , + " Total . Bsmt . SF " , " Garage . Area " , " Wood . Deck . SF " , + " Open . Porch . SF " , " Lot . Area " ," Year . Built " , " Yr . Sold " ) ]) X1st . Flr . SF X2nd . Flr . SF Total . Bsmt . SF Garage . Area Wood . Deck . SF 0.64246625 0.35752189 0.65297863 0.63859449 0.35269867 Open . Porch . SF Lot . Area Year . Built Yr . Sold Section 5.3. Least squares regression 13 0.33690942 0.29082346 Page 83 0.56516475 0.02594858 Note that all the individual variables, with the exception of the SaleP rice itself, do not show a correlation coefficient with SaleP rice of more than 0.7. Complete Activity 5.3 in the Exercise Manual before you proceed with this section. Multiple linear regression involves using numerical input variables to predict a numerical output. In order to do this, it multiplies the value of each variable by a specific slope and aggregate the outputs. For instance, in this illustration, the slope for X1st.F lr.SF represents the contribution of the first-floor space of the house to the overall prediction. Before making predictions, data are split into two equal sets: a training set and a test set. 1 2 3 4 5 6 > # split data into train and test sets > set . seed (1001) > train <- sales [1:1001 , ] > test <- sales [1002: nrow ( sales ) , ] > cat ( nrow ( train ) , ’ training and ’ , nrow ( test ) , ’ test instances .\ n ’) 1001 training and 1001 test instances . In multiple regression, the slopes form an array with a single slope value for each attribute. By multiplying each attribute by the slope and adding the results, we can predict the SaleP rice. 1 2 3 4 5 6 7 8 9 10 11 12 > # define predict function > predict <- function ( slopes , row ) { + sum ( slopes * as . numeric ( row ) ) + } > example _ row <- test [ , ! names ( test ) % in % " SalePrice " ][1 , ] > cat ( ’ Predicting sale price for : ’ , paste0 ( example _ row , collapse = " , " ) , " \ n " ) Predicting sale price for : 1287 , 0 , 1063 , 576 , 364 , 17 , 9830 , 1959 , 2010 > example _ slopes <- rnorm ( length ( example _ row ) , mean = 10 , sd = 1) > cat ( ’ Using slopes : ’ , paste0 ( example _ slopes , collapse = " , " ) , " \ n " ) Using slopes : 12.1886480934024 , 9.82245266527506 , 9.81472472040473 , 7.49346378650813 , 9.44268866267632 , 9.85644054650458 , 11.0915017022335 , 9.37705626728088 , 9.0925396147486 > cat ( ’ Result : ’ , predict ( example _ slopes , example _ row ) , " \ n " ) Result : 179715.9 A predicted SaleP rice is the end result, which can be compared with the actual SaleP rice to determine whether the slopes are reliable predictors. We should not count on the example slopes above to make any predictions at all since they were selected at random. 1 2 3 4 > cat ( ’ Actual sale price : ’ , test $ SalePrice [1] , " \ n " ) Actual sale price : 162000 > cat ( ’ Predicted sale price using random slopes : ’ , predict ( example _ slopes , example _ row ) , " \ n " ) Predicted sale price using random slopes : 179715.9 The definition of the least squares objective is the next step in performing multiple regression. In order to determine the root mean squared error (RMSE) of the predictions made using the actual prices, we first make the prediction for each row in the training set. 1 2 3 > rmse <- function ( slopes , attributes , prices ) { + errors <- sapply (1: length ( prices ) , function ( i ) { + predicted <- predict ( slopes , attributes [i , ]) Section 5.3. Least squares regression + + + 4 5 6 7 8 9 10 11 12 13 14 15 Page 84 actual <- prices [ i ] ( predicted - actual ) ^ 2 }) + mean ( errors ) ^ 0.5 + } > train _ prices <- train $ SalePrice > train _ attributes <- train [ , ! names ( train ) % in % " SalePrice " ] > rmse _ train <- function ( slopes ) { + rmse ( slopes , train _ attributes , train _ prices ) + } > cat ( ’ RMSE of all training examples using random slopes : ’ , rmse _ train ( example _ slopes ) , " \ n " ) RMSE of all training examples using random slopes : 58433.83 The following R code uses the nloptr package to find the best slopes for a linear regression model. The best slopes are found by minimising the RMSE of the training dataset. The RMSE is a measure of how well a model fits a set of data points. The lower the RMSE, the better the fit of the model. Note that the slopes, example slopes, are first-guess values in the x0 input. The rmse train function is the one that will be minimised as specified by the eval f argument. The nloptr’s optimisation choices are listed in the opts parameter. Here, the algorithm is NLOPT LN SBPLX and the value of xtol rel is set to 1.0e−6 , indicating the desired relative inaccuracy in the value of the minimum objective function. 1 2 3 4 5 6 7 8 9 10 11 > > > > + + # Define best _ slopes using the " minimize " function from the nloptr package install . packages ( " nloptr " ) library ( nloptr ) best _ slopes <- nloptr ( x0 = example _ slopes , eval _ f = rmse _ train , opts = list ( " algorithm " = " NLOPT _ LN _ SBPLX " , " xtol _ rel " = 1.0 e -6) ) > cat ( ’ The best slopes for the training set :\ n ’) The best slopes for the training set : > data . frame ( names ( train _ attributes ) , best _ slopes $ x ) % >% + rename ( " Feature " = 1 , " Coefficient " = 2) % >% + knitr :: kable () 12 13 14 15 16 17 18 19 20 21 22 23 24 | Feature | Coefficient | |: - - - - - - - - - - - - -| - - - - - - - - - - -:| | X1st . Flr . SF | 12.188648| | X2nd . Flr . SF | 9.822453| | Total . Bsmt . SF | 9.814725| | Garage . Area | 7.493464| | Wood . Deck . SF | 9.442689| | Open . Porch . SF | 9.856440| | Lot . Area | 11.091502| | Year . Built | 9.377056| | Yr . Sold | 9.092540| The output table shows the best slopes for each feature in the linear regression model. The coefficients indicate the degree to which each feature affects the SaleP rice target variable. 1 2 > cat ( ’ RMSE of all training examples using the best slopes : ’ , rmse _ train ( best _ slopes $ x ) , " \ n " ) RMSE of all training examples using the best slopes : 58433.83 Section 5.3. Least squares regression Page 85 The function rmse test, which takes in a vector of slopes as input in the following code, computes the root mean squared error (RMSE) for a multiple linear regression model using these slopes and the test data, and returns the RMSE value. 1 2 3 4 5 6 7 8 9 > > > > # define rmse _ test function test _ prices <- test $ SalePrice test _ attributes <- test [ , ! names ( test ) % in % " SalePrice " ] rmse _ test <- function ( slopes ) { + rmse ( slopes , test _ attributes , test _ prices ) + } > rmse _ linear <- rmse _ test ( best _ slopes $ x ) > cat ( ’ Test set RMSE for multiple linear regression : ’ , rmse _ linear , " \ n " ) Test set RMSE for multiple linear regression : 117683.7 Interpreting the results, we can say that the RMSE of 117 683.7 is a measure of how well the multiple linear regression model fits the test data. A smaller RMSE value usually indicates a better fit. Therefore, a higher value of RMSE, as in our output, suggests that the model is not performing well on the test data. 1 2 3 4 1 2 3 4 5 6 7 > # define fit function > fit <- function ( row ) { + sum ( best _ slopes $ x * as . numeric ( row ) ) + } test % >% mutate ( Fitted = apply ( select (. , - SalePrice ) , 1 , fit ) ) % >% ggplot ( aes ( x = Fitted , y = SalePrice ) ) + geom _ point ( color = " blue " ) + geom _ abline ( intercept = 0 , slope = 1 , color = " red " , lwd = 1.5) + ggtitle ( " Scatter Plot of Fitted vs SalePrice " ) dev . off () Figure 5.11: Scatter plot of the fitted vs SaleP rice The majority of the blue dots are clustered tightly around the red line in Figure 5.11, demonstrating that the multiple linear regression model fits the test data reasonably well. However, certain outliers suggest Section 5.3. Least squares regression Page 86 that not all data points may match the model perfectly. Overall, Figure 5.11 offers a quick and simple method for assessing how well the multiple linear regression model performed on the test dataset. A residual plot is a graphic tool for visually assessing the multiple linear regression model quality of fit. It plots the residuals, or the gap between the predicted and actual sale prices, versus the actual sale prices. Let us plot the residual plot of SaleP rice: 1 2 3 4 5 6 7 8 > library ( ggplot2 ) > test % >% + mutate ( Residual = test _ prices - apply ( select (. , - SalePrice ) , 1 , fit ) ) % >% + ggplot ( aes ( x = SalePrice , y = Residual ) ) + + geom _ point ( color = " blue " ) + + geom _ hline ( yintercept = 0 , color = " red " , lwd = 1.5) + + xlim (0 , 7 e5 ) + + ggtitle ( " A residual plot for multiple regression " ) Figure 5.12: A residual plot for multiple regression Figure 5.12 of the multiple linear regression model fits the test data well since the majority of the blue points are distributed widely around the zero line. However, there are certain patterns in the residuals, such as a slight curve in the pattern, which raise the possibility that not all of the patterns in the data are captured by the model. This can be a sign of non-linear correlations or other variables that have not been taken into account. Performance evaluation of the multiple linear regression model can be assessed visually using the residual plot. Areas for model improvement can also be identified like this. Complete Activity 5.4 and Activity 5.5 in the Exercise Manual before you proceed to the next section. Section 5.4. Regression diagnostics 5.4 Page 87 Regression diagnostics Refer to section 4.3 in STA1502 for background information on the assumptions of a linear regression model as well as diagnostic tools for verifying those assumptions. The examination of the residual error is one of the diagnostic methods for verifying the assumptions. Generally, regression diagnostics are used to determine whether a model is consistent with its assumptions; and whether one or more observations are inadequately represented by the model. With the aid of these tools, researchers can assess whether a model appropriately represents the data in their study. This section evaluates the quality of a linear regression study visually with the use of residual plots. These evaluations are also known as diagnostics. The diagnostic plots additionally display residuals in four distinct ways: 1. Normal Q-Q plot. This is used to examine whether or not the residuals are distributed normally. The normal probability plot of residuals should approximately follow a straight line. For the salaries dataset in Section 5.1, Example 5.1.1, it follows that: > salary . model <- lm ( Salary ~ . , data = empl _ salaries ) > qqPlot ( resid ( salary . model ) , main = " Normal Q - Q plot " ) 3 [1] 30 24 1 2 According to the output, points 24 and 30 are out of line with the observations. They are recognised as outliers. 1 2 > qqPlot ( resid ( salary . model ) , main = " Normal Q - Q plot " ) > qqline ( resid ( salary . model ) , col = " steelblue " , lwd = 2) Figure 5.13: The normal Q-Q plot for the salary.model developed from empl salaries data Section 5.4. Regression diagnostics Page 88 The data in Figure 5.13 demonstrate that the residual points fall approximately along the reference line within the interval [−2, 2]. Therefore we are a little more hesitant to say that the residuals are normally distributed in this plot, because there is a little more of a pattern and some noticeable non-linearities. However, the results should serve as a reminder that all models are not perfect, even if we do not necessarily reject the model based on this one test. The residual histogram in Figure 5.14 corroborates these interpretations. > > 3 > 4 + 5 + 6 > 7 > 8 > 9 > 1 2 # standardised residuals std _ resid <- studres ( salary . model ) hist ( std _ resid , freq = FALSE , main = " Distribution of standardised residuals " , xlab = " Standardised residuals " ) xsalary . model <- seq ( min ( std _ resid ) , max ( std _ resid ) , length =30) ysalary . model <- dnorm ( xsalary . model ) lines ( xsalary . model , ysalary . model , lwd =2 , col = " red " ) Figure 5.14: Distribution of the standardised residuals 2. Residuals vs Fitted. This is used to verify the assumptions of a linearity. In Figure 5.15, a horizontal red line that exhibits no trend in the observations around it is an indicator of a linear relationship. > > 3 + 4 + 5 + 6 > 7 > 1 2 # residual vs fitted plot plot ( fitted ( salary . model ) , resid ( salary . model ) , main = " Residual vs Fitted plot " , col = " blue " , ylab = " Residuals " , xlab = " Fitted values " , frame = FALSE ) # add a horizontal line at 0 abline (0 ,0 , col = " red " , lwd =2) Section 5.4. Regression diagnostics Page 89 Figure 5.15: Residuals against fitted values Figure 5.15 compares the residuals for each observation to the expected mean value. There is not a uniform distribution of observations around the reference line (the red line). Since the majority of the observations fall between 20 000 and 80 000, we can conclude that the variation of the estimates is slightly to the right. Additionally, since there are not enough observations between 90 000 and 120 000, it is probable that we have a problem with heteroscedasticity. A regression model is said to be heteroscedastic when the variance of the residual term, or error term, fluctuates significantly or is non-constant. 3. Spread-Location. This is done to check for homogeneity in the variance or homoscedasticity of the residuals. If this is the case, there should be a horizontal line with evenly spread points. The ncvTest, also known as the non-constant variance test, is a test for heteroscedasticity. The variance of the residuals in a normal linear model is assumed to be constant. Let’s examine the syntax in R: > ncvTest ( salary . model ) Non - constant Variance Score Test 3 Variance formula : ~ fitted . values 4 Chisquare = 13.1946 , Df = 1 , p = 0.00028076 1 2 The variance appears to NOT be constant because the ncvTest p-value is less than 0.05. The following command-line generates a spreadLevelPlot based on the salary.model: 1 2 > spreadLevelPlot ( salary . model , + main = " Spread - LevelPlot for salary model " ) 3 4 Suggested power transformation : -0.1104097 Section 5.4. Regression diagnostics Page 90 Figure 5.16: Spread-level plot for salary.model Potential spread-level dependencies, or an extension of the studentised residuals from the salary.model, are shown in Figure 5.16. The positive link between the estimated Salary and the error in the estimation of Salary further demonstrates that there is a presence of heteroscedasticity. 4. Residuals vs Leverage. This is used to find influential or extreme values. The inclusion or exclusion of these data from the study may have an effect on the findings of the linear regression. Take a look in the following R code and results: > par ( mfrow = c (1 ,2) ) > plot ( salary . model ,4) 3 > plot ( salary . model ,5 , lwd =2) 1 2 Section 5.5. Inference and prediction intervals Page 91 Figure 5.17: Cook’s distance and residuals against leverage Cook’s distance, a metric for the impact of observations with a large distance, is presented on the left-hand side in Figure 5.17. Possible outliers include points 24, 29 and 30. Since point 30 has a low residual and low leverage (on right-hand side of Figure 5.17), the model did not adequately account for it and it had a significant impact on the outcomes. All potential outliers found by the Cook’s distance that could affect the findings can be eliminated from the model to improve the output. Complete Activity 5.6 in the Exercise Manual before you proceed to the next section. 5.5 Inference and prediction intervals A linear regression model can be useful for two main reasons: 1. Evaluating the relationship between one or more independent variables and the dependent variable. 2. Predicting future values using the regression model. In relation to point number 2, it is sometimes of relevance to predict both the exact values and an interval that comprises a range of likely values. The interval is known as a prediction interval. Note that in contrast to the confidence interval, which represents uncertainty around mean predicted values, the prediction interval reflects uncertainty around a single value. This section demonstrates how to apply the prediction interval for the regression model. Use the regression model in Example 5.1.1 to predict the value of Salary using the fitted regression model and three new values for Years of experience. > > 3 > 4 > # fit simple regression model model <- lm ( Salary ~ Years _ of _ experience , data = empl _ salaries ) # confidence interval predict ( model , data . frame ( Years _ of _ experience = 18) , interval = " confidence " ) 5 fit lwr upr 6 1 133786.7 119851 147722.4 1 2 Section 5.5. Inference and prediction intervals Page 92 > # prediction interval > predict ( model , data . frame ( Years _ of _ experience = 18) , interval = " prediction " ) 9 fit lwr upr 10 1 133786.7 106879.1 160694.3 7 8 The output findings indicate that someone with 18 Years of experience is predicted to earn of R133 786.70. The findings also indicate that the 95% prediction interval for Salary will be between R106 879.10 and R160 694.30. Also keep in mind that the margins of error for the prediction intervals and the confidence intervals differ slightly. In contrast to the confidence interval, the prediction interval is wider since it predicts for a single value rather than the average value. Now that you have reached the end of Learning Unit 5, start to complete Assessment 6 as outlined in the Activities section on the module site and ensure that you submit the completed assessment for formal evaluation once you have reached the end of Learning Unit 6.