Uploaded by chantel dudzai

DSC2608 Learning Unit 5

advertisement
5. Linear regression
Learning objectives and outcomes: Once you have completed learning unit 5, you should be able to do
the following:
1. Calculate and interpret linear correlation.
2. Determine and interpret the equation of the linear regression line.
3. Demonstrate knowledge of least squares regression and apply it to datasets.
4. Apply visual and numerical regression diagnostics.
5. Apply inference for the true slope and prediction intervals for the regression model.
5.1
Linear correlation
In this section you will learn how to estimate the linear relationship between two variables. The Pearson’s
correlation coefficient, or r, is the relevant measure of association to be used (see learning unit 2,
STA1501). This is a continuous scale measure of the strength of the linear relationship between two
variables. The coefficient r can have a value between −1 and +1.
Example 5.1.1. Figure 5.1 contains eight records of 30 entries of employees’ Salary and Years of experience.
First load the dataset salaries.xls, which is available on myModules under Additional Resources,
into R and View it.
>
>
3 >
4 >
1
2
# Load the library import . xls file into R
library ( readxl )
empl _ salaries <- read _ excel ( " salaries . xls " )
View ( empl _ salaries )
67
Section 5.1. Linear correlation
Page 68
Figure 5.1: Eight records of Years of experience and related Salary of employees
The following R code calculates the correlation coefficient between the Years of experience and the
corresponding Salary from the attached dataset:
> # attach the empl _ salaries dataset
> attach ( empl _ salaries )
3 > # calculate the linear correlation
4 > cor ( Years _ of _ experience , Salary )
5 [1] 0.9152571
1
2
The resulting output of 0.92 indicates a strong positive linear correlation between the variables Salary
and Years of experience. Figure 5.2 illustrates the scatter plot of Salary against Years of experience.
> options ( scipen =999)
> plot ( Years _ of _ experience , Salary ,
3 +
main = ’ Salary vs Years of experience ’ ,
4 +
xlab = ’ Years of experience ’ , ylab = ’ Salary ’
5 +
, col = " red " )
1
2
Section 5.2. Linear regression line equation
Page 69
Figure 5.2: Scatter plot of Salary against Years of experience
Figure 5.2 indicates an increase in the Salary as the number of Years of experience increase, which
means there is a positive linear association.
Complete Activity 5.1 in the Exercise Manual before you proceed to the next section.
5.2
Linear regression line equation
The equation of a simple linear regression line defines the relationship between two variables. It determines the value of the dependent or response variable, y, for a predetermined value of the independent
variable, x. In essence, it is used to predict the value of y given a value for x. Refer to learning unit
2 in STA1501 and learning unit 4 in STA1502 to refresh your memory and get more background
information.
To draw a linear regression line in R, we need the following two functions that form part of the built-in
stats package that comes with the installation of R:
1. The abline() function is used to add one or more straight lines through the current plot. It has
the following format:
abline(α = NULL, β = NULL, h = NULL, v = NULL, · · · )
Section 5.2. Linear regression line equation
Page 70
Parameters: α, β: specifies the intercept and the slope of the line; h: specifies y-value for
horizontal line(s); v: specifies x-value(s) for vertical line(s) and returns a straight line in the plot.
2. The lm() function, which stands for linear model, can also be used to create a simple regression
model. It has the following format:
lm(formula,data)
Parameters: formula, data: present the relation between x and y, and the vector to which the
formula will be applied, respectively. It returns the parameters of the equation for the relationship
between x and y.
Moving forward, applying the abline() function onto the scatter plot of Salary against Years of experience
variables to Figure 5.2 will yield the following:
> # apply the abline () function
> abline ( lm ( Salary ~ Years _ of _ experience , data = empl _ salaries ) ,
3 +
col = ’ blue ’ , lwd = 2)
1
2
Figure 5.3: The abline function applied to a linear regression model of Salary against
Years of experience
Figure 5.3 illustrates a best fit line (linear regression line) for the scatter plot of Salary against
Years of experience.
The following model is returned by deploying the lm() function, whereas the ouput is called using the
summary() function on the model:
Section 5.2. Linear regression line equation
1
2
Page 71
> salary . lm <- lm ( Salary ~ . , data = empl _ salaries )
> summary ( salary . lm )
3
4
5
Call :
lm ( formula = Salary ~ . , data = empl _ salaries )
6
Residuals :
Min
1Q
9 -28921.5
-7380.9
7
8
Median
3Q
-920.6
8728.1
Max
22411.3
10
11
Coefficients :
Estimate
Std . Error t value
18832.8
3697.3
5.094
6386.3
531.2
12.021
15
Pr ( >| t |)
16 ( Intercept )
0.00002146820188 * * *
17 Years _ of _ experience 0.00000000000143 * * *
18 --19 Signif . codes :
0 ‘* * * ’ 0.001 ‘* * ’ 0.01 ‘* ’ 0.05 ‘. ’ 0.1 ‘ ’
12
( Intercept )
14 Years _ of _ experience
13
1
20
Residual standard error : 11240 on 28 degrees of freedom
Multiple R - squared : 0.8377 , Adjusted R - squared : 0.8319
23 F - statistic : 144.5 on 1 and 28 DF ,
p - value : 0.00000000 000142 8
21
22
The intercept of 18 832.8 at line number 13 is the expected value of the employee Salary when accounting
for zero Years of experience. Line number 14 shows the predicted slope. The employee Salary would
rise by R6 386.30 for every one year increase in Years of experience.
The standard error can be used to calculate an estimate of the predicted difference. The t-value, a
measure of how far the estimate deviates from zero in terms of standard deviations, is the other crucial
parameter. Also note that it must be significantly greater than zero in order for us to reject the null
hypothesis of the model and claim that there is a relationship between Years of experience and Salary.
For instance, the output of the summary model confirms that the t-values are certainly far from zero,
indicating that there is a significant relationship between Salary and Years of experience.
Usually, a p-value of 5% or less is the commonly used significance level (or alpha level) in hypothesis
testing. The three asterisks (lines number 16 to 17) represent a highly significant regression coefficient.
In this example, the p-values of Salary and Years of experience are very small. That means, we can
reject the null hypothesis that there is no effect or no difference. As a result it implies that there exists a
relationship between Salary and Years of experience. Lastly, R2 shows how much of the variation in the
dependent variable is explained by the linear relationship with the independent variable in the model.
That means about 83% of the variation in the variable Salary is explained by the linear relationship with
the variable Years of experience.
At this point, you should be able to load any dataset in R using the appropriate library and View the
dataset.
Example 5.2.1. Consider a dataset for advertising that tracks sales as a function of spending on T V ,
Radio and N ewspaper advertising. Thousands of rands are used to measure each of these four variables.
To get a general overview of the dataset structure, the next steps should be completed after loading
the file advertising.csv dataset, which is available under Additional Resources on the module site,
and naming it ad sales.
1
> ad _ sales <- read . csv ( " advertising . csv " )
Section 5.2. Linear regression line equation
> View ( ad _ sales )
> dim ( ad _ sales )
4 [1] 200
4
5 > class ( ad _ sales )
6 [1] " data . frame "
7 > str ( ad _ sales )
8 ‘ data . frame ’: 200
9 $ TV
: num
10 $ Radio
: num
11 $ Newspaper : num
12 $ Sales
: num
13 >
Page 72
2
3
obs . of 4 variables :
230.1 44.5 17.2 151.5 180.8 ...
37.8 39.3 45.9 41.3 10.8 48.9 32.8 19.6 2.1 2.6 ...
69.2 45.1 69.3 58.5 58.4 75 23.5 11.6 1 21.2 ...
22.1 10.4 9.3 18.5 12.9 7.2 11.8 13.2 4.8 10.6 ...
The output reveals that the dataset has four variables with 200 observations in total: Sales and three
media channel budgets for T V , Radio and N ewspaper. The data frame variables are all integers and
provide information about the kinds of analysis that can be done.
To check that the data are reasonable and within expectations, the summary() function can be used.
This fucntion is used to generate descriptive statistics for a given dataset or object. When you apply
summary() to a data frame or a vector, it provides a concise summary of the data, including measures of
central tendency (e.g. mean and median), measures of spread (e.g. minimum and maximum, quartiles),
and other relevant statistics depending on the data type.
> # high level overview of ad _ sales dataset
> summary ( ad _ sales )
3 TV
Radio
Newspaper
Sales
4 Min .
: 0.70
Min .
: 0.000
Min .
: 0.30
Min .
: 1.60
5 1 st Qu .: 74.38
1 st Qu .: 9.975
1 st Qu .: 12.75
1 st Qu .:10.38
6 Median :149.75
Median :22.900
Median : 25.75
Median :12.90
7 Mean
:147.04
Mean
:23.264
Mean
: 30.55
Mean
:14.02
8 3 rd Qu .:218.82
3 rd Qu .:36.525
3 rd Qu .: 45.10
3 rd Qu .:17.40
9 Max .
:296.40
Max .
:49.600
Max .
:114.00
Max .
:27.00
10 >
1
2
The findings contain a five-point summary and mean for each of the four variables in the output. For
instance, the minimum and maximum budgets for T V advertisements are 0.7 and 296.40, respectively,
which translate to R700 and R296 400. However, the mean budget for T V advertisements is 147.04,
which is equal to R147 040.
Let us further inspect the data to see whether there exists any association between the advertising Sales
and the advertisement budgets for T V , Radio and N ewspaper by determining the linear correlation.
> # DataExplorer : designed for fast exploratory data analysis
> library ( DataExplorer )
3 > plot _ correlation ( ad _ sales )
1
2
Section 5.2. Linear regression line equation
Page 73
Figure 5.4: Correlation plot of T V , Radio, N ewspaper and Sales
Figure 5.4 reveals a significant positive linear correlation between Sales, and T V and Radio spending
budgets, as well as a weak positive linear association between Sales and the N ewspaper budget. The
correlation plot results are further supported by diagrams of each of the three budgets plotted against
Sales. A positive linear correlation coefficient means that as one variable increases, the other variable
also tends to increase. An increase in Radio and T V budgets will correspond to an increase in Sales.
>
>
3 >
4 >
5 >
1
2
attach ( ad _ sales )
par ( mfrow = c (1 ,3) )
plot ( TV , Sales , xlab = " TV " , ylab = " Sales " , col = " cyan " )
plot ( Radio , Sales , xlab = " Radio " , ylab = " Sales " , col = " blue " )
plot ( Newspaper , Sales , xlab = " Newspaper " , ylab = " Sales " , col = " darkred " )
Section 5.2. Linear regression line equation
Page 74
Figure 5.5: Scatter plots of Sales against T V , Radio and N ewspaper budgets
Figure 5.5 indicates the positive linear correlations between the Radio and T V budgets and Sales, as
opposed to almost no pattern between N ewspaper and Sales.
The plots can include the simple least squares equations that represent the line of best fit for predicting
Sales based on T V , Radio and N ewspaper variables. These lines of best fit depict a straightforward
linear model that can be used to predict Sales by considering the respective media budget variables.
>
>
3 >
4 >
5 >
6 +
7 >
8 >
9 +
10 >
11 >
12 +
13 +
14 >
1
2
par ( mfrow = c (1 ,3) )
TV _ lm <- lm ( Sales ~ TV , data = ad _ sales )
radio _ lm <- lm ( Sales ~ Radio , data = ad _ sales )
newspaper _ lm <- lm ( Sales ~ Newspaper , data = ad _ sales )
plot ( TV , Sales , xlab = " TV " , ylab = " Sales " , col = " cyan " ,
main = " Sales against TV " , frame = FALSE )
abline ( TV _ lm , col = " red " , lwd =2)
plot ( Radio , Sales , xlab = " Radio " , ylab = " Sales " , col = " blue " ,
main = " Sales against Radio " , frame = FALSE )
abline ( radio _ lm , col = " red " , lwd =2)
plot ( Newspaper , Sales , xlab = " Newspaper " , ylab = " Sales " ,
col = " darkred " , main = " Sales against Newspaper " ,
frame = FALSE )
abline ( newspaper _ lm , col = " red " , lwd =2)
Section 5.3. Least squares regression
Page 75
Figure 5.6: The least squares regression equation fitted to each scatter plot of Sales against T V , Radio
and N ewspaper
Figure 5.6 demonstrates a weak positive linear correlation in the plot of Sales against N ewspaper,
which explains why the relationship between these two variables tend to increase slightly, even though
it is not very strong.
5.3
Least squares regression
The goal of a linear regression line is to estimate or predict the values of the response variable, y,
given values of the explanatory, x. Usually, estimates are not exact. Each one is slightly different
from the actual values. The least total inaccuracy among all straight lines is an acceptable criterion
for determining which line is the “best” fit. This section illustrates this criterion in order to determine
whether it is possible to identify the best straight line in terms of the criterion.
Let us illustrate how the estimated values and regression residuals may be obtained in R. The given
data points are
x1 , . . . , xn and y1 , . . . , yn .
The estimated response or the ordinary least square (OLS) fitted values can be defined by
yˆi = α + β̂ xi
for all
i = 1, . . . , n,
(5.3.1)
where α and β denote the intercept and slope of the regression line, respectively. The residuals are the
differences between the actual responses and the fitted values, that is,
ei = yi − ŷi ,
(5.3.2)
such that the residual sum squared (RSS) of errors is defined as
RSS =
n
X
i=1
e2i =
n X
i=1
yi − ŷi
2
≥ 0.
(5.3.3)
Using the dataset salaries from Section 5.1, let us assign the Salary variable to Y and the Years of experience
to X, and then compute the OLS fitted values and residuals according to Equations (5.3.1) and (5.3.2).
View the R demonstrations.
Section 5.3. Least squares regression
Page 76
>
>
3 >
4 >
5 >
6 >
7 >
8 >
9 >
10 >
11 >
12 >
13 >
attach ( empl _ salaries )
# define variables
Y <- cbind ( Salary )
X <- cbind ( Years _ of _ experience )
# ordinary least squares
OLS <- lm ( Y ~ X )
# predicated or estimates values
y . hat <- fitted ( OLS )
# residuals of the OLS
err <- resid ( OLS )
# column bind of all 4 - variables
table <- cbind (Y ,X , y . hat , err )
table
14
Salary Years _ of _ experience
y . hat
err
15 1
19143
1.0 25219.16
-6076.1636
16 2
26205
1.3 27135.06
-930.0615
17 3
17531
1.5 28412.33 -10881.3268
18 4
23325
2.0 31605.49
-8280.4899
19 5
19691
2.2 32882.76 -13191.7552
20 6
36442
2.9 37353.18
-911.1836
21 7
39950
3.0 37991.82
1958.1838
22 8
34245
3.2 39269.08
-5024.0815
23 9
44245
3.2 39269.08
4975.9185
24 10
36989
3.7 42462.24
-5473.2447
25 11
43018
3.9 43739.51
-721.5099
26 12
35594
4.0 44378.14
-8784.1426
27 13
36757
4.0 44378.14
-7621.1426
28 14
36881
4.1 45016.78
-8135.7752
29 15
40911
4.5 47571.31
-6660.3057
1
2
To illustrate how the fitted equations in (5.3.1) and (5.3.2) are used, let us manually confirm the values
of ŷ1 and e1 that are highlighted in the result as follows:
ŷ1 = 18832.8 + 6386.3 x1 = 18832.8 + 6386.3 (1) = 25219.1
ei = y1 − yˆ1 = 19143 − 25219.1 = −6076.16
The least square principle can be summarised as follows:
1. Select α and β such that RSS (5.3.3) is minimised.
2. The problem
min RSS = min
α,β
is called the least squares problem.
α,β
n
X
e2i
(5.3.4)
i=1
Example 5.3.1. Examine the connection between advertising budgets and Sales using the ad sales
dataset as seen in Section 5.2, Example 5.2.1. Minimising the sum of squared errors will allow you to
find the least squares fit for the regression of Sales onto T V , as illustrated in the following steps:
>
>
3 >
4 +
5 >
6 >
7 +
1
2
par ( mfrow = c (1 ,1) )
TV _ lm <- lm ( Sales ~ TV , data = ad _ sales )
plot ( TV , Sales , xlab = " TV " , ylab = " Sales " ,
col = " blue " , pch =19 , main = " Actual vs Predicted " )
abline ( TV _ lm , col = " red " , lwd =2)
segments ( TV , Sales , TV , predict ( TV _ lm ) ,
col = " grey " )
Section 5.3. Least squares regression
Page 77
Figure 5.7: The squared sum of errors between the actual and predicted Sales in relation to the T V
budget
The line of best fit (the red line in Figure 5.7) makes an effort to minimise the sum of squares for each
and every one of the grey line segments, which indicate the errors.
Consider the issue of predicting the value of a response Y based on multiple explanatory variables for
further investigation. Let’s first review the following definition:
Definition 5.3.2. Multiple linear regression models are defined by the equation
Y = β0 + β1 X1 + β2 X2 + . . . + βn Xn + ,
where
• β0 = intercept, the value of Y when Xi = 0 for all i = 1, . . . , n.
• βi for all i = 1, . . . , n are the slopes of the line, defined as the change in Y for a one unit in X,
keeping constant other βi0 s.
• Xi for all i = 1, . . . , n are independent variables (or explanatory variables).
• Y = dependent variable (or response variable).
• = random error that explains the deviation of the points (X; Y ) about the line.
The only difference between Definition 5.3.2 and Equation 5.3.1 for simple linear regression is the
presence of several independent variables. The method of least squares, which applies simple linear
regression principles to n dimensions, is used to estimate the regression coefficients.
Section 5.3. Least squares regression
Page 78
The lm() function can also be used to build a multiple regression model of Sales based on the three
advertising budgets of T V , Radio and N ewspaper media channel variables in R as follows:
1
2
> ad _ sales . lm <- lm ( Sales ~ . , data = ad _ sales )
> summary ( ad _ sales . lm )
3
4
5
Call :
lm ( formula = Sales ~ . , data = ad _ sales )
6
Residuals :
Min
1Q
9 -8.8277 -0.8908
7
8
Median
0.2418
3Q
1.1893
Max
2.8292
10
11
Coefficients :
Estimate
Std . Error t value
( Intercept ) 2.938889
0.311908
9.422
14 TV
0.045765
0.001395
32.809
15 Radio
0.188530
0.008611
21.893
16 Newspaper
-0.001037
0.005871
-0.177
17
Pr ( >| t |)
18 ( Intercept ) <0.0000000000000002 * * *
19 TV
<0.0000000000000002 * * *
20 Radio
<0.0000000000000002 * * *
21 Newspaper
0.86
22 --23 Signif . codes :
24 0 ‘* * * ’ 0.001 ‘* * ’ 0.01 ‘* ’ 0.05 ‘. ’ 0.1 ‘ ’ 1
12
13
25
Residual standard error : 1.686 on 196 degrees of freedom
Multiple R - squared : 0.8972 , Adjusted R - squared : 0.8956
28 F - statistic : 570.3 on 3 and 196 DF ,
p - value : < 0.0 000 0 000 00 000 0 022
26
27
The output illustrates a multiple regression model for estimating Sales based on the three advertising
budgets. Interpreting the F -statistic and the corresponding p-value, which are located at the bottom of
the model summary, is the first step in evaluating the multiple regression analysis. As can be observed,
the p-value of the F -statistic is equal to 0.00000000000000022, which indicates a very significant
relationship. This indicates a meaningful relationship between at least one of the explanatory variables
(T V , Radio and N ewspaper) and the response variable (Sales).
Another way is to look closely at the table of coefficients presented in the following step, which displays the estimates of beta regression coefficients and the corresponding p-values of the t-statistic, to
determine which explanatory variables are significantly different from zero:
> # a table of beta coefficients : b _ 0 , b _ 1 , b _ 2 and b _ 3
> summary ( ad _ sales . lm ) $ coefficient
3
Estimate
Std . Error
t value
4 ( Intercept )
2.938889369 0.311908236 9.4222884
5 TV
0.045764645 0.001394897 32.8086244
6 Radio
0.188530017 0.008611234 21.8934961
7 Newspaper
-0.001037493 0.005871010
-0.1767146
1
2
It follows that β0 = 2.938889369, β1 = 0.045764645, β2 = 0.188530017 and β3 = −0.001037493,
which can be used to express the following multiple regression model with beta coefficients rounded to
three decimal places:
ŷ = 2.94 + 0.046(T V ) + 0.189(Radio) − 0.001(N ewspaper),
Section 5.3. Least squares regression
Page 79
where ŷ is the expected or predicted cost of Sales. Observe that changes in the N ewspaper advertising
budget are not significantly correlated with changes in Sales. The changes in T V and Radio advertising
budgets are significantly correlated with changes in Sales. Keeping T V and Radio budgets constant,
an additional R1 000 spent on N ewspapers results in a 1 000 × 0.001 = 1 rand decrease in Sales. On
the other hand, keeping Radio and N ewspaper budgets constant, investing R1 000 in a T V budget
results in an increase in Sales of 1 000 × 0.046 = 46 rand. The interpretation for the Radio budget is
based on a similar argument as the T V budget.
The N ewspaper variable can be eliminated from the model, as illustrated in the following step, because
it is not statistically significant:
> # new model : removes Newspaper variable
> ad _ sales _ new . lm <- lm ( Sales ~ TV + Radio , data = ad _ sales )
3 > summary ( ad _ sales _ new . lm )
4 Call :
5 lm ( formula = Sales ~ TV + Radio , data = ad _ sales )
1
2
6
Residuals :
Min
1Q
9 -8.7977 -0.8752
7
8
Median
0.2422
3Q
1.1708
Max
2.8328
10
11
Coefficients :
Estimate
Std . Error t value
2.92110
0.29449
9.919
0.04575
0.00139
32.909
0.18799
0.00804
23.382
16
Pr ( >| t |)
17 ( Intercept ) <0.0000000000000002 * * *
18 TV
<0.0000000000000002 * * *
19 Radio
<0.0000000000000002 * * *
20 --21 Signif . codes :
22 0 ‘* * * ’ 0.001 ‘* * ’ 0.01 ’* ’ 0.05 ‘. ’ 0.1 ‘ ’ 1
12
( Intercept )
14 TV
15 Radio
13
23
Residual standard error : 1.681 on 197 degrees of freedom
Multiple R - squared : 0.8972 , Adjusted R - squared : 0.8962
26 F - statistic : 859.6 on 2 and 197 DF ,
p - value : < 0.0 000 0 000 00 000 0 022
24
25
A modified multiple linear regression model is depicted in the output following an elimination of the
N ewspaper budget variable.
Complete Activity 5.2 in the Exercise Manual before you proceed with this section.
A dataset from Example 5.3.3 illustrates the sales of certain residential properties in Ames, Iowa, from
2006 to 2010. The dataset includes 2 930 observations as well as a sizable number of explanatory
variables (23 nominal, 23 ordinal, 14 discrete, and 20 continuous) that are used to determine the values
of homes.
Example 5.3.3. For the city of Ames, Iowa, the following dataset of house prices and attributes was
gathered over a number of years. This will simply concentrate on a subset of the columns. From the
other columns, we will attempt to predict the SaleP rice column. Initialise R by loading the dataset
and the necessary libraries.
1
2
> # load the required libraries
> library ( tidyverse )
Section 5.3. Least squares regression
3
4
5
6
7
8
9
10
11
12
Page 80
>
>
>
>
>
>
+
+
library ( dplyr )
# # read _ csv : read the CSV file into a tibble object
all _ sales <- read . csv ( " house . csv " , header = TRUE ) % >% as _ tibble ()
# filter : used to apply the filtering conditions
# select : used to select the required columns
sales <- all _ sales % >%
filter ( ‘ Bldg . Type ‘ == " 1 Fam " , ‘ Sale . Condition ‘ == " Normal " ) % >%
select ( ‘ SalePrice ‘ , ‘ X1st . Flr . SF ‘ , ‘ X2nd . Flr . SF ‘ , ‘ Total . Bsmt . SF ‘ , ‘ Garage .
Area ‘ , ‘ Wood . Deck . SF ‘ , ‘ Open . Porch . SF ‘ , ‘ Lot . Area ‘ , ‘ Year . Built ‘ , ‘ Yr . Sold ‘)
% >%
+ # arrange : sort the resulting sales tibble by the SalePrice
+ arrange ( SalePrice )
Note that successive data manipulation operations are chained together using the operator %>% from
line 8. Only rows with Bldg.T ype columns of 1F am and Sale.Condition of N ormal are chosen using
the filter() function. The columns of interest are kept using the select() function. The SaleP rice
column is used to order the resulting sales dataset using the arrange() function.
Figure 5.8: Ten records of the sorted sales by SaleP rice column
The following R code creates a histogram with 32 bins using ggplot2 package, where the x-axis
corresponds to the SaleP rice column of the sales data frame. The plot title and axis labels are
set using the labs function. The dollar format function from the scales package is used by the
scale x continuous function to format the x-axis labels as dollar amounts.
1
2
3
4
5
6
> # geom _ histogram () is used to create the histogram
> options ( scipen =999)
> ggplot ( sales , aes ( x = SalePrice ) ) +
+
geom _ histogram ( color = " black " , fill = " cyan " ) +
+
scale _ x _ continuous ( labels = scales :: dollar _ format ( prefix = " $ " ) ) +
+
labs ( x = " SalePrice " , y = " Count " , title = " Histogram of SalePrice " )
Section 5.3. Least squares regression
Page 81
Figure 5.9: Histogram of the SaleP rice
There is a lot of variation and a clearly skewed distribution in Figure 5.9. A few homes with extraordinarily
high prices can be seen in the long tail to the right. There are no homes in the short left tail that sold
for less than $35 000.
The following command-lines create a scatter plot, where the x-axis corresponds to the X1st.F lr.SF
column and the y-axis corresponds to the SaleP rice column of the sales data frame.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
>
>
>
>
+
+
# aes () : map ’ X1st . Flr . SF ’ to the x - axis and ’ SalePrice ’ to the y - axis .
# geom _ point () : create a scatter plot with red points
options ( scipen =999)
ggplot ( sales , aes ( x = ‘ X1st . Flr . SF ‘ , y = SalePrice ) ) +
geom _ point ( color = " red " ) +
labs ( x = " X1st . Flr . SF " , y = " SalePrice " , title = " SalePrice against X1st . Flr
. SF " ) +
+ scale _ y _ continuous ( labels = function ( y ) paste0 ( " $ " , y ) )
> # aes () : map ’ X1st . Flr . SF ’ to the x - axis and ’ SalePrice ’ to the y - axis .
> # geom _ point () : create a scatter plot with red points
> options ( scipen =999)
> ggplot ( sales , aes ( x = ‘ X1st . Flr . SF ‘ , y = SalePrice ) ) +
+
geom _ point ( color = " red " ) +
+
labs ( x = " X1st . Flr . SF " , y = " SalePrice " , title = " SalePrice against X1st . Flr
. SF " ) +
+ scale _ y _ continuous ( labels = function ( y ) paste0 ( " $ " , y ) )
Section 5.3. Least squares regression
Page 82
Figure 5.10: Scatter plot of the SaleP rice by X1st.F lr.SF
The SaleP rice cannot be predicted by a single attribute alone. For example, the X1st.F lr.SF expressed
in square feet correlates with SaleP rice, but partly explains its variability. Note that one square feet
(sq ft) is approximately equal to 0.0929 square metres (sq m). The SaleP rice and the X1st.F lr.SF
variables indicate a strong positive correlation, as seen in Figure 5.10 and confimred by the calculated
correlation coefficient of 0.6424663 in the following R code:
1
2
3
> # calculate the correlation coefficient
> cor ( sales $ SalePrice , sales $ ‘ X1st . Flr . SF ‘)
[1] 0.6424663
The following R code selects all the columns using the sales[c()] syntax and then computes the correlation matrix between them using the cor() function. Each cell in the 10 × 10 matrix that results
from the corr matrix variable reflects the correlation coefficient between two variables. Subset the
corr matrix, which chooses the row corresponding to SaleP rice and the columns relating to the other
variables, in order to extract the correlation coefficients between SaleP rice and the other variable.
1
2
3
4
5
6
7
8
9
10
11
12
> # Calculate correlation coefficients of each attributes against SalePrice
> corr _ matrix <- cor ( sales [ c ( " SalePrice " , " X1st . Flr . SF " , " X2nd . Flr . SF " ,
+
" Total . Bsmt . SF " , " Garage . Area " , " Wood . Deck . SF " ,
+
" Open . Porch . SF " , " Lot . Area " ," Year . Built " , " Yr . Sold " ) ])
>
> # print the correlation coefficients
> print ( corr _ matrix [ " SalePrice " , c ( " X1st . Flr . SF " , " X2nd . Flr . SF " ,
+
" Total . Bsmt . SF " , " Garage . Area " , " Wood . Deck . SF " ,
+
" Open . Porch . SF " , " Lot . Area " ," Year . Built " , " Yr . Sold " ) ])
X1st . Flr . SF
X2nd . Flr . SF
Total . Bsmt . SF
Garage . Area Wood . Deck . SF
0.64246625
0.35752189
0.65297863
0.63859449
0.35269867
Open . Porch . SF Lot . Area
Year . Built
Yr . Sold
Section 5.3. Least squares regression
13
0.33690942
0.29082346
Page 83
0.56516475
0.02594858
Note that all the individual variables, with the exception of the SaleP rice itself, do not show a correlation
coefficient with SaleP rice of more than 0.7.
Complete Activity 5.3 in the Exercise Manual before you proceed with this section.
Multiple linear regression involves using numerical input variables to predict a numerical output. In
order to do this, it multiplies the value of each variable by a specific slope and aggregate the outputs.
For instance, in this illustration, the slope for X1st.F lr.SF represents the contribution of the first-floor
space of the house to the overall prediction.
Before making predictions, data are split into two equal sets: a training set and a test set.
1
2
3
4
5
6
> # split data into train and test sets
> set . seed (1001)
> train <- sales [1:1001 , ]
> test <- sales [1002: nrow ( sales ) , ]
> cat ( nrow ( train ) , ’ training and ’ , nrow ( test ) , ’ test instances .\ n ’)
1001 training and 1001 test instances .
In multiple regression, the slopes form an array with a single slope value for each attribute. By multiplying
each attribute by the slope and adding the results, we can predict the SaleP rice.
1
2
3
4
5
6
7
8
9
10
11
12
> # define predict function
> predict <- function ( slopes , row ) {
+
sum ( slopes * as . numeric ( row ) )
+ }
> example _ row <- test [ , ! names ( test ) % in % " SalePrice " ][1 , ]
> cat ( ’ Predicting sale price for : ’ , paste0 ( example _ row , collapse = " , " ) , " \ n " )
Predicting sale price for : 1287 , 0 , 1063 , 576 , 364 , 17 , 9830 , 1959 , 2010
> example _ slopes <- rnorm ( length ( example _ row ) , mean = 10 , sd = 1)
> cat ( ’ Using slopes : ’ , paste0 ( example _ slopes , collapse = " , " ) , " \ n " )
Using slopes : 12.1886480934024 , 9.82245266527506 , 9.81472472040473 ,
7.49346378650813 , 9.44268866267632 , 9.85644054650458 , 11.0915017022335 ,
9.37705626728088 , 9.0925396147486
> cat ( ’ Result : ’ , predict ( example _ slopes , example _ row ) , " \ n " )
Result : 179715.9
A predicted SaleP rice is the end result, which can be compared with the actual SaleP rice to determine
whether the slopes are reliable predictors. We should not count on the example slopes above to make
any predictions at all since they were selected at random.
1
2
3
4
> cat ( ’ Actual sale price : ’ , test $ SalePrice [1] , " \ n " )
Actual sale price : 162000
> cat ( ’ Predicted sale price using random slopes : ’ , predict ( example _ slopes ,
example _ row ) , " \ n " )
Predicted sale price using random slopes : 179715.9
The definition of the least squares objective is the next step in performing multiple regression. In order
to determine the root mean squared error (RMSE) of the predictions made using the actual prices, we
first make the prediction for each row in the training set.
1
2
3
> rmse <- function ( slopes , attributes , prices ) {
+
errors <- sapply (1: length ( prices ) , function ( i ) {
+
predicted <- predict ( slopes , attributes [i , ])
Section 5.3. Least squares regression
+
+
+
4
5
6
7
8
9
10
11
12
13
14
15
Page 84
actual <- prices [ i ]
( predicted - actual ) ^ 2
})
+
mean ( errors ) ^ 0.5
+ }
> train _ prices <- train $ SalePrice
> train _ attributes <- train [ , ! names ( train ) % in % " SalePrice " ]
> rmse _ train <- function ( slopes ) {
+
rmse ( slopes , train _ attributes , train _ prices )
+ }
> cat ( ’ RMSE of all training examples using random slopes : ’ , rmse _ train ( example _
slopes ) , " \ n " )
RMSE of all training examples using random slopes : 58433.83
The following R code uses the nloptr package to find the best slopes for a linear regression model.
The best slopes are found by minimising the RMSE of the training dataset. The RMSE is a measure of
how well a model fits a set of data points. The lower the RMSE, the better the fit of the model.
Note that the slopes, example slopes, are first-guess values in the x0 input. The rmse train function
is the one that will be minimised as specified by the eval f argument. The nloptr’s optimisation
choices are listed in the opts parameter. Here, the algorithm is NLOPT LN SBPLX and the value
of xtol rel is set to 1.0e−6 , indicating the desired relative inaccuracy in the value of the minimum
objective function.
1
2
3
4
5
6
7
8
9
10
11
>
>
>
>
+
+
# Define best _ slopes using the " minimize " function from the nloptr package
install . packages ( " nloptr " )
library ( nloptr )
best _ slopes <- nloptr ( x0 = example _ slopes ,
eval _ f = rmse _ train ,
opts = list ( " algorithm " = " NLOPT _ LN _ SBPLX " , " xtol _ rel " =
1.0 e -6) )
> cat ( ’ The best slopes for the training set :\ n ’)
The best slopes for the training set :
> data . frame ( names ( train _ attributes ) , best _ slopes $ x ) % >%
+
rename ( " Feature " = 1 , " Coefficient " = 2) % >%
+
knitr :: kable ()
12
13
14
15
16
17
18
19
20
21
22
23
24
| Feature
| Coefficient |
|: - - - - - - - - - - - - -| - - - - - - - - - - -:|
| X1st . Flr . SF
|
12.188648|
| X2nd . Flr . SF
|
9.822453|
| Total . Bsmt . SF |
9.814725|
| Garage . Area
|
7.493464|
| Wood . Deck . SF |
9.442689|
| Open . Porch . SF |
9.856440|
| Lot . Area
|
11.091502|
| Year . Built
|
9.377056|
| Yr . Sold
|
9.092540|
The output table shows the best slopes for each feature in the linear regression model. The coefficients
indicate the degree to which each feature affects the SaleP rice target variable.
1
2
> cat ( ’ RMSE of all training examples using the best slopes : ’ , rmse _ train ( best _
slopes $ x ) , " \ n " )
RMSE of all training examples using the best slopes : 58433.83
Section 5.3. Least squares regression
Page 85
The function rmse test, which takes in a vector of slopes as input in the following code, computes the
root mean squared error (RMSE) for a multiple linear regression model using these slopes and the test
data, and returns the RMSE value.
1
2
3
4
5
6
7
8
9
>
>
>
>
# define rmse _ test function
test _ prices <- test $ SalePrice
test _ attributes <- test [ , ! names ( test ) % in % " SalePrice " ]
rmse _ test <- function ( slopes ) {
+
rmse ( slopes , test _ attributes , test _ prices )
+ }
> rmse _ linear <- rmse _ test ( best _ slopes $ x )
> cat ( ’ Test set RMSE for multiple linear regression : ’ , rmse _ linear , " \ n " )
Test set RMSE for multiple linear regression : 117683.7
Interpreting the results, we can say that the RMSE of 117 683.7 is a measure of how well the multiple
linear regression model fits the test data. A smaller RMSE value usually indicates a better fit. Therefore,
a higher value of RMSE, as in our output, suggests that the model is not performing well on the test
data.
1
2
3
4
1
2
3
4
5
6
7
> # define fit function
> fit <- function ( row ) {
+
sum ( best _ slopes $ x * as . numeric ( row ) )
+ }
test % >%
mutate ( Fitted = apply ( select (. , - SalePrice ) , 1 , fit ) ) % >%
ggplot ( aes ( x = Fitted , y = SalePrice ) ) +
geom _ point ( color = " blue " ) +
geom _ abline ( intercept = 0 , slope = 1 , color = " red " , lwd = 1.5) +
ggtitle ( " Scatter Plot of Fitted vs SalePrice " )
dev . off ()
Figure 5.11: Scatter plot of the fitted vs SaleP rice
The majority of the blue dots are clustered tightly around the red line in Figure 5.11, demonstrating that
the multiple linear regression model fits the test data reasonably well. However, certain outliers suggest
Section 5.3. Least squares regression
Page 86
that not all data points may match the model perfectly. Overall, Figure 5.11 offers a quick and simple
method for assessing how well the multiple linear regression model performed on the test dataset.
A residual plot is a graphic tool for visually assessing the multiple linear regression model quality of fit.
It plots the residuals, or the gap between the predicted and actual sale prices, versus the actual sale
prices. Let us plot the residual plot of SaleP rice:
1
2
3
4
5
6
7
8
> library ( ggplot2 )
> test % >%
+
mutate ( Residual = test _ prices - apply ( select (. , - SalePrice ) , 1 , fit ) ) % >%
+
ggplot ( aes ( x = SalePrice , y = Residual ) ) +
+
geom _ point ( color = " blue " ) +
+
geom _ hline ( yintercept = 0 , color = " red " , lwd = 1.5) +
+
xlim (0 , 7 e5 ) +
+
ggtitle ( " A residual plot for multiple regression " )
Figure 5.12: A residual plot for multiple regression
Figure 5.12 of the multiple linear regression model fits the test data well since the majority of the blue
points are distributed widely around the zero line. However, there are certain patterns in the residuals,
such as a slight curve in the pattern, which raise the possibility that not all of the patterns in the data
are captured by the model. This can be a sign of non-linear correlations or other variables that have
not been taken into account. Performance evaluation of the multiple linear regression model can be
assessed visually using the residual plot. Areas for model improvement can also be identified like this.
Complete Activity 5.4 and Activity 5.5 in the Exercise Manual before you proceed to the next section.
Section 5.4. Regression diagnostics
5.4
Page 87
Regression diagnostics
Refer to section 4.3 in STA1502 for background information on the assumptions of a linear regression
model as well as diagnostic tools for verifying those assumptions. The examination of the residual error
is one of the diagnostic methods for verifying the assumptions.
Generally, regression diagnostics are used to determine whether a model is consistent with its assumptions; and whether one or more observations are inadequately represented by the model. With the aid
of these tools, researchers can assess whether a model appropriately represents the data in their study.
This section evaluates the quality of a linear regression study visually with the use of residual plots.
These evaluations are also known as diagnostics. The diagnostic plots additionally display residuals in
four distinct ways:
1. Normal Q-Q plot. This is used to examine whether or not the residuals are distributed normally.
The normal probability plot of residuals should approximately follow a straight line. For the
salaries dataset in Section 5.1, Example 5.1.1, it follows that:
> salary . model <- lm ( Salary ~ . , data = empl _ salaries )
> qqPlot ( resid ( salary . model ) , main = " Normal Q - Q plot " )
3 [1] 30 24
1
2
According to the output, points 24 and 30 are out of line with the observations. They are
recognised as outliers.
1
2
> qqPlot ( resid ( salary . model ) , main = " Normal Q - Q plot " )
> qqline ( resid ( salary . model ) , col = " steelblue " , lwd = 2)
Figure 5.13: The normal Q-Q plot for the salary.model developed from empl salaries data
Section 5.4. Regression diagnostics
Page 88
The data in Figure 5.13 demonstrate that the residual points fall approximately along the reference
line within the interval [−2, 2]. Therefore we are a little more hesitant to say that the residuals are
normally distributed in this plot, because there is a little more of a pattern and some noticeable
non-linearities. However, the results should serve as a reminder that all models are not perfect,
even if we do not necessarily reject the model based on this one test. The residual histogram in
Figure 5.14 corroborates these interpretations.
>
>
3 >
4 +
5 +
6 >
7 >
8 >
9 >
1
2
# standardised residuals
std _ resid <- studres ( salary . model )
hist ( std _ resid , freq = FALSE ,
main = " Distribution of standardised residuals " ,
xlab = " Standardised residuals " )
xsalary . model <- seq ( min ( std _ resid ) , max ( std _ resid ) , length =30)
ysalary . model <- dnorm ( xsalary . model )
lines ( xsalary . model , ysalary . model , lwd =2 , col = " red " )
Figure 5.14: Distribution of the standardised residuals
2. Residuals vs Fitted. This is used to verify the assumptions of a linearity. In Figure 5.15, a
horizontal red line that exhibits no trend in the observations around it is an indicator of a linear
relationship.
>
>
3 +
4 +
5 +
6 >
7 >
1
2
# residual vs fitted plot
plot ( fitted ( salary . model ) , resid ( salary . model ) ,
main = " Residual vs Fitted plot " , col = " blue " ,
ylab = " Residuals " , xlab = " Fitted values " ,
frame = FALSE )
# add a horizontal line at 0
abline (0 ,0 , col = " red " , lwd =2)
Section 5.4. Regression diagnostics
Page 89
Figure 5.15: Residuals against fitted values
Figure 5.15 compares the residuals for each observation to the expected mean value. There is not
a uniform distribution of observations around the reference line (the red line). Since the majority
of the observations fall between 20 000 and 80 000, we can conclude that the variation of the
estimates is slightly to the right. Additionally, since there are not enough observations between
90 000 and 120 000, it is probable that we have a problem with heteroscedasticity. A regression
model is said to be heteroscedastic when the variance of the residual term, or error term, fluctuates
significantly or is non-constant.
3. Spread-Location. This is done to check for homogeneity in the variance or homoscedasticity
of the residuals. If this is the case, there should be a horizontal line with evenly spread points.
The ncvTest, also known as the non-constant variance test, is a test for heteroscedasticity. The
variance of the residuals in a normal linear model is assumed to be constant. Let’s examine the
syntax in R:
> ncvTest ( salary . model )
Non - constant Variance Score Test
3 Variance formula : ~ fitted . values
4 Chisquare = 13.1946 , Df = 1 , p = 0.00028076
1
2
The variance appears to NOT be constant because the ncvTest p-value is less than 0.05. The
following command-line generates a spreadLevelPlot based on the salary.model:
1
2
> spreadLevelPlot ( salary . model ,
+
main = " Spread - LevelPlot for salary model " )
3
4
Suggested power transformation :
-0.1104097
Section 5.4. Regression diagnostics
Page 90
Figure 5.16: Spread-level plot for salary.model
Potential spread-level dependencies, or an extension of the studentised residuals from the salary.model,
are shown in Figure 5.16. The positive link between the estimated Salary and the error in the
estimation of Salary further demonstrates that there is a presence of heteroscedasticity.
4. Residuals vs Leverage. This is used to find influential or extreme values. The inclusion or
exclusion of these data from the study may have an effect on the findings of the linear regression.
Take a look in the following R code and results:
> par ( mfrow = c (1 ,2) )
> plot ( salary . model ,4)
3 > plot ( salary . model ,5 , lwd =2)
1
2
Section 5.5. Inference and prediction intervals
Page 91
Figure 5.17: Cook’s distance and residuals against leverage
Cook’s distance, a metric for the impact of observations with a large distance, is presented on the
left-hand side in Figure 5.17. Possible outliers include points 24, 29 and 30. Since point 30 has a
low residual and low leverage (on right-hand side of Figure 5.17), the model did not adequately
account for it and it had a significant impact on the outcomes. All potential outliers found by the
Cook’s distance that could affect the findings can be eliminated from the model to improve the
output.
Complete Activity 5.6 in the Exercise Manual before you proceed to the next section.
5.5
Inference and prediction intervals
A linear regression model can be useful for two main reasons:
1. Evaluating the relationship between one or more independent variables and the dependent variable.
2. Predicting future values using the regression model.
In relation to point number 2, it is sometimes of relevance to predict both the exact values and an
interval that comprises a range of likely values. The interval is known as a prediction interval. Note that
in contrast to the confidence interval, which represents uncertainty around mean predicted values, the
prediction interval reflects uncertainty around a single value. This section demonstrates how to apply
the prediction interval for the regression model.
Use the regression model in Example 5.1.1 to predict the value of Salary using the fitted regression
model and three new values for Years of experience.
>
>
3 >
4 >
# fit simple regression model
model <- lm ( Salary ~ Years _ of _ experience , data = empl _ salaries )
# confidence interval
predict ( model , data . frame ( Years _ of _ experience = 18) , interval = " confidence " )
5
fit
lwr
upr
6 1 133786.7 119851 147722.4
1
2
Section 5.5. Inference and prediction intervals
Page 92
> # prediction interval
> predict ( model , data . frame ( Years _ of _ experience = 18) , interval = " prediction " )
9
fit
lwr
upr
10 1 133786.7 106879.1 160694.3
7
8
The output findings indicate that someone with 18 Years of experience is predicted to earn of R133 786.70.
The findings also indicate that the 95% prediction interval for Salary will be between R106 879.10 and
R160 694.30. Also keep in mind that the margins of error for the prediction intervals and the confidence
intervals differ slightly. In contrast to the confidence interval, the prediction interval is wider since it
predicts for a single value rather than the average value.
Now that you have reached the end of Learning Unit 5, start to complete Assessment 6 as outlined
in the Activities section on the module site and ensure that you submit the completed assessment for
formal evaluation once you have reached the end of Learning Unit 6.
Download