Correlation and Linear Regression

Correlation and
Linear Regression
Chapter 13
McGraw-Hill/Irwin
Copyright © 2012 by The McGraw-Hill Companies, Inc. All rights reserved.
Topics
1.
Correlation



2.
Scatter Diagram
Correlation coefficient
Test on correlation coefficient
Simple Linear Regression Analysis





Estimation
Validity of the model—test on the slope coefficient
Fitness of the model—coefficient of determination
Prediction
The error term and residuals
13-2
Regression Analysis Introduction




Recall in chapter 4 we used Applewood Auto Group data to show the
relationship between two variables using a scatter diagram. The profit for
each vehicle sold and the age of the buyer were plotted on an XY graph
The graph showed that as the age of the buyer increased, the profit for each
vehicle also increased in general.
Numerical measures to express the strength of relationship between two
variables are developed in this chapter.
In addition, an equation is used to express the relationship between
variables, allowing us to estimate one variable on the basis of another.
EXAMPLES
Does the amount Healthtex spends per month on training its sales force affect its
monthly sales?
Is the number of square feet in a home related to the cost to heat the home in
January?
In a study of fuel efficiency, is there a relationship between miles per gallon and the
weight of a car?
Does the number of hours that students studied for an exam influence the exam
score?
13-3
Dependent Versus Independent Variable
The Dependent Variable is the variable being predicted or
estimated.
The Independent Variable provides the basis for estimation. It
is the predictor variable.
Which in the questions below are the dependent and independent variables?
1.Does the amount Healthtex spends per month on training its sales force affect its
monthly sales?
2.Is the number of square feet in a home related to the cost to heat the home in
January?
3.In a study of fuel efficiency, is there a relationship between miles per gallon and the
weight of a car?
4.Does the number of hours that students studied for an exam influence the exam
score?
13-4
Scatter Diagram Example
The sales manager of Copier Sales of America, which has a large sales
force throughout the United States and Canada, wants to determine
whether there is a relationship between the number of sales calls made
in a month and the number of copiers sold that month. The manager
selects a random sample of 10 representatives and determines the number
of sales calls each representative made last month and the number of
copiers sold.
13-5
Minitab Scatter Plots
Strong positive
correlation
No correlation
Weak negative
correlation
13-6
The Coefficient of Correlation, r
The Coefficient of Correlation (r) is a measure of the
strength of the relationship between two variables.





It shows the direction and strength of the linear relationship between two interval or ratio-scale
variables
It can range from -1.00 to +1.00.
Values of -1.00 or +1.00 indicate perfect and strong correlation.
Values close to 0.0 indicate weak correlation.
Negative values indicate a negative/indirect relationship and positive values indicate a
positive/direct relationship.
13-7
Correlation Coefficient - Interpretation
13-8
Correlation Coefficient - Example
X  45
Y  22
s X  9.189
sY  14.337
=CORREL(VAR1, VAR2)
What does correlation of 0.759 mean?
First, it is positive, so we see there is a direct relationship between the number of sales calls and
the number of copiers sold. The value of 0.759 is fairly close to 1.00, so we conclude that the
association is strong.
However, does this mean that more sales calls cause more sales?
No, we have not demonstrated cause and effect here, only that the two variables—sales calls
and copiers sold—are related.
Data Copier
13-9
Testing the Significance of
the Correlation Coefficient
H0:  = 0 (the correlation in the population is 0)
H1:  ≠ 0 (the correlation in the population is not 0)
Test statistic: t 
r n2
1 r
2
, with n-2 degrees of freedom
Rejection region: reject H0 if t > t/2,n-2 or t < -t/2,n-2
13-10
Testing the Significance of
the Correlation Coefficient – Copier Sales Example
1. H0:  = 0 (the correlation in the population is 0)
H1:  ≠ 0 (the correlation in the population is not 0)
2. α=.05
r n  2 .759 10  2
3. Computing t, we get t 

1 r
4. Reject H0 if:
t > t/2,n-2 or t < -t/2,n-2
t > t0.025,8 or t < -t0.025,8
t > 2.306 or t < -2.306
2
Rejection region
1  .759
2
 3.297
Rejection region
5. The computed t (3.297) is within the rejection region, therefore, we will
reject H0. This means the correlation in the population is not zero.
From a practical standpoint, it indicates to the sales manager that
there is correlation with respect to the number of sales calls made
and the number of copiers sold in the population of salespeople.
13-11
One-tail Test of
the Correlation Coefficient – Applewood Example
Sometimes instead of a two-tail test on the correlation coefficient, we
many want to test whether the coefficient takes a specific sign.
Example: in Chapter 4 we constructed the following scatter plot for the
profit earned on a vehicle sale and the age of the purchaser of the
Applewood Auto Group. The correlation coefficient is found to be
.262. Test at the 5% significance level to determine whether the
relationship is positive. (n = 180)
Profit
Plot of Profit and Age
$3,500
$3,000
$2,500
$2,000
$1,500
$1,000
$500
$0
0
20
40
Age
60
80
13-12
One-tail Test of
the Correlation Coefficient – Applewood Example
1. H0:  = 0 (the correlation in the population is 0)
H1:  > 0 (the correlation in the population is positive/direct)
2. α=.05
3. Computing t, we get
t
r n2
1 r
2

.262 180 2
1  .262
2
 3.622
4. Reject H0 if: t > t,n-2
t > t0.05,178
t > 1.653
5. The computed t (3.622) is within the rejection region, therefore, we will
reject H0. This means the correlation in the population is positive.
From a practical standpoint, it indicates that there is a positive
correlation between profits and age in the population.
13-13
Regression Analysis



In regression analysis we use the independent
variable (X) to estimate the dependent variable (Y).
The relationship between the variables is linear.
Both variables must be at least interval scale.
The least squares criterion is used to determine the
equation.
REGRESSION EQUATION An equation that expresses the LINEAR
relationship between two variables.
LEAST SQUARES PRINCIPLE Determining a regression equation
by minimizing the sum of the squares of the vertical distances
between the actual Y values and the predicted values of Y.
13-14
Linear Regression Model
Simple Linear Regression Model:
Y=α+βX+ε
Y is the dependent variable and X is the independent variable.
α is Y-intercept and β is the slope, both are population coefficients and
need to be estimated using sample data.
ε is the error term.
The model represents the linear relationship between the two variables in
the population
Estimated Linear Regression Equation:
Yˆ = a + b X
The estimated equation represents the linear relationship between the two
variables estimated from the sample.
13-15
Regression Analysis – Least Squares Principle

The least squares principle is used to obtain a and b.
13-16
Computing the Slope of the Line and the Yintercept
13-17
Regression Equation Example
Recall the example involving
Copier Sales of America. The
sales manager gathered
information on the number of
sales calls made and the
number of copiers sold for a
random sample of 10 sales
representatives. Use the least
squares method to determine a
linear equation to express the
relationship between the two
variables.
What is the expected number of
copiers sold by a representative
who made 20 calls?
Data Copier
13-18
Regression Equation Example
Representative Calls Sales
Tom Keller
20
30
Jeff Hall
40
60
Excel: Data -> Data
Analysis ->
Descriptive Statistics
See Excel instruction in
Lind et al., p 329, #5 for
more details.
Calls
Brian Virost
Greg Fish
Susan Welch
20
30
10
40
60
30
Mean
Standard
Error
Median
Mode
Carlos Ramirez
10
40
Rich Niles
Mike Keil
Mark Reynolds
Soni Jones
20
20
20
30
40
50
30
70
Step 1 – Find the slope (b) of the line
Step 2 – Find the y-intercept (a)
Sales
22.000
2.906
20.000
20.000
Mean
Standard
Error
Median
Mode
45.000
Standard
Deviation
9.189
Standard
Deviation
14.337
Sample
Variance
Kurtosis
Skewness
Range
Minimum
Maximum
Sum
Count
84.444
0.396
0.601
30.000
10.000
40.000
220.000
10.000
Sample
Variance
Kurtosis
Skewness
Range
Minimum
Maximum
Sum
Count
205.556
-1.001
0.566
40.000
30.000
70.000
450.000
10.000
4.534
40.000
30.000
Finding and Fitting the Regression Equation Example
The regressionequation is :
^
Y  a  bX
^
Y  18.9476 1.1842X
For a representative w homade 20 calls, the expectednumber
of copiers sold is
^
Y  18.9476 1.1842(20)
^
Y  42.6316
13-20
Validity of the Model– Copier Sales Example
1. H0: β = 0 (the slope 0; there is no linear relationship; the model is
invalid)
H1: β ≠ 0 (the slope is not 0; there is linear relationship; the
model is valid)
2. α=.05
b  0 1.1842 0
t

 3.297
3. Test statistic:
sb
4. Rejection region:
Reject H0 if:
t > t/2,n-2 or t < -t/2,n-2
t/2,n-2 =t0.025,8 =2.306
t > 2.306 or t < -2.306
0.3591
Rejection region
Rejection region
5. Conclusion: The slope of the equation is significantly different
from zero; there is linear relationship between the two variables
and thus the model is valid.
13-21
One-tail Test on Slope Coefficient – Copier
Sales Example
1. H0: β = 0
H1: β > 0 (positive slope)
2. α=.05
3. Test statistic: t 
b  0 1.1842 0

 3.297
sb
0.3591
4. Rejection region:
Reject H0 if t > t,n-2
t,n-2 =t0.05,8 = 1.860
t > 1.860
5. Conclusion: the test statistic is 3.297, which is higher than the
critical value of 1.86. Thus we reject the null hypothesis and
conclude that the slope is positive.
13-22
Fitness of the model—
Coefficient of Determination
The coefficient of determination (r2) is the proportion of
the total variation in the dependent variable (Y) that is
explained or accounted for by the variation in the
independent variable (X).



It is the square of the coefficient of correlation.
It ranges from 0 to 1.
The high r2 is, the better the model fits the data.
13-23
Coefficient of Determination (r2) – Copier Sales
Example
Variation in Y=
variation explained by variation in X + variation unexplained
r2 
variation explainedby variation in X
SSR

total variation in Y
SS Total
•The coefficient of determination, r2 ,is 0.576,
found by (0.759)2 [recall r = 0.759]
•Interpretation: 57.6 percent of the variation in
the number of copiers sold is explained, or
accounted for, by the variation in the number of
sales calls.
13-24
Confidence Interval and Prediction Interval
Estimates of Y
• A confidence interval reports the mean value of
Y for a given X.
• A prediction interval reports the range of values
of Y for a particular value of X.
13-25
Confidence Interval Estimate - Example
We return to the Copier Sales of America illustration.
Determine a 95 percent confidence interval for all sales
representatives who make 25 calls.
^
Y  18.9476 1.1842X  18.9476 1.1842(25)  48.5526
t   t1.95
2
,n2
 t.025 ,8
 2.306
Thus, the 95 percent confidence interval for the average sales of all sales
representatives who make 25 calls is from 40.9170 up to 56.1882 copiers.
13-26
Prediction Interval Estimate - Example
We return to the Copier Sales of America illustration.
Determine a 95 percent prediction interval for Sheila
Baker, a West Coast sales representative who made
25 calls.
If Sheila Baker makes 25 sales calls, the number of copiers she
will sell will be between about 24 and 73 copiers.
13-27
Confidence and Prediction Intervals –
Minitab Illustration
13-28
Regression analysis using Excel
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.759014
R Square
0.576102
Adjusted R
Square
0.523115
Standard
Error
9.900824
Observation
s
See Excel
instruction in
Lind et al.,
p 510, #2.
|r|, the sign of r depends on the
sign of b.
Coefficient of determination r2
SSR=1065.789 and SS total = 1850. r 2 
10
1065 .789
 .576
1850
ANOVA
df
SS
MS
Regression
1
1065.789
1065.789
Residual
8
784.2105
98.02632
Total
9
1850
Coefficients
Standard
Error
F
10.87248
Significance
F
0.010902
Estimated coefficients: a and b
T test statistics for testing on the slope
p-value for testing on the slope
t Stat
P-value
Lower 95%
Upper 95% Lower 95.0% Upper 95.0%
Intercept
18.94737
8.498819
2.229412
0.056349
-0.65094
38.54568
-0.65094
38.54568
Calls
1.184211
0.359141
3.297345
0.010902
0.356031
2.01239
0.356031
2.01239
Data Copier
13-29
Error Term and Residuals
Simple Linear Regression Model:
Y=α+βX+ε
Error term: ε
The error term accounts for all other variables, measurable and
immeasurable that are not part of the model but can affect the
magnitude of Y.
In the copier sales example, the error term may represent the
salesperson's skills, effort , etc. The error term varies from one
salesperson to another even if they make the same number of
calls, that is, for the same value of X.
The error terms makes the model probabilistic and thus
differentiates it from deterministic models like Profit=Revenue –
Cost.
13-30
Error Term and Residuals
Assumption requirement of the error term:
 The probability distribution of ε is normal.
 The mean of the distribution is 0.
 The variance of ε is constant variance regardless of
the value of Yˆ .
 The value of ε associated with any particular value of
Y is independent of ε associated with any other value
of Y. In other words, we require the error terms to be
independent of each other.
13-31
Error Term and Residuals




Residuals: estimated error term, denoted ˆ
The residuals can be used to investigate whether the
assumption requirements of the error term are
satisfied for a regression analysis. We will explore this
later.
The residuals measure the vertical distance between
each observation and the regression line.
The residual for a specific observation is calculated
as:
ˆ  Y  Yˆ  Y  (a  bX )
13-32
Error Term and Residuals—Example

In the copier sales example, the estimated regression
^
equation is
Y  18.9476  1.1842 X

Let’s calculate the residual that corresponds to the
observation of Soni Jones, who made 30 calls (X) and
sold 70 copiers (Y).
ˆ  Y  Yˆ
ˆ  70  (18.9476 1.1842* 30)
ˆ  70  54.4736
ˆ  15.5246
13-33
Error Term and Residuals—Example
LEAST SQUARES PRINCIPLE Determining a regression equation
by minimizing the sum of the squares of the vertical distances
between the actual Y values and the predicted values of Y.

If we obtain the residuals from the regression obtained using
least squares principle, square the residuals and sum the
squares up, the total we get will be the smallest among all
possible regression lines that we can fit to the original data.
Representative Calls (X)
Tom Keller
Jeff Hall
Brian Virost
Greg Fish
Susan Welch
Carlos Ramirez
Rich Niles
Mike Keil
Mark Reynolds
Soni Jones
20
40
20
30
10
10
20
20
20
30
Sales (Y)
30
60
40
60
30
40
40
50
30
70
Yˆ
42.63
66.32
42.63
54.47
30.79
30.79
42.63
42.63
42.63
54.47
SUM
ˆ
ˆ 2
-12.63
-6.32
-2.63
5.53
-0.79
9.21
-2.63
7.37
-12.63
15.53
159.56
39.89
6.93
30.54
0.62
84.83
6.93
54.29
159.56
241.07
smallest
784.21
13-34