X - McGraw Hill Higher Education

COMPLETE
BUSINESS
STATISTICS
by
AMIR D. ACZEL
&
JAYAVEL SOUNDERPANDIAN
7th edition.
Prepared by Lloyd Jaisingh, Morehead State
University
Chapter 10
Simple Linear Regression and Correlation
McGraw-Hill/Irwin
Copyright © 2009 by The McGraw-Hill Companies, Inc. All rights reserved.
10-2
10
•
•
•
•
•
•
•
•
•
•
•
Simple Linear Regression and Correlation
Using Statistics
The Simple Linear Regression Model
Estimation: The Method of Least Squares
Error Variance and the Standard Errors of Regression Estimators
Correlation
Hypothesis Tests about the Regression Relationship
How Good is the Regression?
Analysis of Variance Table and an F Test of the Regression Model
Residual Analysis and Checking for Model Inadequacies
Use of the Regression Model for Prediction
The Solver Method for Regression
10-3
10 LEARNING OBJECTIVES
After studying this chapter, you should be able to:
• Determine whether a regression experiment would be useful in a given
•
•
•
•
•
instance
Formulate a regression model
Compute a regression equation
Compute the covariance and the correlation coefficient of two random
variables
Compute confidence intervals for regression coefficients
Compute a prediction interval for the dependent variable
10-4
10
LEARNING OBJECTIVES (continued)
After studying this chapter, you should be able to:
• Test hypothesis about a regression coefficients
• Conduct an ANOVA experiment using regression results
• Analyze residuals to check if the assumptions about the
regression model are valid
• Solve regression problems using spreadsheet templates
• Use LINEST function to carry out a regression
10-5
10-1 Using Statistics
• Regression refers to the statistical technique of modeling the
relationship between variables.
• In simple linear regression, we model the relationship
between two variables.
• One of the variables, denoted by Y, is called the dependent
variable and the other, denoted by X, is called the
independent variable.
• The model we will use to depict the relationship between X and
Y will be a straight-line relationship.
• A graphical sketch of the the pairs (X, Y) is called a scatter
plot.
10-6
10-1 Using Statistics
This scatterplot locates pairs of observations of
advertising expenditures on the x-axis and sales
on the y-axis. We notice that:
Scatterplot of Advertising Expenditures (X) and Sales (Y)
140
120
 Larger (smaller) values of sales tend to be
associated with larger (smaller) values of
advertising.
Sales
100
80
60
40
20
0
0
10
20
30
40
50
A d ve rtising
 The scatter of points tends to be distributed around a positively sloped straight line.
 The pairs of values of advertising expenditures and sales are not located exactly on a
straight line.
 The scatter plot reveals a more or less strong tendency rather than a precise linear
relationship.
 The line represents the nature of the relationship on average.
10-7
Examples of Other Scatterplots
0
Y
Y
Y
0
0
0
0
X
X
X
Y
Y
Y
X
X
X
10-8
Model Building
The inexact nature of the
relationship between
advertising and sales
suggests that a statistical
model might be useful in
analyzing the relationship.
A statistical model separates
the systematic component
of a relationship from the
random component.
Data
Statistical
model
Systematic
component
+
Random
errors
In ANOVA, the systematic
component is the variation
of means between samples
or treatments (SSTR) and
the random component is
the unexplained variation
(SSE).
In regression, the
systematic component is
the overall linear
relationship, and the
random component is the
variation around the line.
10-9
10-2 The Simple Linear Regression
Model
The population simple linear regression model:
Y= 0 + 1 X
Nonrandom or
Systematic
Component
+ 
Random
Component
where
 Y is the dependent variable, the variable we wish to explain or predict
 X is the independent variable, also called the predictor variable
  is the error term, the only random component in the model, and thus, the
only source of randomness in Y.
 0 is the intercept of the systematic component of the regression relationship.
 1 is the slope of the systematic component.
The conditional mean of Y: E[Y X ]   0   1 X
10-10
Picturing the Simple Linear
Regression Model
Y
Regression Plot
E[Y]=0 + 1 X
Yi
}
{
Error: i
}
1 = Slope
The simple linear regression
model gives an exact linear
relationship between the
expected or average value of Y,
the dependent variable, and X,
the independent or predictor
variable:
E[Yi]=0 + 1 Xi
1
Actual observed values of Y
differ from the expected value by
an unexplained or random error:
0 = Intercept
X
Xi
Yi = E[Yi] + i
= 0 + 1 Xi + i
10-11
Assumptions of the Simple Linear
Regression Model
• The relationship between X and Y is a
•
•
straight-line relationship.
The values of the independent
variable X are assumed fixed (not
random); the only randomness in the
values of Y comes from the error term
i.
The errors i are normally distributed
with mean 0 and variance 2. The
errors are uncorrelated (not related)
in successive observations. That is:
~ N(0,2)
Y
Assumptions of the Simple
Linear Regression Model
E[Y]=0 + 1 X
Identical normal
distributions of errors,
all centered on the
regression line.
X
10-12
10-3 Estimation: The Method of Least
Squares
Estimation of a simple linear regression relationship involves finding
estimated or predicted values of the intercept and slope of the linear
regression line.
The estimated regression equation:
Y = b0 + b1X + e
where b0 estimates the intercept of the population regression line, 0 ;
b1 estimates the slope of the population regression line, 1;
and e stands for the observed errors - the residuals from fitting the estimated
regression line b0 + b1X to a set of n points.
The estimated regression line:
Y  b0 + b1 X
 (Y - hat) is the value of Y lying on the fitted regression line for a given
where Y
value of X.
10-13
Fitting a Regression Line
Y
Y
Data
X
Three errors from the
least squares regression
line
X
Y
Three errors
from a fitted line
X
Errors from the least
squares regression
line are minimized
X
10-14
Errors in Regression
Y
the observeddata point
Yi
.
{
Error ei  Yi  Yi
Yi
Y  b0  b1 X
Yi the predicted value of Y for X
X
Xi
the fitted regression line
i
10-15
Least Squares Regression
The sum of squared errors in regression is:
n
SSE =
e
i=1
2
i

n
 (y
i
 y i ) 2
i=1
The least squares regression line is that which minimizes the SSE
with respect to the estimates b 0 and b 1 .
The normal equations:
n
y
b0
n
i
 nb0  b1  x i
i=1
i=1
n
x y
i
i=1
SSE
i
n
n
i=1
i=1
Least squares b0
b0  x i  b1  x 2i
Least squares b1
At this point
SSE is
minimized
with respect
to b0 and b1
b1
10-16
Sums of Squares, Cross Products,
and Least Squares Estimators
Sums of Squares and Cross Products:
SSx   (x  x )   x
2
2
x



2
n 2
y


2
2
SS y   ( y  y )   y 
n
SSxy   (x  x )( y  y )  
x  ( y )


xy 
Least  squares regression estimators:
b1 
SS XY
SS X
b0  y  b1 x
n
10-17
Example 10-1
Miles
1211
1345
1422
1687
1849
2026
2133
2253
2400
2468
2699
2806
3082
3209
3466
3643
3852
4033
4267
4498
4533
4804
5090
5233
5439
79,448
Dollars
1802
2405
2005
2511
2332
2305
3016
3385
3090
3694
3371
3998
3555
4692
4244
5298
4801
5147
5738
6420
6059
6426
6321
7026
6964
106,605
Miles 2
1466521
1809025
2022084
2845969
3418801
4104676
4549689
5076009
5760000
6091024
7284601
7873636
9498724
10297681
12013156
13271449
14837904
16265089
18207288
20232004
20548088
23078416
25908100
27384288
29582720
293,426,946
Miles*Dollars
2182222
3234725
2851110
4236057
4311868
4669930
6433128
7626405
7416000
9116792
9098329
11218388
10956510
15056628
14709704
19300614
18493452
20757852
24484046
28877160
27465448
30870504
32173890
36767056
37877196
390,185,014
2
SS x   x 
 x 2
n
 293, 426 ,946 
SS xy   xy 
79, 448
 x ( y )
2
 40,947 ,557.84
25
n
 390,185,014 
(79, 448)(106,605 )
 51, 402 ,852 .4
25
SS
51, 402,852.4
b  XY 
 1.255333776  1.26
1 SS
40,947 ,557.84
X
b  y b x 
0
1
 274.85
106,605
25
 79,448 

 25 
 (1.255333776 )
10-18
Template (partial output) that can be
used to carry out a Simple Regression
10-19
Template (continued) that can be used
to carry out a Simple Regression
10-20
Template (continued) that can be used
to carry out a Simple Regression
Residual Analysis. The plot shows the absence of a relationship
between the residuals and the X-values (miles).
10-21
Template (continued) that can be used
to carry out a Simple Regression
Note: The normal probability plot is approximately linear. This
would indicate that the normality assumption for the errors has not
been violated.
10-22
Total Variance and Error Variance
Y
Y
X
What you see when looking
at the total variation of Y.
X
What you see when looking
along the regression line at
the error variance of Y.
10-23
10-4 Error Variance and the Standard
Errors of Regression Estimators
Y
Degrees of Freedom in Regression:
df = (n - 2) (n total observations less one degree of freedom
for each parameter estimated (b 0 and b1 ) )
2
)
SS
(
2
XY
SSE =  ( Y - Y )  SSY 
SS X
= SSY  b1SS XY
2
2
An unbiased estimator of s , denoted by S :
SSE
MSE =
(n - 2)
Square and sum all
regression errors to find
SSE.
X
Example 10 - 1:
SSE = SS Y  b1 SS XY
 66855898  (1.255333776 )( 51402852 .4 )
 2328161.2
MSE 
SSE
n2
 101224 .4
s 
MSE 

2328161.2
23
101224 .4  318.158
10-24
Standard Errors of Estimates in
Regression
The standard error of b0 (intercept):
s(b0 ) 
where s =
s
2
x

nSS X
MSE
The standard error of b1 (slope):
s
s(b1 ) 
SS X
Example 10 - 1:
2
s x
s(b0 ) 
nSS X
318.158 293426944

( 25)( 4097557.84 )
 170.338
s
s(b1 ) 
SS X
318.158

40947557.84
 0.04972
10-25
Confidence Intervals for the
Regression Parameters
A (1 -  ) 100% confidence interval for b :
0
b  t 
s (b )
0  ,(n 2 ) 0
2

A (1 -  ) 100% confidence interval for b :
1
b  t 
s (b )
1  ,(n 2 ) 1
2

Least-squares point estimate:
b1=1.25533
Height = Slope
Length = 1
0
(not a possible value of the
regression slope at 95%)
Example 10 - 1
95% Confidence Intervals:
b t
s (b )
0  0.025,( 25 2 ) 0
= 274.85  ( 2.069) (170.338)
 274.85  352.43
 [ 77.58, 627.28]
b1  t
 0.025,( 25 2 )
s (b1 )
= 1.25533  ( 2.069) ( 0.04972 )
 1.25533  010287
.
 [115246
.
,1.35820]
10-26
Template (partial output) that can be used
to obtain Confidence Intervals for 0 and 1
10-27
10-5 Correlation
The correlation between two random variables, X and Y, is a measure of the
degree of linear association between the two variables.
The population correlation, denoted by, can take on any value from -1 to 1.
  1
-1 <  < 0
0
0<<1
  1
indicates a perfect negative linear relationship
indicates a negative linear relationship
indicates no linear relationship
indicates a positive linear relationship
indicates a perfect positive linear relationship
The absolute value of  indicates the strength or exactness of the relationship.
10-28
Illustrations of Correlation
Y
 = -1
Y
=0
Y
=1
X
Y
 = -.8
X
X
Y
=0
Y
 = .8
X
X
X
10-29
Covariance and Correlation
The covariance of two random variables X and Y:
Cov ( X , Y )  E [( X   )(Y   )]
X
Y
where  and  Y are the population means of X and Y respectively.
X
The population correlation coefficient:
Cov ( X , Y )
=
 
X Y
The sample correlation coefficient * :
SS
XY
r=
SS SS
X Y
*Note:
Example 10 - 1:
SS
XY
r=
SS SS
X Y
51402852.4

( 40947557.84)( 66855898)
51402852.4

.9824
52321943.29
If  < 0, b1 < 0 If  = 0, b1 = 0 If  > 0, b1 >0
10-30
Hypothesis Tests for the Correlation
Coefficient
H0:  = 0
H1:   0
(No linear relationship)
(Some linear relationship)
Test Statistic: t( n 2 ) 
r
1 r2
n2
Example 10 -1:
r
t( n 2 ) 
1 r2
n2
0.9824
=
1 - 0.9651
25 - 2
0.9824
=
 25.25
0.0389
t0. 005  2.807  25.25
H 0 rejected at 1% level
10-31
10-6 Hypothesis Tests about the
Regression Relationship
Constant Y
Unsystematic Variation
Y
Y
X
Nonlinear Relationship
Y
X
X
A hypothesis test for the existence of a linear relationship between X and Y:
H0: 1  0
H1:  1  0
Test statistic for the existence of a linear relationship between X and Y:
b
1

t
(n - 2)
s(b )
1
where b is the least - squares estimate of the regression slope and s ( b ) is the standard error of b .
1
1
1
When the null hypothesis is true, the statistic has a t distribution with n - 2 degrees of freedom.
10-32
Hypothesis Tests for the Regression
Slope
Example 10 - 1:
H0: 1  0
H1:  1  0

t
b
1
s(b )
1
1.25533
(n - 2)
=
 25.25
Example10 - 4 :
H :  1
0 1
H :  1
1 1
b 1
t
 1
( n - 2) s (b )
1
1.24 - 1
=
 1.14
0.21
0.04972
 2.807  25.25
t
( 0 . 005 , 23 )
H 0 is rejected at the 1% level and we may
conclude that there is a relationship between
charges and miles traveled.
 1.671  1.14
(0.05,58)
H is not rejected at the10% level.
0
We may not conclude that the beta
coefficien t is different from 1.
t
10-33
10-7 How Good is the Regression?
The coefficient of determination, r2, is a descriptive measure of the strength of
the regression relationship, a measure of how well the regression line fits the data.
( y  y )  ( y  y)
 ( y  y )
Total = Unexplained
Explained
Deviation
Deviation
Deviation
(Error)
(Regression)
Y
.
Y
Unexplained Deviation
Y
Explained Deviation
Y
}
{
2
 ( y  y )   ( y  y)   ( y  y )
SST
= SSE
+ SSR
Total Deviation
{
2
r 
X
X
2
SSR
SST
 1
SSE
SST
Percentage of
total variation
explained by
the regression.
2
10-34
The Coefficient of Determination
Y
Y
Y
X
X
SST
r2 = 0
SSE
r2 = 0.50
SST
SSE SSR
SST
SSR
6000
Dollars
SSR 64527736.8
r 

 0.96518
SST
66855898
r2 = 0.90
S
S
E
7000
Example 10 -1:
2
X
5000
4000
3000
2000
1000 1500 2000 2500 3000 3500 4000 4500 5000 5500
Miles
10-35
10-8 Analysis-of-Variance Table and
an F Test of the Regression Model
Source of
Variation
Sum of
Squares
Regression SSR
Degrees of
Freedom Mean Square F Ratio
(1)
MSR
Error
SSE
(n-2)
MSE
Total
SST
(n-1)
MST
MSR
MSE
Example 10-1
Source of
Variation
Sum of
Squares
Regression 64527736.8
Degrees of
Freedom
F Ratio p Value
1
Mean Square
64527736.8
637.47
101224.4
Error
2328161.2
23
Total
66855898.0
24
0.000
10-36
Template (partial output) that displays Analysis of
Variance and an F Test of the Regression Model
10-37
10-9 Residual Analysis and Checking
for Model Inadequacies
Residuals
Residuals
0
0
x or y
x or y
Homoscedasticity: Residuals appear completely
random. No indication of model inadequacy.
Residuals
Heteroscedasticity: Variance of residuals
increases when x changes.
Residuals
0
0
Time
Residuals exhibit a linear trend with time.
x or y
Curved pattern in residuals resulting from
underlying nonlinear relationship.
10-38
Normal Probability Plot of the
Residuals
Flatter than Normal
10-39
Normal Probability Plot of the
Residuals
More Peaked than Normal
10-40
Normal Probability Plot of the
Residuals
Positively Skewed
10-41
Normal Probability Plot of the
Residuals
Negatively Skewed
10-42
10-10 Use of the Regression Model
for Prediction
• Point Prediction
A single-valued estimate of Y for a given value of X obtained by
inserting the value of X in the estimated regression equation.
• Prediction Interval
For a value of Y given a value of X
 Variation in regression line estimate
 Variation of points around regression line
For an average value of Y given a value of X
 Variation in regression line estimate
10-43
Errors in Predicting E[Y|X]
Y
Y
Upper limit on slope
Upper limit on intercept
Regression line
Lower limit on slope
Y
X
X
1) Uncertainty about the
slope of the regression line
Regression line
Y
Lower limit on intercept
X
X
2) Uncertainty about the
intercept of the regression line
10-44
Prediction Interval for E[Y|X]
Y
•
Prediction band for E[Y|X]
Regression
line
•
•
Y
X
X
Prediction Interval for E[Y|X]
The prediction band for E[Y|X] is
narrowest at the mean value of X.
The prediction band widens as the
distance from the mean of X increases.
Predictions become very unreliable when
we extrapolate beyond the range of the
sample itself.
10-45
Additional Error in Predicting Individual
Value of Y
Y
Regression line
Y
Prediction band for E[Y|X]
Regression
line
Y
Prediction band for Y
X
3) Variation around the regression
line
X
X
Prediction Interval for E[Y|X]
10-46
Prediction Interval for a Value of Y
A (1 -  ) 100% prediction interval for Y :
1 (x  x)
yˆ  t  s 1  
n
SS
2

2
X
Example10 - 1 (X = 4,000) :
1 (4,000  3,177.92)
{274.85  (1.2553)(4,000)}  2.069  318.16 1  
25
40,947,557.84
 5296.05  676.62  [4619.43, 5972.67]
2
10-47
Prediction Interval for the Average
Value of Y
A (1 -  ) 100% prediction interval for the E[Y X] :
1 (x  x)
yˆ  t  s

n
SS
2

2
X
Example10 - 1 (X = 4,000) :
1 (4,000  3,177.92)
{274.85  (1.2553)(4,000)}  2.069  318.16

25
40,947,557.84
 5,296.05  156.48  [5139.57, 5452.53]
2
10-48
Template Output with Prediction
Intervals
10-49
10-11 The Excel Solver Method for
Regression
The solver macro available in EXCEL can also be used to conduct a
simple linear regression. See the text for instructions.
10-50
Using Minitab Fitted-Line Plot for
Regression
Fitted Line Plot
Y = - 0.8465 + 1.352 X
9.0
S
R-Sq
R-Sq(adj)
8.5
Y
8.0
7.5
7.0
6.5
6.0
5.5
6.0
6.5
X
7.0
7.5
0.184266
95.2%
94.8%