Hid_Chapter 6 Correlation & Regression

advertisement
10-1
Chapter 6
Simple Linear
Regression and
Correlation
10-2
Introduction
Many problems in engineering and science involve exploring the
relationships between two or more variables. Regression analysis is a
statistical technique of modeling the relationship between variables.
There are two types of regression model
multiple regression and simple linear regression model
A regression model that contains more than one regressor/
independent/ predictor variable is called a multiple regression model.
Simple linear regression model, has only one independent variable or
regressor.
Simple linear regression is used to model the relationship b/n two
variables
• One of the variables, denoted by Y, is called the dependent variable and the
other, denoted by X, is called the independent variable.
• The model we will use to depict the relationship between X and Y will be a
straight-line relationship.
10-3
Scatter diagram/ plot
Scatterplot of Advertising Expenditures (X) and Sales (Y)
140
120
100
Sales
It is a graph on which each (x, y) pair is
represented as a point plotted in a twodimensional coordinate system.
This scatter plot locates pairs of observations of
advertising expenditures on the x-axis and sales
on the y-axis. We notice that:
 Larger (smaller) values of sales tend to be
associated with larger (smaller) values of
advertising.
80
60
40
20
0
0
10
20
30
40
50
A d ve rtising
 The scatter of points tends to be distributed around a positively sloped straight line.
 The pairs of values of advertising expenditures and sales are not located exactly on a
straight line.
 The scatter plot reveals a more or less strong tendency rather than a precise linear
relationship.
10-4
Therefore, it is probably reasonable to assume
that the mean of the random variable Y is related
to x by a straight-line relationship
This line represents the nature of the relationship on
average.
10-5
Examples of Other Scatterplots
0
Y
Y
Y
0
0
0
0
X
X
X
Y
Y
Y
X
X
X
10-6
Simple Linear Regression Model
The mean of the random variable Y is related to x by the following straight-line
relationship:
E[Y X ]   0   1 X
where the slope and intercept of the line are called regression coefficients.
While the mean of Y is a linear function of x, the actual observed value y does
not fall exactly on a straight line. The expected value of Y is assumed to be a
linear function of x.
The actual value of Y is determined by the mean value function (the linear model)
plus a random error term, 
10-7
Therefore; the actual value of Y (dependent variable) is determined by the
following model (relation)
Y= 0 + 1 X +
Nonrandom or
Systematic

Random
Component
We will call this model the simple linear regression model
Where:
Y is the dependent variable, the variable we wish to explain or predict
X is the independent variable, also called the regressor/ predictor variable
  is the error term, the only random component in the model
 0 is the intercept of the systematic component of the regression relationship.
 1 is the slope of the systematic component.
Note: The choice of the model is based on inspection of a scatter diagram
10-8
Assumptions of the Simple Linear
Regression Model
•
•
The relationship between X
and Y is a straight-line
relationship.
The errors i are normally
distributed with mean 0 and
variance 2. The errors are
uncorrelated (not related) in
successive observations. That
is: ~ N(0,2)
Y
Assumptions of the Simple
Linear Regression Model
E[Y]=0 + 1 X
Identical normal
distributions of errors,
all centered on the
regression line.
X
10-9
Estimation: The Method of Least
Squares
Suppose that we have n pairs of observations (x1, y1), (x2, y2),…(xn, yn).
the estimated regression line is:
Y= 0 + 1 X + 
Estimation of a simple linear regression relationship involves finding
Regression coefficients, 0 and 1 of the linear regression line.
The estimates of 0 and 1 should result in a line that is (in some sense) a
“best fit” to the data.
The values of Regression coefficients (0 and 1 ) should be determined so
that the sum of the squares of the vertical deviations is minimized.
This criterion for estimating the regression coefficients is called method
of least squares.
10-10
Errors in Regression
Y
the observeddata point
Yi
Yi
 
ˆ
Y   0   1 X the fitted regression line
.
{
Error  i  Yi  Yˆi
Yi the predicted value of Y for X
X
Xi
i
10-11
Least Squares Regression
The sum of squared errors in regression is :
n
n
SSE = L      (y i  yˆ i ) 2 whereYˆ  
2
i
i =1
0
i =1
 X
1
The least squares regression line is that whic h minimizes the SSE
with respect to the estimates b 0 and b1.


The least squares estimators of  and , say,  0 and 1 must satisfy the
0
1
following
10-12
Simpilifyi ng these two equations yealds :

n
y
i

n
 n  0   1  xi
i =1
n
i =1


n
n
2
x
y


x


x
 i i 0 i 1 i
i =1
i =1
i =1
These equations are called the least squares normal equations.
The solution t o the normal equations results in the least squares
estimators


0
1
 and 
10-13
10-14
Notationally, it is occasionally convenient to give special symbols to the
numerator and denominator of the previous Equation

Therefore;  1 = Sxy/Sxx
10-15
The fitted or estimated regression line is therefore


ˆ
Y   0  1 X
Note that each pair of observations satisfies the relationship


Yi= 0 + 1 Xi +  i
i
Where
= yi - Ŷi is called the residual. The residual describes the error in
the fit of the model to the ith observation yi
10-16
Example
Miles
1211
1345
1422
1687
1849
2026
2133
2253
2400
2468
2699
2806
3082
3209
3466
3643
3852
4033
4267
4498
4533
4804
5090
5233
5439
79,448
Dollars
1802
2405
2005
2511
2332
2305
3016
3385
3090
3694
3371
3998
3555
4692
4244
5298
4801
5147
5738
6420
6059
6426
6321
7026
6964
106,605
Miles 2
1466521
1809025
2022084
2845969
3418801
4104676
4549689
5076009
5760000
6091024
7284601
7873636
9498724
10297681
12013156
13271449
14837904
16265089
18207288
20232004
20548088
23078416
25908100
27384288
29582720
293,426,946
Miles*Dollars
2182222
3234725
2851110
4236057
4311868
4669930
6433128
7626405
7416000
9116792
9098329
11218388
10956510
15056628
14709704
19300614
18493452
20757852
24484046
28877160
27465448
30870504
32173890
36767056
37877196
390,185,014

x 2

2
S xx   x 
n
2
79
,
448
 293,426,946 
 40,947,557 .84
25
 x( y)
S xy   xy   
n
 390,185,014  (79,448)(106,605)  51,402,852.4
25
S
b  XY  51,402,852 .4  1.255333776  1.26
1 S
40,947,557 .84
XX


b  y  b x  106,605  (1.255333776 ) 79,448 
0
1
25
 25 
 274 .85
10-17
Error Variance and the Standard
Errors of Regression Estimators
Y
Degrees of Freedom in Regression :
df = n - 2
2
(
S
)
SSE =  (Y - Yˆ )2  S  XY
YY
S
XX
= S b S
YY 1 XY
An unbiased estimator of s2 , denoted by S 2 :
MSE = SSE
(n - 2)
Square and sum all
regression errors to find
SSE.
X
Example
SSE = S  b S
YY 1 XY
 66855898  (1.255333776 )(51402852 .4)
 2328161 .2
SSE 2328161 .2
MSE 

n2
23
 101224 .4
s  MSE  101224 .4  318 .158
10-18
Standard Errors of Estimates in
Regression
The standard error of  0 (intercept ) :
s( 0 ) 
x
s
2
nS XX

1
where s =
MSE
The standard error of
s(
 )
1
s
S XX

1
(slope) :
Example
s  x2
s(  ) 
0
nS
XX
 318.158 293426944
(25) (4097557.84)
 170.338
s(  )  s
1
S
XX
318.158

40947557.84
 0.04972
10-19
Confidence Intervals for the
Regression Parameters
In addition to point estimates of the slope and intercept, it is possible to obtain
Confidence Interval estimates of these parameters. The width of these
Confidence intervals is a measure of the overall quality of the regression line.
10-20
Example
95% Confidence Intervals :
b t
s(b )
0 0.025,(252) 0










= 274.85  (2.069) (170.338)
 274.85  352.43
 [77.58,627.28]
b t
1





0.025,(252)





s(b )
1
= 1.25533  (2.069) (0.04972)
 1.25533  0.10287
 [1.15246,1.35820]
10-21
PREDICTION OF NEW
OBSERVATIONS
An important application of a regression model is predicting new or future
observations Y corresponding to a specified level of the regressor variable x. If
x0 is the value of the regressor variable of interest,


ˆ
YO   0   1 X O
is the point estimator of the new or future value of the response Y0.
10-22
Correlation
The correlation between two random variables, X and Y, is a measure of the
degree of linear association between the two variables.
The population correlation, denoted by, can take on any value from -1 to 1.
  
-1 <  < 0

0<<1
  
indicates a perfect negative linear relationship
indicates a negative linear relationship
indicates no linear relationship
indicates a positive linear relationship
indicates a perfect positive linear relationship
The absolute value of  indicates the strength or exactness of the relationship.
10-23
Illustrations of Correlation
Y
 = -1
Y
=0
Y
=1
X
Y
 = -.8
X
X
Y
=0
Y
 = .8
X
X
X
10-24
Covariance and Correlation
The covariance of two random variables X and Y:
Cov ( X , Y )  E [( X   )(Y   )]
X
Y
where  and  Y are the population means of X and Y respectively.
X
The population correlation coefficient:
Cov ( X , Y )
=
 
X Y
The sample correlatio n coefficien t*:
S
XY
r=
S
S
XX YY
*Note:
Example
S
XY
r=
S
S
XX YY
51402852.4

( 40947557.84)( 66855898)
51402852.4

.9824
52321943.29
If  < 0, b1 < 0 If  = 0, b1 = 0 If  > 0, b1 >0
Download