Simple Linear Regression and Correlation

advertisement
Chapter 17
Simple Linear Regression
and Correlation
1
17.1 Introduction
• In this chapter we employ Regression Analysis
to examine the relationship among quantitative
variables.
• The technique is used to predict the value of one
variable (the dependent variable - y)based on
the value of other variables (independent
variables x1, x2,…xk.)
2
17.2 The Model
• The first order linear model
y  b0  b1x  
y = dependent variable
x = independent variable
b0 = y-intercept
b1 = slope of the line
 = error variable
y
b0 and b1 are unknown,
therefore, are estimated
from the data.
Rise
b0
b1 = Rise/Run
Run
x
3
17.3 Estimating the Coefficients
• The estimates are determined by
– drawing a sample from the population of interest,
– calculating sample statistics.
– producing a straight line that cuts into the data.
y
The question is:
Which straight line fits best?
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
x
4
The best line is the one that minimizes
the sum of squared vertical differences
between the points and the line.
Sum of squared differences = (2 - 1)2 + (4 - 2)2 +(1.5 - 3)2 + (3.2 - 4)2 = 6.89
Sum of squared differences = (2 -2.5)2 + (4 - 2.5)2 + (1.5 - 2.5)2 + (3.2 - 2.5)2 = 3.99
4
3
2.5
2
Let us compare two lines
The second line is horizontal
(2,4)
w
w (4,3.2)
(1,2) w
w (3,1.5)
1
1
2
3
4
The smaller the sum of
squared differences
the better the fit of the
line to the data.
5
To calculate the estimates of the coefficients
that minimize the differences between the data
points and the line, use the formulas:
b1 
The regression equation that estimates
the equation of the first order linear model
is:
cov( X , Y )
s 2x
ŷ  b 0  b1x
b 0  y  b1 x
6
• Example 17.1 Relationship between odometer
reading and a used car’s selling price.
– A car dealer wants to find
the relationship between
the odometer reading and
the selling price of used cars.
– A random sample of 100 cars
is selected, and the data
recorded.
– Find the regression line.
Car Odometer
1 37388
2 44758
3 45833
4 30862
5 31705
6 34010
.
.
.
.
.
.
Price
5318
5061
5008
5795
5784
5359
.
.
.
Independent variable x
Dependent variable y
7
• Solution
– Solving by hand
• To calculate b0 and b1 we need to calculate several
statistics first;
2
( x  x)


i
x  36,009.45;
s 2x
y  5,411.41;
( x  x)(y

cov( X , Y ) 
 43,528,688
n 1
i
i
 y)
n 1
 1,356,256
where n = 100.
b1 
cov(X, Y)
s 2x
1,356,256

 .0312
43,528,688
b 0  y  b1x  5411.41  ( .0312)(36,009.45)  6,533
ŷ  b 0  b1x  6,533  .0312 x
8
– Using the computer (see file Xm17-01.xls)
Tools > Data analysis > Regression > [Shade the y range and the x range] > OK
SUMMARY OUTPUT
Price
Regression Statistics
Multiple R 0.806308
R Square 0.650132
Adjusted R Square
0.646562
Standard Error
151.5688
Observations
100
6000
5500
5000
4500
19000
29000
ŷ  6,533  .0312 x
39000
49000
Odometer
ANOVA
df
Regression
Residual
Total
1
98
99
SS
MS
F
Significance F
4183528 4183528 182.1056 4.4435E-24
2251362 22973.09
6434890
CoefficientsStandard Error t Stat
P-value
Intercept
6533.383 84.51232 77.30687 1.22E-89
Odometer -0.03116 0.002309 -13.4947 4.44E-24
9
6533
Price
6000
5500
5000
4500
0
No data
19000
29000
39000
49000
Odometer
ŷ  6,533  .0312 x
The intercept is b0 = 6533.
Do not interpret the intercept as the
“Price of cars that have not been driven”
This is the slope of the line.
For each additional mile on the odometer,
the price decreases by an average of $0.0312
10
17.4 Error Variable: Required Conditions
• The error  is a critical part of the regression model.
• Four requirements involving the distribution of  must
be satisfied.
–
–
–
–
The probability distribution of  is normal.
The mean of  is zero: E() = 0.
The standard deviation of  is s for all values of x.
The set of errors associated with different values of y are
all independent.
11
From the first three assumptions we have:
y is normally distributed with mean
E(y) = b0 + b1x, and a constant standard
deviation s
E(y|x3)
The standard deviation remains constant,
m3
b0 + b1x3
E(y|x2)
b0 + b1x2
m2
but the mean value changes with x
b0 + b1x1
E(y|x1)
m1
x1
x2
x3
12
17.5 Assessing the Model
• The least squares method will produce a
regression line whether or not there is a linear
relationship between x and y.
• Consequently, it is important to assess how well
the linear model fits the data.
• Several methods are used to assess the model:
– Testing and/or estimating the coefficients.
– Using descriptive measurements.
13
• Sum of squares for errors
– This is the sum of differences between the points
and the regression line.
– It can serve as a measure of how well the line fits the
n
data. SSE   (y i  ŷ i ) 2 .
i 1
SSE  (n  1)s 2Y 
cov( X, Y)
s 2x
– This statistic plays a role in every statistical
technique we employ to assess the model.
14
• Standard error of estimate
– The mean error is equal to zero.
– If s is small the errors tend to be close to zero
(close to the mean error). Then, the model fits the
data well.
– Therefore, we can, use s as a measure of the
suitability of using a linear model.
– An unbiased estimator of s2 is given by s2
S tan dard Error of Estimate
SSE
s 
n2
15
• Example 17.2
– Calculate the standard error of estimate for example
17.1, and describe what does it tell you about the
model fit?
• Solution


( y i  ŷ i ) 2
Calculated before
6,434,890

 64,999
n 1
99
2
cov(
X
,
Y
)
(

1
,
356
,
256
)
SSE  (n  1)s 2Y 
 99(64,999) 
 2,252,363
2
43,528,688
sx
s 2Y
Thus,
It is hard to assess the model based
on s even when compared with the
SSE 2,251,363
s 
 151.6
mean value of y.
n2
98
s   151.6, y  5,411.4 16
• Testing the slope
– When no linear relationship exists between two
variables, the regression line should be horizontal.
q
q
qq
q
q
q
q
q
q
q
q
Linear relationship.
Different inputs (x) yield
different outputs (y).
No linear relationship.
Different inputs (x) yield
the same output (y).
The slope is not equal to zero
The slope is equal to zero
17
• We can draw inference about b1 from b1 by testing
H0: b1 = 0
H1: b1 = 0 (or < 0,or > 0)
– The test statistic is
b1  b1
t
s b1
where
s b1 
s
(n  1)s 2x
The standard error of b1.
– If the error variable is normally distributed, the statistic is
Student t distribution with d.f. = n-2.
18
• Solution
– Solving by hand
– To compute “t” we need the values of b1 and sb1.
b1  .312
s b1 
t
s
(n  1)s 2x

151.6
(99)(43,528,688
b1  b1  .312  0

 13.49
.00231
s b1
– Using the computer
 .00231
There is overwhelming evidence to infer
that the odometer reading affects the
auction selling price.
Coefficients Standard Error t Stat
Intercept
6533.383035
84.51232199 77.30687
Odometer -0.031157739
0.002308896 -13.4947
P-value
1.22E-89
4.44E-24
19
• Coefficient of determination
– When we want to measure the strength of the linear
relationship, we use the coefficient of determination.
R 
2
[cov( X , Y )]
s x2 s 2y
2
or R  1 
2

SSE
( yi  y) 2
20
– To understand the significance of this coefficient
note:
The regression model
Overall variability in y
The error
21
Two data points (x1,y1) and (x2,y2) of a certain sample are shown.
y2
y
y1
x1
Total variation in y =
(y1  y) 2  (y 2  y) 2 
x2
Variation explained by the
regression line)
(ŷ1  y) 2  (ŷ 2  y) 2
+ Unexplained variation (error)
 (y1  ŷ1 ) 2  (y 2  ŷ 2 ) 2
22
Variation in y = SSR + SSE
• R2 measures the proportion of the variation in y
that is explained by the variation in x.
2
R  1

SSE
(y i  y)


 (y  y)
( y i  y ) 2  SSE
2
i
2


SSR
(y i  y) 2
• R2 takes on any value between zero and one.
R2 = 1: Perfect match between the line and the data points.
R2 = 0: There are no linear relationship between x and y.
23
• Example 17.4
– Find the coefficient of determination for example 17.1;
what does this statistic tell you about the model?
• Solution
2
– Solving by hand; R 
– Using the computer
[cov( X , Y )]2
s 2x s 2y

[ 1,356 ,256 ]2
( 43 ,528 ,688 )( 64 ,999 )
 .6501
• From the regression output we have
Regression Statistics
Multiple R
R Square
Adjusted R Square
Standard Error
Observations
0.8063
0.6501
0.6466
151.57
100
65% of the variation in the auction
selling price is explained by the
variation in odometer reading. The
rest (35%) remains unexplained by
this model.
24
17.6 Finance Application: Market Model
• One of the most important applications of linear
regression is the market model.
• It is assumed that rate of return on a stock (R) is
linearly related to the rate of return on the overall
market.
R = b0 + b1Rm +
Rate of return on a particular stock
Rate of return on some major stock index
The beta coefficient measures how sensitive the stock’s rate
of return is to changes in the level of the overall market.
25
• Example 17.5 The market model
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.560079
R Square 0.313688
Adjusted R Square
0.301855
Standard Error
0.063123
Observations
60
ANOVA
• Estimate the market model for
Nortel, a stock traded in the
Toronto Stock Exchange.
• Data consisted of monthly
percentage return for Nortel
and monthly percentage return
for all the stocks.
dfstock’s SS This is aMS
F total
Significance
F
This is a measure of the
measure of the
risk embedded
1
0.10563
0.10563
3.27E-06
marketRegression
related risk. In this sample,
in the Nortel
stock,26.50969
that is market-related.
Residual
58 0.231105 0.003985
for each 1% increase in the TSE
Specifically, 31.37% of the variation in Nortel’s
Total
59 0.336734
return, the average increase in
return are explained by the variation in the
Nortel’s return is .8877%.
Coefficients
Standard TSE’s
Error returns.
t Stat
P-value
Intercept
TSE
0.012818 0.008223 1.558903
0.887691 0.172409 5.148756
0.12446
3.27E-06
26
17.7 Using the Regression Equation
• Before using the regression model, we need to
assess how well it fits the data.
• If we are satisfied with how well the model fits
the data, we can use it to make predictions for y.
• Illustration
– Predict the selling price of a three-year-old Taurus
with 40,000 miles on the odometer (Example 17.1).
ŷ  6533  .0312 x  6533  .0312( 40,000)  5,285
27
• Prediction interval and confidence interval
– Two intervals can be used to discover how closely
the predicted value will match the true value of y.
• Prediction interval - for a particular value of y,
• Confidence interval - for the expected value of y.
– The prediction interval
– The confidence interval
( x g  x) 2
1
1 
n
( x i  x) 2
( x g  x) 2
1

n
( x i  x) 2
ŷ  t  2 s 

ŷ  t  2 s 

The prediction interval is wider than the confidence interval
28
• Example 17.6 interval estimates for the car
auction price
– Provide an interval estimate for the bidding price on a
Ford Taurus with 40,000 miles on the odometer.
– Solution
• The dealer would like to predict the price of a single car
• The prediction interval(95%) =
ŷ  t  2 s  1 
1

n
( x g  x) 2

( x i  x) 2
t.025,98
1
( 40,000  36,009) 2
[ 6533  .0312( 40000)]  1.984(151.6) 1 

 5,285  303
29
100
4,309,340,160

– The car dealer wants to bid on a lot of 250 Ford
Tauruses, where each car has been driven for about
40,000 miles.
– Solution
• The dealer needs to estimate the mean price per car.
• The confidence interval (95%) =
ŷ  t  2 s 
1

n
( x g  x)2

( x i  x)2
1
( 40,000  36,009) 2
[ 6533  .0312( 40000)]  1.984(151.6)

 5,285  35
100
4,309,340,160

30
• The effect of the given value of x on the interval
– As xg moves away from x the interval becomes
longer. That is, the shortest interval is found at x.
ŷ  b0  b1x g
ŷ( x g  x  1)
ŷ( x g  x  1)
2
(
x

x
)
1
g
The ŷconfidence
 t  2 s  interval

when xg =nx
( x  x)2

1
The ŷconfidence
 t  2 s  interval

x

when xg = n 1

x  2 x 1 x 1 x  2
x
( x( x 2)1)xx21 ( x  12))xx 12
1
The confidence interval
ŷ  t  2 s 

when xg = xn 2

i
12
( xi  x )2
22
( xi  x )2
31
17.8 Coefficient of correlation
• The coefficient of correlation is used to measure the
strength of association between two variables.
• The coefficient values range between -1 and 1.
– If r = -1 (negative association) or r = +1 (positive
association) every point falls on the regression line.
– If r = 0 there is no linear pattern.
• The coefficient can be used to test for linear
relationship between two variables.
32
• Testing the coefficient of correlation
– When there are no linear relationship between two
variables, r = 0.
– The hypotheses are:
H0: r = 0
H1: r = 0
Y
X
– The test statistic is:
t r
n2
The statistic is Student t distributed
with d.f. = n - 2, provided the variables
are bivariate normally distributed.
1 r2
w here r is the sample coefficient of correlation
cov (X, Y)
calculated by r 
s x sy
33
• Example 17.7 Testing for linear relationship
– Test the coefficient of correlation to determine if linear
relationship exists in the data of example 17.1.
• Solution
– We test H0: r = 0
H1: r  0.
– Solving by hand:
• The rejection region is
|t| > t/2,n-2 = t.025,98 = 1.984 or so.
• The sample coefficient of
correlation r=cov(X,Y)/sxsy=-.806
The value of the t statistic is
t r
n2
1 r
2
 13.49
Conclusion: There is sufficient
evidence at  = 5% to infer that
there are linear relationship
between the two variables.
34
• Spearman rank correlation coefficient
– The Spearman rank test is used to test whether
relationship exists between variables in cases where
• at least one variable is ranked, or
• both variables are quantitative but the normality
requirement is not satisfied
35
– The hypotheses are:
• H0: rs = 0
• H1: rs = 0
– The test statistic is
cov (a, b)
rs 
s a sb
a and b are the ranks of the data.
– For a large sample (n > 30) rs is approximately
normally distributed
z  rs n  1
36
• Example 17.8
– A production manager wants to examine the
relationship between
• aptitude test score given prior to hiring, and
• performance rating three months after starting work.
– A random sample of 20 production workers was
selected. The test scores as well as performance
rating was recorded.
37
Aptitude Performance
Employee
test
rating
1
59
3
2
47
2
3
58
4
4
66
3
5
77
2
.
.
.
.
.
.
.
.
.
Scores range from 0 to 100
Scores range from 1 to 5
• Solution
– The hypotheses are:
• H0: rs = 0
• H1: rs = 0
– The problem objective is
to analyze the
relationship between
two variables.
– Performance rating is
ranked.
– The test statistic is rs,
and the rejection region
is |rs| > rcritical (taken from
the Spearman rank
correlation table). 38
Employee
1
2
3
4
5
.
.
.
Aptitude
test
59
47
58
66
77
.
.
.
Rank(a)
9
3
8
14
20
.
.
.
Performance
rating
3
2
4
3
2
.
.
.
Rank(b)
10.5
3.5
17
10.5
3.5
.
.
.
Ties are broken
by averaging the
ranks.
– Conclusion:
Solving by hand
Rank
each
separately.At 5% significance
Do• not
reject
thevariable
null hypothesis.
level
there is insufficient
to infer that the
• Calculate
sa = 5.92; sevidence
b =5.50; cov(a,b) = 12.34
two• variable
related tos one
another.
Thus rs =are
cov(a,b)/[s
a b] = .379.
• The critical value for  = .05 and n = 20 is .450.
39
17.9 Regression Diagnostics - I
• The three conditions required for the validity of
the regression analysis are:
– the error variable is normally distributed.
– the error variance is constant for all values of x.
– The errors are independent of each other.
• How can we diagnose violations of these
conditions?
40
• Residual Analysis
– Examining the residuals (or standardized residuals),
we can identify violations of the required conditions
– Example 17.1 - continued
• Nonnormality.
– Use Excel to obtain the standardized residual histogram.
– Examine the histogram and look for a bell shaped diagram with
mean close to zero.
41
A Partial list of
Standard residuals
RESIDUAL OUTPUT
Observation
1
2
3
4
5
For each residual we calculate
the standard deviation as follows:
Residuals
Standard Residuals
-50.45749927
-0.334595895
-77.82496482
-0.516076186
-97.33039568
-0.645421421
223.2070978
1.480140312
238.4730715
1.58137268
sri  s  1  hi w here
1
hi  
n
( x i  x )2

( x j  x )2
Standardized residual i =
Residual i / Standard deviation
40
30
20
We can also apply the Lilliefors test
or the c2 test of normality.
10
0
-2.5
-1.5
-0.5
0.5
1.5
2.5
More
42
• Heteroscedasticity
– When the requirement of a constant variance is
violated we have heteroscedasticity.
+
++
^y
Residual
+ + +
+
+
+
+
+
+
+
+
+
++ +
+
+ +
+
+
+
+ +
+
+ +
+
+
+
y^
++
+ ++
++
++
+
+
++
+
+
The spread increases with ^y
43
• When the requirement of a constant variance is
not violated we have homoscedasticity.
+
++
^y
Residual
+
+ +
+
+
+
+
+ +
+
+
+
+ +
+ +
+
+
+
++ +
+
+
+ +
+ +
+
The spread of the data points
does not change much.
y^
++
++
++
+
+ +++
+++
+
++
+
+
++
+
+
44
• When the requirement of a constant variance is
not violated we have homoscedasticity.
^y
Residual
+
+ +
+
+
+
+
+
+
+
+
+
+ +
+
+
+ +
+ ++
+ +
+
+
++ +
+
As far as the even spread, this is
a much better situation
y^
+
+
+
++
++
++
+
+
+
++ +
++ ++
+++
+++ +
+++
+
++
45
• Nonindependence of error variables
– A time series is constituted if data were collected
over time.
– Examining the residuals over time, no pattern should
be observed if the errors are independent.
– When a pattern is detected, the errors are said to be
autocorrelated.
– Autocorrelation can be detected by graphing the
residuals against time.
46
Patterns in the appearance of the residuals
over time indicates that autocorrelation exists.
Residual
Residual
+ ++
+
0
+
+
+
+
+
+ +
+
+
+
++
+
+
+
Time
Note the runs of positive residuals,
replaced by runs of negative residuals
+
+
+
0 +
+
+
+
Time
+
+
Note the oscillating behavior of the
residuals around zero.
47
• Outliers
– An outlier is an observation that is unusually small or
large.
– Several possibilities need to be investigated when an
outlier is observed:
• There was an error in recording the value.
• The point does not belong in the sample.
• The observation is valid.
– Identify outliers from the scatter diagram.
– It is customary to suspect an observation is an outlier if
its |standard residual| > 2
48
An outlier
+ +
+
+ +
+ +
+ +
An influential observation
+++++++++++
… but, some outliers
may be very influential
+
+
+
+
+
+
+
The outlier causes a shift
in the regression line
49
• Procedure for regression diagnostics
– Develop a model that has a theoretical basis.
– Gather data for the two variables in the model.
– Draw the scatter diagram to determine whether a
linear model appears to be appropriate.
– Check the required conditions for the errors.
– Assess the model fit.
– If the model fits the data, use the regression
equation.
50
Download