Chapter 2.3 – Least Squares Regression

advertisement
Chapter 2.3 – Least Squares Regression
If we find a linear association between two quantitative variables (e.g.
through a scatterplot), we can use this knowledge to help predict the
value of one variable (y ) given the value of another variable (x).
Stat 226 – Introduction to Business Statistics I
Spring 2009
Professor: Dr. Petrutza Caragea
Section A
Tuesdays and Thursdays 9:30-10:50 a.m.
What are plausible values when x = 7?
Chapter 2, Section 2.3
We can often use a straight line, so-called regression or prediction line, to
describe
Least Squares Regression
how a response variable y changes as an explanatory variable x
changes
predict a value of y given a specific value of x
Stat 226 (Spring 2009, Section A)
Introduction to Business Statistics I
Section 2.3
1 / 22
Stat 226 (Spring 2009, Section A)
Introduction to Business Statistics I
Chapter 2.3 – Least Squares Regression
Chapter 2.3 – Least Squares Regression
How shall we fit a line, i.e. what is a good line?
Least Squares regression line (LS regression line)
Section 2.3
2 / 22
The equation of the least squares regression line is given by
y! = a + b · x
where
We will use the so-called “least squares” principle to find the best line
y! corresponds to the predicted value of y for a given value of x
b=r·
the best line should minimize the sum of the squared errors, i.e. give the
closest fit to all data points.
This best line through the data is called the “least squares regression line”
or “prediction line”
Stat 226 (Spring 2009, Section A)
Introduction to Business Statistics I
Section 2.3
3 / 22
sy
sx
is the slope of the LS regression line
a = ȳ − b · x̄ is the intercept of the LS regression line
r is the correlation coefficient between x and y
x̄, sx correspond to the mean and standard deviation of x
ȳ , sy correspond to the mean and standard deviation of y
Stat 226 (Spring 2009, Section A)
Introduction to Business Statistics I
Section 2.3
4 / 22
Chapter 2.3 – Least Squares Regression
Chapter 2.3 – Least Squares Regression
Example: recall previous data
Let’s discuss next how to plot the LS regression line and the
meaning/interpration of the slope b and intercept a.
x
y
2
2
5
4
8
7
8
6
10
9
12
10
How to plot the LS regression line on the scatterplot
Find two points on the prediction line and connect both:
we found x̄ = 7.5, sx = 3.5637, ȳ = 6.3, sy = 3.0111, and r = 0.9878
x = 3 and x = 9:
so,
b=r·
sy
=
sx
plot both points (x, y!) on scatterplot and connect them
and
a = ȳ − b · x̄ =
⇒ LS regression line is given by:
y! =
Stat 226 (Spring 2009, Section A)
Introduction to Business Statistics I
Section 2.3
5 / 22
Chapter 2.3 – Least Squares Regression
Stat 226 (Spring 2009, Section A)
Introduction to Business Statistics I
Section 2.3
6 / 22
Chapter 2.3 – Least Squares Regression
interpretation of intercept a and slope b
What can we use the line for?
a — intercept: corresponds to the predicted value of y , when x = 0
we can use the line to predict the sales amount based on the number of
radio ads aired per week, e.g. what is y! when x = 6?
b — slope: corresponds to the change in the predicted value y!, when
x increases by 1 unit
b>0:
b<0:
Stat 226 (Spring 2009, Section A)
Introduction to Business Statistics I
Section 2.3
7 / 22
Stat 226 (Spring 2009, Section A)
Introduction to Business Statistics I
Section 2.3
8 / 22
Chapter 2.3 – Least Squares Regression
Chapter 2.3 – Least Squares Regression
Can we use the prediction line to predict the sales amount when the
number of radio ads aired per week is 15? That is, what is y! when x = 15?
Stat 226 (Spring 2009, Section A)
Introduction to Business Statistics I
Section 2.3
9 / 22
Chapter 2.3 – Least Squares Regression
Some facts about LS-Regression
1
In regression, we must clearly know which variable is the explanatory
variable and which variable is the response variable. Switching x and
y will change the
,
but will not affect the value of the correlation r .
2
The Least squares regression line
point
.
Stat 226 (Spring 2009, Section A)
goes through the
Introduction to Business Statistics I
Section 2.3
10 / 22
Chapter 2.3 – Least Squares Regression
Example: weekly radio ads and the sales amount: r = .9878.
So r 2 =
Some facts about LS-Regression cont’d
3
We want r 2 close to
Stat 226 (Spring 2009, Section A)
of the variation in
(y ) can be explained by the
least squares regression line of the number of advertisements on
.
is called
and corresponds to the amount (percent) of
in the y-values that is accounted for by the “regression of y on x”
(tells us how good the predictions will be).
5
4
. Thus
Based on the value of r 2 how do you know if r is negative or positive
(that is the direction of the association?)
; in general
Introduction to Business Statistics I
Section 2.3
11 / 22
Stat 226 (Spring 2009, Section A)
Introduction to Business Statistics I
Section 2.3
12 / 22
Chapter 2.3 – Least Squares Regression
Chapter 2.3 – Least Squares Regression
y
residuals and residual plots
residual
A residual is defined as the difference between an observed value y and its
predicted value y! based on the prediction line
yˆ # a " bx
residual = observed y − predicted y
= y − y!
residual can be thought of as an error that we commit when using the
prediction line
unless a y -value falls right on the prediction line, a residual we be
either
positive when observed y -value falls above the prediction line
negative when observed y -value falls below the prediction line
so residual is only zero ⇔ y! = y
Stat 226 (Spring 2009, Section A)
Introduction to Business Statistics I
Section 2.3
residual=
ŷ
13 / 22
Chapter 2.3 – Least Squares Regression
Stat 226 (Spring 2009, Section A)
y ! yˆ
y
Introduction to Business Statistics I
x
Section 2.3
14 / 22
Section 2.3
16 / 22
Chapter 2.3 – Least Squares Regression
example: radio ads vs. sales
LS regression line is
radio ads x
2
5
8
8
10
12
Total
y! = 0.0735 + 0.835 · x
sales y
2
4
7
6
9
10
y!
residual y − y!
In order to plot residuals we need to plot residuals on the horizontal
axis and corresponding x-values on the vertical axis
Stat 226 (Spring 2009, Section A)
Introduction to Business Statistics I
Section 2.3
15 / 22
Stat 226 (Spring 2009, Section A)
Introduction to Business Statistics I
Chapter 2.3 – Least Squares Regression
Chapter 2.3 – Least Squares Regression
curved pattern: data are not linear (straight line fits poorly)
what to look for in a residual plot
residual plots can tell us whether the fitted linear model is adequate
in general
residuals appear to be randomly scattered around 0 → good fit
any pattern in a residual plot always indicates a bad fit
increasing/decreasing spread of residuals:
some examples:
good fit: residuals are scattered around 0, we have about as many
residuals below zero as are above
Stat 226 (Spring 2009, Section A)
Introduction to Business Statistics I
Section 2.3
17 / 22
Chapter 2.3 – Least Squares Regression
Stat 226 (Spring 2009, Section A)
Introduction to Business Statistics I
Section 2.3
18 / 22
Chapter 2.3 – Least Squares Regression
influential points and outliers
outlier — an observation that is separated from the main bulk of the data
An observation that has a considerable effect on the fitted regression
model (e.g. on the correlation r , intercept a, and/or slope b) is considered
influential.
Often we have that observations which are outliers with respect to their
x-values tend to be more influential than observations which are outliers
with respect to their y -values.
Such observations (outlying w.r.t. all x values) are said to have a high
leverage; they alter the fitted least squares regression line significantly
consider the next example: data on treadmill time (until exhaustion)
versus ski time for biathletes
Stat 226 (Spring 2009, Section A)
Introduction to Business Statistics I
Section 2.3
19 / 22
Stat 226 (Spring 2009, Section A)
Introduction to Business Statistics I
Section 2.3
20 / 22
Chapter 2.3 – Least Squares Regression
Chapter 2.3 – Least Squares Regression
72
72.5
Linear Fit
ski time = 74.109138 - 0.7075958 treadmill time
70
Linear Fit
ski time = 91.119389 - 2.5464928 treadmill time
70
ski time
65
66
64
62.5
62
60
60
8
9
10
11
12
13
14
15
8
Linear Fit
Linear Fit
treadmill time
RSquare
RSquare Adj
0.866146
Root Mean Square Error
1.136304
Mean of Response
66.76
Observations (or Sum Wgts)
Term
treadmill time
-3
-2.5
88
RSquare Adj
0.045785
Root Mean Square Error
3.034485
Mean of Response
67.00923
Mean of Response
Term
Estimate
13
RSquare Adj
0.684109
Root Mean Square Error
1.725788
Mean of Response
67.00923
Std Error
t Ratio
Prob>|t|
<.0001
Intercept
74.109138
5.718203
12.96
<.0001
Intercept
-8.50
<.0001
treadmill time
-0.707596
0.563685
-1.26
0.2354
treadmill time
Term
Residual
Residual
5.0
2.5
2.5
0.0
0.0
-2.5
-2.5
-5.0
15
12
8
9
10
Stat 226 (Spring 2009, Section A)
11
12
13
14
Introduction to Business Statistics I
Estimate
15
12
Prob>|t|
Term
Std Error
t Ratio
93.870923
3.207895
29.26
<.0001
Intercept
-2.79398
0.328864
-8.50
<.0001
treadmill time
21 / 22
13
Estimate
Std Error
t Ratio
Prob>|t|
91.119389
4.754056
17.17
<.0001
-2.5464928
0.4901
-5.19
0.0003
3
5.0
1
2.5
-1
0.0
-3
-2.5
-1.5
-2
9
8.5
910
9.511
10 12 10.5 1311
14
11.5
15
12
-5
-5.0
88
8.59
9 10 9.5 11 10
Stat 226 (Spring 2009, Section A)
10.5
12
11
13
11.5
14
12
15
treadmill
treadmilltime
time
treadmill
treadmilltime
time
Section 2.3
0.710433
Observations (or Sum Wgts)
2
1.5
1
0.5
0
-0.5
-1
-3
-2.5
88
treadmill time
treadmill
treadmilltime
time
66.76
Observations (or Sum Wgts)
29.26
14
11.5
Linear Fit
Linear Fit
RSquare
t Ratio
10 12 10.5 1311
12
0.866146
3.207895
9.511
11.5
1.136304
0.328864
910
11
Root Mean Square Error
7.5
5.0
9
8.5
10.5
RSquare Adj
Std Error
-1.5
-2
10
0.878315
-2.79398
2
1.5
1
0.5
0
-0.5
-1
9.5
RSquare
Observations (or Sum Wgts)
12
Prob>|t|
9
0.125303
93.870923
Estimate
Intercept
8.5
treadmill time
0.878315
RSquare
Residual
Linear Fit
ski time = 93.870923 - 2.7939803 treadmill time
Residual
Residual
ski time
68
Linear Fit
ski time = 93.870923 - 2.7939803 treadmill time
67.5
Introduction to Business Statistics I
Section 2.3
22 / 22
Download