Section 7.3 Least Squares Lines

advertisement
Least Squares Regression
Fitting a Line to Bivariate Data
Linear Relationships
Avg. occupants per
car
 1980: 6/car
 1990: 3/car
 2000: 1.5/car
 By the year 2010
every fourth car will
have nobody in it!


Food for Thought
Kind of
mathematical
relationship between
year and avg. no. of
occupants per car?
Why might relationship break down by
2010?
Basic Terminology
Scatterplots, correlation: interested in
association between 2 variables (assign
x and y arbitrarily)
 Least squares regression: does one
quantitative variable explain or cause
changes in another variable?

Basic Terminology (cont.)
Explanatory variable: explains or
causes changes in the other variable;
the x variable. (independent variable)
 Response variable: the y -variable; it
responds to changes in the x - variable.
(dependent variable)

Examples
Fertilizer (x )
corn yield (y )
 Advertising $ (x )
store income (y )
 Drug dose (x )
blood pressure (y )
 Daily temperature (x )
natural gas demand (y )
 change in min wage(x)
unemployment rate (y)

Simplest Relationship

Simplest equation that describes the
dependence of variable y on variable x
y = b0 + b1x
 linear equation
 graph is line with slope b1 and yintercept b0
Graph
y=b0 +b1x
y
rise
Slope b=rise/run
b0
run
0
x
Notation
(x1, y1), (x2, y2), . . . , (xn, yn)
 draw the line y= b0 + b1x through the
scatterplot , the point on the line
corresponding to xi is
yˆi  b0  b1 xi ; yˆi is the value of y predicted by the line

y  b0  b1 x when x  xi ;
yi is the observed value of y when x  xi .
Observed y, Predicted y
FUEL CONSUMPTION
FUEL CONSUMPTION vs CAR WEIGHT
7
6.5
6
5.5
5
4.5
4
3.5
3
2.5
2
predicted y when
x=2.7
yhat = a + bx
= a + b*2.7
(2.7, 3.6)
3.6 = observed y
1.5
2
2.5 2.7
3
CAR WEIGHT
3.5
4
4.5
Scatterplot: Fuel Consumption
vs Car Weight
Fuel consumption (gal/100
miles)
Fuel Consumption vs Car Weight
7
“Best” line?
6
5
Fuel consumption
4
3
2
1
2
3
4
Car Weight (1000 lbs)
5
Scatterplot with least squares
prediction line
FUEL CONSUMP.
(gal/100 miles)
FUEL CONSUMPTION vs CAR WEIGHT
7
6
5
4
3
2
y = 1.639x - 0.3631
r 2 = 0.9538
1.5
2.5
3.5
WEIGHT (1000 lbs)
4.5
How do we draw the line?
Residuals
the i th residual is the vertical deviation of the
i th data point from the line :
i th residual
= observed y  predicted y
 yi  yˆi
 yi  (b0  b1 xi )
Residuals: graphically
Graphical Display of Residuals
positive residual
Yi
negative residual
Yi
ei=Yi - Yi
Xi
X
Criterion for choosing what
line to draw: method of least
squares
The method of least squares chooses
the line that makes the sum of squares
of the residuals as small as possible
 This line has slope b1 and intercept b0
that minimizes

n
[ y  (b
i 1
i
0
 b1 xi )]
2
for the given observations ( xi , yi )
Least Squares Line y = b0 + b1x:
Slope b1 and Intercept b0
(x1 , y1 ),(x 2 , y2 ),
b1  r
slope
,(x n , yn )
sy
sx
y  intercept b0  y  bx
where
n
 (x  x )
sx 
2
i
i 1
is the standard deviation of x1, x2 ,..., xn
n 1
n
 ( y  y)
sy 
i 1
2
i
is the standard deviation of y1, y2 ,..., yn
n 1
n
r
 ( x  x )( y  y )
i
i 1
i
is the correlation between x and y
(n  1) sx s y
n
n
n
i 1
i 1
SSE   y  b0  yi  b1  xi yi
i 1
2
i
Example: Income vs
Consumption Expenditure
Consumption
Income (x)
Expenditure (y)
1
7
5
6
9
9
13
8
17
10
Questions
Construct scatterplot; determine if linear
model is appropriate. If so …
 … find the least squares prediction line
 Estimate consumption expenditure in a
household with an income of (i) $6,000
(ii) $25,000. Comfortable with
estimates?
 Compute the residuals

Scatterplot
Expenditure ($1,000's)
Consumption Expenditure
11
10
9
8
7
6
5
0
5
10
Household Income ($1,000's)
15
20
Solution
Inc. x
Exp. y
xi-xbar
(xi-xbar)2
yi-ybar
(yi-ybar)2 (xi-xbar)
(yi-ybar)
1
8
1
7
-8
64
-1
5
6
-4
16
-2
4
8
9
9
0
0
1
1
0
13
8
4
16
0
0
0
17
10
8
64
2
4
16
x=45
2
2
y=40 (xi-xbar) (xi-xbar) (yi-ybar) (yi-ybar)
=0
=160
=0
=10
x
sy 
45
40
 9; y 
 8; sx 
5
5
10
4

2.5  1.581; r 
160
4

40  6.325
32
 .8
4(6.325)(1.581)
32
Calculations
sy
1.581
b1  r  .8
 .2;
sx
6.325
b0  y  b1 x  8  .2(9)  8  1.8  6.2
least squares prediction line:
yˆ  6.2  .2 x
least squares prediction line
yˆ  b0  b1 x  6.2  .2 x
income  $6, 000, x  6
yˆ  6.2  .2(6)  7.4 ($7, 400)
income  $25, 000, x  25
yˆ  6.2  .2(25)  11.2 ($11, 200)
Least Squares Prediction Line
Expenditure ($1,000's)
Consumption Expenditure
11
10
y = 6.2 + 0.2x
9
8
7
6
5
0
5
10
Household Income ($1,000's)
15
20
Consumption Expenditure
Prediction When x=$6,000
7.4
Expenditure ($1,000's)
Consumption Expenditure
11
10
y = 6.2 + 0.2x
9
8
7
6
5
0
5
6
10
Household Income ($1,000's)
15
20
Consumption Expenditure
Prediction When x=$25,000
11.2
Expenditure ($1,000's)
Consumption Expenditure
12
11
10
y = 6.2 + 0.2x
9
8
7
6
5
0
5
10
15
Household Income ($1,000's)
20
25
25
The least squares line always
goes through the point with
coordinates (x, y)
Least Squares Line Goes Through ( x , y )
Consumption Expenditure
11
10
( x, y ) = ( 9, 8 )
9
y = 0.2x + 6.2
8
7
6
5
0
5
10
Income
15
20
C. Compute the Residuals
Inc. x ConE y y=6.2+.2x
y - y
(y-y)^2
1
7
6.4
.6
.36
5
6
7.2
-1.2
1.44
9
9
8
1
1
13
8
8.8
-.8
.64
17
10
9.6
.4
.16
residuals=0
(residuals)2
=3.6
Residuals
Expenditure ($1,000's)
Consumption Expenditure
11
10
y = 6.2 + 0.2x
9
8
7
6
5
0
5
10
Household Income ($1,000's)
15
20
Income Residual Plot
Residuals
Income Residual Plot
2
1
0
-1 0
-2
5
10
Income
15
20
residuals, (residuals)2

Note that
* residuals = 0
 (residuals)2 = 3.6
* From formula in box on p. 7:
SSE=yi2 – b0*yi – b1*xiyi
330 – 6.2*40 - .2*392
= 330 – 248 – 78.4 = 3.6
Any other line drawn through the
scatterplot will have
(residuals)2 > 3.6
Car Weight, Fuel
Consumption Example, cont.
(xi, yi): (3.4, 5.5) (3.8, 5.9) (4.1, 6.5) (2.2, 3.3)
(2.6, 3.6) (2.9, 4.6) (2, 2.9) (2.7, 3.6) (1.9, 3.1) (3.4, 4.9)
FUEL CONSUMP.
(gal/100 miles)
FUEL CONSUMPTION vs CAR WEIGHT
7
6
5
4
3
2
1.5
2.5
3.5
WEIGHT (1000 lbs)
4.5
col. sum
Wt
(x)
Fuel
2
x
x
y i - y (yi - y)2
(x
x)
i
i
(y)
(xi - x)(y i - y)
3.4
5.5
.5
.25
1.11
1.231
.555
3.8
5.9
.9
.81
1.51
2.2801 1.359
4.1
6.5
1.2
1.44
2.11
4.4521 2.532
2.2
3.3
-.7
.49
-1.09 1.1881 .763
2.6
3.6
-.3
.09
-.79
.6241
.237
2.9
4.6
0
0
.21
.0441
0
2.0
2.9
-.9
.81
-1.49 2.2201 1.341
2.7
3.6
-.2
.04
-.79
1.9
3.1
-1.0 1
-1.29 1.6641 1.29
3.4
4.9
.5
.25
.51
.2601
29
43.9
0
5.18
0
14.589 8.49
.6241
.158
.255
Calculations
x  2.9; y  4.39; sx 
5.18
9
 .7587;
8.49
sy 
 1.2732; r 
 .9766
9(.77587)(1.2732)
sy
1.2732
slope b1  r
 .9766
 1.639
sx
.7587
14.589
9
intercept b0  y  b1 x  4.39  1.639(2.9)  .3631
least squares prediction line yˆ  b0  b1 x  .3631  1.639x
Scatterplot with least squares
prediction line
FUEL CONSUMP.
(gal/100 miles)
FUEL CONSUMPTION vs CAR WEIGHT
7
6
5
4
3
2
y = 1.639x - 0.3631
r 2 = 0.9538
1.5
2.5
3.5
WEIGHT (1000 lbs)
4.5
The Least Squares Line Always goes
Through ( x, y )
FUEL CONSUMP. (gal/100 miles)
FUEL CONSUMPTION vs CAR WEIGHT
7
6.5
6
5.5
5
4.5
4
3.5
3
2.5
2
(x, y ) = (2.9, 4.39)
y = 1.639x - 0.3631
1.5
2.5
WEIGHT (1000 lbs)
3.5
4.5
Using the least squares line for prediction.
Fuel consumption of 3,000 lb car? (x=3)
yˆ  .3631  1.639(3)  4.5539
Fuel Consumption vs Car Weight: Scatterplot and Least Squares Line
FUEL CONSUMPTION
7
y = - 0.3631 + 1.639x
6
5
4
(3.0, 4.5539)
3
2
1.5
2
2.5
3
CAR WEIGHT
3.5
4
4.5
Be Careful!
Fuel consumption of 500 lb car? (x = .5)
yˆ  .3631  1.639(.5)  .4564
(219 mpg)
FUEL CONSUMP.
(gal/100 miles)
FUEL CONSUMPTION vs CAR WEIGHT
7
6
5
4
3
2
y = 1.639x - 0.3631
r 2 = 0.9538
1.5
2.5
3.5
4.5
WEIGHT (1000 lbs)
x = .5 is outside the range of the x-data that we
used to determine the least squares line
Avoid GIGO! Evaluating the least
squares line
1.
2.
3.
Create scatterplot. Approximately
linear?
Calculate r2, the square of the
correlation coefficient
Examine residual plot
r2 : The Variation Accounted
For

The square of the correlation coefficient
r gives important information about the
usefulness of the least squares line
r2: important information for evaluating the
usefulness of the least squares line
-1 ≤ r ≤ 1 implies 0 ≤ r2 ≤ 1
The square of the correlation coefficient, r2, is the
fraction of the variation in y that is explained by the
least squares regression of y on x.
The square of the correlation coefficient, r2, is the
fraction of the variation in y that is explained by the
variation in x.
Example: car weight, fuel
consumption
x=car weight, y=fuel consumption
r2 = (.9766)2  .95
About 95% of the variation in fuel
consumption (y) is explained by the
linear relationship between car weight
(x) and fuel consumption (y).
 What else affects fuel consumption?
– Driver, size of engine, tires, road, etc.

Example: SAT scores
SAT Mean per State vs % Seniors Taking Test
Mean SAT Score
1120
1070
y = -2.2375x + 1023.4
R2 = 0.7542
1020
970
920
870
820
0
10
20
30
40
50
% of Seniors Taking Test
60
70
80
SAT scores: calculations
x  33.882 sx  24.103 y  947.549 s y  62.1 r  .868
b1  r
sy
sx
, b0  y  b1 x
62.1
slope b1  .868
 2.23635
24.103
intercept b0  947.549  (2.236)33.882  1023.309
least squares prediction line yˆ  1023.309  2.236 x
SAT scores: result
r2 = (-.868)2
= .7534
SAT Mean per State vs % Seniors Taking Test
Mean SAT Score
1120
1070
y = -2.2375x + 1023.4
R2 = 0.7542
1020
970
920
870
820
0
10
20
30
40
50
60
70
80
% of Seniors Taking Test
If 57% of NC seniors take the SAT, the predicted mean
score is
yˆ  1023.309  2.23635(57)  895.84
Avoid GIGO! Evaluating the least
squares line
1.
2.
3.
Create scatterplot. Approximately
linear?
Calculate r2, the square of the
correlation coefficient
Examine residual plot
Residuals


residual
=observed y - predicted y
=
y
y
Properties of residuals
1. The residuals always sum to 0 (therefore
the mean of the residuals is 0)
2. The least squares line always goes
through the point (x, y)
Graphically
residual = y - y
y
yi
yi
ei=yi - yi
X
xi
Residual Plot



Residuals help us determine if fitting a least
squares line to the data makes sense
When a least squares line is appropriate, it
should model the underlying relationship;
nothing interesting should be left behind
We make a scatterplot of the residuals in the
hope of finding…
NOTHING!
Car Wt/ Fuel Consump:
Residuals

CAR WT. FUEL CONSUMP. Pred FUEL CONSUMP. Residuals

3.4
3.8
4.1
2.2
2.6
2.9
2
2.7
1.9
3.4









5.5
5.9
6.5
3.3
3.6
4.6
2.9
3.6
3.1
4.9
5.2094980690 .290501931
5.865096525 0.034903475
6.356795367 0.143204633
3.242702703 0.057297297
3.898301158 -0.29830115
4.39
0.21
2.914903475 -0.01490347
4.062200772 -0.46220077
2.751003861 0.348996139
5.209498069 -0.309498069
Example: Car wt/fuel consump.
residual plot page 13
RESIDUALS vs WT(X)
RESIDUALS
0.4
0.2
0
RESIDUAL
-0.2
-0.4
-0.6
1.5
2
2.5
3
WT(X)
3.5
4
4.5
SAT Residuals
Residuals
%TAKE Residual Plot
100
50
0
-50 0
-100
20
40
%TAKE
60
80
Linear Relationship?
Linear(?)
60
50
Y
40
30
20
10
0
-4
-2
0
2
X
4
6
8
Garbage In Garbage Out
GIGO
60
50
y = 4x + 11
Y
40
30
20
10
0
-4
-2
0
2
X
4
6
8
Residual Plot – Clue to GIGO
Residual Plot
Residuals
20
10
0
-4
-2
-10
0
2
-20
X Variable
4
6
8
GIGO
60
50
y = 4x + 11
Y
40
30
20
10
0
-4
-2
0
2
4
6
8
4
6
8
X
Residual Plot
Residuals
20
10
0
-4
-2
-10
0
2
-20
X Variable
Download