Linear Regression

advertisement
A student wonders if tall women tend to date taller
men than do short women. She measures herself,
her dormitory roommate, and the women in the
adjoining rooms. Then she measures the next man
each woman date. Draw & discuss the scatterplot
and calculate the correlation coefficient.
Women Men
(x)
(y)
66
72
64
68
66
70
65
68
70
71
65
65
Linear Regression
Guess the correlation coefficient

http://istics.net/stat/Correlations/
Can we make a Line of Best Fit
Regression Line

This is a line that describes how a response
variable (y) changes as an explanatory
variable (x) changes.

It’s used to predict the value of (y) for a given
value of (x).

Unlike correlation, regression requires that
we have an explanatory variable.
Let’s try some!

http://illuminations.nctm.org/ActivityDetail.asp
x?ID=146
Regression Line

The following data shows the number of miles driven and
advertised price for 11 used Honda CR-Vs from the 2002-2006
model years (prices found at www.carmax.com). The
scatterplot below shows a strong, negative linear association
between number of miles and advertised cost. The
correlation is -0.874. The line on the plot is the regression
line for predicting advertised price based on number of miles.
Thousand
Miles
Driven
Cost
(dollars)
22
29
35
39
45
49
55
56
69
70
86
17998
16450
14998
13998
14599
14988
13599
14599
11998
14450
10998
The regression line is shown below….
Use it to answer the following.
Slope:
Y-intercept:
Predict the price for a Honda with
50,000 miles.
Extrapolation

This refers to using a regression line for
prediction far outside the interval of values of
the explanatory variable x used to obtain the
line.

They are not usually very accurate
predictions.

Slope:

Y-int:

Predict weight after 16 wk

Predict weight at 2 years:
Residual

The equation of the least-squares
regression line for the sprint time and longjump distance data is predicted long-jump
distance = 304.56 – 27.3 (sprint time).
Find and interpret the residual for the
student who had a sprint time of 8.09
seconds.
Regression

Let’s see how a regression line is calculated.
Fat vs Calories in Burgers
Fat (g)
19
31
34
35
39
39
43
Calories
410
580
590
570
640
680
660
Let’s standardize the variables
Fat
Cal
z - x's
z - y's
19
410
-1.959
-2
31
580
-0.42
-0.1
34
590
-0.036
0
35
570
0.09
-0.2
39
640
0.6
0.56
39
680
0.6
1
43
660
1.12
0.78
The line must contain the point
 x, y  and pass through the origin.
Let’s clarify a little. (Just watch & listen)
The equation for a line that passes through the origin
can be written with just a slope & no intercept: y =
mx.
But, we’re using z-scores so our equation should
reflect this and thus it’s z y  mzx
Many lines with different slope pass through the origin.
Which one fits our data the best? That is which
slope determines the line that minimizes the sum of
the squared residuals.
Line of Best Fit –Least
Squares Regression Line
It’s the line for which the sum of the squared residuals
is smallest. We want to find the mean squared
residual.
Residual = Observed - Predicted
Focus on the vertical deviations from the line.
Let’s find it. (just watch & soak it in)
MSR 
 z
 zy

n 1
z


MSR 
MSR 
y
2
y
 mz x 
2
n 1
2
2
2
z

2
mz
z

m
z
 y
x y
x 
n 1
2
2
z
z
z


y
x y
2  zx
MSR 
 2m
m
n 1
n 1
n 1
MSR  1  2mr  m 2
This is r!
since
z y  mz x
St. Dev of z scores is
1 so variance is 1 also.
Continue……
b
Since this is a parabola – it reaches it’s minimum at x 
2a
This gives us
(2r )
m
r
2(1)
Hence – the slope of the best fit line for z-scores is the
correlation coefficient → r.
Slope – rise over run
A slope of r for z-scores means that for every increase
of 1 standard deviation in zx , there is an increase of
r standard deviations in z y . “Over 1 and up r”
Translate back to x & y values – “over one standard
deviation in x, up r standard deviations in y.
Slope of the regression line is:
b
rs y
sx
Why is correlation “r”

Because it was calculated from the
regression of y on x after standardizing the
variables – just like we have just done – thus
he used r to stand for (standardized)
regression.
The number of miles (in thousands) for the 11 used
Hondas have a mean of 50.5 and a standard deviation
of 19.3. The asking prices had a mean of $14,425 and
a standard deviation of $1,899. The correlation for
these variables is r = -0.874. Find the equation of the
least-squares regression line and explain what change
in price we would expect for each additional 19.3
thousand miles.
So let’s write the equation!
y  mx  b
y  b0  b1 x
Slope:
Explain the slope:
from algebra
b0 y-intercept

b1 slope
Fat (g)
Calories
19
410
31
580
34
590
35
570
39
640
39
680
43
660
Homework
Page 191 (27-32, 35, 37, 39, 41)
Download